Releases: jorgecarleitao/arrow2
v0.17.0
What's Changed
- Fixed writing nested parquet by @jorgecarleitao in #1390
- Fixed error in writing sliced binary by @jorgecarleitao in #1391
- Fixed broken guide link by @kjschiroo in #1395
- Changed methods to slice arrays by @jorgecarleitao in #1396
- Fixed writing of sliced arrays to parquet by @jorgecarleitao in #1397
- Simplified code via DRY by @jorgecarleitao in #1398
- Improved API of getting mutable from Buffer by @jorgecarleitao in #1399
- Simplified code by @jorgecarleitao in #1401
- Improved support for date64 written by pyarrow to parquet by @jorgecarleitao in #1402
- Fixed nested boolean offset by @ritchie46 in #1404
- Added apply_validity and set_validity to mutable utf8 array by @Arty-Maly in #1406
- Fixed ahash dependency for wasm by @hzuo in #1407
- Added cast for FixedSizeBinary to (Large)Binary by @ritchie46 in #1403
- Updated base64 to 0.21 by @WindSoilder in #1408
- Fixed statistics writing flag and correct null_count in dictionaries by @ritchie46 in #1414
- Added convenience accessor array.get by @ozgrakkurt in #1416
- Re-exported the
bloom_filter
module fromparquet2
crate by @ozgrakkurt in #1420 - Added support for MapArray read and write to parquet by @b41sh in #1419
- Added support for decimal256 read/write in parquet by @TCeason in #1412
- Added support for JSON serialization of dictionary by @ritchie46 in #1424
- Added MapScalar by @b41sh in #1428
- Changed encoded float::Inf as null in json by @SimonSchneider in #1427
- Added
set_len
method to Buffer by @haixuanTao in #1374 - Fixed issue with Time32/Time64 datatype in csv reader by @christophe-petitjean in #1425
- Made
num_values
public by @b41sh in #1431 - Made
len/len_proxy
consistent withOffsets
by @ritchie46 in #1434 - Added memmap
&[u8]
asBooleanArray
by @ritchie46 in #1436 - Added impl_mutable_array_mut_validity macro for mutable arrays by @Arty-Maly in #1435
- Added buffer interoperability with arrow-rs by @tustvold in #1437
- Changed async ipc writer to accept schema by value by @ritchie46 in #1439
- Updated multiversion and support wider registers by @ritchie46 in #1440
- Updated dependencies by @ritchie46 in #1441
- Added interoperability with arrow-schema by @tustvold in #1442
New Contributors
- @kjschiroo made their first contribution in #1395
- @WindSoilder made their first contribution in #1408
- @TCeason made their first contribution in #1412
- @haixuanTao made their first contribution in #1374
- @christophe-petitjean made their first contribution in #1425
Full Changelog: v0.16.0...v0.17.0
v0.16.0
A new release is here! Thank you everyone that contributed to it! 🙇
Breaking changes:
- Made IPC writer take owned schema #1361 (ritchie46)
- Correctly update child-offsets in
GrowableUnion
#1360 (jleibs)
Fixed bugs:
- invalid written parquet file of nested structures. (Mixing list with structs) #1325
- Fix incorrect downcast in
estimated_size_bytes
#1351 (jleibs) - fix(parquet): nested struct /list writing #1347 (ritchie46)
- Fixed csv infer_schema on empty fields #1342 (tripokey)
Enhancements:
- Added support for
take
ofFixedSizeListArray
#1386 (kylebarron) - Renamed
factory
argument on parquet read functions toreader_factory
#1380 (ozgrakkurt) - Made some structs and functions public #1375 (b41sh)
- Added
Utf8Array::apply_validity
#1367 (Arty-Maly) - Added set/get scratches #1363 (ritchie46)
- Amortized intermediate allocations in IPC writer #1362 (ritchie46)
- Improved clippy #1353 (jorgecarleitao)
Documentation updates:
- Fixed typo in
OffsetsBuffer
docs #1373 (DzenanJupic) - Update README.md to fix capitalization and spelling #1338 (yerke)
Testing updates:
New Contributors
- @yerke made their first contribution in #1338
- @jleibs made their first contribution in #1351
- @tripokey made their first contribution in #1342
- @Arty-Maly made their first contribution in #1367
- @DzenanJupic made their first contribution in #1373
- @kylebarron made their first contribution in #1386
v0.15.0
A new release is here, adding a number of new features and improvements to arrow2. Thank you to everyone that contributed to it!
This release adds support to a new format, the "record" JSON format, contributed by @AnIrishDuck, a new trait TryExtendFromSelf
to efficiently concatenate an array into an existing mutable array, and multiple improvements by @sundy-li and @ritchie46 to performance. Finally, we have a new API OffsetsBuffer
and Offsets
proposed by @ritchie46 to allow creating variable sized-arrays without having to check for offsets.
This release also features a number of contributions from first contributors:
- @benesch made their first contribution in #1271
- @RinChanNOWWW made their first contribution in #1287
- @datapythonista made their first contribution in #1290
- @sandflee made their first contribution in #1286
- @Samrose-Ahmed made their first contribution in #1279
- @jondo2010 made their first contribution in #1300
- @cyr made their first contribution in #1318
- @universalmind303 made their first contribution in #1321
Thank you everyone for the great work this year, and happy festivities everyone!
Breaking changes:
- Added values' capacity to
MutableBinaryArray::reserve
#1277 - Removed
from_data
from all arrays #1328 (jorgecarleitao) - Added
Offsets
andOffsetsBuffer
#1316 (jorgecarleitao) - Bumped parquet2 dependency #1304 (ritchie46)
- Added data_pagesize_limit to write parquet pages #1303 (sundy-li)
- Bumped arrow-format to 0.8 #1298 (Xuanwo)
- Improved iterators #1270 (jorgecarleitao)
New features:
- Added
TryExtendFromSelf
#1278 (jorgecarleitao) - Added support for JSON ser/de records layout #1275 (AnIrishDuck)
Fixed bugs:
- Parquet writes all values of sliced arrays? #1323
- Avro schema: Invalid record names #1269
- Fixed writing nested/sliced arrays to parquet #1326 (ritchie46)
- Fixed failing to accept dictionary full of nulls #1312 (ritchie46)
- Added support for Extension types in ffi #1300 (jondo2010)
- Fixed error in memory usage of sliced binary/list/utf8arrays #1293 (ritchie46)
- Fixed descending ordering when specify nulls first #1286 (sandflee)
- Added avro record names when converting arrow schema to avro #1279 (Samrose-Ahmed)
Enhancements:
- Fixed clippy #1336 (jorgecarleitao)
- Improved
UnionArray
#1331 (jorgecarleitao) - Bumped json-deserializer version #1321 (universalmind303)
- Removed flushing during arrow IPC writing to improve performance when using a buffered writer #1318 (cyr)
- Improved performance of check_indexes #1313 (ritchie46)
- Improved performance of checking offsets
~-64-73%
#1305 (ritchie46) - Added
reserve
to pushable containers in parquet extend_from_decoder #1301 (ritchie46) - Optimized slicing #1285 (jorgecarleitao)
- Improved ZipValidity iterators #1284 (ritchie46)
- Added
MutableBinaryValuesArray
#1276 (jorgecarleitao)
Documentation updates:
- Fixed link from the API to the guide #1290 (datapythonista)
v0.14.1
A couple of backward-compatible bug fixes and improvements that everyone benefits from :)
Thank you @cjermain, @shaeqahmed and @ozgrakkurt! 🙇
Fixed bugs:
- Potential bug in reading lists from avro? #1252
- Removed un-used code #1258 (jorgecarleitao)
- Fixed error reading unbounded Avro list #1253 (jorgecarleitao)
- Add missing call to
try_push_valid
for nested avro deserialization #1248 (shaeqahmed)
Enhancements:
- Bump json_deserializer version to 0.4.1 #1261 (cjermain)
- Fixed clippy for 1.60 #1259 (jorgecarleitao)
- Added
BinaryArray::into_mut
and double-ended support for its iterator #1255 (ozgrakkurt)
Testing updates:
- Improved test for nullable struct read from Avro #1250 (jorgecarleitao)
v0.14.0
Another release of arrow2 is here!
Besides API improvements to reading IPC and parquet, there are two main new features, the ability to memory map arrow files (check out https://jorgecarleitao.github.io/arrow2/v0.14.0/guide/io/ipc_mmap.html) and support for decimal 256.
The following had their first time contribution to their crate:
- @daniel-martinez-maqueda-sap made their first contribution in #1204
- @AnIrishDuck made their first contribution in #1211
- @samkaufman made their first contribution in #1213
- @teymour-aldridge made their first contribution in #1225
- @poga made their first contribution in #1234
- @knil-sama made their first contribution in #1237
Thank you everyone for all the issues, PRs and ideas!
Breaking changes:
- Removed
Count
(parquet statistics) #1217 (jorgecarleitao) - Exposed parquet indexed page filtering to
FileReader
#1216 (jorgecarleitao) - Simpler IPC API #1208 (jorgecarleitao)
- Migrated Avro code to avro-schema repo #1199 (jorgecarleitao)
- Added support for decimal 256 #1194 (jorgecarleitao)
New features:
- Added support for decoding delta-length-encoded binary (parquet) #1228 (jorgecarleitao)
- Added support to read and write Parquet's delta-bitpacked (integer encoding) #1226 (jorgecarleitao)
- Added support for parquet sidecar to
FileReader
#1215 (jorgecarleitao) - Write 64bit aligned IPC files #1201 (jorgecarleitao)
- Added support to mmap IPC format #1197 (jorgecarleitao)
- Added
MutableStructArray
#1196 (hohav)
Fixed bugs:
- Stack overflow in parquet RowGroupReader with groups_filter #1206
- fixed comparisson and validity kernels #1243 (ritchie46)
- Fixed reading nested stats #1240 (jorgecarleitao)
FileSink
now closes the underlying writer. #1213 (samkaufman)- Fixed JSON infer order #1212 (jorgecarleitao)
- Fixed StackOverflow in skipping many parquet row groups #1210 (jorgecarleitao)
- Fix escaped like wildcards #1204 (daniel-martinez-maqueda-sap)
- Removed println :( #1203 (jorgecarleitao)
Enhancements:
- Added schema to FileReader #1246 (jorgecarleitao)
- Simpler nested parquet read #1241 (jorgecarleitao)
- Removed unneeded code #1229 (jorgecarleitao)
- Improved
MutableStruct::push
#1223 (hohav) - Reduced binary size #1221 (jorgecarleitao)
- Added utf8 <> binary cast #1220 (jorgecarleitao)
- split parquet compression backend features #1207 (ritchie46)
- Improved API of
mmap
#1205 (ritchie46) - Added
MutableArray::reserve
#1202 (jorgecarleitao) - Delayed dict #1185 (jorgecarleitao)
Documentation updates:
- Fixed guide and improved examples #1247 (jorgecarleitao)
- Added documentation on parquet compatibility under
TimeUnit
. #1238 (TurnOfACard) - Fixed typo in error message for impl StructArray #1237 (knil-sama)
- Fixed incorrect command in doc for generating ORC files #1234 (poga)
- Improved github page generation #1233 (jorgecarleitao)
- Fix a typo in the docs #1225 (teymour-aldridge)
- Fix some doc links/typos #1211 (AnIrishDuck)
Testing updates:
- Fixed clippy warnings #1227 (jorgecarleitao)
- Updated integration test #1214 (jorgecarleitao)
v0.13.1
Thanks @daniel-martinez-maqueda-sap!
Fixed bugs:
- Fix escaped like wildcards #1204 (daniel-martinez-maqueda-sap)
- Removed println :( #1203 (jorgecarleitao)
v0.13.0
A new version (0.13) is now available on crates.io! 🎉🎉🎉🎉
This is another large release of arrow2. Among the many, many changes (see below), it is worth noting:
- Added copy-on-write API to perform operations in place, improving performance of expressions like
(a + b) * 2
by a factor of 2-10x - Added support to read from Apache ORC format
- Added support for projection and limit pushdown when reading from Arrow IPC format
- Added support for
f16
Thank you to the numerous contributors, both via PRs and issues, that resulted in this fantastic release 🙇
Breaking changes:
- Made
nested
argument ofarray_to_pages
non-owning #1174 - Replaced
Result
bypanic
in boolean comparison #1159 (jorgecarleitao) - Improved dictionary invariants #1137 (jorgecarleitao)
- Change signature of PrimitiveScalar::value to return reference #1129 (ncpenke)
- Removed need to pass encodings by value #1123 (ritchie46)
- Removed unused
NativeType::to_ne_bytes
#1112 (jorgecarleitao) - Avoid clone in
with_validity
#1104 (jorgecarleitao) - Reduced need of
unsafe
in FFI #1100 (jorgecarleitao) - Removed
Buffer::into_mut
andmake_mut
functions #1089 (jorgecarleitao) - Renamed
Bitmap::null_count
toBitmap::unset_bits
#1087 (jorgecarleitao) - Made
chunk_size
optional in parquet'scolumn_iter_to_arrays
#1055 (jorgecarleitao) - Migrated from
Arc<dyn Array>
toBox<dyn Array>
#1042 (jorgecarleitao)
New features:
- Added support to read ORC #1189 (jorgecarleitao)
- Added support for limit pushdown to IPC reading #1135 (jorgecarleitao)
- Added support to write and read Intervals from and to parquet #1122 (jorgecarleitao)
- Added support to write
FixedSizeBinary
to Avro #1118 (jorgecarleitao) - Added support for projections in reading IPC streams #1097 (joshuataylor)
- Added support to write parquet
_metadata
sidecar #1063 (jorgecarleitao) - Added cow APIs (2x-10x vs non-cow) #1061 (jorgecarleitao)
- Added support to read and write f16 #1051 (jorgecarleitao)
Fixed bugs:
- Fixed error not implemented error when reading plain, after-dict pages for fix-len-binary from parquet #1192 (jorgecarleitao)
- Fixed error in decoding nested multi-page columns from parquet #1188 (jorgecarleitao)
- Fixed error in counting items in nested parquet #1182 (jorgecarleitao)
- Fixed reading stats from int96 parquet #1181 (jorgecarleitao)
- Fixed limit pushdown in parquet #1180 (jorgecarleitao)
- use
FnOnce
forPrimitiveArray::apply_validity
#1176 (ritchie46) - release memory on predicate with 0% selectivity #1163 (ritchie46)
- Fixed error in reading
Struct<List<...>>
from parquet #1150 (jorgecarleitao) - Fixed IPC projection #1149 (ritchie46)
- Fixed casting dictionary keys #1143 (ritchie46)
- Fixed reading arrays from parquet with required children #1140 (jorgecarleitao)
- Fixed panic in deserializing nested statistics #1139 (jorgecarleitao)
- Aligned name of
FixedSizeBinaryArray::values_iter
#1117 (jorgecarleitao) - Fixed error in
FixedSizeListArray::new_null
#1114 (jorgecarleitao) - Fixed panic in writing dictionaries to parquet #1113 (jorgecarleitao)
- Fixed error in reading chunked parquet #1108 (jorgecarleitao)
- Raise error when invalid fields are passed to flight #1093 (jorgecarleitao)
- Made IPC projection not sort projection #1082 (jorgecarleitao)
- Fixed error in chunked_mut bitmap #1081 (jorgecarleitao)
- Fixed panic in bitmap assign_mut #1078 (ritchie46)
- Panic-free read of IPC files #1075 (jorgecarleitao)
- Bumped parquet2 (minor) requirement #1071 (jorgecarleitao)
- Fixed divide by zero on reading empty row group #1062 (jorgecarleitao)
- Fixed missing validation of number of encodings passed when writing to parquet #1057 (jorgecarleitao)
Enhancements:
- Improved performance of reading Binary from parquet #1190 (ritchie46)
- Bumped to latest nightly #1186 (gyscos)
- Improved error message #1179 (jorgecarleitao)
- Added support to read and write nested dictionaries to parquet #1175 (jorgecarleitao)
- Added
MutableUtf8Array::into_data
#1170 (ritchie46) - Added
Default
forUtf8Array
#1169 (ritchie46) - fix(parquet): allow to read other logical types from parquet #1168 (sundy-li)
- fix(parquet): enforce to use ParquetTimeUnit::Nanoseconds for PhysicalType::Int96 #1167 (sundy-li)
- Added constructor
MutableFixedSizeListArray::new_from
#1161 (hohav) - Removed unneeded
Default
constraint #1157 (hohav) - Improved checks to safety invariants in FFI #1154 (jorgecarleitao)
- Removed un-needed indirection #1153 (jorgecarleitao)
- Soften generic constraint of
Buffer
#1152 (sundy-li) - Use ahash by default #1148 (ritchie46)
- Reduced bound checks [#1142](https://github.com/j...
v0.12.0
A new version of arrow2 is now available in crates.io. 🎉🎉🎉
See below all great things that were released 🚀. But before that, thank you so much to everyone that contributed to this release: 🙇
@ahmedriza, @dexterduck, @GPSnoopy, @HaoYang670, @SimonSchneider, @TurnOfACard, @aptr322, @arxra, @b41sh, @cjermain, @dbr, @jorgecarleitao, @ritchie46
Breaking changes:
- Require one encoding per parquet column on write #1012
- Bumped parquet2 #1035 (jorgecarleitao)
- Improved performance of deserializing JSON (2x) #1024 (jorgecarleitao)
- Remove
from_trusted_len_*
fromBuffer
#1020 (jorgecarleitao) - Bumped arrow-format #1011 (jorgecarleitao)
- Replace
fn Offset::is_large()
asconst Offset::IS_LARGE
#1002 (HaoYang670) - Renamed
ArrowError
toError
#993 (jorgecarleitao)
New features:
- Added support to deserialize
MapArray
from parquet #1045 (jorgecarleitao) - Added support for random access reads from IPC #1034 (jorgecarleitao)
- Added support for custom sort
build_compare_fn
#1016 (b41sh) - Added support to write nested parquet #1007 (jorgecarleitao)
- Added support for deserializing JSON from iterator #989 (cjermain)
Fixed bugs:
- Writing of
ListArray
does not preserve all values #1008 - Write a two-dimensional list to parquet file failed #992
- Writing to Parquet fails for extension types that contain lists #830
- Fixed using lower limit than size of first parquet row group #1046 (arxra)
- Fixed error in consuming sliced
FixedSizedBinary
from c data interface (FFI) #1026 (jorgecarleitao) - Fixed lexsort limit equal or greater than row_count #1021 (b41sh)
- Fixed error in reading nested parquet structs #1015 (jorgecarleitao)
- Fixed panic on debug print of invalid timezones #1013 (jorgecarleitao)
- Treat empty timezone string as no-timezone #1009 (dbr)
- Fixed encoding of
NaN
to json #990 (SimonSchneider) - Fixed error in writing
ListArray
to parquet #984 (jorgecarleitao) - Fixed decoding Binary Plain pages with dictionary pages #982 (aptr322)
Enhancements:
- Added
Debug
andPartialEq
forMapArray
#1043 (jorgecarleitao) - Exposed compression levels for parquet #1041 (ritchie46)
- Added
.arced
/.boxed
to arrays #1040 (jorgecarleitao) - Added utility to create encodings #1018 (jorgecarleitao)
- Made
parquet_to_arrow_schema
public #1006 (martingallagher) - Speeded up
min_max_boolean
for the case where all values are null #1005 (HaoYang670) - Simplified
min_max_string
andmin_max_binary
#1004 (HaoYang670) - Added support for Decimal in
build_compare
#998 (GPSnoopy) - remove accidental quadratic null_count #991 (ritchie46)
- Aligns MutableDictionaryArray's with MutablePrimitiveArrays with TryPush #981 (TurnOfACard)
Documentation updates:
- Cleaned docs for BinaryArray #1047 (jorgecarleitao)
- Improved API docs for
MutableBitmap
#1025 (jorgecarleitao) - Improved docs for
bitmap
#1022 (jorgecarleitao) - Improved API docs for
PrimitiveArray
andUtf8Array
#1017 (jorgecarleitao) - Fixed dev guide #1003 (jorgecarleitao)
Testing updates:
- Added more tests #1029 (jorgecarleitao)
- Moved coverage reporting to
cargo-llvm-cov
#1028 (jorgecarleitao) - Added more tests (increase coverage) #1027 (jorgecarleitao)
- Moved tests from lib to
tests
#1001 (jorgecarleitao) - Allowed feature-specific test runs #985 (jorgecarleitao)
v0.11.2
New features:
- Added support to append to existing IPC Arrow file #972 (jorgecarleitao)
- Added pop to utf8/binary/fixedSize MutableArray #966 (ygf11)
- Added support for union scalars #930 (ncpenke)
Fixed bugs:
- Added support to read nested binary from parquet #978 (jorgecarleitao)
- Fixed empty reader panic for NDJSON type infer #974 (Roberto-XY)
- Prevented SO in large parquet files #973 (ritchie46)
- Fixed API bug in
async
read of IPC metadata #969 (jorgecarleitao) - Fixed writing required list to parquet #968 (jorgecarleitao)
Enhancements:
- Added support Parquet deserialize LargeList and Uint data types #979 (b41sh)
- Made reading of IPC dictionaries lazy #971 (jorgecarleitao)
- Allowed creating IPC
FileWriter
without writing to the file #970 (jorgecarleitao)
v0.11.0
Arrow2 v0.11.0 is out!! 🎉🎉🎉
This release is mainly focus on improving upon the previous one on better parquet support. In particular, we have the main ingredients to read indexed parquet pages, which allow skipping deserializing individual pages, and since this version parquet files are written with page indexes. There is still some work to improve the frontend API to skip pages via statistics, which will be left for the next version.
This version also contains multiple bug fixes.
Thanks everyone that contributed to this release (individual PRs below)! 🙇
Changelog
Breaking changes:
- Refactored parquet statistics deserialization #962 (jorgecarleitao)
- Made GroupFilter
Send + Sync
#947 (jorgecarleitao)
New features:
- Added support for non-ordered projections to IPC reading #961 (jorgecarleitao)
- Added support for reading indexed parquet pages #923 (jorgecarleitao)
Fixed bugs:
- Parquet regression:
exceptions.ArrowErrorException: NotYetImplemented("Can't read Dictionary(UInt32, LargeUtf8, false) from parquet")
#955 - Reading Parquet binary column panics during deserialization 'attempt to subtract with overflow` #944
- Reading Parquet file written by pyarrow with
lz4
compression fails withOutOfSpec("Thrift out of range")
#940 - Issues when trying to create a parquet file with FixedSizedListArray #691
- Fixed bug in writing csv with buffer resizing #965 (ritchie46)
- Fixed bug in reading binary parquet #945 (jorgecarleitao)
- Fixed error in writing fixedSizeListArray to parquet #941 (jorgecarleitao)
- Fixed support to read dict nested binary parquet #924 (jorgecarleitao)
Enhancements:
- Reduced memory usage in reading parquet #964 (jorgecarleitao)
- Simpler IPC code #939 (jorgecarleitao)
- don't allocate string when writing to csv #935 (ritchie46)
- Removed un-needed generic parameter #927 (jorgecarleitao)
- update to odbc-api 0.36.0 #925 (pacman82)
Documentation updates:
- Fixed example of parallel read via rayon #958 (jorgecarleitao)
- Fixed guide deployment #931 (jorgecarleitao)
- Typo fix #919 (bkmgit)
Testing updates:
- Fixed patch of integration tests #960 (jorgecarleitao)
- Added test for MapArray #942 (jorgecarleitao)
- Fixed wrong clippy warning #938 (jorgecarleitao)