Return format of Arrow Stream data to an Elixir NIF #1033

joshuataylor · 2022-05-31T06:16:44Z

joshuataylor
May 31, 2022

Hi!

This is more about my lack of Rust knowledge and the "best" way to return data back to Elixir.

Basically, here is my use case:

I am working on an Elixir Snowflake Connector, which currently returns JSON but Snowflake (a cloud database/data warehouse) can also send you Arrow files as Arrow Streams. These are files that make up your result set when querying and can depend on your query.
Elixir does not have an Apache Arrow implementation, but using NIFs is acceptable for these use cases. A NIF is a function that is implemented in C instead of Erlang, but can also be Rust using Rustler
I then plan on passing the response we get back from Snowflake to the Rust package I'll create (let's call it snowflake_elixir_rustler for the sake of simplicity), into a function called "parse_arrow_stream` or something.
Read Arrow streams works perfectly for this usecase, and I can parse the file! 🎉! (Still working on parsing the binary, which is a next step).
This needs to return data in the function in a format Elixir can understand -- it can read back pretty much any Rust type I believe, so the main question is how to do this effectively.
I am then wondering what the best way to do this is. Do I return a vec of files? Do I return a RecordBatch?
I would prefer the columns to be returned in an array, so that way nothing needs to be converted on the Elixir side.
req_elixir returns data in columns and headers, along side metadata (don't worry about the num rows/success, that's handled by the actual Snowflake JSON response that contains the Arrow files) like this:

%ReqSnowflake.Result{
  columns: ["L_ORDERKEY", "L_PARTKEY"],
  num_rows: 2,
  rows: [[3_000_001, 14406], [3_000_002, 34422]],
  success: true
}

Sorry for this long ramble, just looking for some opinions about writing back data, it shouldn't be many gigabytes, most files would be a couple of megs in length at max from my initial testing.

Example Arrow Stream file here (from Snowflake example data): https://github.com/joshuataylor/arrow_fail_read_example/blob/main/example_snowflake_data

Tracking issue for req_snowflake - joshuataylor/req_snowflake#7

edit:

I think the best option at this point is to do similar to what the CSV writer does to write back to Elixir types.

Answered by jorgecarleitao

May 31, 2022

This is a very interesting question!

Generally, there are two questions to answer:

which allocator are you planning to use to handle snowflake's arrow data?

1.1. Rust's allocator
1.2. Elixir's allocator
which in-memory format do you plan to use in your application?

2.1. Arrow
2.2. Custom in-memory format

From these questions, we can design our needs:

a. (1.1, 2.1) - a mechanism to expose shared structs like Arc<dyn Array> to Elixir (this is what the Arrow C data interface is intended for)
b. (1.1, 2.2) - a mechanism to expose non-owning structs like &[T] to Elixir, and a mechanism to convert them to the custom format
c. (1.2, 2.1) - a mechanism to expose Arc<dyn Array> to Elixir and…

View full answer

jorgecarleitao · 2022-05-31T15:19:22Z

jorgecarleitao
May 31, 2022
Maintainer

This is a very interesting question!

Generally, there are two questions to answer:

which allocator are you planning to use to handle snowflake's arrow data?

1.1. Rust's allocator
1.2. Elixir's allocator
which in-memory format do you plan to use in your application?

2.1. Arrow
2.2. Custom in-memory format

From these questions, we can design our needs:

a. (1.1, 2.1) - a mechanism to expose shared structs like Arc<dyn Array> to Elixir (this is what the Arrow C data interface is intended for)
b. (1.1, 2.2) - a mechanism to expose non-owning structs like &[T] to Elixir, and a mechanism to convert them to the custom format
c. (1.2, 2.1) - a mechanism to expose Arc<dyn Array> to Elixir and a mechanism to memcopy that to Elixir's own regions / allocated regions
d. (1.2, 2.2) - same as (1.1, 2.2)

Starting very small, consider the following struct:

struct Array {
    pub values: Arc<Vec<i32>>,
    pub validity: Option<Arc<Vec<u8>>>,
}

how would we map this to Elixir? Is it possible to perform a memcopy of e.g. let a: &[i32] = values.as_ref() to Elixir's allocator? Going through Rustler code, I can see that a Vec<T> is decoded item by item, so it seems that the answer is generally no.

Thus, without an Elixir Arrow librarywe can't benefit from sharing a reference or even memcoping the whole Vec in one go. Therefore, we are left with option b) and d) - we need to decode value by value (so O(N)) when moving Arrow -> Elixir.

In this case, I would prob try d) like:

implement Serde's Serializer for e.g. PrimitiveArray<T> like is done by serde_rustler.
leverage serde_rustler to map serde types to Elixir.

Note that this does incur a CPU-cost, as we are in the traditional ser-de conversion between Arrow and another in-memory format (Arrow->Serde->Elixir). The benefit of arrow here is the snowflake Arrow, which is likely much faster in transmission than JSON for large data.

With a minimal Elixir Arrow library (i.e. that implements reading from the C data interface / FFI), we would could leverage Rust's implementation for everything else (e.g. Arrow IPC, Parquet, etc.), by passing Rust's Arc<dyn Array> to Elixir. We would then need some functions to print/operate these in Elixir.

Hope this helps somehow.

0 replies

joshuataylor · 2022-05-31T23:02:28Z

joshuataylor
May 31, 2022
Author

It does!

Thank you so much for your incredibly thorough answer!

I think experimenting with how fast it is to encode to Elixir types in rust then return that vs using the Foreign Interface and doing that in Elixir is a first step.

As my first experiment, I'm skipping using serde and just encoding the values using rustler .encode, which I'll then benchmark.
I'll then use the foreign interface to return a chunk back (multiple ffi:export_array_to_c).

I'll post back my findings, as they will probably be relevant for other languages as well, and as a generic Elixir Arrow binding at some point as well (this is just a Snowflake specific binding right now, where we read Arrow streaming files).

edit:

For the first implementation, I've ended up just serialising to Elixir formats, which is great as we don't have to convert the column types like we do with the JSON implementation (https://github.com/joshuataylor/req_snowflake/blob/initial/lib/req_snowflake/req_snowflake.ex#L251). It's also absurdly fast so far (I honestly thought I hadn't setup the benchmark properly), I'll provide benchmark results once I get the rest of the types mapped.

0 replies

joshuataylor · 2022-06-04T13:21:59Z

joshuataylor
Jun 4, 2022
Author

Again, thank you so much for your explanation, it really helped.

I went with the approach of serialising to Rust types, which Rustler can just convert for me. I'm looking into other options to see if there is another way to return data to Elixir using some tricks, but from my initial testing it's fast. Snowflake tends to send files ranging from 100kb-1mb up to 20mb, so it's not a huge amount of data to parse anyway.

I've got an initial PR here - joshuataylor/snowflake_arrow#1

Here are some initial benchmarking results, these aren't casting into Elixir structs etc yet:

Laptop, Apple Macbook m1

CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.13.4
Erlang 25.0

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking arrow_large (9.4mb) ...
Benchmarking arrow_small (368kb) ...

Name                          ips        average  deviation         median         99th %
arrow_small (368kb)        1.15 K        0.87 ms    ±21.48%        0.79 ms        1.20 ms
arrow_large (9.4mb)       0.148 K        6.76 ms    ±25.01%        5.91 ms        9.32 ms

Comparison:
arrow_small (368kb)        1.15 K
arrow_large (9.4mb)       0.148 K - 7.77x slower +5.89 ms

Desktop (slowish single core performance)

Operating System: Linux
CPU Information: AMD Ryzen Threadripper 2990WX 32-Core Processor
Number of Available Cores: 64
Available memory: 125.82 GB
Elixir 1.13.4
Erlang 25.0

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking arrow_large (9.4mb) ...
Benchmarking arrow_small (368kb) ...

Name                          ips        average  deviation         median         99th %
arrow_small (368kb)        488.32        2.05 ms    ±44.27%        1.45 ms        3.80 ms
arrow_large (9.4mb)         56.52       17.69 ms     ±9.90%       17.45 ms       24.26 ms

Comparison: 
arrow_small (368kb)        488.32
arrow_large (9.4mb)         56.52 - 8.64x slower +15.65 ms

Desktop, AMD Ryzen 5 5600X

Operating System: Linux
CPU Information: AMD Ryzen 5 5600X 6-Core Processor
Number of Available Cores: 12
Available memory: 31.33 GB
Elixir 1.13.4
Erlang 25.0

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking arrow_large (9.4mb) ...
Benchmarking arrow_small (368kb) ...

Name                          ips        average  deviation         median         99th %
arrow_small (368kb)        1.08 K        0.92 ms    ±22.65%        0.83 ms        1.28 ms
arrow_large (9.4mb)       0.121 K        8.25 ms    ±28.22%        7.47 ms       11.54 ms

Comparison: 
arrow_small (368kb)        1.08 K
arrow_large (9.4mb)       0.121 K - 8.94x slower +7.33 ms

1 reply

jorgecarleitao Jun 5, 2022
Maintainer

Awesome. Thanks a lot for the feedback and for sharing the benchmarks. Do you know how do they compare against the JSON version? Curious about the performance against the baseline.

joshuataylor · 2022-06-05T07:07:07Z

joshuataylor
Jun 5, 2022
Author

It's hard to compare the JSON implementation, as we get back strings/integers, then need to also cast them. I'll create proper benchmarks later which compare the full decode process we have to go through (as we want to map these to elixir types) with the JSON approach, where as we can do this all in Rust and return what we need to Elixir.

But for parsing the files, here are similar JSON files from the same result set, closest to the file size that Snowflake returns. This uses Jason, which is a pure Elixir jason decoder, if we wanted a fair comparison we should also test with a JSON nif, but 🤷‍♂️

Removed benchmarks as I realised I was only getting partial results for larger files.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return format of Arrow Stream data to an Elixir NIF #1033

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Return format of Arrow Stream data to an Elixir NIF #1033

joshuataylor May 31, 2022

Replies: 4 comments · 1 reply

jorgecarleitao May 31, 2022 Maintainer

joshuataylor May 31, 2022 Author

joshuataylor Jun 4, 2022 Author

jorgecarleitao Jun 5, 2022 Maintainer

joshuataylor Jun 5, 2022 Author

joshuataylor
May 31, 2022

Replies: 4 comments 1 reply

jorgecarleitao
May 31, 2022
Maintainer

joshuataylor
May 31, 2022
Author

joshuataylor
Jun 4, 2022
Author

jorgecarleitao Jun 5, 2022
Maintainer

joshuataylor
Jun 5, 2022
Author