You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am wondering if there is a way in gleam to better distribute jobs across available agents rather than just executors. Consider the following example:
read a text file containing (key, value) pairs, with many keys being repeated
partition by keys, process the value and transform it to something
write values to disk distributed evenly by keys across agents
Let's say there are 6 possible keys, ["key1", "key2", ... , "key6"], and I am running 3 agents on different machines. Ideally, I would like to have the following distribution after my flow:
agent1: only has values with two different keys
agent2: only has values with two different keys
agent3: only has values with two different keys
For the example, consider the following flow:
constNUM_AGENTS=3...f:=flow.New("example").
Read(file.Txt("data/*", 4)).
Map("process the values", ProcessValues).
PartitionByKey("shard by key", NUM_AGENTS).
Map("write to local file on agent's filesystem", WriteToFile)
...
Using this flow, I don't get the results I'm looking for. Instead, gleam simply finds available executors, regardless of the agent they're running on, so I might get something that looks like this:
agent1: has values with three different keys
agent2: has values with 1 key
agent3: has values with two different keys
Is there a way to prioritize available agents without running jobs on them? In particular, I would like the partition to shard all the 6 keys into 2, 2, and 2 running on separate agents and only use the executors available on those agents.
The text was updated successfully, but these errors were encountered:
I am wondering if there is a way in gleam to better distribute jobs across available agents rather than just executors. Consider the following example:
Let's say there are 6 possible keys, ["key1", "key2", ... , "key6"], and I am running 3 agents on different machines. Ideally, I would like to have the following distribution after my flow:
agent1: only has values with two different keys
agent2: only has values with two different keys
agent3: only has values with two different keys
For the example, consider the following flow:
Using this flow, I don't get the results I'm looking for. Instead, gleam simply finds available executors, regardless of the agent they're running on, so I might get something that looks like this:
agent1: has values with three different keys
agent2: has values with 1 key
agent3: has values with two different keys
Is there a way to prioritize available agents without running jobs on them? In particular, I would like the partition to shard all the 6 keys into 2, 2, and 2 running on separate agents and only use the executors available on those agents.
The text was updated successfully, but these errors were encountered: