-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: Support sharing mutable memory between host and guest #594
Comments
Even though the high-level semantics looks like "copies" are being made when passing component-level values into and out of components from the host, in a real host implementation, the That being said, maybe there are other use cases not addressed by these above two points, so I'd be happy to dig into other concrete use cases. |
@lukewagner for the concrete use case in AICI, I think @mmoskal can speak to the latency constraints of AICI better than I could. That said, to add specificity to this GitHub issue, I'll describe some of the workings of guided LLM generation in the context of AICI. There are three processes involved:
The LLM inference server can serve multiple requests concurrently in a batch, each request is called a sequence, and there is a 1:1 correspondence between sequences produced by the LLM and AICI controllers. The LLM inference server processes all sequences at once in a batch, producing one token per sequence per decoding cycle. Decoding cycles, for our purposes, consists of two phases: logit generation and sampling. Generation is the more compute-intensive step, and depending on the language model and inference server, this step takes from 100ms to under 2ms (e.g.: groq). Each AICI controller is tasked with, given the result of the previous step's sampling, producing a set of logit biases as an array. These logit biases are of a fixed size for a given language model, and are on the order of 200-500 KiB (between 32k and 100k 32 bit floating point numbers.) After generation and before or during sampling, these biases are applied to affect the token chosen for the sequence during sampling. This situation lends itself extremely well to mutable shared memory. The buffers are fixed size, multiple processes or threads are involved, and the memory shared could be mapped into coherent memory on an accelerator. This is important because the memory subsystem here is "slow", and round tripping through main system memory is a real cause of poor performance in LLMs. We don't want to stall the inference server for any reason, as a stall either delays the entire batch or wastes compute as the sequences that miss a deadline must backtrack. While I could imagine a mechanism that uses an IO-uring like buffer, managing a pair of buffers across these processes seems much more complex - and error prone. And while I'm familiar with the Are You Sure You Want to Use MMAP in Your Database Management System? paper, I think it's hard to understate that a great many project (database or otherwise) has bootstrapped itself by using memory mapped IO to let operating system kernels do the heavy lifting to build a proof of concept prior to building their own primitives. |
Another salient, concrete example comes to mind from my day job at Pulumi. This was another use case where memory mapped IO is valuable for reducing latency, resident memory, Pulumi is an infrastructure as code system that uses a plugin architecture to support managing various cloud providers with various languages. Cloud provider plugins implement a schema describing the resources they manage, and schemas range in size from kilobytes to over 100MiB for major clouds. Language plugins support writing Pulumi code in six languages or interpreter runtimes (Node.js, Python, Go, .NET, Java, and YAML). The Pulumi engine and CLI manages launching these plugin processes and each kind of plugin implements a gRPC protocol. Most of our languages use a generated per-provider generated SDK to help ensure the correctness of programs written in Pulumi, The The YAML language runtime however does not have any SDK - it instead, at runtime, relies on a provider plugin's schema to perform type-checking. For one provider in particular, the schema is on the order of 100MiB, and if naively implemented, the YAML language plugin would request the schema from the engine, the engine would intermediate and request the schema from the provider plugin. This would result in an excessive number of copies of the schema president in memory, which can result in OOMs in memory-constrained CI/CD systems. The solution implemented was to support caching the schema to a file and use memory mapped IO. (Speaking as an individual who likes to hack on these things.) If the Pulumi engine were to support provider plugins implemented in WASM, I expect we would still want to use memory mapped IO for the Edit: I should add here that while it may seem "obvious" to simply ship the schema as a file alongside the plugin, some cloud provider plugins rely on the schema as an artifact to determine the cloud APIs to call or how to encode/decode RPCs. This plugin in particular that had a 100MiB schema is one of those. And of course, it's much nicer to be able to ship a single binary with embedded data than a set of files, and it leaves less room for error. There are more design constraints here than I can go into in one aside though. :) |
@lukewagner IIUC, both io_uring styling streaming and directly usable i32 offset share a same strong assumption which is wasm code plays the producer(to allocate resource and manage them). In those cases, the question becomes how to let host consumes data in the linear memory efficiently. B in other side, wasm are plugins and actors of a big system. Host manages resources and holds data, like incoming messages, raw bytes and so on. Because of security(of sandbox), need a copy-in to pass arguments and a copy-out to return results for a host-wasm call. Even worse, those data aren't only from MMU managed memory. |
I suggest this issue should be moved to https://github.com/WebAssembly/ComponentModel - any concerns for how Wasm primitives can be shared between guests, or host and guest, are in the Component Model's domain, and WASI just builds on top of what the CM provides. |
Happy to move discussion over to the Component Model repo (also, the URL has a hyphen, so it's: https://github.com/WebAssembly/component-model/), but just to try to reply with the context already given above: @AaronFriel Thanks for providing all that background. From what I was able to understand, it sounds like what you are describing is very much a streaming use case. In this context, I think the io_uring-style of ABI (in which pointers are submitted for asynchronous reading-out-of and writing-into) should have great performance (potentially better, even, than the mmap approach, even, given that memory accesses that result in I/O are blocking). To be clear, I'm not saying that the idea is to literally standardize io_uring (in particular, there wouldn't be request/response ring buffers managed by wasm) -- I'm just describing the rough shape of the ABI (in which pointers are submitted when initiating async I/O operations). In a Windows setting, we could say this was "Overlapped I/O"-style. |
Yes, but if we are talking about data that is to be read or written by wasm compiled from linear-memory languages (like C, C++ and Rust), that wasm expects the data to be in the default linear memory and thus, one way or another, we have to get the data into that default linear memory. We can imagine trying to pass in a pointer to external host memory (via Instead, with the io_uring-style of ABI we can acknowledge that, independent of the above issue, there is inevitably a copy from the kernel (or DMA engine or smart NIC) into user-space anyways, so let's have this one copy copy directly into or out of the default linear memory (by supplying the |
I'm not sure I agree. For the description of the LLM problem the data is of a fixed, well defined size and shared by multiple processes, and synchronization over main memory can even be considered "slow". We could imagine the data backed by the logit biases being mapped into coherent memory in an accelerator, or shared via IPC with multiple other processes written in other languages. For the description of plugin provider schemas, large RPC calls, streaming might be more appropriate, but in practice what we saw was that the duplication of pages in memory in both sides of the RPC interface doubled or tripled peak memory usage. In the scenario described, three processes are involved: a provider plugin, the engine, and a language host plugin. If the engine is intermediating, it should not need to materialize a copy of the schema in memory. However, the provider plugin would read all of its I believe that this will result in memory pressure, though I think the kernel will be fine evicting .rodata pages or equivalent, but it still results in at least one extra copy in memory. That is, I think streaming IO has overheads that aren't recognized here and while the latencies might be slight, they are non-zero, and it seems weird to contort the system such that the memory must be owned by a singular WASM guest. For guest to guest memory sharing, it sounds like this naturally requires multiple copies, or in the LLM use case, an entirely different process on the host could own the memory, which could be mapped by the IOMMU into a different device! |
Ah, is the data shared read-only, and is the goal here to have the same physical pages mapped into the virtual address space of each process? |
For the LLM use case, it is mutable. |
To try to understand which part of the whole architecture we're talking about here: for this mutable data, are we talking about the big arrays of logit biases you mentioned above:
? If so, then it sounded like while, at the low-level, there is memory mutation, at a high-level, what's happening is that data is being passed into the guest, which computes some new data that is then passed out of the guest. I'm also guessing (but let me know if that's not right) that a single wasm instance is reused many times in a row (or even concurrently). Thus, at a high level, it seems like one could think of the wasm as implementing a big function from inputs to output (logit biases). Given that, the point I was also just making above is that wasm has a hard time working with anything outside of its default linear memory. Thus, we somehow need to get the input data into linear memory for the wasm to be able to access it. Similarly, the output logit arrays will also be written into the same linear memory because that's all the guest can write to directly from regular C/C++/Rust/... code. But let me know though if you've built something with wasm that works differently using multiple memories or references or something else; I can imagine various hypothetical alternative approaches, but so far they all have appeared difficult to get working in practice (e.g., with existing C/C++ code), but I'm always interested to hear if folks have built something that works. Otherwise, it seems like the basic problem statement is how do we efficiently bulk-move data into and out of wasm's linear memory as part of making this big function call. Now, "linear memory" is just a spec-level concept that doesn't have to be implemented with plain anonymous vmem in the runtime: it is also possible (and some engines already do) map files or other things into linear memory (e.g., using It's hard to discuss this question in the abstract because how precisely mapping works varies greatly by API and OS and more; this is also why it's hard to spec mapping portably at the wasm layer without simply ruling out whole classes of embeddings. But if you have a particular host API that you'd like to discuss in this context of this LLM scenario, I'd be happy to dig into that. |
I think the current (pre-Component) implementation in AICI is helpful to look at:
Due to the reasons you've described above, it's not easy today to share the memory; instead and I believe to further reduce overheads a bitmask is used. But it would be great, I think, to fully obviate the need for the bitmask and without compromise by enabling the host to export functions which loan memory to a guest (a la Even better if this works with this shared memory, because I think it would allow AICI runtime host to be almost entirely "hands off" on the shared memory and the inner loop as it acts as an intermediary between the LLM and the AICI controller guests. This would give the guests the ability to write full logit maps, even participate in the sampling step of the LLM. |
Thanks for all the links, that definitely helps paint a better picture of the system we're talking about. I didn't read all the code, so I may be missing the bigger picture, but from my brief reading, it looks like the shmem is being mmap'ed into a location outside of wasm's default linear memory (pointed to by It is worth asking how we could further improve upon this situation by even fancier mapping techniques, that further reduced copying, of course. But at least, if I'm not missing something, we'd not be regressing things as-is, so we could open this up as a future optimization discussion. |
It looks like the WASI specification with the component model / interface types specification as written today does not support a shared memory type. Mutable shared memory is an important capability for many kinds of software that have tight bounds on latency and/or would benefit from zero-copy IO, ranging from:
mmap
backed buffers from the hostIn fact, the current design of the WASI specification makes it more difficult to share memory between host and guest as compared to pre-component model specification, by constraining input & output codegen to specific named types.
There are many ways to address this gap, e.g.:
list<T>
supported being declared as aborrow<list<T>>
. In Rust, this would be received as a&mut [T]
, in JavaScript aTypedArray
, and so on.For a concrete use case, the Microsoft AI Controller Interface project uses shared memory to reduce latency on mutating moderately large arrays of floating point numbers representing the logits of a large language model.
The text was updated successfully, but these errors were encountered: