Skip to content

Run LLaMA inference on CPU, with Rust πŸ¦€πŸš€πŸ¦™

License

Notifications You must be signed in to change notification settings

leonardohn/llama-rs

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLaMA-rs

Do the LLaMA thing, but now in Rust πŸ¦€πŸš€πŸ¦™

A llama riding a crab, AI-generated

Image by @darthdeus, using Stable Diffusion

ko-fi

Latest version MIT Discord

Gif showcasing language generation using llama-rs

LLaMA-rs is a Rust port of the llama.cpp project. This allows running inference for Facebook's LLaMA model on a CPU with good performance using full precision, f16 or 4-bit quantized versions of the model.

Just like its C++ counterpart, it is powered by the ggml tensor library, achieving the same performance as the original code.

Getting started

Make sure you have a Rust 1.65.0 or above and C toolchain1 set up, and get a copy of the model's weights2.

llama-rs is a Rust library, while llama-cli is a CLI application that wraps llama-rs and offers basic inference capabilities.

The following instructions explain how to build llama-cli.

NOTE: For best results, make sure to build and run in release mode. Debug builds are going to be very slow.

Building using cargo

Run

cargo install --git https://github.com/rustformers/llama-rs llama-cli

to install llama-cli to your Cargo bin directory, which rustup is likely to have added to your PATH.

It can then be run through llama-cli.

Building from repository

Clone the repository, and then build it through

cargo build --release

The resulting binary will be at target/release/llama-cli[.exe].

It can also be run directly through Cargo, using

cargo run --release -- <ARGS>

This is useful for development.

Running

For example, try the following prompt:

llama-cli -m <path>/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"

Some additional things to try:

  • Use --help to see a list of available options.

  • If you have the alpaca-lora weights, try --repl mode!

    llama-cli -m <path>/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt --repl

    Gif showcasing alpaca repl mode

  • Sessions can be loaded (--load-session) or saved (--save-session) to file. To automatically load and save the same session, use --persist-session. This can be used to cache prompts to reduce load time, too:

    Gif showcasing prompt caching

    (This GIF shows an older version of the flags, but the mechanics are still the same.)

Q&A

Why did you do this?

It was not my choice. Ferris appeared to me in my dreams and asked me to rewrite this in the name of the Holy crab.

Seriously now.

Come on! I don't want to get into a flame war. You know how it goes, something something memory something something cargo is nice, don't make me say it, everybody knows this already.

I insist.

Sheesh! Okaaay. After seeing the huge potential for llama.cpp, the first thing I did was to see how hard would it be to turn it into a library to embed in my projects. I started digging into the code, and realized the heavy lifting is done by ggml (a C library, easy to bind to Rust) and the whole project was just around ~2k lines of C++ code (not so easy to bind). After a couple of (failed) attempts to build an HTTP server into the tool, I realized I'd be much more productive if I just ported the code to Rust, where I'm more comfortable.

Is this the real reason?

Haha. Of course not. I just like collecting imaginary internet points, in the form of little stars, that people seem to give to me whenever I embark on pointless quests for rewriting X thing, but in Rust.

How is this different from llama.cpp?

This is a reimplementation of llama.cpp that does not share any code with it outside of ggml. This was done for a variety of reasons:

  • llama.cpp requires a C++ compiler, which can cause problems for cross-compilation to more esoteric platforms. An example of such a platform is WebAssembly, which can require a non-standard compiler SDK.
  • Rust is easier to work with from a development and open-source perspective; it offers better tooling for writing "code in the large" with many other authors. Additionally, we can benefit from the larger Rust ecosystem with ease.
  • We would like to make ggml an optional backend (see this issue).

In general, we hope to build a solution for model inferencing that is as easy to use and deploy as any other Rust crate.

Footnotes

  1. A modern-ish C toolchain is required to compile ggml. A C++ toolchain should not be necessary. ↩

  2. The only legal source to get the weights at the time of writing is this repository. The choice of words also may or may not hint at the existence of other kinds of sources. ↩

About

Run LLaMA inference on CPU, with Rust πŸ¦€πŸš€πŸ¦™

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 100.0%