Replies: 8 comments 4 replies
-
Your grammar isn't correct btw, use the ones from there https://github.com/ggerganov/llama.cpp/tree/master/grammars |
Beta Was this translation helpful? Give feedback.
-
I am aware that the grammar I am using does not strictly speaking match the JSON format. For the particular purpose I am using it, I have made some changes that work better for me. However, using e.g https://github.com/ggerganov/llama.cpp/blob/master/grammars/json.gbnf, I still have the same issue as with my own grammar (GPU usage is restricted to ~30% / 30 tokens per second vs 80 tokens/sec with no grammar) and these issues arise much more severely with llama3 than with other models (like Solar 10B where I do get 100% GPU usage). |
Beta Was this translation helpful? Give feedback.
-
I ran it few days ago using this gguf https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF I used a json grammar (custom one), and on a RTX 4090 I was at 110tok/s (output) |
Beta Was this translation helpful? Give feedback.
-
Maybe it's a 3090 issue? |
Beta Was this translation helpful? Give feedback.
-
I'm getting the same issue with so it must be a llama.cpp problem
vs
|
Beta Was this translation helpful? Give feedback.
-
I ran numerous tests for a blog post on this exact topic (see here), and, with a smaller vocabulary than Llama 3's, I saw latencies even worse than the ones here. I'm currently working on a revision that uses CodeGemma (i.e. an even larger vocabulary than LLama 3's) and the results further confirm the scaling issues. |
Beta Was this translation helpful? Give feedback.
-
You're right, in my first message I checked the token output, not the sampling time, it's clearly slower on sampling time |
Beta Was this translation helpful? Give feedback.
-
Well well well, here we go @nagolinc I'm having the same issue right now, did you find a fix ? |
Beta Was this translation helpful? Give feedback.
-
Has anyone else noticed that llama 3 runs much lower when using a grammar file than other models?
Llama-3-Smaug-8B-IQ4_XS.gguf, no grammar
Llama-3-Smaug-8B-IQ4_XS.gguf, with grammar
So I'm getting way fewer tokens/sec when using grammar. The slowdown isn't nearly as bad with other models.
nous-hermes-2-solar-10.7b-misaligned.Q5_K_M.gguf, no grammar
nous-hermes-2-solar-10.7b-misaligned.Q5_K_M.gguf, with grammar
This is the grammar I'm using (just want JSON formatted output)
Beta Was this translation helpful? Give feedback.
All reactions