Lllama 3 much slower with grammar #1376

nagolinc · 2024-04-23T10:58:30Z

nagolinc
Apr 23, 2024

Has anyone else noticed that llama 3 runs much lower when using a grammar file than other models?

Llama-3-Smaug-8B-IQ4_XS.gguf, no grammar

Llama.generate: prefix-match hit

llama_print_timings:        load time =     305.62 ms
llama_print_timings:      sample time =     550.98 ms /  1024 runs   (    0.54 ms per token,  1858.52 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   12748.16 ms /  1024 runs   (   12.45 ms per token,    80.33 tokens per second)
llama_print_timings:       total time =   19509.79 ms /  1025 tokens

Llama-3-Smaug-8B-IQ4_XS.gguf, with grammar

Llama.generate: prefix-match hit

llama_print_timings:        load time =     305.62 ms
llama_print_timings:      sample time =   39846.97 ms /   533 runs   (   74.76 ms per token,    13.38 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   20621.72 ms /   533 runs   (   38.69 ms per token,    25.85 tokens per second)
llama_print_timings:       total time =   64615.65 ms /   534 tokens

So I'm getting way fewer tokens/sec when using grammar. The slowdown isn't nearly as bad with other models.

nous-hermes-2-solar-10.7b-misaligned.Q5_K_M.gguf, no grammar

llama_print_timings:        load time =     133.80 ms
llama_print_timings:      sample time =     543.57 ms /  1024 runs   (    0.53 ms per token,  1883.86 tokens per second)
llama_print_timings: prompt eval time =     357.46 ms /  1202 tokens (    0.30 ms per token,  3362.63 tokens per second)
llama_print_timings:        eval time =   12448.58 ms /  1023 runs   (   12.17 ms per token,    82.18 tokens per second)
llama_print_timings:       total time =   19599.24 ms /  2225 tokens

nous-hermes-2-solar-10.7b-misaligned.Q5_K_M.gguf, with grammar

llama_print_timings:        load time =     500.91 ms
llama_print_timings:      sample time =    1621.06 ms /   238 runs   (    6.81 ms per token,   146.82 tokens per second)
llama_print_timings: prompt eval time =     889.39 ms /  1351 tokens (    0.66 ms per token,  1519.01 tokens per second)
llama_print_timings:        eval time =    4713.91 ms /   237 runs   (   19.89 ms per token,    50.28 tokens per second)
llama_print_timings:       total time =    7715.88 ms /  1588 tokens

This is the grammar I'm using (just want JSON formatted output)

root ::= JSON

JSON ::= Object | Array

Object ::= "{" ws KeyValues ws "}"
KeyValues ::= KeyValue | KeyValue "," ws KeyValues
KeyValue ::= string ":" ws Value

Array ::= "[" ws Values ws "]"
Values ::= Value | Value "," ws Values

Value ::= string | number | Object | Array | "true" | "false" | "null"

string ::= "\"" (jsonchar)+ "\""

number ::= [0-9]+
jsonchar ::= [a-zA-Z0-9 .?!*,'] | "\\\"" 
ws ::= [ \t\n]*

ExtReMLapin · 2024-04-24T09:18:48Z

ExtReMLapin
Apr 24, 2024

Your grammar isn't correct btw, use the ones from there https://github.com/ggerganov/llama.cpp/tree/master/grammars

0 replies

nagolinc · 2024-04-24T10:01:52Z

nagolinc
Apr 24, 2024
Author

I am aware that the grammar I am using does not strictly speaking match the JSON format. For the particular purpose I am using it, I have made some changes that work better for me.

However, using e.g https://github.com/ggerganov/llama.cpp/blob/master/grammars/json.gbnf, I still have the same issue as with my own grammar (GPU usage is restricted to ~30% / 30 tokens per second vs 80 tokens/sec with no grammar) and these issues arise much more severely with llama3 than with other models (like Solar 10B where I do get 100% GPU usage).

2 replies

ExtReMLapin Apr 24, 2024

Since december, I do not see any performance discrease in grammar vs no grammar, with all layers offloaded to GPU.

nagolinc Apr 24, 2024
Author

This is specificly on llama3. I have not noticed it with other models

llama 3

vs Solar 10B

Again, I would like to emphasize this is a problem specifically with llama3 (and occurs with multiple fine-tunes/quantization that I have tried), and it does not occur with other models that I am aware of (llama 2, mistral 7b, etc)

Code to replicate:

from llama_cpp.llama import Llama, LlamaGrammar

llama_model="D:/lmstudio/bartowski/Llama-3-Smaug-8B-GGUF/Llama-3-Smaug-8B-IQ4_XS.gguf"

n_gpu_layers = 999
llm = Llama(llama_model, n_ctx=8162, n_gpu_layers=n_gpu_layers)

json_grammar = LlamaGrammar.from_string(open('json.gbnf.txt').read())

system_prompt="""You are a skilled writing assistant.
Write a story based on the user's prompt
Always output your answer as JSON
"""    

messages=[{"role":"system","content":system_prompt},
         {"role":"user","content":"a story about bees"}]


max_new_tokens=1024
temperature=1.0


def formatMessages(messages):
    prompt = ""
    lastRole = "system"
    for message in messages:
        prompt += message["role"] + ":\n"
        if message["role"] != lastRole:
            prompt += "\n"
        prompt += message["content"] + "\n"
        lastRole = message["role"]

    # now add a final "assitant:" to the prompt
    prompt += "assistant:\n"
    
    
    #print(prompt)   
    
    return prompt


#this will run slow (<50% GPU usage, 30tokens/sec)
response=llm(
                    formatMessages(messages),
                    max_tokens=max_new_tokens,
                    repeat_penalty=1.2,
                    temperature=temperature,
                    grammar=json_grammar,
                    top_k=1000
                )


#this will run faster (~100% GPU usage, 80 tokens/sec)
response=llm(
                    formatMessages(messages),
                    max_tokens=max_new_tokens,
                    repeat_penalty=1.2,
                    temperature=temperature,
                    top_k=1000
                )


#the same problem occurs if using chatCompletion
#this will run slow
response=llm.create_chat_completion(
    messages=messages,
    response_format={
        "type": "json_object",
    },
    temperature=0.7,
)

#this will run fast
response=llm.create_chat_completion(
    messages=messages,
    temperature=0.7,
)

ExtReMLapin · 2024-04-24T15:18:40Z

ExtReMLapin
Apr 24, 2024

I ran it few days ago using this gguf https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF

I used a json grammar (custom one), and on a RTX 4090 I was at 110tok/s (output)

0 replies

nagolinc · 2024-04-24T15:40:11Z

nagolinc
Apr 24, 2024
Author

Maybe it's a 3090 issue?

That gguf gives the same behavior

1 reply

ExtReMLapin Apr 24, 2024

You're 100% sure you're using CUDA build (not clbas or something else) and the latest version of this repo ? using --upgrade when running pip install ?

What about running llama.cpp in command line ? there are ready to use windows builds on the official repository

nagolinc · 2024-04-24T16:40:58Z

nagolinc
Apr 24, 2024
Author

I'm getting the same issue with
llama-b2717-bin-win-cuda-cu12.2.0-x64.zip
from
https://github.com/ggerganov/llama.cpp/releases/tag/b2717

so it must be a llama.cpp problem


.\main.exe -m .\Meta-Llama-3-8B-Instruct.Q4_K_M.gguf -n 256 -p 'Request: schedule a call at 8pm; Command:' --gpu-layers 9999

llama_print_timings:        load time =    2004.73 ms
llama_print_timings:      sample time =      20.01 ms /   256 runs   (    0.08 ms per token, 12792.32 tokens per second)
llama_print_timings: prompt eval time =      36.61 ms /    12 tokens (    3.05 ms per token,   327.78 tokens per second)
llama_print_timings:        eval time =    3180.97 ms /   255 runs   (   12.47 ms per token,    80.16 tokens per second)
llama_print_timings:       total time =    3341.60 ms /   267 tokens

vs


.\main.exe -m .\Meta-Llama-3-8B-Instruct.Q4_K_M.gguf -n 256 -p 'Request: schedule a call at 8pm; Command:' --gpu-layers 9999 --grammar-file .\json.gbnf.txt

llama_print_timings:        load time =    1995.89 ms
llama_print_timings:      sample time =   14390.09 ms /   256 runs   (   56.21 ms per token,    17.79 tokens per second)
llama_print_timings: prompt eval time =      37.11 ms /    12 tokens (    3.09 ms per token,   323.34 tokens per second)
llama_print_timings:        eval time =    5684.23 ms /   255 runs   (   22.29 ms per token,    44.86 tokens per second)
llama_print_timings:       total time =   20800.27 ms /   267 tokens

0 replies

brandonwillard · 2024-04-30T18:15:16Z

brandonwillard
Apr 30, 2024

llama.cpp's approach to grammar-structured generation doesn't scale well in the size of the vocabulary and/or grammar, and that alone could easily explain the latencies posted here.

I ran numerous tests for a blog post on this exact topic (see here), and, with a smaller vocabulary than Llama 3's, I saw latencies even worse than the ones here. I'm currently working on a revision that uses CodeGemma (i.e. an even larger vocabulary than LLama 3's) and the results further confirm the scaling issues.

0 replies

ExtReMLapin · 2024-05-13T12:08:12Z

ExtReMLapin
May 13, 2024

You're right, in my first message I checked the token output, not the sampling time, it's clearly slower on sampling time

0 replies

ExtReMLapin · 2024-10-31T11:23:23Z

ExtReMLapin
Oct 31, 2024

Well well well, here we go @nagolinc

I'm having the same issue right now, did you find a fix ?

1 reply

ExtReMLapin Oct 31, 2024

Also I don't have the issue in CLI mode (command line, not python)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lllama 3 much slower with grammar #1376

{{title}}

Replies: 8 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Lllama 3 *much* slower with grammar #1376

Replies: 8 comments · 4 replies

nagolinc Apr 24, 2024 Author

nagolinc Apr 24, 2024 Author

nagolinc Apr 24, 2024 Author

nagolinc Apr 24, 2024 Author

Lllama 3 much slower with grammar #1376

Replies: 8 comments 4 replies

nagolinc
Apr 24, 2024
Author

nagolinc Apr 24, 2024
Author

nagolinc
Apr 24, 2024
Author

nagolinc
Apr 24, 2024
Author