How to allow parallel inference requests #1725

webhaypac · 2024-09-03T05:01:43Z

webhaypac
Sep 3, 2024

Hi,
I am running on GPU using the following command.
python3 -m llama_cpp.server --host xx.xx.xxx.xx --port 4444 --model /home/user1/llama.cpp-old/models/codellama-7b-instruct.Q8_0.gguf --n_gpu_layers -1 --n_threads 5 --n_threads_batch 5 --interrupt_requests false

but, I still can't get concurrent inference requests to work. I just need to allow multiple inference requests to return answers in the same time. Currently, all the requests gets queued and handled one at a time.

any idea how to do that?

Thank you

rookiemann · 2024-09-22T14:29:14Z

rookiemann
Sep 22, 2024

I would like to know this too, does llama-cpp-python have a native in-built way to do this?

0 replies

RobinBially · 2024-09-26T21:25:18Z

RobinBially
Sep 26, 2024

also tried these as described here, doesn't work:

LLAMA_ARG_N_PARALLEL=3 python -m llama_cpp.server --model ./cache/llamacpp/qwen2-0_5b-instruct-q4_0.gguf
python -m llama_cpp.server --model ./cache/llamacpp/qwen2-0_5b-instruct-q4_0.gguf --parallel 3
python -m llama_cpp.server --model ./cache/llamacpp/qwen2-0_5b-instruct-q4_0.gguf -np 3

0 replies

javi22020 · 2024-10-21T14:43:47Z

javi22020
Oct 21, 2024

Any solution?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to allow parallel inference requests #1725

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to allow parallel inference requests #1725

webhaypac Sep 3, 2024

Replies: 3 comments

rookiemann Sep 22, 2024

RobinBially Sep 26, 2024

javi22020 Oct 21, 2024

webhaypac
Sep 3, 2024

rookiemann
Sep 22, 2024

RobinBially
Sep 26, 2024

javi22020
Oct 21, 2024