Replies: 3 comments
-
I would like to know this too, does llama-cpp-python have a native in-built way to do this? |
Beta Was this translation helpful? Give feedback.
-
also tried these as described here, doesn't work:
|
Beta Was this translation helpful? Give feedback.
-
Any solution? |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am running on GPU using the following command.
python3 -m llama_cpp.server --host xx.xx.xxx.xx --port 4444 --model /home/user1/llama.cpp-old/models/codellama-7b-instruct.Q8_0.gguf --n_gpu_layers -1 --n_threads 5 --n_threads_batch 5 --interrupt_requests false
but, I still can't get concurrent inference requests to work. I just need to allow multiple inference requests to return answers in the same time. Currently, all the requests gets queued and handled one at a time.
any idea how to do that?
Thank you
Beta Was this translation helpful? Give feedback.
All reactions