== Install vLLM == Use `envs/env-vllm.yml` to create a Miniforge env. $ source conda-init.sh (base) $ conda env create -f envs/env-vllm.yml ... Activate the vLLM environment. (base) $ conda activate vllm (vllm) $ Install the dependencies. (vllm) $ pip install --upgrade pip ... (vllm) $ pip install torch torchvision ... Clone vLLM. (vllm) $ git clone https://github.com/vllm-project/vllm.git Build it. (vllm) $ cd vllm (vllm) $ pip install -e . ... Done! == Test With a Model == Download a model, such as https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct[meta-llama/Llama-3.1-8B-Instruct], and run it. (vllm) $ hf download meta-llama/Llama-3.1-8B-Instruct ... (vllm) $ python3 ... >>> from vllm import LLM, SamplingParams >>> import torch >>> torch.backends.mps.is_available() True >>> llm = LLM(model="MODEL_PATH_OR_NAME",tensor_parallel_size=1,trust_remote_code=False,dtype="float16") ... INFO 03-29 15:00:00 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. INFO 03-29 15:00:00 [model.py:549] Resolved architecture: LlamaForCausalLM WARNING 03-29 15:00:00 [model.py:2016] Casting torch.bfloat16 to torch.float16. INFO 03-29 15:00:00 [model.py:1678] Using max model len 131072 WARNING 03-29 15:00:00 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 32.0 GiB for KV cache. INFO 03-29 15:00:00 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=4096. INFO 03-29 15:00:00 [vllm.py:786] Asynchronous scheduling is enabled. ... (EngineCore pid=83442) INFO 03-29 15:00:05 [cpu_worker.py:109] Warning: NUMA is not enabled in this build. `init_cpu_threads_env` has no effect to setup thread affinity. (EngineCore pid=83442) INFO 03-29 15:00:05 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.223.249:52310 backend=gloo (EngineCore pid=83442) INFO 03-29 15:00:08 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=83442) INFO 03-29 15:00:08 [cpu_model_runner.py:71] Starting to load model MODEL_PATH_OR_NAME... Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00>> MODEL_PATH_OR_NAME is either something like: * `/Users/johndoe/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/` if it is a path, or * `meta-llama/Llama-3.1-8B-Instruct` if it is a name Configure sampling and ask a question. >>> sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100) >>> prompt = "You are a helpful assistant from the 18th century England. Respond using appropriate language. Introduce the topic of medieval castle design in five sentences." >>> outputs = llm.generate([prompt], sampling_params) Rendering prompts: 100%|...| 1/1 [00:00<00:00, 17.47it/s] Processed prompts: 100%|...| 1/1 [00:21<00:00, 21.17s/it, est. speed input: 1.46 toks/s, output: 4.72 toks/s] >>> for output in outputs: ... print(output.outputs[0].text) ... Good morrow to thee, kind sir or madam! I am at thy service. 'Tis a grand pleasure to converse with one such as thyself. I daresay, we find ourselves in the midst of a most fascinating discussion, one that touches upon the grand structures of old - the medieval castles. The designs of these fortified strongholds, as they stood sentinel over the land, were a testament to the ingenuity and craftsmanship of their builders, who strove to create >>> exit() (EngineCore pid=83442) INFO 03-29 15:07:38 [core.py:1210] Shutdown initiated (timeout=0) (EngineCore pid=83442) INFO 03-29 15:07:38 [core.py:1233] Shutdown complete Serve the model. (vllm) $ python -m vllm.entrypoints.api_server \ --model MODEL_PATH_OR_NAME \ --tensor-parallel-size 1 \ --host 0.0.0.0 \ --port 8000 \ --dtype float16 INFO 03-29 15:32:01 [api_server.py:129] vLLM API server version 0.18.1rc1.dev221+g43cc5138e ... INFO 03-29 15:32:01 [model.py:549] Resolved architecture: LlamaForCausalLM WARNING 03-29 15:32:01 [model.py:2016] Casting torch.bfloat16 to torch.float16. INFO 03-29 15:32:01 [model.py:1678] Using max model len 131072 WARNING 03-29 15:32:01 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 32.0 GiB for KV cache. INFO 03-29 15:32:01 [vllm.py:786] Asynchronous scheduling is enabled. INFO 03-29 15:32:01 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. ... (EngineCore pid=84256) INFO 03-29 15:32:06 [cpu_worker.py:109] Warning: NUMA is not enabled in this build. `init_cpu_threads_env` has no effect to setup thread affinity. (EngineCore pid=84256) INFO 03-29 15:32:06 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.223.249:53531 backend=gloo (EngineCore pid=84256) INFO 03-29 15:32:06 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=84256) INFO 03-29 15:32:06 [cpu_model_runner.py:71] Starting to load model /Users/johndoe/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/... Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00