VLLM.adoc 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
  1. == Install vLLM ==
  2. Use `envs/env-vllm.yml` to create a Miniforge env.
  3. $ source conda-init.sh
  4. (base) $ conda env create -f envs/env-vllm.yml
  5. ...
  6. Activate the vLLM environment.
  7. (base) $ conda activate vllm
  8. (vllm) $
  9. Install the dependencies.
  10. (vllm) $ pip install --upgrade pip
  11. ...
  12. (vllm) $ pip install torch torchvision
  13. ...
  14. Clone vLLM.
  15. (vllm) $ git clone https://github.com/vllm-project/vllm.git
  16. Build it.
  17. (vllm) $ cd vllm
  18. (vllm) $ pip install -e .
  19. ...
  20. Done!
  21. == Test With a Model ==
  22. Download a model, such as https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct[meta-llama/Llama-3.1-8B-Instruct], and run it.
  23. (vllm) $ hf download meta-llama/Llama-3.1-8B-Instruct
  24. ...
  25. (vllm) $ python3
  26. ...
  27. >>> from vllm import LLM, SamplingParams
  28. >>> import torch
  29. >>> torch.backends.mps.is_available()
  30. True
  31. >>> llm = LLM(model="MODEL_PATH_OR_NAME",tensor_parallel_size=1,trust_remote_code=False,dtype="float16")
  32. ...
  33. INFO 03-29 15:00:00 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
  34. INFO 03-29 15:00:00 [model.py:549] Resolved architecture: LlamaForCausalLM
  35. WARNING 03-29 15:00:00 [model.py:2016] Casting torch.bfloat16 to torch.float16.
  36. INFO 03-29 15:00:00 [model.py:1678] Using max model len 131072
  37. WARNING 03-29 15:00:00 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 32.0 GiB for KV cache.
  38. INFO 03-29 15:00:00 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=4096.
  39. INFO 03-29 15:00:00 [vllm.py:786] Asynchronous scheduling is enabled.
  40. ...
  41. (EngineCore pid=83442) INFO 03-29 15:00:05 [cpu_worker.py:109] Warning: NUMA is not enabled in this build. `init_cpu_threads_env` has no effect to setup thread affinity.
  42. (EngineCore pid=83442) INFO 03-29 15:00:05 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.223.249:52310 backend=gloo
  43. (EngineCore pid=83442) INFO 03-29 15:00:08 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
  44. (EngineCore pid=83442) INFO 03-29 15:00:08 [cpu_model_runner.py:71] Starting to load model MODEL_PATH_OR_NAME...
  45. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
  46. Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:10, 3.60s/it]
  47. Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:10<00:10, 5.44s/it]
  48. Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:16<00:05, 5.65s/it]
  49. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:17<00:00, 4.10s/it]
  50. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:17<00:00, 4.49s/it]
  51. (EngineCore pid=83442)
  52. (EngineCore pid=83442) INFO 03-29 15:00:26 [default_loader.py:384] Loading weights took 17.97 seconds
  53. (EngineCore pid=83442) INFO 03-29 15:00:26 [kv_cache_utils.py:1319] GPU KV cache size: 262,144 tokens
  54. (EngineCore pid=83442) INFO 03-29 15:00:26 [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 2.00x
  55. (EngineCore pid=83442) INFO 03-29 15:00:31 [cpu_model_runner.py:82] Warming up model for the compilation...
  56. (EngineCore pid=83442) INFO 03-29 15:01:02 [decorators.py:640] saved AOT compiled function to .../.cache/vllm/torch_compile_cache/torch_aot_compile/36db133f7a06cf0d1706adf64d0b95f254e21249fbbb1a84fadaf605c9b6d09f/rank_0_0/model
  57. (EngineCore pid=83442) INFO 03-29 15:01:14 [monitor.py:76] Initial profiling/warmup run took 12.29 s
  58. (EngineCore pid=83442) INFO 03-29 15:01:14 [cpu_model_runner.py:92] Warming up done.
  59. (EngineCore pid=83442) INFO 03-29 15:01:14 [core.py:283] init engine (profile, create kv cache, warmup model) took 47.40 seconds
  60. (EngineCore pid=83442) INFO 03-29 15:01:15 [vllm.py:786] Asynchronous scheduling is disabled.
  61. (EngineCore pid=83442) WARNING 03-29 15:01:15 [vllm.py:855] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  62. (EngineCore pid=83442) WARNING 03-29 15:01:15 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 32.0 GiB for KV cache.
  63. >>>
  64. MODEL_PATH_OR_NAME is either something like:
  65. * `/Users/johndoe/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/` if it is a path, or
  66. * `meta-llama/Llama-3.1-8B-Instruct` if it is a name
  67. Configure sampling and ask a question.
  68. >>> sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100)
  69. >>> prompt = "You are a helpful assistant from the 18th century England. Respond using appropriate language. Introduce the topic of medieval castle design in five sentences."
  70. >>> outputs = llm.generate([prompt], sampling_params)
  71. Rendering prompts: 100%|...| 1/1 [00:00<00:00, 17.47it/s]
  72. Processed prompts: 100%|...| 1/1 [00:21<00:00, 21.17s/it, est. speed input: 1.46 toks/s, output: 4.72 toks/s]
  73. >>> for output in outputs:
  74. ... print(output.outputs[0].text)
  75. ...
  76. Good morrow to thee, kind sir or madam! I am at thy service. 'Tis a grand pleasure to converse with one such as thyself. I daresay, we find ourselves in the midst of a most fascinating discussion, one that touches upon the grand structures of old - the medieval castles. The designs of these fortified strongholds, as they stood sentinel over the land, were a testament to the ingenuity and craftsmanship of their builders, who strove to create
  77. >>> exit()
  78. (EngineCore pid=83442) INFO 03-29 15:07:38 [core.py:1210] Shutdown initiated (timeout=0)
  79. (EngineCore pid=83442) INFO 03-29 15:07:38 [core.py:1233] Shutdown complete
  80. Serve the model.
  81. (vllm) $ python -m vllm.entrypoints.api_server \
  82. --model MODEL_PATH_OR_NAME \
  83. --tensor-parallel-size 1 \
  84. --host 0.0.0.0 \
  85. --port 8000 \
  86. --dtype float16
  87. INFO 03-29 15:32:01 [api_server.py:129] vLLM API server version 0.18.1rc1.dev221+g43cc5138e
  88. ...
  89. INFO 03-29 15:32:01 [model.py:549] Resolved architecture: LlamaForCausalLM
  90. WARNING 03-29 15:32:01 [model.py:2016] Casting torch.bfloat16 to torch.float16.
  91. INFO 03-29 15:32:01 [model.py:1678] Using max model len 131072
  92. WARNING 03-29 15:32:01 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 32.0 GiB for KV cache.
  93. INFO 03-29 15:32:01 [vllm.py:786] Asynchronous scheduling is enabled.
  94. INFO 03-29 15:32:01 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
  95. ...
  96. (EngineCore pid=84256) INFO 03-29 15:32:06 [cpu_worker.py:109] Warning: NUMA is not enabled in this build. `init_cpu_threads_env` has no effect to setup thread affinity.
  97. (EngineCore pid=84256) INFO 03-29 15:32:06 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.223.249:53531 backend=gloo
  98. (EngineCore pid=84256) INFO 03-29 15:32:06 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
  99. (EngineCore pid=84256) INFO 03-29 15:32:06 [cpu_model_runner.py:71] Starting to load model /Users/johndoe/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/...
  100. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
  101. Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:10, 3.45s/it]
  102. Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:06<00:06, 3.48s/it]
  103. Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:13<00:04, 4.80s/it]
  104. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00, 3.37s/it]
  105. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00, 3.63s/it]
  106. (EngineCore pid=84256)
  107. (EngineCore pid=84256) INFO 03-29 15:32:21 [default_loader.py:384] Loading weights took 14.52 seconds
  108. (EngineCore pid=84256) INFO 03-29 15:32:21 [kv_cache_utils.py:1319] GPU KV cache size: 262,144 tokens
  109. (EngineCore pid=84256) INFO 03-29 15:32:21 [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 2.00x
  110. (EngineCore pid=84256) INFO 03-29 15:32:25 [cpu_model_runner.py:82] Warming up model for the compilation...
  111. (EngineCore pid=84256) INFO 03-29 15:32:32 [decorators.py:640] saved AOT compiled function to /Users/johndoe/.cache/vllm/torch_compile_cache/torch_aot_compile/2be6100cc84590c09dcb4ec4e92bd60957f53fa2bafccb2237b1ff596c6c8db7/rank_0_0/model
  112. (EngineCore pid=84256) INFO 03-29 15:32:39 [monitor.py:76] Initial profiling/warmup run took 6.59 s
  113. (EngineCore pid=84256) INFO 03-29 15:32:39 [cpu_model_runner.py:92] Warming up done.
  114. (EngineCore pid=84256) INFO 03-29 15:32:39 [core.py:283] init engine (profile, create kv cache, warmup model) took 18.06 seconds
  115. (EngineCore pid=84256) INFO 03-29 15:32:39 [vllm.py:786] Asynchronous scheduling is disabled.
  116. (EngineCore pid=84256) WARNING 03-29 15:32:39 [vllm.py:855] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  117. (EngineCore pid=84256) WARNING 03-29 15:32:39 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 32.0 GiB for KV cache.
  118. INFO 03-29 15:32:39 [launcher.py:37] Available routes are:
  119. INFO 03-29 15:32:39 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
  120. INFO 03-29 15:32:39 [launcher.py:46] Route: /docs, Methods: HEAD, GET
  121. INFO 03-29 15:32:39 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
  122. INFO 03-29 15:32:39 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
  123. INFO 03-29 15:32:39 [launcher.py:46] Route: /health, Methods: GET
  124. INFO 03-29 15:32:39 [launcher.py:46] Route: /generate, Methods: POST
  125. INFO: Started server process [84248]
  126. INFO: Waiting for application startup.
  127. INFO: Application startup complete.
  128. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
  129. Interrogate it.
  130. $ curl -XPOST -H "Content-Type: application/json" -d '{"prompt": "You are a helpful assistant from the 18th century England. Respond using appropriate language. Introduce the topic of medieval castle design in five sentences. Only respond once and do not repeat the answer.", "max_tokens": 1000, "temperature": 0.5}' http://localhost:8000/generate | jq