华为昇腾910B服务器上部署Qwen3-30B-A3B并启用EvalScope推理性能测试
使用 MindIE、vllm-ascend 推理引擎在华为昇腾910B显卡上运行 Qwen3-30B-A3B 模型,然后简单测试下推理性能
一、准备
1.1 环境信息
| 模型 | Qwen3-30B-A3B | MindIE 运行该模型需要至少2张卡,推荐4张 |
| 服务器型号 | Atlas 800I A2 | 1台 |
| 显卡 | 910B4 | 64G/张 |
| 驱动 | >=24.1.0 | |
| MindIE | >=2.1.RC1 | |
| vllm-ascend | v0.11.0rc0 |
服务器一共8张卡,0-3 给 MindIE 使用,4-7 给 vllm-ascend 使用
1.2 模型准备
modelscope download --model Qwen/Qwen3-30B-A3B --local_dir /model/Qwen3-30B-A3B
chmod -R 750 /model/Qwen3-30B-A3B
Bash
二、使用 MindIE 运行
2.1 启动容器
docker run -itd --privileged --name=qwen3-30b-a3b-mindie --net=host --shm-size=500g \--device=/dev/davinci0 \--device=/dev/davinci1 \--device=/dev/davinci2 \--device=/dev/davinci3 \--device=/dev/davinci_manager \--device=/dev/devmm_svm \--device=/dev/hisi_hdc \-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \-v /usr/local/sbin/:/usr/local/sbin/ \-v /var/log/npu/slog/:/var/log/npu/slog \-v /var/log/npu/profiling/:/var/log/npu/profiling \-v /var/log/npu/dump/:/var/log/npu/dump \-v /var/log/npu/:/usr/slog \-v /etc/hccn.conf:/etc/hccn.conf \-v /model:/model \swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.1.RC1-800I-A2-py311-openeuler24.03-lts \/bin/bash
Bash
2.2 修改配置
docker exec -it qwen3-30b-a3b-mindie bash
cd /usr/local/Ascend/mindie/latest/mindie-service
vim conf/config.json
Bash
必须修改的参数如下(没有按照在配置文件中的顺序),其他参数可根据需要自行修改
"httpsEnabled":false,禁用 HTTPS"npuDeviceIds":[[0,1,2,3]],NPU 卡号,下标从0开始"modelName":qwen3-30b-a3b,模型名称,后续调用模型服务时使用"modelWeightPath":/model/Qwen3-30B-A3B,挂载到容器内模型权重路径"worldSize":4,模型使用的 NPU 卡总数量
ASCEND_RT_VISIBLE_DEVICES选择物理卡,npuDeviceIds使用逻辑卡。即无论ASCEND_RT_VISIBLE_DEVICES选择了什么卡,npuDeviceIds下标一律从0开始。- NPU 可以被多个容器挂载,但只能被一个容器使用。
测试时使用的配置文件如下:
{"Version" : "1.0.0","ServerConfig" :{"ipAddress" : "0.0.0.0","managementIpAddress" : "127.0.0.2","port" : 1025,"managementPort" : 1026,"metricsPort" : 1027,"allowAllZeroIpListening" : true,"maxLinkNum" : 1000,"httpsEnabled" : false,"fullTextEnabled" : false,"tlsCaPath" : "security/ca/","tlsCaFile" : ["ca.pem"],"tlsCert" : "security/certs/server.pem","tlsPk" : "security/keys/server.key.pem","tlsPkPwd" : "security/pass/key_pwd.txt","tlsCrlPath" : "security/certs/","tlsCrlFiles" : ["server_crl.pem"],"managementTlsCaFile" : ["management_ca.pem"],"managementTlsCert" : "security/certs/management/server.pem","managementTlsPk" : "security/keys/management/server.key.pem","managementTlsPkPwd" : "security/pass/management/key_pwd.txt","managementTlsCrlPath" : "security/management/certs/","managementTlsCrlFiles" : ["server_crl.pem"],"kmcKsfMaster" : "tools/pmt/master/ksfa","kmcKsfStandby" : "tools/pmt/standby/ksfb","inferMode" : "standard","interCommTLSEnabled" : true,"interCommPort" : 1121,"interCommTlsCaPath" : "security/grpc/ca/","interCommTlsCaFiles" : ["ca.pem"],"interCommTlsCert" : "security/grpc/certs/server.pem","interCommPk" : "security/grpc/keys/server.key.pem","interCommPkPwd" : "security/grpc/pass/key_pwd.txt","interCommTlsCrlPath" : "security/grpc/certs/","interCommTlsCrlFiles" : ["server_crl.pem"],"openAiSupport" : "vllm","tokenTimeout" : 600,"e2eTimeout" : 600,"distDPServerEnabled":false},"BackendConfig" : {"backendName" : "mindieservice_llm_engine","modelInstanceNumber" : 1,"npuDeviceIds" : [[0,1,2,3]],"tokenizerProcessNumber" : 8,"multiNodesInferEnabled" : false,"multiNodesInferPort" : 1120,"interNodeTLSEnabled" : true,"interNodeTlsCaPath" : "security/grpc/ca/","interNodeTlsCaFiles" : ["ca.pem"],"interNodeTlsCert" : "security/grpc/certs/server.pem","interNodeTlsPk" : "security/grpc/keys/server.key.pem","interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt","interNodeTlsCrlPath" : "security/grpc/certs/","interNodeTlsCrlFiles" : ["server_crl.pem"],"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa","interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb","ModelDeployConfig" :{"maxSeqLen" : 8192,"maxInputTokenLen" : 6144,"truncation" : false,"ModelConfig" : [{"modelInstanceType" : "Standard","modelName" : "qwen3-30b-a3b","modelWeightPath" : "/model/Qwen3-30B-A3B","worldSize" : 4,"cpuMemSize" : 5,"npuMemSize" : -1,"backendType" : "atb","trustRemoteCode" : false,"async_scheduler_wait_time": 120,"kv_trans_timeout": 10,"kv_link_timeout": 1080}]},"ScheduleConfig" :{"templateType" : "Standard","templateName" : "Standard_LLM","cacheBlockSize" : 128,"maxPrefillBatchSize" : 10,"maxPrefillTokens" : 6144,"prefillTimeMsPerReq" : 150,"prefillPolicyType" : 0,"decodeTimeMsPerReq" : 50,"decodePolicyType" : 0,"maxBatchSize" : 200,"maxIterTimes" : 2048,"maxPreemptCount" : 0,"supportSelectBatch" : false,"maxQueueDelayMicroseconds" : 5000}}
}
JSON
2.3 启动模型服务
docker exec -it qwen3-30b-a3b-mindie bash
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
cd /usr/local/Ascend/mindie/latest/mindie-service
./bin/mindieservice_daemon
Bash
日志中输出 Daemon start success!表示模型服务已经正常启动
三、使用 vllm-ascend 运行
使用 vllm-ascend 运行模型比较方便,启动容器时将 4-7 卡挂载到容器,模型服务将会使用 4-7 卡
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc0
docker run -itd \--name qwen3-30b-a3b-vllm-ascend \--shm-size=1g \--device /dev/davinci4 \--device /dev/davinci5 \--device /dev/davinci6 \--device /dev/davinci7 \--device /dev/davinci_manager \--device /dev/devmm_svm \--device /dev/hisi_hdc \-v /usr/local/dcmi:/usr/local/dcmi \-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \-v /etc/ascend_install.info:/etc/ascend_install.info \-v /root/.cache:/root/.cache \-v /model:/model \-p 8000:8000 \-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \-e VLLM_USE_MODELSCOPE=True \-it $IMAGE \vllm serve /model/Qwen3-30B-A3B --served-model-name qwen3-30b-a3b --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --enable_expert_parallel --gpu-memory-utilization 0.9 --enable-prefix-caching --enable-chunked-prefill
Bash
四、推理性能测试
4.1 测试工具
测试工具选择 EvalScope,示例脚本如下:
EvalScope 支持命令行、Python两种执行方式,这两种方式对结果不会有影响,这里使用 Python 脚本的方式,再自行写脚本批量执行
# llm-bench.py
from evalscope.perf.main import run_perf_benchmark
from evalscope.perf.arguments import Arguments
task_cfg = Arguments(parallel=[1],number=[10],model='qwen3-30b-a3b',url='http://127.0.0.1:8000/v1/chat/completions',api='openai',dataset='random',min_tokens=1024,max_tokens=1024,prefix_length=0,min_prompt_length=1024,max_prompt_length=1024,tokenizer_path='/model/Qwen3-30B-A3B',extra_args={'ignore_eos': True}
)
results = run_perf_benchmark(task_cfg)
Bash
执行:
python3 llm-bench.py
Bash
4.2 测试结果
测试结果如下,从结果上来看,MindIE 整体表现要比 vllm-ascend 好一些
- 输入上下文为1024,输出上下文为256
- 到448并发后 vllm-ascend 异常退出,所以暂时只测试以下情况
- 结果仅供参考
| MindIE | MindIE | MindIE | vllm-ascend | vllm-ascend | vllm-ascend | ||
|---|---|---|---|---|---|---|---|
| batch size并发数 | requests 请求数 | TTFT(s) | TPOT(s) | throughout(tokens/s) | TTFT(s) | TPOT(s) | throughout(tokens/s) |
| 1 | 50 | 0.0958 | 0.0219 | 229.0896 | 0.2579 | 0.0216 | 225.4876 |
| 16 | 50 | 0.4537 | 0.0284 | 2198.1873 | 2.2589 | 0.0319 | 1623.3467 |
| 32 | 64 | 0.7763 | 0.0372 | 3988.11 | 5.4252 | 0.0336 | 2922.9758 |
| 64 | 128 | 1.3651 | 0.0506 | 5727.4234 | 8.9482 | 0.0471 | 3900.7664 |
| 96 | 192 | 1.9668 | 0.061 | 6987.6632 | 12.3918 | 0.0593 | 4449.3021 |
| 128 | 256 | 2.5768 | 0.0708 | 7896.7329 | 17.6229 | 0.0733 | 4491.8562 |
| 192 | 384 | 3.7235 | 0.096 | 8679.6975 | 28.9566 | 0.0905 | 4709.636 |
| 224 | 448 | 3.9525 | 0.1051 | 8017.7596 | 33.6394 | 0.1131 | 4579.7825 |
| 256 | 512 | 4.1468 | 0.1167 | 8398.7053 | 28.033 | 0.1085 | 5866.5547 |
| 320 | 640 | 4.7421 | 0.1451 | 8281.9079 | 35.1066 | 0.1014 | 5879.3744 |
| 384 | 768 | 5.5375 | 0.1748 | 8702.9784 | 39.3735 | 0.1053 | 6652.2664 |
| 448 | 896 | 6.0939 | 0.2022 | 8498.057 | 56.4764 | 0.1272 | 4731.1409 |