阳江市网站建设_网站建设公司_营销型网站_seo优化
2025/12/19 21:52:07 网站建设 项目流程

华为昇腾910B服务器上部署Qwen3-30B-A3B并启用EvalScope推理性能测试

使用 MindIE、vllm-ascend 推理引擎在华为昇腾910B显卡上运行 Qwen3-30B-A3B 模型,然后简单测试下推理性能

一、准备

1.1 环境信息

模型Qwen3-30B-A3BMindIE 运行该模型需要至少2张卡,推荐4张
服务器型号Atlas 800I A21台
显卡910B464G/张
驱动>=24.1.0
MindIE>=2.1.RC1
vllm-ascendv0.11.0rc0

服务器一共8张卡,0-3 给 MindIE 使用,4-7 给 vllm-ascend 使用

1.2 模型准备

modelscope download --model Qwen/Qwen3-30B-A3B --local_dir /model/Qwen3-30B-A3B
chmod -R 750 /model/Qwen3-30B-A3B

Bash

二、使用 MindIE 运行

2.1 启动容器

docker run -itd --privileged --name=qwen3-30b-a3b-mindie --net=host --shm-size=500g \--device=/dev/davinci0 \--device=/dev/davinci1 \--device=/dev/davinci2 \--device=/dev/davinci3 \--device=/dev/davinci_manager \--device=/dev/devmm_svm \--device=/dev/hisi_hdc \-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \-v /usr/local/sbin/:/usr/local/sbin/ \-v /var/log/npu/slog/:/var/log/npu/slog \-v /var/log/npu/profiling/:/var/log/npu/profiling \-v /var/log/npu/dump/:/var/log/npu/dump \-v /var/log/npu/:/usr/slog \-v /etc/hccn.conf:/etc/hccn.conf \-v /model:/model \swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.1.RC1-800I-A2-py311-openeuler24.03-lts \/bin/bash

Bash

2.2 修改配置

docker exec -it qwen3-30b-a3b-mindie bash
cd /usr/local/Ascend/mindie/latest/mindie-service
vim conf/config.json

Bash

必须修改的参数如下(没有按照在配置文件中的顺序),其他参数可根据需要自行修改

  • "httpsEnabled"false,禁用 HTTPS
  • "npuDeviceIds"[[0,1,2,3]],NPU 卡号,下标从0开始
  • "modelName"qwen3-30b-a3b,模型名称,后续调用模型服务时使用
  • "modelWeightPath"/model/Qwen3-30B-A3B,挂载到容器内模型权重路径
  • "worldSize"4,模型使用的 NPU 卡总数量
  1. ASCEND_RT_VISIBLE_DEVICES选择物理卡,npuDeviceIds使用逻辑卡。即无论 ASCEND_RT_VISIBLE_DEVICES选择了什么卡,npuDeviceIds下标一律从0开始。
  2. NPU 可以被多个容器挂载,但只能被一个容器使用。

测试时使用的配置文件如下:

{"Version" : "1.0.0","ServerConfig" :{"ipAddress" : "0.0.0.0","managementIpAddress" : "127.0.0.2","port" : 1025,"managementPort" : 1026,"metricsPort" : 1027,"allowAllZeroIpListening" : true,"maxLinkNum" : 1000,"httpsEnabled" : false,"fullTextEnabled" : false,"tlsCaPath" : "security/ca/","tlsCaFile" : ["ca.pem"],"tlsCert" : "security/certs/server.pem","tlsPk" : "security/keys/server.key.pem","tlsPkPwd" : "security/pass/key_pwd.txt","tlsCrlPath" : "security/certs/","tlsCrlFiles" : ["server_crl.pem"],"managementTlsCaFile" : ["management_ca.pem"],"managementTlsCert" : "security/certs/management/server.pem","managementTlsPk" : "security/keys/management/server.key.pem","managementTlsPkPwd" : "security/pass/management/key_pwd.txt","managementTlsCrlPath" : "security/management/certs/","managementTlsCrlFiles" : ["server_crl.pem"],"kmcKsfMaster" : "tools/pmt/master/ksfa","kmcKsfStandby" : "tools/pmt/standby/ksfb","inferMode" : "standard","interCommTLSEnabled" : true,"interCommPort" : 1121,"interCommTlsCaPath" : "security/grpc/ca/","interCommTlsCaFiles" : ["ca.pem"],"interCommTlsCert" : "security/grpc/certs/server.pem","interCommPk" : "security/grpc/keys/server.key.pem","interCommPkPwd" : "security/grpc/pass/key_pwd.txt","interCommTlsCrlPath" : "security/grpc/certs/","interCommTlsCrlFiles" : ["server_crl.pem"],"openAiSupport" : "vllm","tokenTimeout" : 600,"e2eTimeout" : 600,"distDPServerEnabled":false},"BackendConfig" : {"backendName" : "mindieservice_llm_engine","modelInstanceNumber" : 1,"npuDeviceIds" : [[0,1,2,3]],"tokenizerProcessNumber" : 8,"multiNodesInferEnabled" : false,"multiNodesInferPort" : 1120,"interNodeTLSEnabled" : true,"interNodeTlsCaPath" : "security/grpc/ca/","interNodeTlsCaFiles" : ["ca.pem"],"interNodeTlsCert" : "security/grpc/certs/server.pem","interNodeTlsPk" : "security/grpc/keys/server.key.pem","interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt","interNodeTlsCrlPath" : "security/grpc/certs/","interNodeTlsCrlFiles" : ["server_crl.pem"],"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa","interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb","ModelDeployConfig" :{"maxSeqLen" : 8192,"maxInputTokenLen" : 6144,"truncation" : false,"ModelConfig" : [{"modelInstanceType" : "Standard","modelName" : "qwen3-30b-a3b","modelWeightPath" : "/model/Qwen3-30B-A3B","worldSize" : 4,"cpuMemSize" : 5,"npuMemSize" : -1,"backendType" : "atb","trustRemoteCode" : false,"async_scheduler_wait_time": 120,"kv_trans_timeout": 10,"kv_link_timeout": 1080}]},"ScheduleConfig" :{"templateType" : "Standard","templateName" : "Standard_LLM","cacheBlockSize" : 128,"maxPrefillBatchSize" : 10,"maxPrefillTokens" : 6144,"prefillTimeMsPerReq" : 150,"prefillPolicyType" : 0,"decodeTimeMsPerReq" : 50,"decodePolicyType" : 0,"maxBatchSize" : 200,"maxIterTimes" : 2048,"maxPreemptCount" : 0,"supportSelectBatch" : false,"maxQueueDelayMicroseconds" : 5000}}
}

JSON

2.3 启动模型服务

docker exec -it qwen3-30b-a3b-mindie bash
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
cd /usr/local/Ascend/mindie/latest/mindie-service
./bin/mindieservice_daemon

Bash

日志中输出 Daemon start success!表示模型服务已经正常启动

三、使用 vllm-ascend 运行

使用 vllm-ascend 运行模型比较方便,启动容器时将 4-7 卡挂载到容器,模型服务将会使用 4-7 卡

export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc0
docker run -itd \--name qwen3-30b-a3b-vllm-ascend \--shm-size=1g \--device /dev/davinci4 \--device /dev/davinci5 \--device /dev/davinci6 \--device /dev/davinci7 \--device /dev/davinci_manager \--device /dev/devmm_svm \--device /dev/hisi_hdc \-v /usr/local/dcmi:/usr/local/dcmi \-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \-v /etc/ascend_install.info:/etc/ascend_install.info \-v /root/.cache:/root/.cache \-v /model:/model \-p 8000:8000 \-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \-e VLLM_USE_MODELSCOPE=True \-it $IMAGE \vllm serve /model/Qwen3-30B-A3B --served-model-name qwen3-30b-a3b --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --enable_expert_parallel --gpu-memory-utilization 0.9 --enable-prefix-caching --enable-chunked-prefill

Bash

四、推理性能测试

4.1 测试工具

测试工具选择 EvalScope,示例脚本如下:

EvalScope 支持命令行、Python两种执行方式,这两种方式对结果不会有影响,这里使用 Python 脚本的方式,再自行写脚本批量执行

# llm-bench.py
from evalscope.perf.main import run_perf_benchmark
from evalscope.perf.arguments import Arguments
task_cfg = Arguments(parallel=[1],number=[10],model='qwen3-30b-a3b',url='http://127.0.0.1:8000/v1/chat/completions',api='openai',dataset='random',min_tokens=1024,max_tokens=1024,prefix_length=0,min_prompt_length=1024,max_prompt_length=1024,tokenizer_path='/model/Qwen3-30B-A3B',extra_args={'ignore_eos': True}
)
results = run_perf_benchmark(task_cfg)

Bash

执行:

python3 llm-bench.py

Bash

4.2 测试结果

测试结果如下,从结果上来看,MindIE 整体表现要比 vllm-ascend 好一些

  1. 输入上下文为1024,输出上下文为256
  2. 到448并发后 vllm-ascend 异常退出,所以暂时只测试以下情况
  3. 结果仅供参考
MindIEMindIEMindIEvllm-ascendvllm-ascendvllm-ascend
batch size并发数requests 请求数TTFT(s)TPOT(s)throughout(tokens/s)TTFT(s)TPOT(s)throughout(tokens/s)
1500.09580.0219229.08960.25790.0216225.4876
16500.45370.02842198.18732.25890.03191623.3467
32640.77630.03723988.115.42520.03362922.9758
641281.36510.05065727.42348.94820.04713900.7664
961921.96680.0616987.663212.39180.05934449.3021
1282562.57680.07087896.732917.62290.07334491.8562
1923843.72350.0968679.697528.95660.09054709.636
2244483.95250.10518017.759633.63940.11314579.7825
2565124.14680.11678398.705328.0330.10855866.5547
3206404.74210.14518281.907935.10660.10145879.3744
3847685.53750.17488702.978439.37350.10536652.2664
4488966.09390.20228498.05756.47640.12724731.1409

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询