Qwen 3 VL 图像预处理、图像增广接入Processor

张开发

• 2026/4/3 23:47:08 • 15 分钟阅读

分享文章

背景在训练Qwen 3 VL时希望在数据pipeline的图像处理部分做一些图像预处理再传递给Processor这是训练视觉模型的常用方案比如随机裁切、亮度随机变换等。我们的目的是处理后的numpy数组/torch.Tensor/PIL Image经过Processor编码要和处理后的图像直接读入一致。这里在实践时发现有一些坑值得记录故整理本文。本文也适用于Qwen 2 VL、Qwen 2.5 VL数据处理基本流程首先读取模型和Processorfrom transformers import AutoModelForImageTextToText, AutoProcessor MODEL_NAME Qwen/Qwen3-VL-4B-Instruct model AutoModelForImageTextToText.from_pretrained( MODEL_NAME, dtypeauto, device_mapauto ) processor AutoProcessor.from_pretrained(MODEL_NAME)通常图像预处理管线会布置在data_collator中这里实现一个简略版的collator。piece_of_data模拟了一个datasets数据集中的项。我们只对图像进行操作这里不做具体的变换保持图像原样只做数据接口的演示。本案例中展示了路径string、tensor、numpy、PIL格式图像用法。import cv2 import torch import numpy as np piece_of_data [ { image: ./sample_img.jpg, messages: [ { role: user, content: [{type: text, text: Describe the image.},{type: image, image: ./sample_img.jpg}] }, # { # role: assistant, # content: [{type: text, text: The image is a street scene with a car and a person.}] # } ] } ] def my_collator(inputs): # ... 其他必要过程 texts [processor.apply_chat_template(example[messages], tokenizeFalse) for example in inputs] image_inputs [] image_inputs_tmp [] image_strs [] for example in inputs: for message in example[messages]: for content in message[content]: if content[type] image: image cv2.imread(content[image]) image_inputs_tmp.append(image) image_strs.append(content[image]) for i,image_input in enumerate(image_inputs_tmp): image_input cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255. image_tensor torch.from_numpy(image_input).permute(2, 0, 1).float() # image_tensor YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程 image_string image_strs[i] # 基于路径string的输入 image_tensor image_tensor.clamp(0, 1).permute(1, 2, 0) # tensor形式的输入 image_np image_tensor.numpy() # bumpy格式输入 # 不要转换精度防止损失信息 image_pil Image.fromarray(np.uint8(image_np*255)) # PIL Object输入 image_inputs.append(image_string) # 这里采用基于路径string的输入 batch processor(texttexts, imagesimage_inputs, return_tensorspt, paddingTrue) # ... 其他必要过程 return batch print(my_collator(piece_of_data))上述图像为sample_img.jpg基准路径输入上述代码直接采用路径输入时我们读取的是原图。输出如下{input_ids: tensor([[151644, 872, 198, 74785, 279, 2168, 11, 2291, 6529, 311, 1894, 13, 151652, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151653, 151645, 198]]), attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), pixel_values: tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137], [ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059], [ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745], ..., [-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196], [ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118], [-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]), image_grid_thw: tensor([[ 1, 30, 40]])}通过相关文章 https://blog.csdn.net/qq_40672115/article/details/151675269?spm1001.2014.3001.5502 解读pixel_values是实际的图像编码后tensor。我们记住这时的pixel_values内容。numpy格式输入这里采用0.0-1.0值域RGB格式(H,W,C)的图像。直接将上述代码的image_inputs.append(image_string)改成image_inputs.append(image_np)此时发现pixel_values变成了pixel_values: tensor([[-0.9926, -0.9926, -0.9926, ..., -0.9925, -0.9925, -0.9925], [-0.9927, -0.9927, -0.9927, ..., -0.9925, -0.9925, -0.9925], [-0.9927, -0.9927, -0.9927, ..., -0.9923, -0.9925, -0.9926], ..., [-0.9966, -0.9967, -0.9968, ..., -0.9964, -0.9964, -0.9960], [-0.9960, -0.9963, -0.9965, ..., -0.9960, -0.9961, -0.9960], [-0.9963, -0.9962, -0.9962, ..., -0.9958, -0.9961, -0.9964]])这是怎么回事查看Qwen2VLImageProcessor的源码发现image默认值域是0-255范围的。如果要用0.0-1.0图像需要设置do_rescaleFalse。可以在processor中传入参数images_kwargs{do_rescale: False}另外经过实验验证这里的image格式通道顺序是RGB并非opencv默认的BGR。对于图像的形状推荐用(H,W,C)。def _preprocess( self, images: Union[ImageInput, VideoInput], do_resize: Optional[bool] None, size: Optional[dict[str, int]] None, resample: Optional[PILImageResampling] None, do_rescale: Optional[bool] None, rescale_factor: Optional[float] None, do_normalize: Optional[bool] None, image_mean: Optional[Union[float, list[float]]] None, image_std: Optional[Union[float, list[float]]] None, patch_size: Optional[int] None, temporal_patch_size: Optional[int] None, merge_size: Optional[int] None, do_convert_rgb: Optional[bool] None, data_format: Optional[ChannelDimension] ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] None, ): Preprocess an image or batch of images. Copy of the preprocess method from CLIPImageProcessor. Args: images (ImageInput): Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set do_rescaleFalse. vision_info (list[Dict], *optional*): Optional list of dictionaries containing additional information about vision inputs. do_resize (bool, *optional*, defaults to self.do_resize): Whether to resize the image. size (dict[str, int], *optional*, defaults to self.size): Size of the image after resizing. shortest_edge and longest_edge keys must be present. resample (PILImageResampling, *optional*, defaults to self.resample): Resampling filter to use if resizing the image. This can be one of the PILImageResampling enums. do_rescale (bool, *optional*, defaults to self.do_rescale): Whether to rescale the image. rescale_factor (float, *optional*, defaults to self.rescale_factor): Scale factor to use if rescaling the image. do_normalize (bool, *optional*, defaults to self.do_normalize): Whether to normalize the image. image_mean (float or list[float], *optional*, defaults to self.image_mean): Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image. image_std (float or list[float], *optional*, defaults to self.image_std): Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image. patch_size (int, *optional*, defaults to self.patch_size): The spatial patch size of the vision encoder. temporal_patch_size (int, *optional*, defaults to self.temporal_patch_size): The temporal patch size of the vision encoder. merge_size (int, *optional*, defaults to self.merge_size): The merge size of the vision encoder to llm encoder. do_convert_rgb (bool, *optional*, defaults to self.do_convert_rgb): Whether to convert the image to RGB. data_format (ChannelDimension, *optional*, defaults to ChannelDimension.FIRST): The channel dimension format for the output image. Can be one of: - channels_first or ChannelDimension.FIRST: image in (num_channels, height, width) format. - channels_last or ChannelDimension.LAST: image in (height, width, num_channels) format. - Unset: Use the channel dimension format of the input image. input_data_format (ChannelDimension or str, *optional*): The channel dimension format for the input image. Can be one of: - channels_first or ChannelDimension.FIRST: image in (num_channels, height, width) format. - channels_last or ChannelDimension.LAST: image in (height, width, num_channels) format. - none or ChannelDimension.NONE: image in (height, width) format. - none or ChannelDimension.NONE: image in (height, width) format. 因此我们应该用的代码是def my_collator(inputs): # ... 其他必要过程 texts [processor.apply_chat_template(example[messages], tokenizeFalse) for example in inputs] image_inputs [] image_inputs_tmp [] image_strs [] for example in inputs: for message in example[messages]: for content in message[content]: if content[type] image: image cv2.imread(content[image]) image_inputs_tmp.append(image) image_strs.append(content[image]) for i,image_input in enumerate(image_inputs_tmp): image_input cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255. image_tensor torch.from_numpy(image_input).permute(2, 0, 1).float() # image_tensor YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程 image_string image_strs[i] image_tensor image_tensor.clamp(0, 1).permute(1, 2, 0) image_np image_tensor.numpy() # 不要转换精度防止损失信息 image_pil Image.fromarray(np.uint8(image_np*255)) image_inputs.append(image_np) batch processor(texttexts, imagesimage_inputs, return_tensorspt, paddingTrue, images_kwargs{do_rescale: False}) # 注意这里更改了 # ... 其他必要过程 return batch这时输出为pixel_values: tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137], [ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059], [ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745], ..., [-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196], [ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118], [-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]符合基本预期Tensor格式输入这里采用0.0-1.0值域RGB格式(H,W,C)的图像。采用代码def my_collator(inputs): # ... 其他必要过程 texts [processor.apply_chat_template(example[messages], tokenizeFalse) for example in inputs] image_inputs [] image_inputs_tmp [] image_strs [] for example in inputs: for message in example[messages]: for content in message[content]: if content[type] image: image cv2.imread(content[image]) image_inputs_tmp.append(image) image_strs.append(content[image]) for i,image_input in enumerate(image_inputs_tmp): image_input cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255. image_tensor torch.from_numpy(image_input).permute(2, 0, 1).float() # image_tensor YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程 image_string image_strs[i] image_tensor image_tensor.clamp(0, 1).permute(1, 2, 0) image_np image_tensor.numpy() # 不要转换精度防止损失信息 image_pil Image.fromarray(np.uint8(image_np*255)) image_inputs.append(image_tensor) batch processor(texttexts, imagesimage_inputs, return_tensorspt, paddingTrue, images_kwargs{do_rescale: False}) # ... 其他必要过程 return batch输出pixel_values: tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137], [ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059], [ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745], ..., [-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196], [ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118], [-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]符合预期。PIL格式输入这里采用0-255值域RGB格式(H,W,C)的图像。此时不需要设置do_rescale: False。采用代码def my_collator(inputs): # ... 其他必要过程 texts [processor.apply_chat_template(example[messages], tokenizeFalse) for example in inputs] image_inputs [] image_inputs_tmp [] image_strs [] for example in inputs: for message in example[messages]: for content in message[content]: if content[type] image: image cv2.imread(content[image]) image_inputs_tmp.append(image) image_strs.append(content[image]) for i,image_input in enumerate(image_inputs_tmp): image_input cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255. image_tensor torch.from_numpy(image_input).permute(2, 0, 1).float() # image_tensor YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程 image_string image_strs[i] image_tensor image_tensor.clamp(0, 1).permute(1, 2, 0) image_np image_tensor.numpy() # 不要转换精度防止损失信息 image_pil Image.fromarray(np.uint8(image_np*255)) image_inputs.append(image_pil) batch processor(texttexts, imagesimage_inputs, return_tensorspt, paddingTrue) # ... 其他必要过程 return batch输出tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137], [ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059], [ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745], ..., [-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196], [ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118], [-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]],符合预期总结本文围绕 Qwen 3 VL及 2/2.5 系列训练中图像预处理的核心需求 —— 确保处理后图像经 Processor 编码与原图直接读入结果一致展开了多格式图像输入的实践验证与坑点解析。通过搭建简化的数据 collator 流程分别测试了路径、numpy 数组、torch.Tensor、PIL Image 四种输入方式明确了各格式的关键适配要点路径输入作为基准无需额外配置numpy 数组和 Tensor0.0-1.0 值域、RGB 格式、(H,W,C) 形状需通过images_kwargs{do_rescale: False}关闭默认 rescale避免值域二次转换导致编码异常PIL Image0-255 值域、RGB 格式则可直接适配 Processor 默认逻辑无需额外参数调整。同时需注意图像格式转换中的通道顺序BGR 转 RGB和精度保留避免不必要的类型转换确保预处理过程不损失图像信息。这些实践结论为视觉模型训练中灵活嵌入随机裁切、亮度变换等预处理操作提供了可靠的实现参考帮助开发者高效规避格式适配问题保障数据处理管线的一致性与稳定性。

更多文章

前端开发 2026/4/2 4:55:38

Streamlit-Authenticator升级适配指南：解决安全身份验证中的版本兼容性问题

1. Streamlit-Authenticator升级适配的核心挑战最近在帮团队升级一个老项目的身份验证模块时，遇到了典型的版本兼容性问题。原本运行良好的登录系统突然报错，控制台显示TypeError: __init__() got multiple values for argument cookie_expiry_days&…

背景在一台未经过任何调优的 Linux 服务器上部署 Redis，在 Redis 启动过程中，可能会碰到以下警告信息。1363410:M 15 Jan 2026 13:07:34.879 # WARNING: The TCP backlog setting of 512 cannot be enforced because /proc/sys/net/core/somaxconn is se…

张开发

前端开发 2026/4/3 21:29:08

如何将TIDAL高品质音乐库永久保存到本地：tidal-dl-ng完全指南

如何将TIDAL高品质音乐库永久保存到本地：tidal-dl-ng完全指南【免费下载链接】tidal-dl-ng TIDAL Media Downloader Next Generation! Up to HiRes / TIDAL MAX 24-bit, 192 kHz. 项目地址: https://gitcode.com/gh_mirrors/ti/tidal-dl-ng 你是否曾为TIDAL…

张开发

Qwen 3 VL 图像预处理、图像增广接入Processor

最新文章

SEO 关键词短语分隔技巧有哪些_SEO 关键词聚类分析中分隔的影响是什么

嘉立创EDA与Altium Designer实战技巧：从封装绘制到高速布线全解析

C++ 大规模系统构建：分析基于 Bazel 或 CMake 的 C++ 增量编译优化与物理依赖图谱的剪枝策略

国央企创新负责人如何实现科技成果与产业需求的精准对接？

RoboFactory实战：从零构建多机器人协同任务的数据生成与训练流水线

【技术干货】从 Kilo 重构 VS Code 扩展，看多智能体并行 AI 编程的新范式

推荐文章

相关文章

探索Akagi：实时牌局分析与AI决策支持的麻将辅助系统

SEO 视频在不同行业的应用有何差异_SEO 视频的长度应该控制在什么范围内

基于YOLO+DeepSeek的农作物病虫害检测与环境监测一体化智能平台植物病虫害识别系统

生成式引擎优化（GEO）实战指南：从技术架构到行业落地

嵌入式开发调试宏与性能优化实战

LosslessCut：解锁无损视频编辑的5个专业技巧

分享文章

更多文章

Streamlit-Authenticator升级适配指南：解决安全身份验证中的版本兼容性问题

Vue Form Generator高级应用：如何开发自定义字段组件

FModel：Unreal Engine资源探索的高效解决方案

OpenClaw飞书机器人进阶：Qwen3.5-9B图片问答自动回复

Polars 2.0大规模清洗提速370%？揭秘lazy执行+并行策略的隐藏API调用链

[CSS]文字-立体

学习SEO有什么好的学习方法

OpenClaw+Kimi-VL-A3B-Thinking：学术论文图表自动解析与摘要生成

OpenClaw日志分析：千问3.5-9B任务执行问题定位

php方案 php内核通信：利用 FFI 直接调用 ioctl 操控自定义 Linux 内核模块

Redis 调优：必须关注的几个参数

如何将TIDAL高品质音乐库永久保存到本地：tidal-dl-ng完全指南

Qwen 3 VL 图像预处理、图像增广接入Processor

最新文章

SEO 关键词短语分隔技巧有哪些_SEO 关键词聚类分析中分隔的影响是什么

嘉立创EDA与Altium Designer实战技巧：从封装绘制到高速布线全解析

C++ 大规模系统构建：分析基于 Bazel 或 CMake 的 C++ 增量编译优化与物理依赖图谱的剪枝策略

国央企创新负责人如何实现科技成果与产业需求的精准对接？

RoboFactory实战：从零构建多机器人协同任务的数据生成与训练流水线

【技术干货】从 Kilo 重构 VS Code 扩展，看多智能体并行 AI 编程的新范式

推荐文章

相关文章

探索Akagi：实时牌局分析与AI决策支持的麻将辅助系统

SEO 视频在不同行业的应用有何差异_SEO 视频的长度应该控制在什么范围内

基于YOLO+DeepSeek的农作物病虫害检测与环境监测一体化智能平台 植物病虫害识别系统

生成式引擎优化（GEO）实战指南：从技术架构到行业落地

嵌入式开发调试宏与性能优化实战

LosslessCut：解锁无损视频编辑的5个专业技巧

分享文章

更多文章

基于YOLO+DeepSeek的农作物病虫害检测与环境监测一体化智能平台植物病虫害识别系统