Qwen 3 VL 图像预处理、图像增广接入Processor

张开发
2026/4/3 23:47:08 15 分钟阅读
Qwen 3 VL 图像预处理、图像增广接入Processor
背景在训练Qwen 3 VL时希望在数据pipeline的图像处理部分做一些图像预处理再传递给Processor这是训练视觉模型的常用方案比如随机裁切、亮度随机变换等。我们的目的是处理后的numpy数组/torch.Tensor/PIL Image经过Processor编码要和处理后的图像直接读入一致。这里在实践时发现有一些坑值得记录故整理本文。本文也适用于Qwen 2 VL、Qwen 2.5 VL数据处理基本流程首先读取模型和Processorfrom transformers import AutoModelForImageTextToText, AutoProcessor MODEL_NAME Qwen/Qwen3-VL-4B-Instruct model AutoModelForImageTextToText.from_pretrained( MODEL_NAME, dtypeauto, device_mapauto ) processor AutoProcessor.from_pretrained(MODEL_NAME)通常图像预处理管线会布置在data_collator中这里实现一个简略版的collator。piece_of_data模拟了一个datasets数据集中的项。我们只对图像进行操作这里不做具体的变换保持图像原样只做数据接口的演示。本案例中展示了路径string、tensor、numpy、PIL格式图像用法。import cv2 import torch import numpy as np piece_of_data [ { image: ./sample_img.jpg, messages: [ { role: user, content: [{type: text, text: Describe the image.},{type: image, image: ./sample_img.jpg}] }, # { # role: assistant, # content: [{type: text, text: The image is a street scene with a car and a person.}] # } ] } ] def my_collator(inputs): # ... 其他必要过程 texts [processor.apply_chat_template(example[messages], tokenizeFalse) for example in inputs] image_inputs [] image_inputs_tmp [] image_strs [] for example in inputs: for message in example[messages]: for content in message[content]: if content[type] image: image cv2.imread(content[image]) image_inputs_tmp.append(image) image_strs.append(content[image]) for i,image_input in enumerate(image_inputs_tmp): image_input cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255. image_tensor torch.from_numpy(image_input).permute(2, 0, 1).float() # image_tensor YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程 image_string image_strs[i] # 基于路径string的输入 image_tensor image_tensor.clamp(0, 1).permute(1, 2, 0) # tensor形式的输入 image_np image_tensor.numpy() # bumpy格式输入 # 不要转换精度防止损失信息 image_pil Image.fromarray(np.uint8(image_np*255)) # PIL Object输入 image_inputs.append(image_string) # 这里采用基于路径string的输入 batch processor(texttexts, imagesimage_inputs, return_tensorspt, paddingTrue) # ... 其他必要过程 return batch print(my_collator(piece_of_data))上述图像为sample_img.jpg基准路径输入上述代码直接采用路径输入时我们读取的是原图。输出如下{input_ids: tensor([[151644, 872, 198, 74785, 279, 2168, 11, 2291, 6529, 311, 1894, 13, 151652, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151653, 151645, 198]]), attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), pixel_values: tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137], [ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059], [ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745], ..., [-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196], [ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118], [-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]), image_grid_thw: tensor([[ 1, 30, 40]])}通过相关文章 https://blog.csdn.net/qq_40672115/article/details/151675269?spm1001.2014.3001.5502 解读pixel_values是实际的图像编码后tensor。我们记住这时的pixel_values内容。numpy格式输入这里采用0.0-1.0值域RGB格式(H,W,C)的图像。直接将上述代码的image_inputs.append(image_string)改成image_inputs.append(image_np)此时发现pixel_values变成了pixel_values: tensor([[-0.9926, -0.9926, -0.9926, ..., -0.9925, -0.9925, -0.9925], [-0.9927, -0.9927, -0.9927, ..., -0.9925, -0.9925, -0.9925], [-0.9927, -0.9927, -0.9927, ..., -0.9923, -0.9925, -0.9926], ..., [-0.9966, -0.9967, -0.9968, ..., -0.9964, -0.9964, -0.9960], [-0.9960, -0.9963, -0.9965, ..., -0.9960, -0.9961, -0.9960], [-0.9963, -0.9962, -0.9962, ..., -0.9958, -0.9961, -0.9964]])这是怎么回事查看Qwen2VLImageProcessor的源码发现image默认值域是0-255范围的。如果要用0.0-1.0图像需要设置do_rescaleFalse。可以在processor中传入参数images_kwargs{do_rescale: False}另外经过实验验证这里的image格式通道顺序是RGB并非opencv默认的BGR。对于图像的形状推荐用(H,W,C)。def _preprocess( self, images: Union[ImageInput, VideoInput], do_resize: Optional[bool] None, size: Optional[dict[str, int]] None, resample: Optional[PILImageResampling] None, do_rescale: Optional[bool] None, rescale_factor: Optional[float] None, do_normalize: Optional[bool] None, image_mean: Optional[Union[float, list[float]]] None, image_std: Optional[Union[float, list[float]]] None, patch_size: Optional[int] None, temporal_patch_size: Optional[int] None, merge_size: Optional[int] None, do_convert_rgb: Optional[bool] None, data_format: Optional[ChannelDimension] ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] None, ): Preprocess an image or batch of images. Copy of the preprocess method from CLIPImageProcessor. Args: images (ImageInput): Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set do_rescaleFalse. vision_info (list[Dict], *optional*): Optional list of dictionaries containing additional information about vision inputs. do_resize (bool, *optional*, defaults to self.do_resize): Whether to resize the image. size (dict[str, int], *optional*, defaults to self.size): Size of the image after resizing. shortest_edge and longest_edge keys must be present. resample (PILImageResampling, *optional*, defaults to self.resample): Resampling filter to use if resizing the image. This can be one of the PILImageResampling enums. do_rescale (bool, *optional*, defaults to self.do_rescale): Whether to rescale the image. rescale_factor (float, *optional*, defaults to self.rescale_factor): Scale factor to use if rescaling the image. do_normalize (bool, *optional*, defaults to self.do_normalize): Whether to normalize the image. image_mean (float or list[float], *optional*, defaults to self.image_mean): Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image. image_std (float or list[float], *optional*, defaults to self.image_std): Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image. patch_size (int, *optional*, defaults to self.patch_size): The spatial patch size of the vision encoder. temporal_patch_size (int, *optional*, defaults to self.temporal_patch_size): The temporal patch size of the vision encoder. merge_size (int, *optional*, defaults to self.merge_size): The merge size of the vision encoder to llm encoder. do_convert_rgb (bool, *optional*, defaults to self.do_convert_rgb): Whether to convert the image to RGB. data_format (ChannelDimension, *optional*, defaults to ChannelDimension.FIRST): The channel dimension format for the output image. Can be one of: - channels_first or ChannelDimension.FIRST: image in (num_channels, height, width) format. - channels_last or ChannelDimension.LAST: image in (height, width, num_channels) format. - Unset: Use the channel dimension format of the input image. input_data_format (ChannelDimension or str, *optional*): The channel dimension format for the input image. Can be one of: - channels_first or ChannelDimension.FIRST: image in (num_channels, height, width) format. - channels_last or ChannelDimension.LAST: image in (height, width, num_channels) format. - none or ChannelDimension.NONE: image in (height, width) format. - none or ChannelDimension.NONE: image in (height, width) format. 因此我们应该用的代码是def my_collator(inputs): # ... 其他必要过程 texts [processor.apply_chat_template(example[messages], tokenizeFalse) for example in inputs] image_inputs [] image_inputs_tmp [] image_strs [] for example in inputs: for message in example[messages]: for content in message[content]: if content[type] image: image cv2.imread(content[image]) image_inputs_tmp.append(image) image_strs.append(content[image]) for i,image_input in enumerate(image_inputs_tmp): image_input cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255. image_tensor torch.from_numpy(image_input).permute(2, 0, 1).float() # image_tensor YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程 image_string image_strs[i] image_tensor image_tensor.clamp(0, 1).permute(1, 2, 0) image_np image_tensor.numpy() # 不要转换精度防止损失信息 image_pil Image.fromarray(np.uint8(image_np*255)) image_inputs.append(image_np) batch processor(texttexts, imagesimage_inputs, return_tensorspt, paddingTrue, images_kwargs{do_rescale: False}) # 注意这里更改了 # ... 其他必要过程 return batch这时输出为pixel_values: tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137], [ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059], [ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745], ..., [-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196], [ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118], [-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]符合基本预期Tensor格式输入这里采用0.0-1.0值域RGB格式(H,W,C)的图像。采用代码def my_collator(inputs): # ... 其他必要过程 texts [processor.apply_chat_template(example[messages], tokenizeFalse) for example in inputs] image_inputs [] image_inputs_tmp [] image_strs [] for example in inputs: for message in example[messages]: for content in message[content]: if content[type] image: image cv2.imread(content[image]) image_inputs_tmp.append(image) image_strs.append(content[image]) for i,image_input in enumerate(image_inputs_tmp): image_input cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255. image_tensor torch.from_numpy(image_input).permute(2, 0, 1).float() # image_tensor YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程 image_string image_strs[i] image_tensor image_tensor.clamp(0, 1).permute(1, 2, 0) image_np image_tensor.numpy() # 不要转换精度防止损失信息 image_pil Image.fromarray(np.uint8(image_np*255)) image_inputs.append(image_tensor) batch processor(texttexts, imagesimage_inputs, return_tensorspt, paddingTrue, images_kwargs{do_rescale: False}) # ... 其他必要过程 return batch输出pixel_values: tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137], [ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059], [ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745], ..., [-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196], [ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118], [-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]符合预期。PIL格式输入这里采用0-255值域RGB格式(H,W,C)的图像。此时不需要设置do_rescale: False。采用代码def my_collator(inputs): # ... 其他必要过程 texts [processor.apply_chat_template(example[messages], tokenizeFalse) for example in inputs] image_inputs [] image_inputs_tmp [] image_strs [] for example in inputs: for message in example[messages]: for content in message[content]: if content[type] image: image cv2.imread(content[image]) image_inputs_tmp.append(image) image_strs.append(content[image]) for i,image_input in enumerate(image_inputs_tmp): image_input cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255. image_tensor torch.from_numpy(image_input).permute(2, 0, 1).float() # image_tensor YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程 image_string image_strs[i] image_tensor image_tensor.clamp(0, 1).permute(1, 2, 0) image_np image_tensor.numpy() # 不要转换精度防止损失信息 image_pil Image.fromarray(np.uint8(image_np*255)) image_inputs.append(image_pil) batch processor(texttexts, imagesimage_inputs, return_tensorspt, paddingTrue) # ... 其他必要过程 return batch输出tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137], [ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059], [ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745], ..., [-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196], [ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118], [-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]],符合预期总结本文围绕 Qwen 3 VL及 2/2.5 系列训练中图像预处理的核心需求 —— 确保处理后图像经 Processor 编码与原图直接读入结果一致展开了多格式图像输入的实践验证与坑点解析。通过搭建简化的数据 collator 流程分别测试了路径、numpy 数组、torch.Tensor、PIL Image 四种输入方式明确了各格式的关键适配要点路径输入作为基准无需额外配置numpy 数组和 Tensor0.0-1.0 值域、RGB 格式、(H,W,C) 形状需通过images_kwargs{do_rescale: False}关闭默认 rescale避免值域二次转换导致编码异常PIL Image0-255 值域、RGB 格式则可直接适配 Processor 默认逻辑无需额外参数调整。同时需注意图像格式转换中的通道顺序BGR 转 RGB和精度保留避免不必要的类型转换确保预处理过程不损失图像信息。这些实践结论为视觉模型训练中灵活嵌入随机裁切、亮度变换等预处理操作提供了可靠的实现参考帮助开发者高效规避格式适配问题保障数据处理管线的一致性与稳定性。

更多文章