基于“重复付款”话术的Booking.com钓鱼攻击机制与防御研究
2025/12/18 15:05:39
本文聚焦淘宝图片搜索(拍立淘)和店铺全商品抓取的核心实现逻辑,从「抓包分析→反爬破解→代码落地→异常处理」全流程拆解,同时强调合规性与风控规避,适合有基础 Python 爬虫经验的开发者学习(仅用于技术研究,严禁商用)。
| 反爬类型 | 表现形式 | 应对思路 |
|---|---|---|
| 登录态验证 | 无 Cookie 请求返回验证码 / 空白页 | 手动登录获取 Cookie+tb_token |
| 签名加密 | 请求参数sign由前端 JS 动态生成 | 抓包提取加密 JS,用 execjs 执行 |
| 频率限制 | 单 IP / 账号高频请求触发 403 / 限流 | 随机延迟 + 代理 IP + 分散请求时间 |
| 设备指纹 | 检测 WebDriver / 固定 UA | 伪造 UA + 禁用 webdriver 特征 |
| 验证码 | 滑块验证 / 短信验证(高频请求触发) | 手动验证 / 对接打码平台(如超级鹰) |
bash
运行
pip install requests parsel execjs pycryptodome pillow fake-useragent requests-toolbelt retryexecjs:执行前端加密 JS;pycryptodome:处理加密 / 解密;pillow:图片预处理;retry:请求重试;requests-toolbelt:处理大文件请求;mitmproxy(抓包)、Charles(抓包)、Node.js(替代 execjs 执行 JS,效率更高)。Charles/Fiddler/ 浏览器 F12(Network):用于抓取真实接口、参数、加密 JS;taobao.com/tmall.com)。https://s.taobao.com/simba/imgSearch.htm(POST);| 参数名 | 说明 |
|---|---|
imageBase64 | 图片 Base64 编码(无data:image/jpeg;base64,前缀) |
_tb_token_ | 登录态 Token(从 Cookie / 页面源码提取) |
sign | 请求签名(前端 JS 加密生成) |
timestamp | 毫秒级时间戳 |
data.list)。淘宝对图片大小 / 格式有要求(建议≤1MB,JPG/PNG),需压缩 + 转 Base64:
python
运行
import base64 from PIL import Image from io import BytesIO def image_to_base64(image_path, max_size=(800, 800)): """ 图片转Base64(压缩+去除前缀) :param image_path: 本地图片路径/网络图片URL :param max_size: 最大尺寸(宽,高) :return: Base64字符串 """ try: # 处理本地图片 if image_path.startswith(('http://', 'https://')): import requests resp = requests.get(image_path, timeout=10) img = Image.open(BytesIO(resp.content)) else: img = Image.open(image_path) # 压缩图片(保持比例) img.thumbnail(max_size) # 转JPEG格式(避免PNG透明通道问题) buffer = BytesIO() img.save(buffer, format='JPEG', quality=80) # 转Base64并去除前缀 base64_str = base64.b64encode(buffer.getvalue()).decode('utf-8') return base64_str except Exception as e: raise ValueError(f"图片转Base64失败:{str(e)}") # 测试 if __name__ == "__main__": print(image_to_base64("./test.jpg"))淘宝sign参数由前端 JS 加密生成,步骤如下:
sign的 JS 文件(浏览器 F12→Sources→搜索sign/md5);execjs执行 JS 生成sign。python
运行
import execjs import time # 淘宝拍立淘加密JS(需根据实际抓包更新,示例为简化版) ENCRYPT_JS = """ const crypto = require('crypto'); function getSign(params, tb_token) { // 1. 参数按key排序 const keys = Object.keys(params).sort(); let str = ''; for (let k of keys) { str += k + params[k]; } // 2. 拼接tb_token str += tb_token; // 3. MD5加密并转大写 return crypto.createHash('md5').update(str).digest('hex').toUpperCase(); } """ def generate_sign(params, tb_token): """生成sign签名""" try: # 编译JS ctx = execjs.compile(ENCRYPT_JS) # 调用函数 sign = ctx.call("getSign", params, tb_token) return sign except Exception as e: raise RuntimeError(f"生成sign失败:{str(e)}") # 测试 if __name__ == "__main__": params = { "imageBase64": "测试Base64", "t": str(int(time.time()*1000)), "_tb_token_": "你的_tb_token_" } print(generate_sign(params, params["_tb_token_"]))python
运行
import requests import time from fake_useragent import UserAgent from retry import retry # -------------------------- 配置项 -------------------------- # 登录后从浏览器复制(F12→Network→任意请求→Cookie) COOKIE = "你的淘宝Cookie" # 登录后从页面源码/请求参数提取(F12→Elements→搜索_tb_token_) TB_TOKEN = "你的_tb_token_" # 代理(可选,建议高匿代理) PROXIES = { # "http": "http://127.0.0.1:7890", # "https": "http://127.0.0.1:7890" } # 随机UA UA = UserAgent() # -------------------------- 核心函数 -------------------------- @retry(tries=3, delay=2, backoff=2) # 重试3次,延迟递增 def taobao_image_search(image_path): """ 淘宝图片搜索(拍立淘) :param image_path: 本地图片路径/网络图片URL :return: 解析后的商品列表 """ # 1. 图片转Base64 image_base64 = image_to_base64(image_path) # 2. 构造基础参数 timestamp = str(int(time.time() * 1000)) base_params = { "action": "imgSearch", "imageBase64": image_base64, "t": timestamp, "_tb_token_": TB_TOKEN, "q": "", "spm": "a21bo.jianhua.201866-taobao-item.1", "callback": f"jsonp_{int(time.time()*1000)}_{int(time.time()*1000)}" } # 3. 生成sign签名 base_params["sign"] = generate_sign(base_params, TB_TOKEN) # 4. 构造请求头 headers = { "User-Agent": UA.random, "Cookie": COOKIE, "Referer": "https://shang.taobao.com/", "Origin": "https://shang.taobao.com", "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8", "Accept": "application/json, text/javascript, */*; q=0.01", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8" } # 5. 发送请求 try: resp = requests.post( url="https://s.taobao.com/simba/imgSearch.htm", data=base_params, headers=headers, proxies=PROXIES, timeout=20, verify=False # 忽略SSL验证(可选) ) resp.raise_for_status() # 抛出HTTP错误 # 处理JSONP响应(若有) resp_text = resp.text if resp_text.startswith("jsonp_"): resp_text = resp_text[resp_text.find("(")+1 : resp_text.rfind(")")] # 解析JSON result = resp.json() if not result.get("success"): raise ValueError(f"接口返回失败:{result.get('msg')}") # 提取核心商品数据 goods_list = result.get("data", {}).get("list", []) parsed_data = [] for goods in goods_list: parsed_data.append({ "商品ID": goods.get("itemId"), "商品标题": goods.get("title"), "商品价格": goods.get("price"), "商品链接": goods.get("clickUrl"), "店铺名称": goods.get("shopName"), "店铺ID": goods.get("shopId"), "销量": goods.get("sales"), "相似度": goods.get("similarity") # 图片相似度 }) return parsed_data except requests.exceptions.RequestException as e: raise RuntimeError(f"请求失败:{str(e)}") except Exception as e: raise RuntimeError(f"解析失败:{str(e)}") # -------------------------- 测试调用 -------------------------- if __name__ == "__main__": try: # 替换为你的图片路径 result = taobao_image_search("./test.jpg") print(f"搜索到{len(result)}件相似商品:") for idx, goods in enumerate(result, 1): print(f"\n【{idx}】{goods}") except Exception as e: print(f"执行失败:{str(e)}")| 问题 | 原因 | 解决方案 |
|---|---|---|
| sign 验证失败 | JS 加密逻辑过期 | 重新抓包提取最新加密 JS |
| 登录态失效 | Cookie/_tb_token_过期 | 重新登录淘宝,复制最新 Cookie/tb_token |
| 返回验证码页面 | 高频请求 / IP 风控 | 更换 IP + 增加延迟 + 手动完成滑块验证 |
| 图片解析失败 | 图片格式 / 大小不符合要求 | 转 JPEG + 压缩至 800x800 以内 |
淘宝店铺链接有多种格式,需兼容提取shop_id:
python
运行
import re def extract_shop_id(shop_url): """ 从店铺链接提取shop_id 支持格式: 1. https://shop12345678.taobao.com/ 2. https://xxx.taobao.com/shop/view_shop.htm?user_number_id=12345678 3. https://detail.tmall.com/seller_view.htm?user_id=12345678 """ # 匹配shopXXXXXX.taobao.com pattern1 = r"shop(\d+)\.taobao\.com" # 匹配user_number_id=XXXXXX pattern2 = r"user_number_id=(\d+)" # 匹配tmall的user_id=XXXXXX pattern3 = r"user_id=(\d+)" match1 = re.search(pattern1, shop_url) match2 = re.search(pattern2, shop_url) match3 = re.search(pattern3, shop_url) if match1: return match1.group(1) elif match2: return match2.group(1) elif match3: return match3.group(1) else: raise ValueError(f"无法提取店铺ID:{shop_url}") # 测试 if __name__ == "__main__": print(extract_shop_id("https://shop12345678.taobao.com/")) # 输出12345678https://s.taobao.com/search?q=&seller_id={shop_id}&page={page}(GET);page从 1 开始,无商品时返回空列表;g_page_configJS 变量中(JSON 格式)。python
运行
import re import json import time import requests from parsel import Selector from fake_useragent import UserAgent from retry import retry # -------------------------- 配置项 -------------------------- COOKIE = "你的淘宝Cookie" PROXIES = { # "http": "http://127.0.0.1:7890", # "https": "http://127.0.0.1:7890" } REQUEST_DELAY = 3 # 请求延迟(秒) MAX_PAGE = 50 # 最大抓取页数(防止无限循环) UA = UserAgent() # -------------------------- 工具函数 -------------------------- def parse_taobao_goods_html(html): """解析淘宝商品列表HTML,提取商品数据""" selector = Selector(text=html) # 提取g_page_config(核心数据) config_str = selector.css("script:contains('g_page_config')::text").get() if not config_str: return [] # 清洗JSON字符串(去除多余代码) try: config_str = re.search(r"g_page_config = (.*?);\s+g_srp_loadCss", config_str).group(1) config_data = json.loads(config_str) except (re.error, json.JSONDecodeError): return [] # 提取商品列表 goods_list = config_data.get("mods", {}).get("itemlist", {}).get("data", {}).get("auctions", []) parsed_data = [] for goods in goods_list: parsed_data.append({ "商品ID": goods.get("nid"), "商品标题": goods.get("raw_title"), "商品价格": goods.get("view_price"), "销量": goods.get("view_sales"), "商品链接": f"https://item.taobao.com/item.htm?id={goods.get('nid')}", "店铺名称": goods.get("nick"), "店铺ID": goods.get("user_id"), "发货地": goods.get("item_loc"), "是否天猫": goods.get("is_tmall", False) }) return parsed_data @retry(tries=2, delay=2) def crawl_shop_page(shop_id, page_no): """抓取单页商品""" url = f"https://s.taobao.com/search?q=&seller_id={shop_id}&page={page_no}" headers = { "User-Agent": UA.random, "Cookie": COOKIE, "Referer": f"https://shop{shop_id}.taobao.com/", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "zh-CN,zh;q=0.9" } resp = requests.get( url=url, headers=headers, proxies=PROXIES, timeout=20, verify=False ) resp.raise_for_status() return parse_taobao_goods_html(resp.text) # -------------------------- 核心爬虫 -------------------------- def crawl_taobao_shop_all_goods(shop_url): """抓取店铺所有商品""" # 1. 提取店铺ID shop_id = extract_shop_id(shop_url) print(f"开始抓取店铺ID:{shop_id}") # 2. 分页抓取 all_goods = [] page_no = 1 while page_no <= MAX_PAGE: print(f"抓取第{page_no}页...") try: # 随机延迟(避免固定间隔) time.sleep(REQUEST_DELAY + float(time.time() % 1)) # 抓取单页 page_goods = crawl_shop_page(shop_id, page_no) if not page_goods: print(f"第{page_no}页无商品,抓取结束") break # 合并数据 all_goods.extend(page_goods) page_no += 1 except Exception as e: print(f"第{page_no}页抓取失败:{str(e)}") # 失败后更换代理/增加延迟,此处简化为跳过 page_no += 1 continue # 3. 保存数据 with open(f"taobao_shop_{shop_id}_goods.json", "w", encoding="utf-8") as f: json.dump(all_goods, f, ensure_ascii=False, indent=4) print(f"抓取完成!共{len(all_goods)}件商品,已保存至taobao_shop_{shop_id}_goods.json") return all_goods # -------------------------- 测试调用 -------------------------- if __name__ == "__main__": try: # 替换为目标店铺链接 crawl_taobao_shop_all_goods("https://shop12345678.taobao.com/") except Exception as e: print(f"执行失败:{str(e)}")g_page_config中totalCount)计算总页数,避免MAX_PAGE限制;python
运行
def get_proxy(): """从代理池获取IP""" resp = requests.get("http://你的代理池地址/get") return {"http": resp.text, "https": resp.text} # 在请求时使用:proxies=get_proxy()selenium模拟登录(需处理滑块验证码),自动提取 Cookie/tb_token;python
运行
# 示例:超级鹰打码(需注册账号) def get_captcha_result(captcha_img_path): import requests params = { "user": "你的账号", "pass2": "你的密码MD5", "softid": "你的软件ID", "codetype": "1902" # 滑块验证码类型 } files = {"userfile": open(captcha_img_path, "rb")} resp = requests.post("http://upload.chaojiying.net/Upload/Processing.php", data=params, files=files) return resp.json()["pic_str"].js文件,避免重复编译:python
运行
# 读取外部JS文件 def load_encrypt_js(js_path): with open(js_path, "r", encoding="utf-8") as f: return f.read() ctx = execjs.compile(load_encrypt_js("./taobao_sign.js"))webdriver特征(若用 selenium):python
运行
from selenium import webdriver options = webdriver.ChromeOptions() options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options) driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")若需长期稳定使用,必须接入官方 API:
taobao.pai.litao.search(需申请权限);taobao.seller.item.search(企业资质);python
运行
import requests import hashlib import time def taobao_open_api(app_key, app_secret, method, params): """淘宝开放平台API调用""" timestamp = time.strftime("%Y-%m-%d %H:%M:%S") params.update({ "method": method, "app_key": app_key, "timestamp": timestamp, "format": "json", "v": "2.0", "sign_method": "md5" }) # 生成签名 sign_str = ''.join([f"{k}{v}" for k, v in sorted(params.items())]) + app_secret params["sign"] = hashlib.md5(sign_str.encode()).hexdigest().upper() resp = requests.get("https://eco.taobao.com/router/rest", params=params) return resp.json() # 调用示例(需替换为真实AppKey/AppSecret) # result = taobao_open_api( # app_key="你的AppKey", # app_secret="你的AppSecret", # method="taobao.seller.item.search", # params={"seller_id": "店铺ID", "page_no": 1} # )