目标网址
经过对网页资源的分析,我们发现,今日头条表情包在网页中的加载方式有两种。
刚刚打开网页时得到的图片数据是静态加载的。
随着鼠标网页滑动又会动态的加载出其他的表情包。
说明:今日头条表情包的数据加载方式是有两种的,静态加载和动态加载。
静态加载表情包数据获取 请求静态页面 1 url = 'https://so.toutiao.com/search?keyword=%E5%8F%AF%E7%88%B1%E8%A1%A8%E6%83%85%E5%8C%85&pd=atlas&source=input&dvpf=pc&aid=4916&page_num=0'
1 2 response = requests.get(url)
1371
页面内容不对,内容太少,说明网页做了反爬虫。
1 2 3 4 headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' , 'cookie' : 'tt_webid=7070411004517713438; _S_IPAD=0; MONITOR_WEB_ID=7070411004517713438; _S_DPR=1.5; _S_WIN_WH=1494_791; ttwid=1%7CWRYiyMilfWWPqkGJbYTLHNlKONhX8FbDm05Zqmp8IGI%7C1662364466%7Cf0ab5e3c3dacc114198107705e1107693379ebe9b45225f4e44c924923d7cca4; _tea_utm_cache_4916=undefined' }
1 response = requests.get(url, headers=headers)
此时页面内容正常。
解析出表情包源地址
1 2 imgs_url = re.findall('"img_url":"(.*?)","img_small_url"' , response.text) len (imgs_url)
40
保存所有图片到本地 获取图片的名称 1 imgs_url[2 ].split('/' )[-1 ]
'53dca2fabe514b5ca2a39a4cf741ee90~640x640.jpeg'
保存图片到本地 1 2 3 4 5 6 7 8 for img_url in imgs_url: img_name = img_url.split('/' )[-1 ] img_res = requests.get(img_url) with open ('imgs/' + img_name, 'wb' ) as f: f.write(img_res.content)
动态加载 请求动态加载数据包 1 url2 = 'https://so.toutiao.com/search?keyword=%E5%8F%AF%E7%88%B1%E8%A1%A8%E6%83%85%E5%8C%85&pd=atlas&source=input&dvpf=pc&aid=4916&page_num=2&rawJSON=1&search_id=20220914082011010150065167076D647B'
1 response2 = requests.get(url2, headers=headers)
解析数据 1 res2_json = response2.json()
1 2 3 imgs_url = [] for i in range (len (res2_json['rawData' ]['data' ])): imgs_url.append(res2_json['rawData' ]['data' ][i]['img_url' ])
保存图片到本地 1 2 3 4 5 6 7 8 for img_url in imgs_url: img_name = img_url.split('/' )[-1 ] img_res = requests.get(img_url) with open ('imgs/' + img_name, 'wb' ) as f: f.write(img_res.content)
获取指定页面的表情包 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 n = int (input ('请输入要抓取的表情包页面:' )) url2 = f'https://so.toutiao.com/search?keyword=%E5%8F%AF%E7%88%B1%E8%A1%A8%E6%83%85%E5%8C%85&pd=atlas&source=input&dvpf=pc&aid=4916&page_num={n} &rawJSON=1&search_id=20220914082011010150065167076D647B' response2 = requests.get(url2, headers=headers) res2_json = response2.json() imgs_url = [] for i in range (len (res2_json['rawData' ]['data' ])): imgs_url.append(res2_json['rawData' ]['data' ][i]['img_url' ]) for img_url in imgs_url: img_name = img_url.split('/' )[-1 ] img_res = requests.get(img_url) with open ('imgs/' + img_name, 'wb' ) as f: f.write(img_res.content)
请输入要抓取的表情包页面:10
批量抓取表情包 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 num = int (input ('请输入要抓取的表情包页数:' )) for n in range (2 , num+1 ): url2 = f'https://so.toutiao.com/search?keyword=%E5%8F%AF%E7%88%B1%E8%A1%A8%E6%83%85%E5%8C%85&pd=atlas&source=input&dvpf=pc&aid=4916&page_num={n} &rawJSON=1&search_id=20220914082011010150065167076D647B' response2 = requests.get(url2, headers=headers) res2_json = response2.json() imgs_url = [] for i in range (len (res2_json['rawData' ]['data' ])): imgs_url.append(res2_json['rawData' ]['data' ][i]['img_url' ]) for img_url in imgs_url: img_name = img_url.split('/' )[-1 ] img_res = requests.get(img_url) with open ('imgs/' + img_name, 'wb' ) as f: f.write(img_res.content)
请输入要抓取的表情包页数:3