Python爬虫演绎正则提取数据-白红宇

Python爬虫演绎正则提取数据

阅读量：806 次

发布时间：2019-03-25

本文共 1644 字，大约阅读时间需要 5 分钟。

Python爬虫演示：使用正则表达式提取数据

1. 什么是正则表达式

正则表达式（Regular Expression，简称 regex）是一种强大的文本匹配工具，通过定义特定的模式可以快速提取或过滤网页中的结构化数据。它包含普通字符（如字母、数字）和特殊元字符（如 . * ?），能够描述复杂的字符匹配规则。

2. 爬取《男人装》网页内容

说明步骤

访问目标网页

使用 urllib.request 库发送 GET 请求，获取目标网页的 HTML 数据。

from urllib import requestimport reimport os# 目标网页地址url = 'http://enrz.com/fhm/2016/12/17/74914.html'req = request.Request(url)html = request.urlopen(req)content = html.read().decode('utf-8')

提取网页标题

制定正则表达式匹配 <h2> 标签内的内容。

# 过滤标题的正则表达式title_pattern = r'(.*?)
'# 搜索并提取标题title_match = re.search(title_pattern, content)if title_match:    print(f"网页标题：{title_match.group(1)}")

提取图片路径

设计正则表达式获取图片的 src 属性。

# 过滤图片路径的正则表达式pic_src_pattern = r'
    
     '# 搜索并提取图片路径pic_src_match = re.search(pic_src_pattern, content)if pic_src_match:    print(f"图片路径：{pic_src_match.group(1)}")

创建存储路径

根据标题自动创建文件夹。

# 获取标题作为文件夹名称dir_name = title_match.group(1)# 创建文件夹dir_path = os.path.join(os.getcwd(), dir_name)if not os.path.exists(dir_path):    os.makedirs(dir_path)

下载并保存图片

使用循环处理每张图片，并将其保存到相应文件夹。

while True:    # 搜索所有图片路径    pics = re.findall(pic_src_pattern, content)    for pic in pics:        if not os.path.exists(pic):            # 防止断线重试，直接跳转获取图片            req_pic = request.Request(pic)            try:                html_pic = request.urlopen(req_pic)                image_data = html_pic.read()                # 保存图片                pic_path = os.path.join(dir_path, pic.split('/')[-1])                with open(pic_path, 'wb') as f:                    f.write(image_data)            except:                pass    break

完成任务！

所需文件已成功创建并存入 dir_name 文件夹中，包含：

每张图片：
- 图片文件名（自动提取）
- 图片完整路径（保存位置：dir_name/图片文件名）

详细的提取日志文件（可选扩展功能）

转载地址：http://szdyk.baihongyu.com/

你可能感兴趣的文章