Python 网络爬虫实战完全指南:从入门到精通
网络爬虫是数据采集的核心技术,让你能够从互联网上自动获取海量数据。本文将从零开始,系统讲解 Python 爬虫的完整技术栈,包括 HTTP 请求、数据解析、反爬应对、异步加速等核心知识点,助你成为爬虫高手。
一、爬虫基础概念
网络爬虫(Web Crawler),又称网络蜘蛛(Web Spider),是一种按照特定规则自动抓取万维网信息的程序。它的核心工作流程可以概括为:
1. 发送请求:向目标网站发送 HTTP 请求
2. 获取响应:接收服务器返回的 HTML/JSON 数据
3. 解析数据:从响应中提取所需信息
4. 存储数据:将提取的数据保存到本地或数据库
1.1 HTTP 协议基础
理解 HTTP 协议是爬虫开发的第一步。HTTP 请求由请求行、请求头、请求体三部分组成:
import requests
# 发送 GET 请求 - 最常用的请求方式
response = requests.get(
url="https://httpbin.org/get",
params={"key": "value", "page": 1},
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
},
timeout=10
)
print(f"状态码: {response.status_code}")
# 发送 POST 请求
post_response = requests.post(
url="https://httpbin.org/post",
data={"username": "admin", "password": "123456"},
json={"key": "value"},
headers={"Content-Type": "application/json"}
)
1.2 会话管理
很多网站需要登录后才能访问数据,这时就需要使用 Session 来保持登录状态:
import requests
# 创建 Session 对象,自动管理 Cookie
session = requests.Session()
# 模拟登录
login_url = "https://example.com/login"
login_data = {"username": "your_username", "password": "your_password"}
response = session.post(login_url, data=login_data)
# 登录后访问需要认证的页面
profile_url = "https://example.com/profile"
profile = session.get(profile_url)
print(profile.text)
# 关闭 Session
session.close()
二、数据解析技术
获取到网页 HTML 后,需要从中提取数据。常用的解析方式有三种:正则表达式、XPath、CSS 选择器。
2.1 正则表达式解析
正则表达式是最基础的解析方式,适合处理简单的文本匹配:
import re
import requests
html = requests.get("https://example.com").text
# 提取所有链接
links = re.findall(r'<a href="(.*?)">.*?</a>', html, re.S)
print(f"找到 {len(links)} 个链接")
# 提取邮箱地址
emails = re.findall(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", html)
# 提取手机号(中国大陆)
phones = re.findall(r"1[3-9]\d{9}", html)
2.2 XPath 解析(推荐)
XPath 是 XML 路径语言,功能强大且易读,配合 lxml 库使用效率极高:
from lxml import etree
import requests
html = requests.get("https://example.com").text
tree = etree.HTML(html)
# 基础路径选择
titles = tree.xpath("//h1/text()")
print(f"H1 标题: {titles}")
# 属性选择
links = tree.xpath("//a/@href")
images = tree.xpath("//img/@src")
# 条件筛选
hot_items = tree.xpath("//div[contains(@class, 'hot')]/ul/li")
# 提取多个属性
for item in tree.xpath("//div[@class='product']"):
title = item.xpath("./h2/text()")[0]
price = item.xpath(".//span[@class='price']/text()")[0]
link = item.xpath(".//a/@href")[0]
print(f"{title} - {price} - {link}")
2.3 BeautifulSoup 解析
BeautifulSoup 提供了更 Pythonic 的 API,适合快速开发:
from bs4 import BeautifulSoup
import requests
html = requests.get("https://example.com").text
soup = BeautifulSoup(html, "lxml")
# 标签选择
title = soup.title.string
first_h1 = soup.h1.text
# find 和 find_all
all_links = soup.find_all("a")
first_div = soup.find("div")
# 条件筛选
articles = soup.find_all("div", class_="article")
# 获取属性和文本
for link in soup.find_all("a"):
href = link.get("href")
text = link.get_text(strip=True)
if href:
print(f"{text}: {href}")
三、反爬应对策略
网站为了保护数据,会采用各种反爬机制。以下是常见的反爬策略及应对方法:
3.1 User-Agent 检测
最基础的反爬手段是检测 User-Agent,识别非浏览器请求:
import requests
from fake_useragent import UserAgent
# 随机生成 User-Agent
ua = UserAgent()
def get_random_headers():
return {
"User-Agent": ua.random,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
}
response = requests.get("https://example.com", headers=get_random_headers())
3.2 IP 封禁与代理
频繁请求会导致 IP 被封禁,使用代理池可以有效解决:
import requests
import random
PROXY_POOL = [
{"http": "http://ip1:port", "https": "https://ip1:port"},
{"http": "http://ip2:port", "https": "https://ip2:port"},
]
def request_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
proxy = random.choice(PROXY_POOL)
response = requests.get(
url,
proxies=proxy,
headers={"User-Agent": "Mozilla/5.0..."},
timeout=10
)
if response.status_code == 200:
return response
except Exception as e:
print(f"第 {attempt + 1} 次请求失败: {e}")
continue
return None
四、异步爬虫加速
同步爬虫效率低下,异步爬虫可以同时发送多个请求,大幅提升效率:
import asyncio
import aiohttp
async def fetch_one(session, url, semaphore):
async with semaphore:
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as response:
return await response.text()
except Exception as e:
print(f"请求失败 {url}: {e}")
return None
async def fetch_all(urls, max_concurrent=10):
semaphore = asyncio.Semaphore(max_concurrent)
connector = aiohttp.TCPConnector(limit=100)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [fetch_one(session, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks)
return results
# 使用示例
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
results = asyncio.run(fetch_all(urls, max_concurrent=20))
print(f"成功获取 {sum(1 for r in results if r)} 个页面")
五、数据存储
爬取的数据需要持久化存储,常用的方式有文件存储和数据库存储:
import json
import csv
import sqlite3
# 1. JSON 文件存储
def save_to_json(data, filename):
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# 2. CSV 文件存储
def save_to_csv(data_list, filename):
if not data_list:
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=data_list[0].keys())
writer.writeheader()
writer.writerows(data_list)
# 3. SQLite 数据库存储
def init_db(db_path="spider.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
url TEXT UNIQUE,
content TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
return conn
六、完整实战案例
下面是一个完整的爬虫项目,实现了从网页抓取到数据存储的全流程:
import requests
from lxml import etree
import sqlite3
import time
import random
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)
class NewsSpider:
def __init__(self, base_url, db_path="news.db"):
self.base_url = base_url
self.session = requests.Session()
self.db = self._init_db(db_path)
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})
def _init_db(self, db_path):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
url TEXT UNIQUE,
content TEXT,
publish_time TEXT
)
""")
conn.commit()
return conn
def get_article_list(self, page=1):
url = f"{self.base_url}/page/{page}"
response = self.session.get(url, timeout=10)
tree = etree.HTML(response.text)
return tree.xpath("//div[@class='article-item']/a/@href")
def parse_article(self, url):
response = self.session.get(url, timeout=10)
tree = etree.HTML(response.text)
return {
"title": tree.xpath("//h1/text()")[0].strip(),
"url": url,
"content": "".join(tree.xpath("//div[@class='content']//p/text()"))
}
def run(self, max_pages=5):
for page in range(1, max_pages + 1):
urls = self.get_article_list(page)
for url in urls:
article = self.parse_article(url)
if article:
self.db.execute(
"INSERT OR IGNORE INTO articles (title, url, content) VALUES (?, ?, ?)",
(article["title"], article["url"], article["content"])
)
self.db.commit()
time.sleep(random.uniform(1, 3))
logger.info("爬取完成")
if __name__ == "__main__":
spider = NewsSpider("https://news.example.com")
spider.run(max_pages=3)
总结
Python 网络爬虫是一项综合性技术,需要掌握 HTTP 协议、数据解析、反爬应对、异步编程等多方面知识。本文从基础到实战,系统讲解了:
1. HTTP 协议基础:理解请求响应机制,掌握 Session 管理
2. 数据解析技术:正则表达式、XPath、BeautifulSoup 三大方式
3. 反爬应对策略:User-Agent 伪装、代理池、验证码处理
4. 异步爬虫加速:使用 aiohttp 实现高并发采集
5. 数据存储方案:JSON、CSV、SQLite 多种存储方式
6. 完整实战案例:从架构设计到代码实现的完整爬虫项目
掌握这些技能后,你就可以应对绝大多数网站的数据采集需求。记住:爬虫要遵守网站的 robots.txt 规则,尊重数据版权,合理控制请求频率。
本文链接:https://www.kkkliao.cn/?id=922 转载需授权!
版权声明:本文由廖万里的博客发布,如需转载请注明出处。



手机流量卡
免费领卡
号卡合伙人
产品服务
关于本站
