当前位置：首页 > 学习笔记 > 正文内容

Python 爬虫实战教程：从零构建数据采集系统

廖万里3个月前 (04-05)学习笔记8

网络爬虫是数据获取的利器，掌握它意味着拥有了从互联网海洋中提取价值数据的能力。本文将从基础概念到实战案例，手把手教你构建一个完整的 Python 爬虫系统。

一、爬虫核心概念

在开始写代码之前，我们需要理解爬虫的基本原理。网络爬虫（Web Crawler）是一种自动化程序，它模拟浏览器向网站服务器发送请求，获取网页内容，然后从中提取需要的数据。

爬虫的工作流程可以概括为四个步骤：

发送请求：向目标网站发送 HTTP 请求，获取网页内容
解析内容：使用解析库提取网页中的目标数据
存储数据：将提取的数据保存到文件或数据库
继续爬取：根据需要继续爬取其他页面

1.1 HTTP 协议基础

理解 HTTP 协议是爬虫开发的基础。当我们在浏览器中输入 URL 时，浏览器会向服务器发送一个 HTTP 请求，服务器返回响应内容。爬虫要做的就是模拟这个过程。

一个完整的 HTTP 请求包含以下要素：

# HTTP 请求的基本组成部分
"""
请求方法：GET、POST、PUT、DELETE 等
请求头（Headers）：包含 User-Agent、Cookie、Referer 等
请求体（Body）：POST 请求时携带的数据
请求 URL：目标资源的地址
"""

# 常见的请求头示例
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    # Cookie 用于保持登录状态
    "Cookie": "session_id=abc123; user_token=xyz789"
}

1.2 网页解析技术

获取到网页内容后，我们需要从中提取目标数据。常用的解析技术有三种：

正则表达式：最基础但灵活的解析方式，适合简单的文本提取
BeautifulSoup：Python 最流行的 HTML 解析库，API 友好
XPath：强大的路径选择语言，适合复杂的结构化提取

二、环境搭建与基础库

2.1 安装必要的库

# 安装爬虫常用库
# pip install requests beautifulsoup4 lxml

# 验证安装
import requests
from bs4 import BeautifulSoup

print(f"requests 版本: {requests.__version__}")
print("BeautifulSoup 安装成功！")

2.2 requests 库基础

requests 是 Python 最流行的 HTTP 库，它的设计理念是"HTTP for Humans"，使用起来非常简单直观。

import requests

# 基础 GET 请求
def basic_get_request():
    """发送最简单的 GET 请求"""
    url = "https://httpbin.org/get"
    response = requests.get(url)
    
    # 检查响应状态
    if response.status_code == 200:
        print("请求成功！")
        print(f"响应内容: {response.json()}")
    else:
        print(f"请求失败，状态码: {response.status_code}")

# 带参数的 GET 请求
def get_with_params():
    """发送带查询参数的 GET 请求"""
    url = "https://httpbin.org/get"
    params = {
        "keyword": "Python爬虫",
        "page": 1,
        "limit": 20
    }
    
    response = requests.get(url, params=params)
    print(f"实际请求URL: {response.url}")
    print(f"响应数据: {response.json()}")

# POST 请求（用于表单提交）
def post_request():
    """发送 POST 请求"""
    url = "https://httpbin.org/post"
    data = {
        "username": "test_user",
        "password": "test_pass"
    }
    
    response = requests.post(url, data=data)
    print(f"POST 响应: {response.json()}")

# 带请求头的请求
def request_with_headers():
    """模拟浏览器发送请求"""
    url = "https://httpbin.org/headers"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8"
    }
    
    response = requests.get(url, headers=headers)
    print(f"请求头信息: {response.json()}")

2.3 BeautifulSoup 解析网页

BeautifulSoup 是 Python 最强大的网页解析库之一，它能够将复杂的 HTML 文档转换为树形结构，方便我们提取数据。

from bs4 import BeautifulSoup
import requests

def parse_html_example():
    """BeautifulSoup 解析示例"""
    # 示例 HTML
    html_doc = """
    
    爬虫测试页面
    
        
            Python 爬虫教程
            
                第一篇文章
                第二篇文章
                第三篇文章
            
            这是一个测试页面
        
    
    
    """
    
    # 创建 BeautifulSoup 对象
    soup = BeautifulSoup(html_doc, "html.parser")
    
    # 1. 获取标签内容
    title = soup.title.string
    print(f"页面标题: {title}")
    
    # 2. 通过 id 查找元素
    main_title = soup.find(id="main-title")
    print(f"主标题: {main_title.string}")
    
    # 3. 通过 class 查找所有匹配元素
    items = soup.find_all(class_="item")
    for item in items:
        link = item.find("a")
        print(f"文章链接: {link.string} -> {link["href"]}")
    
    # 4. 使用 CSS 选择器
    article_links = soup.select("ul.article-list li a")
    for link in article_links:
        print(f"CSS选择器结果: {link.string}")

# 实战：爬取真实网页
def scrape_real_page():
    """爬取真实网页示例"""
    url = "https://example.com"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # 提取所有链接
    links = soup.find_all("a")
    for link in links:
        href = link.get("href")
        text = link.string
        if href and text:
            print(f"链接: {text} -> {href}")

三、实战案例：构建新闻爬虫

下面我们通过一个完整的实战案例，演示如何构建一个新闻数据采集系统。

3.1 爬虫架构设计

"""
新闻爬虫系统架构

1. 爬取模块：负责发送请求、获取网页内容
2. 解析模块：负责从网页中提取新闻数据
3. 存储模块：负责将数据保存到文件或数据库
4. 调度模块：负责管理爬取任务、控制爬取频率
"""

import requests
from bs4 import BeautifulSoup
import json
import time
import random
from typing import List, Dict, Optional
from dataclasses import dataclass
from datetime import datetime

# 定义数据结构
@dataclass
class NewsArticle:
    """新闻文章数据类"""
    title: str           # 标题
    author: str          # 作者
    publish_time: str    # 发布时间
    content: str         # 内容摘要
    url: str             # 原文链接
    crawl_time: str      # 爬取时间

class NewsSpider:
    """新闻爬虫类"""
    
    def __init__(self):
        # 初始化请求头，模拟浏览器
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
        }
        # 请求会话（保持 Cookie）
        self.session = requests.Session()
        self.session.headers.update(self.headers)
        # 已爬取的 URL 集合（去重）
        self.crawled_urls = set()
    
    def fetch_page(self, url: str, retry: int = 3) -> Optional[str]:
        """
        获取网页内容
        
        Args:
            url: 目标 URL
            retry: 重试次数
        
        Returns:
            网页 HTML 内容，失败返回 None
        """
        for i in range(retry):
            try:
                response = self.session.get(url, timeout=10)
                response.raise_for_status()
                response.encoding = response.apparent_encoding
                return response.text
            except requests.RequestException as e:
                print(f"请求失败 (尝试 {i+1}/{retry}): {e}")
                time.sleep(2)  # 等待后重试
        return None
    
    def parse_news_list(self, html: str) -> List[str]:
        """
        从列表页解析新闻链接
        
        Args:
            html: 列表页 HTML
        
        Returns:
            新闻详情页 URL 列表
        """
        soup = BeautifulSoup(html, "html.parser")
        news_links = []
        
        # 查找所有新闻链接（根据实际网站结构调整选择器）
        for link in soup.select("a.news-item"):
            href = link.get("href")
            if href and href not in self.crawled_urls:
                news_links.append(href)
                self.crawled_urls.add(href)
        
        return news_links
    
    def parse_news_detail(self, html: str, url: str) -> Optional[NewsArticle]:
        """
        解析新闻详情页
        
        Args:
            html: 详情页 HTML
            url: 新闻 URL
        
        Returns:
            NewsArticle 对象，解析失败返回 None
        """
        soup = BeautifulSoup(html, "html.parser")
        
        try:
            # 提取标题
            title_elem = soup.select_one("h1.article-title")
            title = title_elem.string.strip() if title_elem else "未知标题"
            
            # 提取作者
            author_elem = soup.select_one(".author-name")
            author = author_elem.string.strip() if author_elem else "未知作者"
            
            # 提取发布时间
            time_elem = soup.select_one(".publish-time")
            publish_time = time_elem.string.strip() if time_elem else "未知时间"
            
            # 提取内容
            content_elem = soup.select_one(".article-content")
            content = content_elem.get_text(strip=True)[:200] if content_elem else ""
            
            return NewsArticle(
                title=title,
                author=author,
                publish_time=publish_time,
                content=content,
                url=url,
                crawl_time=datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            )
        except Exception as e:
            print(f"解析失败: {e}")
            return None
    
    def save_to_json(self, articles: List[NewsArticle], filename: str):
        """保存数据到 JSON 文件"""
        data = [article.__dict__ for article in articles]
        with open(filename, "w", encoding="utf-8") as f:
            json.dump(data, f, ensure_ascii=False, indent=2)
        print(f"已保存 {len(articles)} 条新闻到 {filename}")
    
    def run(self, start_url: str, max_pages: int = 5):
        """
        运行爬虫
        
        Args:
            start_url: 起始 URL
            max_pages: 最大爬取页数
        """
        articles = []
        
        for page in range(1, max_pages + 1):
            print(f"正在爬取第 {page} 页...")
            
            # 构建列表页 URL
            list_url = f"{start_url}?page={page}"
            
            # 获取列表页
            html = self.fetch_page(list_url)
            if not html:
                continue
            
            # 解析新闻链接
            news_links = self.parse_news_list(html)
            print(f"发现 {len(news_links)} 条新闻")
            
            # 爬取每篇新闻详情
            for link in news_links:
                detail_html = self.fetch_page(link)
                if detail_html:
                    article = self.parse_news_detail(detail_html, link)
                    if article:
                        articles.append(article)
                        print(f"已爬取: {article.title}")
                
                # 随机延迟，避免被封
                time.sleep(random.uniform(1, 3))
        
        # 保存数据
        self.save_to_json(articles, "news_data.json")
        print(f"爬取完成，共获取 {len(articles)} 条新闻")

# 使用示例
if __name__ == "__main__":
    spider = NewsSpider()
    spider.run("https://example-news-site.com/news", max_pages=3)

四、反爬虫策略与应对

随着爬虫技术的普及，越来越多的网站部署了反爬虫机制。了解这些机制并掌握应对策略，是爬虫工程师的必备技能。

4.1 常见反爬虫手段

"""
常见的反爬虫手段：

1. User-Agent 检测
   - 检查请求头中的 User-Agent 是否为浏览器

2. IP 频率限制
   - 限制单个 IP 的请求频率
   - 超过阈值则封禁 IP

3. Cookie/Session 验证
   - 需要登录才能访问
   - 通过 Cookie 追踪用户行为

4. JavaScript 渲染
   - 数据通过 JS 动态加载
   - 直接请求 HTML 无法获取数据

5. 验证码
   - 高频访问时弹出验证码
   - 图形验证码、滑块验证码等

6. 浏览器指纹检测
   - 检测 Canvas、WebGL、字体等指纹
   - 识别是否为真实浏览器
"""

# 应对策略示例
import random
import time

# 1. User-Agent 轮换
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
]

def get_random_user_agent():
    """获取随机 User-Agent"""
    return random.choice(USER_AGENTS)

# 2. 请求延迟与随机化
def smart_delay():
    """智能延迟，模拟真实用户行为"""
    # 基础延迟 1-3 秒，随机浮动
    delay = random.uniform(1, 3)
    # 偶尔长延迟（模拟用户阅读）
    if random.random() < 0.1:
        delay += random.uniform(3, 8)
    time.sleep(delay)

# 3. 使用代理 IP
def request_with_proxy(url: str, proxy_list: list):
    """使用代理 IP 发送请求"""
    for proxy in proxy_list:
        try:
            proxies = {
                "http": f"http://{proxy}",
                "https": f"http://{proxy}"
            }
            response = requests.get(
                url,
                headers={"User-Agent": get_random_user_agent()},
                proxies=proxies,
                timeout=10
            )
            return response
        except Exception:
            continue
    return None

4.2 使用 Selenium 处理动态内容

"""
Selenium 是一个自动化测试工具，
可以模拟真实浏览器操作，
用于处理 JavaScript 动态加载的页面。

安装: pip install selenium webdriver-manager
"""

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

def scrape_dynamic_page(url: str):
    """使用 Selenium 爬取动态加载的页面"""
    
    # 配置 Chrome 选项
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")  # 无头模式
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument(f"--user-agent={get_random_user_agent()}")
    
    # 启动浏览器
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()),
        options=options
    )
    
    try:
        # 访问页面
        driver.get(url)
        
        # 等待内容加载（最多等待 10 秒）
        wait = WebDriverWait(driver, 10)
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, "content")))
        
        # 模拟滚动（触发懒加载）
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        
        # 提取数据
        elements = driver.find_elements(By.CSS_SELECTOR, ".news-item")
        for elem in elements:
            title = elem.find_element(By.TAG_NAME, "h3").text
            print(f"标题: {title}")
        
        # 获取页面源码
        page_source = driver.page_source
        return page_source
        
    finally:
        driver.quit()

# 使用示例
if __name__ == "__main__":
    url = "https://example-dynamic-site.com"
    html = scrape_dynamic_page(url)

五、数据存储与处理

"""
数据存储的多种方式
"""

import json
import csv
import sqlite3
from datetime import datetime

# 1. JSON 存储
def save_to_json(data: list, filename: str):
    """保存数据到 JSON 文件"""
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# 2. CSV 存储
def save_to_csv(data: list, filename: str):
    """保存数据到 CSV 文件"""
    if not data:
        return
    
    with open(filename, "w", encoding="utf-8-sig", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

# 3. SQLite 数据库存储
def save_to_sqlite(data: list, db_name: str = "spider.db"):
    """保存数据到 SQLite 数据库"""
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    
    # 创建表
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS articles (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT NOT NULL,
            author TEXT,
            publish_time TEXT,
            content TEXT,
            url TEXT UNIQUE,
            crawl_time TEXT
        )
    """)
    
    # 插入数据
    for item in data:
        cursor.execute("""
            INSERT OR IGNORE INTO articles 
            (title, author, publish_time, content, url, crawl_time)
            VALUES (?, ?, ?, ?, ?, ?)
        """, (
            item.get("title"),
            item.get("author"),
            item.get("publish_time"),
            item.get("content"),
            item.get("url"),
            item.get("crawl_time")
        ))
    
    conn.commit()
    conn.close()

# 使用示例
sample_data = [
    {
        "title": "Python 爬虫入门",
        "author": "张三",
        "publish_time": "2026-04-04",
        "content": "这是一篇爬虫教程...",
        "url": "https://example.com/1",
        "crawl_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    }
]

# save_to_json(sample_data, "data.json")
# save_to_csv(sample_data, "data.csv")
# save_to_sqlite(sample_data)

六、爬虫道德与法律

在使用爬虫时，我们必须遵守道德规范和法律法规：

遵守 robots.txt：检查网站的 robots.txt 文件，尊重网站的爬取规则
控制爬取频率：避免对服务器造成过大压力
尊重版权：不要爬取和传播受版权保护的内容
保护隐私：不要爬取个人隐私数据
合法使用：确保数据用途合法合规

import urllib.robotparser

def check_robots_txt(base_url: str, user_agent: str = "*") -> bool:
    """检查 robots.txt 是否允许爬取"""
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(f"{base_url}/robots.txt")
    rp.read()
    
    # 检查特定路径是否允许爬取
    test_path = "/"
    return rp.can_fetch(user_agent, test_path)

# 使用示例
# allowed = check_robots_txt("https://example.com")
# print(f"是否允许爬取: {allowed}")