当前位置：首页 > 学习笔记 > 正文内容

Python 爬虫实战教程 - 从入门到精通

廖万里2个月前 (03-16)学习笔记4

# Python 爬虫实战教程

前言

在互联网时代，数据是最宝贵的资源之一。Python 爬虫技术能够帮助我们自动化获取网络数据，广泛应用于数据分析、机器学习、市场研究等领域。本教程将从零开始，带你掌握 Python 爬虫的核心技术，并通过多个实战案例巩固所学知识。

---

第一章：爬虫基础概念

1.1 什么是网络爬虫

网络爬虫（Web Crawler）是一种自动化程序，能够模拟浏览器行为，按照一定规则自动抓取互联网上的信息。爬虫的工作流程可以概括为：

1. 向目标网站发送 HTTP 请求 2. 获取服务器返回的响应数据 3. 解析数据，提取所需信息 4. 存储数据到本地或数据库

1.2 爬虫的合法性

在进行爬虫开发时，务必遵守以下原则：

遵守 robots.txt 协议：这是网站对爬虫的访问限制声明
控制访问频率：避免给服务器造成过大压力
不爬取敏感信息：尊重用户隐私和网站权益
遵守法律法规：不得用于非法用途

⚠️ 重要提示：本教程仅供学习交流，请勿将爬虫用于非法用途。

1.3 开发环境准备

安装 Python：

推荐使用 Python 3.8 或更高版本。

# 验证 Python 版本 python --version

# 应显示：Python 3.x.x

创建虚拟环境：

# 创建虚拟环境 python -m venv crawler_env

# 激活虚拟环境 # Windows crawler_env\Scripts\activate

# Linux/Mac source crawler_env/bin/activate

安装核心库：

pip install requests beautifulsoup4 lxml scrapy selenium pandas

---

第二章：HTTP 协议与请求库

2.1 HTTP 协议基础

HTTP（HyperText Transfer Protocol）是客户端与服务器通信的协议。了解 HTTP 是学习爬虫的基础。

常用请求方法：

| 方法 | 说明 | |------|------| | GET | 获取资源 | | POST | 提交数据 | | PUT | 更新资源 | | DELETE | 删除资源 |

常见状态码：

| 状态码 | 含义 | |--------|------| | 200 | 请求成功 | | 301/302 | 重定向 | | 400 | 请求错误 | | 403 | 禁止访问 | | 404 | 资源不存在 | | 500 | 服务器内部错误 |

2.2 Requests 库入门

Requests 是 Python 最流行的 HTTP 请求库，简洁优雅。

基本使用：

import requests
# GET 请求
response = requests.get('https://httpbin.org/get')
print(response.status_code)  # 状态码
print(response.text)         # 响应内容
print(response.json())       # JSON 解析
# POST 请求
data = {'username': 'admin', 'password': '123456'}
response = requests.post('https://httpbin.org/post', data=data)
print(response.json())

添加请求头：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8'
}
response = requests.get('https://example.com', headers=headers)

处理 Cookie 和 Session：

# 使用 Session 保持会话
session = requests.Session()
# 登录
login_data = {'username': 'admin', 'password': '123456'}
session.post('https://example.com/login', data=login_data)
# 访问需要登录的页面
response = session.get('https://example.com/profile')
print(response.text)

处理超时和异常：

import requests
from requests.exceptions import Timeout, ConnectionError, RequestException
try:
    response = requests.get('https://example.com', timeout=5)
    response.raise_for_status()  # 检查状态码
except Timeout:
    print("请求超时")
except ConnectionError:
    print("连接失败")
except RequestException as e:
    print(f"请求异常: {e}")

---

第三章：数据解析技术

3.1 正则表达式

正则表达式是最基础的数据提取方式，适用于结构简单的文本。

常用正则模式：

import re
html = '''
<div class="content">
    <h2>标题1</h2>
    <p>价格：99元</p>
    <h2>标题2</h2>
    <p>价格：199元</p>
</div>
'''
# 提取所有标题
titles = re.findall(r'<h2>(.*?)</h2>', html)
print(titles)  # ['标题1', '标题2']
# 提取价格
prices = re.findall(r'价格：(\d+)元', html)
print(prices)  # ['99', '199']
# 使用命名分组
pattern = r'<h2>(?P<title>.*?)</h2>\s*<p>价格：(?P<price>\d+)元</p>'
matches = re.finditer(pattern, html)
for match in matches:
    print(match.group('title'), match.group('price'))

3.2 BeautifulSoup 解析

BeautifulSoup 是最易用的 HTML/XML 解析库，支持多种解析器。

基本使用：

from bs4 import BeautifulSoup
import requests
# 获取网页
response = requests.get('https://books.toscrape.com/')
soup = BeautifulSoup(response.text, 'lxml')
# 查找元素
# 通过标签名查找
titles = soup.find_all('h3')
for title in titles:
    print(title.text)
# 通过 class 查找
books = soup.find_all('article', class_='product_pod')
for book in books:
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    print(f"{title}: {price}")
# 通过 CSS 选择器查找
prices = soup.select('p.price_color')
for price in prices:
    print(price.text)

高级查找技巧：

from bs4 import BeautifulSoup
html = '''
<div class="container">
    <ul id="list">
        <li class="item active">项目1</li>
        <li class="item">项目2</li>
        <li class="item">项目3</li>
    </ul>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
# 组合查找
item = soup.find('li', class_='active')
print(item.text)  # 项目1
# 查找父元素和兄弟元素
parent = item.parent
print(parent.name)  # ul
next_sibling = item.find_next_sibling('li')
print(next_sibling.text)  # 项目2
# 获取属性
print(item.get('class'))  # ['item', 'active']

3.3 XPath 解析

XPath 是强大的路径表达式语言，适合复杂的 XML/HTML 解析。

from lxml import etree
import requests
response = requests.get('https://books.toscrape.com/')
html = etree.HTML(response.text)
# XPath 语法示例
# / 从根节点选取
# // 从匹配选择的当前节点选择文档中的节点
# @ 选取属性
# . 选取当前节点
# .. 选取当前节点的父节点
# 获取所有书名
titles = html.xpath('//h3/a/@title')
print(titles)
# 获取所有价格
prices = html.xpath('//p[@class="price_color"]/text()')
print(prices)
# 获取完整信息
books = html.xpath('//article[@class="product_pod"]')
for book in books:
    title = book.xpath('.//h3/a/@title')[0]
    price = book.xpath('.//p[@class="price_color"]/text()')[0]
    rating = book.xpath('.//p[contains(@class, "star-rating")]/@class')[0]
    print(f"书名: {title}, 价格: {price}, 评分: {rating}")

---

第四章：实战案例一——爬取小说网站

4.1 需求分析

目标：爬取某小说网站的章节目录和正文内容，保存为本地文本文件。

技术选型：

请求库：requests
解析库：BeautifulSoup
存储：txt 文件

4.2 代码实现

import requests
from bs4 import BeautifulSoup
import time
import os
import random
class NovelCrawler:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    
    def get_page(self, url, retry=3):
        """获取页面内容"""
        for i in range(retry):
            try:
                response = self.session.get(url, headers=self.headers, timeout=10)
                response.encoding = 'utf-8'
                return response.text
            except Exception as e:
                print(f"请求失败，第 {i+1} 次重试: {e}")
                time.sleep(2)
        return None
    
    def parse_chapter_list(self, html):
        """解析章节列表"""
        soup = BeautifulSoup(html, 'lxml')
        chapters = []
        
        # 根据实际网站结构调整选择器
        chapter_list = soup.select('#list dd a')
        
        for chapter in chapter_list:
            title = chapter.text
            url = self.base_url + chapter['href']
            chapters.append({
                'title': title,
                'url': url
            })
        
        return chapters
    
    def parse_content(self, html):
        """解析章节内容"""
        soup = BeautifulSoup(html, 'lxml')
        
        # 根据实际网站结构调整选择器
        content_div = soup.select_one('#content')
        
        if content_div:
            content = content_div.text.strip()
            # 清理无用字符
            content = content.replace('\xa0', ' ').replace('\r', '')
            return content
        return None
    
    def crawl(self, novel_name):
        """主爬取函数"""
        # 创建保存目录
        save_dir = f'./novels/{novel_name}'
        os.makedirs(save_dir, exist_ok=True)
        
        # 获取章节列表
        print("正在获取章节列表...")
        index_html = self.get_page(self.base_url)
        chapters = self.parse_chapter_list(index_html)
        print(f"共找到 {len(chapters)} 个章节")
        
        # 爬取每章内容
        for i, chapter in enumerate(chapters):
            print(f"正在爬取: {chapter['title']} ({i+1}/{len(chapters)})")
            
            content_html = self.get_page(chapter['url'])
            content = self.parse_content(content_html)
            
            if content:
                # 保存章节
                filename = f"{save_dir}/{i+1:04d}_{chapter['title']}.txt"
                with open(filename, 'w', encoding='utf-8') as f:
                    f.write(chapter['title'] + '\n\n' + content)
            
            # 随机延时，避免被封
            time.sleep(random.uniform(0.5, 1.5))
        
        print("爬取完成！")
# 使用示例
if __name__ == '__main__':
    crawler = NovelCrawler('https://example.com/novel/')
    crawler.crawl('小说名称')

---

第五章：实战案例二——爬取电商平台商品信息

5.1 需求分析

目标：爬取某电商平台的商品信息，包括商品名称、价格、销量、店铺等，保存为 CSV 文件。

5.2 处理动态加载

很多电商平台使用 JavaScript 动态加载商品数据，需要分析 API 接口。

import requests
import pandas as pd
import time
import random
import json
class EcommerceCrawler:
    def __init__(self):
        self.session = requests.Session()
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Referer': 'https://example.com',
            'Accept': 'application/json',
        }
        self.products = []
    
    def search_products(self, keyword, page=1):
        """搜索商品"""
        # API 接口（需要通过浏览器开发者工具分析获取）
        api_url = 'https://api.example.com/search'
        
        params = {
            'keyword': keyword,
            'page': page,
            'pageSize': 50,
        }
        
        try:
            response = self.session.get(
                api_url, 
                headers=self.headers, 
                params=params,
                timeout=10
            )
            data = response.json()
            return self.parse_product_data(data)
        except Exception as e:
            print(f"请求失败: {e}")
            return []
    
    def parse_product_data(self, data):
        """解析商品数据"""
        products = []
        
        # 根据 API 返回结构解析
        items = data.get('data', {}).get('items', [])
        
        for item in items:
            product = {
                '商品ID': item.get('id'),
                '商品名称': item.get('title'),
                '价格': item.get('price'),
                '销量': item.get('sales'),
                '店铺名称': item.get('shopName'),
                '商品链接': item.get('url'),
                '评分': item.get('rating'),
            }
            products.append(product)
        
        return products
    
    def crawl(self, keyword, max_pages=10):
        """爬取多页数据"""
        print(f"开始搜索: {keyword}")
        
        for page in range(1, max_pages + 1):
            print(f"正在爬取第 {page} 页...")
            products = self.search_products(keyword, page)
            
            if not products:
                print("没有更多数据")
                break
            
            self.products.extend(products)
            time.sleep(random.uniform(1, 3))
        
        print(f"共爬取 {len(self.products)} 个商品")
    
    def save_to_csv(self, filename):
        """保存到 CSV 文件"""
        df = pd.DataFrame(self.products)
        df.to_csv(filename, index=False, encoding='utf-8-sig')
        print(f"数据已保存到: {filename}")
# 使用示例
if __name__ == '__main__':
    crawler = EcommerceCrawler()
    crawler.crawl('机械键盘', max_pages=5)
    crawler.save_to_csv('./products.csv')

---

第六章：实战案例三——使用 Selenium 处理 JavaScript 渲染页面

6.1 Selenium 简介

Selenium 是一个自动化测试工具，可以模拟真实浏览器操作，适合处理 JavaScript 动态渲染的页面。

6.2 环境配置

# 安装 Selenium pip install selenium webdriver-manager

# 自动管理浏览器驱动

6.3 完整示例：爬取无限滚动页面

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
import pandas as pd
class SeleniumCrawler:
    def __init__(self, headless=True):
        """初始化浏览器"""
        options = webdriver.ChromeOptions()
        
        if headless:
            options.add_argument('--headless')
        
        options.add_argument('--disable-gpu')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--window-size=1920,1080')
        
        # 设置 User-Agent
        options.add_argument(
            'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        )
        
        # 自动安装驱动
        service = Service(ChromeDriverManager().install())
        self.driver = webdriver.Chrome(service=service, options=options)
        self.wait = WebDriverWait(self.driver, 10)
    
    def scroll_to_bottom(self, times=5):
        """滚动加载更多内容"""
        for i in range(times):
            # 滚动到页面底部
            self.driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);"
            )
            print(f"滚动 {i+1}/{times}")
            time.sleep(2)  # 等待加载
            
            # 检查是否加载完成
            try:
                self.wait.until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, '.item'))
                )
            except:
                print("加载超时")
    
    def crawl_infinite_scroll(self, url):
        """爬取无限滚动页面"""
        self.driver.get(url)
        time.sleep(3)
        
        # 滚动加载更多
        self.scroll_to_bottom(10)
        
        # 提取数据
        items = self.driver.find_elements(By.CSS_SELECTOR, '.item')
        data = []
        
        for item in items:
            try:
                title = item.find_element(By.CSS_SELECTOR, '.title').text
                price = item.find_element(By.CSS_SELECTOR, '.price').text
                data.append({
                    'title': title,
                    'price': price
                })
            except:
                continue
        
        return data
    
    def close(self):
        """关闭浏览器"""
        self.driver.quit()
# 使用示例
if __name__ == '__main__':
    crawler = SeleniumCrawler(headless=False)
    
    try:
        data = crawler.crawl_infinite_scroll('https://example.com/list')
        df = pd.DataFrame(data)
        df.to_csv('./scroll_data.csv', index=False)
        print(f"爬取完成，共 {len(data)} 条数据")
    finally:
        crawler.close()

---

第七章：反爬虫与应对策略

7.1 常见反爬虫机制

| 类型 | 说明 | 应对策略 | |------|------|----------| | User-Agent 检测 | 检查请求头中的 User-Agent | 设置真实浏览器 UA | | IP 限制 | 单一 IP 请求频率过高 | 使用代理 IP 池 | | Cookie 验证 | 需要 Cookie 才能访问 | 使用 Session 保持会话 | | JavaScript 渲染 | 数据通过 JS 动态加载 | 使用 Selenium/Playwright | | 验证码 | 频繁访问触发验证码 | OCR 识别或人工处理 | | 登录验证 | 需要登录才能访问 | 模拟登录获取 Cookie |

7.2 使用代理 IP

import requests
import random
class ProxyCrawler:
    def __init__(self):
        self.proxies_pool = [
            {'http': 'http://proxy1:port', 'https': 'https://proxy1:port'},
            {'http': 'http://proxy2:port', 'https': 'https://proxy2:port'},
            # 更多代理...
        ]
    
    def get_random_proxy(self):
        """随机获取代理"""
        return random.choice(self.proxies_pool)
    
    def request_with_proxy(self, url, retry=3):
        """使用代理请求"""
        for i in range(retry):
            proxy = self.get_random_proxy()
            try:
                response = requests.get(
                    url, 
                    proxies=proxy, 
                    timeout=10,
                    headers={'User-Agent': 'Mozilla/5.0...'}
                )
                if response.status_code == 200:
                    return response
            except Exception as e:
                print(f"代理 {proxy} 失败: {e}")
                continue
        
        return None

7.3 设置请求延时

import time
import random
def smart_delay(min_delay=1, max_delay=3):
    """智能延时，模拟人类行为"""
    delay = random.uniform(min_delay, max_delay)
    time.sleep(delay)
# 在循环中使用
for url in urls:
    response = requests.get(url)
    # 处理响应...
    smart_delay(1, 2)  # 随机等待 1-2 秒

---

第八章：数据存储

8.1 保存为 JSON

import json
data = [
    {'name': '商品1', 'price': 99},
    {'name': '商品2', 'price': 199},
]
# 保存
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)
# 读取
with open('data.json', 'r', encoding='utf-8') as f:
    loaded_data = json.load(f)

8.2 保存为 CSV

import pandas as pd
data = [
    {'name': '商品1', 'price': 99},
    {'name': '商品2', 'price': 199},
]
df = pd.DataFrame(data)
df.to_csv('data.csv', index=False, encoding='utf-8-sig')

8.3 存入数据库

import sqlite3
import pandas as pd
# 创建数据库连接
conn = sqlite3.connect('crawler.db')
# 创建表
conn.execute('''
    CREATE TABLE IF NOT EXISTS products (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT,
        price REAL,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
''')
# 插入数据
data = [
    ('商品1', 99.0),
    ('商品2', 199.0),
]
conn.executemany('INSERT INTO products (name, price) VALUES (?, ?)', data)
conn.commit()
# 读取数据
df = pd.read_sql('SELECT * FROM products', conn)
print(df)
conn.close()

---

总结

本教程涵盖了 Python 爬虫的核心技术，从 HTTP 请求、数据解析到实战案例，帮助你建立完整的爬虫知识体系。

关键要点回顾：

1. 请求库选择：简单页面用 requests，复杂页面用 Selenium 2. 解析技术：正则、BeautifulSoup、XPath 各有优势 3. 反爬策略：设置请求头、使用代理、控制频率 4. 数据存储：根据需求选择 JSON、CSV 或数据库

学习建议：

多实践，从简单的静态页面开始
学会使用浏览器开发者工具分析网页
遵守法律法规和网站 robots.txt 协议
持续关注反爬虫技术的发展

祝你在 Python 爬虫的学习之路上一帆风顺！

本文链接：https://www.kkkliao.cn/?id=646 转载需授权！

分享到：

标签: Python 爬虫教程数据采集

返回列表

上一篇：Python 自动化办公实战 - 10 个效率翻倍的脚本

下一篇：Docker 入门到实战教程

“Python 爬虫实战教程 - 从入门到精通” 的相关文章

h5网页版字母消除游戏制作，html+jquery4年前 (2022-09-09)

Python所有的库都在这里了！！强烈建议收藏4年前 (2022-10-29)

Cursor AI 编程助手完整教程 - 让 AI 帮你写代码2个月前 (03-16)

Git 版本控制完全指南 - 团队协作必备技能2个月前 (03-16)

FastAPI 构建 AI API 服务 - 从开发到部署2个月前 (03-16)

Python 爬虫实战教程 - 从入门到精通

前言

第一章：爬虫基础概念

1.1 什么是网络爬虫

1.2 爬虫的合法性

1.3 开发环境准备

第二章：HTTP 协议与请求库

2.1 HTTP 协议基础

2.2 Requests 库入门

第三章：数据解析技术

3.1 正则表达式

3.2 BeautifulSoup 解析

3.3 XPath 解析

第四章：实战案例一——爬取小说网站

4.1 需求分析

4.2 代码实现

第五章：实战案例二——爬取电商平台商品信息

5.1 需求分析

5.2 处理动态加载

第六章：实战案例三——使用 Selenium 处理 JavaScript 渲染页面

6.1 Selenium 简介

6.2 环境配置

6.3 完整示例：爬取无限滚动页面

第七章：反爬虫与应对策略

7.1 常见反爬虫机制

7.2 使用代理 IP

7.3 设置请求延时

第八章：数据存储

8.1 保存为 JSON

8.2 保存为 CSV

8.3 存入数据库

总结

“Python 爬虫实战教程 - 从入门到精通” 的相关文章

发表评论

廖万里

© 2022-2026 天桥区万策云网络工作室、东莞市东城万策智联网络工作室及济南高新区万策网络工作室提供技术支持
鲁公网安备 37010502001945号
鲁ICP备2026009861号-1

Powered By Z-BlogPHP. Theme by TOYEAN.

Python 爬虫实战教程 - 从入门到精通

前言

第一章：爬虫基础概念

1.1 什么是网络爬虫

1.2 爬虫的合法性

1.3 开发环境准备

第二章：HTTP 协议与请求库

2.1 HTTP 协议基础

2.2 Requests 库入门

第三章：数据解析技术

3.1 正则表达式

3.2 BeautifulSoup 解析

3.3 XPath 解析

第四章：实战案例一——爬取小说网站

4.1 需求分析

4.2 代码实现

第五章：实战案例二——爬取电商平台商品信息

5.1 需求分析

5.2 处理动态加载

第六章：实战案例三——使用 Selenium 处理 JavaScript 渲染页面

6.1 Selenium 简介

6.2 环境配置

6.3 完整示例：爬取无限滚动页面

第七章：反爬虫与应对策略

7.1 常见反爬虫机制

7.2 使用代理 IP

7.3 设置请求延时

第八章：数据存储

8.1 保存为 JSON

8.2 保存为 CSV

8.3 存入数据库

总结

“Python 爬虫实战教程 - 从入门到精通” 的相关文章

发表评论取消回复

廖万里

© 2022-2026 天桥区万策云网络工作室、东莞市东城万策智联网络工作室及济南高新区万策网络工作室提供技术支持 鲁公网安备 37010502001945号 鲁ICP备2026009861号-1

Powered By Z-BlogPHP. Theme by TOYEAN.

发表评论

© 2022-2026 天桥区万策云网络工作室、东莞市东城万策智联网络工作室及济南高新区万策网络工作室提供技术支持
鲁公网安备 37010502001945号
鲁ICP备2026009861号-1