简易爬虫学习

一个简单的爬虫是如何运行的

Posted by xiaoh on May 21, 2016

目录

  1. 控制器
  2. 下载器
  3. 解析器
  4. URL管理器
  5. 输出器
  6. 文档

好长时间没有更新博客了,最近做了一些爬虫的整理,虽然以前做过好多,但没有整理过,最近这段时间想要提炼一下Python爬虫的东西。


控制器

控制器作为爬虫的整合工具,起到调度的作用,具有将爬虫各个组件协调起来工作的能力。

#!/usr/bin/python
#-*- coding:utf-8 -*-

#############################################
# File Name: spider.py
# Author: xiaoh
# Mail: xiaoh@about.me
# Created Time:  2016-05-20 23:31:49
#############################################

from url_manager import *
from html_downloader import *
from html_parser import *
from html_outputer import *

class SpiderMain():
    def __init__(self):
        self.urls = UrlManager()
        self.downloader = HtmlDownloader()
        self.parser = HtmlParser()
        self.outputer = HtmlOutputer()

    def craw(self, start_url):
        self.urls.add_new_url(start_url)
        count = 1
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                html_cont = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_cont)
                self.urls.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)

                if count == 2000:
                    break

                print 'Download %d page, title: %s' % (count, new_data['title'])

                count = count + 1
            except Exception as e:
                print type(e)
                print e.message

        self.outputer.output_html()

if __name__ == "__main__":
    start_url = "http://baike.baidu.com/view/6687996.htm"
    spider = SpiderMain()
    spider.craw(start_url)

下载器

下载器则需要进行页面的下载工作,简单的下载器直接调用库函数即可,但复杂一些的就需要考虑到限制IP(使用代理),查看Cookies(模拟登录),脚本加密(模拟浏览器执行)等等方面,这里只简单介绍最基础的下载器,更深入的后续会在博客里面进行更新。

#!/usr/bin/python
#-*- coding:utf-8 -*-

#############################################
# File Name: html_downloader.py
# Author: xiaoh
# Mail: xiaoh@about.me
# Created Time:  2016-05-21 19:00:54
#############################################

import requests

class HtmlDownloader():
    def download(self, url):
        if url is None:
            return None
        req = requests.get(url)
        if req.status_code != 200:
            return None
        return req.content

解析器

解析器是从网页内容中获取需要的内容的功能,这部分也可以有很多种处理方式,分大方向的话,分别是模糊匹配(正则表达式)和结构化解析(html.parser/lxml),按照速度来说,模糊匹配会更快一些,但正则可能会带来一定的难度,结构化解析倒是很简单,我们这里用到的就是结构化解析方法(正则写的太多了,烦)。

而结构化解析有一个第三方的库叫 BeautifulSoup 他可以更好的进行网页的解析(其实就是用起来简单)

#!/usr/bin/python
#-*- coding:utf-8 -*-

#############################################
# File Name: html_parser.py
# Author: xiaoh
# Mail: xiaoh@about.me
# Created Time:  2016-05-21 19:03:28
#############################################

import re
import urlparse
from bs4 import BeautifulSoup

class HtmlParser():

    def parse(self, url, content):
        if url is None or content is None:
            return
        soup = BeautifulSoup(content, 'html.parser', from_encoding='utf-8')
        new_urls = self._get_new_urls(url, soup)
        new_data = self._get_new_data(url, soup)
        return new_urls, new_data

    def _get_new_urls(self, url, soup):
        links = soup.find_all('a', href=re.compile(r'/view/\d+\.htm'))
        new_urls = set()
        for link in links:
            new_url = link['href']
            new_full_url = urlparse.urljoin(url, new_url)
            new_urls.add(new_full_url)
        return new_urls

    def _get_new_data(self, url, soup):
        title = soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find('h1')
        summary = soup.find('div', class_="lemma-summary")
        return {
            "title":title.text,
            "summary":summary.text,
            "url":url
        }

URL管理器

URL管理器主要工作就是存储URL列表,并进行URL去重

#!/usr/bin/python
#-*- coding:utf-8 -*-

#############################################
# File Name: url_manager.py
# Author: xiaoh
# Mail: xiaoh@about.me
# Created Time:  2016-05-21 18:53:35
#############################################


class UrlManager():
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    def add_new_url(self, url):
        if url is None:
            return
        if url in self.new_urls or url in self.old_urls:
            return
        self.new_urls.add(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def get_new_url(self):
        if not self.has_new_url():
            return None
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url

    def has_new_url(self):
        return len(self.new_urls) != 0

输出器

输出器则是将下载的内容进行整理输出的模块,这部分可以自定义做各种入库或者写入文件的操作。

#!/usr/bin/python
#-*- coding:utf-8 -*-

#############################################
# File Name: html_outputer.py
# Author: xiaoh
# Mail: xiaoh@about.me
# Created Time:  2016-05-21 19:31:36
#############################################


class HtmlOutputer():
    def __init__(self):
        self.datas = []

    def collect_data(self, data):
        if data is None:
            return
        self.datas.append(data)

    def output_html(self):
        fout = open('output.html', 'w')

        fout.write('<html><body><table>')
        for data in self.datas:
            fout.write('<tr>')
            fout.write('<td>%s</td>' % data['url'])
            fout.write('<td>%s</td>' % data['title'].encode('utf-8'))
            fout.write('<td>%s</td>' % data['summary'].encode('utf-8'))
            fout.write('</tr>')
        fout.write('</html></body></table>')

文档

本博客内容主要学习了网络视频教程的总结和练习,详细内容可以查看以下链接

END