Python 爬虫学习笔记之多线程爬虫

发布时间：2016-12-04 12:33:49 所属栏目：Asp教程来源：网络整理

导读：XPath 的安装以及使用 1 . XPath 的介绍刚学过正则表达式，用的正顺手，现在就把正则表达式替换掉，使用 XPath，有人表示这太坑爹了，早知道刚上来就学习 XPath 多省事

打印结果为

http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=50
http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=100
http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=150
http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=200
http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=250
http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=300
http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=350
http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=400
http://tieba.baidu.com/f#63;kw=pythonie=utf-8pn=450
单线程耗时 : 7.26399993896 s
多线程耗时 : 2.49799990654 s

至于以上链接为什么设置间隔为 50，是因为我发现在百度贴吧上没翻一页，pn 的值就会增加 50。通过以上结果我们发现，多线程相比于单线程效率提升了太多太多。至于以上代码中多线程的使用，我就不再过多讲解，我相信只要接触过 Java 的人对多线程的使用不会陌生，其实都是大差不差。没有接触过 Java ？那就对不起了，以上代码请自行消化吧。

实战 -- 爬取当当网书籍信息

一直以来都在当当网购买书籍，既然学会了如何利用 Python 爬取信息，那么首先就来爬取一下当当网中的书籍信息吧。本实战完成之后的内容如下所示

Python 爬虫学习笔记之多线程爬虫

在当当网中搜索 Java ，出现了89页内容，我选择爬取了前 80 页，而且为了比较多线程和单线程的效率，我特意在这里对二者进行了比较，其中单线程爬取所用时间为 67s，而多线程仅为 15s 。

如何爬取网页，在上面 XPath 的使用中我们也已经做了介绍，无非就是进入网页，右击选择检查，查看网页 html 代码，然后寻找规律，进行信息的提取，在这里就不在多介绍，由于代码比较短，所以在这里直接上源代码。

# coding=utf8
import requests
import re
import time
from lxml import etree
from multiprocessing.dummy import Pool as ThreadPool
import sys

reload(sys)
sys.setdefaultencoding('utf-8')

def changepage(url, total):
  urls = []
  nowpage = int(re.search('(d+)', url, re.S).group(1))
  for i in range(nowpage, total + 1):
    link = re.sub('page_index=(d+)', 'page_index=%s' % i, url, re.S)
    urls.append(link)
  return urls

def spider(url):
  html = requests.get(url)
  content = html.text

  selector = etree.HTML(content)
  title = []
  title = selector.xpath('//*[@id="component_0__0__6612"]/li/a/@title')

  detail = []
  detail = selector.xpath('//*[@id="component_0__0__6612"]/li/p[3]/span[1]/text()')
  saveinfo(title,detail)

def saveinfo(title, detail):
  length1 = len(title)
  for i in range(0, length1 - 1):
    f.writelines(title[i] + 'n')
    f.writelines(detail[i] + 'nn')

if __name__ == '__main__':
  pool = ThreadPool(4)
  f = open('info.txt', 'a')
  url = 'http://search.dangdang.com/#63;key=Javaact=inputpage_index=1'
  urls = changepage(url, 80)

  time1 = time.time()
  pool.map(spider, urls)
  pool.close()
  pool.join()

  f.close()
  print '爬取成功！'
  time2 = time.time()
  print '多线程耗时 : ' + str(time2 - time1) + 's'

  # time1 = time.time()
  # for each in urls:
  #   spider(each)
  # time2 = time.time()
  # f.close()

  # print '单线程耗时 : ' + str(time2 - time1) + 's'

可见，以上代码中的知识，我们都在介绍 XPath 和并行化中做了详细的介绍，所以阅读起来十分轻松。

好了，到今天为止，Python 爬虫相关系列的文章到此结束，谢谢你的观看。

（编辑：云计算网_泰州站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

2/2

首页