Scrapy爬虫实践之搜索并获取前程无忧职位信息（基础篇）

Levy_X 2017-09-05

展开全文

一、开发环境

OS：Windows 7 64bit旗舰版
Python：2.7.10
Scrapy：1.0.3
MySQL：5.6.21
Sublime Text2：2.0.2

二、目标

通过在前程无忧的职位搜索中输入职位关键词获取相应的职位信息，现在我们通过Scrapy爬虫来实现这个功能，自动帮我们获取相关的职位信息，并保存成.json格式和保存到MySQL数据库。

三、实现步骤

Scrapy是一个比较流行的Python爬虫框架，Scrapy爬虫的基本实现流程如下：

1.通过scrapy startproject spiderproject 来创建一个新的爬虫工程，spiderproject是我们自己命名的爬虫工程。比如本例子，我们创建一个工程scrapy startproject qcwy，qcwy即为我们的工程名。

2.定义我们要解析具体数据的Item结构，在items.py文件中。

3.在pipelines.py中实现数据存储的功能，可以在这里实现我们抓取的数据保存在.json文件中，或者MySQL中，或者SQLite中，或者MongoDB中，或者其他你要保存的格式或者数据库中。

4.在第1步创建的工程文件夹下的spider文件目录下创建一个新的后缀为.py的文件命名最好包含spider这个单词，因为我们要在这个文件中实现核心功能Spider。

5.在setting.py中进行一些相关的设置。

下面我们以详细实例说明如何以Scrapy实现我们的目的：

1.在CMD中输入 cd c:\ 进入c盘根目录

2.输入scrapy startproject qcwy，我们可以看到在c盘根目录下看到qcwy这个文件夹（本文假如你的开发环境配置没有任何问题，具体配置请参照Scrapy文档）。

进入qcwy文件夹后，可以看到还有一个qcwy的文件夹和一个scrapy.cfg项目配置文件，不用管它，继续进入qcwy这个文件夹，可以看到：

3.定义我们要解析的数据的Item类

实现items.py详细代码：

from scrapy.item import Item, Field class QcwyItem(Item): #定义要抓取信息的Item结构 title = Field() #职位名称 link = Field() #详情链接 company = Field() #公司名称 updatetime = Field() #更新时间

4.实现pipelines.py的详细代码：

import json import codecs import MySQLdb import MySQLdb.cursors from twisted.enterprise import adbapi from scrapy import signals class QcwyJsonPipeline(object): def __init__(self): self.file = codecs.open('..\\qcwy\\qcwy\\qcwy.json', 'w', encoding = 'utf-8') def process_item(self, item, spider): line = json.dumps(dict(item), ensure_ascii = False) '\n' self.file.write(line) return item def spider_closed(self, spider): self.file.close() class QcwyMySQLPipeline(object): '''docstring for MySQLPipeline''' def __init__(self): self.connpool = adbapi.ConnectionPool('MySQLdb', host = '127.0.0.1', db = 'qcwy', user = 'root', passwd = '', cursorclass = MySQLdb.cursors.DictCursor, charset = 'utf8', use_unicode = True ) def process_item(self, item, spider): query = self.connpool.runInteraction(self._conditional_insert, item) query.addErrback(self.handle_error) return item def _conditional_insert(self, tx, item): if item.get('title'): tx.execute('insert into detail (title, link, company, updatetime) values(%s, %s, %s, %s)', (item['title'], item['link'], item['company'], item['updatetime'])) def handle_error(self, e): log.err(e)

5.Spider的实现

# -*- coding: utf-8 -*- import logging import scrapy import urllib import codecs from scrapy.selector import Selector from qcwy.items import QcwyItem import sys reload(sys) sys.setdefaultencoding('utf-8') keyword = 'Python' #把字符串编码成符合url规范的编码 keywordcode = urllib.quote(keyword) is_start_page = True class TestfollowSpider(scrapy.Spider): name = 'qcwysearch' allowed_domains = [''] start_urls = [ 'http://search./jobsearch/search_result.php?fromJs=1&jobarea=030200%2C00&funtype=0000&industrytype=00&keyword=' keywordcode, ] def parse(self, response): global is_start_page url = '' #从开始页面开始解析数据，开始页面start_urls if is_start_page: url = self.start_urls[0] is_start_page = False else: href = response.xpath('//table[@class='searchPageNav']/tr/td[last()]/a/@href') url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_dir_contents) def parse_dir_contents(self, response): for sel in response.xpath('//table[@id='resultList']/tr[@class='tr0']'): item = QcwyItem() temp = sel.xpath('td[@class='td1']/a/text()').extract() if len(temp) > 0: item['title'] = temp[0] keyword temp[-1] else: item['title'] = keyword item['link'] = sel.xpath('td[@class='td1']/a/@href').extract()[0] item['company'] = sel.xpath('td[@class='td2']/a/text()').extract()[0] item['updatetime'] = sel.xpath('td[@class='td4']/span/text()').extract()[0] yield item next_page = response.xpath('//table[@class='searchPageNav']/tr/td[last()]/a/@href') if next_page: url = response.urljoin(next_page[0].extract()) yield scrapy.Request(url, self.parse_dir_contents)

6.在setting.py中加入：

ITEM_PIPELINES = { 'qcwy.pipelines.QcwyJsonPipeline': 300, 'qcwy.pipelines.QcwyMySQLPipeline': 800, }

7.在CMD中输入cd c:\qcwy进入工程目录

输入scrapy crawl qcwysearch

然后查看生成的json文件，可以看到以下我们想要的信息：

{'company': '广东威法科技发展有限公司', 'link': 'http://search./job/70631365,c.html', 'updatetime': '2015-09-29', 'title': '中级后端Python工程师'} {'company': '珠海横琴新区盖网通传媒有限公司广州分公司', 'link': 'http://search./job/69706486,c.html', 'updatetime': '2015-09-29', 'title': '开发工程师Python开发工程师'} {'company': '广州明朝互动科技股份有限公司', 'link': 'http://search./job/69898292,c.html', 'updatetime': '2015-09-29', 'title': '开发工程师Python开发工程师'} {'company': '广州市赛酷比软件有限公司', 'link': 'http://search./job/69816266,c.html', 'updatetime': '2015-09-29', 'title': '分布式软件架构师(Python...'} {'company': '广州市赛酷比软件有限公司', 'link': 'http://search./job/69816017,c.html', 'updatetime': '2015-09-29', 'title': '爬虫工程师Python爬虫工程师'} {'company': '广州杰升信息科技有限公司', 'link': 'http://search./job/51077705,c.html', 'updatetime': '2015-09-29', 'title': '工程师Python工程师'} {'company': '广州飞屋网络科技有限公司', 'link': 'http://search./job/71800002,c.html', 'updatetime': '2015-09-29', 'title': '开发工程师Python开发工程师'} {'company': '广州飞屋网络科技有限公司', 'link': 'http://search./job/59931502,c.html', 'updatetime': '2015-09-29', 'title': '开发工程师Python开发工程师'} {'company': '广州达新信息科技有限公司', 'link': 'http://search./job/69614188,c.html', 'updatetime': '2015-09-29', 'title': '开发工程师Python开发工程师'} {'company': '广州埃立方通信技术有限公司', 'link': 'http://search./job/71920362,c.html', 'updatetime': '2015-09-29', 'title': '软件工程师Python软件工程师'} {'company': '广州黑珍珠计算机技术有限公司', 'link': 'http://search./job/60592771,c.html', 'updatetime': '2015-09-29', 'title': '服务器端程序员Python服务器端程序员'} {'company': '聚力精彩（北京）信息技术有限公司', 'link': 'http://search./job/72246020,c.html', 'updatetime': '2015-09-29', 'title': '高级开发工程师AEBPython高级开发工程师AEB'} {'company': '广州七乐康药业连锁有限公司', 'link': 'http://search./job/71362309,c.html', 'updatetime': '2015-09-29', 'title': '中级开发工程师Python中级开发工程师'} {'company': '广州七乐康药业连锁有限公司', 'link': 'http://search./job/69404827,c.html', 'updatetime': '2015-09-29', 'title': ' 初级开发工程师Python 初级开发工程师'} {'company': '广州七乐康药业连锁有限公司', 'link': 'http://search./job/69207304,c.html', 'updatetime': '2015-09-29', 'title': '资深开发工程师Python资深开发工程师'} {'company': '贵州格安科技有限公司广州分公司', 'link': 'http://search./job/63157264,c.html', 'updatetime': '2015-09-29', 'title': '高级Python开发工程师'} {'company': '贵州格安科技有限公司广州分公司', 'link': 'http://search./job/60035771,c.html', 'updatetime': '2015-09-29', 'title': ' 开发工程师Python 开发工程师'} {'company': '珠海麒润网络科技有限公司', 'link': 'http://search./job/51153619,c.html', 'updatetime': '2015-09-29', 'title': 'PHP/Python开发工程师'} {'company': '广州百伦贸易有限公司', 'link': 'http://search./job/71097137,c.html', 'updatetime': '2015-09-29', 'title': '开发工程师Python开发工程师'} {'company': '广州暴雨网络技术有限公司', 'link': 'http://search./job/69546409,c.html', 'updatetime': '2015-09-29', 'title': '服务端开发工程师Python服务端开发工程师'}

通过phpMyAdmin来查看MySQL数据库，可以看到：