michaelh0226 发表于 2013-1-30 01:43:28

Crawl a website with scrapy

 
Introduction

In this article, we are going to see how to scrape information from a website, in particular, from all pages with a common URL pattern. We will see how to do that with Scrapy, a very powerful, and yet simple, scraping and web-crawling framework.
For example, you might be interested in scraping information about each article of a blog, and store it information in a database. To achieve such a thing, we will see how to implement a simple spider using Scrapy, which will crawl the blog and store the extracted data into a MongoDB database.
We will consider that you have a working MongoDB server, and that you have installed the pymongo and scrapypython packages, both installable with pip.
If you have never toyed around with Scrapy, you should first read this short tutorial.
First step, identify the URL pattern(s)

In this example, we’ll see how to extract the following information from each isbullsh.it blogpost :

[*]title
[*]author
[*]tag
[*]release date
[*]url
We’re lucky, all posts have the same URL pattern: http://isbullsh.it/YYYY/MM/title. These links can be found in the different pages of the site homepage.
What we need is a spider which will follow all links following this pattern, scrape the required information from the target webpage, validate the data integrity, and populate a MongoDB collection.
Building the spider

We create a Scrapy project, following the instructions from their tutorial. We obtain the following project structure:
isbullshit_scraping/├── isbullshit│   ├── __init__.py│   ├── items.py│   ├── pipelines.py│   ├── settings.py│   └── spiders│       ├── __init__.py│       ├── isbullshit_spiders.py└── scrapy.cfgWe begin by defining, in items.py, the item structure which will contain the extracted information:
<div class="highlight" style="padding-left: 0.5em; color: #333333; line-height: 18px;">from scrapy.item import Item, Fieldclass IsBullshitItem(Item):    title = Field()    author = Field()    tag = Field()    date = Field()    link = Field()
页: [1]
查看完整版本: Crawl a website with scrapy