How do I get scrapy pipeline to fill my mongodb with my items? Here is what my code looks like at the moment which is a reflection of the information i got off of scrapy documentation. I also want to mention that I have tried returning items instead of yielding, as well tried using item loaders. All methods seem to have the same outcome. on that note I want to mention that if I run the command mongoimport --db mydb --collection mycoll --drop --jsonArray --file ~/path/to/scrapyoutput.json
my database gets populated(as long as I yield and don't return items)... I would really love to get this pipeline working though...
okay so here is my code:
here is my spider
import scrapy
from scrapy.selector import Selector
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
from capstone.items import CapstoneItem
class CongressSpider(CrawlSpider):
name = "congress"
allowed_domains = ["www.congress.gov"]
start_urls = [
'https://www.congress.gov/members',
]
#creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links.
rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),)
def parse_page(self, response):
for search in response.selector.xpath(".//li[@class='compact']"):
yield {
'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(),
'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(),
'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(),
'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(),
'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(),
}
settings:
BOT_NAME = 'capstone'
SPIDER_MODULES = ['capstone.spiders']
NEWSPIDER_MODULE = 'capstone.spiders'
ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'congress'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 10
here is my pipeline.py import pymongo
from pymongo import MongoClient
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
collection_name= 'members'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI')
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self,spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item
here is items.py import scrapy
class CapstoneItem(scrapy.Item):
member = scrapy.Field()
state = scrapy.Field()
District = scrapy.Field()
party = scrapy.Field()
served = scrapy.Field()
last but not least my output looks like this:
2017-02-26 20:44:41 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-26 20:44:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 8007,
'downloader/request_count': 24,
'downloader/request_method_count/GET': 24,
'downloader/response_bytes': 757157,
'downloader/response_count': 24,
'downloader/response_status_count/200': 24,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 27, 4, 44, 41, 767181),
'item_scraped_count': 2139,
'log_count/DEBUG': 2164,
'log_count/INFO': 11,
'request_depth_max': 22,
'response_received_count': 24,
'scheduler/dequeued': 23,
'scheduler/dequeued/memory': 23,
'scheduler/enqueued': 23,
'scheduler/enqueued/memory': 23,
'start_time': datetime.datetime(2017, 2, 27, 4, 39, 58, 834315)}
2017-02-26 20:44:41 [scrapy.core.engine] INFO: Spider closed (finished)
so it seems to me like I am not getting any errors, my items were scraped. if i had ran it with a -o myfile.json i could import myfile to my mongodb but the pipeline just isn't doing anything!
mongo
MongoDB shell version: 3.2.12
connecting to: test
Server has startup warnings:
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] 2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]
> show dbs
congress 0.078GB
local 0.078GB
> use congress
switched to db congress
> show collections
members
system.indexes
> db.members.count()
0
>
I suspect my problem has to do with my settings file. I am new with scrapy and mongodb and I have a feeling I haven't told scrapy where my mongodb is correctly. here are some other sources I found, I tried using them as examples but everything I tried just lead to the same result(scraping was done, mongo was empty) https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/ https://github.com/sebdah/scrapy-mongodb I have a bunch more sources but not enough reputation to post more unfortunately. anyway~ any thoughts would be much appreciated thanks.
I commented out my line of code that said
ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}
and I uncommented the line of code that was already inside the settings file
ITEM_PIPLINES = { 'capstone.pipelines.MongoDBPipeline': 300, }
the only difference I can see is the newlines and this setting was set well below all my other settings. after getting this to work I started getting python errors about the typos in my pipeline file. I figured out that my pipeline wasn't connecting because of the output before my items were being scraped:
[scrapy.middleware] INFO: Enabled item pipelines:[]
after changed my settings i got this:
[scrapy.middleware] INFO: Enabled item piplines:['capstone.pipelines.MongoDBPipeline']