How can I get scrapy pipelines to fill my mongodb with my items?


How do I get scrapy pipeline to fill my mongodb with my items? Here is what my code looks like at the moment which is a reflection of the information i got off of scrapy documentation. I also want to mention that I have tried returning items instead of yielding, as well tried using item loaders. All methods seem to have the same outcome. on that note I want to mention that if I run the command mongoimport --db mydb --collection mycoll --drop --jsonArray --file ~/path/to/scrapyoutput.json my database gets populated(as long as I yield and don't return items)... I would really love to get this pipeline working though...

okay so here is my code:

here is my spider

    import scrapy

    from scrapy.selector import Selector
    from scrapy.loader import ItemLoader
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    from scrapy.http import HtmlResponse
    from capstone.items import CapstoneItem

    class CongressSpider(CrawlSpider):
        name = "congress"
        allowed_domains = [""]
        start_urls = [
    #creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links.
    rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),)

    def parse_page(self, response):
        for search in response.selector.xpath(".//li[@class='compact']"):
            yield {
                'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(),
                'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(),
                'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(),
                'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(),
                'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(),


    BOT_NAME = 'capstone'

    SPIDER_MODULES = ['capstone.spiders']
    NEWSPIDER_MODULE = 'capstone.spiders'

    ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}
    MONGO_URI = 'mongodb://localhost:27017'
    MONGO_DATABASE = 'congress'

here is my import pymongo

    from pymongo import MongoClient
    from scrapy.conf import settings
    from scrapy.exceptions import DropItem
    from scrapy import log

    class MongoDBPipeline(object):
        collection_name= 'members'
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
        def from_crawler(cls, crawler):
            return cls(
                mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        def open_spider(self,spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
        def close_spider(self, spider):
        def process_item(self, item, spider):
            return item

here is import scrapy

    class CapstoneItem(scrapy.Item):
        member = scrapy.Field()
        state = scrapy.Field()
        District = scrapy.Field()
        party = scrapy.Field()
        served = scrapy.Field()

last but not least my output looks like this:

    2017-02-26 20:44:41 [scrapy.core.engine] INFO: Closing spider (finished)
    2017-02-26 20:44:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 8007,
    'downloader/request_count': 24,
    'downloader/request_method_count/GET': 24,
    'downloader/response_bytes': 757157,
    'downloader/response_count': 24,
    'downloader/response_status_count/200': 24,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2017, 2, 27, 4, 44, 41, 767181),
    'item_scraped_count': 2139,
    'log_count/DEBUG': 2164,
    'log_count/INFO': 11,
    'request_depth_max': 22,
    'response_received_count': 24,
    'scheduler/dequeued': 23,
    'scheduler/dequeued/memory': 23,
    'scheduler/enqueued': 23,
    'scheduler/enqueued/memory': 23,
    'start_time': datetime.datetime(2017, 2, 27, 4, 39, 58, 834315)}
    2017-02-26 20:44:41 [scrapy.core.engine] INFO: Spider closed (finished)

so it seems to me like I am not getting any errors, my items were scraped. if i had ran it with a -o myfile.json i could import myfile to my mongodb but the pipeline just isn't doing anything!

     MongoDB shell version: 3.2.12
     connecting to: test
     Server has startup warnings:
      2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten]                              2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **    WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten]
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten]
     > show dbs
     congress  0.078GB
     local     0.078GB
     > use congress
     switched to db congress
     > show collections
     > db.members.count()

I suspect my problem has to do with my settings file. I am new with scrapy and mongodb and I have a feeling I haven't told scrapy where my mongodb is correctly. here are some other sources I found, I tried using them as examples but everything I tried just lead to the same result(scraping was done, mongo was empty) I have a bunch more sources but not enough reputation to post more unfortunately. anyway~ any thoughts would be much appreciated thanks.

I commented out my line of code that said

ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}

and I uncommented the line of code that was already inside the settings file

ITEM_PIPLINES = { 'capstone.pipelines.MongoDBPipeline': 300, }

the only difference I can see is the newlines and this setting was set well below all my other settings. after getting this to work I started getting python errors about the typos in my pipeline file. I figured out that my pipeline wasn't connecting because of the output before my items were being scraped:

[scrapy.middleware] INFO: Enabled item pipelines:[]

after changed my settings i got this:

[scrapy.middleware] INFO: Enabled item piplines:['capstone.pipelines.MongoDBPipeline']