Deleting text in a title while scratching

I am currently trying to scrapy a youtube playlist. The scrap works however I would like to get only a portion of the title. For example : The video title is: 'Et si on mangeait la connaissance? | Idriss Aberkane | TEDxPanthéonSorbonne' Through scrap

Scrapy - Get Duplicate Items When Adding Items Using a Loop

I am crawling the data from JSON response. Extracting data into item using for loop and all i get, is a last record rewriting all the previous records made by this loop. Here is my code: def parse_centers_and_ambulances(self, response): json_response

In Scrapy, how to set the time limit for each URL?

I am trying to crawl multiple websites using Scrapy link extractor and follow as TRUE (recursive) .. Looking for a solution to set the time limit to crawl for each url in start_urls list. Thanks import scrapy class DmozItem(scrapy.Item): title = scra

Find email addresses in the body using scrapy

I am trying to find all the email addresses on a page using scrapy. I found a xpath which should return the email addresses but when I run the code below it doesnt find any email addresses (which I know are there). And I get errors like: File "C:\Ana

Use Scrapy to retrieve nested JSON data?

I am trying to write a Web app that crawls info from Sony's PlayStation store. I've found the JSON file that has the data I want, but I'm wondering how to use Scrapy to store only certain elements of the JSON file? Here's part of the JSON data: { "ag

scrapy xpath ratemyprofessor

I am new to the scrapy and already spent so much times on this easy program but I cannot figure out. I used the chrome to inspect the x path for links of all professor in this page and used the console to test the xpath. when I put the "correct xpath

Authentication form / Connecting to a site using Scrapy

I am a beginner with Scrapy. I am trying to login a site for me to be able to do scraping. But I am stuck. Below is the code in my spider Spider: import scrapy from scrapy.http import FormRequest from scrapy.selector import HtmlXPathSelector from scr

Scrapy Spider does not scratch properly

I am using Python.org 2.7 64 bit shell on Windows Vista. I have Scrapy installed and it seems to be stable and working. However, I have copied the following simple piece of code: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXP

Allow only internal links in the Web Wide Scrapy Crawl

I am using Scrapy to Crawl thousands of websites. I have a large list of domains to crawl. Everything works fine just that the crawler follows external links too, which is why it crawls way too many domains than necessary. I already tried to use "all

xpath gets text from multiple lines

I have this html <td width="70%">REGEN REAL ESTATE, Dubai – U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial – Buying, Selling, R <a href="http://www.justproperty.com/company_view

The function scrapy and xpath 'corresponds' to the syntax

I'm running scrapy 0.20.2. $ scrapy shell "http://newyork.craigslist.org/ata/" I would like to make the list of all links to advertisements pages set apart the index.html $ sel.xpath('//a[contains(@href,html)]') ... <Selector xpath='//a[conta

Scrapy Tutorial (Python) - ImportError: Error loading object

I have to run basic tutorial on the Scrapy architecture Win32. When I try scrapy crawl dmoz, shows me the following error: File "C:\Python27\lib\site-packages\scrapy\utils\misc.py", line 40, in load_object raise ImportError, "Error loading

Scrapy Very basic example

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web. They were trying to run the command: scrapy crawl mininova.org -o scraped_data.json -t json I don't quite understand what does this mean? look

Scrapy: transmitting data between two spiders

I need to create a spider that crawls for some data from web site. part of the data is an external URL. I already created the spider that crawls the data from the root site and now i want to write the spider for external web pages. I was thinking of

Scrapy: html xpath selector returns the result in html?

How do I retrieve all of the HTML contained inside a tag? hxs = HtmlXPathSelector(response) element = hxs.select('//span[@class="title"]/') perhaps = hxs.select('//span[@class="title"]/html()') html_of_tag = ? EDIT: if I look at the do

How to avoid scrapy ignorant hash tag

i am working on scrapy I had a site to scrape with hash tag included , but when i run it , scrapy downloading the response by ignoring hash tag For example this is the url with hash fragments, url="www.example.com/hash-tag.php#user_id-654" and t

Scrapy.bat explanation

In the Python Scrapy framework there is a scrapy.bat file: @echo off setlocal "%~dp0..\python" "%~dp0scrapy" %* endlocal Could someone explain what this does? Especially this line "%~dp0..\python" "%~dp0scrapy" %*.T