Scraping (BeautifulSoup) no tag

I am very new to scraping, so as I understand, BeautifulSoup only extracts the data found inside tags (with functions like get, find, find_all ...) The source code of the website i am scraping is displaying the various items inside the same tag and t

Reliably scraping stock price charts

The Problem: My goal is to automate scraping a table with currency prices from this website stock prices. As the stock broker doesn't provide API, I'm forced to find work arounds. I have already searched for applications for this purpose in order to

Get all the href links using selenium in python

I am practicing selenium in python and I wanted to fetch all the links on a web page using selenium. For example, I want all the links in 'a href" tag from this website : I've written a script and it is working. But, it's

Remove white space from scratched text

$url = 'MyUrl'; $contents = file_get_contents($url); function scrape_between($data, $start, $end){ $data = stristr($data, $start); $data = substr($data, strlen($start)); $stop = stripos($data, $end); $data = substr($data, 0, $stop); return $data; } $

Need advice on how to speed up the web scraper

I am still pretty new to this. I am trying to pull data from web pages, but this method I have implemented seems a bit slow. I used the time module to narrow down the cause of the lag. requests.get(url) took the majority of the time (1-5 seconds) sou

Authentication form / Connecting to a site using Scrapy

I am a beginner with Scrapy. I am trying to login a site for me to be able to do scraping. But I am stuck. Below is the code in my spider Spider: import scrapy from scrapy.http import FormRequest from scrapy.selector import HtmlXPathSelector from scr

Python - Login to Web Scrape

I'm trying to web-scrape a page on that requires me to be logged in. I have done this using the .ROBLOSECURITY cookie, however, that cookie changes every few days. I want to instead log in using the login form and Python. The form and

Waiting for a table to fully load using selenium with python

I want to scrape some data from a page which is in a table. So I am only bothered about the data in the table. Earlier I was using Mechanize, but I found sometimes some of the data are missing, especially in the bottom of the table. Googling, I found

Search for a channel with re module in the html page

I have html page with javascript There is some js code in html: <script type="text/javascript"> var viewAllLimiter = 0; </script> how can I find viewAllLimiter and take its value with I have tried: #hh2 - opened page with

Scraper returns an empty array

I'm a bit new at curl and xpath so still learning the in's and out's. I have written a scraper but when i try to show the scraped data via an array, nothing shows up. So what is wrong with my code? <?php ini_set("display_errors", "1"

Analyze robot.txt using java and identify if a URL is allowed

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed. I am pretty sure that jsoup is not made for this and it's all about we

Web scraping / scraping

In what all ways can we connect with bank sites for scraping the datas from that site? I have referred this site, for know

How to reach the desired node in the result of xpath?

As I have mentioned in question title, I am trying below code to reach till the desired node in xpath result. <?php $xpath = '//*[@id="topsection"]/div[3]/div[2]/div[1]/div/div[1]'; $html = new DOMDocument(); @$html->loadHTMLFile('http://w

Scrapy: transmitting data between two spiders

I need to create a spider that crawls for some data from web site. part of the data is an external URL. I already created the spider that crawls the data from the root site and now i want to write the spider for external web pages. I was thinking of

HTML table on Html page - NO XML

I'm trying to grab data from an Html table on a website. No XML is involved. <table id="e-cal-table" class="e-cal-table" width="100%"> <tr> <th>Date</th> <th>Time</th> <th>Currency<

PHP regex to return & lt; option & gt; values

Just wondering if you can help me out a bit with a little task I'm trying to do in php. I have text that looks something like this in a file: (random html) ... <OPTION VALUE="195" SELECTED>Physical Chem <OPTION VALUE="239">

Retrieve website comments using disqus

I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: I realize that cnn uses disqus for their comment discussion. As the comm

Does the threading violate robots.txt?

I'm new to scraping and I recently realized that threading is probably the way to go to crawl a site quickly. Before I begin hacking that out though, I figured it would probably be intelligent to determine whether or not that will end up getting me t

Problem pulling data from the website in .NET and C #

I have written a web scraping program to go to a list of pages and write all the html to a file. The problem is that when I pull a block of text some of the characters get written as '�'. How do I pull those characters into my text file? Here is my c