Perl Script, Web Scraper

advertisements

I am new to Perl language and have this script which scrapes the amazon website for reviews. Everytime I run it I get an error about a compilation error. Was wondering if someone could shed some light as to whats wrong with it.

#!/usr/bin/perl
# get_reviews.pl
#
# A script to scrape Amazon, retrieve reviews, and write to a file
# Usage: perl get_reviews.pl <asin>
use strict;
use warnings;
use LWP::Simple;

# Take the asin from the command-line
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

# Assemble the URL from the passed asin.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

#Remove everything before the reviews
$content =~ s!.*?Number of Reviews:!!ms;

# Loop through the HTML looking for matches
while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)[RETURN]
\n.*?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {

my($rating,$title,$date,$reviewer,$review) = [RETURN]
($1||'',$2||'',$3||'',$4||'',$5||'');
$reviewer =~ s!<.+?>!!g;   # drop all HTML tags
$reviewer =~ s!\(.+?\)!!g;   # remove anything in parenthesis
$reviewer =~ s!\n!!g;      # remove newlines
$review =~ s!<.+?>!!g;     # drop all HTML tags
$review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.

# Print the results
print "$title\n" . "$date\n" . "by $reviewer\n" .
      "$rating stars.\n\n" . "$review\n\n";

}


The syntax errors seem to be caused by the "[RETURN]" that appears twice in your code. When I removed those, the code compiled without problems.

Amazon don't really like people scraping their web site. Which is why they provide an API that gives you access to their content. And there's a Perl module that for using that API - Net::Amazon. You should use that instead of fragile web scraping techniques.