I'm currently working on a Perl script to gather data from the QuakeLive website. Everything was going fine until I couldn't get a set of data.
I was using regexes for that and they work for everything apart from the favourite arena, weapon and game type. I just need to get the names of those three elements in a $1 for further processing.
I tried regexing up to the favorites image, but without succeeding. If it's any use, I'm already using WWW::Mechanize in the script.
I think that the problem could be related to the class name of the paragraphs where those elements are, while the previous one was classless.
You can find an example profile HERE.
Note that for the previous part of the page, it worked using code like:
$content =~ /<b>Wins:<\/b> (.*?)<br \/>/;
$wins = $1;
print "Wins: $wins\n";
The immediate problem is that you have:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
That is, there is no <br />
following the value for favorites such as Arena. Now, the correct way to do this would involve using a proper HTML parser. The fragile solution is to adapt your pattern (untested):
my ($favarena) = $content =~ m{<b>Arena:</b> ([^<]+)};
That should put everything up to the <
of the next <div>
in $favarena
. Now, if all arenas are single words with no spaces in them,
my ($favarena) = $content =~ m{<b>Arena:</b> (\S+)};
would save you the trouble of having to trim whitespace afterwards.
Note that it is easy for such regex based solutions to be fooled with simple things like commented out snippets in the source. E.g., if the source were to be changed to:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<!-- <b>Arena: </b> here -->
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
your script would be in trouble where as a solution using an HTML parser would not.
An example using HTML::TokeParser::Simple:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'martianbuddy.html' );
while ( my $tag = $p->get_tag('p') ) {
next unless $tag->is_start_tag;
next unless defined (my $class = $tag->get_attr('class'));
next unless grep { /^prf_faves\z/ } split ' ', $class;
my $fav = $p->get_tag('b');
my $type = $p->get_text('/b');
my $value = $p->get_text('/p');
$value =~ s/\s+\z//;
print "$type = $value\n";
}
Output:
Arena: Campgrounds Game Type: Clan Arena Weapon: Rocket Launcher
And, here is an example using HTML::TreeBuilder:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder;
use YAML;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('martianbuddy.html');
my @p = $tree->look_down(_tag => 'p', sub {
return unless defined (my $class = $_[0]->attr('class'));
return unless grep { /^prf_faves\z/ } split ' ', $class;
return 1;
}
);
for my $p ( @p ) {
my $text = $p->as_text;
$text =~ s/^\s+//;
my ($type, $value) = split ': ', $text;
print "$type: $value\n";
}
Output:
Arena: Campgrounds Game Type: Clan Arena Weapon: Rocket Launcher
Given that the document is an HTML fragment rather than a full document, you will have more success with modules based on HTML::Parser rather than those that expect to operate on well-formed XML documents.