Unable to Install XML :: LibXML on Windows

advertisements

I am trying to use XPath to extract some HTML tags and data and for that I need to use XML::LibXML module.

I tried installing it from CPAN shell but it doesn't install.

I followed the instructions from CPAN site about the installation, that we need to install libxml2, iconv and zlib wrappers before installing XML::LibXML and it didn't work out.

Also, if there is any other simpler module that gets my task done, please let me know.

The task at hand:

I am searching for a specific <dd> tag on a html page which is really big ( around 5000 - 10000) <dd> and <dt> tags. So, I am writing a script which matches the content within <dd> tag and fetches the content within the corresponding (next) <dt> tag.

I wish i could i have been a little more clearer. Any help is greatly appreciated.


If you are using ActiveState Perl, you should add the repositories listed at ActivePerl 10xx Win32 PPM packages to ppm and then use

ppm install XML::LibXML

Trying to parse HTML as XML is generally not a pleasant task. I think HTML::TokeParser is more suitable to the task.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;

my $p = HTML::TokeParser->new(\*DATA);

my @definitions;

while ( my $dl_tag = $p->get_tag('dl') ) {
    while ( my $dt_tag = $p->get_tag('dt') ) {
        my $term = $p->get_trimmed_text('/dt');
        my $dd_tag = $p->get_tag('dd');
        my $defn = $p->get_trimmed_text('/dd');
        push @definitions, [$term, $defn];
    }
}

use Data::Dumper;
print Dumper \@definitions;

__DATA__
<dl>
<dt>One</dt>
<dd>1</dd>
<dt>Two</dt>
<dd>2</dd>
</dl>

Output:

$VAR1 = [
          [
            'One',
            '1'
          ],
          [
            'Two',
            '2'
          ]
        ];