I am working on a data-mining project for which I need to analyse the progress of discussion in a thread of a forum. I am interested in extracting information like time of post, stats of post's author (no. of posts, joining date, etc.), text of the post, etc.
However while using standard scraping tools (like Scrapy in python) I need to write the regular expressions for detecting these fields in the page's html source. As these tags vary with the type of forum, it is becoming a major problem to tackle the regular expressions for every forum. Is there a standard bank of such regular expressions available, so that they can be used based on the type of forum?
Or is there any other technique to extract these fields from the forum's page.
I wrote some configuration files for some major forums. Hope you can decipher and infer how to parse it.
enclosed_section=tag:table,attributes:id;threadslist thread=tag:a,attributes:id;REthread_title_ list_next_page=type:next_page,attributes:anchor_text;> post=tag:div,attributes:id;REpost_message_ thread_next_page=type:next_page,attributes:anchor_text;>
enclosed_section is the div that contains links to all the threads thread is where you'll find the link to each thread list_next_page is the link to the next page with list of threads post is the div with the post text. thread_next_page is the link to the next page of the thread
enclosed_section=tag:table,attributes:id;forum_table thread=tag:a,attributes:class;topic_title list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href post=tag:div,attributes:class;post entry-content | thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href post_count_section=tag:td,attributes:class;stats post_count=tag:li,attributes:,reg_exp:(\d+) Repl