I am programming in JAVA for my current program with the problem.
I have to parse a big .rdf file(XML format) which is 1.60 GB in size, and then insert the parsed data to mysql localhost server.
After googling, I decided to use SAX parser in my code. Many sites encouraged using SAX parser over DOM parser, saying that SAX parser is much faster than DOM parser.
However, when I executed my code which uses SAX parser, I found out that my program executes so slow. One senior in my lab told me that the slow speed issue might have occurred from file I/O process.
In the code of 'javax.xml.parsers.SAXParser.class', 'InputStream' is used for file input, which could make the job slow compared to using 'Scanner' class or 'BufferedReader' class.
My question is.. 1. Are SAX parsers good for parsing large-scale xml documents?
My program took 10 minutes to parse a 14MB sample file and insert data
to mysql localhost.
Actually, another senior in my lab who made a similar program
as mine but using DOM parser parses the 1.60GB xml file and saves data
in an hour.
- How can I use 'BufferedReader' instead of using 'InputStream', while using the SAX parser library?
This is my first question asking to stackoverflow, so any kinds of advices would be thankful and helpful. Thank you for reading.
Added part after receiving initial feedbacks I should have uploaded my code to clarify my problem, I apologize for it..
package xml_parse;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
public class Readxml extends DefaultHandler {
Connection con = null;
String[] chunk; // to check /A/, /B/, /C/ kind of stuff.
public Readxml() throws SQLException {
// connect to local mysql database
con = DriverManager.getConnection("jdbc:mysql://localhost/lab_first",
"root", "2030kimm!");
}
public void getXml() {
try {
// obtain and configure a SAX based parser
SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
// obtain object for SAX parser
SAXParser saxParser = saxParserFactory.newSAXParser();
// default handler for SAX handler class
// all three methods are written in handler's body
DefaultHandler default_handler = new DefaultHandler() {
String topic_gate = "close", category_id_gate = "close",
new_topic_id, new_catid, link_url;
java.sql.Statement st = con.createStatement();
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
if (qName.equals("Topic")) {
topic_gate = "open";
new_topic_id = attributes.getValue(0);
// apostrophe escape in SQL query
new_topic_id = new_topic_id.replace("'", "''");
if (new_topic_id.contains("International"))
topic_gate = "close";
if (new_topic_id.equals("") == false) {
chunk = new_topic_id.split("/");
for (int i = 0; i < chunk.length - 1; i++)
if (chunk[i].length() == 1) {
topic_gate = "close";
break;
}
}
if (new_topic_id.startsWith("Top/"))
new_topic_id.replace("Top/", "");
}
if (topic_gate.equals("open") && qName.equals("catid"))
category_id_gate = "open";
// add each new link to table "links" (MySQL)
if (topic_gate.equals("open") && qName.contains("link")) {
link_url = attributes.getValue(0);
link_url = link_url.replace("'", "''"); // take care of
// apostrophe
// escape
String insert_links_command = "insert into links(link_url, catid) values('"
+ link_url + "', " + new_catid + ");";
try {
st.executeUpdate(insert_links_command);
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void characters(char ch[], int start, int length)
throws SAXException {
if (category_id_gate.equals("open")) {
new_catid = new String(ch, start, length);
// add new row to table "Topics" (MySQL)
String insert_topics_command = "insert into topics(topic_id, catid) values('"
+ new_topic_id + "', " + new_catid + ");";
try {
st.executeUpdate(insert_topics_command);
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
if (qName.equals("Topic"))
topic_gate = "close";
if (qName.equals("catid"))
category_id_gate = "close";
}
};
// BufferedInputStream!!
String filepath = null;
BufferedInputStream buffered_input = null;
/*
* // Content filepath =
* "C:/Users/Kim/Desktop/2016여름/content.rdf.u8/content.rdf.u8";
* buffered_input = new BufferedInputStream(new FileInputStream(
* filepath)); saxParser.parse(buffered_input, default_handler);
*
* // Adult filepath =
* "C:/Users/Kim/Desktop/2016여름/ad-content.rdf.u8"; buffered_input =
* new BufferedInputStream(new FileInputStream( filepath));
* saxParser.parse(buffered_input, default_handler);
*/
// Kids-and-Teens
filepath = "C:/Users/Kim/Desktop/2016여름/kt-content.rdf.u8";
buffered_input = new BufferedInputStream(new FileInputStream(
filepath));
saxParser.parse(buffered_input, default_handler);
System.out.println("Finished.");
} catch (SQLException sqex) {
System.out.println("SQLException: " + sqex.getMessage());
System.out.println("SQLState: " + sqex.getSQLState());
} catch (Exception e) {
e.printStackTrace();
}
}
}
This is my whole code of my program..
My original code from yesterday tried file I/O like the following way (instead of using 'BufferedInputStream')
saxParser.parse("file:///C:/Users/Kim/Desktop/2016여름/content.rdf.u8/content.rdf.u8",
default_handler);
I expected some speed improvements in my program after I used 'BufferedInputStream', but speed didn't improve at all. I am having trouble figuring out the bottleneck causing the speed issue. Thank you very much.
the rdf file being read in the code is about 14 MB in size, and it takes about 11 minutes for my computer to execute this code.
Are SAX parsers good for parsing large-scale xml documents?
Yes clearly SAX and StAX parsers are the best choices to parse big XML documents as they are low memory and CPU consumers which is not the case of DOM parsers that load everything into memory which is clearly not the right choice in this case.
Response Update: Regarding your code for me your slowness issue is more related to how you store your data in your database. Your current code executes your queries in auto commit mode while you should use the transactional mode for better performances as you have a lot of data to insert, read this for a better understanding. To reduce the round trips between the database and your application you should also consider using batch update like in this good example.