Comparing two csv files in Java

advertisements

We have a need to compare two CSV files. Let say file one have a few rows, and second file could have the same no of rows or more. Most of the rows could remain same on both files.Looking for the best approach to do a diff between these two files and read only those rows which has a difference in the second file from the first file. The application processing the file is in Java.

What are the best approaches for this?

Note : it would be great if we can know a row is updated, inserted or deleted in the second file.

Requirements:-

  1. There won't be any duplicate records
  2. File 1 and file 2 could have same no of records with a few rows with updated values in file2 (Records updated)
  3. File 2 could have a few rows removed ( this is treated as record deleted)
  4. File 2 could have a few new rows added ( this is treated as record inserted)
  5. On of the column could be treated a the primary key of the record, that won't change in both the files.

One method for doing this would be to use java's Set interface; read each line as a string, add it to the set, then do a removeAll() with the second set on the first set, thus retaining the rows which differ. This, of course, assumes that there are no duplicate rows in the files.

// using FileUtils to read in the files.
HashSet<String> f1 = new HashSet<String>(FileUtils.readLines("file1.csv"));
HashSet<String> f2 = new HashSet<String>(FileUtils.readLines("file2.csv"));
f1.removeAll(f2); // f1 now contains only the lines which are not in f2

Update

Okay, so you have a PK field. I'll just assume you know how to get that from your string; use openCSV or regex or whatever you want. Make an actual HashMap instead of a HashSet as above, use the PK as the key and the row as the value.

HashMap<String, String> f1 = new HashMap<String, String>();
HashMap<String, String> f2 = new HashMap<String, String>();
// read f1, f2; use PK field as the key
List<String> deleted = new ArrayList<String>();
List<String> updated = new ArrayList<String>();
for(Map.Entry<String, String> entry : f1.keySet()) {
    if(!f2.containsKey(entry.getKey()) {
        deleted.add(entry.getValue());
    } else {
        if(!f2.get(entry.getKey().equals(f1.getValue())) {
            updated.add(f1.getValue());
        }
    }
}
for(String key : f1.keySet()) {
    f2.remove(key);
}
// f2 now contains only "new" rows