I have two sentences that needed to be compared to each-other. The final result is how much percent one sentence contains in the other, my problem is that I have 100.000 records that need to be compared with lets say another 10. That is 1.000.000 loops, which in my algorithm is very slow.
This is the algorithm that I am using:
private double BreakStringsAndCheck(string s1, string s2)
{
if (s1 == null || s2 == null || s1.Length == 0 || s2.Length == 0)
return (double)0;
string[] firstArray = s1.Split(' ');
string[] secondArray = s2.Split(' ');
if (firstArray.Length > secondArray.Length)
{
string[] tempArray = firstArray;
firstArray = secondArray;
secondArray = tempArray;
}
double value = 0;
for (int i = 0; i < firstArray.Length; i++)
for (int j = 0; j < secondArray.Length; j++)
value += firstArray[i] == secondArray[j] ? (double)100 : (double)0;
return findLongest ? value : value / firstArray.Length;
}
It's a small method but it is not very fast. From my testing I can do 40-60 comparisons in 1 second, that is almost 5 hours for 1.000.000 loops.
Can some one think of another method or logic that is much faster than this?
Update:
I will try to explain the problem with more details. I have database with more than 100.000 records, and every day I insert, and compare 10-20 new record in this database. This records are sentences from 2 to 10 words and I need to write fast method that will compare this new records with those in database, the result should be percentage of how much one sentence contains the words from the other.
I need the records that has more than 70% word match.
I hope that I'm clear now.
I'm not a C# programmer, but here are a few general tips:
- Move the floating point arithmetic out of the loop. You should be able to count the characters that match and do the division later.
- You should be able to run each "long" loop in a separate thread of execution since the data is static. I would spawn a separate thread for each of your "10" sentences and run them in parallel.
- You might want to remove the call to
split
if you can. Basically, remove any extra memory allocations.
The final thought is to grab an algorithms book or google for text processing algorithms. This problem sounds like something that has been solved over and over again. There is probably something in AOCP v3 that solves this problem. You could also profile the code (not sure what types of profilers are available), but that probably won't yield substantial improvements.