The best way to compare IP addresses quickly

advertisements

I'm parsing two CSV files which contains IP addresses. The first is a source CSV, and the second is a "Blacklist".

Because of the size of the source file, I'm trying to optimize the speed at which I find IP addresses that match the blacklist.

EDIT: The blacklist consists of IP Address "Blocks". This means that each record in the blacklist has two IP addresses: A Start Block (ex. 216.254.128.0) and an End Block. (Ex. 216.254.223.255)

This means that direct lookups etc, will NOT work.

I'm wondering what's the best way to approach this. The brute strength method would be:

String[] parts = sourceIP.split("\\."); // String array, each element is text between dots

int hi = 255;
int lo = 0;

int mid = (hi - lo) / 2 ;

if (Integer.valueOf(parts[0]) > mid) {
    mid = lo;
}

I could then repeat this for each part to decide whether or not the IP address is in the black list.

This seems pretty aggressive and with 4k+ records, this could take a very, very long time.

It could take 10+ iterations to decide each part and that would then have to be repeated to check the "High" part of the IP blocks in the blacklist. That's 80+ iterations per record.

I'm hoping to get some input here to see the best method for comparing IP addresses.

What are your thoughts?

Would it be possible to use a quick bitwise mask to compare values rapidly by serializing INetAddress?

FILE STRUCTURE CLARIFICATION:

Source IP File:

Contains a list of records from a database. (Aprox 4k). Each record contains names, addresses, emails, and IP Address.

Blacklist:

Contains 4.2k records. Each record is an IP Address "Block". This consists of two IP Addresses. 1. Start and 2. End.

If the record in the Source list has an IP address that's found in the blacklist, I need to save that record and add it to a new file.


I assume you're talking IPV4 addresses of the form xxx.xxx.xxx.xxx.

You can easily convert an IP address into an integer. Each segment (i.e. xxx) is 8 bits (i.e. one byte). So four of them together makes a 32-bit integer. So, given an IP address like "192.168.100.12", you can split it into its four parts, parse each one to a byte and create an integer. Say, for example, that you created a byte array of the segments:

ipBytes[0] = 192;
ipBytes[1] = 168;
ipBytes[2] = 100;
ipBytes[3] = 12;

You can turn that into an integer:

int ipAddress = ipBytes[0];
ipAddress = (ipAddress << 8) | ipBytes[1];
ipAddress = (ipAddress << 8) | ipBytes[2];
ipAddress = (ipAddress << 8) | ipBytes[3];

There are more efficient ways to do that, but you get the idea. Your language's runtime library might already have something that'll parse an IP address and give you the bytes to make it an integer.

You have a set of IP address ranges that you want to check your source addresses against. Load each of the ranges into a structure like this:

class IPRange
{
    public int startIp;
    public int stopIp;
}

And store those in an array or list. Then sort the list by starting IP address.

For each source IP address, convert it to an integer and do a binary search of the list, searching the starting IP address. The source address itself might not be (probably won't be) found, but when the binary search terminates the mid value will hold the index of the range whose starting IP address is less than or equal to the source address. You then just have to check the source address against that item's ending IP address to see if it's in the range.

Binary search is O(log n). If you're searching a list of 4,300 ranges, it's going to take at most 13 probes to find an address in the array. That should be plenty fast enough, even when doing 4,000 different searches. You're only talking on the order of 50,000 total probes of the range array.

A couple of notes:

First, as I said above, I assume you're talking about IPV4 addresses. If you're talking about IPV6 addresses, the same concepts still apply but you'll need a 64 bit integer. I don't know enough about IPv6 to say how you'd convert the address to 64 bit integer. Probably you should depend on you runtime library to get the address bytes.

Second: I assume that ranges don't overlap. That is, you won't have something like:

start range    end range
192.168.1.1    192.168.2.255
192.168.2.1    192.168.3.255

If you have that, then an IP address could fall within either of those ranges. You could potentially construct overlapping ranges that would allow addresses to fall through the cracks. If you have overlapping ranges, the problem becomes a little bit more complicated.