November 26th, 2014

Hash Tables 2

As previously discussed, wicrawl uses a custom hash table to keep track of who we hear talking on the radio while we’re sniffing. 
In the 802.11 spec, each radio transmitter on the media has an EUI-48 address which it uses to identify itself when transmitting.  When we see those, we use the Jenkins hash to squash the 6 bytes of hardware address down to 4 bytes, since wicrawl usually runs on 32-bit computers and it’s a lot faster to deal with 32-bit numbers than 48-bit ones. 

The key to the hash table is that we use something called modulo arithmetic to index the hash table with the results of the hash function.  Remember how to do division?  With whole numbers, the results of a division have two parts: the quotient and the remainder.  With modulo arithmetic, we throw away the quotient and keep the remainder.  Here’s how this works in the discovery engine: let’s say that we have a hash table size of 8.  If the hash of some MAC address is 17, we divide 17 / 8 = 2 remainder 1.  Since we throw away the quotient, we only use the remainder 1 to index the table.

This is great, because we don’t have to search the whole table to find the entry for our MAC address.  No matter how big the table gets, it takes the same number of steps to figure out what index our data is at: it’s a constant-time search algorithm.  

Now, there is the possibility that two MAC addresses will return the same hash result, and we call this a collision.  In the discovery engine, we deal with this by hanging a linked list of database entries off of each hash table entry.  With the default hash table settings, you can scan up to about a thousand APs before these linked lists get longer than a few entries on average.  If people start reporting they’re doing that kind of thing, we can implement the next level of scalability: we double the size of the hash table.  This is a very time-consuming operation because you have to re-insert all the entries in the hash table, which can hang the discovery engine for several milliseconds.  Although this doesn’t sound like much, remember that wardrivers sometimes only have a second or two of bidirectional connectivity to the APs they’re scanning, and any delays will affect the ability of wardrivers to crawl APs they see.

The solution is to grow the hash table in the background, and add any new entries to both the old and new tables as they come in.  When the insert queue goes empty, you do a pointer swap to make the new table the “official” table, and delete the old one in the background.

As an aside, I grabbed the implementation we used from the Linux kernel.  In there, the hash routine is used for the Linux kernel’s rather nice neighbor table facility that’s available to any network protocol that needs to keep track of peers (which is basically all of them).  Other BSD-derived stacks like NT and Solaris have the IP stack married to the Ethernet part of things, so the code to keep track of peers isn’t easily available to be used by other protocols you might want to run.  BSD zealots might say that the fully-factored approach Linux uses is slower, but the numbers don’t seem to back this up.



Leave a Response

Imhotep theme designed by Chris Lin. Proudly powered by Wordpress.
XHTML | CSS | RSS | Comments RSS