April 18th, 2014

Hash Tables

wicrawl’s discovery engine is getting several features added to it in the near future.  One of them requires something more sophisticated than the “did we see this AP?” BSSID table we currently use, so we have to add a hash table to read and write the new data quickly.
In any decent computer science program, the curriculum is going to include a lot of talk about complexity, because time is money and programs need to run quickly.  90% of the time a CS major thinks about it, they’re classifying an algorithm as either linear or constant-time.  A linear algorithm gets slower depending on how much data is involved, whereas a constant-time algorithm takes about the same time no matter how much data is involved.  Of course, there is a limit: you have to fit all the data into RAM, and it has to take about the same amount of time to read or write any RAM cell in the system.
When you need to keep track of a small number of things in a program — say, the numbers in your cell phone — the easiest thing to do is to tack new entries onto the end of a dynamically allocated list of entries, and to look up entries with a for loop that walks the list looking for the entry in question.  This is fine if you have a small number of entries in the list, like your phone numbers, but once the list starts getting really long, you’re better off looking for a constant-time method.

The usual way to keep track of stuff with a constant-time algorithm is a hash table.  In reality, hash tables are only constant-time on average; if you give them funny input, they can be just as bad as a linear search through the list.  Figuring out this average/worst cast business for a given algorithm is part of the reason why complexity gets its own branch of mathematics.

It’s really easy to teach this topic in a way that makes CS students hate it.  In reality, hash tables and hash functions aren’t very hard, and can be implemented by anyone who understands prime numbers and finite-precision arithmatic (which is where the “32-bit” in your processor’s description comes from).  We’ll discuss the hash strategies used by common network stacks in a later post, then we’ll figure out how to implement a good one to keep track of the per-BSSID data in wicrawl’s discovery engine.

Leave a Response

Imhotep theme designed by Chris Lin. Proudly powered by Wordpress.
XHTML | CSS | RSS | Comments RSS