RSS 2.0

Personal Info:

Joe Send mail to the author(s) works on parallel libraries, infrastructure, and programming models in Microsoft's Developer Division.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2008, Joe Duffy

 
 Sunday, August 13, 2006

I've run across several algorithms lately that have benefited from the use of a Bloom filter. This led me to dig up the original paper (Space/time trade-offs in hash coding with allowable errors) in which the idea was proposed. What surprised me a little was that this technique was invented 35 years ago, by a fellow by the name of Burton Bloom, and yet it remains a simple and effective way to speed up a certain category of modern software problems.

The technique summarizes the contents of a set into a concise value (the filter) from which quick answers may be computed. This value is typically generated using some form of hash-coding of the set elements and stored within a bitmap, enabling ultra-fast bitwise updates and queries. This value promises to never return a false negative, although it is permitted to return false positives. So long as the false positive rate is low, many queries that would ordinarily have come up empty handed after a lengthy search will be retired in constant time. As an element is added to the set, the value is updated to reflect its presence. A tricky part of this technique, however, is that, because the value often summarizes the elements using one-bit-to-many-elements, removing an element from the set typically cannot just reset the corresponding bit. As items are removed over time, this can lead to a stale value, an increasing rate of false positives, and a higher number of lengthy searches. Thus the value must be periodically recomputed to combat this problem.

Let's take an example. Say you had a balanced binary tree that was frequently searched, infrequently deleted from, and whose searches more often lead to misses than they do hits. You can expect O(log n) search time. That's not bad, but for large values of n you may have incentive to speed up the common case, which happens to be the worst case for a perfectly balanced binary tree: a search that turns up empty. Using a Bloom filter gives you a slightly modified algorithm whose search complexity is Ω(1) for our problem's best case, and still O(log n) otherwise. This also adds some notable cost due to the need to periodically recompute the filter value, which is an O(n) operation, but since we infrequently delete items, we expect this to pay off in spades.

Say we already had a BinaryTree<T> data structure. We could easily adopt a Bloom filter with a series of minor modifications.

First, a new field to remember the filter's value. I've chosen a 64-bit bitmap for illustration:

long filterValue = 0;

And since we've used a bitmap, we need a routine to calculate any arbitrary element's single bit position. Remember, false positives are OK, and therefore multiple elements may share the same bit:

long GetFilterValueForElement(T e) {
    return 1 << (e.GetHashCode() % (sizeof(long) * 8));
}

When adding an item to the set, we must change the filter's value:

filterValue |= GetFilterValueForElement(e);

Here's the beneficial change. While searching for an item, we can add a quick check up front to speed up queries:

if ((filterValue & GetFilterValueForElement(e)) == 0) {
    // Element not found: we can be assured this is correct.
}
// Perform the existing, lengthier search. might be a false positive.

And of course, we must periodically do the O(n) operation on the tree to recompute the filter. We might do this whenever the tree becomes empty and every n deletions, for example:

if (isEmpty || (++deletionCount % recomputeFilterPeriod) == 0) {
    filterValue = 0;
    foreach (T e in this) {
        filterValue |= GetFilterValueForElement(e);
    }
}

You could even keep track of the number of false positives seen, and use that to determine when to initiate the recomputation.

This technique clearly won't work well in data structures and problems in which the deletion-to-query ratio is high, but there are plenty of situations where this technique helps tremendously. Logs of various forms, for example, exhibit the property that they are added to but never deleted. Depending on the density of the data structure, you may want to use an array instead of a bitmap--or a bitmap with more bits--to represent the summary. I ran a quick test and the simple bit-shifting hash-coding mechanism above yielded a full filter (~0) after only 227 random object allocations. You can of course even decide to reduce the density of your bitmap dynamically by detecting a close-to-full map and upgrading to a larger filter data structure. A dense filter often corresponds to a higher false positives rate, in which case you simply waste time maintaining and checking a value that doesn't give you any benefit.

 

Recent Entries:

Search:

Browse by Date:
<November 2008>
SunMonTueWedThuFriSat
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...