RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Monday, January 08, 2007

I will be giving a PLINQ talk at the Declarative Aspects of Multicore Programming (DAMP) workshop at POPL next week:

[Update: 1/23/07 -- added a link to the slide deck below.]

PLINQ: A Query Language for Data Parallel Programming

Microsoft’s language integrated query (LINQ) technology provides an expressive syntax and suite of APIs which facilitate classic data parallel operations: filtering, mapping, reductions, loop tiling, sorts, nested loops parallelism, prefix scans, and more.  In this talk, we look at a new set of extensions to the LINQ technology, called parallel LINQ (PLINQ), which automatically optimizes and parallelizes query operations based on dynamic runtime information.  We believe that the use of a familiar SQL-like syntax will broaden the reach of PLINQ in industry, and provides programmers with a more declarative and flexible way of expressing data-intensive computations.

Download presentation (PPT).

The workshop is being chaired by Guy Blelloch from CMU, is located in Nice, France, and features some interesting recent research in the field of parallel programming.  If anybody will be in town and wants to meet up, please drop me a line at joedu AT microsoft.

1/8/2007 1:09:41 PM (Pacific Standard Time, UTC-08:00)  #   

 Sunday, January 07, 2007

Jim Larus, a Microsoft colleague of mine, recently co-authored a book on Transactional Memory with Ravi Rajwar from Intel, one of the most prominent authorities in the TM community.  It just became available.  You can purchase and download it online at Morgan & Claypool's website: Transactional Memory (Synthesis Lectures on Computer Architecture).  The series of which it is part is new, but there are at least a few other great books in its pipeline, including a CMP (chip multi-processor) architecture book by Kunle Olukotun, from Stanford and an architect of Sun's Niagara processor.

1/7/2007 12:12:42 PM (Pacific Standard Time, UTC-08:00)  #   

 Saturday, December 30, 2006

I’m a big food addict. When I first moved to the Seattle area from Boston about three years ago, I was initially quite depressed. I expected that Seattle’s food scene would be totally bland in comparison to Boston’s, and NYC was no longer a “quick” drive away (for the occasional binge). Sadly, I still can’t say that most of Seattle’s top restaurants come even close to NYC, but it’s much better than I expected. Many of the better restaurants are right up there with Boston’s. The selection is much smaller, sure, but after a few years I’ve settled on some favorites. I’ve listed these below in approximate order of my most to least favorite.

(Note that the list isn’t complete. I’ll be updating it as time goes by.)

For anybody who recently moved to the area, is visiting, or has been here for a while and hasn’t checked out the food scene, this list might come in handy when making a choice. I even gave each five 1-5 star ratings—overall experience, food taste, food presentation, wine list, value, and service, just like a famous food critic would. (My descriptions, on the other hand, make it quite apparent that I’m not one.) Overall experience is not just the average of the other four, it’s really just how I rate the restaurant’s dining experience in general. A bit arbitrary, yes. I’ve also marked each with a rough guideline on price/person: $ is $0-$20, $$ is $20-$50, $$$ is $50-$100, $$$$ is $100-$200, and $$$$$ is >$200.

All this list really indicates is my personal likes and dislikes; grab a Zagat’s Guide for more detailed and thorough ratings. You also might want to check out tastingmenu.com. It’s a wonderful site with very detailed essays on individual meals and beautiful photographs. Everybody has their own personal taste, and I happen to disagree with some of their ratings, but that’s to be expected.

5 STARS OVERALL

Mistral
http://mistralseattle.com/
New French -- Seattle, WA (Belltown)
Food taste: 5
Food presentation: 5
Wine list: N/A (pairings)
Service: 3
Price: $$$$$

Mistral rivals some of the best restaurants in the US, and certainly beats out all others I’ve eaten at in Seattle, including the Herbfarm. This is pure foodie paradise and a great way to spend 5 hours. Dishes are elegantly and classically presented, but with many new-age molecular gastronomy components and techniques weaved seamlessly throughout. While the food is clearly French-heavy, most dishes were minimalist and surprisingly light, with judicious use of cream and butter.

I recommend the Mistral Experience, which is 9 courses and 6 (or so) paired wines. I actually ended up with 8 or 9 glasses, due to a complimentary glass of Champagne, an extra glass of dessert wine, and a few surprises tossed into the mix. Mistral offers only tasting menus, no a la carte, which surprisingly caused an entire six top to get up and leave(!) the first time I were there. Perhaps it was the price, which somehow ended up at $600 or so for two people. It was well worth it, though. The chef had absolutely no problem creating a 9 course vegetarian menu on the fly. And yes, I always ask in advance about this when I make reservations, as I would feel rude dropping in to a tasting menu-only restaurant and expecting them to completely rearrange the menu to be vegetarian friendly. Most good restaurants don’t mind.

I was at first taken aback by Mistral’s very simple interior design. It’s a pale white with hardwood floors, 12 or so tables, and subtle and warm candle lighting. The ceilings are very high, and there are only a couple windows, aside from the front glass. The result is an elegant, cozy, and quaint atmosphere, but also feels a little like you’re at a cocktail party in some chic art studio located in a dark LA alleyway, with a décor which hasn’t yet been filled out. After a few courses, the initial shock passed, and I eventually found it calming.

I have to admit the wait-staff took me off guard. The Maître d' was aloof and seemingly 100% clueless, and our waiter was a bit aggressive. To be honest, I was completely put off by his demeanor and surfer-like descriptions of the food, wine, etc. (duuuude). But to be fair… after a few courses, it became evident why he works there. He was absolutely passionate about the food and wine, and was oozing with detail after detail about the cuisine. He had a story to tell about every dish and every wine. In the end, he gave a different, but good, experience compared to what I was expecting. In a phrase, decidedly unpretentious and food geeky. The 3 rating for service is primarily because of the Maître d', not him.

Sadly I don’t tend to take very good notes when I dine out. Here are some things that stand out in my mind. Robuchon-like cauliflower soup, with nice European (high fat content) butter. Ferran Adria-like carrot and cucumber foams, without feeling ridiculously out of place. Sous-vide hamachi tuna, which was absolutely delicious and unexpected. Sous-vide as a cooking technique creates an odd meaty texture for fish (particularly tuna), but cooks it perfectly and evenly throughout. Plenty of exotic fishes and meats, mallard duck, foie gras. Fingerling potato puree with chive oil. Many, many fruit reductions, including pomegranate and quince. There were two desserts, one of which was a bursting(!) with vibrant flavor passion fruit sorbet, and tasted, well, like I had just crushed open and slurped the innards of a passion fruit. It had just the right level of pleasing sourness, and thankfully wasn’t enhanced with any sweetener at all. Desserts like maple ice cream, little hand-made cookies, and pastry. Everything had a place on the plate and wandered only a little bit, keeping the palate sufficiently amused.

The paired wines are perfectly selected, jumping all over the map, and don’t feel like just an afterthought. Pace was perfect. Given the size of the meal and starting time (9 courses, starting at 7pm), you might expect to have been a little rushed, but I wasn't. And the portions were just the right size; at the end, I was no more full than if I had eaten a typical 4-course dinner at a slightly less extravagant restaurant.

Rover’s
http://www.rovers-seattle.com/
Classic French -- Seattle, WA (Madison Park)
Food taste: 5
Food presentation: 4
Wine list: 5
Service: 4
Price: $$$$$

Rover’s cranks out classic French cuisine like no other Seattle restaurant. What makes Rover’s truly special, aside from its perfect execution and taste, are that dishes are typically locally-inspired, highlighting seasonal, fresh ingredients. All of this integrated into a truly French menu that whisks you away to Paris. OK, sure they always have great diver scallops, foie gras, truffles, … and I’m not sure how much of this is from Washington and Oregon, but these focal points are at least augmented with seasonal and regional flavors. This isn’t country style comfort food… it’s high end, small-to-medium sized dishes, but with plenty of cream, heavy butter, and warm flavors.

Go for the 8-course tasting menu, I say. You’ll feel it more than most 8-course tasting menus due to the heavy French flavors, but you won’t regret it. The wine list rocks and will keep you occupied hunting for a bottle for a couple days if you’re into that kind of thing, although you’ll have a tough time finding a bottle for under $200. Rover’s never disappoints. It puts all other classic French restaurants in Seattle to shame (of which there are very few, of course).

4 STARS OVERALL

The Herbfarm
http://www.herbfarm.com/
New American --Woodinville, WA
Food taste: 4
Food presentation: 4
Wine list: 5
Service: 4
Price: $$$$$

I’ve been here many, many times, and I’ve never been disappointed. They have only one seating per night of about 100 people (just an approximate guess), and the atmosphere is very much like going our for a night at the theatre. The menu is completely prix fixe, no a la carte. It makes for a nice, 5 hour-long evening out. If you show up 30 minutes early, you get a tour of the herb garden. After doing this a couple times, it gets a little old (since the script doesn’t change at all), but you do get to munch on some really fresh herbs to whet the appetite.

What truly makes the Herbfarm great is the ever-changing, seasonal, and locally inspired menu. If you check out their website, you’ll notice the menu rotates every two weeks or so, featuring locally available and seasonally fresh ingredients. The Mycologist’s Dream is our favorite (here’s a sample menu), featuring dozens of varieties of mushrooms, often foraged the morning before the meal from all over the state of Washington. And it’s great for vegetarians.

The food is squarely in the New American category. The dishes are very good and usually vibrantly seasoned with fresh herbs. The simple, light dishes are definitely what they do best—e.g. consommés, poached fish, simple mushroom dishes—but when things get heavy, the herbs tend to get lost in the mix and the result is usually good, but not special. I must admit that I find the style to be borderline “tired,” but that’s probably a reflection of how I feel about New American restaurants in general as I write this (you'll notice there are quite a few in Washington). Because the Herbfarm's specialty is locally produced, seasonal ingredients, the menu style can fluctuate quite a bit: without these light, herby dishes, it’s hard to find a common theme from one visit to the restaurant to the next. As a result you don't really go back craving your favorite dish as you would many other restaurants.

The paired wine selections are usually good, but sometimes not stellar. They have a massive and wonderful winelist which due to the paired wines you seldom get to explore. But in addition to the 5 paired wines, you can order others from their list. Last time, I ordered a tasty flight of Madeiras, ranging from 1890 to 1960.

The Harvest Vine
http://www.harvestvine.com/
Spanish / Tapas -- Madison Park, WA
Food taste: 5
Food presentation: 4
Wine list: 3
Service: 3
Price: $$$

I’ve died and gone to… Basque, Spain! This teensy little restaurant seats probably 40 people total, 20 seats of which are available at or around the tapas bar. They always keep those seats available for walk-ins, so you can usually get lucky on a weeknight (or at the right time on weekends; I’m not telling, sorry :) ).

Foods range from pure Spanish comfort dishes, like rustic cheese, three kinds of olives prepared in three different marinades, fried piquillo peppers with sea salt, fresh white anchovies, delicious, oozing with fat Serrano ham, blood sausages, and chorizo to new-age, but Spanish-inspired, treats like foie gras, monk fish liver, squab, ox, and scallops. The wine list is good, though not great, featuring almost exclusively Spanish and Portuguese wines. I’m salivating for a glass of Jerez right now (a fortified dessert wine, made the same way as Sherry, in fact Sherry is just the English word for Jerez (similar to how Sherry is Xérès in French)).

The Harvest Vine is always a treat.

Café Juanita
http://www.cafejuanita.com/
Northern Italian -- Kirkland, WA
Food taste: 4
Food presentation: 4
Wine list: 3
Service: 3
Price: $$$$

This is one of the best Northern Italian restaurants I’ve ever eaten at. And this is having eaten at plenty of Italian restaurants in the Boston area which is famous for its Italian North End neighborhood. There’s always a good mix of hearty and light dishes, with plenty of exotic meat, foul, and fish, spanning things like wagyu beef, prosciutto, braised octopus, boar, venison, foie gras, veal sweetbreads, …, the list just goes on and on. Just about every time I’ve gone, they have had a seasonal Mediterranean fish, usually baked with a simple ragout of herbs, olives, capers, and served whole. Last week it was Branzino (a.k.a. spigola). I’ve never been disappointed. They have just an OK selection of vegetarian dishes and salads.

They also have some wonderful cheeses, which you can often get paired with an aperitif (like grappa, a balsamic vinegar martini, etc.). The desserts are also very good. I’ll never forget the olive oil ice cream that I had there last winter: surprisingly delicious! Service is spotty, sometimes great, sometimes just OK, really depending on the server. Timing is usually very good but sometimes off. The décor is pretty bland, but as you can tell by my lack of a décor rating for the restaurants, that kind of thing doesn’t really put me off that much so long as the food knocks me out.

Lark
http://www.larkseattle.com/
New American -- Seattle, WA (Capitol Hill)
Food taste: 5
Food presentation: 4
Wine list: 2
Service: 3
Price: $$$

Lark is a new favorite of mine, and one at which I find myself at least one night every other week. Like the Harvest Vine, the restaurant is very walk-in friendly and thus conducive to impulsive dining out decisions.

The food at Lark is simple and just a tiny bit bigger than tapas sized dishes. This is classic New American food, but with some interesting twists and inventive ingredient combinations. A couple weeks back I had foie gras terrine with quince jam; they frequently have fresh, seasonal mushrooms sautéed delicately with just a little herb (like parsley) and sea salt; yellowtail carpaccio; always a wonderful squab or pheasant dish, usually with something like an apple frisee or nut compote. They also have some really great classics. You can’t go wrong with some cheese, of which they have a great selection, served with fresh, local honeycomb. Rosti potatoes with clabber cream, pommes de terre “Robuchon”, gnudi, and any of their salads are always a treat.

I have two gripes about Lark. First is that the winelist is pretty much crap. Thankfully I’m usually in on a weeknight, which means I’m up for at most a glass or two of something, in which case their by-the-glass list isn’t too far off the mark. But if you’re looking for a bottle, I’d recommend bringing your own instead. The second is timing. At Lark they tend to assume that you’re going to share a lot of the dishes. When I order things in a particular order, saying clearly “as the first course”, “as the main course”, etc., I am saying this for a reason. Pay attention! The food more than makes up for both problems, thankfully.

3 STARS OVERALL

Crush
New American -- Seattle, WA (Capitol Hill)
Food taste: 3
Food presentation: 4
Wine list: 3
Service: 3
Price: $$$$

Crush is a strong, solid, very good restaurant. I can’t honestly say that the food is great enough to put it up into the 4 star category, however, mostly because the composition of ingredients and flavors is at the same time predictable and over the top. Many dishes have way too much going on, but then again many are simple and delicious. Particularly given the interior décor, which consists of plasticy white chairs located in a white-and-black decorated old house, and the difficulty in getting reservations, I get the impression they are trying to appeal to the non-foodie chic crowd a little bit too much.

I tremendously enjoyed the sautéed Hudson Valley foie gras with warm huckleberries on a slice of pumpernickel bread. It had huge, warm flavors and awesome fat content. The sea scallops and Maui onion confit was also terrific. When I think of Crush, I think of consistently good New American food where you can’t go wrong. I don’t think I’ve had a dish I didn’t like there, although their portions are often a little bit much for me.

Il Terrazzo Carmine
http://ilterrazzocarmine.com/
Italian -- Seattle, WA (Pioneer Square)
Food taste: 3
Food presentation: 3
Wine list: 4
Service: 2
Price: $$$$

Restaurant Zoë
http://www.restaurantzoe.com/
New American -- Seattle, WA (Belltown)
Food taste: 3
Food presentation: 3
Wine list: 2
Service: 3
Price: $$$

Café Campagne
http://www.campagnerestaurant.com/cafe_home.html
French (Bistro) -- Seattle, WA (Pike Place)
Food taste: 4
Food presentation: 3
Wine list: 2
Service: 2
Price: $$$

Dahlia Lounge
Pacific Northwest -- http://tomdouglas.com/dahlia/
Seattle, WA (Downtown)
Food taste: 3
Food presentation: 3
Wine list: 3
Service: 2
Price: $$$

Dahlia is Tom Douglas’s most upscale restaurant. In some ways, it feels like Tom is trying too hard with this one. Having eaten at his others, he is in his element when the dishes are very rustic with lots of complex flavors woven throughout, sort of sloppy, exemplified by the Palace Kitchen. I call it “hunting lodge food”, since there’s usually some game meat and generally themes that you’d imagine enjoying after a long day’s hunt. (I don’t hunt by the way, so this is purely inspired by movies, books, and other works of fiction on the topic. :P) Dahlia, on the other hand, tries to be a little minimalist while at the same time giving the same huge, rustic, and yes, sloppy feel to the dishes. If it sounds strange, it is. It kind of doesn't work for me, although some dishes are hits: I am completely in love with their Italian Bread Salad, with rustic bread cubes, fresh mozarella, kalamata olives, fruity olive oil, chunky strips of cured ham, etc. as a lunch.

I think everybody should try Dahlia, but it’s not the kind of place I’d go to on, say, a weekly basis. The food can get a little repetitive and is very predictable, as are all of Tom’s restaurants. It’s almost comfort food, but at that price, I have a dozen other comfort food restaurants that I’d rather hit up. It’s a matter of personal taste really. The bakery next door has some serious artesianal bread and pastries. Check it out; the bakery is definitely worthy of a trip on Sunday morning just to get a loaf to snack on.

Nishino
http://www.nishinorestaurant.com/
Japanese -- Seattle, WA (Madison Park)
Food taste: 3
Food presentation: 3
Wine list: 3
Service: 2
Price: $$$

Lola
http://tomdouglas.com/lola/
New Greek -- Seattle, WA (Downtown)
Food taste: 4
Food presentation: 3
Wine list: 2
Service: 2
Price: $$$

Veil
http://www.veilrestaurant.com/
New American -- Seattle, WA (Pioneer Square)
Food taste: 3
Food presentation: 4
Wine list: 2
Service: 3
Price: $$$

2 STARS OVERALL

Tango
http://tangorestaurant.com/
Spanish / Tapas -- Seattle, WA (Capitol Hill)
Food taste: 3
Food presentation: 3
Wine list: 3
Service: 3
Price: $$$

Palace Kitchen
http://tomdouglas.com/palace/
Pacific Northwest -- Seattle, WA (Downtown)
Food taste: 4
Food presentation: 2
Wine list: 2
Service: 3
Price: $$$

Campagne
http://www.campagnerestaurant.com/camp_home.html
Classic French -- Seattle, WA (Pike Place)
Food taste: 3
Food presentation: 3
Wine list: 4
Service: 2
Price: $$$$

Seastar
http://www.seastarrestaurant.com/
Bellevue, WA
Food taste: 3
Food presentation: 3
Wine list: 3
Service: 3
Price: $$$

Purple Café and Wine Bar
http://www.thepurplecafe.com/
Pacific Northwest -- Kirkland, WA, Woodinville, WA, and Seattle, WA (Downtown)
Food taste: 2
Food presentation: 2
Wine list: 3
Service: 3
Price: $$$

Monsoon
http://www.monsoonseattle.com/
New Vietnamese -- Seattle, WA (Capitol Hill)
Food taste: 3
Food presentation: 3
Wine list: 2
Service: 3
Price: $$$

Cactus
http://www.cactusrestaurants.com/
Mexican -- Kirkland, WA, Seattle, WA (Madison Park), and Seattle, WA (Alki Beach)
Food taste: 4
Food presentation: 3
Wine list: 1
Service: 2
Price: $$

Brasa
http://www.brasa.com/
New Mediterranean -- Seattle, WA (Belltown)
Food taste: 3
Food presentation: 2
Wine list: 2
Service: 2
Price: $$$

1 STAR OVERALL

Many of these places are comfort food and quick bite places. It’s a little unfair to label them “1 star” since comparing a great Southern Indian restaurant to somewhere like Mistral is like comparing apples to oranges. Each is amazingly good, but are for vastly different purposes (and with different price points). C’est la vie.

The Dish
Breakfast -- Seattle, WA (Ballard)
Food taste: 3
Food presentation: 1
Wine list: N/A
Service: 2
Price: $

Udupi Palace
Indian (Southern) -- Bellevue, WA
Food taste: 4
Food presentation: 1
Wine list: N/A
Service: 1
Price: $

Preet’s
http://www.preets.com/
Indian (Punjabi) -- Redmond, WA
Food taste: 3
Food presentation: 2
Wine list: N/A
Service: 2
Price: $

Kabul
http://www.kabulrestaurant.com/
Afghan -- Seattle, WA
Food taste: 3
Food presentation: 2
Wine list: N/A
Service: 2
Price: $$ (a little expensive for what you get)

Serious Pie
http://tomdouglas.com/serious/
Seattle, WA (Downtown)
Food taste: 3
Food presentation: 1
Wine list: 1
Service: 1
Price: $$

Wild Ginger
http://www.wildginger.net/
Pacific Rim -- Seattle, WA (Downtown)
Food taste: 2
Food presentation: 2
Wine list: 2
Service: 2
Price: $$$ (overpriced)

Szmania’s
http://www.szmanias.com/
German -- Seattle, WA (Magnolia)
Food taste: 3
Food presentation: 2
Wine list: 2
Service: 2
Price: $$$

Typhoon
http://www.typhoonrestaurants.com/
Thai -- Redmond, WA (Pioneer Square)
Food taste: 3
Food presentation: 3
Wine list: N/A (good tea list, though)
Service: 1
Price: $$

Mediterranean Kitchen
Middle Eastern -- Bellevue, WA and Seattle, WA (Queen Anne)
Food taste: 2
Food presentation: 2
Wine list: 1
Service: 2
Price: $$

I Love Sushi
http://www.ilovesushi.com/
Bellevue, WA and Seattle, WA (Lake Union)
Food taste: 2
Food presentation: 2
Wine list: N/A
Service: 2
Price: $

This place serves up good middle-of-the-road sushi, not too great, not too bad either. The raw fish is where it’s at here—I’m not overly crazy about the cooked foods. It gets busy on weekdays during lunch, but you can snag a Bento Box to go which is a fun treat.

Third Floor Fish Café
http://www.fishcafe.com/
Pacific Northwest -- Kirkland, WA
Food taste: 3
Food presentation: 2
Wine list: 2
Service: 2
Price: $$

Shamiana
http://shamianarestaurant.com/
Indian and Pakistani -- Kirkland, WA
Food taste: 3
Food presentation: 2
Wine list: 1
Service: 2
Price: $$

Marina Park Grill
http://www.marinaparkgrill.com/
Pacific Northwest -- Kirkland, WA
Food taste: 2
Food presentation: 2
Wine list: 2
Service: 2
Price: $$

Some pretty decent seafood dishes and seaside fare, like burgers and salads. It’s right on the waterfront and is a great place to stop by while walking around Kirkland and hanging out at the park. All around decent place with a not-too-shabby by-the-glass winelist. I wouldn’t go out of your way to eat here for dinner, but it makes a great late Sunday afternoon lunch on a warm summer day. There’s a little ice cream shop right around the corner for a nice treat afterwards while you walk through the park.

Chutney’s
http://www.chutneys.com/
Indian -- Bellevue, WA
Food taste: 2
Food presentation: 2
Wine list: 1
Service: 1
Price: $$

Malay Satay Hut
http://www.chutneys.com/
Malaysian -- Redmond, WA
Food taste: 3
Food presentation: 1
Wine list: N/A
Service: 1
Price: $

NEXT ON MY LIST…

These are restaurants I haven’t been to yet, but that I’ve heard first hand are quite good. They are listed in the order in which I plan on visiting them (or in the case of Gypsy, hope ;) ). And yes, I am rather embarrassed I haven’t tried Canlis or Lampreia yet. I suspect both will be either 4 or 5 stars. My intent is to do so in the first couple weeks of the new year.

Gypsy
Invitation only. :(

Lampreia
http://www.lampreiarestaurant.com/

Canlis
http://www.canlis.com/

Le Gourmand

La Carta de Oaxaca
http://www.lacartadeoaxaca.com/

Cascadia
http://www.cascadiarestaurant.com/html/

Carmelita
http://www.carmelita.net/

Izumi

12/30/2006 2:12:39 PM (Pacific Standard Time, UTC-08:00)  #   

 Friday, December 29, 2006

Deadlocks aren’t always because you’ve taken locks in the wrong order.

In many systems, tasks communicate with other tasks through shared buffers.  In a concurrent shared memory system, these buffers might be simple queues shared between many threads.  In COM and Windows GUI programs these buffers might take the form of a window’s message queue.  In any case, if some task A performs a synchronous message send to task B, and task B does a synchronous send to task A, near simultaneously, and if neither task continues to process incoming messages, both will be blocked forever.

This is the classic reentrancy versus deadlock problem.  Ensuring that both A and B continue to process incoming messages while blocked on a send will eliminate the deadlock, albeit at the cost of possible reentrancy headaches.  Better yet, you could just send messages asynchronously.

Things can get quite a bit more complicated than this, of course.  Imagine that we have three operations, A, B, and C, being run over N concurrent streams of data.  We use data parallelism to partition the input data into and replicate the operations over the N streams, such that we have A0…AN-1, B0…BN-1, and C0…CN-1 operating on disjoint input.  A0 produces data for B0 which produces data for C0, and so on.  Elements are pulled on demand from the leaves (A0…AN-1) to the root (C0…CN-1), using a single execution resource (like a thread) per stream, i.e. E0…EN-1.

This is quite a bit like many real data parallel systems, including stream processing.

Imagine that sometimes AN finds that some input data must be given to BM instead of BN (where N != M); there is a similar story for B and C too.  We might be tempted to use some form of shared buffer to perform the inter-task communication here.  In other words, when A0 finds something for B1, it sends the data to it by placing it into B1’s input buffer.  This might be done asynchronously, i.e. A0 needn’t wait for B1 to actually consume the message, hence avoiding the sort of deadlock we noted earlier.

Unfortunately, since it might take some unknowable amount of time for B1 to process its input, we might worry about excessive memory usage for these buffers.  So we could put a bound on its maximum size using an ordinary bounded buffer… but once we’ve done that, we have turned what was an asynchronous send into a possibly-synchronous one, and in doing so introduced the same deadlock problem with which I began this whole discussion.

We could solve this by ensuing that, whenever a task must block because the destination buffer is full, it also processes incoming messages in its own buffer.  In other words, we use reentrancy.  Sadly, things are not always quite so simple.

Imagine this case: A0 has found data for B1, but B1’s buffer has become full.  So A0 is now waiting for B1 to process messages from its buffer to make room.  Nobody else will produce data for A0 at this point, so it’s stuck waiting.  Sadly, B1 has too become blocked trying to send a message that it has found for C0.  Because the same execution resource E0 that must free space in C0’s buffer is currently blocked in A0 waiting for B1 to free space from its buffer, and because the execution resource E1 is also waiting for E0, we now have a very convoluted deadlock on our hands.

The solution?  There’s no reason to keep the execution resource E0 occupied in A0 waiting for B1 in this case.  E0 could instead be freed up to run C0, freeing space for B1, and untangling the system.  Reentrancy strikes again, but this time, in a good way.  Note that in a heterogeneous system where these buffers are not controlled by the same resource, this solution is difficult to realize in practice.  Maybe A uses a custom bounded buffer written in C# to communicate, B uses SendMessages and COM message queues, and C uses GUI messages.  In this case, orchestrating the waiting to be deadlock free becomes a real challenge.

12/29/2006 11:27:31 AM (Pacific Standard Time, UTC-08:00)  #   

 Friday, December 22, 2006

You might have noticed I'm blogging a lot more about native code than I used to.  Most of my day-to-day work still has to do with managed code, but I've been dazzled by all the new cool things in Vista that you can't do yet in managed code (and probably won't be able to for some time to come, since the CLR needs to keep running on old OSes).  And, besides, many low-level Windows changes impact managed code too, like the fairness post from last week.

My question to you is:  Do you vehemently like or dislike either category of post?  Does the split between the two work?  I'm trying hard to keep it at about 50/50. 

I'm wondering not just for blogging reasons.  As you might imagine, my book is looking to be very similar to my blog.  It is, after all, "Concurrent Programming on Windows", which sometimes involves using native code to access features that managed doesn't expose to you.  Similarly, the CLR gives you many things that Windows alone doesn't give you.  I'm trying reeeeealllly hard to ensure the prose is not schizophrenic, flows and progresses nicely, and doesn't repeat things: e.g. "here's how you do it in VC++, here's how you do it in C#, ..."  This can be a challenge for many things, but is obviously very important.  Thankfully a huge portion of the book is more about applying the technologies generally, which tends to be pretty common across both environments.

So ... what do you think?

12/22/2006 11:56:18 AM (Pacific Standard Time, UTC-08:00)  #   

If you need RegisterWaitForSingleObject-like behavior in the new Vista threadpool -- a great feature, by the way, that amortizes the cost of losing a thread to waiting by cramming up to 63 waits into a single threadpool thread, leading to better scalability on systems that must wait for a lot of things concurrently, sometimes for long or unpredictable times (and the magic behind ASP.NET Asynchronous Pages) -- you'll want to look at the SetThreadpoolWait and related APIs.  You get basically the same functionality, with a mostly cleaned up interface and the added benefit of having cleanup groups to take care of resource cleanup (if you so choose):

VOID WINAPI SetThreadpoolWait(PTP_WAIT pwa, HANDLE h, PFILETIME pftTimeout);

One thing annoyed me about this API when I first tried to use it.  To be truthful, it still does.  Instead of a DWORD timeout specified in 1ms intervals, you'll notice you have to supply a pointer to a FILETIME data structure.  Wha-wha-whaat?  This differs from just about every other wait API on Windows and is not as simple as it sounds.

The no-timeout case, which is typically -1 (a.k.a. INFINITE), is simple enough: just pass a NULL pointer.  The "check, but immediately timeout if unsignaled without waiting," variant, typically specified with 0, is pretty straightforward.  Just initialize your FILETIME to contain 0's.  There are many ways to do this but using a struct field initializer is probably the easiest:

FILETIME ft = {0,0};
SetThreadpoolWait(..., &ft);

But for real timeouts, how the heck do you get your hands on a properly initialized FILETIME?  OK, here's where things get nasty.  You can grab ahold of a FILETIME from many places, including from an actual file's creation or modification time, but you probably don't want to do that.  GetSystemTimeAsFileTime(&ft) will turn the current time into a FILETIME, which is useless without additional math to turn it into a time relative to now (otherwise, you'd just use a {0,0} FILETIME as noted above).  Alternatively, you could start with a SYSTEMTIME (the current can be retrieved with GetSystemTime), which allows you to define a precise year, month, day, hour, minute, second, and/or millisecond, and then turn that into a FILETIME with SystemTimeToFileTime(..., &ft).  But how many times do you really have a precise absolute date and time at which you want your wait to stop?  Yes, February 3rd, 2019, at 6:29 AM (and 39.66 seconds) please.

Whew!  At this point, you're probably bewildered.  Most wait timeouts are defined in terms of relative time.  Thankfully, you can still define a relative time when calling SetThreadpoolWait.  But sadly you have to resort to some messy casting and data conversion to do this.  How do you do that?  Simple.  If the FILETIME's contents represent a negative integer, then the threadpool adds the negation of that to the current system time during registration to come up with an absolute time.  So it does the GetSystemTimeAsFileTime/addition nonsense for you.  Well, how the heck do you do create a negative FILETIME anyway?  "Easy"!

__declspec(align(8)) FILETIME ft;
*reinterpret_cast<LONG64 *>(&ft) = -1000;
SetThreadpoolWait(..., &ft);

That causes the threadpool to use a timeout of 1000 from the current time.  1000 units of what you might ask?  FILETIME represents its time with 100ns units, so this particular timeout of -1000 is interpreted to mean 100 microseconds from now.  Generally, if you have n milliseconds, you'd calculate n * 1000 (to get microsecond units) * 10 (to get 100 nanosecond units).

[Update: 12/26/2006: I changed the examples to use __declspec(align(8)), ensuring that the FILETIME is aligned on an 8-byte boundary.  Pavel noted in comments to this post that, since FILETIME consists of two DWORDs, VC++ will align on 4-byte boundaries.  On some architectures -- like IA64 -- treating these as a single 64-bit aggregate piece of data will present alignment faults to the software.  (This is unlike the behavior you'll see on X86 and X64, where such faults are fixed transparently by the hardware and/or OS, albeit at a performance hit.  You can ask the OS to fix these up on IA64 automatically, with SetErrorMode(SEM_NOALIGNMENTFAULTEXCEPT), but this results in even worse performance than on other architectures.)  Raymond wrote about this previously.  One solution is to align manually, as shown here; another is to memcpy data to and from the FILETIME/LONG64; yet another, which is probably the cleanest, is to simply access the FILETIME's fields directly.  E.g. FILETIME ft = { -m * 1000 * 10, ~0 } or ft.dwLowDateTime = -m * 1000 * 10; ft.dwHighDateTime = ~0, for some milliseconds timeout m.]

To save you some time and frustration in this process, you can just use the following little shim in your code.  I do.  It gives you a nice, relative DWORD-based timeout interface to the SetThreadpoolWait API:

VOID SetThreadpoolWaitWithMsTimeout(PTP_WAIT pwa, HANDLE h, DWORD dwMilliseconds) {
    __declspec(align(8)) FILETIME ft;
    PFILETIME pft;

    if (dwMilliseconds == -1) {
        // Just pass NULL to the wait API to signal "no timeout".
        pft = NULL;
    } else {
        pft = &ft;

        // FILETIME uses 100ns intervals. To convert milliseconds into 100ns units,
        // we must multiply by 1000 (to get microseconds) and then by 10 (to get 100ns).
        // We take the negative of this number to specify "relative to now" time.
        *reinterpret_cast<LONG64 *>(pft) = -dwMilliseconds * 1000 * 10;
    }

    SetThreadpoolWait(pwa, h, pft);
}

To be honest, I don't know the reasoning behind the break from tradition here.  [Update: 12/26/2006: Richard pointed out, very correctly, that the NTDLL implementations have alwasy dealt in terms of FILETIMEs and performed DWORD ms to 100ns conversions; in that sense, this is hardly a "break from tradition" for Windows generally, but is certainly a new direction for KERNEL32 itself.]  Using a FILETIME clearly causes some trickiness for something that should be damn simple, but it does have one distinct advantage.  The DWORD-based APIs only permit you to specify relative time, whereas the FILETIME approach allows you to specify relative or absolute, depending on what you need.  I have to admit that I can’t think of many cases with wait registrations where you’d want an absolute timeout, but I might be unimaginative.  This also has the benefit of removing some degree of ambiguity: with relative, you always have to wonder: time relative to what?  Is it relative to the time at which an actual wait thread sees the registration and adds the HANDLE to its list of waitable objects?  Or is it relative to the time at which the call to register the wait is made?  Or something else?  In the end, I’d have preferred a separate API for absolute time… I just like my DWORD-based relative timeouts.  They are familiar and comfortable.  Continuing to use RegisterWaitForSingleObject is an option, but doesn't let you use environments, cleanup groups, etc.  Tradeoffs abound.

12/22/2006 11:41:51 AM (Pacific Standard Time, UTC-08:00)  #   

 Thursday, December 14, 2006

Raymond just posted a brief entry about lock convoys.  While he links to a few other relevant posts, none of them mention the new anti-convoy features that all Windows locks now use.  So I thought that I would take the chance to do just that.

Many people claim that fair locks lead to convoys.  In my experience, however, few people really know the reason why.  Before Windows Vista (client OS) and Windows Server SP1 (server OS), mutexes, critical sections, and internal locks like kernel pushlocks used a lock handoff mechanism to guarantee true fairness.  In other words, the lock would not actually become "available" when released so long as there were threads waiting to acquire it.  Instead, the thread releasing the lock would modify the lock's state so that it appeared as if the next thread in the wait queue already owned it.  The new owner thread would then simply wake up, typically via the releaser signaling an event, and the owner would find that it already owned the lock, proceeding happily.  No other thread can "sneak in" between the time the new owner is woken and the time that it is actually scheduled for execution with this design.

While this sounds nice and, well, fair, it exacerbates convoys.  Why?  Because it effectively extends the lock hold time by the communication and context switch latency required to wake and reschedule the new owner.  Context switches on Windows are anything but cheap, and tend to cost anywhere from 4,000 to 10,000+ cycles.  Assume C represents the cost of a context switch.  Then with a truly fair lock, the lock will be in an intermediary handed off state, with no thread actually running code under its protection, for about C cycles.  It's actually worse than this.  Assuming the system is busy, a thread that is woken just goes to the end of the OS's thread scheduler queue, and is required to wait until it gets allocated a timeslice in which to execute.  This can make C much larger in practice.  And of course on highly loaded systems the condition worsens, which can add insult to injury (as we see momentarily).

To cope with the possibility of a scheduling delay, Windows uses something called a priority boost for any thread waiting on an auto-reset event (or that owns a window): the boost temporarily increases the target thread's priority which subsequently decays after it gets scheduled.  Assuming no other high priority threads are runnable, this ensures the latency is very close to C.  But C's still pretty darned big...

To illustrate the problem with the fair handoff scheme, imagine we have a lock L for which a new thread arrives every 2,000 cycles.  Each such thread runs for 1,000 cycles under the protection of the lock.  No problem, right?  On average, the lock will be held for 1,000 cycles, unheld for 1,000 cycles, and so on.  Assuming the arrival rate is somewhat random, but statistically averages out to the values mentioned, then occasionally we might get some contention, requiring waiting.  But for every contentious acquire, there should also be a big gap in time where there are no owners (or where wait queues can be drained).  A system with these characteristics should balance out well.  It should survive until threads begin arriving at a frequency of more than 1,000 cycles, give or take some epsilon, which is actually a doubling in throughput.  It's not even close to capacity.  (Real systems depend on many more factors than this simplistic view, but you get the point.)

As soon as the lock is fair, however, this scheme quickly becomes untenable and will come to a grinding halt.  Imagine that thread T0 acquires L at cycle 0; if it just so happens that T1 tries to acquire it at cycle 500, then T1 will have to wait.  Remember, on average, the arrival rate is 1,000 cycles, but that's just an average.  We expect the occasional wait to occur.  This wait, unfortunately, causes a domino effect from which the system will never recover.  T0 then releases L at cycle 1,000, as expected, handing off ownership to T1; sadly, T1 doesn't actually start running inside the lock until 5,000 (assuming 4,000 for C, and assuming no scheduling delay); in the 4,000 cycles it took for T1 to wake back up and start running, we expect 2 new threads will have arrived on average; these threads would see L as owned by T1 and respond by waiting.  By the time those threads execute, another 10,000 cycles (2*(C+1,000)) will have passed, and another 4 threads will have begun waiting.  And so on.  This process repeats indefinitely, the requests pile up (hopefully with a bound), and disaster strikes.  The system simply won't scale this way.

If you remove the strict fairness policy, however, the system scales.  And that, my friends, is why all of the locks in Windows are now unfair.

Of course, Windows locks are still a teensy bit fair.  The wait lists for mutually exclusive locks are kept in FIFO order, and the OS always wakes the thread at the front of such wait queues.  (APCs regularly disturb this ordering -- a topic for another day -- which actually calls into question the merit of the original design goal of attaining fairness in the first place.)  Now when a lock becomes unowned, a FIFO waking algorithm is still used, but the lock is immediately marked as unavailable.  Another thread can sneak in and take the lock before the woken thread is even scheduled (although priority boosting is still, somewhat questionably, in the system).  If another thread steals the lock, the woken thread may subsequently have to rewait, meaning it must go to the back of the queue, again disturbing the nice FIFO ordering.

The change to unfair locks clearly has the risk of leading to starvation.  But, statistically speaking, timing in concurrent systems tends to be so volatile that each thread will eventually get its turn to run, probabilistically speaking.  Many more programs would suffer from the convoy problems resulting from fair locks than would notice starvation happening in production systems as a result of unfair locks.

12/14/2006 10:31:42 PM (Pacific Standard Time, UTC-08:00)  #   

 Wednesday, December 06, 2006

I took this past week off so that I could work on my book.  Well, I'm happy to report that I've been successfully writing like a madman, averaging around 15-20 solid pages per day.  I still have a long way to go, but I'm getting more confident with the passing of each day that this book will be...  well...  a book that I'd actually like to sit down and read.

[Update: 12/7/2006: Correction made -- the CLR's JITs do not generate manual alignment code.  Instead, we defer to the costly OS handler for alignment fixups.]

In the process of writing the section on data alignment, I realized there is very little documentation on the alignment policy used by the CLR.  This is in contrast to Kang Su Gatlin's wonderful MSDN treatise on the subject for VC++, which leaves absolutely nothing hidden in the closet.  Well, I still don't have all of the answers for you.  Sorry.  You'll have to wait for the book.  But in the meantime, I've discovered that there's a myth that deserves a little debunking.

In the MSDN documentation for InterlockedCompareExchange64, it says:

"The variables for this function must be aligned on a 64-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems."

I've also heard and read this from other various sources.  I've heard, for example, that LOCK CMPXCHG8B will still do a load/compare/store sequence, but that, if the address isn't 8-byte aligned, the instruction will not be atomic.  This would lead to sporadic atomicity failures, probably even more difficult to track down than a typical race.  Given that the CLR doesn't faithfully align 64-bit data types on 8-byte boundaries (as we'll see momentarily), I suddenly feared that Interlocked.CompareExchange(ref Int64, ...) was HORRIBLY broken.  Without an MP machine at home, I couldn't test this out, so I decided to do a little digging.

In the manuals for many AMD processors and older Intel X86 processors, I found no reference to CMPXCHG8B requiring an aligned address.  What I did find, however, in the Intel 64-bit and IA32 System Programmer's Manual Part A was the following (emphasis mine):

"The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:

  • Any boundary for an 8-bit access (locked or otherwise).
  • 16-bit boundary for locked word accesses.
  • 32-bit boundary for locked doubleword accesses.
  • 64-bit boundary for locked quadword accesses."

If I'm reading that right, this means the common wisdom around 8-byte alignment and LOCK CMPXCHG8B is hogwash.  (Sadly, proving the absence of some flaky processor that crashes or has unpredictable behavior under certain circumstances is rather difficult, especially if someone at some point though it was true enough to put it in the MSDN documentation.  If somebody out there knows of a real case -- and it's not just hear say -- please let me know!)  If this is true of all X86 processors, it means that Interlocked.CompareExchange(ref Int64, ...) isn't horribly broken on the CLR after all.  (Yaay.)  It would have been broken...  because, as I said earlier, the CLR does NOT align 64-bit values on 8-byte address boundaries...

Conversing briefly with Simon Hall over email, the dev that owns most (all?) of the type layout infrastructure, I've concluded the following:  CLR type layout tries to eliminate all misaligned data layout through a combination of padding and field reordering.  This means that data of >= 8-bytes on 64-bit always begins on 8-byte boundaries, and data of >= 4-bytes on 32-bit always begins (at least) on 4-byte boundaries.  I say "at least" because emperical evidence shows that type layout actually aligns many 8-byte fields on 8-byte boundaries, even on 32-bit.  (It turns out this doesn't matter much...  neither the 32-bit JIT nor the GC respect this when allocating data.)  In summary, the CLR ensures that no field that could have fit inside a single 4/8-byte segment ever spills across a boundary.  The CLR also adds necessary padding to StructLayout(Sequential) types, while still preserving the original field ordering.

Therefore, the only cases where we end up with truly misaligned data is with StructLayout(Explicit) and StructLayout(Pack=...) types.  For example the simple struct, struct S { [FieldOffset(6)] int i; }, will always be misaligned, on 32- and 64-bit alike.  In such cases, our JIT simply generates the naive code and lets the OS perform misalignment fixups.  This is actually rather costly, as Kang Su's aforementioned article explaines.  We could have, like the VC++ compiler, generate the manual alignment code using a combination of loads and shifts, but my guess is that most of our customers don't care and will never notice.

To preserve the hard work done by type layout, our JITs and the GC guarantee that all allocated data is aligned on at least 4-byte (on 32-bit) or 8-byte (on 64-bit) boundaries.  I say "at least" once again because I know, for example, that VC++ aligns stack frames on 16-byte boundaries for 64-bit.  I don't claim to understand why.  We might do something similar.

Here's an interesting program that just prints out a few field addresses, and whether things are 8-byte aligned.  You'll interestingly notice that the int/long fields that are adjacent to one another are padded with 4-bytes in between on 32- and 64-bit, but that the JIT and GC only align on 4-byte addresses on 32-bit.  I presume this is so that the layout doesn't have to change between 32- and 64-bit, but I can't say for sure:

using System;
using System.Runtime.InteropServices;

class C {
    internal S s;
}

struct S {
    internal int x;
    internal long y;
    internal byte z;
}

unsafe class P {
    static void Main(string[] args) {
        int pad = 5;
        if (args.Length > 0) pad = int.Parse(args[0]);

        Console.WriteLine("Field\t[Begin\tEnd)\t%8");
        PrintStackS(pad);
        PrintHeapS(pad);
    }

    static void PrintStackS(int x) {
        int * pad = stackalloc int[x];
        S * s = stackalloc S[1];
        PrintAddr(s);
    }

    static void PrintHeapS(int x) {
        for (int i = 0; i < x; i++) new object();
        C c = new C();
        fixed (S * pcs = &c.s) {
            PrintAddr(pcs);
        }
    }

    static unsafe void PrintAddr(S * ps) {
        ulong xa = new UIntPtr(&ps->x).ToUInt64();
        Console.WriteLine("X\t{0:X}\t{1:X}\t{2}", xa, xa + sizeof(int), xa % 8);
        ulong ya = new UIntPtr(&ps->y).ToUInt64();
        Console.WriteLine("Y\t{0:X}\t{1:X}\t{2}", ya, ya + sizeof(long), ya % 8);
        ulong za = new UIntPtr(&ps->z).ToUInt64();
        Console.WriteLine("Z\t{0:X}\t{1:X}\t{2}", za, za + sizeof(byte), za % 8);
    }
}

Running it with a few different inputs yields these results:

C:\Temp>8by
Field   [Begin  End)    %8
X       12F440  12F444  0
Y       12F448  12F450  0
Z       12F450  12F451  0
X       1273670 1273674 0
Y       1273678 1273680 0
Z       1273680 1273681 0

C:\Temp>8by 2
Field   [Begin  End)    %8
X       12F44C  12F450  4
Y       12F454  12F45C  4
Z       12F45C  12F45D  4
X       1273664 1273668 4
Y       127366C 1273674 4
Z       1273674 1273675 4

If the CLR ever decides to support a 128 CAS operation, Interlocked.CompareExchange(ref Int128, ...), which I hope we will, we would need to guarantee alignment on 16-byte boundaries.  In comparison to CMPXCHG8B, CMPXCHG16B does indeed fail when issued against an address that isn't 16-byte aligned.  Instead of failing silently, a GP fault is generated.  This is difficult, because not only must type layout respect the alignment (you can already get this with StructLayout(..., Pack=16)), but the JIT and the GC would also need to allocate correctly.  Or, of course, you could over-allocate a chunk of data and shift the start pointer to the first aligned address inside of it.  This might work for the stack, but for GC allocated data this is going to keep shifting around on you, and probably won't work very well.  Before the CLR supports Interlocked.CompareExchange(ref Int128, ...), however, I suppose we ought to provide an Int128.  :)

12/6/2006 10:06:36 PM (Pacific Standard Time, UTC-08:00)  #   

 Friday, December 01, 2006

I recently left the CLR team to join a new team focusing on parallelism on Microsoft's platform in the 3-10 year timeframe.

I am leading design and development of the Parallel LINQ (PLINQ) project that I alluded to here.

We're looking for supersmart technical people to join the team and help change the face of programming for anybody writing code on the CLR or VC++. PLINQ isn't the only project. Solid CS skills are a must, but you don't necessarily have to be a concurrency guru (right away).

These job postings have some detail:

http://members.microsoft.com/careers/search/results.aspx?FromCP=Y&JobCategoryCodeID=&JobLocationCodeID=&JobProductCodeID=&JobTitleCodeID=&Divisions=&TargetLevels=&Keywords=2012%20&JobCode=&ManagerAlias=&Interval=10

If something catches your eye, respond on the jobs site or send me your resume directly at joedu AT you-know-where DOT com (i.e. microsoft DOT com).

I'll of course still be blogging about everything concurrency related, so you won't notice much of a difference content-wise. And I'm still happy to answer any CLR related questions and help you find the right folks inside the team to listen to your feedback.

12/1/2006 8:54:32 PM (Pacific Standard Time, UTC-08:00)  #   

 Tuesday, November 28, 2006

There’s surprisingly little information out there on Windows keyed events.  That’s probably because the feature is hidden inside the OS, not exposed publicly through the Windows SDK, and is used only by a small fraction of kernel- and user-mode OS features.  The Windows Internals book makes brief mention of them, but only in passing on page 160.  Since keyed events have been revamped markedly for Vista, a quick write up on them felt appropriate.  I had the pleasure to chat at length today with the developer who designed and implemented the feature back in Windows XP.  (I typically try to get work done during the day, but it seems the whole Microsoft campus was offline, aside from the two of us, due to the 1 or 2 inches of snow we received last evening).  I doubt much of this will make it into my book, since knowing it all won’t necessarily help you write better concurrent software.

First here’s the quick backgrounder on why keyed events were even necessary.  Before Windows 2000, critical sections, when initialized, were truly initialized.  That meant their various dynamically allocated blobs of memory were allocated, contained the right bit patterns, and also that the per-section auto-reset event that is used to block and wake threads under contention was allocated and ready.  Unfortunately, there are a finite number of kernel object HANDLEs per process, of which auto-reset events consume one, and each object consumes some amount of non-pageable pool memory.  It also turns out lots of code uses critical sections.  Around the Windows 2000 time frame, a lot more people were writing multithreaded code, primarily for server SMP programs.  It’s relatively common now-a-days to have hundreds or thousands of them in a single process.  And many critical sections are used only occasionally (or never at all), meaning the auto-reset event often isn't even necessary!  Aside from the auto-reset event, the entire critical section is pageable and has no impact on a fixed size resource.

This was a problem, and had big scalability impacts.  So starting with Windows 2000, the kernel team decided that allocation of the event would be delayed until it’s first needed.  That means EnterCriticalSection had to, in response to the first contended acquire, allocate the event.  But there’s a problem with this.  If memory is low, or the number of HANDLEs in the process had been exhausted, this lazy allocation would actually fail.  Suddenly EnterCriticalSection, which would never have failed previously (stack overflow aside), could throw an exception.  What’s worse, you couldn’t really recover from these exceptions: the CRITICAL_SECTION data structure was left in an unusable and damaged state.  But wait, it gets worse.  I’m told there was a sizeable cleanup initiated that involved filing many, many bugs to fix code that used EnterCriticalSection throughout the Windows and related code-bases.  Unfortunately, then people realized that LeaveCriticalSection could also fail under some even more obscure circumstances.  (If EnterCriticalSection fails, throwing an out of memory exception, the subsequent LeaveCriticalSection would see the damaged state and think it could help out by allocating the event.  This too could fail, causing even more corruption.)  What to do?  Wrap each call to EnterCriticalSection AND LeaveCriticalSection in its own separate __try/__except clause?  And do precisely what in response, since the data structure was completely hosed anyway?

The bottom line was that no human being could possibly write reliable software using critical sections.  Terminating the process, or isolating those bits of code that used such a damaged critical section somehow, were the only intelligible responses.  Most of Microsoft's software, including the CRT and plenty of important applications, would probably not do anything, and remain busted.

Still, the people responsible for the original change believed strongly that the impacts to reliability were the lesser of two evils: that limiting Windows scalability so fundamentally was a complete non-starter.  As a short-term solution, then, InitializeCriticalSectionAndSpinCount was hacked so that passing a dwSpinCount argument with its high-bit set, e.g. InitializeCriticalSectionAndSpinCount(&cs, 0x80000000 | realSpinCount), would revert to the pre-Windows 2000 behavior of pre-allocating the event object.  No longer would low resources possibly cause exceptions out of EnterCriticalSection and LeaveCriticalSection.  But all that code written to use the ordinary InitializeCriticalSection API was still vulnerable.  And this just pushed the fundamental reliability vs. scalability decision back onto the poor developer.  What a horrible choice to have to make.

This is when keyed events were born.  They were added to Windows XP as a new kernel object type, and there is always one global event \KernelObjects\CritSecOutOfMemoryEvent, shared among all processes.  There is no need for any of your code to initialize or create it—it’s always there and always available, regardless of the amount of resources on the machine.  Having it there always adds a single HANDLE per process, which is a very small price to pay for the benefit that comes along with it.  If you dump the handles with !handle in WinDbg, you’ll always see one of type KeyedEvent.  Well, what does it do?

A keyed event allows threads to set or wait on it, just like an ordinary event.  But having just a single, global event would be pretty useless, given that we’d like to somehow solve the original critical section problem, which effectively requires a single event per critical section.  Here's where the ingenuity arises.  When a thread waits on or sets the event, they specify a key.  This key is just a pointer-sized value, and represents a unique identifier for the event in question.  When a thread sets an event for key K, only a single thread that has begun waiting on K is woken (like an auto-reset event).  Only waiters in the current process are woken, so K is effectively isolated between processes although there’s a global event.  K is most often just a memory address.  And there you go: you have an arbitrarily large number of events in the process (bounded by the addressable bytes in the system), but without the cost of allocating a true event object for each one.

By the way, if N waiters must be woken, the same key N is set multiple times, meaning for manual-reset-style sets, the list of waiters somehow needs to be tracked.  (Although not an issue for critical sections, this comes up for SRWLs, noted below.)  This gives rise to a subtle corner case: if a setter finds the wait list associated with K to be empty, it must wait for a thread to arrive.  Yes, that means the thread setting the event can wait too.  Why?  Because it's just how keyed events work; without it, there would be extra synchronization needed to ensure a waiter didn't record that it was about to wait (e.g. in the critical section bits), the setter to see this and set the keyed event (and leave), and lastly the waiter to actually get around to waiting on the keyed event.  This would lead to a missed pulse, and possibly deadlock, if it weren't for the current behavior.

So you can probably imagine how this solves the original problem.  When a critical section finds that it can’t allocate a dedicated event due to low resources, it will just wait and set the keyed event, using the critical section’s address in memory as the key K.  You might think: well, gosh, with this nifty new keyed events thingamajiggit, why didn’t they get rid of per-critical section events entirely?  I did at least.

There are admittedly some drawbacks to keyed events.  First and foremost, the implementation in Windows XP was not the most efficient one.  It maintained the wait list as a linked list, so finding and setting a key required an O(n) traversal.  Here n is the number of threads waiting globally in the system.  The head of the list is in the keyed event object itself, and entries in the linked list are threaded by reusing a chunk of memory on the waiting thread’s ETHREAD for forward- and back-links—cleverly avoiding any dynamic allocation whatsoever (aside from the ETHREAD memory which is already allocated at thread creation time).  But given that the event is shared physically across the entire machine, depending on a linked list like this for all critical sections globally would not have scaled very well at all.  And this sharing can also result in contention that is difficult to explain, since threads have to use synchronization when accessing the list.

[Update: 2/2/2007: Neill, the dev I mentioned at the outset, just emailed me a correction to my original write-up.  I had incorrectly stated that the forward- and back-links happen through TEB memory (which is user-mode); they actually use ETHREAD memory.]

But keyed events have improved quite a bit in Windows Vista.  Instead of using a linked list, they now use a hash-table scheme, trading the possibility of hash collisions (and hence some amount of contention unpredictability) in favor of improved lookup performance.  This improvement was good enough to use them as the sole event mechanism for the new “slim” reader/writer locks (SRWLs), condition variables, and one-time initialization APIs.  Yes, you heard that right…  None of these new features use traditional events under the covers.  They use keyed events instead.  This is in part why the new SRWLs are super light-weight, taking up only a pointer-sized bit of data and not requiring any event objects whatsoever.  Critical sections still use auto-reset events, but I understand that this is primarily for AppCompat reasons.  It’s admittedly nice when debugging to be able to grab hold of the HANDLE for the internal event and dump information about it, something you can’t do with keyed events, and plenty of customers depend on this information being there.

The improvement that keyed events offer to reliability and the alleviation of HANDLE and non-pageable pressure is overall a very welcome one, and one that will undoubtedly pave the way for new synchronization OS features in the future.  I personally hope that one day they are made available to user-mode code through the Windows SDK.

11/28/2006 9:32:46 PM (Pacific Standard Time, UTC-08:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<January 2007>
SunMonTueWedThuFriSat
31123456
78910111213
14151617181920
21222324252627
28293031123
45678910

Browse by Category:

Notables: