|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Friday, December 22, 2006
You might have noticed I'm blogging a lot more about native code than I used to. Most of my day-to-day work still has to do with managed code, but I've been dazzled by all the new cool things in Vista that you can't do yet in managed code (and probably won't be able to for some time to come, since the CLR needs to keep running on old OSes). And, besides, many low-level Windows changes impact managed code too, like the fairness post from last week.
My question to you is: Do you vehemently like or dislike either category of post? Does the split between the two work? I'm trying hard to keep it at about 50/50.
I'm wondering not just for blogging reasons. As you might imagine, my book is looking to be very similar to my blog. It is, after all, "Concurrent Programming on Windows", which sometimes involves using native code to access features that managed doesn't expose to you. Similarly, the CLR gives you many things that Windows alone doesn't give you. I'm trying reeeeealllly hard to ensure the prose is not schizophrenic, flows and progresses nicely, and doesn't repeat things: e.g. "here's how you do it in VC++, here's how you do it in C#, ..." This can be a challenge for many things, but is obviously very important. Thankfully a huge portion of the book is more about applying the technologies generally, which tends to be pretty common across both environments.
So ... what do you think?
If you need RegisterWaitForSingleObject-like behavior in the new Vista threadpool -- a great feature, by the way, that amortizes the cost of losing a thread to waiting by cramming up to 63 waits into a single threadpool thread, leading to better scalability on systems that must wait for a lot of things concurrently, sometimes for long or unpredictable times (and the magic behind ASP.NET Asynchronous Pages) -- you'll want to look at the SetThreadpoolWait and related APIs. You get basically the same functionality, with a mostly cleaned up interface and the added benefit of having cleanup groups to take care of resource cleanup (if you so choose):
VOID WINAPI SetThreadpoolWait(PTP_WAIT pwa, HANDLE h, PFILETIME pftTimeout);
One thing annoyed me about this API when I first tried to use it. To be truthful, it still does. Instead of a DWORD timeout specified in 1ms intervals, you'll notice you have to supply a pointer to a FILETIME data structure. Wha-wha-whaat? This differs from just about every other wait API on Windows and is not as simple as it sounds.
The no-timeout case, which is typically -1 (a.k.a. INFINITE), is simple enough: just pass a NULL pointer. The "check, but immediately timeout if unsignaled without waiting," variant, typically specified with 0, is pretty straightforward. Just initialize your FILETIME to contain 0's. There are many ways to do this but using a struct field initializer is probably the easiest:
FILETIME ft = {0,0}; SetThreadpoolWait(..., &ft);
But for real timeouts, how the heck do you get your hands on a properly initialized FILETIME? OK, here's where things get nasty. You can grab ahold of a FILETIME from many places, including from an actual file's creation or modification time, but you probably don't want to do that. GetSystemTimeAsFileTime(&ft) will turn the current time into a FILETIME, which is useless without additional math to turn it into a time relative to now (otherwise, you'd just use a {0,0} FILETIME as noted above). Alternatively, you could start with a SYSTEMTIME (the current can be retrieved with GetSystemTime), which allows you to define a precise year, month, day, hour, minute, second, and/or millisecond, and then turn that into a FILETIME with SystemTimeToFileTime(..., &ft). But how many times do you really have a precise absolute date and time at which you want your wait to stop? Yes, February 3rd, 2019, at 6:29 AM (and 39.66 seconds) please.
Whew! At this point, you're probably bewildered. Most wait timeouts are defined in terms of relative time. Thankfully, you can still define a relative time when calling SetThreadpoolWait. But sadly you have to resort to some messy casting and data conversion to do this. How do you do that? Simple. If the FILETIME's contents represent a negative integer, then the threadpool adds the negation of that to the current system time during registration to come up with an absolute time. So it does the GetSystemTimeAsFileTime/addition nonsense for you. Well, how the heck do you do create a negative FILETIME anyway? "Easy"!
__declspec(align(8)) FILETIME ft; *reinterpret_cast<LONG64 *>(&ft) = -1000; SetThreadpoolWait(..., &ft);
That causes the threadpool to use a timeout of 1000 from the current time. 1000 units of what you might ask? FILETIME represents its time with 100ns units, so this particular timeout of -1000 is interpreted to mean 100 microseconds from now. Generally, if you have n milliseconds, you'd calculate n * 1000 (to get microsecond units) * 10 (to get 100 nanosecond units).
[Update: 12/26/2006: I changed the examples to use __declspec(align(8)), ensuring that the FILETIME is aligned on an 8-byte boundary. Pavel noted in comments to this post that, since FILETIME consists of two DWORDs, VC++ will align on 4-byte boundaries. On some architectures -- like IA64 -- treating these as a single 64-bit aggregate piece of data will present alignment faults to the software. (This is unlike the behavior you'll see on X86 and X64, where such faults are fixed transparently by the hardware and/or OS, albeit at a performance hit. You can ask the OS to fix these up on IA64 automatically, with SetErrorMode(SEM_NOALIGNMENTFAULTEXCEPT), but this results in even worse performance than on other architectures.) Raymond wrote about this previously. One solution is to align manually, as shown here; another is to memcpy data to and from the FILETIME/LONG64; yet another, which is probably the cleanest, is to simply access the FILETIME's fields directly. E.g. FILETIME ft = { -m * 1000 * 10, ~0 } or ft.dwLowDateTime = -m * 1000 * 10; ft.dwHighDateTime = ~0, for some milliseconds timeout m.]
To save you some time and frustration in this process, you can just use the following little shim in your code. I do. It gives you a nice, relative DWORD-based timeout interface to the SetThreadpoolWait API:
VOID SetThreadpoolWaitWithMsTimeout(PTP_WAIT pwa, HANDLE h, DWORD dwMilliseconds) { __declspec(align(8)) FILETIME ft; PFILETIME pft;
if (dwMilliseconds == -1) { // Just pass NULL to the wait API to signal "no timeout". pft = NULL; } else { pft = &ft;
// FILETIME uses 100ns intervals. To convert milliseconds into 100ns units, // we must multiply by 1000 (to get microseconds) and then by 10 (to get 100ns). // We take the negative of this number to specify "relative to now" time. *reinterpret_cast<LONG64 *>(pft) = -dwMilliseconds * 1000 * 10; }
SetThreadpoolWait(pwa, h, pft); }
To be honest, I don't know the reasoning behind the break from tradition here. [Update: 12/26/2006: Richard pointed out, very correctly, that the NTDLL implementations have alwasy dealt in terms of FILETIMEs and performed DWORD ms to 100ns conversions; in that sense, this is hardly a "break from tradition" for Windows generally, but is certainly a new direction for KERNEL32 itself.] Using a FILETIME clearly causes some trickiness for something that should be damn simple, but it does have one distinct advantage. The DWORD-based APIs only permit you to specify relative time, whereas the FILETIME approach allows you to specify relative or absolute, depending on what you need. I have to admit that I can’t think of many cases with wait registrations where you’d want an absolute timeout, but I might be unimaginative. This also has the benefit of removing some degree of ambiguity: with relative, you always have to wonder: time relative to what? Is it relative to the time at which an actual wait thread sees the registration and adds the HANDLE to its list of waitable objects? Or is it relative to the time at which the call to register the wait is made? Or something else? In the end, I’d have preferred a separate API for absolute time… I just like my DWORD-based relative timeouts. They are familiar and comfortable. Continuing to use RegisterWaitForSingleObject is an option, but doesn't let you use environments, cleanup groups, etc. Tradeoffs abound.
 Thursday, December 14, 2006
Raymond just posted a brief entry about lock convoys. While he links to a few other relevant posts, none of them mention the new anti-convoy features that all Windows locks now use. So I thought that I would take the chance to do just that.
Many people claim that fair locks lead to convoys. In my experience, however, few people really know the reason why. Before Windows Vista (client OS) and Windows Server SP1 (server OS), mutexes, critical sections, and internal locks like kernel pushlocks used a lock handoff mechanism to guarantee true fairness. In other words, the lock would not actually become "available" when released so long as there were threads waiting to acquire it. Instead, the thread releasing the lock would modify the lock's state so that it appeared as if the next thread in the wait queue already owned it. The new owner thread would then simply wake up, typically via the releaser signaling an event, and the owner would find that it already owned the lock, proceeding happily. No other thread can "sneak in" between the time the new owner is woken and the time that it is actually scheduled for execution with this design.
While this sounds nice and, well, fair, it exacerbates convoys. Why? Because it effectively extends the lock hold time by the communication and context switch latency required to wake and reschedule the new owner. Context switches on Windows are anything but cheap, and tend to cost anywhere from 4,000 to 10,000+ cycles. Assume C represents the cost of a context switch. Then with a truly fair lock, the lock will be in an intermediary handed off state, with no thread actually running code under its protection, for about C cycles. It's actually worse than this. Assuming the system is busy, a thread that is woken just goes to the end of the OS's thread scheduler queue, and is required to wait until it gets allocated a timeslice in which to execute. This can make C much larger in practice. And of course on highly loaded systems the condition worsens, which can add insult to injury (as we see momentarily).
To cope with the possibility of a scheduling delay, Windows uses something called a priority boost for any thread waiting on an auto-reset event (or that owns a window): the boost temporarily increases the target thread's priority which subsequently decays after it gets scheduled. Assuming no other high priority threads are runnable, this ensures the latency is very close to C. But C's still pretty darned big...
To illustrate the problem with the fair handoff scheme, imagine we have a lock L for which a new thread arrives every 2,000 cycles. Each such thread runs for 1,000 cycles under the protection of the lock. No problem, right? On average, the lock will be held for 1,000 cycles, unheld for 1,000 cycles, and so on. Assuming the arrival rate is somewhat random, but statistically averages out to the values mentioned, then occasionally we might get some contention, requiring waiting. But for every contentious acquire, there should also be a big gap in time where there are no owners (or where wait queues can be drained). A system with these characteristics should balance out well. It should survive until threads begin arriving at a frequency of more than 1,000 cycles, give or take some epsilon, which is actually a doubling in throughput. It's not even close to capacity. (Real systems depend on many more factors than this simplistic view, but you get the point.)
As soon as the lock is fair, however, this scheme quickly becomes untenable and will come to a grinding halt. Imagine that thread T0 acquires L at cycle 0; if it just so happens that T1 tries to acquire it at cycle 500, then T1 will have to wait. Remember, on average, the arrival rate is 1,000 cycles, but that's just an average. We expect the occasional wait to occur. This wait, unfortunately, causes a domino effect from which the system will never recover. T0 then releases L at cycle 1,000, as expected, handing off ownership to T1; sadly, T1 doesn't actually start running inside the lock until 5,000 (assuming 4,000 for C, and assuming no scheduling delay); in the 4,000 cycles it took for T1 to wake back up and start running, we expect 2 new threads will have arrived on average; these threads would see L as owned by T1 and respond by waiting. By the time those threads execute, another 10,000 cycles (2*(C+1,000)) will have passed, and another 4 threads will have begun waiting. And so on. This process repeats indefinitely, the requests pile up (hopefully with a bound), and disaster strikes. The system simply won't scale this way.
If you remove the strict fairness policy, however, the system scales. And that, my friends, is why all of the locks in Windows are now unfair.
Of course, Windows locks are still a teensy bit fair. The wait lists for mutually exclusive locks are kept in FIFO order, and the OS always wakes the thread at the front of such wait queues. (APCs regularly disturb this ordering -- a topic for another day -- which actually calls into question the merit of the original design goal of attaining fairness in the first place.) Now when a lock becomes unowned, a FIFO waking algorithm is still used, but the lock is immediately marked as unavailable. Another thread can sneak in and take the lock before the woken thread is even scheduled (although priority boosting is still, somewhat questionably, in the system). If another thread steals the lock, the woken thread may subsequently have to rewait, meaning it must go to the back of the queue, again disturbing the nice FIFO ordering.
The change to unfair locks clearly has the risk of leading to starvation. But, statistically speaking, timing in concurrent systems tends to be so volatile that each thread will eventually get its turn to run, probabilistically speaking. Many more programs would suffer from the convoy problems resulting from fair locks than would notice starvation happening in production systems as a result of unfair locks.
 Wednesday, December 06, 2006
I took this past week off so that I could work on my book. Well, I'm happy to report that I've been successfully writing like a madman, averaging around 15-20 solid pages per day. I still have a long way to go, but I'm getting more confident with the passing of each day that this book will be... well... a book that I'd actually like to sit down and read.
[Update: 12/7/2006: Correction made -- the CLR's JITs do not generate manual alignment code. Instead, we defer to the costly OS handler for alignment fixups.]
In the process of writing the section on data alignment, I realized there is very little documentation on the alignment policy used by the CLR. This is in contrast to Kang Su Gatlin's wonderful MSDN treatise on the subject for VC++, which leaves absolutely nothing hidden in the closet. Well, I still don't have all of the answers for you. Sorry. You'll have to wait for the book. But in the meantime, I've discovered that there's a myth that deserves a little debunking.
In the MSDN documentation for InterlockedCompareExchange64, it says:
"The variables for this function must be aligned on a 64-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems."
I've also heard and read this from other various sources. I've heard, for example, that LOCK CMPXCHG8B will still do a load/compare/store sequence, but that, if the address isn't 8-byte aligned, the instruction will not be atomic. This would lead to sporadic atomicity failures, probably even more difficult to track down than a typical race. Given that the CLR doesn't faithfully align 64-bit data types on 8-byte boundaries (as we'll see momentarily), I suddenly feared that Interlocked.CompareExchange(ref Int64, ...) was HORRIBLY broken. Without an MP machine at home, I couldn't test this out, so I decided to do a little digging.
In the manuals for many AMD processors and older Intel X86 processors, I found no reference to CMPXCHG8B requiring an aligned address. What I did find, however, in the Intel 64-bit and IA32 System Programmer's Manual Part A was the following (emphasis mine):
"The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
- Any boundary for an 8-bit access (locked or otherwise).
- 16-bit boundary for locked word accesses.
- 32-bit boundary for locked doubleword accesses.
- 64-bit boundary for locked quadword accesses."
If I'm reading that right, this means the common wisdom around 8-byte alignment and LOCK CMPXCHG8B is hogwash. (Sadly, proving the absence of some flaky processor that crashes or has unpredictable behavior under certain circumstances is rather difficult, especially if someone at some point though it was true enough to put it in the MSDN documentation. If somebody out there knows of a real case -- and it's not just hear say -- please let me know!) If this is true of all X86 processors, it means that Interlocked.CompareExchange(ref Int64, ...) isn't horribly broken on the CLR after all. (Yaay.) It would have been broken... because, as I said earlier, the CLR does NOT align 64-bit values on 8-byte address boundaries...
Conversing briefly with Simon Hall over email, the dev that owns most (all?) of the type layout infrastructure, I've concluded the following: CLR type layout tries to eliminate all misaligned data layout through a combination of padding and field reordering. This means that data of >= 8-bytes on 64-bit always begins on 8-byte boundaries, and data of >= 4-bytes on 32-bit always begins (at least) on 4-byte boundaries. I say "at least" because emperical evidence shows that type layout actually aligns many 8-byte fields on 8-byte boundaries, even on 32-bit. (It turns out this doesn't matter much... neither the 32-bit JIT nor the GC respect this when allocating data.) In summary, the CLR ensures that no field that could have fit inside a single 4/8-byte segment ever spills across a boundary. The CLR also adds necessary padding to StructLayout(Sequential) types, while still preserving the original field ordering.
Therefore, the only cases where we end up with truly misaligned data is with StructLayout(Explicit) and StructLayout(Pack=...) types. For example the simple struct, struct S { [FieldOffset(6)] int i; }, will always be misaligned, on 32- and 64-bit alike. In such cases, our JIT simply generates the naive code and lets the OS perform misalignment fixups. This is actually rather costly, as Kang Su's aforementioned article explaines. We could have, like the VC++ compiler, generate the manual alignment code using a combination of loads and shifts, but my guess is that most of our customers don't care and will never notice.
To preserve the hard work done by type layout, our JITs and the GC guarantee that all allocated data is aligned on at least 4-byte (on 32-bit) or 8-byte (on 64-bit) boundaries. I say "at least" once again because I know, for example, that VC++ aligns stack frames on 16-byte boundaries for 64-bit. I don't claim to understand why. We might do something similar.
Here's an interesting program that just prints out a few field addresses, and whether things are 8-byte aligned. You'll interestingly notice that the int/long fields that are adjacent to one another are padded with 4-bytes in between on 32- and 64-bit, but that the JIT and GC only align on 4-byte addresses on 32-bit. I presume this is so that the layout doesn't have to change between 32- and 64-bit, but I can't say for sure:
using System; using System.Runtime.InteropServices;
class C { internal S s; }
struct S { internal int x; internal long y; internal byte z; }
unsafe class P { static void Main(string[] args) { int pad = 5; if (args.Length > 0) pad = int.Parse(args[0]);
Console.WriteLine("Field\t[Begin\tEnd)\t%8"); PrintStackS(pad); PrintHeapS(pad); }
static void PrintStackS(int x) { int * pad = stackalloc int[x]; S * s = stackalloc S[1]; PrintAddr(s); }
static void PrintHeapS(int x) { for (int i = 0; i < x; i++) new object(); C c = new C(); fixed (S * pcs = &c.s) { PrintAddr(pcs); } }
static unsafe void PrintAddr(S * ps) { ulong xa = new UIntPtr(&ps->x).ToUInt64(); Console.WriteLine("X\t{0:X}\t{1:X}\t{2}", xa, xa + sizeof(int), xa % 8); ulong ya = new UIntPtr(&ps->y).ToUInt64(); Console.WriteLine("Y\t{0:X}\t{1:X}\t{2}", ya, ya + sizeof(long), ya % 8); ulong za = new UIntPtr(&ps->z).ToUInt64(); Console.WriteLine("Z\t{0:X}\t{1:X}\t{2}", za, za + sizeof(byte), za % 8); } }
Running it with a few different inputs yields these results:
C:\Temp>8by Field [Begin End) %8 X 12F440 12F444 0 Y 12F448 12F450 0 Z 12F450 12F451 0 X 1273670 1273674 0 Y 1273678 1273680 0 Z 1273680 1273681 0
C:\Temp>8by 2 Field [Begin End) %8 X 12F44C 12F450 4 Y 12F454 12F45C 4 Z 12F45C 12F45D 4 X 1273664 1273668 4 Y 127366C 1273674 4 Z 1273674 1273675 4
If the CLR ever decides to support a 128 CAS operation, Interlocked.CompareExchange(ref Int128, ...), which I hope we will, we would need to guarantee alignment on 16-byte boundaries. In comparison to CMPXCHG8B, CMPXCHG16B does indeed fail when issued against an address that isn't 16-byte aligned. Instead of failing silently, a GP fault is generated. This is difficult, because not only must type layout respect the alignment (you can already get this with StructLayout(..., Pack=16)), but the JIT and the GC would also need to allocate correctly. Or, of course, you could over-allocate a chunk of data and shift the start pointer to the first aligned address inside of it. This might work for the stack, but for GC allocated data this is going to keep shifting around on you, and probably won't work very well. Before the CLR supports Interlocked.CompareExchange(ref Int128, ...), however, I suppose we ought to provide an Int128. :)
 Friday, December 01, 2006
I recently left the CLR team to join a new team focusing on parallelism on Microsoft's platform in the 3-10 year timeframe.
I am leading design and development of the Parallel LINQ (PLINQ) project that I alluded to here.
We're looking for supersmart technical people to join the team and help change the face of programming for anybody writing code on the CLR or VC++. PLINQ isn't the only project. Solid CS skills are a must, but you don't necessarily have to be a concurrency guru (right away).
These job postings have some detail:
http://members.microsoft.com/careers/search/results.aspx?FromCP=Y&JobCategoryCodeID=&JobLocationCodeID=&JobProductCodeID=&JobTitleCodeID=&Divisions=&TargetLevels=&Keywords=2012%20&JobCode=&ManagerAlias=&Interval=10
If something catches your eye, respond on the jobs site or send me your resume directly at joedu AT you-know-where DOT com (i.e. microsoft DOT com).
I'll of course still be blogging about everything concurrency related, so you won't notice much of a difference content-wise. And I'm still happy to answer any CLR related questions and help you find the right folks inside the team to listen to your feedback.
 Tuesday, November 28, 2006
There’s surprisingly little information out there on Windows keyed events. That’s probably because the feature is hidden inside the OS, not exposed publicly through the Windows SDK, and is used only by a small fraction of kernel- and user-mode OS features. The Windows Internals book makes brief mention of them, but only in passing on page 160. Since keyed events have been revamped markedly for Vista, a quick write up on them felt appropriate. I had the pleasure to chat at length today with the developer who designed and implemented the feature back in Windows XP. (I typically try to get work done during the day, but it seems the whole Microsoft campus was offline, aside from the two of us, due to the 1 or 2 inches of snow we received last evening). I doubt much of this will make it into my book, since knowing it all won’t necessarily help you write better concurrent software.
First here’s the quick backgrounder on why keyed events were even necessary. Before Windows 2000, critical sections, when initialized, were truly initialized. That meant their various dynamically allocated blobs of memory were allocated, contained the right bit patterns, and also that the per-section auto-reset event that is used to block and wake threads under contention was allocated and ready. Unfortunately, there are a finite number of kernel object HANDLEs per process, of which auto-reset events consume one, and each object consumes some amount of non-pageable pool memory. It also turns out lots of code uses critical sections. Around the Windows 2000 time frame, a lot more people were writing multithreaded code, primarily for server SMP programs. It’s relatively common now-a-days to have hundreds or thousands of them in a single process. And many critical sections are used only occasionally (or never at all), meaning the auto-reset event often isn't even necessary! Aside from the auto-reset event, the entire critical section is pageable and has no impact on a fixed size resource.
This was a problem, and had big scalability impacts. So starting with Windows 2000, the kernel team decided that allocation of the event would be delayed until it’s first needed. That means EnterCriticalSection had to, in response to the first contended acquire, allocate the event. But there’s a problem with this. If memory is low, or the number of HANDLEs in the process had been exhausted, this lazy allocation would actually fail. Suddenly EnterCriticalSection, which would never have failed previously (stack overflow aside), could throw an exception. What’s worse, you couldn’t really recover from these exceptions: the CRITICAL_SECTION data structure was left in an unusable and damaged state. But wait, it gets worse. I’m told there was a sizeable cleanup initiated that involved filing many, many bugs to fix code that used EnterCriticalSection throughout the Windows and related code-bases. Unfortunately, then people realized that LeaveCriticalSection could also fail under some even more obscure circumstances. (If EnterCriticalSection fails, throwing an out of memory exception, the subsequent LeaveCriticalSection would see the damaged state and think it could help out by allocating the event. This too could fail, causing even more corruption.) What to do? Wrap each call to EnterCriticalSection AND LeaveCriticalSection in its own separate __try/__except clause? And do precisely what in response, since the data structure was completely hosed anyway?
The bottom line was that no human being could possibly write reliable software using critical sections. Terminating the process, or isolating those bits of code that used such a damaged critical section somehow, were the only intelligible responses. Most of Microsoft's software, including the CRT and plenty of important applications, would probably not do anything, and remain busted.
Still, the people responsible for the original change believed strongly that the impacts to reliability were the lesser of two evils: that limiting Windows scalability so fundamentally was a complete non-starter. As a short-term solution, then, InitializeCriticalSectionAndSpinCount was hacked so that passing a dwSpinCount argument with its high-bit set, e.g. InitializeCriticalSectionAndSpinCount(&cs, 0x80000000 | realSpinCount), would revert to the pre-Windows 2000 behavior of pre-allocating the event object. No longer would low resources possibly cause exceptions out of EnterCriticalSection and LeaveCriticalSection. But all that code written to use the ordinary InitializeCriticalSection API was still vulnerable. And this just pushed the fundamental reliability vs. scalability decision back onto the poor developer. What a horrible choice to have to make.
This is when keyed events were born. They were added to Windows XP as a new kernel object type, and there is always one global event \KernelObjects\CritSecOutOfMemoryEvent, shared among all processes. There is no need for any of your code to initialize or create it—it’s always there and always available, regardless of the amount of resources on the machine. Having it there always adds a single HANDLE per process, which is a very small price to pay for the benefit that comes along with it. If you dump the handles with !handle in WinDbg, you’ll always see one of type KeyedEvent. Well, what does it do?
A keyed event allows threads to set or wait on it, just like an ordinary event. But having just a single, global event would be pretty useless, given that we’d like to somehow solve the original critical section problem, which effectively requires a single event per critical section. Here's where the ingenuity arises. When a thread waits on or sets the event, they specify a key. This key is just a pointer-sized value, and represents a unique identifier for the event in question. When a thread sets an event for key K, only a single thread that has begun waiting on K is woken (like an auto-reset event). Only waiters in the current process are woken, so K is effectively isolated between processes although there’s a global event. K is most often just a memory address. And there you go: you have an arbitrarily large number of events in the process (bounded by the addressable bytes in the system), but without the cost of allocating a true event object for each one.
By the way, if N waiters must be woken, the same key N is set multiple times, meaning for manual-reset-style sets, the list of waiters somehow needs to be tracked. (Although not an issue for critical sections, this comes up for SRWLs, noted below.) This gives rise to a subtle corner case: if a setter finds the wait list associated with K to be empty, it must wait for a thread to arrive. Yes, that means the thread setting the event can wait too. Why? Because it's just how keyed events work; without it, there would be extra synchronization needed to ensure a waiter didn't record that it was about to wait (e.g. in the critical section bits), the setter to see this and set the keyed event (and leave), and lastly the waiter to actually get around to waiting on the keyed event. This would lead to a missed pulse, and possibly deadlock, if it weren't for the current behavior.
So you can probably imagine how this solves the original problem. When a critical section finds that it can’t allocate a dedicated event due to low resources, it will just wait and set the keyed event, using the critical section’s address in memory as the key K. You might think: well, gosh, with this nifty new keyed events thingamajiggit, why didn’t they get rid of per-critical section events entirely? I did at least.
There are admittedly some drawbacks to keyed events. First and foremost, the implementation in Windows XP was not the most efficient one. It maintained the wait list as a linked list, so finding and setting a key required an O(n) traversal. Here n is the number of threads waiting globally in the system. The head of the list is in the keyed event object itself, and entries in the linked list are threaded by reusing a chunk of memory on the waiting thread’s ETHREAD for forward- and back-links—cleverly avoiding any dynamic allocation whatsoever (aside from the ETHREAD memory which is already allocated at thread creation time). But given that the event is shared physically across the entire machine, depending on a linked list like this for all critical sections globally would not have scaled very well at all. And this sharing can also result in contention that is difficult to explain, since threads have to use synchronization when accessing the list.
[Update: 2/2/2007: Neill, the dev I mentioned at the outset, just emailed me a correction to my original write-up. I had incorrectly stated that the forward- and back-links happen through TEB memory (which is user-mode); they actually use ETHREAD memory.]
But keyed events have improved quite a bit in Windows Vista. Instead of using a linked list, they now use a hash-table scheme, trading the possibility of hash collisions (and hence some amount of contention unpredictability) in favor of improved lookup performance. This improvement was good enough to use them as the sole event mechanism for the new “slim” reader/writer locks (SRWLs), condition variables, and one-time initialization APIs. Yes, you heard that right… None of these new features use traditional events under the covers. They use keyed events instead. This is in part why the new SRWLs are super light-weight, taking up only a pointer-sized bit of data and not requiring any event objects whatsoever. Critical sections still use auto-reset events, but I understand that this is primarily for AppCompat reasons. It’s admittedly nice when debugging to be able to grab hold of the HANDLE for the internal event and dump information about it, something you can’t do with keyed events, and plenty of customers depend on this information being there.
The improvement that keyed events offer to reliability and the alleviation of HANDLE and non-pageable pressure is overall a very welcome one, and one that will undoubtedly pave the way for new synchronization OS features in the future. I personally hope that one day they are made available to user-mode code through the Windows SDK.
 Saturday, November 18, 2006
I was surprised to find out that attempting to acquire an orphaned native "slim" reader/writer lock (SRWL) on the shutdown path hangs on Windows Vista. Unlike orphaned critical section acquisitions during shutdown on Windows -- which, in Vista cause the process to terminate immediately, and pre-Vista enjoyed "weakening" to avoid deadlocking at the risk of seeing corrupt state -- SRWL's AcquireSRWLockXXX methods are not shutdown aware.
This is actually pretty dangerous, and effectively means you should stay as far away from SRWLs during shutdown as possible. Avoiding synchronization and any sort of cross-thread coordination in DllMain is generally a good rule of thumb anyway, since it runs under the protection of the loader lock, has to tolerate very harsh conditions, and often runs w/out the presence of other active threads.
But this means something even stronger and sets SRWLs distinctly apart from Win32 critical sections. If you're writing a reusable native library whose functionality somebody might conceivably want to use on the shutdown path, you really ought not to be using SRLWs internally. If app developers don't realize you employ internal SRWL synchronization, they might call you and, every so often when the stars align, their users will experience a random hang during shutdown. Library authors might consider giving TryXXX variants of their APIs so that developers can at least deal with the case in which a SRWL has been orphaned.
Notice that hanging is similar to managed code's approach to lock acquisitions during shutdown. There's a big difference, though: the CLR anoints a shutdown watchdog thread that kills the process after 2 seconds when a hang occurs, whereas native code won't. The native hang persists indefinitely.
I'd be a much happier guy if SRWLs mirrored the new Vista orphaned critical section shutdown behavior.
 Sunday, November 12, 2006
I just read Dijkstra’s "My recollection of operating system design", note #1303. He describes issues with the design and implementation of THE operating system, written several years afterward in retrospect. Theme-wise, he focuses on concurrency, resource management and scheduling, fault tolerance, and real-time considerations for fairness. I found the paper to be quite fascinating and easy to read. I enjoyed the theme of concurrency woven throughout, particularly reading about both the failed and successful approaches. The original hand-written note can be found here.
Here are some choice quotes that I was drawn to with some comments by yours truly:
"... [with this] we have achieved CCC (= Completely Concealed Concurrency) in the sense that our two machines are now functionality equivalent in the sense that, fed with the same program, they will produce exactly the same output. As long as speed of execution is unobservable, it is impossible to determine which of the two machines did execute your program." (pp. 5)
[Comment: Although the details are quite different, I was struck by CCC’s goal of isolation among tasks running concurrently in the system and its resemblance to STM, at least in terms of overarching design goals.]
"Thus CCC improved the efficiency ... but the comfortable invisibility of concurrency had a high price ... To maximize throughput, we would like each reader to read at its maximum speed most of the time ... but with CCC, the calculator has no means of identifying that input stream: the invisibility of concurrency has made the notion of "first-come-first-served" inapplicable. The moral of the story was that, essentially for the sake of efficiency, concurrency should become somewhat visible. It became so, and then, all hell broke loose." (pp. 7-8)
[Comment: Again, STM often requires breaking this isolation to improve scalability of certain data structures, a la open nesting (see Moss ’06). Although there isn’t as much practical experience with such things, it is quite clear that open nesting can certainly cause all hell to break loose (see Agrawal, et. al, '06).)]
"In retrospect, the problems were in roughly three areas: (i) The basic mechanics of switching the central processor from one task to another, (ii) The strategy for scheduling the central processor and the scope of its commitment, (iii) The logical problems that emerge when a number of resources are shared by a number of competitors ... Those years taught me that the ability to detect timely that some theory is needed is crucial for successful software design." (pp. 9)
"We learned to distinguish between "essential urgency," where failure to be in time would derail the computation, and "moral urgency," where the failure would only hurt the efficiency... [this] taught us that in resource allocation one could (and therefore should) separate the concerns of necessity and of desirability." (pp. 18)
[Comment: This is a great lesson to learn when it comes to reliability in general. There is a large difference between statistically reliable and completely reliable. All software is statistically reliable to a degree, it’s only a matter of the statistical details, in particular the statistical frequency of failure, whereas very little software, at least on commercial non-real-time systems like Windows, is completely reliable, in other words, cannot fail.]
"With the symmetry between CPU and peripheral clearly understood, symmetric synchronizing primitives could not fail to be conceived and I introduced the operations P and V on semaphores. Initially my semaphores were binary ... When I showed this to Bram and Carel, Carel immediately suggested to generalize the binary semaphore to a natural semaphore ... How valuable this generalization was we would soon discover." (pp. 23-24)
"Many peripherals could be designed in such a way that they did not confront the CPU with real-time obligations ... but the problem was, that these same devices would often now confront the CPU with situations of "moral urgency" ... The solution was to allow the CPU to build up a queue of commands and to enable the channel to switch without CPU intervention to the next command in the queue. Synchronization was controlled by two counting semaphores in what we now know as the producer/consumer arrangement ..." (pp. 24)
[Comment: Obviously when scheduling concurrent tasks, e.g. in a user-mode scheduler, one has to worry about fairness and prioritization too. His writings are generally applicable.]
"In retrospect I am sorry that I postponed widespread publication until 1967 and think that I would have served mankind better if I had enabled The Evil One to improve its product." (pp. 27)
[Comment: I had to laugh at this. Dijkstra wrote this regarding his design for semaphores and synchronizing channels, referring to IBM as "The Evil One" and IBM/360 as "its product". Times have changed things, but not by much: Microsoft is now, I suppose, "The Evil One," at least for many people.]
"For economic reasons one wants to avoid in a multiprogramming system the so-called "busy form of waiting," in which the central processor is unproductively engaged in the waiting cycle of a temporarily stopped program, for it is better to grant the processor to a program that may actually proceed." (pp. 28)
"Halfway the multiprogramming project in Eindhoven I realized that we would have been in grave difficulties had we not seen in time the possibility of definitely unintended deadlocks. From that experience we concluded that a successful systems designer should recognize as early as possible situations in which some theory was needed. In our case we needed enough theory about deadlocks and their prevention to develop what became known as "the banker's algorithm," without which the multiprogramming system would only have been possible in a very crude form." (pp. 36)
I hope at least one other person out there finds this interesting. We've certainly made progress over the past 40+ years, although maybe not as much as we think. As Dijkstra says in this note several times, perhaps one of the largest improvements in the state of the art since he worked on THE has been the development of terminology. It's incredibly hard to build on top of prior work if you haven't the terms to talk and reason about what has worked and what has failed.
 Thursday, November 09, 2006
The CLR tried to add support for fibers in Whidbey. This was done in response to SQL Server Yukon hosting the runtime in process, aka SQLCLR. Eventually, mostly due to schedule pressure and a long stress bug tail related to fiber-mode, we threw up our hands and declared it unsupported. Given the choice between fixing mainline stress bugs (which almost exclusively use the unhosted CLR, meaning OS threads) and fixing fiber-related stress bugs, the choice was a fairly straightforward one. This impacts SQL customers that want to run in fiber mode, but there are much fewer of those than those who want to run in thread mode.
Perhaps the biggest thing we did to support fibers intrinsically in the runtime was to decouple the CLR thread object from the physical OS thread. Since most managed code accesses thread-specific state through this façade, we are able to redirect calls to threads or fibers as appropriate. And we of course plumbed the EE to call out to hosts so they can perform task management at various points in the code, enabling a non-preemptive scheduler to do its job. When a CLR host with a registered TaskManager object is detected, we defer many tasks to it that we’d ordinarily implement with OS calls. For example, instead of just creating a new OS thread, we will call out through the TaskManager interface so that the thread can use a fiber if it wishes.
We do various other things of interest:
- Because the CLR thread object is per-fiber, any information hanging off of it is also per-fiber. Thread.ManagedThreadId returns a stable ID that flows around with the CLR thread. It is not dependent on the identity of the physical OS thread, which means using it implies no form of affinity. Different fibers running on the same thread return different IDs. Impersonation and locale is also carried around with the fiber instead of the thread. This also ensures we can properly walk stacks, propagate exceptions, and report all of the active roots held on stack frames (for all fibers) to the GC.
- Managed TLS is stored in FLS if a fiber is being used. This includes the ThreadStaticAttribute and Thread.GetData and Thread.SetData routines. We avoid introducing thread affinity when these APIs are used.
- Any time we block in managed code or at most places in the runtime, we call out to the host so that it may SwitchToFiber. This includes calls to WaitHandle.WaitOne, contentious Monitor.Enters, Thread.Sleep, and Thread.Join, as well as any other APIs that use those internally. Some managed code blocks by P/Invoking, either intentionally or unintentionally, which leaves us helpless. The sockets classes in Whidbey, for instance, make possibly-blocking calls to Win32. These should really be cleaned up. Not only does it prevent us from switching in fiber mode, but it also prevents us from pumping the message queue on an STA thread. Apps do this too, such as P/Invoking to MsgWaitForMultipleObjects in order to do some custom message pumping code. The lack of coordination with blocking in the kernel also makes it way too easy to accidentally forfeit an entire CPU for lengthy periods of time.
- We do some things during a fiber switch to shuffle data in and out of TLS. This includes copying the current thread object pointer and AppDomain index from FLS to TLS, for example, as well as doing general book-keeping that is used by the internal fiber switching routines (SwitchIn and SwitchOut).
- Our CLR internal critical sections coordinate with the host. Anytime we create or wait on an event, it is a thin wrapper that calls out to the host. This meant sacrificing some freedom around waiting, like doing away with WaitForMultipleObjectsEx with WAIT_ANY and WAIT_ALL, but ensures seamless integration with a fiber-mode host.
- All thread creation, aborts, and joins are host aware, and call out to the host so they can ensure these events are processed correctly given an alternative scheduling mechanism.
None of this logic kicks in if fibers are used underneath the CLR. It all requires close coordination between the host which is doing user-mode scheduling and the CLR which is executing the code running on those fibers. If you call into managed code on a thread that was converted to a fiber, and then later switch fibers without involvement w/ the CLR, things will break badly. Our stack walks and exception propagation will rely on the wrong fiber’s stack, the GC will fail to find roots for stacks that aren’t live on threads, among many, many other things.
Important areas of the BCL and runtime that can introduce thread affinity, then make a call that might block, and later release thread affinity—such as the acquisition and release of an OS CRITICAL_SECTION or Mutex—have been annotated with calls to Thread.BeginThreadAffinity and Thread.EndThreadAffinity. These APIs call out to the host who maintains a recursion counter to track regions of affinity. If a blocking operation happens inside such a region (i.e. count > 0), the host should avoid rescheduling another fiber on the current thread and/or moving the current fiber to another thread. This can create CPU stalls, so we try to avoid it, but is better than the consequence of ignoring the affinity.
In reality, there is little code today that actually uses these APIs. Large swaths of the .NET Framework have not yet been modified to use these calls and thus remain unprotected. We inherit a lot of the affinity problems from Win32. This can have a dramatic impact on reliability and correctness when used in a fiber-mode host. Switching a fiber that has acquired OS thread affinity can result in data being accidentally shared between units of work (like the ownership of a lock) or movement of work to a separate thread (which then expects to find some TLS, but is surprised when it isn’t there). Both are very bad. If we were serious about supporting fibers underneath managed code, we really ought to do an audit of the libraries to find any dangerous unmarked P/Invokes or OS thread affinity.
Spin loops without going through the user-mode scheduler first potentially wastefully burn CPU cycles. A lot of the .NET Framework and some of the CLR itself spins without host coordination. While not disastrous, presuming they all fall back to waiting eventually, this can have a negative impact on performance and scalability.
The 2.0 CLR’s policy in response to stack overflow is to FailFast the whole process. Too much of Win32 is unreliable in the face of overflow to try and continue running. With fibers in the picture it might be attractive to reserve smaller stacks since presumably the smaller work items will need less. And you're apt to have a lot more of them. This is a dangerous game to play. This trades off some amount of committed memory for an increased chance of overflowing the stack, an event that is clearly catastrophic.
Fibers and debuggers don’t interact well today either. Most rely on Win32 CONTEXTs pulled from the OS thread, in a fiber-unaware way. Depending on the frequency at which it resamples the context, this can get out of sync in the face of fiber switches. Even if you’ve suspended all threads, you’ll not be able to peer into the stacks of fibers that aren’t currently scheduled. FuncEval and EnC also depend on thread suspension and coordination in a way that makes it hard to predict will happen when fibers are added to the mix. A lot of the debugging libraries we have, such as System.Diagnostics, are also not fiber-aware and may yield surprising answers to API calls.
In the end, remember that we decided to cut fiber support because of stress bugs. Most of these stress bugs wouldn’t have actually blocked the simple, short-running scenarios, but would have plagued a long-running host like SQL Server. The ICLRTask::SwitchOut API was cut, which is unfortunate: it means you can’t switch out a task while it is running, which effectively makes it impossible to build a fiber-mode host on the 2.0 RTM CLR. Thankfully, re-enabling it (for those playing w/ SSCLI) would be a somewhat trivial exercise.
 Wednesday, November 01, 2006
People often ask whether they should use EventWaitHandle objects or the Monitor.Wait, Pulse, and PulseAll methods for synchronization. There is no simple answer to this question; although, as with most software problems, it can be summarized as: It depends.
EventWaitHandle comes in two flavors, auto- and manual-reset. EventWaitHandle subclasses WaitHandle and offers two subclasses for convenience: AutoResetEvent and ManualResetEvent. These are just thin wrappers on top of the CreateEvent and related APIs in Win32. The differences are deceivingly simple.
Auto-reset, when signaled with the EventWaitHandle.Set API (kernel32!SetEvent internally), allows one thread to witness the signal before the event automatically transitions back to the unsignaled state. If there are any waiting threads, one will be chosen and unblocked. The waiting threads are maintained in a FIFO queue, but it’s not strictly FIFO for the same reasons very few things on Windows are FIFO: various events, like device IO completion, kernel-, and user-mode APCs can wake a thread temporarily, removing it and then requeuing it in the wait queue. If a thread is constantly woken to process device IO, it might be starved indefinitely (if the queue is long). If no threads were waiting at the time of this signal, the next thread to wait on the event will not block, and instead just moves the event back to the unsignaled state and returns. This is all done atomically so you are guaranteed only one thread will ever witness a signal.
Manual-reset, when signaled, wakes all threads that are waiting on it. As its name implies, it must be manually reset with the EventWaitHandle.Reset API (kernel32!ResetEvent internally). While the event remains signaled, any threads that try to wait on it will not block and just return from the wait function immediately.
Signaling an already-signaled event has no effect. It’s easy to get into trouble in this area with auto-reset events. If you signal the event N times, expecting N threads to see the signals and do some amount of processing, you’re betting the farm on a race condition, for instance. This is easy to get wrong, very easily leading to deadlocks. If you have a shared buffer, an attractive design might to simply call Set on the event each time a new item arrives. The thinking might be that, while threads might not be sleeping, at least one thread will wake up per item and process it. This thinking is dead wrong and can get you into a quagmire. The waking threads would need to contain a loop ‘while (!empty) { … }’ before going back to sleep, otherwise one of the signals may go missing. If production of new items depended on consumers making forward progress, the program might lock up. And it entirely depends on consumers going to sleep in the first place which, if producers typically produce faster than consumers, might only happen occassionally (and hence not show up during testing).
Monitor.Wait, Pulse, and PulseAll are very different from their close Win32 event cousins. They are much more akin to the new Windows Vista APIs, SleepConditionVariableCS and SleepConditionVariableSRW. Wait will exit the monitor (lock) for the object in question until another thread pulses the object. Once the thread wakes up, it immediately reacquires the lock on the object. Pulse wakes up one waiting thread, in FIFO order, while PulseAll wakes all waiting threads. Notice that the monitor has no residual effect from the pulse; that is, if no threads were waiting at the exact moment of a pulse, there is no evidence that it actually happened. This leads to the notorious missed-pulse problem. To solve it, you just have to ensure that the wait condition is always tested (in a loop) around the Wait.
Note that Wait does something a little dirty. It releases an arbitrary amount of recursive acquisitions. As soon as it does this, other threads can acquire the monitor. If you are not careful with recursion, you can end up Waiting with broken invariants, accidentally letting other threads peer into this state. This is just another bit of evidence that recursion is something that is best avoided.
The first major consideration to make when selecting between EventWaitHandle and Monitor’s methods is whether you need a stand-alone event or a real condition variable that is integrated with locks. That is, the two have very distinct and disjoint feature sets. Win32 events also let you do more sophisticated waits, with the WaitHandle.WaitAll or WaitAny APIs, allowing you to wait for all of the events or a single event in an array to be signaled. So which feature-set do you want? That Win32 events give you events without the synchronization looks simpler, but is probably misleading. You typically need to manage mutual exclusion in some way with events, too, so you’ll end up using a monitor, ReaderWriterLock, Mutex, etc. in addition. The one benefit is that you have more control over locking and can be more conscious of certain policies like recursion. The fact that a Win32 event “sticks” in the signaled state can also be useful to avoid the missed-pulse problem, although with some discipline it is easily avoided with monitors too. Often people end up building a sticky event with a bool and monitor pair. One-time or lazy initialization is an example of this.
Win32 events are fairly heavyweight too. Each one consumes some amount of kernel memory, and setting, resetting, and waiting on one incurs somewhat expensive kernel transitions. In managed code, simply allocating one increases pressure on the GC because of yet another finalizable object to track. In a well-tuned system, you have to manage events carefully, which usually means Disposing of them far before the GC’s finalizer thread has a chance to see one. Even cleverer systems will pool them to amortize the cost of creating and closing the events. This is a double-edged sword. The V1.1 ReaderWriterLock we shipped in the CLR pools events. In my opinion, this is a little too clever and myopic: a good solution would pool events across many components in the process, not just ReaderWriterLocks. Imagine if each type we shipped tried to maintain its own pool of events.
As you may have guessed, Monitor actually uses Windows events underneath it all. Each CLR thread has a manual-reset event, allocated when it is created (or lazily when the thread first wanders into managed code). When a Wait is issued, this per-thread event is stuck on the tail of a linked list associated with the target object’s sync block. We can use a single event per thread since a thread can only ever be waiting for a single object at a given time. (You can’t do a WAIT_ANY or WAIT_ALL on monitors.) The thread then releases the lock on the object (accounting for any recursion), waits on this event, and then reacquires the lock on the object (again, accounting for any recursion). When a Pulse is issued, the head of the object’s linked list of waiters is popped off and its associated event is signaled. Similarly, PulseAll clears the entire linked list and signals all of the events. Notice I said that Pulse operates on the head of the list: we use a strict FIFO ordering (as of 2.0). And since we don’t remove the list entries in the face of an APC, there is no risk of perturbing the FIFO ordering, aside from premature exits due to thread aborts or interruptions.
There are a few things to note about this.
The signals on the thread events happen while the signaler still owns the lock. In other words, the thread calling Pulse(o) will still own the lock on o for some time after the call, yet the thread that called Wait(o) will immediately wake after the Pulse and try to acquire this lock (failing and waiting). Yes, all woken threads have to immediately wait when attempting to reacquire this lock, which is actually pretty crappy. If you’re using PulseAll, this could have a noticeable (and in some cases, dramatic) impact on scalability. Windows uses priority boosts to “hand off” the current time-slice to the recipient of an event signal, similar to what occurs when a GUI event is enqueued into a thread’s message queue, which just exacerbates this effect. You’re just about guaranteed that there will be a scheduler ping-pong effect immediately after a pulse. I am honestly surprised we don’t enqueue the Pulse/PulseAll calls on the object’s sync-block, processing them only once the lock has been exited. Yet another benefit to using events is that you can devise algorithms that signal events outside of critical sections, often leading to improved scalability.
We also don’t do any form of spinning. Events are generally speaking very volatile in terms of timing, so spinning only buys you something if you know that the occurrence of events are frequent enough that wait-avoidance will pay off. In many low-level concurrent algorithms, this is a worth-while technique, just as with spinning while trying to acquire a CRITICAL_SECTION in Win32 (see InitializeCriticalSectionWithSpinCount and SetCriticalSectionSpinCount) can improve scalability by avoiding expensive kernel-mode transitions due to waiting. In fact, it’s conceivable that somebody would want to use an event that never did a real wait, particularly if you’re dealing with a very tiny race condition that is expected to arise very infrequently. This is also dangerous, however, as it can lead to those rare CPU spikes that are almost impossible to debug and discern from a crash dump. This is pretty simple to build, but very hard to fine-tune so that it performs adequately.
So in the end, I will simply fall back to my original answer: It depends.
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 26 | 27 | 28 | 29 | 30 | 1 | 2 | | 3 | 4 | 5 | 6 | 7 | 8 | 9 | | 10 | 11 | 12 | 13 | 14 | 15 | 16 | | 17 | 18 | 19 | 20 | 21 | 22 | 23 | | 24 | 25 | 26 | 27 | 28 | 29 | 30 | | 31 | 1 | 2 | 3 | 4 | 5 | 6 |
Browse by Category:
Notables:
|