User Mode Scheduling in Windows 7

Don’t use threads. Or more pre­cisely, don’t over-use them. It’s one of the first thing fledg­ling pro­gram­mers learn after they start using threads. This is be­cause thread­ing in­volves a lot of over­head. In short, using more threads may im­prove con­cur­rency, but it will give you less over­all through­put as more pro­cess­ing is put into sim­ply man­ag­ing the threads in­stead of let­ting them run. So pro­gram­mers learn to use threads spar­ingly.

When nor­mal threads run out of time, or block on some­thing like a mutex or I/O, they hand off con­trol to the op­er­at­ing sys­tem ker­nel. The ker­nel then finds a new thread to run, and switches back to user-mode to run the thread. This con­text switch­ing is what User Mode Sched­ul­ing looks to al­le­vi­ate.

User Mode Sched­ul­ing can be thought of as a cross be­tween threads and thread pools. An ap­pli­ca­tion cre­ates one or more UMS sched­uler threads—typ­i­cally one for each proces­sor. It then cre­ates sev­eral UMS worker threads for each sched­uler thread. The worker threads are the ones that run your ac­tual code. When­ever a worker thread runs out of time, it is put on the end of its sched­uler thread’s queue. If a worker thread blocks, it is put on a wait­ing list to be re-queued by the ker­nel when what­ever it was wait­ing on fin­ishes. The sched­uler thread then takes the worker thread from the top of the queue and starts run­ning it. Like the name sug­gests, this hap­pens en­tirely in user-mode, avoid­ing the ex­pen­sive user->ker­nel->user-mode tran­si­tions. Let­ting each thread run for ex­actly as long as it needs helps to solve the through­put prob­lem. Work is only put into man­ag­ing threads when ab­solutely nec­es­sary in­stead of in ever smaller time slices, leav­ing more time to run your ac­tual code.

A good side ef­fect of this is UMS threads also help to al­le­vi­ate the cache thrash­ing prob­lems typ­i­cal in heav­ily-threaded ap­pli­ca­tions. For­get­ting your data shar­ing pat­terns, each thread still needs its own stor­age for stack space, proces­sor con­text, and thread-lo­cal stor­age. Every time a con­text switch hap­pens, some data may need to be pushed out of caches in order to load some ker­nel-mode code and the next thread’s data. By switch­ing be­tween threads less often, cache can be put to bet­ter use for the task at hand.

If you have ever had a chance to use some of the more es­o­teric APIs in­cluded with Win­dows, you might be won­der­ing why we need UMS threads when we have fibers which offer sim­i­lar co-op­er­a­tive mul­ti­task­ing. Fibers have a lot of spe­cial ex­cep­tions. There are things that aren’t safe to do with them. Li­braries that rely on thread-lo­cal stor­age, for in­stance, will likely walk all over them­selves if used from within fibers. A UMS thread on the other hand is a full fledged thread—they sup­port TLS and no have no real spe­cial things to keep in mind while using them.

I still wouldn’t count out thread pools just yet. UMS threads are still more ex­pen­sive than a thread pool and the large mem­ory re­quire­ments of a thread still apply here, so things like per-client threads in in­ter­net dae­mons are still out of the ques­tion if you want to be mas­sively scal­able. More likely, UMS threads will be most use­ful for build­ing thread pools. Most thread pools launch two or three threads per CPU to help stay busy when given block­ing tasks, and UMS threads will at least help keep their time slice usage op­ti­mal.

From what I un­der­stand the team be­hind Mi­crosoft’s Con­cur­rency Run­time, to be in­cluded with Vi­sual C++ 2010, was one of the pri­mary forces be­hind UMS threads. They worked very closely with the ker­nel folks to find the most scal­able way to en­able the su­per-par­al­lel code that will be pos­si­ble with the CR.

Posted on April 23, 2009 in Coding, Microsoft, Scalability, Threading, Windows 7

Related Posts