April 2009 Archives

The Password is Courage

I’m not a big anime watcher, usu­ally only falling back on it when I’m in a cod­ing slump and feel­ing re­ally bored. It was on one such oc­ca­sion that I picked up Soul Eater. It’s rare to get such a char­ac­ter-dri­ven story out of main­stream anime. The char­ac­ters, music, art de­sign, style… all of it clicked with me. I fell in love with it from the first episode.

I stopped watch­ing about a month be­fore the story ended so that I could see the final five episodes all in one go. Well, it ended late last month and I only just got around to watch­ing them. Maybe be­cause I was busy, maybe be­cause I didn’t want to admit to my­self that it was over. Lit­tle of both, prob­a­bly.

In the end it was a dis­ap­point­ment. Soul Eater could have been so much more, if only they had fo­cused on the char­ac­ters in­stead of de­volv­ing into a huge ac­tion bat­tle. I guess such an end­ing it is to be ex­pected from some­thing based on a shōnen manga, though from what I’ve heard it de­vi­ated quite a bit from the source, which is still on­go­ing. Maybe I’ll pick that up.

And another…

BSG got a shoutout in last night’s 30 Rock! Elisa (Salma Hayek) was in a shirt, and Frank (Judah Fried­lan­der) said “What the frak?!”.

Salma Hayek in her “What the frak?!” shirt

User Mode Scheduling in Windows 7

Don’t use threads. Or more pre­cisely, don’t over-use them. It’s one of the first thing fledg­ling pro­gram­mers learn after they start using threads. This is be­cause thread­ing in­volves a lot of over­head. In short, using more threads may im­prove con­cur­rency, but it will give you less over­all through­put as more pro­cess­ing is put into sim­ply man­ag­ing the threads in­stead of let­ting them run. So pro­gram­mers learn to use threads spar­ingly.

When nor­mal threads run out of time, or block on some­thing like a mutex or I/O, they hand off con­trol to the op­er­at­ing sys­tem ker­nel. The ker­nel then finds a new thread to run, and switches back to user-mode to run the thread. This con­text switch­ing is what User Mode Sched­ul­ing looks to al­le­vi­ate.

User Mode Sched­ul­ing can be thought of as a cross be­tween threads and thread pools. An ap­pli­ca­tion cre­ates one or more UMS sched­uler threads—typ­i­cally one for each proces­sor. It then cre­ates sev­eral UMS worker threads for each sched­uler thread. The worker threads are the ones that run your ac­tual code. When­ever a worker thread runs out of time, it is put on the end of its sched­uler thread’s queue. If a worker thread blocks, it is put on a wait­ing list to be re-queued by the ker­nel when what­ever it was wait­ing on fin­ishes. The sched­uler thread then takes the worker thread from the top of the queue and starts run­ning it. Like the name sug­gests, this hap­pens en­tirely in user-mode, avoid­ing the ex­pen­sive user->ker­nel->user-mode tran­si­tions. Let­ting each thread run for ex­actly as long as it needs helps to solve the through­put prob­lem. Work is only put into man­ag­ing threads when ab­solutely nec­es­sary in­stead of in ever smaller time slices, leav­ing more time to run your ac­tual code.

A good side ef­fect of this is UMS threads also help to al­le­vi­ate the cache thrash­ing prob­lems typ­i­cal in heav­ily-threaded ap­pli­ca­tions. For­get­ting your data shar­ing pat­terns, each thread still needs its own stor­age for stack space, proces­sor con­text, and thread-lo­cal stor­age. Every time a con­text switch hap­pens, some data may need to be pushed out of caches in order to load some ker­nel-mode code and the next thread’s data. By switch­ing be­tween threads less often, cache can be put to bet­ter use for the task at hand.

If you have ever had a chance to use some of the more es­o­teric APIs in­cluded with Win­dows, you might be won­der­ing why we need UMS threads when we have fibers which offer sim­i­lar co-op­er­a­tive mul­ti­task­ing. Fibers have a lot of spe­cial ex­cep­tions. There are things that aren’t safe to do with them. Li­braries that rely on thread-lo­cal stor­age, for in­stance, will likely walk all over them­selves if used from within fibers. A UMS thread on the other hand is a full fledged thread—they sup­port TLS and no have no real spe­cial things to keep in mind while using them.

I still wouldn’t count out thread pools just yet. UMS threads are still more ex­pen­sive than a thread pool and the large mem­ory re­quire­ments of a thread still apply here, so things like per-client threads in in­ter­net dae­mons are still out of the ques­tion if you want to be mas­sively scal­able. More likely, UMS threads will be most use­ful for build­ing thread pools. Most thread pools launch two or three threads per CPU to help stay busy when given block­ing tasks, and UMS threads will at least help keep their time slice usage op­ti­mal.

From what I un­der­stand the team be­hind Mi­crosoft’s Con­cur­rency Run­time, to be in­cluded with Vi­sual C++ 2010, was one of the pri­mary forces be­hind UMS threads. They worked very closely with the ker­nel folks to find the most scal­able way to en­able the su­per-par­al­lel code that will be pos­si­ble with the CR.

QRJDQ Lives!

Quake Rocket Jump­ing done Quick is some­thing new. Think Quake done Quick, but for rocket jump­ing. Sim­ple huh? The first video, rjx­treme, is up and wait­ing to be im­proved!

CSI Cameos

Lots of BSG peo­ple in last week’s sci-fi fo­cused CSI (A Space Odd­ity). I spot­ted Grace Park (Boomer/Athena/Eight), Kate Ver­non (Ellen), Rekha Sharma (Tory), and Ron Moore (se­ries cre­ator). They also men­tioned tak­ing a decades-old TV show and up­dat­ing it to make a hit. Hehe.

Optimizing IP range searching in PeerGuardian

I was work­ing on some­thing com­pletely dif­fer­ent last night, when an el­e­gant idea came to mind on how to sig­nif­i­cantly speed up Peer­Guardian’s IP search­ing. It’s funny how an idea can just pop into the mind about a prob­lem that hasn’t been thought of in a long time.

Right now Peer­Guardian uses a bi­nary search to match IPs. This is al­ready pretty ef­fi­cient, run­ning in ⌈log 2 N⌉—so for 100,000 IP ranges, about 16 tests need to be done. This has the ad­di­tional ad­van­tage of hav­ing no mem­ory over­head.

My idea is to use a struc­ture sim­i­lar to a B+tree, pack­ing as many IPs and branch point­ers into a cache line as pos­si­ble. On today’s ar­chi­tec­tures, a cache line is typ­i­cally 64 bytes, so 8 IPs and 9 branch point­ers would fit on each node, mak­ing it only need to read about ⌈log 9 N⌉ nodes to find a match. So in order to find a match in 100,000 IP ranges, only about 5 nodes would need to be read.

CPUs al­ways read and cache data in blocks (a cache line), so an al­go­rithm that keeps this in mind to min­i­mize mem­ory reads and max­i­mize cache usage should be in­cred­i­bly fast. Even though this in­tro­duces sig­nif­i­cant over­head for branch point­ers (about 2x the stor­age would be re­quired), it should be far more ef­fi­cient over­all.

But this al­go­rithm im­proves in an­other way too: branch­ing. I’m talk­ing in terms of branch in­struc­tions, not the branch point­ers men­tioned above. The fewer branches code takes, the faster a su­per­scalar or pipelined CPU will be able to run your code. For this al­go­rithm, an en­tire node could be processed (that is, com­par­ing the IPs and de­ter­min­ing which branch node to go into) with zero branches using in­te­ger SSE2 (PCMPGTD, PMOVM­SKB), and bit-scan for­ward (BSF).

I can’t be sure how much of a speed dif­fer­ence this would make until I code it up, but I bet it would be at least 200% faster. I’ve been too busy to work on Peer­Guardian for quite a while, so I don’t know if this will ever make it into PG. We’re look­ing for a new coder with more time on their hands.

Fast & Furious

Fast & Furious

It’s a video game (Need 4 Speed) made into a movie (The Fast and the Fu­ri­ous) made into a video game movie (Fast & Fu­ri­ous).

Okay, maybe that’s not the of­fi­cial story, but that’s sure how it felt. Their weird use of GPS felt like an in-game HUD. It even had an “ n miles left” an­nouncer voice you might find in time tri­als. Var­i­ous scenes felt like the third-per­son view so com­mon in games like Need 4 Speed.

Ad­e­quately fun ac­tion, but over­all silly story and plenty of bad lines.

New site

New site with blog & pro­jects merged, plus a nicer de­sign.

Kendo and Ramen

Cherry Blossom Festival

Today I went down to Lit­tle Tokyo to grab some ramen for lunch, and hap­pened upon the Cherry Blos­som Fes­ti­val. I only stayed a lit­tle while, to see some sumo and kendo per­for­mances.

I got there late for the sumo un­for­tu­nately, but I did get to see a few matches where Dan Kalbfleisch wiped the floor with some other guys.

The kendo demon­stra­tion started with sen­sei Cary Yoshio Mi­zobe per­form­ing tameshi­giri—cut­ting a tatami omote with a katana. His stu­dents went on from there to show off their moves with shi­nai. Sen­sei Mi­zobe was ex­plain­ing one of the moves: tsuki, a stab to the throat ap­par­ently dif­fi­cult enough that he only lets his black belt stu­dents per­form it, to lessen the risk of not hav­ing enough pre­ci­sion and in­jur­ing the op­po­nents. He said he was hired to train Brit­tany Mur­phy to per­form it for her new movie, The Ramen Girl. The only prob­lem is, they wanted him to train her on this ad­vanced move in eight hours. His only ad­vice was to to­tally fake it out with cam­era tricks, or risk in­jury. Thought that was funny :)

Bear McCreary to score Capcom’s Dark Void

8-bit Dark Void

Bear Mc­Creary, com­poser of Bat­tlestar Galac­tica, let out the in­for­ma­tion today that he is scor­ing Cap­com’s new game, Dark Void. I re­mem­ber some videos of this com­ing out a while ago, from E3 maybe, and it looked like a pretty fun game but after so much time I had for­got­ten about it. These new videos make it look as awe­some as ever, though, and with Bear scor­ing it, you know the music will be amaz­ing!

If you like his music, con­sider post­ing at Cap­com where Bear is try­ing to get au­tho­riza­tion to re­lease a sound­track CD.