Toldain Talks

Because reading me sure beats working!


Toldain started as an Everquest character. I've played him in EQ2, WoW, Vanguard, LOTRO, and Zork Online. And then EVE Online, where I'm 3 million years old, rather than my usual 3000. Currently I'm mostly playing DDO. But I still have fabulous red hair. In RL, I am a software developer who has worked on networked games, but not MMORPGS.

Monday, August 23, 2010

Geeking Out In Space

CCP Veritas has a post up describing how he and his colleagues tracked down some issues with module reactivation in EVE. During my life as a programmer, I've come to love stories about tracking down elusive bugs, and this is a really great example.

We'll start at the first thing we noticed when digging at it: the system responsible for telling the server when modules should be turned off or repeated would get minutes behind in processing when fleet fights happen while other systems remain reasonably responsive. This system, named Dogma, handles module activation/repeat/deactivation, as well as the actual effects of those modules. [...]

Tasks on the EVE servers use a time-sharing technique called cooperative multitasking which, in short, means that a task has to willingly yield execution to other tasks, otherwise it will run forever. In this case, it would seem the part of Dogma handling module repeat and deactivation was being too nice - yielding execution too much.

Looking at the code some, a theory emerged as to why. There was an error case that stuck out as odd - if an effect was supposed to be stopped or repeated, but the effect system itself didn't agree that it was time yet, the code would throw up its hands and give up. If that error case gets hit, the processing loop would yield to other systems early. A code comment was very reassuring though - this error was supposedly "rare."

The "rare" error happened 1.5 million times in the month of June, 2010 on TQ.

Great story, and an interesting followup. By all means, read it. After I read it, I did some calculations, presented below:

So, it seems to me that the problem was really quite rare. And it happened all the time.


Blogger Magson said...

Reminds me of something once said about "Lies, Damn Lies and Statistics. . . "

Isn't it amazing how when raw numbers get to be huge that even miniscule percentages (rare events) become commonplace? Kinda like the post office. They handle billions of pieces of mail and only misroute/lose something like 0.0017% of them and yet "everyone knows" the post office is horrible at losing stuff. Sounds similar to this "rare error."

3:28 PM  

Post a Comment

<< Home