Multithreading and job queues

17 Novembre, 20192 Dicembre, 2019 Daniele Carbone

Multithreading is one essential element in modern game programming, but it’s also the nightmare of many developers and the cause of countless bugs. The way i think about it is like the combination of one unsolvable problem and two solvable problems. The first one is a direct consequence of the nature of concurrency: without multithreading our code is deterministic, while after introducing concurrency the determinism is lost. We can design very sophisticated thread managers, but in the end it’s always the operating system that makes the call about the execution ordering, and we have no control over that. The main consequence of this problem is that it becomes very difficult to reason about your code. The second problem is represented by our necessity to decide which thread executes what task. This is of course solvable, but as many other solvable problems there are more efficient solutions than others. And finally, we have the issue of communicating between threads, particularly in order to handle synchronization and shared data. In this post i want to discuss one neat way in particular of implementing multithreading, one that hopefully solves the last two problems in an efficient way: the job queue (also known as work queue, or task system, or scheduler, or whatever, you get the point!).

A Fresh Perspective on Concurrency

Let’s start from the basic way you might want to deal with concurrency in a first place. You have a bunch of threads and a bunch of semi-independent systems in your engine that you want to parallelize. The most straightforward way to deal with the first solvable problem i mentioned before is to assign one thread to each system, and therefore you can create your AI thread, the audio thread, the debug system thread, the asset loading thread, etc.. However, if you care about performance and optimization you will discover that this system is actually very inefficient. Maybe your AI thread is always going to be full of work, but what about the audio thread, or the asset loading one? Probably your application is not going to import assets all the time, or to play hundreds of audio files together, and as a consequence you might have some thread that is starving for some additional work. Another way of thinking about who executes what, is to schedule the threads not according to the different conceptual systems that compose your engine, but according to the concept of priority. All the needed tasks can be sent into a queue with the desired priority level, without making differences about their own nature. The queue itself can be FIFO, meaning that tasks with the same priority start to be processed in order of arrival (but can finish in a complete different order of course), and circular to have wrapping for read and write indices. This system ensures that each thread has always some work to do, provided that the subdivision of priorities is actually balanced. However, we need to solve a new problem now! How can we be sure that the same task is not executed by multiple threads? This was not a possibility before, since the threads were organized by categories and the same task could only be executed by a single thread. The answer comes from special x64 instructions such as InterlockedCompareExchange, designed in order to guarantee the univocity of the execution with respect to multiple competing threads. The way this function works is by atomically comparing a destination value and a comparand value, both representing the index of the next entry to read, and if they are equal the destination is replaced with a third value (usually the incremented index). The function returns the original destination value, and since the first thread that wants to execute the task is going to increment the index, it’s possible to guarantee thread safety by executing the task only if the return value is equal to the original index. The function adds a little bit of overhead to the system, but we can minimize its impact by designing our tasks to be long enough to limit the amount of function calls.

Up to this point the idea of a job queue seems convincing, but let’s stress the system a little bit more. Tasks can be added during the lifetime of our application and the queue can execute them using a busy-waiting technique. In this way, after a thread finishes some work it can be put to sleep for the desired amount of time. This is very important because we want our program to execute also the rest of the code, without being stuck forever into the queue, but it is also useful in order to limit the power consumption of our application. However, the communication problem we mentioned before is very relevant right now. How can we signal to the operating system when to put a thread to sleep or to resume its activity? A common solution to this issue is to use a semaphore, which is a countable wait primitive capable of controlling a shared resource between threads. A semaphore handles a count between a minimum and a maximum value (in this case equal to the amount of threads). Initially the count is set to zero and this can be interpreted as zero available threads. When the count is incremented, a number of threads proportional to the count itself are awaken from their sleep and are capable of executing the tasks. Each time the game logic sends the necessary data for a task it has to increment the semaphore by one, asking for the attention of at least one thread. In Win32 this increment can be performed by calling the ReleaseSemaphore function. Finally, after a thread wakes up it decrements the count by one, effectively signaling to be unavailable for the time being.

Implementing the job queue

Now it’s time to turn all this fancy reasoning into practice. Using C++ and the Win32 API to begin with, we can spawn a new thread by calling the CreateThread function. As with any Win32 function it’s useful to read the related documentation, which you can find here (https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createthread). After reading the explanation of the function arguments we discover that it’s necessary to include an application-defined function, called ThreadProc, which represents the entry point of the thread. Moreover, we can pass a pointer to a variable or a struct that we want to be used by the thread, and in our case this struct is going to be our work queue itself. Let’s start by defining the queue as a collection of entries, a semaphore and a bunch of volatile variables that are going to be useful in order to control the queue itself. The entries in turn contain a data component and a callback that represents the actual task:

// We pre-declare the platform_work_queue to break the circular dependency
// The callback is defined through a macro because in this way we only need to 
// specify its name, instead of writing every time also the prototype
struct platform_work_queue;
#define PLATFORM_WORK_QUEUE_CALLBACK(name) void name(platform_work_queue* queue, void* data)
typedef PLATFORM_WORK_QUEUE_CALLBACK(platform_work_queue_callback);

struct platform_work_queue_entry
{
	void* data;
	platform_work_queue_callback* callback;
};

struct platform_work_queue
{
	platform_work_queue_entry entries[256];

	HANDLE semaphoreHandle;
	
	uint32 volatile completionGoal;
	uint32 volatile completionCount;
	uint32 volatile nextEntryToRead;
	uint32 volatile nextEntryToWrite;
};

The volatile keyword is a way to warn the compiler that a variable might be changed externally (by another thread in this case), and therefore it’s not allowed to optimize it away. This is another very important concept in multithreading programming, which is also linked to the problem of synchronization and communication: the compiler doesn’t know about the existence of other threads, and therefore it might choose to hide away some variable into a register if the current thread is not going to use it. Specifying variables as volatile is not only useful, but also necessary in these cases. Before the creation of the threads we initialize all volatile variables to zero and we create and assign a semaphore to the queue:

// Right now we only need to pass the queue into the thread, but if some additional
// non queue-related data is needed this is the struct we have to fill
struct win32_thread_startup
{
	platform_work_queue* queue;
};

void createWorkQueue(platform_work_queue* queue, uint32 threadCount, win32_thread_startup* threadStartups)
{
	queue->completionCount = 0;
	queue->completionGoal = 0;
	queue->nextEntryToRead = 0;
	queue->nextEntryToWrite = 0;

	uint32 initialCount = 0;
	queue->semaphoreHandle = CreateSemaphoreEx(0, initialCount, threadCount, 0, 0, SEMAPHORE_ALL_ACCESS);

	for (uint32 threadIndex = 0; threadIndex < threadCount; ++threadIndex)
	{
		win32_thread_startup* threadStartup = threadStartups + threadIndex;
		threadStartup->queue = queue;

		DWORD threadID;
		HANDLE threadHandle = CreateThread(0, 0, ThreadProc, threadStartup, 0, &threadID);
		CloseHandle(threadHandle);
	}
}

// The concept of priorities can be implemented by spawning an array of threads associated
// to a single queue. Different queues may handle a different number of threads, and the
// latter is what ultimately determines their priority level. Here is an example about
// the creation of two queues, one high priority and one low priority

// High priority queue
win32_thread_startup HPThreadStartups[6] = {};
platform_work_queue HPQueue = {};
createWorkQueue(&HPQueue, ArrayCount(HPThreadStartups), HPThreadStartups);

// Low priority queue
win32_thread_startup LPThreadStartups[2] = {};
platform_work_queue LPQueue = {};
createWorkQueue(&LPQueue, ArrayCount(LPThreadStartups), LPThreadStartups);

Now let’s see how to define the ThreadProc function. At first we extract the queue that is passed through the win32_thread_startup struct, and then we execute the busy-waiting loop:

DWORD WINAPI ThreadProc(LPVOID lpParameter)
{
    win32_thread_startup* threadStartup = (win32_thread_startup*)lpParameter;
    platform_work_queue* queue = threadStartup->queue;
        
    for (;;)
    {
    	if (processNextWorkQueueEntry(queue))
    		// decrease semaphore count by 1 (on wakeup, not on entry)
    		WaitForSingleObjectEx(queue->semaphoreHandle, INFINITE, FALSE);		
    }
}

The function processNextWorkQueueEntry implements the actual FIFO logic for the queue. At the beginning we save the index of the next entry to read into a comparand variable. If this index is equal to the one determined by the corresponding addWorkQueueEntry call, then the thread has work to do, otherwise it is going to sleep. If we are working, now it’s the time to execute the InterlockedCompareExchange trick by letting the first available thread to increment the entry index (note the queue circularity through the modulus operator). Finally, only this lucky first worker has the privilege to execute the callback and to atomically increment the completion count:

bool32 processNextWorkQueueEntry(platform_work_queue* queue)
{
	bool32 shouldSleep = false;

	// Implement circular FIFO queue
	uint32 originalNextEntryToRead = queue->nextEntryToRead;
	uint32 newNextEntryToRead = ((queue->nextEntryToRead + 1) % ArrayCount(queue->entries));

	// If there is work to do
	if (originalNextEntryToRead != queue->nextEntryToWrite)
	{
		// Increment index
		uint32 index = InterlockedCompareExchange((LONG volatile*)&queue->nextEntryToRead, newNextEntryToRead,
			                                  originalNextEntryToRead);

		if (index == originalNextEntryToRead)
		{
			platform_work_queue_entry entry = queue->entries[index];
			entry.callback(queue, entry.data);
			InterlockedIncrement((LONG volatile*)&queue->completionCount);
		}
	}
	else
		shouldSleep = true;

	return shouldSleep;
}

// Sometimes we want to execute all the task inside the queue, and
// to start filling it again after some event (such as hot code reloading)
void processAllWorkQueue(platform_work_queue* queue)
{
    while (queue->completionGoal != queue->completionCount)
    	processNextWorkQueueEntry(queue);

    queue->completionGoal = 0;
    queue->completionCount = 0;
}

Now, the queue can start working only after we prepared some task for it to crunch, and the work load is dictated of course by the necessities of the application. Each task is characterized by its data and callback, and these two elements together with the queue itself are going to be the arguments of the addWorkQueueEntry function:

void addWorkQueueEntry(platform_work_queue* queue, platform_work_queue_callback* callback, void* data)
{
	// Implement circular FIFO logic
	uint32 newNextEntryToWrite = ((queue->nextEntryToWrite + 1) % ArrayCount(queue->entries));

	// Make sure the cicular queue hasn't wrapped around, before writing
	Assert(newNextEntryToWrite != queue->nextEntryToRead);

	platform_work_queue_entry* entry = queue->entries + queue->nextEntryToWrite;
	entry->data = data;
	entry->callback = callback;
	++queue->completionGoal;

	// Protect from compiler aggressive code optimization
	_WriteBarrier();

	queue->nextEntryToWrite = newNextEntryToWrite;

	// Increase semaphore count by 1 and return previous one
	ReleaseSemaphore(queue->semaphoreHandle, 1, 0);
}

In this last function, we update the circular index of the work to be executed and we insert data and callback into the queue. Note how the _WriteBarrier() call is used in order to protect against aggressive compiler optimizations that may move the code up and down the boundary without considering the multithreading context. Finally we increment the semaphore, forcing at least one thread to become interested in our task.

Wow, that was quite a ride but i think the multithreaded job queue was worth it. I am sure the system can still be greatly improved, but this is what i am currently toying with. I hope to be back soon with some update!

Hot Code Reloading

17 Novembre, 201923 Novembre, 2019 Daniele Carbone

The videogame development industry relies heavily on fast programming languages such as C/C++, for obvious reasons, but most of the time also some scripting language can be used on top of the core layers of the engine. Languages like Python, C# or Lua allow for an easy and modular production of gameplay code, which is often based on the fine tuning of several parameters (especially at the final stages of development). Without a scripting language it would be necessary to recompile and relink the code for each tiny modification of any parameter, and this of course lowers the productivity of the entire team. A piece of code written with a scripting language can be modified and reloaded at runtime, a technique usually known as hot code reloading. However, the integration of several different programming languages increases the overall complexity of the engine, and requires to write one or more very robust parsers that are capable of translating the information from one language to the other. The aim of this post is to describe a technique for hot code reloading that doesn’t require to use a scripting language, but can be directly implemented in C/C++ with a slight modification of the engine architecture. Yes, you heard me well, it’s actually possible to leave the executable running and to modify any piece of code at runtime, immediately seeing the results and eventually keep tuning the transparency of that pesky particle system. The first time i even heard of this method was during an episode of Handmade Hero, the awesome streaming about game engine coding from Casey Muratori. I’ll simply describe my own implementation and my experience in using this feature, possibly also demystifying that black magic feeling behind it!

Requirements

The first key component for the implementation of hot code reloading is the separation between platform code and game code. This feature is already a good idea without hot reloading in mind, because it makes easier to port the game to a different platform (even on console) since the code that interacts with the operating system is decoupled from the actual game. Platform and game can interact in the same way a scripting language interacts with the engine, provided that the platform code is built as an executable and the game is built as a dll. The platform code keeps the entire application running, while we can modify the game dll at will and reloading it each time a new recompilation process is sensed by the platform itself. In particular, since the game is going to be imported just like a library, we need to identify and group its main functions in order to export them as extern function pointers. For example we might choose to develop game logic and rendering as one library, audio processing as another and finally also a library for eventual debug services. Of course each of these function pointers can contain calls to many other functions, just like a main function contains the entire codebase in itself, even if technically the main function of a dll is a little bit different thing and it is optional. I have used three general arguments for our function pointers (memory, inputs and render commands) that need to be provided by the platform code, but of course the entire system can be expanded if needed. It’s important to remember that the following code has to be defined in a header file which is visible from both platform and dll code:

// At first we introduce a macro that simply defines a function, and then we define a C-style function pointer. 
// This allows to eventually introduce a "stub" version of itself, for example in order to handle cases of failed initialization
#define GAME_UPDATE_AND_RENDER(name) void name(game_memory* memory, game_input* input, game_render_commands* renderCommands)
typedef GAME_UPDATE_AND_RENDER(game_update_and_render);

#define GAME_GET_SOUND_SAMPLES(name) void name(game_memory* memory, game_sound_output_buffer* soundBuffer)
typedef GAME_GET_SOUND_SAMPLES(game_get_sound_samples);

#define DEBUG_GAME_END_FRAME(name) void name(game_memory* memory, game_input* input, game_render_commands* renderCommands)
typedef DEBUG_GAME_END_FRAME(debug_game_end_frame);

// We specify that this function is going to be exported. Extern "C" is for name mangling
extern "C" __declspec(dllexport) GAME_UPDATE_AND_RENDER(gameUpdateAndRender)
{
	// Implement the game
	...
}

// Repeat also for audio and debug services
...

The second key component to a successful hot code reloading is to handle correctly the memory. If we ask for memory in the game code, in fact, after the reloading process all pointers are not going to be valid anymore. This is another good reason to keep platform and game as separated chunks! The platform code, being the one responsible for the comunication with the OS, acts as the gatekeeper for memory and it is the one that holds the pointers. In this way, even when the game is reloaded the pointers to memory are still going to be valid because the platform code is always running. The game of course has the right to ask for more memory, but the request has to pass through the platform layer every time. Many game engines that i know already apply this strategy for memory management (which is a big allocation at startup followed by eventual expansions when needed), but if your engine absolutely needs to call new or malloc every time also in the dll then maybe hot code reloading can be a tough feature to introduce in your codebase. For this reason, it would be ideal to plan for hot code reloading during the early stages of development.

Implementation

Using Windows and the Win32 API, the actual code reloading can be achieved by retrieving the function pointers with GetProcAddress and assigning them to pointers of the platform layer. This of course needs to be done once before the main game loop, and can be repeated each time the current dll write time is different than the last one. As you can see in the next code snippet, every time we load the game code we also save the current write time in the win32_game_code struct:

struct win32_game_code
{
	HMODULE gameCodeDLL;
	FILETIME lastDLLwriteTime;
	
	game_update_and_render* updateAndRender;
	game_get_sound_samples* getSoundSamples;
	debug_game_end_frame* DEBUG_EndFrame;

	bool32 isValid;
}

inline FILETIME getLastWriteTime(char* fileName)
{
	FILETIME lastWriteTime = {};
	WIN32_FILE_ATTRIBUTE_DATA data;
	if (GetFileAttributesEx(fileName, GetFileExInfoStandard, &data))
		lastWriteTime = data.ftLastWriteTime;

	return lastWriteTime;
}

win32_game_code loadGameCode(char* sourceDLLname, char* tempDLLname, char* lockFileName)
{
	win32_game_code result = {};

	WIN32_FILE_ATTRIBUTE_DATA ignored;
	if (!GetFileAttributesEx(lockFileName, GetFileExInfoStandard, &ignored))
	{
		result.lastDLLwriteTime = getLastWriteTime(sourceDLLname);

		CopyFile(sourceDLLname, tempDLLname, FALSE);
		result.gameCodeDLL = LoadLibraryA(tempDLLname);
		if (result.gameCodeDLL)
		{
			result.updateAndRender = (game_update_and_render*)GetProcAddress(result.gameCodeDLL, "gameUpdateAndRender");
			result.getSoundSamples = (game_get_sound_samples*)GetProcAddress(result.gameCodeDLL, "gameGetSoundSamples");
			result.DEBUG_EndFrame = (debug_game_end_frame*)GetProcAddress(result.gameCodeDLL, "DEBUGGameEndFrame");
			result.isValid = (result.updateAndRender && result.getSoundSamples && result.DEBUG_EndFrame);
		}
	}

	if (!result.isValid)
	{
		result.updateAndRender = 0;
		result.getSoundSamples = 0;
		result.DEBUG_EndFrame = 0;
	}

	return result;
}

void unloadGameCode(win32_game_code* gameCode)
{
	if (gameCode->gameCodeDLL)
	{
		FreeLibrary(gameCode->gameCodeDLL);
		gameCode->gameCodeDLL = 0;
	}

	gameCode->isValid = false;
	gameCode->updateAndRender = 0;
	gameCode->getSoundSamples = 0;
}

The arguments of the loadGameCode function are three strings: source, temp and lock names. The first two are the paths to the locations of game dll and a temporary file that holds its copy, and they can be hardcoded or can be determined at runtime using a combination of Win32 functions (such as GetModuleFileNameA) and string manipulation. The last one, the lock, has to do with Visual Studio locking the .pdb file even after the dll is unloaded. One way to overcome this issue is to force Visual Studio to generate a pdb with a different name every time the code is recompiled, for example using the following expression:

// Right click the DLL project and select Properties->Configuration Properties->Linker->Debugging->Generate Program Database File
$(OutDir)$(TargetName)-$([System.DateTime]::Now.ToString("TMA_mm_ss_fff")).pdb

However, now the pdb files are going to pile up in the output folder after every recompilation. We can define a pre-build event for the dll in which we delete all .pdb files:

// Right click the DLL project and select Properties->Configuration Properties->Build Events->Pre-Build Event
del "$(Outdir)*.pdb" > NUL 2> NUL

The final issue we have with the .pdb file is that the MSVC compiler actually writes the dll file before it, and therefore during code hotloading the dll is going to be loaded immediately while the .pdb still has to be written. At this point Visual Studio would fail to load the .pdb, and debugging the code after reloading would be impossible. This is why we create the lock file passed as argument in loadGameCode! This lock is just a dummy file that is created in a pre-build event and deleted in a post-build event:

// During pre-build event
del "$(Outdir)*.pdb" > NUL 2> NUL 
echo WAITING FOR PDB > "$(Outdir)lock.tmp"

// During post-build event
del "$(Outdir)lock.tmp"

During its lifetime, the lock file is a living proof that the code is still being loaded. For this reason, in the loadGameCode function we check if the lock file is not present by calling if (!GetFileAttributesEx(…)) and only under this circumstance we proceed to load the dll. This ensures that also the .pdb file has been written and allows to debug the code as usual.

Finally, let’s see how to actually trigger the hotloading after each recompilation:

// Before game loop (platform layer)
win32_game_code game = loadGameCode(sourceGameCodeDLLpath, tempGameCodeDLLpath, gameCodeLockpath);

...

// Inside the game loop (platform layer), between the update and the render
FILETIME newDLLwriteTime = getLastWriteTime(sourceGameCodeDLLFullPath); 					
bool32 isExeReloaded = false;

// Check if the name of the DLL has changed. Since we create a unique DLL name each time, this event
// indicates that a code reloading needs to happen
bool32 doesExeNeedReloading = (CompareFileTime(&newDLLwriteTime, &game.lastDLLwriteTime) != 0);
if (doesExeNeedReloading)
{
	// If the code is multithreaded and a work queue is implemented, complete all
	// the task now because the callbacks may point to invalid memory after the reloading
	...
	
	// If the debug system is designed to record events along the codebase, now it's the
	// time to stop it
	...

	unloadGameCode(&game);
	game = loadGameCode(sourceGameCodeDLLpath, tempGameCodeDLLpath, gameCodeLockpath);
	isExeReloaded = true;
}

That’s it! A current limit of this hot code reloading implementation is the lack of robustness towards changes of the data layout inside classes and/or structs. If variables are added, removed or reordered inside a class the static memory is going to fail because it would read/write from wrong locations. While it’s possible to solve this issue by improving both memory management and the reloading code, possibly with some mild refactoring, i believe that a feature like this should be dictated by real necessities from the gameplay programming side. If changing data layout at runtime were effectively a time-saving feature for some specific type of debugging or tuning, then i would be interested in investing more time to improve the system. Otherwise, i would just consider this as premature optimization, and you already know what more experienced programmers than me have to say about that!

Demo

In this example i launch the executable outside the debugger. I am interested in tuning the movement speed of an animation, and thanks to hot code reloading i can change the source code and immediately see the effect on my application:

That’s all for now, happy code reloading!

Daniele Carbone

Game Engine Programmer

Categoria: General Programming

Multithreading and job queues

Hot Code Reloading