Click here to Skip to main content
15,440,991 members
Articles / Programming Languages / C++
Technical Blog
Posted 15 Nov 2019

Tagged as

Stats

4.4K views
11 bookmarked

Exploring Multi-Threading in C++

Rate me:
Please Sign up or sign in to vote.
5.00/5 (7 votes)
15 Nov 2019CPOL4 min read
Multi-threading in C++

The pursuit of performance is something that interests me as a developer, so as a learning exercise I decided to experiment and consolidate my knowledge about multi-threading. Nowadays it’s becoming even more important since our CPUs get more and more cores. Modern game engines and applications use multiple CPU cores to stay fast and responsive.

Series of Articles

Setup and Baseline Result

As a test case, I decided to create a series of tasks, some small and others big, to simulate different workload types. As an easy test case, I grabbed a method to calculate Pi, and run that method multiple times, depending on how heavy I want the workload to be.

C++
double CalcPi(int n)
{
	double sum = 0.0;
	int sign = 1;
	for (int i = 0; i < n; ++i)
	{
		sum += sign / (2.0 * i + 1.0);
		sign *= -1;
	}
	return 4.0 * sum;
}

Now I create a couple of different Jobs running CalcPi and add them into a vector or a queue (depending on the test I’m running). My CalcPiJob class looks something like this:

C++
class CalcPiJob
{
public:
	CalcPiJob(int iterations)
		: m_iterations(iterations)
	{ }

	void DoWork()
	{
		float p = 0.0f;
		for (int i = 0; i < m_iterations; ++i) {
			p += CalcPi(m_iterations);
		}

		p /= m_iterations;
		std::this_thread::sleep_for(std::chrono::milliseconds(Settings::ThreadPause));
	}

private:
	int m_iterations;
};

Creating a series of different workload types looks something like:

C++
std::queue<CalcPiJob*> GetJobsQ()
{
	std::queue<CalcPiJob*> jobQ;
	for (int i = 0; i < Settings::JobCountHigh; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountHigh));
	}

	for (int i = 0; i < Settings::JobCountMedium; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountMedium));
	}

	for (int i = 0; i < Settings::JobCountLow; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountLow));
	}
	return jobQ;
}

std::vector<CalcPiJob*> GetJobVector()
{
	std::vector<CalcPiJob*> jobs;
	for (int i = 0; i < Settings::JobCountHigh; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountHigh));
	}

	for (int i = 0; i < Settings::JobCountMedium; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountMedium));
	}

	for (int i = 0; i < Settings::JobCountLow; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountLow));
	}
	return jobs;
}

I have also defined a couple of constants to help out.

C++
struct Settings
{
	enum class Priority : int {
		Low = 0,
		Medium,
		High
	};

	static const int JobCountLow = 120;
	static const int JobCountMedium = 60;
	static const int JobCountHigh = 25;

	static const int ThreadPause = 100;

	static const int IterationCountLow = 5000;
	static const int IterationCountMedium = 10000;
	static const int IterationCountHigh = 20000;

	static const int PrecisionHigh = 100;
	static const int PrecisionMedium = 100;
	static const int PrecisionLow = 100;
};

Now for baseline, I go through all Jobs and execute DoWork sequentially.

C++
void RunSequential()
{
	std::queue<CalcPiJob*> jobQ = GetJobsQ();
	while (!jobQ.empty())
	{
		CalcPiJob* job = jobQ.front();
		jobQ.pop();

		job->DoWork();
		delete job;
	}
}

I’m running all my tests on a i7 4770K, that has 4 cores and 8 threads. All timings were taken from a release build, and all profile images from debug builds (for illustration of workload purposes).

Sequential run time: 20692 ms

Image 1

First Worker Thread

Let the interesting part begin. As an easy step towards a multi-threading application, I’m going to create only one thread, to share the workload with the main thread.

This already brings a few new concepts to be aware of such as sharing data across multiple threads. We protect our data access with a std::mutex, and lock it with std::scoped_lock (introduced in C++17. Use similar std::lock_guard if your compiler doesn’t support it).

You’ll need a few includes first.

C++
// you should already have these.
#include <vector>
#include <queue>

#include <thread> // thread support
#include <mutex>  // mutex support
#include <atomic> // atomic variables
#include <future> // later on for std::async
C++
CalcPiJob* GetAndPopJob(std::queue<CalcPiJob*>& jobQ, std::mutex& mutex)
{
	std::scoped_lock<std::mutex> lock(mutex);
	if (!jobQ.empty())
	{
		CalcPiJob* job = jobQ.front();
		jobQ.pop();

		return job;
	}
	return nullptr;
}

GetAndPopJob does exactly what it says, it will get a job if one exists and pop it from the queue. empty(), front() and pop() are protected inside this method with the use of the std::scoped_lock.

C++
void ExecuteJobsQ(std::atomic<bool>& hasWork, 
	std::queue<CalcPiJob*>& jobQ, 
	std::mutex& mutex)
{
	while (hasWork)
	{
		CalcPiJob* currentJob = GetAndPopJob(jobQ, mutex);
		if (currentJob)
		{
			currentJob->DoWork();
			delete currentJob;
		}
		else
		{
			hasWork = false;
		}
	}
}

ExecuteJobsQ will run in the main thread and the worker thread. It gets a job, executes it, and continues until there is no more work to do.

C++
// global mutex for read/write access to Job Queue
static std::mutex g_mutexJobQ;

void RunOneThread()
{
	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> jobsPending = true;

	// Starting new thread
	std::thread t([&]() {
		ExecuteJobsQ(jobsPending, jobQ, g_mutexJobQ);
	});

	// main thread, also does the same.
	ExecuteJobsQ(jobsPending, jobQ, g_mutexJobQ);

	t.join();
}

One worker thread run time: 10396 ms

Image 2

The image above shows the execution of the jobs, the larger ones first, then the medium sized ones and lastly the smaller ones. This was the order at which the tasks were added into the queue.

More Worker Threads

Now this is nice, so let's add more threads! How many? Well, I know my CPU has 8 threads, but nothing guarantees they will only run for my program though. Operating system time slice program execution across multiple cores/threads, so even if you create more threads than your max CPU threads, there’s no “problem” because the operating system will switch execution time for them on its own.

C++ provides us a way of determining how many concurrent threads our system supports, so lets just use that: std::thread::hardware_concurrency().

C++
void RunThreaded()
{
	// -1 to make space for main thread
	int nThreads = std::thread::hardware_concurrency() - 1;
	std::vector<std::thread> threads;

	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> hasJobsLeft = true;
	for (int i = 0; i < nThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasJobsLeft, jobQ, g_mutexJobQ);
		});
		threads.push_back(std::move(t));
	}

	// main thread
	ExecuteJobsQ(hasJobsLeft, jobQ, g_mutexJobQ);

	for (int i = 0; i < nThreads; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 2625 ms.

Image 3
8 threads

Now this is a nicer view. 7 worker threads working with the main thread to process all jobs. Again, first we see the bigger jobs, then medium, then smaller ones being processed. This is being processed in the order they were added.

Async Tasks

When spawning tasks with std::async, we don’t manually create threads, they are spawned from a thread pool.

C++
void RunJobsOnAsync()
{
	std::vector<CalcPiJob*> jobs = GetJobVector();

	std::vector<std::future<void>> futures;
	for (int i = 0; i < jobs.size(); ++i)
	{
		auto j = std::async([&jobs, i]() {
			jobs[i]->DoWork();
			});
		futures.push_back(std::move(j));
	}

	// Wait for Jobs to finish, .get() is a blocking operation.
	for (int i = 0; i < futures.size(); ++i)
	{
		futures[i].get();
	}

	for (int i = 0; i < jobs.size(); ++i)
	{
		delete jobs[i];
	}
}

Run time: 2220 ms

Image 4

Overview

This time table only serves as an overview for this particular case. Of course, in real applications, results vary.

Test Run Time (ms) Improvement
Sequential 20692 1.x
One Thread 10396 1.99x
Threaded 2625 7.88x
Async Tasks 2220 9.3x

The sample codes are my exploration of this specific case and by no means is free of bugs. But it is interesting to see how the code would run across multiple threads, how to synchronize and make the most of my system.

All screenshots are taken with the debug version of the program, so we could clearly see the workload in the profiler. For that, I used Superluminal Profiler. I found out that it is an amazing, lightweight profiler. You can also use Intel’s VTune for free.

Continue Reading

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
United Kingdom United Kingdom
http://mikeadev.net/about-me/

Comments and Discussions

 
QuestionVery nice, but... Pin
Member 1466094219-Nov-19 3:39
MemberMember 1466094219-Nov-19 3:39 
First of all, nice job. And I really mean it.

I just got a little worried when you stated that you got 9.x performance improvement on a 8 threaded cpu. That is just not possible. When you add threads, you have the same amount of work being executed in paralell, and an additional amount of work for the thread management. Amount of work = cpu time. It's math + physics.

I gave a quick look at your code, and saw that you have a line of code that implements a wait:
std::this_thread::sleep_for(std::chrono::milliseconds(Settings::ThreadPause));


If you remove this line, than you are going to see actual processing times. You are going to have gains, but without the "fat" to be burned, the serial time is going to get closer to the multi-threaded time.

I am too lazy to actually get the code, compile, get execution times, put them into excel and create a page with it... Smile | :)

In spite of everything I wrote, I would like to say it again: very nice work.
AnswerRe: Very nice, but... Pin
Michael Adaixo19-Nov-19 12:22
MemberMichael Adaixo19-Nov-19 12:22 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.