Click here to Skip to main content
15,564,879 members
Please Sign up or sign in to vote.
4.75/5 (6 votes)
See more:
I am looking for more information on how to prevent thread starvation/blocking/racing in an application or service that could possibly running up to 200 threads simultaneously across multiple processors/cores.

Most of the examples I have found online show how to coordinate between 3 or 4 threads max and I am planning a app that will have hundreds.

I've made applications that have run up to 30 threads but when going way beyond that I am worried about some threads not getting any CPU time not to mention synchronization.

Where can I find some really good articles and blogs?
What are the best books on the subject?
There are a lot of books and articles on the subject, I am looking for the most informative and instructional but not presented in an engineering language above my understanding.

All suggestions are appreciated.
Posted
Comments
Mehdi Gholam 15-Nov-13 10:41am    
What are you trying to do?
Sergey Alexandrovich Kryukov 15-Nov-13 10:57am    
First of all, the best and most radical method of preventing thread starvation is not writing code. :-)

The approached good for 2-3 threads are generally good for 200. But do you have 100 or at least 20 cores? I doubt it. First thing to think about is why you are using threads. They rarely help to improve throughput of the system. (The do, I say, rarely.) Threads are more useful to program something which is parallel logically, due to the intrinsic nature of some subjects: communication on several independent lines, with several devices, several players, several characters of some game, something like that...

—SA
pasztorpisti 15-Nov-13 19:10pm    
"First of all, the best and most radical method of preventing thread starvation is not writing code."
:thumbsup: :-) :-) :-) Not writing code is what I usually advice to avoid bugs...
Sergey Alexandrovich Kryukov 15-Nov-13 20:58pm    
:-)
Bibin_Babu 30-May-14 14:13pm    
:)

1 solution

There are too many scenarios to cover so it is impossible to answer you question.

But, using more active threads than the number of cores is usually counterproductive. You can have benefits from using more threads than the number of cores only if most of the threads usually want to sleep, for example by waiting on some IO operations like recv() call that receives data from the network. Even in that case you want to actively control the number of threads that execute in parallel and starting 200 threads hoping that only 4 of them will *want* to run in parallel if you have 4 cores is not a good practice.

If you have to run 200 parallel tasks that are active all the time then you have to start only the same number of threads as the number of cores you have and you have to split up the tasks into small jobs, then you execute these small jobs on your threads in whatever order you want (maybe with prioritizing). This is already near to having your own scheduler. A technique that often results in more sophisticated solution than the previous one is using green threads / coroutines (deprecated ucontext POSIX api on some linux distros and fiber api on windows).

Using green threads you can split up a thread into many green threads and you perform the tasks switching/cooperation between these green threads explicitly in user space. Lets say you have 4 cores. Then you can start 4 threads and if you split up each thread into 100 green threads than you have 400 green threads but only 4 of them are executing in parallel at any give point of time and you can write your own scheduler because you have control over the task switching between the green threads.

Writing for example a web server using green threads has the benefit that the code for the multithreaded and async/green threaded implementation can use the same servlet code if written well (because you can put the task switching into your own implementation of otherwise blocking function calls like socket read/write functions and the implementation for the same blocking function calls in multithreaded mode are actually calling the blocking OS equivalent functions).
 
Share this answer
 
v3
Comments
Sergey Alexandrovich Kryukov 15-Nov-13 21:41pm    
Sure, a 5.
—SA
pasztorpisti 16-Nov-13 9:32am    
Thank you!
Foothill 26-Jan-14 15:03pm    
This is a good start. Your answer suggests that you are very knowledgeable in this area. Say that each separate thread was performing very simple calculations. Is it feasible to use the simplified architecture of nVidia's CUDA cores to accomplish massively paralleled programming?
pasztorpisti 26-Jan-14 16:12pm    
I don't have experience with CUDA but I know that its usage is very similar in essence to that of some other platforms used for massively parallel computing. Generally in case of optimization and of course also in case of multithreaded optimization you can not omit/skip one terribly important thing: always measure the actual performance of your program on the ACTUAL/USED PLATFORMS/HARDWARES. Of course it is important to write the code well algorithmically but on the other hand your main logic has to distribute the computable tasks well to make efficient use of your coputation capacity. In a machine with CUDA you have several CPU cores and GPU cores. If you write your code for CUDA only then your CPU cores will only upload the computable jobs/data to the hardware controlled by CUDA and then the CPU cores just wait for the CUDA api to finish. When CUDA finishes you have to download the results. Upload/download also has a cost. If the upload/download cost is near to the cost of computing the task on CPU then it isn't worth performiing that small task with CUDA. To measure the times - to "profile" your application - you need the right tools. You should search and find the tools that can monitor all of your processing units including CPUs and GPUs or any other things. CUDA has such tools. Again, there is no golden rule when it comes to optimization. Its just a dream that you write your code along predefined guidelines and then it runs "optimally" everywhere. I work in the games industry where doing optimizations happens quite often. If your application has to run "perfectly" at least on iPad4 and some specific Android devices then you have to profile/measure your app on these hardwares and you have to fine-tune your code and job distribution logic on these platforms. if you app has to run on a PC with 8 cores (3Ghz each) with 4G ram then you have to optimize your code for this hardware. If you optimize your program for a specific hardware then it will probably not perform "optimally" on hardware that less from some of the resources of your original target hardware you performed the optimization for. In case of games for example it isn't rare to balance between CPU/GPU to "run the sh*t out of the hardware". You measure how much CPU/GPU is used and if you find out that you usually have some free idle time on the CPU (on your target hardware) then you may want to offload some tasks from the GPU to the CPU (several tasks can be performed on both CPU and gpu, for example character animation/skinning). Your application is near to optimal if it uses all cores near to 100% without stalling on any of the threads. In practice this is very hard to achieve especially if you have multiple target platforms/hardwares.
Foothill 26-Jan-14 16:47pm    
So, in theory, it is possible to have an multi-core processor act as the traffic cop directing data to and from waiting threads running in a GPU. From the viewpoint of pure data computation, the large-scale architectures of modern x86 and x64 processors is like using an earth mover to dig a post hole. I am approaching this from the idea that the architecture of GPU's are already optimal for performing millions of simple calculations per second and are already designed to run in parallel. Is something like that even possible?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900