I have some thoughts about this:
1) I hope you dont go through all this if ProcessorCount==1
or maybe you should, just to make sure it is also correct in that case ?
2) in my experience it works a lot easier when you never use end, but instead
use end+1 (saves a lot of mistakes, and it fits the for-loop paradigma).
3) I would not try to divide the job in exactly even parts, when you split
100 in 48+52 it won't matter that much.
4) I would be very careful not to have data that resides in the same cache line
(more precisely in addresses that differ by less than the cache line size)
be written by different processors. It would cause a lot of invalidate,
flush and reload operations, and would be devastating for performance.
And yes this is hard to achieve in a high-level language.
I hope but am not absolutely sure, arrays are always allocated at
an address that is a multiple of a sufficiently high power of 2
(say a multiple of 256 B). You can check this somewhat by experimenting,
better is to have it specd somewhere, but I dont recall having seen that.
5) 3+4 together means: I would, still thinking of arrays, split on
boundaries that correspond to 256 B or more, hence round up to 64
when dealing with 4B ints, etc.
6) I do recall some Intel articles that suggested making structs artificially
larger when you can afford it, just to achieve my point 4 (e.g. for control
data, such as task control blocks).
One of them showed how you can really kill a multi-processor system: have
two threads do spin locks on 2 data items that belong to the same cache line !
7) on the other hand, if you over-align your data, you increase the possibility
of cache trashing; example: if you need repetitive access to only 1000 bytes,
but all these bytes are at a stride of 1KB, they would constantly miss in
a cache even as big as 2MB, since cache associativity nowadays is something
like 8-way, so there would only be 8 cache line candidates to cache the
requested data (the cache line candidate in each way being determined by
the lowest address bits (but excluding those that correspond to the
cache line width).
8) in conclusion, I suggest you use variables for your basic parameters
(such as number of processors, estimated safe allignment (my 256 B above), etc.
And once you have it running, just do some more experiments with slightly
different values for those variables, just to see what really helps.
Good luck, and please keep posting your progress.