For loop performance and multithreaded performance issues


I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).

int sum = 0;
auto start = chrono::high_resolution_clock::now();

for (int i = 0; i < 800000; ++i)
    sum += ints[i];

auto end = chrono::high_resolution_clock::now();
auto diff = end - start;

Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.

for (auto& val : ints)
    sum += val;

for (auto it = ints.begin(); it != ints.end(); ++it)
    sum += *it;

At this point console output looked like:

index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms

This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?

I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.

auto func = [&](int start, int total, int index)
    int partial_sum = 0;

    auto s = chrono::high_resolution_clock::now();
    for (int i = start; i < start + total; ++i)
        partial_sum += ints[i];
    auto e = chrono::high_resolution_clock::now();
    auto d = e - s;

    cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
    sum += partial_sum;

for (int i = 0; i < 8; ++i)
    threads.push_back(thread(func, i * 100000, 100000, i));

Basically every thread was summing 1/8 of the total array, and the final console output was:

thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms

So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.

EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.

EDIT2: Here's the whole main function:

With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.

For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.