OpenMP并行线程的交替数量会造成严重的性能损失
Severe performance loss alternating number of OpenMP parallel threads
以下代码更改用于交替并行的并行线程数。
#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>
std::vector<float> v;
float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
float total = 0;
std::vector<int>nthreads{threadsFirst,threadsSecond};
for (int nthread : nthreads) {
omp_set_num_threads(nthread);
#pragma omp parallel for
for (int i = 0; i < tasks; ++i) {
for (int n = 0; n < perTaskComputation; ++n) {
if (v[i] > 5) {
v[i] * 0.002;
}
v[i] *= 1.1F * (i + 1);
}
}
for (auto a : v) {
total += a;
}
}
return total;
}
int main()
{
int tasks = 1000;
int load = 1000;
v.resize(tasks, 1);
for (int threadAdd = 0; threadAdd <= 1; ++threadAdd) {
std::cout << "Run batchn";
for (int j = 1; j <= 16; j += 1) {
float minT = 1e100;
float maxT = 0;
float totalT = 0;
int samples = 0;
int iters = 100;
for (float i = 0; i <= iters; ++i) {
auto start = std::chrono::steady_clock::now();
foo(tasks, load, j, j + threadAdd);
auto end = std::chrono::steady_clock::now();
float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
if (i > 20) {
minT = std::min(minT, ms);
maxT = std::max(maxT, ms);
totalT += ms;
samples++;
}
}
std::cout << "Run parallel fors with " <<j << " and " << j + threadAdd << " threads -- Min: "
<< minT << "ms Max: " << maxT << "ms Avg: " << totalT / samples << "ms" << std::endl;
}
}
}
当在发布模式下使用Visual Studio 2019编译和运行时,以下是输出:
Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms Max: 2.47ms Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms Max: 1.234ms Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms Max: 0.759ms Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms Max: 0.578ms Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms Max: 0.676ms Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms Max: 0.999ms Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms Max: 0.786ms Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms Max: 0.948ms Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms Max: 0.504ms Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms Max: 0.702ms Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms Max: 1.104ms Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms Max: 1.01ms Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms Max: 3.577ms Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms Max: 0.792ms Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms Max: 0.723ms Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms Max: 0.858ms Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms Max: 3.501ms Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms Max: 4.809ms Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms Max: 14.394ms Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms Max: 8.572ms Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms Max: 15.739ms Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms Max: 16.787ms Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms Max: 39.971ms Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms Max: 45.473ms Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms Max: 31.844ms Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms Max: 21.199ms Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms Max: 21.608ms Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms Max: 18.779ms Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms Max: 26.991ms Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms Max: 27.701ms Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms Max: 26.351ms Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms Max: 40.517ms Avg: 22.0216ms
在第一批中,随着线程数量的增加,要进行多次运行,使用相同数量的线程交替并行运行。此批处理产生预期的行为,随着线程数量的增加,性能会增加。
然后进行第二批处理,运行相同的代码,但交替并行fors,其中一个线程比另一个多使用一个线程。第二批具有严重的性能损失,计算时间增加了50~100倍。
在Ubuntu中使用gcc编译和运行会导致预期的行为,两个批处理的执行方式相似。
所以,问题是,在使用Visual Studio时,是什么导致了这种巨大的性能损失?
关于问题评论中解释的实验,由于缺乏更好的解释,这似乎是VS运行时中的一个错误。
相关文章:
- 性能损失并行
- 在原始循环上使用boost::irange的性能损失
- 在C++代码中使用纯 C 库是否有性能下降/损失
- 可变的FlatBuffers,性能损失
- 通过Delphi访问Windows API是否会导致性能损失
- 在不损失C++或 Python 性能的情况下计算 pi
- 指针向量与值向量 大内存块与小内存块的性能损失
- 写入内存缓冲区时性能损失 (C++)
- 如果未在类声明中定义函数,则性能损失
- 库适配器性能损失
- 使用函子提供函数或运算符作为C++模板参数的性能损失
- 调用cuda内核时的性能损失
- MinGW g++从4.5.0更新到4.6.2后的性能损失
- 由于if语句,C++的性能损失巨大
- 在 Visual Studio 2010 中使用"auto"关键字的性能损失
- 在不损失性能的情况下提高可读性
- 只有2个元素的元组是否有任何性能损失?
- 在虚拟机上运行openMp算法造成的性能损失
- 在 Go 方法中按值传递"this"是否会对性能造成损失?
- 我是否应该使用"if"语句统一两个相似的内核,冒着性能损失的风险?