I use OpenCL as the platform for GPU programming. The code which is intended to be executed on the graphics hardware is based on C, so it is quite easy to use and understand. The OpenCL specification defines an optional extension for double precision floating point arithmetics, and this extension is supported by the NVIDIA driver on the GPU I use (GeForce GTX 285). So my plan was to read back partial results from the GPU to my code, and compare them to the corresponding data produced by the original program.
Without further ado, I will just say what I've found out after many hours of meticulous testing. Even when considering the basic arithmetic operations, the result of calculations are not necessarily the same. I got to a piece of code which calculated an expression like "(a-b)*(a-b)+(x-y)*(x-y)", and the result was one bit off when compared to the number produced by C or Java code. At first I thought that the GPU implemented floating point rounding incorrectly, that the result of the expression fell between two neighbouring double values, and the incorrect one was selected. It turned out that this wasn't exactly the case.
I've created a very simple C program to make sure that nothing irrelevant interfered with my test. The GPU code comprises of the following function:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void test(I won't digress by explaining the details of this OpenCL C code. It takes four double arguments, and returns the results of five expressions in the buffer 'c'. The values of the parameters are declared in the C code as follows:
double a, double b, double x, double y,
__global double* c) {
c[0]=a-b;
c[1]=(a-b)*(a-b);
c[2]=x-y;
c[3]=(x-y)*(x-y);
c[4]=(a-b)*(a-b)+(x-y)*(x-y);
}
double a=22;I compared the values returned by the GPU code with those that the C program itself computed, using the non-converting cast to long to ensure that even a single-bit difference becomes visible. These are the results:
double b=12.020105095868775;
double x=201;
double y=210;
a-b CPU: 4621807799426188662, GPU: 4621807799426188662 -> 1The last column is the result of simple double equality. The value of each subexpression appears to be the same, except for the very last one. Very strange. Here were two equal numbers, (a-b)^2 and (x-y)^2, and adding them yields different results. This is when I thought that addition doesn't work correctly on the GPU side. To make sure that this is the case, I've planted the values of (a-b)^2 and (x-y)^2 as double literals into the OpenCL code, checked that they are the values that I wanted to add (by returning them in the output buffer, I saw the exact same values as in the table above), and added these. The result was the CPU version of the sum.
(a-b)^2 CPU: 4636709024391772619, GPU: 4636709024391772619 -> 1
x-y CPU: -4602115869219225600, GPU: -4602115869219225600 -> 1
(x-y)^2 CPU: 4635400285215260672, GPU: 4635400285215260672 -> 1
sum CPU: 4640558254430887142, GPU: 4640558254430887141 -> 0
What can this mean? Only one thing: although these pairs of numbers appear to be the same when observed from CPU code, they are actually not. Their sums are different. My conclusion is that there are hidden details in these doubles on the GPU side. Internally, the numbers seem to be stored in a precision higher than 64 bits. It is quite possible that the same is true for the CPU, but the internal precisions have to be different. As the double data type is strictly 64-bit, this additional data cannot be observed directly, it is manifested in the results of further operations, when the numerical difference accumulates to an amount which becomes visible when rounding to the nearest floating point value representable in 64 bits.
Conclusion
Different hardware have different internal numerical precisions, even if the data types that they use are equivalent. So under these conditions, one cannot hope that performing the exact same calculations on the exact same input results in the exact same output. There is nothing that can be done about this.