I recently started going through an amazing Udacity course on Parallel Programming. Having been working on image processing and computer vision for quite some time now, I have realized that CPUs are NOT designed for image processing applications. Even the oh-so-optimized OpenCV implementations of computer vision algorithms in C/C++ are not real time when working on something as computationally expensive as variational optical flow. However, if you use the in-built CUDA module (in OpenCV 3.0), the performance is way better.
CPUs are not getting any faster, due to limitation of clock speeds which have virtually remained the same since the past 5 years or so. Increasing this clock speed has become close to impossible, since increasing clock speeds increases the power consumption, which makes it difficult to cool the CPU. So, for faster computations, GPUs are the way to go.
GPU vs CPU
My computer has a quad-core processor with hyper threading (an Intel i7 Ivy Bridge). This means that, in the best case, I can have at most 8-threads truly running in parallel. On the other hand, the low-end GPU that I have (nVidia GeForce GT630M) has 96 cores!
In general, a CPU has a few, very powerful computation cores, where as a GPU has a very large number of smaller, less powerful computation cores. The time taken to perform any one particular task is less on the CPU, but if you need to performs thousands of such tasks, then the GPU would beat the CPU.
One more important thing to note is that, while designing CPUs engineers optimize for latency. On the other hand, maximum thorughput is what the designers are aiming at while making GPUs.
Throughput vs Latency
Throughput: It is defined as the amount of work done in unit time. For example, I need to transport 50 bags of rice from ground floor of a building to the 10th floor, and I can carry at most two bags at a time. Let the time taken in each trip be 5 minutes. So, the amount of work done in one hour would be 2*(60/5) = 24 bags. I can say that the throughput is 24 bags/hr.
Latency: It is defined as the amount of time taken to perform a particular task. In the previous example, it would take take 125 minutes to take all the 50 bags, and hence the latency (measures in time units) is 125 minutes.
CUDA is a framework developed by nVidia for writing programs that run both on the GPU and the CPU. On the CPU side, you can write programs in C, and then used some extensions to C (written by nVidia) to write programs that run on the GPU. These programs that run on the GPU are called kernels. A kernel looks like a serial program, but the CPU launches on a large number of threads on the GPU. In CUDA, the CPU is referred to as the host while the GPU is referred to as the device. In this relationship between the CPU and the GPU, the CPU is the alpha. The CPU and the GPU have separate memories, and can perform operations only on the data that is stored in their own memory. The CPU can allocate memory on the GPU, copy data from the CPU memory to the GPU memory, launch kernels on hundreds of thread on the GPU, and copy back the results from the GPU memory. The GPU, on the other hand, can only respond to call of memory copy made by the CPU, and cannot make its own requests for data transfer.
Skeleton of a CUDA program:
- Allocate memory on the GPU
- Transfer data from the CPU memory to the GPU memory
- Perform the computations on the GPU
- Copy the results from the GPU to the CPU
A sample code in CUDA, which calculates the cubes of all integers from 1 to 64.
If you have the nVidia CUDA toolkit installed, you can compile and run the above program using: