cuda学习1——Udacity

来源

https://classroom.udacity.com/courses/cs344/lessons/55120467/concepts/670743010923

CPU与GPU

异构型计算机(termed heterogeneous)有两种：根据不同的处理器区分——CPU与GPU。cuda编程模型允许我们在GPU上运行。程序运行在CPU的部分成为host，在GPU的部分则是的device，并且还假设host和device有各自分开的内存。在CPU与GPU的关系中，前者占据着重要的位置，它告诉GPU应该做什么

流程：

从CPU拷贝数据到GPU
从GPU拷贝数据到CPU

这两部就是cudaMemcpy

分配GPU内存：cudaMalloc
在GPU上启动内核

cuda程序例子

#include <stdio.h>

__global__ void square(float * d_out, float * d_in){
    int idx = threadIdx.x;//threadIdx is a structure
    float f = d_in[idx];
    d_out[idx] = f * f;
}

int main(int argc, char ** argv) {
	const int ARRAY_SIZE = 64;
	const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

	// generate the input array on the host
	float h_in[ARRAY_SIZE];
	for (int i = 0; i < ARRAY_SIZE; i++) {
		h_in[i] = float(i);
	}
	float h_out[ARRAY_SIZE];

	// declare GPU memory pointers
	float * d_in;
	float * d_out;

	// allocate GPU memory
	cudaMalloc((void**) &d_in, ARRAY_BYTES);
	cudaMalloc((void**) &d_out, ARRAY_BYTES);

	// transfer the array to the GPU
	cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

	// launch the kernel
	square<<<1, ARRAY_SIZE>>>(d_out, d_in);

	// copy back the result array to the CPU
	cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

	// print out the resulting array
	for (int i =0; i < ARRAY_SIZE; i++) {
		printf("%f", h_out[i]);
		printf(((i % 4) != 3) ? "\t" : "\n");
	}

	cudaFree(d_in);
	cudaFree(d_out);

	return 0;
}

设置kernel启动

以这个为例：

1
2
3

square<<<1, ARRAY_SIZE>>>(d_out, d_in);
//square<<<dim3(bx, by, bz), dim3(tx, ty, tz), shmem>>>(d_out, d_in);
//相对于启动了bx*by*bz个block，每个block具有tx*ty*tz个线程。sheme默认0

我们使用了这些启动参数1, ARRAY_SIZE，并以这些自变量d_out, d_in来启动它。

在这里我们启动了64个线程，即1个带有64个线程的块。对于kernel而言：

能够同时运行多个块；
每个块带有多个线程；较新的GPU可以支持1024个线程，不要超过1024.

threadIdx这个结构中，x, y, z分别表示线程在block中不同纬度的索引。

blockDim：block的大小，有多少个线程；

blockIdx：网格中block的索引；

gridDim：网格大小；

总结

当我们写一个程序，它看起来是运行在一个线程上；
当我们启动程序的时候，我们从CPU代码启动这个内核；
在内核中，每个线程都知道自己所在index；