Boylston: self-define kernel for matrix multiplication in CUDA

self-define kernel for matrix multiplication in CUDA

I am just starting CUDA programming and I am learning something on kernel
design for matrix multiplication. I copy the main code found online and I
have added my part to implement matrix multiplication of A(MxM) and B(MxN)
#include <iostream>
#include <iomanip>
#include <fstream>
#include <vector>
#include <cuda_runtime.h>
#include <cuComplex.h>
#include <cusp/complex.h>
using namespace std;
const int M=55, N=73;
typedef cusp::complex<double> Complex;
__global__ void kernelFunc(Complex* ad, Complex* bd, Complex* cd, int n)
{
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(x < n && y < n)
{
Complex v = Complex(0.0, 0.0);
for(int i=0; i<n; i++) v += ad[y * n + i] * bd[i * n + x];
cd[y * n + x] = v;
}
}
int main(int argc, char *argv[])
{
std::vector< Complex > _A(M*M);
std::vector< Complex > _B(M*N);
Complex *A, *B, *C;
cudaMalloc((void**)&A, M*M*sizeof(Complex));
cudaMalloc((void**)&B, M*N*sizeof(Complex));
cudaMalloc((void**)&C, M*N*sizeof(Complex));
for (int i=0; i<M*M; i++) _A[i] = Complex((double)i, (double)i);
for (int i=0; i<M*N; i++) _B[i] = Complex(1.0, 0.0);
cudaMemcpy( A, &_A[0], (M*M)*sizeof(Complex), cudaMemcpyHostToDevice );
cudaMemcpy( B, &_B[0], (M*N)*sizeof(Complex), cudaMemcpyHostToDevice );
dim3 block(32, 32);
dim3 grid((M+31)/32, (M+31)/32);
kernelFunc<<<grid, block>>>(A, B, C, M);
cudaMemcpy(&_B[0], &C[0], (M*N)*sizeof(Complex), cudaMemcpyDeviceToHost);
cudaFree(A);
cudaFree(B);
cudaFree(C);
return 0;
}
But it stated online that the involved matrices must be square matrices so
does it mean that this code cannot be used on matrices of arbitrary
dimension? I don't understand how to define the number of blocks and
number of grids to fit my problem. The matrix in my problem has dimension
MxM with M an odd number. I try above code for small matrix, it seems work
but I have to apply that to pretty big matrix. I don't know if that will
work for big matrix too.

Boylston

Wednesday, 11 September 2013

self-define kernel for matrix multiplication in CUDA

No comments:

Post a Comment