The VBA code is pretty slow and cannot be executed asynchronously. The aim of this article is to show how to execute OpenCL code simultaneously at GPGPU and CPU from VBA. The performance measurements of OpenCL code executed on CPU shows that it is even faster than the native C# code.
OpenCL is a C99 based language to program the GPGPU and CPU (the Wikipedia mentioned also DSPs and FPGAs). The positive of OpenCL is that the same code can be executed on GPGPU and CPU without any changes and the number of supported platforms is enormous. From the other side, the most widespread programming language for non-professionals, VBA, supports only calculations at one processor and cannot asynchronously execute the code.
I like Excel very much because of its interactivity, but sometimes for massive calculations, one needs a little more power. So, I hope this article will close this gap and will show how to add the support of multiplatform / multithreading calculations on GPGPU / CPU to Excel.
The purpose of this short article is to show how to add the support of OpenCL to the VBA. Since my needs were limited only by increasing of computational power of Excel, no image processing and corresponding topics were implemented / discussed. For those, who like the solution-from-the-box, there is an installer (Windows only) that installs the
ClooWrapperVBA library and registers it. For others, who do not trust installers, the same content is available also as zip-file. You will have only to register DLL using "register.bat".
Who benefits from the library?
- The hobby programmers who make their models in Excel. That's me!
- The people who would like to learn OpenCL, but do not want to install additional programs like Visual Studio.
- The scientists of old school, who do their investigations in Excel. I have heard about that samurais.
- .NET 4.0 Framework
- Installed Excel
- OpenCL.dll in Windows folder (normally exist on each Windows computer).
How the code is organized:
- Sources contain sources of the wrapper (in C#) and VBA code. Demo Excel table is not attached to the sources.
- Demo Excel table can be found after installation in "C:\Program Files (x86)\ClooWrapperVBA\demo".
- If you want to compile the C# sources by yourself, the demo Excel table can be found in zip file in folder "demo".
Let's Burn the GPU/CPU!
The DLL is written in C# and contains COM interface to Cloo, accessible from Excel. Cloo is an open source wrapper to execute OpenCL code from .NET.
Let's start with a simple program that performs matrix multiplication code written in OpenCL.
Dim clooConfiguration As New ClooWrapperVBA.Configuration
At first, that would be interesting to check which hardware is available to us. The code snippet below iterates through all available platforms and all devices inside each platform and gets their configurations:
nPlatforms = clooConfiguration.Platforms
For i = 1 To nPlatforms
result = clooConfiguration.SetPlatform(i - 1)
If result Then
platformName = clooConfiguration.Platform.PlatformName
platformVendor = clooConfiguration.Platform.PlatformVendor
platformVersion = clooConfiguration.Platform.PlatformVersion
nDevices = clooConfiguration.Platform.Devices
For j = 1 To nDevices
result = clooConfiguration.Platform.SetDevice(j - 1)
If result Then
deviceType = clooConfiguration.Platform.Device.DeviceType
deviceName = clooConfiguration.Platform.device.DeviceName
deviceVendor = clooConfiguration.Platform.device.DeviceVendor
maxComputeUnits = clooConfiguration.Platform.device.MaxComputeUnits
deviceAvailable = clooConfiguration.Platform.device.DeviceAvailable
compilerAvailable = clooConfiguration.Platform.device.CompilerAvailable
deviceVersion = clooConfiguration.Platform.device.DeviceVersion
driverVersion = clooConfiguration.Platform.device.DriverVersion
globalMemorySize = clooConfiguration.Platform.device.GlobalMemorySize
maxClockFrequency = clooConfiguration.Platform.device.MaxClockFrequency
maxMemoryAllocationSize = _
openCLCVersionString = _
The important configuration settings are:
deviceType ("GPU" / "CPU")
maxComputeUnits - number of processors / threads that can be used
SetDevice functions return
true if the functions were executed successfully,
false, if not.
If no platforms or devices were found, then check if OpenCL.dll exist in "Windows" folder (normally in "C:\Windows\"). If you have no OpenCL.dll in "Windows" folder, then copy it from other computer or Google for it. Do antivirus check in any case! Another possible reason, the GPGPU / CPU drivers are too old and not supported by OpenCL. Google the latest drivers for your CPU / GPGPU.
Matrix Multiplication in OpenCL
Now, let us multiply two floating point matrices M1[p, q] and M2[q, r] using the first available device that can compile OpenCL sources.
First of all, the OpenCL sources will be read.
Open Application.ActiveWorkbook.Path & "\cl\MatrixMultiplication.cl" For Binary As #1
sources = Space$(LOF(1))
Get #1, , sources
The matrix multiplication OpenCL code of two floating point matrices is shown below:
__kernel void FloatMatrixMult_
(__global float* MResp, __global float* M1, __global float* M2, __global int* q)
int i = get_global_id(0);
int j = get_global_id(1);
int p = get_global_size(0);
int r = get_global_size(1);
MResp[i + p * j] = 0;
int QQ = q;
for (int k = 0; k < QQ; k++)
MResp[i + p * j] += M1[i + p * k] * M2[k + QQ * j];
If the found device has a compiler (
= true), we will try to compile the OpenCL sources:
Set progDevice = New ClooWrapperVBA.ProgramDevice
result = progDevice.Build(sources, "", platformId, deviceId, cpuCounter, buildLogs)
The parameters are:
sources - the sources as a plain text, read from the file at previous step.
- Second parameter ("") contains compiler options. In the simplest case, it can be empty ("", not
Nothing). Among the common compiler options, like "
-w" (Inhibit all warning messages), you can also define here commonly used constants ("
-D name=definition") and use them in the OpenCL code. The complete list of compiler options can be found at Khronos web page.
- The fifth parameter,
cpuCounter, defines the device index of some certain type ("
CPU" or "
GPU"). Since your platform can have thousands of devices of different type, this parameter distinguishes between different devices of the same type.
buildLogs shows the current compiler logs. It's important to check if it contains any errors. For me, it was very helpful to check the build logs for warnings when I developed my first own OpenCL code. The warnings were mostly the reasons of the kernel crash during execution. The errors are also accumulated in property
errorString = progDevice.ErrorString
If the compilation was successful (
= true), we have to define the kernel which we want to execute. The single parameter of the
CreateKernel function is a kernel name (
result = progDevice.CreateKernel("DoubleMatrixMult")
Then we have to set input and output arrays of the kernel. OpenCL typically uses vectors (1-dimensional arrays) as input parameters:
result = progDevice.SetMemoryArgument_Double(0, vecResp)
result = progDevice.SetMemoryArgument_Double(1, vecM1)
result = progDevice.SetMemoryArgument_Double(2, vecM2)
result = progDevice.SetMemoryArgument_Long(3, vecQ)
To set kernel arguments, there are six functions:
- To set arrays:
SetMemoryArgument_Long (corresponds to array of integers in C#)
SetMemoryArgument_Single (corresponds to array of floats in C#)
- To set values:
SetValueArgument_Long (corresponds to integer in C#)
SetValueArgument_Single (corresponds to float in C#)
The first parameter, argument index, starts with
0 for first argument and must be increased for the next arguments. It is also very important to set variables in a right sequence. First, the variable with argument index
0, then with argument index
1 and so on.
Now, the input array sizes must be set in the
globalWorkSize(0) = p
globalWorkSize(1) = r
Finally, we can start kernel execution. The function
ExecuteSync returns to VBA only when its execution is finished. If you want to run your OpenCL code in asynchronous mode on several devices, then you have to read "Advanced topics":
result = progDevice.ExecuteSync(globalWorkOffset, globalWorkSize, localWorkSize)
The results of execution must be taken from the kernel using appropriate
result = progDevice.GetMemoryArgument_Double(0, vecResp)
The first parameter, argument
index, has the same meaning as the argument index parameter of "
Set" functions, but you can use the "
Get" functions in arbitrary order.
Finally, the memory should be cleaned from the arrays and all instantiated
result = progDevice.ReleaseMemObject(3)
result = progDevice.ReleaseMemObject(2)
result = progDevice.ReleaseMemObject(1)
result = progDevice.ReleaseMemObject(0)
result = progDevice.ReleaseKernel
result = progDevice.ReleaseProgram
The releasing functions have to be called in the following sequence:
- Release memory arguments, starting with highest argument index
- Release kernel
Finally, note, that all COM-visible functions of DLL return boolean:
true, if the function was executed successfully,
false, if not. Any function returned
false, then it is helpful to check errors in property
The basic example discussed in details before contains only a small part of possibilities of Cloo. The complete possibilities of
ClooWrapperVBA can be tested in corresponding Excel sheets ("
Simply press "Configuration" button and you will get the truth about your system. :-)
The sheet performs simple performance measurement using and OpenCL code from the article "How to Use Your GPU in .NET". The performance test is done at first found CPU and GPGPU device in single and double precisions.
Another performance test is done for multiplication of two 1200*1200 double precision matrices in VBA and first found CPU / GPGPU. The results of calculations on CPU / GPGPU are compared to VBA results (see cells C3:C4) and the computation time was measured (cells B2:B4). The results show that the OpenCL code executed on GPGPU is 300 times faster than VBA (cell B9) and also 8 times faster that native C# code! (Simply imagine, my Excel is 8 times faster than native C#!!!) The "native" calculations were done in C# and contain the same multiplication of two matrices 1200*1200. The native performance measurement was performed only on my computer and therefore stay unchanged. If you want, you can write this part on any programming language you want (C#/C/C++) and update the execution time manually. The reason I didn't add it to
ClooWrapperVBA is because it was a single test case and I wanted to have only a clean source of
Asynchronous Execution of OpenCL Code
There are two additional functions to execute the kernel in asynchronous mode:
result = programDevice.ExecuteBackground_
(globalWorkOffset, globalWorkSize, localWorkSize, THREAD_PRIORITY)
In the snippet below, the OpenCL code is executed at CPU and GPGPU simultaneously in infinite loop:
- Every 100 ms, the state of the execution at CPU / GPGPU is checked using
ExecutionCompleted function (
true - execution is completed,
false - the OpenCL code is still running).
- If execution was completed, the output array will be read from the kernel, the new input array will be written to the kernel and the OpenCL code will be started at desired device using
- The infinite loop will run until the
20) will be reached.
While Not allTasks_Completed
For i = 1 To progDevices.Count
If progDevices.Item(i).ProgramDevice.ExecutionCompleted Then
result = progDevices.Item(i).ProgramDevice.GetMemoryArgument_Double_
finishedTasks = finishedTasks + 1
If startedTasks < MAX_TASKS Then
result = progDevices.Item(i).ProgramDevice.SetMemoryArgument_Double_
(globalWorkOffset, globalWorkSize, localWorkSize, _
THREAD_PRIORITY, AddressOf Asynchronous.CPU_Task_Completed)
result = progDevices.Item(i).ProgramDevice.ExecuteBackground_
(globalWorkOffset, globalWorkSize, localWorkSize, THREAD_PRIORITY)
startedTasks = startedTasks + 1
currentTaskId(i) = startedTasks
progDevices.Item(i).ProgramDevice.ExecutionCompleted = False
If startedTasks = finishedTasks Then
allTasks_Completed = True
result = programDevice.ExecuteAsync(globalWorkOffset, globalWorkSize, _
localWorkSize, THREAD_PRIORITY, AddressOf Asynchronous.CPU_Task_Completed)
- The first versions of Excel demo table used the function
ExecuteAsync to run OpenCL code in asynchronous mode. The function used callback to the VBA function when the execution was completed. But later, I found that it is rather problematic to debug the code which uses callback functions. Excel, for example, crashes if you are currently in debugger and the code received a callback. Even writing the results to the sheet can cause an Excel crash. Therefore, in the test table, I do not use this function even if it is still in the COM interface.
The common parameter of both functions,
THREAD_PRIORITY, sets one of five priorities (0 - "
Lowest", 1 - "
BelowNormal", 2 - "
Normal", 3 - "
AboveNormal", 4 - "
Highest"), but I'm pretty sure that this parameter has no influence on the execution of the OpenCL code. I simply added it because I had such possibility :-)
The sheet initializes OpenCL program to multiply two 2000*2000 matrices at first found CPU and GPGPU devices and run it 20 times in asynchronous mode. So, every time the program is executed at CPU and GPGPU asynchronously. The state of execution is shown in the sheet. On my computer, the number of finished CPU calculations is 4 times smaller than a number of GPGPU processes, which correspond to performance measurements done before.
Run ClooWrapperVBA from VBScript
Actually, it was not a big surprise for me that VBScript can work with COM-DLLs, but it was still quite surprising. The
ClooWrapperVBA can obtain the platform/device configurations and write them out. The calculations using
ClooWrapperVBA are at the moment not possible because the variables in VBScript are of type
Variant and must be sent to DLL using
ArrayList, which will make DLL more complicated. Since the usage from VBScript was not intended, the
ClooWrapperVBA can be used in VBScript only to write out your current platform/device configuration.
Conclusion and Points of Interest
The sources are available at my GitHub page and it would be great if you (yes, especially YOU!) will create a pull request. Any improvements are highly appreciated!
- 15th May, 2022: Initial version