Skip to main content

SIMD - Single Instruction Multiple Data


Hi! Today’s lecture, we learned SIMD - Single Instruction Multiple Data. This is a great tool to process data in a bulk fashion. So, instead of doing one by one, based on the variable size, we can do 16, 8, 4 or 2 at the time. This technique is called auto-vectorization resources, and it falls into the category of machine instruction optimization that I mentioned in my last post.

If the machine is SIMD enabled, the compiler can use it when translating a sum loop, for example. If we are summing 8 bits numbers, using SIMD, it will be 16 times faster. However, the compiler can figure that it is not safe to use SIMD due to overlapping or non-aligned data. In fact, the compiler will not apply SIMD in most cases, so we need to get our hands dirty and inject some assembly. I’ll show you how to do it in a second.

Here are the lanes of the 128-bit AArch64 Advanced SIMD:
16 x 8 bits
8 x 16 bits
4 x 32 bits
2 x 64 bits
1 x 128 bits

Reading the ARM manual, we can find a lot of SIMD functions. Bringing back the volume example, we can process 8 values each time, and not worry about the overflow. The magic instruction is SQDMULH – Signed Integer Saturating Doubling Multiply returning High Half. With that name, it must make coffee too! Well, no. It multiplies the first parameter with the second. It puts the result into the third, discarding the fraction portion and not overflowing – on overflow, it will keep the minimum or maximum value. It is precisely what we need to deal with the volume in one instruction.

Now let’s mix some C and Assembly, shall we?

The syntax is:
__asm__ ("assembly code" : outputs : inputs : clobbers);

Warning: this will break the portability. It is a good idea to have compiler flags to “pick” the right portion of the code based on the architecture being compiled. Here we are not doing that.

This is the code provided by our instructor. Do you see the loop in C and the ASM instruction inside? The line 52 is doing 8 values per iteration using the magic single instruction SQDMULH. It is fast! The code, as it is, will only work on Arch64, though.


If you don’t like assembly like me, intrinsics will help. The GCC compiler has some sort of functions representations of the assembly instructions. I think that it helps, but it also has its limitations. Here is the same example, but using intrinsics. Take a look at line 42.


This is it for today. I’m working on profiling my awk build. Stay tuned!

Comments

Popular posts from this blog

Project Stage 2

Photo by  SpaceX  on  Unsplash Hey! Were you curious about the results of profiling AWK ? Me too! Quick recap, what is profiling, and how to do it? Profiling is a technique to map the time execution of each part of the application. We can add instrumentation to the executable, or use interruption sampling to generate that map. Here, I’ll use both. Click here for more details on profiling . For the instrumentation, we have to tell the compiler to add the tools needed to collect the execution data. So, I’ve changed the “makefile” file, CFLAGS variable with “-g -Og -pg” and ran the make command. Then, I just ran the awk the same way I did to benchmark it. Here is the command line: ./awk 'BEGIN {FS = "<|:|=";} {if ($8 == "DDD>") a ++;} END {print "count: " a;}' bmark-data.txt This awk version, instrumented, generates a file gmon.out, which contains all execution data. This is the raw material to create a profile report using gp

Assembly?

Photo by  Jonas Svidras  on  Unsplash Last week on my SPO course, I had my first experience writing Assembly code. I won’t lie; it was struggling. For me, Assembly is like the Latin of the codding languages and “carpe diem” wasn’t my first lesson. Hexadecimal, binary and a list of instructions is a must know to guarantee survival. Our instructor introduced us to the 6502 processor: it is an old school chip that was used in many home solutions such as PCs and video games. Internally, it has three general-purpose registers, three special-purpose registers, memory and input and output ports. Fortunately, there are emulators on the internet that helps us to focus on the development, hiding the electronic part from us. http://6502.cdot.systems/ Using the emulator, our first task was to copy, paste and execute a piece of code to change the colour of every pixel in the display matrix. That was easy! The result was a yellow screen. Then we were asked to introduce so

x86_64 vs ARMv8

Photo by  Brian Kostiuk  on  Unsplash Things are getting interesting in the SPO 600 course. It’s time to get familiar with modern processor architectures: the x86_64, which powers all most everything today and the new ARMv8 that is gaining traction mostly because of its energy efficiency. Also, for the first time, we will “forget” assembly and focus on the compiler. So, what is the difference between x86_64 and ARMv8? Making a processor is hard and expensive, so instead, they decided to make the x86 (32bits) to work as 64bits – x86_64. That strategy popularized the 64bit environment. On the other side, the ARMv8 was designed for 64bits from the beginning, and its energy efficiency made it accessible on mobile applications. Who remembers the RISC vs CISC competition? The RISC concept tells us to execute simple operations quickly. The CISC concept is quite the opposite: complex operations will perform better than a bunch of simple ones. Who won? Well, everybody won! Nowadays, the