Skip to main content

SIMD - Single Instruction Multiple Data


Hi! Today’s lecture, we learned SIMD - Single Instruction Multiple Data. This is a great tool to process data in a bulk fashion. So, instead of doing one by one, based on the variable size, we can do 16, 8, 4 or 2 at the time. This technique is called auto-vectorization resources, and it falls into the category of machine instruction optimization that I mentioned in my last post.

If the machine is SIMD enabled, the compiler can use it when translating a sum loop, for example. If we are summing 8 bits numbers, using SIMD, it will be 16 times faster. However, the compiler can figure that it is not safe to use SIMD due to overlapping or non-aligned data. In fact, the compiler will not apply SIMD in most cases, so we need to get our hands dirty and inject some assembly. I’ll show you how to do it in a second.

Here are the lanes of the 128-bit AArch64 Advanced SIMD:
16 x 8 bits
8 x 16 bits
4 x 32 bits
2 x 64 bits
1 x 128 bits

Reading the ARM manual, we can find a lot of SIMD functions. Bringing back the volume example, we can process 8 values each time, and not worry about the overflow. The magic instruction is SQDMULH – Signed Integer Saturating Doubling Multiply returning High Half. With that name, it must make coffee too! Well, no. It multiplies the first parameter with the second. It puts the result into the third, discarding the fraction portion and not overflowing – on overflow, it will keep the minimum or maximum value. It is precisely what we need to deal with the volume in one instruction.

Now let’s mix some C and Assembly, shall we?

The syntax is:
__asm__ ("assembly code" : outputs : inputs : clobbers);

Warning: this will break the portability. It is a good idea to have compiler flags to “pick” the right portion of the code based on the architecture being compiled. Here we are not doing that.

This is the code provided by our instructor. Do you see the loop in C and the ASM instruction inside? The line 52 is doing 8 values per iteration using the magic single instruction SQDMULH. It is fast! The code, as it is, will only work on Arch64, though.

// vol_inline.c :: volume scaling in C using AArch64 SIMD
// Chris Tyler 2017.11.29-2019.10.02 - Licensed under GPLv3.
// For the SIMD lab in the Seneca College SPO600 Course
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
int main() {
int16_t* data; // input array
int16_t* limit; // end of input array
// these variables will be used in our assembler code, so we're going
// to hand-allocate which register they are placed in
// Q: what is an alternate approach?
register int16_t* cursor asm("r20"); // input cursor
register int16_t vol_int asm("r22"); // volume as int16_t
int x; // array interator
int ttl =0 ; // array total
data=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
srand(-1);
printf("Generating sample data.\n");
for (x = 0; x < SAMPLES; x++) {
data[x] = (rand()%65536)-32768;
}
// --------------------------------------------------------------------
cursor = data;
limit = data+ SAMPLES ;
// set vol_int to fixed-point representation of 0.75
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);
printf("Scaling samples.\n");
// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
while ( cursor < limit ) {
__asm__ (
"ldr q0, [%[cursor]], #0 \n\t"
// load eight samples into q0 (v0.8h)
// from in_cursor
"sqdmulh v0.8h, v0.8h, v1.8h \n\t"
// multiply each lane in v0 by v1*2
// saturate results
// store upper 16 bits of results into
// the corresponding lane in v0
// Q: Why is #16 included in the str line
// but not in the ldr line?
"str q0, [%[cursor]],#16 \n\t"
// store eight samples to [cursor]
// post-increment cursor by 16 bytes
// and store back into the pointer register
// Q: What do these next three lines do?
: [cursor]"+r"(cursor)
: "r"(cursor)
: "memory"
);
}
// --------------------------------------------------------------------
printf("Summing samples.\n");
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+data[x])%1000;
}
// Q: are the results usable? are they correct?
printf("Result: %d\n", ttl);
return 0;
}
view raw vol_inline.c hosted with ❤ by GitHub

If you don’t like assembly like me, intrinsics will help. The GCC compiler has some sort of functions representations of the assembly instructions. I think that it helps, but it also has its limitations. Here is the same example, but using intrinsics. Take a look at line 42.

// vol_intrinsics.c :: volume scaling in C using AArch64 Intrinsics
// Chris Tyler 2019.10.02 - Licensed under GPLv3
// For the SIMD lab in the Seneca College SPO600 Course
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#include "vol.h"
int main() {
int16_t* data; // data array
int16_t* limit; // end of input array
register int16_t* cursor asm("r20"); // array cursor (pointer)
register int16_t vol_int asm("r22"); // volume as int16_t
int x; // array interator
int ttl = 0; // array total
data=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
srand(-1);
printf("Generating sample data.\n");
for (x = 0; x < SAMPLES; x++) {
data[x] = (rand()%65536)-32768;
}
// --------------------------------------------------------------------
cursor = data; // Pointer to start of array
limit = data + SAMPLES ;
vol_int = (int16_t) (0.75 * 32767.0);
printf("Scaling samples.\n");
while ( cursor < limit ) {
// Q: What do these intrinsic functions do?
// (See gcc intrinsics documentation)
vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));
// Q: Why is the increment below 8 instead of 16 or some other value?
// Q: Why is this line not needed in the inline assembler version
// of this program?
cursor += 8;
}
// --------------------------------------------------------------------
printf("Summing samples.\n");
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+data[x])%1000;
}
// Q: Are the results usable? Are they accurate?
printf("Result: %d\n", ttl);
return 0;
}

This is it for today. I’m working on profiling my awk build. Stay tuned!

Comments

Popular posts from this blog

Going Faster

Photo by  Anders Jildén  on  Unsplash Today’s topic is compiler optimizations. Besides translating our source code into machine binary executable, the compiler, based on optimization parameters, can also produce faster executables. Just by adding some parameters to the compiler, we can get a better performance or a smaller executable, for example. There are hundreds of those parameters which we can turn on or off using the prefix -f and -fno. However, instead of doing one by one, we can use the features mode using the -O param. It ranges from 0 (no optimization – the default) to 3 (the highest). Using those parameters has a cost —usually, the faster, the larger executable. How does the compiler make it faster if my code is perfect? I’m going to put some methods here, but if you want, here is more detail . Also, bear in mind that most of the optimizations are done in the intermediate representation of the program. So, the examples below are rewritten just to...

Data Input Form

Photo by  Marvin Meyer  on  Unsplash Continuing the Lab 4, we are going to develop the option 2, data input form. The goal is to prompt the user to enter its name, address, city, province and postal code. Also, letting the user use up, down, left, and right arrows to navigate throughout the fields. After finishing the data input, a summary is presented at the end. Using the ROM routines, wasn’t too hard to allow users to type data into the character display. Then, I decided to make the filed names with the same width, 14 characters, limiting the input to 40 characters. So, the user is not allowed to type in the first 14 and after 54 characters. When the user presses enter at the last field, the summary is shown. I could display the fixed message, but I couldn’t copy the inserted data. I’m still working on that, and I’ll update this post as soon I figure it out. It is frustrating for me to spend days in basic problems that could be solved quickly using other langu...

Two-digit Numeric Display - Final

Photo by  Nick Hillier  on  Unsplash In this post, I’ll continue the two-digit numeric display. If you miss it, click here and check it out . To finish this project, we just need to show the numbers in the matrix-pixel (the black-box in the 6502 emulator ). To kickstart, our instructor gave us one example of how to display graphs, which was a lot helpful. The first thing that I’ve noticed was the bitmap table at the bottom. So, I mimic it and made ten tables like that to represent each number (zero to nine). So far, so good! Then I grabbed the logic to display one digit, and then my nightmares just started. How to place two graphs (one for each digit)? How to switch from one number to another? How to reuse code? Where is my coffee?! To emulate some if-elseif-else statements, I used jmp (jump). They are all over the place! However, the 6502 limits the jump range from -127 to 128. That means moving the code-blocks to satisfy all jumps limit. For e...