Skip to main content

Project Stage 3

Photo by NASA on Unsplash

Hello! In this post, I’ll make a list of optimization opportunities that I identified on the AWK project based on what I’ve learned in the SPO600 classes. There are two types of optimizations: portable and platform-specific.

Portable optimizations are the ones that work everywhere, like better algorithms and implementations, and also compiler building flags.
Platform-specific, on the other hand, works only for a targeted architecture. Like the SIMD instructions available only on Arch64 and many others specific for x86_64. It is possible to “force” the usage of such instructions according to the targeted hardware. We can do that on compilation time, and also on run-time.

Now that we know our options, let’s dig in. According to my previous post, the functions nematch and readrec are the hotspots. Here is the command line used to run the awk:

./awk 'BEGIN {FS = "<|:|=";} {if ($8 == "DDD>") a ++;} END {print "count: " a;}' bmark-data.txt

The nematch function is designed to match the regular expression informed in the command line with the characters of each line of the file bmark-data.txt. We have an opportunity here. We could change the regex algorithm for a better one. The same applies to the readrec function, which is designed to read the bmark-data.txt. If we could find faster algorithms, it would make a significant improvement. 

It seems that the awk is not multithreading. It would be great to have a flag or so to spawn multiple processes to deal with a heavy load. In my experience, there is always gain doing things in parallel. That is valid for nametch and readrec, and its parents refldbld and getrec.

All the suggestions above are portable. Also, there are some platform-specific that it would help as well.

I don’t know if it is possible to implement due to the uncertainty of the input (regex and data file). Still, I would try to implement SIMD as low-level parallelization. Instead of processing one by one, making two or four each time would improve the performance drastically. Here, I think that the refldbld and getrec are good candidates for vectorization too.

Also, we could search for applicable platform-specific optimizations that the GCC compiler is not applying, so we can rewrite the code to make it easy for the compiler.

I didn’t find any hardware-specific optimization in the awk project. That might be a clue that there is space for such improvement, not only for Arch64 but also for x86_64. I recognize that this path is hard and requires a lot of knowledge of the hardware and its instructions. Hopefully, the community can help with that.

Finally, playing with the compiler optimization flags can produce excellent results, as demonstrated in Project Stage 1. It doesn’t require changing the code, but definitely, it needs testing.

So, here is the list of my suggestions:

1 – Replace regular expressions algorithm on nametch;
2 – Replace read text file algorithm on readrec;
3 – Implement parallelism on nametch, readrec, refldbld and getrec;
4 – Restructure the code to facilitate SIMD and other hardware-specific optimizations;
5 – There are no platform-specific optimizations, so check everywhere, starting from nametch and readrec;
6 - Try out compiler optimization flags;

This is my last post. We’ve reached the end of the SPO600 course. It was a bumpy journey into assembly language, instructions, bits and bytes, but I liked it. It gave me more details on the software’s lowest level. Even being an upper-level developer (Java, C++, Typescript, PLSQL), I’ll store all knowledge acquired on SPO600 in a special place in my toolbox. Thank you, Chris, all the best.

Comments

Popular posts from this blog

SIMD - Single Instruction Multiple Data

Photo by  Vladimir Patkachakov  on  Unsplash Hi! Today’s lecture, we learned SIMD - Single Instruction Multiple Data. This is a great tool to process data in a bulk fashion. So, instead of doing one by one, based on the variable size, we can do 16, 8, 4 or 2 at the time. This technique is called auto-vectorization resources, and it falls into the category of machine instruction optimization that I mentioned in my last post. If the machine is SIMD enabled, the compiler can use it when translating a sum loop, for example. If we are summing 8 bits numbers, using SIMD, it will be 16 times faster. However, the compiler can figure that it is not safe to use SIMD due to overlapping or non-aligned data. In fact, the compiler will not apply SIMD in most cases, so we need to get our hands dirty and inject some assembly. I’ll show you how to do it in a second. Here are the lanes of the 128-bit AArch64 Advanced SIMD: 16 x 8 bits 8 x 16 bits 4 x 32 bits 2 x 64 bits 1 x ...

Going Faster

Photo by  Anders Jildén  on  Unsplash Today’s topic is compiler optimizations. Besides translating our source code into machine binary executable, the compiler, based on optimization parameters, can also produce faster executables. Just by adding some parameters to the compiler, we can get a better performance or a smaller executable, for example. There are hundreds of those parameters which we can turn on or off using the prefix -f and -fno. However, instead of doing one by one, we can use the features mode using the -O param. It ranges from 0 (no optimization – the default) to 3 (the highest). Using those parameters has a cost —usually, the faster, the larger executable. How does the compiler make it faster if my code is perfect? I’m going to put some methods here, but if you want, here is more detail . Also, bear in mind that most of the optimizations are done in the intermediate representation of the program. So, the examples below are rewritten just to...

Colour Selection

Photo by Scott Webb  on  Unsplash Today I'll talk about Lab 4. We had to pick two tasks out of four and develop the solution using my least favour language: assembly. Our group chose the options 2 (data input form) and 4 (screen colour selector) thinking that would be the easiest ones. The other options were adding calculator and hexdump. This post will talk about the colour selector, and my next will be the input form. The colour selector project was quite easy to do relatively. There are only 16 colours available (0 to F in hex). We have to list them in the text area and allow the selection using the cursor (up and down). Once the colour is selected, we have to paint the graph area. The graph area we did before. Basically, we have to store the colour code for every pixel in the display using the memory location between $0200 and $05FF. In the last post, we deal with up and down keys to change the numeric display. However, we never dealt with character display bef...