Skip to main content

Performance Tuning Hero

Photo by Ayo Ogunseinde on Unsplash
Week 8! I’m Rodrigo, and this is my blog about my SPO 600 course. I’ve posted since January, and so far, I didn’t tell you what is SPO, right? It is Software Portability and Optimization. Today we will approach the Optimization part differently. Instead of squeezing the compiler, we will care about how the software is working.

I have a pretty good experience in performance and tuning in Oracle database, PL/SQL and SQL. I can say that, by far, the significant gain in execution time lies in how the software is designed. The same steps that I’ve used, we are going to use in the course.

First, do not touch the code without knowing how bad it is. Benchmarking is a must. Before, during and after, these metrics will guide our work and justify hours of analysis and development. It must be done right like a methodic scientist collecting vital data and not rushing. The more depth of info you get, the easiest will be the next steps.
Second, target the right piece of the software. Don’t waste time analyzing tasks that perform well. Instead, aim the ones that take more time, consume more CPU or memory.
Third, experiment changes and compare the execution time with the first step. It will guide us like a trial and test.
Lastly, implement the solution and compare the results of the first step.

We can use simple tools to collect data, for example, using a bash script or the time binary. There are commercial tools out there that collect much more data. Sometimes, too much data is overwhelming. Try to keep it simple at first and get more if needed.

So, we find the bad guy. Now what? Try to use a completely different approach. I like the sound volume example explained by our professor. Changing the music’s volume is not as trivial as I thought. Digital audio takes the wave sound and translate it as tiny dots equally spaced throughout the wave. This process is called sampling. The Y coordinate is the position of the number in the list, and the number represents the X. Then, we end up with a massive list of values—all of them at the volume 1.000 (the maximum). Our volume ranges from 0.000 to 1.000. If we need to keep it down not to disturb the neighbours, we need to multiply every number by the volume factor. Easy right? There is no other way?

I caught myself amazed that our professor listed FOUR ways to do it. The other one is the lookup table. Using a 16bit sampling, we can get only 65,536 values. What if we create a table with the volume, the original value and the calculated one? In this case, instead of doing math, we simply query the value. This approach takes more memory, though. 
The third method will do the math, but it will do in binary and using 32bits. This will avoid the float-point conversion used in the first method.
The fourth is called fixed-point math. It will use SIMD instructions and do the math in parallel. By the way, parallelism is a valid tool to deal with performance issues and is relatively easy to implement.

To conclude, being creative and think out-of-the-box is an excellent skill for tuning applications. Also, knowing in-depth, the business and the program tend to produce better results. I like the feeling of being the one that defeated the bad guy that slows down our lovely software. Don't you?

See you.

Comments

Popular posts from this blog

SIMD - Single Instruction Multiple Data

Photo by  Vladimir Patkachakov  on  Unsplash Hi! Today’s lecture, we learned SIMD - Single Instruction Multiple Data. This is a great tool to process data in a bulk fashion. So, instead of doing one by one, based on the variable size, we can do 16, 8, 4 or 2 at the time. This technique is called auto-vectorization resources, and it falls into the category of machine instruction optimization that I mentioned in my last post. If the machine is SIMD enabled, the compiler can use it when translating a sum loop, for example. If we are summing 8 bits numbers, using SIMD, it will be 16 times faster. However, the compiler can figure that it is not safe to use SIMD due to overlapping or non-aligned data. In fact, the compiler will not apply SIMD in most cases, so we need to get our hands dirty and inject some assembly. I’ll show you how to do it in a second. Here are the lanes of the 128-bit AArch64 Advanced SIMD: 16 x 8 bits 8 x 16 bits 4 x 32 bits 2 x 64 bits 1 x ...

Going Faster

Photo by  Anders Jildén  on  Unsplash Today’s topic is compiler optimizations. Besides translating our source code into machine binary executable, the compiler, based on optimization parameters, can also produce faster executables. Just by adding some parameters to the compiler, we can get a better performance or a smaller executable, for example. There are hundreds of those parameters which we can turn on or off using the prefix -f and -fno. However, instead of doing one by one, we can use the features mode using the -O param. It ranges from 0 (no optimization – the default) to 3 (the highest). Using those parameters has a cost —usually, the faster, the larger executable. How does the compiler make it faster if my code is perfect? I’m going to put some methods here, but if you want, here is more detail . Also, bear in mind that most of the optimizations are done in the intermediate representation of the program. So, the examples below are rewritten just to...

Colour Selection

Photo by Scott Webb  on  Unsplash Today I'll talk about Lab 4. We had to pick two tasks out of four and develop the solution using my least favour language: assembly. Our group chose the options 2 (data input form) and 4 (screen colour selector) thinking that would be the easiest ones. The other options were adding calculator and hexdump. This post will talk about the colour selector, and my next will be the input form. The colour selector project was quite easy to do relatively. There are only 16 colours available (0 to F in hex). We have to list them in the text area and allow the selection using the cursor (up and down). Once the colour is selected, we have to paint the graph area. The graph area we did before. Basically, we have to store the colour code for every pixel in the display using the memory location between $0200 and $05FF. In the last post, we deal with up and down keys to change the numeric display. However, we never dealt with character display bef...