I’ve been working in a company for two years now (which is the longest time I have ever worked in the same company). I wasn’t supposed to do a lot of C++ but this is what I ended up doing and on a calculation intensive software. I ended up having to learn a thing or two about improving performances.
I wrote these lines for new comers to the C++ performance optimizing problems.
Don’t forget that prior optimization is the root of all evils. Get the logic right first.
If you are not quite experienced with C++ and you don’t have a lot of time, in many cases you will build faster program that you will build faster if you do it in other languages like java. The “C++ is the fastest language, I must use it” is in many case the best way to produce bullshit. It is way easier to build, debug and profile code for java or C# and the CPU or memory overhead of these languages isn’t relevant in most applications.
These are the most interesting pages on the subject, you should absolutely read them.
This is already described here. Basically, you have to test –> profile –> understand –> fix.
CPU profiling gives perfect results. Some people think it’s just a tool that gives some wrong measures because of its intrusive way of working, but gprof / ppprof profilers do some sampling profiling. So they actually give very very insightful (but not real CPU time) results.
Paradigm is great word. I recently discovered it (because I don’t know that many words) and I like it a lot. Mostly because it explains a lot of things (like failures which are very interesting).
Think like a CPU
If you need to achieve very good performances, you need to think like a CPU. And I mean at the lowest level of modern CPU. And then, once you think you found the perfect way to make things fast, you should TEST it. And you should test it in situations as closes as possible as reality. If you make a very simple use of memory, it could easily get optimized by the CPU scheduler (and cache) and then you could reach 20x times better performance than what you reach in reality.
Use memory in a smart way
If you need to read things sequencially, the best performances are reached with arrays. Mostly because the CPU can prefetch the array you are reading and cache it’s content. If you need to often move things around, you should use pointers because you don’t endup copying a lot of data. Don’t forget pointers consume a lot of data, and if they aren’t necessary you can use offsets.
Make it compiler friendly
Your C++ compiler will do as much as it can to optimize your code. Every time your do call on a virtual class, or a function’s pointer, you will prevent him from doing some optimization. So think carefully before you do them. Use templates as much as you can.
Google-PerfTools is set of tools to create more robust and efficient applications.
The tcmalloc (TC stands for “thread caching”) library overloads the malloc/realloc/free functions and new/delete operators to make them use a virtual heap that garbage collects and recycles memory chunks. This avoids a lots of system requests. This also has a cost, it consumes a little bit more memory (but it’s so reasonnable that you won’t even notice it).
In our program that has its own memory garbage-collection mechanism (for the most important parts), we had an execution speed improvement of 30%.
A colleague (RON) did some comparative tests of memory allocation libraries on an other software. In our simple tests, tcmalloc outperformed all the others in terms of speed (we had a 40% speed improvement, so it’s pretty hard to beat) but also reduced the memory consumption by 25% when we didn’t see any change for with the other libraries.
The CPU and memory profiling tools are both very good. The CPU profiling and heap profiling can be enabled either by changing an environment variable either by enabling/disabling it at runtime. In my case, we have our own scripting language (I would have preferred we didn’t have it, wrapper are always a better option) so I added some functions to start and stop the profiling at any time.
The complete software is a static built so even call of our internal calculation libraries are inlined by the compiler. It makes a lot harder to clearly understand where we actually spent time. To prevent this from happening, you can use “__attribute__ (noline)” in your function signature. In our case, this allows to do a simple grep with the API namespace to get the percentage of time using in each function of the API.
The google performance library also has a heap / memory leak checking tool. And this one is also started by defining a environment variable.
By the way: Why is is so great to be able to heap-check, cpu-profile and/or heap-profile from anywhere? Because you can enable it in production and/or on end-user computers easily. And the most important thing: You have the same version everywhere. You don’t have to maintain different versions and provide it to your users.
I’d like to add few things:
In my tests, adding some memory prefetching code was very messsy and rarely gave some performance improvements. This is the kind of optimization that should be forgotten (at least on x86).
The compiler you will use changes a lot of things. MSVC, G++/Linux, G++/Cygwin/Windows have totally different performances for the same code.
All that stuff also applies to other languages. In Java or C#, profiling gets much simpler.
I tend to think that people who have a good understanding of C/C++ optimization are the most likely to make efficient software in other languages, because they all understand what could cost at a CPU/Memory level.