What is performance?
There's a cohort of smart folks these days that don't like the word "performant", presumably because it's unspecific. I get it. I’ve always charitably interpreted “performant” to mean “reasonably optimized for some mix of minimizing latency, maximizing throughput, minimizing memory, etc. without unjustifiably sacrificing any other performance characteristics.” Obviously some of these compete with each other, so it’s ~nonspecific.
I think the best articulation of performance is: efficiently minimize space, time, and energy. At a certain level, thinking about performance starts to feel like an entry level physics course on thermodynamics.
Why care about performance?
Performance is inherently about resourcefulness. Our world has limited space and energy, and we have limited time. We ought to not waste these finite resources. To me, resourcefulness and sustainability are sufficient reasons to answer the why question, but I also just like tools that are fast.
Performance Characterstics
Performance Characteristic | Classification |
---|---|
RAM usage | Space |
Instruction count | Time |
Execution latency | Time |
Throughput | Time (a scalar measure of work per unit time) |
Code size (e.g. binary file size) | Space |
Number of cores used | Energy |
CPU load time (non-idle time) | Energy |
Number of computers used | Energy |
Clockrate of CPU(s) | Energy |
Clockrate of Memory | Energy |
Data transfer (across links) | Time × Space × Energy |
The above are all sort of first-principle level performance characteristics. Working backwards from these, we can derive various causes and mechanisms to influence them.
Performance influencers
Performance Influencer | Relevant Characteristic | Comments / how to measure |
---|---|---|
TLB Miss Rate | Execution latency | e.g. perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses |
L1/2/3 Cache Hit Rate | Execution latency | e.g. perf cpu/L1-dcache-load-misses/ (also icache for instructions) |
Register pressure | Excetion latency | ~ |
Saturated Network IO | Time, Space | sar n DEV 1 |
Saturated Disk IO | Time, Space | iostat xz 1 |
Unreliable network; TCP retransmits | Time, Space | sar n TCP,ETCP 1 |
Noisy neighbors | Time, Space | pidstat 1 , vmstat 1 , uptime , dmseg |
Memory leak | RAM usage | valgrind , coredump |
Context switches | Time | e.g. perf record -F64 -e cpu-clock -e cs -a -g |
Branch mispredicts | Time | e.g. perf stat -e branch-misses |
False sharing | Time | perf c2c |
Mutex contention | Time | Measure time to acquire locks, lock waiters, see also critical section. |
Cache locality | Time | L1/2/3 cache hit rate, look for data always accessed together but stored far apart |
Access pattern uniform smear | Time | L1/2/3 cache hit rate, TLB misses, analyze access patterns |
Doing something when you could do nothing | Time, Space, Energy | Critical thinking |
Not batching work | Time, Space, Energy | Critical thinking |
Allocating too much | Time, Space, Energy | Coredumps, valgrind, critical thinking |
Syscalls | Time, Space | e.g. perf top -e raw_syscalls:sys_enter -ns comm , strace |
Page faults | Time | e.g. perf record -e page-faults -ag |
CPU migrations | Time | e.g. perf record -e migrations -a |
Overserialized processing | Time | Critical thinking, searching for things that are embarrassingly parallel |
Automatic vectorization | Time | Inspect ASM |
SIMD manual vectorization | Time | e.g. std::simd |
Copies | Time | Zero-copy serialization frameworks are popular for good reason, good to avoid copying if possible. |
Memory striding | Time | If memory access is predictable, we can prefetch data and pipeline instructions. |
Perfomance quips
- A CPU cache-miss is a lost opportunity to have excecuted ~500 instructions!
- Doing nothing is always faster than doing something.
- Whenever you get a chance to line things up linearly and next to each other in space in time, it's probably good for performance. (Except if you're trying to avoid hot keys and hot partitions, but this is more a distributed systems performance problem.)
What about that whole "root of all evil" thing?
You're misquoting Donald Knuth and that's not what he was saying. Here's the paragraph that comes right before and shows the quote in its proper context in the paper it is from, Structured Programming with go to Statements.
The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise- and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering. Of course I wouldn’t bother making such optimizations on a one-shot job, but when it’s a question of preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
The thing is, now 50 years later... we do not live in a world of engineers who are doing much premature optimization. In fact, we've swung to the other side where we're prematurely non-optimizing.
Good performance resources?
- Bryan Cantrill, e.g. Relative Performance of C and Rust
- Brendan Gregg, especially the perf-tools
repo, which is a collection of
ftrace
,perf
, andeBPF
tools for measuring performance characteristics. - Mechanical Sympathy: I skim/skip all the Java stuff, but the perf-related stuff is great.
- Little's law