Performance

What is performance?

There's a cohort of smart folks these days that don't like the word "performant", presumably because it's unspecific. I get it. I’ve always charitably interpreted “performant” to mean “reasonably optimized for some mix of minimizing latency, maximizing throughput, minimizing memory, etc. without unjustifiably sacrificing any other performance characteristics.” Obviously some of these compete with each other, so it’s ~nonspecific.

I think the best articulation of performance is: efficiently minimize space, time, and energy. At a certain level, thinking about performance starts to feel like an entry level physics course on thermodynamics.

Why care about performance?

Performance is inherently about resourcefulness. Our world has limited space and energy, and we have limited time. We ought to not waste these finite resources. To me, resourcefulness and sustainability are sufficient reasons to answer the why question, but I also just like tools that are fast.

Performance Characterstics

Performance Characteristic	Classification
RAM usage	Space
Instruction count	Time
Execution latency	Time
Throughput	Time (a scalar measure of work per unit time)
Code size (e.g. binary file size)	Space
Number of cores used	Energy
CPU load time (non-idle time)	Energy
Number of computers used	Energy
Clockrate of CPU(s)	Energy
Clockrate of Memory	Energy
Data transfer (across links)	Time × Space × Energy

The above are all sort of first-principle level performance characteristics. Working backwards from these, we can derive various causes and mechanisms to influence them.

Performance influencers

Performance Influencer	Relevant Characteristic	Comments / how to measure
TLB Miss Rate	Execution latency	e.g. `perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses`
L1/2/3 Cache Hit Rate	Execution latency	e.g. perf `cpu/L1-dcache-load-misses/` (also `icache` for instructions)
Register pressure	Excetion latency	~
Saturated Network IO	Time, Space	`sar n DEV 1`
Saturated Disk IO	Time, Space	`iostat xz 1`
Unreliable network; TCP retransmits	Time, Space	`sar n TCP,ETCP 1`
Noisy neighbors	Time, Space	`pidstat 1`, `vmstat 1`, `uptime`, `dmseg`
Memory leak	RAM usage	`valgrind`, coredump
Context switches	Time	e.g. `perf record -F64 -e cpu-clock -e cs -a -g`
Branch mispredicts	Time	e.g. `perf stat -e branch-misses`
False sharing	Time	`perf c2c`
Mutex contention	Time	Measure time to acquire locks, lock waiters, see also critical section.
Cache locality	Time	L1/2/3 cache hit rate, look for data always accessed together but stored far apart
Access pattern uniform smear	Time	L1/2/3 cache hit rate, TLB misses, analyze access patterns
Doing something when you could do nothing	Time, Space, Energy	Critical thinking
Not batching work	Time, Space, Energy	Critical thinking
Allocating too much	Time, Space, Energy	Coredumps, valgrind, critical thinking
Syscalls	Time, Space	e.g. `perf top -e raw_syscalls:sys_enter -ns comm`, `strace`
Page faults	Time	e.g. `perf record -e page-faults -ag`
CPU migrations	Time	e.g. `perf record -e migrations -a`
Overserialized processing	Time	Critical thinking, searching for things that are embarrassingly parallel
Automatic vectorization	Time	Inspect ASM
SIMD manual vectorization	Time	e.g. std::simd
Copies	Time	Zero-copy serialization frameworks are popular for good reason, good to avoid copying if possible.
Memory striding	Time	If memory access is predictable, we can prefetch data and pipeline instructions.

Perfomance quips

A CPU cache-miss is a lost opportunity to have excecuted ~500 instructions!
Doing nothing is always faster than doing something.
Whenever you get a chance to line things up linearly and next to each other in space in time, it's probably good for performance. (Except if you're trying to avoid hot keys and hot partitions, but this is more a distributed systems performance problem.)

What about that whole "root of all evil" thing?

You're misquoting Donald Knuth and that's not what he was saying. Here's the paragraph that comes right before and shows the quote in its proper context in the paper it is from, Structured Programming with go to Statements.

The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise- and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering. Of course I wouldn’t bother making such optimizations on a one-shot job, but when it’s a question of preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

The thing is, now 50 years later... we do not live in a world of engineers who are doing much premature optimization. In fact, we've swung to the other side where we're prematurely non-optimizing.

Good performance resources?

Bryan Cantrill, e.g. Relative Performance of C and Rust
Brendan Gregg, especially the perf-tools repo, which is a collection of ftrace, perf, and eBPF tools for measuring performance characteristics.
Mechanical Sympathy: I skim/skip all the Java stuff, but the perf-related stuff is great.
Little's law