I just spent the last few days optimizing some CUDA kernels to go super fast.

It's *so much* effort to worry about memory alignments, access patterns, getting the compiler to emit wide load instructions and other special things on the hot path, vector math, math intrinsics, register usage, warp schedulers, etc.

But it's also *super* rewarding to see the huge impact all that stuff has on performance. It feels like all that hard work pays off when your code literally gets 10x faster! =D

