I just spent the last few days optimizing some CUDA kernels to go super fast.
It's *so much* effort to worry about memory alignments, access patterns, getting the compiler to emit wide load instructions and other special things on the hot path, vector math, math intrinsics, register usage, warp schedulers, etc.
But it's also *super* rewarding to see the huge impact all that stuff has on performance. It feels like all that hard work pays off when your code literally gets 10x faster! =D
This Mastodon instance is for people interested in technology. Discussions aren't limited to technology, because tech folks shouldn't be limited to technology either!