I just spent the last few days optimizing some CUDA kernels to go super fast.

It's *so much* effort to worry about memory alignments, access patterns, getting the compiler to emit wide load instructions and other special things on the hot path, vector math, math intrinsics, register usage, warp schedulers, etc.

But it's also *super* rewarding to see the huge impact all that stuff has on performance. It feels like all that hard work pays off when your code literally gets 10x faster! =D

Sign in to participate in the conversation
Mastodon for Tech Folks

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!