This post implemented the solutions for GPU Puzzles (which used numba) with Triton.
Why Triton but not CUDA? A simple reason is that I only have a Macbook. To learn GPU programming, it’s usually best to have a GPU that support CUDA. With triton, you can still implement kernel and check for correctness with just CPU. triton is a higher level “Block-based” programming than CUDA.