GPU are not necessarily optimized for power efficiency.
Actually they are, as are most other specialized architectures. If you divide the power consumption by their throughput, they're pretty efficient in terms of FLOPS per watt, and even more so at reduced precision, which is often enough. That's because their throughput is extremely high.
Basically the way a GPU works is it has clusters of really dumb cores all of which can essentially do the same exact thing at any given time. It is also very economical to spin up tens of thousands (or even millions) of threads. These aren't like the OS threads though in the sense that there's no OS thread abstraction, and instead they're hardware-scheduled. It is also pretty difficult to get the theoretical throughput numbers out of a GPU with most algorithms because of its architectural limitations. NVIDIA spends a ton of money so others don't have to deal with this, but it's not trivial at all. I took a course on GPU programming years ago, hoping to reuse some of the tricks in my CPU work, and it's so alien the approaches are largely orthogonal.
The key thing to understand though is GPU can't realize any of that amazing throughput on tiny data buffers a DSP typically deals with, and unlike in a DSP in GPU the latency is not even a second thought, but a distant third. GPUs need pretty large, contiguous slabs of data to be able to spin up thousands of hardware threads in order to get even within the order of magnitude of their claimed max. They also need highly predictable memory read and write patterns ("coalesced"), and if not, performance takes a massive nosedive.
When dealing with audio, large chunks of time series data imply massive latency. You can "mitigate latency" like that dude describes, but in doing so you'll very likely make using the GPU not worthwile. And that's before you consider the inherent latency of the system that GPU is plugged into. The CPU based modelers suck not because CPU doesn't have enough flops. Most CPUs made in the past 5 years have more than enough, with sufficient amounts of elbow grease. But latency can't really be guaranteed in a typical consumer OS, especially under load.