After more analysis, the TL;DR here is that, yes, the slowdown is due to memory locality, and yes the pixel order is to blame. More interestingly, by writing the shader differently, we can greatly surpass the fragment shader's performance—though we obviously shouldn't rely on being able to do that regularly.
First, to expand on the analysis: the best way to figure out what's going on in the GPU is to ask it. In this case, the relevant tool is NVIDIA NSight. After some fiddling, I got directly comparable results, which indicated that in both cases, memory is the bottleneck, and in the compute shader's case, it is worse.
Since the actual shader code is substantively identical at the assembly level (see above), and (slightly-better-than-)equal performance can be achieved by removing memory from the equation by altering the shading code, we can be confident that the pixel shading order is to blame.
Perhaps we can find a better shading order?
Spoiler alert: we can. After some experimentation, consider a new shader where there is a global queue of tiles, and each warp grabs a tile and shades the pixels within it in scanline order. This turns out to be 50% faster than the fragment shader!
Here's an animation from my HPG paper's presentation this week, which touched on this issue:
(It can be embiggened, which you may wish to do if you're having trouble reading the text.)
This summarizes the results of these experiments, along with the performance numbers for each and a visualization of what I surmise is going on behind the scenes (simplified: only one warp is shown, it is 8 wide, and latency-hiding is not visualized).
On the left, we have the fragment shader, labeled "Vendor Magic Goes Here". We don't know what the vendor is doing for their fragment shader pixel traversal order (though we could get hints e.g. by writing out atomic variables, etc.), but overall it works really well.
In the middle, we have the original compute shader I described (with
0), which divides the framebuffer into rectangular work groups. Notice that the work groups are probably executed in a mostly reasonable order exactly to mitigate these caching effects, but are not guaranteed to be in any order at all—and indeed will not be due to latency hiding: this explains why the groups are walking over the framebuffer in a mostly coherent fashion, but still skipping around a little bit. This is half the speed of the fragment shader, and I believe the possible skipping around is a reasonable starting guess for the additional memory incoherency revealed within the profile.
Finally, we have the tiles version. Because tiles are processed in a queue of tiles (defined by a global counter, visualized above the tile), the tiles and pixels are processed more in order (neglecting latency hiding and other thread groups). I believe this is a reasonable starting guess as to why this result turns out to be 50% faster than the fragment shader.
It is important to stress that, while these results are correct for this particular experiment, with these particular drivers, these results do not necessarily generalize. This is likely specific to this particular scene, view, and platform configuration, and this behavior may actually even be a bug. This is definitely interesting to play with, but don't go ripping apart your renderer (only) because of one datapoint from a narrowly-defined experiment.
Indeed, what kicked off this whole investigation was that the performance of a (more-complex) compute shader had declined in relative performance since it was last profiled in 2018, using the same code on the same hardware. The only difference was an updated driver.
The lesson is simple: pixel shading orders are hard, and as much as possible they are best left to the GPU vendor to determine. Compute shaders give us the option to do shading-like operations, but we should not expect to be able to reliably exceed the performance of fragment shaders (even if, occasionally, we spectacularly can) because our implementations are not based on insider knowledge of how to optimize for the particular GPU—even when there is a single particular GPU at all.
So, if you're thinking about shading orders, that's really something the GPU should be doing for you: take it up with the vendor. The main reason to use a compute shader would be if you want the convenience or flexibility. Of course, if you thoroughly profile and see a performance gain, and you have reason to expect that the GPU infrastructure you are building on top of will not shift beneath your feet (e.g. you are writing for a console), then maybe using a compute shader is the right choice.