延遲渲染的計算著色器與片段著色器的性能


5

我編寫了一個延遲渲染器,可以使用片段著色器或計算著色器執行著色過程。不幸的是,計算著色器實現的運行速度較慢。我試圖了解原因。

我相信我了解最直接的原因:訪問紋理時的內存局部性。某種程度上,片段著色器的訪問比計算著色器的訪問更加一致。


為了證明這一點,我刪除了除陰影映射代碼以外的所有內容,然後將其更改為隨機採樣。類似於(GLSL偽代碼):

uniform sampler2D tex_shadowmap;

uniform float param;

#ifdef COMPUTE_SHADER
layout(local_size_x=8, local_size_y=4, local_size_z=1) in;
#endif

struct RNG { uint64_t state; uint64_t inc; } _rng;
void rand_seed(ivec2 coord) { /*seed `_rng` with hash of `coord`*/ }
float rand_float() { /*return random float in [0,1]*/ }

void main() {
    rand_seed(/*pixel coordinate*/);

    vec4 light_coord = /*vertex in scaled/biased light's NDC*/;
    vec3 shadowmap_test_pos = light_coord.xyz / light_coord.w;

    float rand_shadow = 0.0;
    for (int i=0;i<200;++i) {
        vec2 coord = fract(mix( shadowmap_test_pos.xy, vec2(rand_float(),rand_float()), param ));
        float tap = textureLod(tex_shadowmap,coord,0.0).r;
        rand_shadow += clamp(shadowmap_test_pos.z,0.0,1.0)<=tap+0.00001 ? 1.0 : 0.0;
    }
    vec4 color = vec4(vec3(rand_shadow)/200.0,1.0);

    /*[set `color` into output]*/
}

param設置為0時,在shadowmap_test_pos處對陰影貼圖進行採樣,我們將獲得正確的場景硬陰影。在這種情況下,陰影貼圖紋理查找位置在某種程度上與像素坐標相關,因此我們希望獲得良好的性能。當param設置為1時,我們得到一個完全隨機的紋理坐標vec2(rand_float(),rand_float()),因此紋理查找與像素坐標完全不相關,並且我們期望性能不佳。

當我們為param嘗試更多值並使用timer query測量陰影通過的延遲時,會發生非常有趣的事情:

plot of param vs. latency

可以看到,當使用完全隨機的坐標(param = 1,右側)時,片段著色器和計算著色器具有相同的性能。但是,隨著坐標變得越來越不隨機,無論片段著色器在做什麼,都使其變得更加連貫,開始發揮作用。當坐標是確定性的並且與屏幕位置相關時(param≈0,左側),片段著色器獲勝2倍(注意:由於GLSL編譯器優化了循環,因此省略了param = 0的情況)。

特別奇怪的是,片段著色器更快似乎取決於紋理樣本坐標與像素坐標的關聯。例如,如果我使用01作為確定性坐標而不是00,則效果消失,並且兩個著色器對於任何param都具有相同的性能。

這些著色器的源代碼和編譯代碼本質上是相同的。除了進行一些設置和將數據寫出(可能會有一點變化)外,著色器是相同的。您可以看到我對PTX拆卸件here所做的比較。循環主體的大部分由內聯RNG佔用,但要點是它是同一循環

注意:測試的硬件是具有當前(446.14)驅動程序的NVIDIA GTX 1080。


我的問題基本上是:我該怎麼辦?我正在計算著色器中的8×4切片中工作,但是誰知道片段著色器在做什麼。但是,我不希望片段著色器做的任何神奇的秘密著色順序能夠好得多,以至於當您運行相同的實際代碼時,性能會有> 2%的差異。(FWIW我嘗試了不同的小組人數,但上述行為沒有真正的改變。)

關於不同著色器的工作方式,有一些general discussions,但是我還沒有找到任何可以解釋這一點的東西。而且,儘管過去驅動程序問題引起了怪異的行為,但計算機著色器現在已經在核心GL中使用了將近8年,而將它們用於延遲著色是一個顯而易見的,可以說是常見的用例,我希望它能很好地工作。

我在這裡想念什麼?

3

After more analysis, the TL;DR here is that, yes, the slowdown is due to memory locality, and yes the pixel order is to blame. More interestingly, by writing the shader differently, we can greatly surpass the fragment shader's performance—though we obviously shouldn't rely on being able to do that regularly.


First, to expand on the analysis: the best way to figure out what's going on in the GPU is to ask it. In this case, the relevant tool is NVIDIA NSight. After some fiddling, I got directly comparable results, which indicated that in both cases, memory is the bottleneck, and in the compute shader's case, it is worse.

Since the actual shader code is substantively identical at the assembly level (see above), and (slightly-better-than-)equal performance can be achieved by removing memory from the equation by altering the shading code, we can be confident that the pixel shading order is to blame.


Perhaps we can find a better shading order?

Spoiler alert: we can. After some experimentation, consider a new shader where there is a global queue of tiles, and each warp grabs a tile and shades the pixels within it in scanline order. This turns out to be 50% faster than the fragment shader!


Here's an animation from my HPG paper's presentation this week, which touched on this issue: enter image description here (It can be embiggened, which you may wish to do if you're having trouble reading the text.)

This summarizes the results of these experiments, along with the performance numbers for each and a visualization of what I surmise is going on behind the scenes (simplified: only one warp is shown, it is 8 wide, and latency-hiding is not visualized).

On the left, we have the fragment shader, labeled "Vendor Magic Goes Here". We don't know what the vendor is doing for their fragment shader pixel traversal order (though we could get hints e.g. by writing out atomic variables, etc.), but overall it works really well.

In the middle, we have the original compute shader I described (with param = 0), which divides the framebuffer into rectangular work groups. Notice that the work groups are probably executed in a mostly reasonable order exactly to mitigate these caching effects, but are not guaranteed to be in any order at all—and indeed will not be due to latency hiding: this explains why the groups are walking over the framebuffer in a mostly coherent fashion, but still skipping around a little bit. This is half the speed of the fragment shader, and I believe the possible skipping around is a reasonable starting guess for the additional memory incoherency revealed within the profile.

Finally, we have the tiles version. Because tiles are processed in a queue of tiles (defined by a global counter, visualized above the tile), the tiles and pixels are processed more in order (neglecting latency hiding and other thread groups). I believe this is a reasonable starting guess as to why this result turns out to be 50% faster than the fragment shader.

It is important to stress that, while these results are correct for this particular experiment, with these particular drivers, these results do not necessarily generalize. This is likely specific to this particular scene, view, and platform configuration, and this behavior may actually even be a bug. This is definitely interesting to play with, but don't go ripping apart your renderer (only) because of one datapoint from a narrowly-defined experiment.


Indeed, what kicked off this whole investigation was that the performance of a (more-complex) compute shader had declined in relative performance since it was last profiled in 2018, using the same code on the same hardware. The only difference was an updated driver.

The lesson is simple: pixel shading orders are hard, and as much as possible they are best left to the GPU vendor to determine. Compute shaders give us the option to do shading-like operations, but we should not expect to be able to reliably exceed the performance of fragment shaders (even if, occasionally, we spectacularly can) because our implementations are not based on insider knowledge of how to optimize for the particular GPU—even when there is a single particular GPU at all.

So, if you're thinking about shading orders, that's really something the GPU should be doing for you: take it up with the vendor. The main reason to use a compute shader would be if you want the convenience or flexibility. Of course, if you thoroughly profile and see a performance gain, and you have reason to expect that the GPU infrastructure you are building on top of will not shift beneath your feet (e.g. you are writing for a console), then maybe using a compute shader is the right choice.