在計算著色器中記錄壓力


4

我正在使用DirectCompute / HLSL編寫Ray Tracer。首先,產生眼睛光線(每個像素一個)。然後,光線被跟踪,陰影化並反射成環形。同樣,每個光源的陰影射線都將生成並進行遮擋測試。作為場景結構,我使用帶有Ropes $ ^ 1 $的KD樹。

我已經用Joshua Barczak的Pyramid分析了著色器,看來我在寄存器壓力(主要是矢量)方面存在主要問題。在斐濟GCN架構上,著色器使用85個SGPR和81個VGPR,從而將每個SIMD的同時波前數量限制為3(VPGR)。

此外,我在著色器中實現了一個簡單的計數器,該計數器在線程啟動時自動增加一個值,在線程結束時自動減小該值,並跟踪同時運行的最大著色器數量。我設法擺脫了一些數據,並將其從8192提升到9216,從而導致〜%13的相關加速。當我使用空的虛擬著色器時,會得到16384個並發線程。

我試圖通過使它們的壽命盡可能短來消除過多的變量,尤其是向量,並且還嘗試將一些變量存儲在組共享存儲器中並直接從那裡讀取/寫入,而所有這些在寄存器計數上都沒有任何變化任何。

是否有一些實用技巧提示如何對寄存器進行加壓?這是否像我認為的一樣嚴重?

我還考慮過將Ray Tracer拆分為各種內核。以我的理解,這應該大大減輕寄存器的壓力,但是我也需要付出大量的編碼工作。這聽起來像一個值得嘗試的主意嗎?


$ ^ 1 $ Popov et al.: Stackless KD-Tree Traversal for High Performance GPU Ray Tracing (2007).

4

Lowering register pressure doesn't necessarily give you any performance boost though. I recently went through this exercise myself on GCN architectures (for a simple ray tracer) and reduced register pressure so that it increased occupancy from 2 to 4, which had no impact on performance. It's generally a good idea to reduce the pressure if you need to hide memory latencies, but it really depends on what the real bottlenecks in your shaders are.

Since you are working on a GPU ray tracer, you might get better improvement in performance by paying attention to the divergence of threads and try to improve this instead. For example by trying to group rays based on location and direction to reduce divergence in KD-tree traversal per wave. I'm not familiar with the paper you are referring to, and you may already have considered this though.