CUDA / Optimization

Accelerating 3D Gaussian Splatting

Research-focused performance optimizations for the CUDA rasterization backend used in 3D Gaussian Splatting training.

Rendered view produced by the optimized 3D Gaussian Splatting pipeline.

3D Gaussian Splatting (Kerbl et al. 2023) enables high quality real-time novel view synthesis, but training remains performance sensitive due to heavy forward and backward rasterization passes.

It is a derivative work of graphdeco-inria/diff-gaussian-rasterization, extended to study memory behavior, kernel efficiency, and training throughput under visual quality constraints.

This project profiles and accelerates the CUDA kernels by targeting:

Redundant global memory traffic
Cache behavior (L1/TEX vs shared memory)
Atomic operation pressure in backward passes
Bandwidth-dominated workloads

Optimizations

1. Image Chunk Size Allocation Fix

Corrected buffer allocation by separating per-pixel buffers from per-tile buffers. This reduced temporary image chunk memory from 16.0 MB → 8.4 MB, improving memory correctness without changing runtime behavior.

2. Shared-Memory Caching

Extended the cooperative fetch stage to cache per-Gaussian feature vectors and depth values. This eliminated redundant per-thread global loads inside the innermost compositing loop.

Nsight Compute: Shift from L1/TEX to explicit shared-memory reuse.

-95.7% global memory instructions
-68.5% L1/TEX activity
+68.7% Shared-memory loads
Kernel time reduced 2.85 ms → 2.50 ms

This shifts reuse from implicit cache behavior to explicit shared-memory reuse, yielding more predictable access patterns and slightly higher SM utilization.

Note: In higher iterations, both memory accesses have similar runtime.

3. Atomic Operation Reduction (Experimental)

Targets the backward rasterization kernel by staging partial results in shared memory to reduce repeated atomic updates.

Achieved 2.88× faster training (190s → 66s), though with some visual quality degradation due to numerical interference.

Baseline

Optimized (Reduced Atomics)

4. Mixed-Precision Spherical Harmonics

Stores SH coefficients in FP16 to reduce bandwidth/VRAM usage, converting to FP32 in registers for arithmetic.

+45.6% training throughput
-18% peak VRAM usage

Iteration 30: Early stages, overhead leads to lower performance.

Iteration 300: Transition phase.

Iteration 3000: Reduced memory traffic yields significant speedup.

Validation & Metrics

All optimizations are evaluated under identical training settings and validated using profiling tools.

Image Quality: NVIDIA FLIP
Profiling: NVIDIA Nsight Systems, Nsight Compute
Metrics: Iteration time, Throughput (it/s), Memory Traffic
Datasets: Playroom (baseline) & Truck (high-load)

Authors

Kaan Altıntaş

Image Chunk Size Allocation Fix
Shared-Memory Caching

Storm van Wassenaar

Atomic Operation Reduction

Ye Hua

Mixed-Precision Spherical Harmonics