CUDA / Optimization

Accelerating 3D Gaussian Splatting

Research-focused performance optimizations for the CUDA rasterization backend used in 3D Gaussian Splatting training.

Rendered view produced by the optimized pipeline
Rendered view produced by the optimized 3D Gaussian Splatting pipeline.

3D Gaussian Splatting (Kerbl et al. 2023) enables high quality real-time novel view synthesis, but training remains performance sensitive due to heavy forward and backward rasterization passes.

It is a derivative work of graphdeco-inria/diff-gaussian-rasterization, extended to study memory behavior, kernel efficiency, and training throughput under visual quality constraints.

This project profiles and accelerates the CUDA kernels by targeting:

  • Redundant global memory traffic
  • Cache behavior (L1/TEX vs shared memory)
  • Atomic operation pressure in backward passes
  • Bandwidth-dominated workloads
01

Optimizations

1. Image Chunk Size Allocation Fix

Corrected buffer allocation by separating per-pixel buffers from per-tile buffers. This reduced temporary image chunk memory from 16.0 MB → 8.4 MB, improving memory correctness without changing runtime behavior.

2. Shared-Memory Caching

Extended the cooperative fetch stage to cache per-Gaussian feature vectors and depth values. This eliminated redundant per-thread global loads inside the innermost compositing loop.

Nsight Compute Memory Chart
Nsight Compute: Shift from L1/TEX to explicit shared-memory reuse.
  • -95.7% global memory instructions
  • -68.5% L1/TEX activity
  • +68.7% Shared-memory loads
  • Kernel time reduced 2.85 ms → 2.50 ms

This shifts reuse from implicit cache behavior to explicit shared-memory reuse, yielding more predictable access patterns and slightly higher SM utilization.

Note: In higher iterations, both memory accesses have similar runtime.

3. Atomic Operation Reduction (Experimental)

Targets the backward rasterization kernel by staging partial results in shared memory to reduce repeated atomic updates.

Achieved 2.88× faster training (190s → 66s), though with some visual quality degradation due to numerical interference.

4. Mixed-Precision Spherical Harmonics

Stores SH coefficients in FP16 to reduce bandwidth/VRAM usage, converting to FP32 in registers for arithmetic.

  • +45.6% training throughput
  • -18% peak VRAM usage
Iter 30
Iteration 30: Early stages, overhead leads to lower performance.
Iter 300
Iteration 300: Transition phase.
Iter 3000
Iteration 3000: Reduced memory traffic yields significant speedup.
02

Validation & Metrics

All optimizations are evaluated under identical training settings and validated using profiling tools.

  • Image Quality: NVIDIA FLIP
  • Profiling: NVIDIA Nsight Systems, Nsight Compute
  • Metrics: Iteration time, Throughput (it/s), Memory Traffic
  • Datasets: Playroom (baseline) & Truck (high-load)
03

Authors

Kaan Altıntaş

Image Chunk Size Allocation Fix
Shared-Memory Caching

Storm van Wassenaar

Atomic Operation Reduction

Ye Hua

Mixed-Precision Spherical Harmonics