3D Gaussian Splatting Experiments

Motivation

In the previous post, we looked at the history of Gaussian Splatting. Now, let’s look at some experiments with 3D Gaussian Splatting and validation of the results. Also, I want to learn this as kinda of like top-down approach. So, I will start with running the 3DGS pipeline on a custom dataset and then analyze the results. The goal is to understand how the pipeline works end-to-end and to validate that it produces reasonable results on a real-world scene.

Objective

It was to run the 3DGS implementation en-to-end on a custom dataset. (self-capmtured indoor scene) for the first time and validate the results.

  • Experience the full pipeline: COLMAP SfM → 3DGS training → point cloud output → viewer visualization
  • Verify that training converges and produces visually reasonable results on a custom scene
  • Obtain baseline quantitative metrics (PSNR / SSIM / LPIPS) for future comparison

First Run Gaussian Splatting.

The meta for first time running the 3DGS pipeline on playroom dataset. (225 Images). My GPU was NVIDIA GeFORCE RTX 2070 Super (8GB VRAM) and the training took about 2 days to complete. The viewer were used to visualize the output point cloud.

The config was as follows:

Namespace(
  data_device='cuda',
  eval=False,            # ← BUG: no train/test split → metrics are NaN
  images='images',
  resolution=-1,         # original resolution
  sh_degree=3,           # Spherical Harmonics degree
  source_path='C:\\Users\\skcjf\\project\\gaussian-splatting\\data\\my_scene\\playroom',
  model_path='./output/858ba1ea-e',
  train_test_exp=False,
  white_background=False
)

Results

  • Training time: ~2 days (RTX 2070 Super)
  • Iterations: 30,000 (checkpoints at 7,000 and 30,000)
  • Output: gaussian-splatting\output\858ba1ea-e\point_cloud\iteration_30000

Then I did not get the results that I expected because I did not add –eval flag to the config, so I got NaN for all the metrics. But the point cloud output looked reasonable and the viewer visualization showed a decent reconstruction of the scene. The PSNR, SSIM, and LPIPS metrics were not computed due to the missing evaluation flag, so I will need to rerun with –eval to get those quantitative results.

alt text

Iteration Comparison in Point Cloud Output:

Iteration7K30K
 alt textalt text
  • Front-facing view (looking into the room): 7k and 30k are nearly identical
  • Looking up at the ceiling: visible holes/gaps → likely insufficient training views from upward angles, or densification did not cover that region adequately

Problem Encountered:

  • Output folder names are random hashes (e.g. 858ba1ea-e), so locating the actual results was initially confusing
  • No build errors; used a separate Python virtual environment
  • Training took ~2 days on RTX 2070 Super, which saturated the GPU entirely (couldn’t even run YouTube simultaneously)
  • 4 total attempts, 3 failed/aborted before the successful run

Second Run with Evaluation

Objective

Run 3DGS on a standard benchmark dataset with --eval enabled to obtain real quantitative metrics (PSNR/SSIM/LPIPS) for the first time. This fixes the [[3DGS First Run]] problem where eval=False produced NaN metrics.

  • Validate that the 3DGS pipeline produces results consistent with the original paper
  • Establish a quantitative baseline for future experiments (hyperparameter tuning, ablation)
  • Learn the full evaluation pipeline: train → render → metrics

I choose to use Google Colab for this run to leverage a more powerful GPU (A100 40GB with High RAN) and faster training times. Then, I ran multiple data (tandt_db / Mip-NeRF 360 dataset) as well to validate the results from paper.

The config was as follows:

Namespace(
  sh_degree=3,
  source_path='/content/gaussian-splatting/data/tandt/train',
  model_path='/content/drive/MyDrive/3dgs_output/train',  # NOTE: mislabeled — actual scene is "train"
  images='images',
  resolution=-1,         # original resolution
  white_background=False,
  train_test_exp=False,
  data_device='cuda',
  eval=True              # FIXED from First Run — train/test split enabled
)

Results

MetricValue
PSNR22.12
SSIM0.822
LPIPS0.196
Gaussians1,095,714
Iterations30,000 (checkpoints at 7,000 / 30,000)
GPUA100 40GB (Google Colab)

These were the metrics using for the dataset: tandt_db(train). The PSNR and SSIM values are consistent with the original 3DGS paper, which reported PSNR around 22-23 and SSIM around 0.8 for similar scenes. The LPIPS value of 0.196 also indicates a reasonably good perceptual quality compared to the ground truth images. The number of Gaussians (1,095,714) is also in line with expectations for a scene of this complexity.

Problems Encountered

  • Mip-NeRF 360 dataset URL (storage.googleapis.com) returned 404 — dead link
  • Initial !unzip extracted to wrong path → Could not recognize scene type! error. Fixed by ensuring sparse/0/ was in the correct location.

What I Learned

  1. --eval is mandatory for quantitative evaluation — without it, no train/test split occurs
  2. 3DGS with default config on standard benchmarks reproduces paper results — the pipeline works
  3. A100 vs RTX 2070 Super is a massive speed difference — Colab is the practical choice for experimentation
  4. Dataset URL availability is not guaranteed — always have backup sources

Third Run with Mip-NeRF 360 Dataset

Objective

Experimentally verify the impact of 3 key 3DGS hyperparameters on the Mip-NeRF 360 dataset:

  • sh_degree: Spherical Harmonics degree (controls angular detail) for Kitchen Scene

The goal is to understand how these hyperparameters affect the final rendered quality (PSNR/SSIM/LPIPS) and visual appearance of the output point cloud. I will run multiple experiments varying one hyperparameter at a time while keeping others fixed, and then analyze the results.

Platform:

Google Colab with A100 High RAM for all experiments to ensure consistent training times and results.

Hyperparameter 1: Spherical Harmonics Degree (sh_degree)

  • Kitchen scene contains many reflective surfaces — stainless steel appliances, faucets, tiles
  • Higher SH degree enables more detailed view-dependent color changes (specular, highlights) as camera angle changes
  • Degree 0 = flat color (diffuse only), Degree 3 = highlights and reflections supported

Parameters:

# sh0
python train.py -s data/mipnerf360/kitchen --eval --sh_degree 0 -m .../kitchen/sh0
# sh1
python train.py -s data/mipnerf360/kitchen --eval --sh_degree 1 -m .../kitchen/sh1
# sh2
python train.py -s data/mipnerf360/kitchen --eval --sh_degree 2 -m .../kitchen/sh2
# sh3 (default)
python train.py -s data/mipnerf360/kitchen --eval --sh_degree 3 -m .../kitchen/sh3

Results

| Degree | SH Coefficients | SSIM | PSNR | LPIPS | Iterations | | —— | ————— | ——— | ———- | ——— | ———- | | 0 | 1 | 0.9244097 | 30.6005344 | 0.1243820 | 7k | | 1 | 4 | 0.9281715 | 31.0590820 | 0.1205885 | 7k | | 2 | 9 | 0.9311565 | 31.2686272 | 0.1176810 | 30k | | 3 | 16 | 0.9326434 | 31.5447502 | 0.1158840 | 7k |

  • Note: sh0, sh1, sh3 trained at 7k iterations, sh2 at 30k → metrics for sh2 may be overestimated. Re-run with unified iteration count for fair comparison.
Image IndexDescriptionImage
18SH degree 0 ~ 3 with ground truth Comparisonalt text
20SH degree 0 ~ 3 with ground truth Comparisonalt text
23SH degree 0 ~ 3 with ground truth Comparisonalt text

Hyperparameter 2: Densify Grad Threshold (densification)

  • Controls when Gaussians are split/cloned during training based on the gradient magnitude
  • Lower threshold → more aggressive densification → more Gaussians generated → better representation of thin structures (e.g. bicycle spokes, tree branches)
  • Higher threshold → fewer Gaussians → faster training but worse representation of fine details
  • Bicycle scene contains many thin structures (spokes, leaves, handlebars) that may benefit from aggressive densification and require 1–2 pixel level detail reconstruction

Parameters:

# Low (more Gaussians, detail ↑, VRAM ↑)
python train.py -s data/mipnerf360/bicycle --eval --densify_grad_threshold 0.0001 -m .../bicycle/dense_low

# Medium (default)
python train.py -s data/mipnerf360/bicycle --eval --densify_grad_threshold 0.0002 -m .../bicycle/dense_med

# High (fewer Gaussians, faster, detail ↓)
python train.py -s data/mipnerf360/bicycle --eval --densify_grad_threshold 0.0005 -m .../bicycle/dense_high

Results (30k iterations)

ThresholdConfigSSIMPSNRLPIPS
0.0001Low0.771977425.30834010.1857106
0.0002Medium (Base)0.747335325.13609700.2421628
0.0005High0.651669124.17069820.3792148
Image IndexDescriptionImage
0Densification Threshold Comparison (0.0001, 0.0002, 0.0005) with GTalt text
1Densification Threshold Comparison (0.0001, 0.0002, 0.0005) with GTalt text
2Densification Threshold Comparison (0.0001, 0.0002, 0.0005) with GTalt text

Observations

  • SH degree 0: Reflective surfaces appear as flat, uniform color. No highlights
  • SH degree 3: Highlights on pots and faucets shift as the camera angle changes
  • Higher degree increases view-dependent effects but also increases VRAM usage and training time
  • Diminishing returns: 0→1 gives the largest PSNR gain (+0.46), subsequent steps are smaller
  • Observations

  • Lower threshold (more Gaussians) = higher PSNR/SSIM and lower LPIPS → better across all metrics
  • LPIPS degrades more than 2x from Low → High (0.186 → 0.379) → thin structures deteriorate sharply
  • However, more Gaussians = higher VRAM usage → memory constraints must be considered in practic

Hyperparameter 3: Number of Training Iterations (iterations)

Objective

  • Does longer training improve quality? like NeRF?
  • At what point does the model converge?

Parameters:

# 7k
python train.py -s data/mipnerf360/bonsai --eval --iterations 7000 --densify_until_iter 5000 -m .../bonsai/iter7k

# 30k (default)
python train.py -s data/mipnerf360/bonsai --eval --iterations 30000 -m .../bonsai/iter30k

# 50k
python train.py -s data/mipnerf360/bonsai --eval --iterations 50000 --densify_until_iter 25000 --test_iterations 7000 30000 50000 --save_iterations 7000 30000 50000 -m .../bonsai/iter50k

Results

IterationsSSIMPSNRLPIPSPSNR Gain
7k0.932379730.50023460.2045038-
30k0.946883632.33525850.1800254+1.84
50k0.946844932.60321430.1798774+0.26
Image IndexDescriptionImage
5Iteration Comparison (7k, 30k, 50k) with GTalt text
10Iteration Comparison (7k, 30k, 50k) with GTalt text
15Iteration Comparison (7k, 30k, 50k) with GTalt text

Observations

  • 7k → 30k: PSNR +1.84 improvement. Leaf and stem sharpness clearly improved
  • 30k → 50k: PSNR +0.26, SSIM nearly identical (0.9468) → convergence confirmed
  • Conclusion: 3DGS converges at ~30k iterations. Training beyond 50k yields diminishing returns

Summary

ExperimentKey VariableConclusion
Spherical Harmonics DegreeView-dependent colorHigher degree improves PSNR/SSIM and reduces LPIPS. Re-run with unified iterations for fair comparison
Densify ThresholdGaussian count / thin structuresLower threshold = better quality. VRAM trade-off exists
IterationsTraining time vs qualityConverges at 30k. Beyond 50k is inefficient

Resource

  • Repo History: https://github.com/sjang1594/3dgs-experiments/tree/main/experiments