Benchmark Results

Data: Qwen3 4B layer 0 — W (2560, 9728), X (244449, 9728)
Device: CUDA (gpu:1)
Loss metric: ||X(W_pruned - W0)^T||_F / ||X W0^T||_F (relative output Frobenius norm)

1. 2:4 Structured Sparsity

block_shape=(1,1), scope_shape=(1,4), keep 2 of 4 contiguous columns. All 2560 rows.

Method

Norm. Loss

Sparsity

Time

OBS local

20.66%

50.0%

0.1s

OBS full (frozen C)

15.42%

50.0%

3.9s

OBS interleaved=8

14.35%

50.0%

2.9s

OBS interleaved=16

14.24%

50.0%

3.3s

OBS interleaved=64

14.16%

50.0%

6.0s

SparseGPT

14.12%

50.0%

1.1s

OBS-ord ng=256 (shared C, Schur, fp16)

13.39%

50.0%

14.8s

True OBS ng=256 L2R

12.09%

50.0%

75s

True OBS ng=256 largest-first

11.87%

50.0%

105s

Key comparisons:

  • True OBS largest-first beats SparseGPT by 16.0% — per-row C with Schur updates, largest-cost blocks first

  • OBS-ord (shared C, Schur, fp16) beats SparseGPT by 5.2% in 14.8s — good speed/quality tradeoff

  • Largest-first ordering improves True OBS by 1.8% over left-to-right (11.87% vs 12.09%)

  • OBS interleaved=64 nearly matches SparseGPT (−0.3%) at 6s — practical fast alternative

  • Gap from OBS full to SparseGPT closed from −9.2% to −0.3% by interleaved mask re-selection

  • OBS split (not shown) is always worse than OBS full — it doesn’t re-select masks

True OBS (first 32 rows only — O(B×K²) memory)

Method

Norm. Loss

Time

OBS full (frozen C)

5.42%

0.3s

OBS interleaved=64

4.09%

3.9s

SparseGPT

4.06%

1.2s

True OBS ng=256

3.42%

1.0s

True OBS ng=1

3.39%

75.9s

  • True OBS ng=256 quality within 1% of ng=1 — Schur update frequency barely matters

  • Quality gap: True OBS > SparseGPT > Interleaved > OBS full


2. Coupled 2:4 Sparsity

Pairs of elements 8 columns apart. View (M, K/16, 8, 2):(K, 16, 1, 8). block_shape=(1,1,1,2), scope_shape=(1,1,4,1), keep 2 of 4 pairs. All 2560 rows.

Method

Norm. Loss

Sparsity

Time

OBS local

27.83%

50.0%

0.0s

OBS full

20.43%

50.0%

3.1s

OBS interleaved=16

19.14%

50.0%

2.7s

OBS interleaved=64

19.06%

50.0%

5.4s

SparseGPT

19.01%

50.0%

1.0s

True OBS ng=16

15.75%

50.0%

433s

True OBS ng=64

15.79%

50.0%

397s

Key comparisons:

  • True OBS ng=16 beats SparseGPT by 17.1% — per-row C with Schur updates, largest-first ordering

  • OBS interleaved=64 nearly matches SparseGPT (−0.3%) but is 30x faster (5.4s vs 159.5s)

  • OBS interleaved=16 is within 0.7% at 59x faster (2.7s vs 159.5s)


3. 4:8 Structured Sparsity

block_shape=(1,2), scope_shape=(1,4). 4 blocks of 2 elements per scope, prune 2 blocks. All 2560 rows.

Method

Norm. Loss

Sparsity

Time

OBS local

27.76%

50.0%

0.0s

OBS full

20.34%

50.0%

3.1s

OBS interleaved=16

19.11%

50.0%

2.7s

OBS interleaved=64

19.04%

50.0%

5.5s

SparseGPT

19.00%

50.0%

1.1s

True OBS ng=256

16.12%

50.0%

487s

True OBS ng=64

15.97%

50.0%

463s

True OBS ng=16

15.92%

50.0%

558s

Key comparisons:

  • True OBS ng=16 beats SparseGPT by 16.2% — per-row C with Schur updates

  • ng=16 vs ng=256: only 0.2% quality difference, ng barely matters

  • OBS interleaved=64 within 0.2% of SparseGPT (5.5s vs 1.1s)

True OBS (first 32 rows only)

Method

Norm. Loss

Time

SparseGPT

5.64%

1.1s

True OBS ng=2

4.53%

20.3s

True OBS ng=16

4.54%

5.0s

  • True OBS ng=2 beats SparseGPT by 19.6%


4. 16-Column Block, 8-Row Coupled Sparsity

View(size=(8, 2, K), stride=(K, 8K, 1)) on 16-row chunks. block_shape=(1,1,16), scope_shape=(1,2,1), keep 1 of 2 blocks per scope. 160 chunks of 16 rows. All 2560 rows.

Method

Norm. Loss

Sparsity

Time

Magnitude

48.83%

50.0%

0.05s

OBS full block

34.19%

50.0%

9.8s

SparseGPT block

33.46%

50.0%

146.3s

True OBS ng=16

26.91%

50.0%

215.5s

Key comparisons:

  • True OBS ng=16 beats SparseGPT by 19.6% — per-row Schur feasible here (only 608 scopes)

  • True OBS beats OBS full by 21.3%

  • OBS full is 14.9x faster than SparseGPT


Summary

Config

Best Method

Norm. Loss

vs SparseGPT

Time

2:4 (all rows)

True OBS ng=256 largest

11.87%

+16.0%

105s

2:4 mid (all rows)

OBS-ord ng=256 (shared C)

13.39%

+5.2%

14.8s

2:4 fast (all rows)

OBS interleaved=64

14.16%

−0.3%

6.0s

Coupled 2:4 (all rows)

True OBS ng=16

15.75%

+17.1%

433s

Coupled 2:4 fast

OBS interleaved=64

19.06%

−0.3%

5.4s

4:8 (all rows)

True OBS ng=16

15.92%

+16.2%

558s

4:8 fast (all rows)

OBS interleaved=64

19.04%

−0.2%

5.5s

16-col block (all rows)

True OBS ng=16

26.91%

+19.6%

215.5s

Takeaways:

  • True OBS (per-row C with Schur) beats SparseGPT by 16–20% across all configs — 105s for 2:4, 558s for 4:8, 433s for coupled 2:4

  • OBS-ord (shared C, Schur, fp16, largest-first) beats SparseGPT by 5% in 15s — practical mid-tier option

  • OBS interleaved (shared C, re-select masks per split) matches SparseGPT within 0.2–0.3% in 3–6s — the practical fast method

  • Key insight: mask re-selection with updated C is what matters. OBS split (fixed masks) always loses to OBS full. OBS interleaved (updated masks) nearly matches SparseGPT.

  • Largest-first block ordering gives free +2% on True OBS, +5% on shared-C OBS

  • For coupled 2:4, OBS interleaved dominates: matches SparseGPT quality but is 30x faster (5.4s vs 159.5s)