Geometric Dilution • SMOTE in High Dimensions

Understand why SMOTE struggles as features grow. Explore the geometry, see the data, and get clear, evidence-based recommendations.

// SMOTE Playground

Move the sliders. We simulate how SMOTE creates synthetic points between neighbors. In high dimensions, those points cover less of the space—this is geometric dilution.

6 Why it matters?
5
120 300
0.10 Seed
Coverage
How much of the space synthetic points can realistically reach.
Dilution
Higher means less meaningful coverage.
Neighbor Reach
Average distance to nearest neighbors.
Samples
Minority / Synthetic
Coverage formula: CR(d) ≈ ε^(d − 1) with ε ≈ 0.842. As d grows, CR shrinks fast.

// Coverage vs Dimensions

The curve drops quickly. That’s the core reason SMOTE can underperform in high-d data.

SMOTE: x_new = x + t(x_nbr − x) For d > 3 we show first 2 features

// Data

Loading evaluation results...

Dataset Metric

// Score by Method

// Effect Size (SMOTE vs Baseline)

// Training Time

// Inputs

Tell us about your dataset. We’ll recommend when SMOTE helps—and when to skip it.

10
10:1
Expected Coverage
Risk Level
Confidence

// Recommendation

[STATUS: —]

    Heuristic: Use SMOTE confidently when d ≤ 5 and IR ≥ 5; cautiously when d ≤ 15; avoid when d > 15 (try RandomOverSampler or class weights).

    // Summary

    SMOTE makes new points between nearby minority samples. In low dimensions, that works well. As dimensions grow, the “space” grows even faster—so those linear interpolations cover a tiny fraction. That’s geometric dilution, and it explains why SMOTE often stalls or backfires in high-d datasets.

    • 6 oversampling methods • 5 datasets • 5+ classifiers
    • F1 primary; ROC-AUC, Precision, Recall, BalAcc secondary
    • Holm-Bonferroni corrections • Cohen’s d effect sizes

    Repo: smote-geometric-analysis

    Try the Explorer first

    Move the Dimensions slider and watch Coverage shrink. That drop is why SMOTE struggles as features grow.