DOCUMENTATION

Auto-Tuner Anatomy ⑤: Statistical Validation — AUC, K-fold CV, and Prior Recommendation

Dissecting the tools that validate whether Auto-Tuner recommendations are overfitted or genuinely significant. ROC AUC (Mann-Whitney U), K-fold cross-validation, and Prior α/β estimation (method of moments, MLE) explained with formulas and code.

In the previous part: ④ Threshold · k Anatomy, we covered the optimization of T and k. This part dissects the three tools that answer: "Are the recommended values actually good?"

1. Why Validation Matters

1.1 The Risk of Overfitting

If you optimize Impact on 13 Won/Lost data points, you find a value that fits those 13 perfectly. But when the 14th project comes in, will that value still be optimal?

This is overfitting. The system has memorized the data rather than learned from it.

Auto-Tuner diagnoses this risk with three tools:

Tool	Question	Answer
ROC AUC	"How well are Won and Lost distinguished?"	Score from 0 to 1
K-fold CV	"Will this model work on new data?"	Overfitting risk level
Prior Recommendation	"Are the initial α, β reasonable?"	Data-driven Prior

2. ROC AUC — A Comprehensive Discriminative Metric

2.1 What is AUC?

ROC AUC = "If you randomly pick one Won project and one Lost project, the probability that Won's P(Win) is higher"

This is a more robust metric than Separation (= difference of means). Why?

Separation: Compares means only → Can be high even if distributions overlap
AUC: Compares all pairs → Must decrease when there is significant overlap

2.2 Calculation: Mann-Whitney U

\text{AUC} = \frac{U}{n_{\text{Won}} \times n_{\text{Lost}}}

Where $U$ is the Mann-Whitney U statistic:

U = (number of pairs where Won P(Win) > Lost P(Win)) + 0.5 × (number of tied pairs)

2.3 Code Implementation

def calculate_auc
  won_p = last_p_wins(@won_projects)    # Final P(Win) of each Won project
  lost_p = last_p_wins(@lost_projects)  # Final P(Win) of each Lost project
  return nil if won_p.empty? || lost_p.empty?

  # Mann-Whitney U test
  u = 0.0
  won_p.each do |w|
    lost_p.each do |l|
      if w > l
        u += 1.0
      elsif w == l
        u += 0.5
      end
    end
  end

  auc = u / (won_p.size * lost_p.size)
  # ...
end

Complexity: $O(n_W \times n_L)$ . With 50 Won and 50 Lost, that's 2,500 comparisons. Instantaneous.

2.4 Interpretation Scale

AUC	Grade	Interpretation
≥ 0.90	excellent	Near-perfect discrimination
0.80 ~ 0.89	good	Strong discriminative power
0.70 ~ 0.79	fair	Acceptable
0.60 ~ 0.69	poor	Needs improvement
< 0.60	fail	Close to random

2.5 What AUC = 0.50 Means

AUC = 0.50 is like "flipping a coin." The P(Win) distributions of Won and Lost completely overlap. In this state, optimizing parameters is meaningless — the data itself contains no discriminative information.

3. K-fold Cross-Validation — Detecting Overfitting

3.1 Principle

"Divide the data into K pieces, train on K-1, and validate on 1. Repeat K times."

Fold 1: [Train][Train][Train][Train][Test]
Fold 2: [Train][Train][Train][Test][Train]
Fold 3: [Train][Train][Test][Train][Train]
Fold 4: [Train][Test][Train][Train][Train]
Fold 5: [Test][Train][Train][Train][Train]

3.2 Auto-Tuner Implementation

CV_FOLDS = 5

def cross_validate
  all_projects = @won_projects + @lost_projects
  return nil if all_projects.size < CV_FOLDS * 2

  # Shuffle (fixed seed for reproducibility)
  shuffled = all_projects.shuffle(random: Random.new(42))
  fold_size = all_projects.size / CV_FOLDS

  folds_results = []

  CV_FOLDS.times do |i|
    # Test set: i-th fold
    test_start = i * fold_size
    test_end = test_start + fold_size - 1
    test_set = shuffled[test_start..test_end]
    train_set = shuffled - test_set

    # Calculate separation on train set
    train_won = train_set.select { |p| p.project_status == 'won' }
    train_lost = train_set.select { |p| p.project_status == 'lost' }
    next if train_won.empty? || train_lost.empty?

    train_sep = calculate_separation_for(train_won, train_lost)

    # Calculate separation on test set
    test_won = test_set.select { |p| p.project_status == 'won' }
    test_lost = test_set.select { |p| p.project_status == 'lost' }
    next if test_won.empty? || test_lost.empty?

    test_sep = calculate_separation_for(test_won, test_lost)

    folds_results << { fold: i + 1, train_sep: train_sep, test_sep: test_sep }
  end
end

3.3 Overfitting Judgment: Gap

\text{gap} = \overline{\text{train\_sep}} - \overline{\text{test\_sep}}

Gap Range	Risk	Interpretation
< 0.05	low	No overfitting ✅
0.05 ~ 0.15	medium	Caution needed ⚠️
> 0.15	high	Severe overfitting 🚨

gap = 0 means identical performance on train and test sets. This model should work well on new data.

gap = 0.30 means separation of 0.40 on train but only 0.10 on test. That's memorization, not learning.

3.4 CV and Phase Relationship

In Phase 2 (10–13 projects), 5-fold CV gives each fold only 2–3 test data points. CV results at this scale are unstable. But it's better than nothing — at minimum, it can detect "if overfitting is severe."

In Phase 4+ (40+ projects), each fold has 8+ test data points, making CV results stable.

4. Prior α/β Recommendation — Empirical Bayes

4.1 The Problem

All projects start with α=1, β=1 (uniform Prior). But with accumulated data, you can set a more informative Prior.

Example: If 60 out of 100 past projects were Won, a new project's Prior could start at α=1.2, β=0.8 (reflecting the prior information that "win probability is 60%").

4.2 Method of Moments

Estimate α, β from the distribution of Won projects' final P(Win) values:

\bar{p} = \frac{\alpha}{\alpha + \beta}, \quad s^2 = \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}

Solving inversely:

\alpha = \bar{p} \left( \frac{\bar{p}(1-\bar{p})}{s^2} - 1 \right), \quad \beta = (1-\bar{p}) \left( \frac{\bar{p}(1-\bar{p})}{s^2} - 1 \right)

def recommend_prior
  all_p = last_p_wins(@won_projects + @lost_projects)
  return nil if all_p.size < 10

  mean_p = all_p.sum / all_p.size
  var_p = all_p.map { |p| (p - mean_p) ** 2 }.sum / (all_p.size - 1)

  return nil if var_p <= 0 || mean_p * (1 - mean_p) <= var_p

  common = (mean_p * (1 - mean_p) / var_p) - 1
  alpha_rec = mean_p * common
  beta_rec = (1 - mean_p) * common

  # Safety range clamping
  alpha_rec = [[alpha_rec, 0.5].max, 5.0].min
  beta_rec = [[beta_rec, 0.5].max, 5.0].min

  { alpha: alpha_rec.round(2), beta: beta_rec.round(2),
    ci_95: calculate_prior_ci(alpha_rec, beta_rec) }
end

4.3 Safety Range

Recommended Prior is clamped to the [0.5, 5.0] range:

α, β < 0.5: Prior is too extreme (nearly 0% or 100%)
α, β > 5.0: Prior is too strong, making it difficult for data to overcome it

A strong Prior slows learning. With α=5, β=5 (total 10), the first signal (Impact=1.0) barely changes P(Win). With a uniform Prior (α=1, β=1), a same-strength signal changes P(Win) from 0.50 → 0.67, but with α=5, β=5, it only changes from 0.50 → 0.55.

5. Three Tools Working Together

AUC = 0.85 (good)      → "Discriminative power is good"
CV gap = 0.02 (low)     → "No overfitting either"
Prior α=1.2, β=0.9      → "Starting point is reasonable"

→ Conclusion: This Auto-Tuner recommendation can be trusted ✅

AUC = 0.62 (poor)       → "Discriminative power is low"
CV gap = 0.20 (high)    → "Severe overfitting"
Prior α=1.0, β=1.0      → "Prior adjustment needed"

→ Conclusion: Do not apply these recommendations ❌ More data needs to accumulate

Combining these three metrics, the Auto-Tuner assigns an overall Grade (A–D). At Grade D, a message stating "We do not recommend making adjustments" is displayed.

Next: ⑥ MCMC Posterior Anatomy — Emcee Ensemble MCMC, probability model definition, HDI, convergence diagnostics.