Auto-Tuner Anatomy ⑤: Statistical Validation — AUC, K-fold CV, and Prior Recommendation
Dissecting the tools that validate whether Auto-Tuner recommendations are overfitted or genuinely significant. ROC AUC (Mann-Whitney U), K-fold cross-validation, and Prior α/β estimation (method of moments, MLE) explained with formulas and code.
In the previous part: ④ Threshold · k Anatomy, we covered the optimization of T and k. This part dissects the three tools that answer: "Are the recommended values actually good?"
1. Why Validation Matters
1.1 The Risk of Overfitting
If you optimize Impact on 13 Won/Lost data points, you find a value that fits those 13 perfectly. But when the 14th project comes in, will that value still be optimal?
This is overfitting. The system has memorized the data rather than learned from it.
Auto-Tuner diagnoses this risk with three tools:
| Tool | Question | Answer |
|---|---|---|
| ROC AUC | "How well are Won and Lost distinguished?" | Score from 0 to 1 |
| K-fold CV | "Will this model work on new data?" | Overfitting risk level |
| Prior Recommendation | "Are the initial α, β reasonable?" | Data-driven Prior |
2. ROC AUC — A Comprehensive Discriminative Metric
2.1 What is AUC?
ROC AUC = "If you randomly pick one Won project and one Lost project, the probability that Won's P(Win) is higher"
This is a more robust metric than Separation (= difference of means). Why?
- Separation: Compares means only → Can be high even if distributions overlap
- AUC: Compares all pairs → Must decrease when there is significant overlap
2.2 Calculation: Mann-Whitney U
Where is the Mann-Whitney U statistic:
U = (number of pairs where Won P(Win) > Lost P(Win)) + 0.5 × (number of tied pairs)
2.3 Code Implementation
def calculate_auc
won_p = last_p_wins(@won_projects) # Final P(Win) of each Won project
lost_p = last_p_wins(@lost_projects) # Final P(Win) of each Lost project
return nil if won_p.empty? || lost_p.empty?
# Mann-Whitney U test
u = 0.0
won_p.each do |w|
lost_p.each do |l|
if w > l
u += 1.0
elsif w == l
u += 0.5
end
end
end
auc = u / (won_p.size * lost_p.size)
# ...
end
Complexity: . With 50 Won and 50 Lost, that's 2,500 comparisons. Instantaneous.
2.4 Interpretation Scale
| AUC | Grade | Interpretation |
|---|---|---|
| ≥ 0.90 | excellent | Near-perfect discrimination |
| 0.80 ~ 0.89 | good | Strong discriminative power |
| 0.70 ~ 0.79 | fair | Acceptable |
| 0.60 ~ 0.69 | poor | Needs improvement |
| < 0.60 | fail | Close to random |
2.5 What AUC = 0.50 Means
AUC = 0.50 is like "flipping a coin." The P(Win) distributions of Won and Lost completely overlap. In this state, optimizing parameters is meaningless — the data itself contains no discriminative information.
3. K-fold Cross-Validation — Detecting Overfitting
3.1 Principle
"Divide the data into K pieces, train on K-1, and validate on 1. Repeat K times."
Fold 1: [Train][Train][Train][Train][Test]
Fold 2: [Train][Train][Train][Test][Train]
Fold 3: [Train][Train][Test][Train][Train]
Fold 4: [Train][Test][Train][Train][Train]
Fold 5: [Test][Train][Train][Train][Train]
3.2 Auto-Tuner Implementation
CV_FOLDS = 5
def cross_validate
all_projects = @won_projects + @lost_projects
return nil if all_projects.size < CV_FOLDS * 2
# Shuffle (fixed seed for reproducibility)
shuffled = all_projects.shuffle(random: Random.new(42))
fold_size = all_projects.size / CV_FOLDS
folds_results = []
CV_FOLDS.times do |i|
# Test set: i-th fold
test_start = i * fold_size
test_end = test_start + fold_size - 1
test_set = shuffled[test_start..test_end]
train_set = shuffled - test_set
# Calculate separation on train set
train_won = train_set.select { |p| p.project_status == 'won' }
train_lost = train_set.select { |p| p.project_status == 'lost' }
next if train_won.empty? || train_lost.empty?
train_sep = calculate_separation_for(train_won, train_lost)
# Calculate separation on test set
test_won = test_set.select { |p| p.project_status == 'won' }
test_lost = test_set.select { |p| p.project_status == 'lost' }
next if test_won.empty? || test_lost.empty?
test_sep = calculate_separation_for(test_won, test_lost)
folds_results << { fold: i + 1, train_sep: train_sep, test_sep: test_sep }
end
end
3.3 Overfitting Judgment: Gap
| Gap Range | Risk | Interpretation |
|---|---|---|
| < 0.05 | low | No overfitting ✅ |
| 0.05 ~ 0.15 | medium | Caution needed ⚠️ |
| > 0.15 | high | Severe overfitting 🚨 |
gap = 0 means identical performance on train and test sets. This model should work well on new data.
gap = 0.30 means separation of 0.40 on train but only 0.10 on test. That's memorization, not learning.
3.4 CV and Phase Relationship
In Phase 2 (10–13 projects), 5-fold CV gives each fold only 2–3 test data points. CV results at this scale are unstable. But it's better than nothing — at minimum, it can detect "if overfitting is severe."
In Phase 4+ (40+ projects), each fold has 8+ test data points, making CV results stable.
4. Prior α/β Recommendation — Empirical Bayes
4.1 The Problem
All projects start with α=1, β=1 (uniform Prior). But with accumulated data, you can set a more informative Prior.
Example: If 60 out of 100 past projects were Won, a new project's Prior could start at α=1.2, β=0.8 (reflecting the prior information that "win probability is 60%").
4.2 Method of Moments
Estimate α, β from the distribution of Won projects' final P(Win) values:
Solving inversely:
def recommend_prior
all_p = last_p_wins(@won_projects + @lost_projects)
return nil if all_p.size < 10
mean_p = all_p.sum / all_p.size
var_p = all_p.map { |p| (p - mean_p) ** 2 }.sum / (all_p.size - 1)
return nil if var_p <= 0 || mean_p * (1 - mean_p) <= var_p
common = (mean_p * (1 - mean_p) / var_p) - 1
alpha_rec = mean_p * common
beta_rec = (1 - mean_p) * common
# Safety range clamping
alpha_rec = [[alpha_rec, 0.5].max, 5.0].min
beta_rec = [[beta_rec, 0.5].max, 5.0].min
{ alpha: alpha_rec.round(2), beta: beta_rec.round(2),
ci_95: calculate_prior_ci(alpha_rec, beta_rec) }
end
4.3 Safety Range
Recommended Prior is clamped to the [0.5, 5.0] range:
- α, β < 0.5: Prior is too extreme (nearly 0% or 100%)
- α, β > 5.0: Prior is too strong, making it difficult for data to overcome it
A strong Prior slows learning. With α=5, β=5 (total 10), the first signal (Impact=1.0) barely changes P(Win). With a uniform Prior (α=1, β=1), a same-strength signal changes P(Win) from 0.50 → 0.67, but with α=5, β=5, it only changes from 0.50 → 0.55.
5. Three Tools Working Together
AUC = 0.85 (good) → "Discriminative power is good"
CV gap = 0.02 (low) → "No overfitting either"
Prior α=1.2, β=0.9 → "Starting point is reasonable"
→ Conclusion: This Auto-Tuner recommendation can be trusted ✅
AUC = 0.62 (poor) → "Discriminative power is low"
CV gap = 0.20 (high) → "Severe overfitting"
Prior α=1.0, β=1.0 → "Prior adjustment needed"
→ Conclusion: Do not apply these recommendations ❌ More data needs to accumulate
Combining these three metrics, the Auto-Tuner assigns an overall Grade (A–D). At Grade D, a message stating "We do not recommend making adjustments" is displayed.
Next: ⑥ MCMC Posterior Anatomy — Emcee Ensemble MCMC, probability model definition, HDI, convergence diagnostics.