Taming Uncertainty with Millions of Simulations — EXAWin Auto-Tuner: A Complete Anatomy
A comprehensive technical essay integrating the Auto-Tuner anatomy series. Covers everything from the philosophical foundations to the mathematical structure, from Grid Search to MCMC, from statistical validation to the human-in-the-loop decision structure — a complete anatomy of the system that converts decades of experience into evidence.
Prologue — One Simple Question
There is only one question a sales manager asks the system:
"Can this deal be won?"
Behind this seemingly simple question lies a system that integrates 6 interdependent components spanning 10 dimensions, running millions of numerical simulations. This article fully dissects that system.
Chapter 1. An Engine Without Likelihood: How Experience Becomes a Formula
1.1 The Most Honest Starting Point
Standard Bayesian inference uses likelihood P(D|θ) to update beliefs. However, sales signals — "the customer asked for a follow-up meeting," "the decision maker attended" — are not observations drawn from any mathematically defined probability distribution.
EXAWin resolves this through the pseudo-count approach:
This declares: "This signal possesses evidential weight equivalent to SWV × Impact virtual success observations." Since the Beta distribution is valid for all real numbers α, β > 0, pseudo-counts need not be integers.
This methodology originated in Expert Elicitation (O'Hagan et al., 2006). When an expert assesses "this evidence is equivalent to N direct observations," adding N as a pseudo-count to α or β is a justified methodology.
The key question, then: How do we determine that N? The answer is f-coupling.
1.2 f-Coupling: Aligning Signal and Prior Scales
If Signal Impact and Prior (α₀, β₀, strength S = α₀ + β₀) are set independently, a single signal can overwhelm the company's entire historical experience in one instant — a physically unrealistic scenario.
The solution: define Signal Impact as a fraction of Prior strength:
represents: "What percentage of the company's total prior experience does one occurrence of this signal represent as evidence?"
| Signal Type | f | Impact (S=10) | Meaning |
|---|---|---|---|
| Game Changer | 0.50 | 5.0 | Half as strong as total prior experience |
| Strong Affirmation/Negation | 0.10 | 1.0 | 10% of prior experience |
| Moderate (Aff./Neg.) | 0.07 | 0.7 | 7% of prior experience |
| Weak (Aff./Neg.) | 0.04 | 0.4 | 4% of prior experience |
| No Signal | 0.01 | 0.1 | Virtually noise level |
The key property of this coupling: Scale invariance. As long as f is the same, the P(Win) trajectory is completely identical regardless of whether S = 10 or S = 100. This is the raison d'être of coupling — maintaining consistent learning dynamics across any Prior configuration.
1.3 EPR Guardrails
EPR (Evidence-Prior Ratio) is a diagnostic metric measuring how much the maximum evidence from a single meeting can impact the Prior:
| Signal Type | EPR Cap | Max Impact (S=10) |
|---|---|---|
| Game Changer | 2.0 | 7.7 |
| Regular Signals | 0.5 | 1.9 |
If a user sets an Impact value in Signal Master that exceeds the cap, the save is rejected at the code level. This prevents inexperienced users from destabilizing the system with extreme parameters.
Chapter 2. The Auto-Tuner, Takes the Stage
2.1 What Is It?
The Auto-Tuner is the system that answers: "Are the currently configured parameters truly optimal?" It analyzes completed Won/Lost project data, simulates by changing each parameter value, and recommends the settings that best separate Won from Lost.
Core metric — Separation:
Where is the average final P(Win) of Won projects, and is the average final P(Win) of Lost projects. Excellent parameters should produce high P(Win) for Won projects and low P(Win) for Lost projects — naturally yielding large separation.
2.2 Data Maturity Phase
The Auto-Tuner's first question is straightforward: "How many completed projects exist?" The lesser count of Won and Lost (min) determines the confidence tier:
| Phase | min(Won, Lost) | Scope | Key |
|---|---|---|---|
| ❌ 1 | < 5 | No analysis | Insufficient |
| 🟠 2 | 5–9 | Display only (Apply locked) | Direction only |
| ✅ 3 | 10–19 | Impact, T, k + MCMC | CLT begins operating |
| 🟢 4 | 20–49 | Full (+ Dampening, Silence) + MCMC | Meaningful CV |
| 🔵 5 | 50+ | Full + MCMC stable | Maximum confidence |
2.3 The 6 Learning Targets
① Signal Lift → Is the signal genuine?
② Impact → Optimal weight?
③ Dampening → Compound Score attenuation rate?
④ Silence Penalty → Silence penalty magnitude?
⑤ Threshold T → Where's the cutoff?
⑥ Slope k → How sharply to discriminate?
These 6 targets are searched sequentially via coordinate descent Grid Search. Interaction effects are ignored, but each recommendation's rationale can be individually explained. MCMC compensates with a simultaneous (N+2)-dimensional joint estimation.
Chapter 3. Signal Lift — Is This Signal Really Meaningful?
3.1 The Question
Signal Master classifies signals as Positive or Negative based on domain expertise. But is the classification correct?
This is mathematically equivalent to a Bayes Factor:
| Lift | Jeffreys' Scale | Interpretation |
|---|---|---|
| > 10 | Decisive | Overwhelmingly Won-associated |
| 3–10 | Strong | Strongly Won-associated |
| ≈ 1 | None | No discriminative power |
| < 1 | Reverse | More common in Lost |
3.2 Laplace Smoothing
If a signal appeared in 0 Lost projects, P(s|Lost) = 0 → division by zero. Adding +1 to the numerator and +2 to the denominator (from the Beta(1,1) uniform prior) resolves this:
def smoothed_rate(count, total)
(count + 1.0) / (total + 2.0)
end
With n=1000, the +1 and +2 cause only 0.1% difference — negligible with sufficient data.
3.3 Dynamic Minimum Appearances
A single observation doesn't guarantee validity. Phase-dependent minimums:
| Phase | Min Appearances |
|---|---|
| 2 | 3 |
| 3 | 5 |
| 4 | 8 |
| 5 | 10 |
Signals below the threshold get Lift = nil and are excluded from Grid Search.
3.4 Mismatch Detection
If a signal classified as Positive has Lift < 1 — the classification itself is wrong. Mismatches appear as warning alerts in the report, with the administrator deciding whether to reclassify, maintain, or remove the signal.
Chapter 4. Grid Search — The Heart of Optimization
4.1 Philosophy
"Try changing the parameter slightly, recommend whatever produces the best result."
No complex math required. Intuitively understandable, and the reasoning behind results can be clearly explained. This is why Grid Search is chosen over Gradient Descent or Bayesian Optimization.
4.2 Impact Grid Search
11 points are evenly distributed across the Phase-dependent search range:
| Phase | Range | Example (current=5.0) |
|---|---|---|
| 2 | ±20% | 4.0–6.0 |
| 3 | ±30% | 3.5–6.5 |
| 4 | ±40% | 3.0–7.0 |
| 5 | ±50% | 2.5–7.5 |
For each candidate, the system re-simulates all Won/Lost projects from scratch and selects the Impact that maximizes separation. If the improvement is less than 0.01, the current value is retained.
4.3 Compound Score: MAX + Remaining × Dampening
When three signals emerge from one meeting — Game Changer (5.0), Strong Affirmation (1.0), Moderate Affirmation (0.7) — adding everything gives 6.7. But this may be redundant information from the same context.
def compound_with_dampening(scores, dampening)
sorted = scores.sort.reverse
sorted[0] + sorted[1..].sum * dampening
end
With dampening = 0.25:
5.0 + (1.0 + 0.7) × 0.25 = 5.425
Dampening is also searched across 11 points in the 0.0–1.0 range.
4.4 Silence Penalty
If the customer hasn't been contacted for 14+ days, β gradually increases:
silence_ratio is searched across 11 points (0.0–1.0). Ratio = 0 means no penalty; 1.0 imposes full Weak Negation per period.
4.5 Performance
Total operations:
9 Impact types × 11 grid points × 100 projects × 20 activities = 198,000
+ Dampening (11) + Silence (11) simulations ≈ 220,000
All data preloaded in memory, only multiplication/addition/comparison — under 1 second in Ruby.
Chapter 5. T · k — The Geometry of Decision-Making
5.1 The Impedance Function
- P(Win) > T → Impedance rises → "Go"
- P(Win) < T → Impedance drops → "No-Go"
- T = where to judge, k = how sharply to judge
5.2 T: Youden J Statistic
100 candidate thresholds (0.01–0.99) are exhaustively tested. The T with maximum J is recommended. If J* < 0.20, no recommendation is made — the data provides insufficient discrimination.
5.3 k: Grid Search Across 1–12
Upper bound = 12. Beyond this, the sigmoid becomes a step function — a 0.01 P(Win) difference causes extreme flips, which is no longer discrimination but binary chopping.
5.4 Per-Stage Independence
Each stage's T and k are optimized completely independently. Discovery's T has zero influence on Proposal's T.
| Stage | Characteristic | Expected T | Expected k |
|---|---|---|---|
| Discovery | Exploratory | Low (lenient) | Low (gentle) |
| Qualification | Basics verified | Moderate | Moderate |
| Proposal | Cost commitment | High (strict) | High (sharp) |
| Negotiation | Final gate | Highest | High |
Chapter 6. MCMC — The Expedition Team That Knows Uncertainty
6.1 What Grid Search Couldn't Do
Grid Search gives point estimates. But:
- Could 4.2 yield nearly the same results as 4.7?
- How confident can we be?
- What about Impact × Dampening interactions?
MCMC doesn't stick a single pin — it draws the entire contour map of probability.
6.2 Emcee — 32 Explorers Referencing Each Other
EXAWin uses Emcee (Affine-Invariant Ensemble Sampler) (Foreman-Mackey et al., 2013). While typical MCMC sends one explorer, Emcee simultaneously dispatches 32 walkers who reference each other's positions via "stretch moves."
Key advantages:
- No gradient required — Only the likelihood function value is needed
- Affine invariant — Automatically handles scale differences (Impact 5.0 vs dampening 0.25)
- Parallel exploration — Low risk of getting stuck in local optima
6.3 Probability Model
# Impact — always positive
impact_gc ~ LogNormal(log(5.0), 0.5)
impact_str_p ~ LogNormal(log(1.0), 0.5)
...
# Dampening — 0~1 range
dampening ~ Beta(5, 15) # mean ≈ 0.25
# Silence — 0~1 range
silence_ratio ~ Beta(3, 7) # mean ≈ 0.30
LogNormal: always positive, long right tail, symmetric in log scale. Beta: naturally bounded to [0,1]. σ=0.5 covers ±60% of the current value with 95% probability.
6.4 Likelihood Function
For each project, the same logic as Ruby's simulate_project runs in Python. Won projects increase likelihood with higher p_win; Lost projects increase likelihood with lower p_win:
Where y=1 for Won, y=0 for Lost.
6.5 Why Emcee Instead of NUTS?
The earlier NUTS-based approach required reconstructing Ruby's simulation logic into PyTensor's tensor graph — incurring structural costs:
- MAX function is non-differentiable → LogSumExp approximation needed
- Thousands of tensor nodes → 3+ minute compilation
- Model modifications required rebuilding the entire graph
Emcee eliminates all of this. Simulations run as pure Python function calls. Project data is pre-compiled into numpy arrays for 5–10× speedup.
6.6 Architecture: Rails ↔ Python
BayesianAutoTuner.full_report
→ MCMCService.run(company, tuner_data)
→ Serialize to JSON
→ python3 lib/mcmc/mcmc_runner.py input.json output.json
→ (subprocess, max 300s timeout)
→ Parse result JSON → merge into report
subprocess provides process isolation: memory released on termination, multi-tenant safety, fault isolation. If MCMC fails, existing Grid Search results return normally — MCMC is additive, never blocking.
Chapter 7. Statistical Validation — Can These Results Be Trusted?
7.1 ROC AUC (Mann-Whitney U)
"Pick one random Won and one random Lost project — the probability that Won's P(Win) is higher."
| AUC | Grade |
|---|---|
| ≥ 0.90 | excellent |
| 0.80–0.89 | good |
| 0.70–0.79 | fair |
| 0.60–0.69 | poor |
| < 0.60 | fail |
AUC = 0.50 is a coin flip — the data contains no discriminative information.
7.2 K-fold Cross-Validation
5-fold CV: train on 4, validate on 1, repeat 5 times.
| Gap | Risk |
|---|---|
| < 0.05 | low ✅ |
| 0.05–0.15 | medium ⚠️ |
| > 0.15 | high 🚨 |
gap = 0.30 means separation of 0.40 on training but only 0.10 on validation — that's memorization, not learning.
7.3 Prior α/β Recommendation
From the distribution of completed projects' final P(Win) values, Method of Moments provides:
Clamped to [0.5, 5.0] for safety. Strong Prior slows learning — this is a system design choice.
Chapter 8. Reading Posterior Results
8.1 HDI (Highest Density Interval)
95% HDI: "The probability that the parameter lies within this range is 95%."
Game Changer: HDI [3.5, 6.2]
→ Current 5.0 is within range → No change needed
Game Changer: HDI [2.0, 3.5]
→ Current 5.0 is outside HDI → Likely overestimated → Recommend ~3.0
Narrow HDI = precise, sufficient data. Wide HDI = uncertainty, more data needed.
8.2 R̂ (R-hat) — Convergence Proof
| R̂ | Interpretation |
|---|---|
| < 1.01 | Perfect ✅ |
| 1.01–1.05 | OK |
| 1.05–1.10 | Caution ⚠️ |
| > 1.10 | Non-convergence 🚨 |
8.3 Grid Search × MCMC Cross-Referencing
When both agree → recommend with high confidence:
Grid Search: 4.7 MCMC HDI: [3.8, 5.9], mean 4.8 → Strong evidence
When they disagree → prioritize MCMC's broader exploration:
Grid Search: 2.0 MCMC HDI: [3.5, 6.0], mean 4.7 → Grid may be locally trapped
Chapter 9. The Complete Map
9.1 Analysis Pipeline
1. Signal Lift → Genuine signal?
2. Impact Grid → Optimal weight?
3. Dampening Grid → Optimal attenuation?
4. Silence Grid → Optimal silence penalty?
5. T Youden J → Optimal threshold?
6. k Grid Search → Optimal slope?
7. ROC AUC → Overall discriminative power?
8. K-fold CV → Overfitting check?
9. Prior Recommendation → Reasonable starting values?
10. MCMC Posterior → Uncertainty-aware estimation?
These 10 analyses form a single report. Grid Search's intuitiveness, MCMC's precision, cross-validation's safety — three pillars that complement each other.
9.2 Single Simulation Flow
simulate_project(project, overrides)
├─ ① α, β = Prior initialization
├─ ② for each activity (chronological)
│ ├─ ③ SWV = stage weight
│ ├─ ④ Compound Score (MAX + dampening)
│ ├─ ⑤ Silence Penalty check
│ └─ ⑥ α += SWV × positive_compound
│ β += SWV × negative_compound
└─ ⑦ Return P(Win) = α / (α + β)
This is the complete lifecycle calculation for one project. The Grid Search runs this for every candidate × every project, and the MCMC runs Emcee's 32 walkers × 1,500 steps, each step invoking this same simulation.
9.3 Impedance Impact Simulation
Before pressing Apply, the administrator sees:
| Stage | P(Win) | Current Impedance | Recommended | Δ | Count |
|---|---|---|---|---|---|
| Discovery | 21.5% | 28.4% | 53.5% | ↑25.1%p | 15 |
| Qualification | 31.7% | 30.7% | 60.3% | ↑29.6%p | 15 |
| Proposal | 46.4% | 40.8% | 74.0% | ↑33.4%p | 15 |
"Current Discovery average impedance is 28%, and applying recommendations raises it to 54%." This means the recommended settings better separate Won from Lost deals.
9.4 Grading System
Combining AUC, CV gap, and separation improvement, the system assigns an overall Grade:
| Grade | Condition | Recommendation |
|---|---|---|
| A | AUC ≥ 0.80, gap < 0.05 | Strong recommendation |
| B | AUC ≥ 0.70, gap < 0.10 | Recommend with caution |
| C | AUC ≥ 0.60, gap < 0.15 | Directional reference only |
| D | AUC < 0.60 or gap ≥ 0.15 | Do not recommend |
Chapter 10. Human-in-the-Loop: Why Humans Decide
10.1 Why Not Auto-Apply?
The Auto-Tuner can find optimal parameters. But the final decision to apply always rests with humans. This is not an engineering oversight — it is a deliberate design philosophy.
- Context the system cannot know — Industry shifts, personnel changes, strategy pivots
- Responsibility — Parameter changes affect all active deals. Automated changes to hundreds of deals' probability scores without human oversight is organizational irresponsibility
- Trust building — Users who understand why values are recommended and choose to apply them build trust in the system. Opaque automation breeds distrust
10.2 The Workflow
Phase Scope Apply MCMC
───────────────────────────────────────────
Phase 1 Display No 🔒 ❌
Phase 2 Display Only 🔒 ❌
Phase 3 Impact, T, k 🔓 ✅ (may be unstable)
Phase 4 Full 🔓 ✅
Phase 5 Full 🔓 ✅ (most stable)
- System generates a report with recommendations
- Administrator reviews each parameter's current vs recommended values
- Administrator examines the evidence: Lift, separation improvement, AUC, CV, HDI
- Administrator decides: Apply all / Apply selectively / Reject all
- Applied values update Signal Master and Stage Master
The system provides evidence. Humans provide judgment. This division of labor is the most robust design for a decision-support system operating in an uncertain world.
Epilogue — Evidence, Not Commands
The Auto-Tuner does not say "do this." It says "here is the evidence, and this is what it suggests."
Behind the question "Can this deal be won?" lies a system that integrates 260 years of statistical heritage — from Bayes's posthumous paper through Robbins's empirical Bayes, Stein's paradox, Efron-Morris's shrinkage estimation, to modern MCMC methods — into a practical decision engine.
Every simulation run, every Lift calculated, every HDI interval drawn is an act of converting decades of a company's accumulated experience — intangible, unstructured, but undeniably real — into the language of mathematics.
And the moment the administrator reviews that evidence and makes a decision, human wisdom and machine computation shake hands.
That handshake is the Auto-Tuner's raison d'être.
References
- Bayes, T. (1763). "An Essay towards Solving a Problem in the Doctrine of Chances." Phil. Trans. R. Soc. London, 53, 370-418.
- Robbins, H. (1956). "An Empirical Bayes Approach to Statistics." Proc. 3rd Berkeley Symp., 1, 157-163.
- James, W. & Stein, C. (1961). "Estimation with Quadratic Loss." Proc. 4th Berkeley Symp., 1, 361-379.
- Efron, B. & Morris, C. (1975). "Data Analysis Using Stein's Estimator and its Generalizations." JASA, 70(350), 311-319.
- O'Hagan, A. et al. (2006). Uncertain Judgements: Eliciting Experts' Probabilities. Wiley.
- Youden, W.J. (1950). "Index for rating diagnostic tests." Cancer, 3(1), 32-35.
- Foreman-Mackey, D. et al. (2013). "emcee: The MCMC Hammer." PASP, 125(925), 306-312.
- Goodman, J. & Weare, J. (2010). "Ensemble samplers with affine invariance." CAMCS, 5(1), 65-80.
- Gelman, A. & Rubin, D.B. (1992). "Inference from Iterative Simulation Using Multiple Sequences." Stat. Sci., 7(4), 457-472.
- Cooper, R.G. (2008). "Perspective: The Stage-Gate Idea-to-Launch Process." JPIM, 25(3).
- Ghosh, J.K. & Ramamoorthi, R.V. (2003). Bayesian Nonparametrics. Springer.
- Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd Ed. CRC Press.