DOCUMENTATION

Taming Uncertainty with Millions of Simulations — EXAWin Auto-Tuner: A Complete Anatomy

A comprehensive technical essay integrating the Auto-Tuner anatomy series. Covers everything from the philosophical foundations to the mathematical structure, from Grid Search to MCMC, from statistical validation to the human-in-the-loop decision structure — a complete anatomy of the system that converts decades of experience into evidence.

Prologue — One Simple Question

There is only one question a sales manager asks the system:

"Can this deal be won?"

Behind this seemingly simple question lies a system that integrates 6 interdependent components spanning 10 dimensions, running millions of numerical simulations. This article fully dissects that system.



Chapter 1. An Engine Without Likelihood: How Experience Becomes a Formula

1.1 The Most Honest Starting Point

Standard Bayesian inference uses likelihood P(D|θ) to update beliefs. However, sales signals — "the customer asked for a follow-up meeting," "the decision maker attended" — are not observations drawn from any mathematically defined probability distribution.

EXAWin resolves this through the pseudo-count approach:

αnew=αprev+SWV×Impact\alpha_{\text{new}} = \alpha_{\text{prev}} + \text{SWV} \times \text{Impact}

This declares: "This signal possesses evidential weight equivalent to SWV × Impact virtual success observations." Since the Beta distribution is valid for all real numbers α, β > 0, pseudo-counts need not be integers.

This methodology originated in Expert Elicitation (O'Hagan et al., 2006). When an expert assesses "this evidence is equivalent to N direct observations," adding N as a pseudo-count to α or β is a justified methodology.

The key question, then: How do we determine that N? The answer is f-coupling.

1.2 f-Coupling: Aligning Signal and Prior Scales

If Signal Impact and Prior (α₀, β₀, strength S = α₀ + β₀) are set independently, a single signal can overwhelm the company's entire historical experience in one instant — a physically unrealistic scenario.

The solution: define Signal Impact as a fraction of Prior strength:

Impacti=fi×S,S=α0+β0\text{Impact}_i = f_i \times S, \quad S = \alpha_0 + \beta_0

fif_i represents: "What percentage of the company's total prior experience does one occurrence of this signal represent as evidence?"

Signal TypefImpact (S=10)Meaning
Game Changer0.505.0Half as strong as total prior experience
Strong Affirmation/Negation0.101.010% of prior experience
Moderate (Aff./Neg.)0.070.77% of prior experience
Weak (Aff./Neg.)0.040.44% of prior experience
No Signal0.010.1Virtually noise level

The key property of this coupling: Scale invariance. As long as f is the same, the P(Win) trajectory is completely identical regardless of whether S = 10 or S = 100. This is the raison d'être of coupling — maintaining consistent learning dynamics across any Prior configuration.

1.3 EPR Guardrails

EPR (Evidence-Prior Ratio) is a diagnostic metric measuring how much the maximum evidence from a single meeting can impact the Prior:

EPR=SWVmax×ImpactS\text{EPR} = \frac{\text{SWV}_{\max} \times \text{Impact}}{S}
Signal TypeEPR CapMax Impact (S=10)
Game Changer2.07.7
Regular Signals0.51.9

If a user sets an Impact value in Signal Master that exceeds the cap, the save is rejected at the code level. This prevents inexperienced users from destabilizing the system with extreme parameters.



Chapter 2. The Auto-Tuner, Takes the Stage

2.1 What Is It?

The Auto-Tuner is the system that answers: "Are the currently configured parameters truly optimal?" It analyzes completed Won/Lost project data, simulates by changing each parameter value, and recommends the settings that best separate Won from Lost.

Core metric — Separation:

Separation=PWonPLost\text{Separation} = \overline{P_{\text{Won}}} - \overline{P_{\text{Lost}}}

Where PWon\overline{P_{\text{Won}}} is the average final P(Win) of Won projects, and PLost\overline{P_{\text{Lost}}} is the average final P(Win) of Lost projects. Excellent parameters should produce high P(Win) for Won projects and low P(Win) for Lost projects — naturally yielding large separation.

2.2 Data Maturity Phase

The Auto-Tuner's first question is straightforward: "How many completed projects exist?" The lesser count of Won and Lost (min) determines the confidence tier:

Phasemin(Won, Lost)ScopeKey
❌ 1< 5No analysisInsufficient
🟠 25–9Display only (Apply locked)Direction only
✅ 310–19Impact, T, k + MCMCCLT begins operating
🟢 420–49Full (+ Dampening, Silence) + MCMCMeaningful CV
🔵 550+Full + MCMC stableMaximum confidence

2.3 The 6 Learning Targets

Signal LiftIs the signal genuine?
ImpactOptimal weight?
DampeningCompound Score attenuation rate?
Silence PenaltySilence penalty magnitude?
Threshold TWhere's the cutoff?
Slope k → How sharply to discriminate?

These 6 targets are searched sequentially via coordinate descent Grid Search. Interaction effects are ignored, but each recommendation's rationale can be individually explained. MCMC compensates with a simultaneous (N+2)-dimensional joint estimation.



Chapter 3. Signal Lift — Is This Signal Really Meaningful?

3.1 The Question

Signal Master classifies signals as Positive or Negative based on domain expertise. But is the classification correct?

Lift(s)=P(sWon)P(sLost)\text{Lift}(s) = \frac{P(s \mid \text{Won})}{P(s \mid \text{Lost})}

This is mathematically equivalent to a Bayes Factor:

LiftJeffreys' ScaleInterpretation
> 10DecisiveOverwhelmingly Won-associated
3–10StrongStrongly Won-associated
≈ 1NoneNo discriminative power
< 1ReverseMore common in Lost

3.2 Laplace Smoothing

If a signal appeared in 0 Lost projects, P(s|Lost) = 0 → division by zero. Adding +1 to the numerator and +2 to the denominator (from the Beta(1,1) uniform prior) resolves this:

def smoothed_rate(count, total)
  (count + 1.0) / (total + 2.0)
end

With n=1000, the +1 and +2 cause only 0.1% difference — negligible with sufficient data.

3.3 Dynamic Minimum Appearances

A single observation doesn't guarantee validity. Phase-dependent minimums:

PhaseMin Appearances
23
35
48
510

Signals below the threshold get Lift = nil and are excluded from Grid Search.

3.4 Mismatch Detection

If a signal classified as Positive has Lift < 1 — the classification itself is wrong. Mismatches appear as warning alerts in the report, with the administrator deciding whether to reclassify, maintain, or remove the signal.



Chapter 4. Grid Search — The Heart of Optimization

4.1 Philosophy

"Try changing the parameter slightly, recommend whatever produces the best result."

No complex math required. Intuitively understandable, and the reasoning behind results can be clearly explained. This is why Grid Search is chosen over Gradient Descent or Bayesian Optimization.

11 points are evenly distributed across the Phase-dependent search range:

PhaseRangeExample (current=5.0)
2±20%4.0–6.0
3±30%3.5–6.5
4±40%3.0–7.0
5±50%2.5–7.5

For each candidate, the system re-simulates all Won/Lost projects from scratch and selects the Impact that maximizes separation. If the improvement is less than 0.01, the current value is retained.

4.3 Compound Score: MAX + Remaining × Dampening

When three signals emerge from one meeting — Game Changer (5.0), Strong Affirmation (1.0), Moderate Affirmation (0.7) — adding everything gives 6.7. But this may be redundant information from the same context.

def compound_with_dampening(scores, dampening)
  sorted = scores.sort.reverse
  sorted[0] + sorted[1..].sum * dampening
end

With dampening = 0.25:

5.0 + (1.0 + 0.7) × 0.25 = 5.425

Dampening is also searched across 11 points in the 0.0–1.0 range.

4.4 Silence Penalty

If the customer hasn't been contacted for 14+ days, β gradually increases:

β+=unit_penalty×count\beta \mathrel{+}= \text{unit\_penalty} \times \text{count}

silence_ratio is searched across 11 points (0.0–1.0). Ratio = 0 means no penalty; 1.0 imposes full Weak Negation per period.

4.5 Performance

Total operations:

9 Impact types × 11 grid points × 100 projects × 20 activities = 198,000
+ Dampening (11) + Silence (11) simulations ≈ 220,000

All data preloaded in memory, only multiplication/addition/comparison — under 1 second in Ruby.



Chapter 5. T · k — The Geometry of Decision-Making

5.1 The Impedance Function

I(P)=11+ek(PT)I(P) = \frac{1}{1 + e^{-k(P - T)}}
  • P(Win) > T → Impedance rises → "Go"
  • P(Win) < T → Impedance drops → "No-Go"
  • T = where to judge, k = how sharply to judge

5.2 T: Youden J Statistic

J(t)=Sensitivity(t)+Specificity(t)1J(t) = \text{Sensitivity}(t) + \text{Specificity}(t) - 1

100 candidate thresholds (0.01–0.99) are exhaustively tested. The T with maximum J is recommended. If J* < 0.20, no recommendation is made — the data provides insufficient discrimination.

5.3 k: Grid Search Across 1–12

Upper bound = 12. Beyond this, the sigmoid becomes a step function — a 0.01 P(Win) difference causes extreme flips, which is no longer discrimination but binary chopping.

5.4 Per-Stage Independence

Each stage's T and k are optimized completely independently. Discovery's T has zero influence on Proposal's T.

StageCharacteristicExpected TExpected k
DiscoveryExploratoryLow (lenient)Low (gentle)
QualificationBasics verifiedModerateModerate
ProposalCost commitmentHigh (strict)High (sharp)
NegotiationFinal gateHighestHigh


Chapter 6. MCMC — The Expedition Team That Knows Uncertainty

6.1 What Grid Search Couldn't Do

Grid Search gives point estimates. But:

  • Could 4.2 yield nearly the same results as 4.7?
  • How confident can we be?
  • What about Impact × Dampening interactions?

MCMC doesn't stick a single pin — it draws the entire contour map of probability.

6.2 Emcee — 32 Explorers Referencing Each Other

EXAWin uses Emcee (Affine-Invariant Ensemble Sampler) (Foreman-Mackey et al., 2013). While typical MCMC sends one explorer, Emcee simultaneously dispatches 32 walkers who reference each other's positions via "stretch moves."

Key advantages:

  1. No gradient required — Only the likelihood function value is needed
  2. Affine invariant — Automatically handles scale differences (Impact 5.0 vs dampening 0.25)
  3. Parallel exploration — Low risk of getting stuck in local optima

6.3 Probability Model

# Impact — always positive
impact_gc     ~ LogNormal(log(5.0), 0.5)
impact_str_p  ~ LogNormal(log(1.0), 0.5)
...
# Dampening — 0~1 range
dampening     ~ Beta(5, 15)    # mean ≈ 0.25
# Silence — 0~1 range
silence_ratio ~ Beta(3, 7)     # mean ≈ 0.30

LogNormal: always positive, long right tail, symmetric in log scale. Beta: naturally bounded to [0,1]. σ=0.5 covers ±60% of the current value with 95% probability.

6.4 Likelihood Function

For each project, the same logic as Ruby's simulate_project runs in Python. Won projects increase likelihood with higher p_win; Lost projects increase likelihood with lower p_win:

P(outcomeθ)=pwiny(1pwin)1yP(\text{outcome} \mid \theta) = p_{\text{win}}^{y} \cdot (1 - p_{\text{win}})^{1-y}

Where y=1 for Won, y=0 for Lost.

6.5 Why Emcee Instead of NUTS?

The earlier NUTS-based approach required reconstructing Ruby's simulation logic into PyTensor's tensor graph — incurring structural costs:

  • MAX function is non-differentiable → LogSumExp approximation needed
  • Thousands of tensor nodes → 3+ minute compilation
  • Model modifications required rebuilding the entire graph

Emcee eliminates all of this. Simulations run as pure Python function calls. Project data is pre-compiled into numpy arrays for 5–10× speedup.

6.6 Architecture: Rails ↔ Python

BayesianAutoTuner.full_report
MCMCService.run(company, tuner_data)
Serialize to JSON
    → python3 lib/mcmc/mcmc_runner.py input.json output.json
     (subprocess, max 300s timeout)
Parse result JSON → merge into report

subprocess provides process isolation: memory released on termination, multi-tenant safety, fault isolation. If MCMC fails, existing Grid Search results return normally — MCMC is additive, never blocking.



Chapter 7. Statistical Validation — Can These Results Be Trusted?

7.1 ROC AUC (Mann-Whitney U)

"Pick one random Won and one random Lost project — the probability that Won's P(Win) is higher."

AUC=UnW×nL\text{AUC} = \frac{U}{n_W \times n_L}
AUCGrade
≥ 0.90excellent
0.80–0.89good
0.70–0.79fair
0.60–0.69poor
< 0.60fail

AUC = 0.50 is a coin flip — the data contains no discriminative information.

7.2 K-fold Cross-Validation

5-fold CV: train on 4, validate on 1, repeat 5 times.

gap=train_septest_sep\text{gap} = \overline{\text{train\_sep}} - \overline{\text{test\_sep}}
GapRisk
< 0.05low ✅
0.05–0.15medium ⚠️
> 0.15high 🚨

gap = 0.30 means separation of 0.40 on training but only 0.10 on validation — that's memorization, not learning.

7.3 Prior α/β Recommendation

From the distribution of completed projects' final P(Win) values, Method of Moments provides:

α=pˉ(pˉ(1pˉ)s21),β=(1pˉ)(pˉ(1pˉ)s21)\alpha = \bar{p}\left(\frac{\bar{p}(1-\bar{p})}{s^2} - 1\right), \quad \beta = (1-\bar{p})\left(\frac{\bar{p}(1-\bar{p})}{s^2} - 1\right)

Clamped to [0.5, 5.0] for safety. Strong Prior slows learning — this is a system design choice.



Chapter 8. Reading Posterior Results

8.1 HDI (Highest Density Interval)

95% HDI: "The probability that the parameter lies within this range is 95%."

Game Changer: HDI [3.5, 6.2]
Current 5.0 is within range → No change needed

Game Changer: HDI [2.0, 3.5]
Current 5.0 is outside HDILikely overestimated → Recommend ~3.0

Narrow HDI = precise, sufficient data. Wide HDI = uncertainty, more data needed.

8.2 R̂ (R-hat) — Convergence Proof

R^=between-chain variance+within-chain variancewithin-chain variance\hat{R} = \sqrt{\frac{\text{between-chain variance} + \text{within-chain variance}}{\text{within-chain variance}}}
Interpretation
< 1.01Perfect ✅
1.01–1.05OK
1.05–1.10Caution ⚠️
> 1.10Non-convergence 🚨

8.3 Grid Search × MCMC Cross-Referencing

When both agree → recommend with high confidence:

Grid Search: 4.7    MCMC HDI: [3.8, 5.9], mean 4.8Strong evidence

When they disagree → prioritize MCMC's broader exploration:

Grid Search: 2.0    MCMC HDI: [3.5, 6.0], mean 4.7Grid may be locally trapped


Chapter 9. The Complete Map

9.1 Analysis Pipeline

1. Signal LiftGenuine signal?
2. Impact GridOptimal weight?
3. Dampening GridOptimal attenuation?
4. Silence GridOptimal silence penalty?
5. T Youden JOptimal threshold?
6. k Grid SearchOptimal slope?
7. ROC AUCOverall discriminative power?
8. K-fold CVOverfitting check?
9. Prior RecommendationReasonable starting values?
10. MCMC PosteriorUncertainty-aware estimation?

These 10 analyses form a single report. Grid Search's intuitiveness, MCMC's precision, cross-validation's safety — three pillars that complement each other.

9.2 Single Simulation Flow

simulate_project(project, overrides)
  ├─ ① α, β = Prior initialization
  ├─ ② for each activity (chronological)
  │    ├─ ③ SWV = stage weight
  │    ├─ ④ Compound Score (MAX + dampening)
  │    ├─ ⑤ Silence Penalty check
  │    └─ ⑥ α += SWV × positive_compound
  │        β += SWV × negative_compound
  └─ ⑦ Return P(Win) = α / (α + β)

This is the complete lifecycle calculation for one project. The Grid Search runs this for every candidate × every project, and the MCMC runs Emcee's 32 walkers × 1,500 steps, each step invoking this same simulation.

9.3 Impedance Impact Simulation

Before pressing Apply, the administrator sees:

StageP(Win)Current ImpedanceRecommendedΔCount
Discovery21.5%28.4%53.5%↑25.1%p15
Qualification31.7%30.7%60.3%↑29.6%p15
Proposal46.4%40.8%74.0%↑33.4%p15

"Current Discovery average impedance is 28%, and applying recommendations raises it to 54%." This means the recommended settings better separate Won from Lost deals.

9.4 Grading System

Combining AUC, CV gap, and separation improvement, the system assigns an overall Grade:

GradeConditionRecommendation
AAUC ≥ 0.80, gap < 0.05Strong recommendation
BAUC ≥ 0.70, gap < 0.10Recommend with caution
CAUC ≥ 0.60, gap < 0.15Directional reference only
DAUC < 0.60 or gap ≥ 0.15Do not recommend


Chapter 10. Human-in-the-Loop: Why Humans Decide

10.1 Why Not Auto-Apply?

The Auto-Tuner can find optimal parameters. But the final decision to apply always rests with humans. This is not an engineering oversight — it is a deliberate design philosophy.

  1. Context the system cannot know — Industry shifts, personnel changes, strategy pivots
  2. Responsibility — Parameter changes affect all active deals. Automated changes to hundreds of deals' probability scores without human oversight is organizational irresponsibility
  3. Trust building — Users who understand why values are recommended and choose to apply them build trust in the system. Opaque automation breeds distrust

10.2 The Workflow

Phase     Scope            Apply    MCMC
───────────────────────────────────────────
Phase 1   Display No       🔒       ❌
Phase 2   Display Only     🔒       ❌
Phase 3   Impact, T, k     🔓        (may be unstable)
Phase 4   Full             🔓       ✅
Phase 5   Full             🔓        (most stable)
  1. System generates a report with recommendations
  2. Administrator reviews each parameter's current vs recommended values
  3. Administrator examines the evidence: Lift, separation improvement, AUC, CV, HDI
  4. Administrator decides: Apply all / Apply selectively / Reject all
  5. Applied values update Signal Master and Stage Master

The system provides evidence. Humans provide judgment. This division of labor is the most robust design for a decision-support system operating in an uncertain world.



Epilogue — Evidence, Not Commands

The Auto-Tuner does not say "do this." It says "here is the evidence, and this is what it suggests."

Behind the question "Can this deal be won?" lies a system that integrates 260 years of statistical heritage — from Bayes's posthumous paper through Robbins's empirical Bayes, Stein's paradox, Efron-Morris's shrinkage estimation, to modern MCMC methods — into a practical decision engine.

Every simulation run, every Lift calculated, every HDI interval drawn is an act of converting decades of a company's accumulated experience — intangible, unstructured, but undeniably real — into the language of mathematics.

And the moment the administrator reviews that evidence and makes a decision, human wisdom and machine computation shake hands.

That handshake is the Auto-Tuner's raison d'être.



References

  1. Bayes, T. (1763). "An Essay towards Solving a Problem in the Doctrine of Chances." Phil. Trans. R. Soc. London, 53, 370-418.
  2. Robbins, H. (1956). "An Empirical Bayes Approach to Statistics." Proc. 3rd Berkeley Symp., 1, 157-163.
  3. James, W. & Stein, C. (1961). "Estimation with Quadratic Loss." Proc. 4th Berkeley Symp., 1, 361-379.
  4. Efron, B. & Morris, C. (1975). "Data Analysis Using Stein's Estimator and its Generalizations." JASA, 70(350), 311-319.
  5. O'Hagan, A. et al. (2006). Uncertain Judgements: Eliciting Experts' Probabilities. Wiley.
  6. Youden, W.J. (1950). "Index for rating diagnostic tests." Cancer, 3(1), 32-35.
  7. Foreman-Mackey, D. et al. (2013). "emcee: The MCMC Hammer." PASP, 125(925), 306-312.
  8. Goodman, J. & Weare, J. (2010). "Ensemble samplers with affine invariance." CAMCS, 5(1), 65-80.
  9. Gelman, A. & Rubin, D.B. (1992). "Inference from Iterative Simulation Using Multiple Sequences." Stat. Sci., 7(4), 457-472.
  10. Cooper, R.G. (2008). "Perspective: The Stage-Gate Idea-to-Launch Process." JPIM, 25(3).
  11. Ghosh, J.K. & Ramamoorthi, R.V. (2003). Bayesian Nonparametrics. Springer.
  12. Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd Ed. CRC Press.