Auto-Tuner Anatomy ⑥: MCMC — The Engine That Knows Uncertainty
While Grid Search finds "the single best point," MCMC estimates "the distribution of all possible points." Dissecting posterior estimation using Emcee Ensemble MCMC, convergence diagnostics, and HDI interpretation.
In the previous part: ⑤ Statistical Validation Anatomy, we covered AUC, K-fold CV, and Prior recommendation. This final part dissects the Auto-Tuner's most advanced tool — MCMC posterior estimation.
1. The Gap Grid Search Left Behind
1.1 The Limitation of a Single Point
Grid Search gives answers like:
"Impact 4.7 improves the separation between Won and Lost by 0.02 over 5.0. Recommendation: 4.7."
This is a point estimate. Like sticking a single pin on a map saying "the treasure is buried here." Useful, but it fails to answer several important questions:
- Could 4.2 also yield nearly the same separation as 4.7?
- Even if 4.7 is best, how confident can we be?
- What about interactions when Impact and dampening are changed simultaneously?
1.2 What MCMC Fills In
MCMC doesn't give a single pin — it draws the entire contour map of probability:
"The ideal range for Impact is 3.8 to 5.9 (95% probability), and the most likely value is 4.7."
This is the posterior distribution. Not "one answer" but all possible answers and their respective probabilities. The difference between a doctor saying "your blood pressure is 130" and "your blood pressure is 130, with a 95% confidence interval of 125–135, which is within the normal range."
2. What is MCMC?
2.1 Intuitive Explanation
MCMC (Markov Chain Monte Carlo) is "a method that explores the parameter space, spending more time in good regions." Imagine an explorer walking through fog-covered mountains:
Step 1: Start at the current position
Step 2: Propose a step to a neighboring position
Step 3: If the new position is higher (higher likelihood), move there
Step 4: Even if lower, there's a probability of moving — a safety mechanism against getting trapped in valleys
Step 5: After thousands of repetitions, the accumulated footprints form peaks (the posterior distribution)
Key insight: The density of footprints IS probability. Where the explorer lingers longest means those parameter values are most supported by data.
2.2 Emcee — Collective Intelligence of an Expedition Team
EXAWin uses Emcee (Affine-Invariant Ensemble Sampler) (Foreman-Mackey et al., 2013). While typical MCMC sends one explorer, Emcee sends 32 expedition members simultaneously.
| Method | Analogy | Characteristics |
|---|---|---|
| Random Walk MCMC | Walking alone in fog | Slow, easily trapped in valleys |
| HMC / NUTS | One climber sensing gradients | Fast, but requires gradient computation |
| Emcee | 32 people referencing each other's positions | No gradients needed, collective search |
Each Emcee walker references other walkers' positions to determine its direction. Mathematically, this is called the "stretch move", with key properties:
- No gradient required — Only the value of the likelihood function needs to be computed
- Affine invariant — Automatically adapts to scale differences between parameters (Impact 5.0 vs dampening 0.25)
- Parallel exploration — 32 walkers simultaneously cover the space, reducing the risk of getting trapped in local optima
EXAWin's parameter space is (N+2)-dimensional — Impact values + dampening 1 + silence 1 (standard configuration: , total 10 dimensions). At this scale, Emcee converges stably within 1,500–3,000 steps.
3. Probability Model Design
3.1 Parameter Space ((N+2)-dimensional)
# Impact values (8 — No Signal is fixed at 0.1, excluded)
impact_gc ~ LogNormal(log(5.0), 0.5) # Game Changer
impact_str_p ~ LogNormal(log(1.0), 0.5) # Strong Affirmation
impact_mod_p ~ LogNormal(log(0.7), 0.5) # Moderate Affirmation
impact_weak_p ~ LogNormal(log(0.4), 0.5) # Weak Affirmation
impact_str_n ~ LogNormal(log(1.0), 0.5) # Strong Negation
impact_mod_n ~ LogNormal(log(0.7), 0.5) # Moderate Negative
impact_weak_n ~ LogNormal(log(0.4), 0.5) # Weak Negation
impact_gc_n ~ LogNormal(log(5.0), 0.5) # Game Changer (Negative)
# Attenuation rate — 0~1 range
dampening ~ Beta(5, 15) # mean ≈ 0.25
# Silence ratio — 0~1 range
silence_ratio ~ Beta(3, 7) # mean ≈ 0.30
3.2 Why LogNormal?
Impact must always be positive — negative Impact has no physical meaning. LogNormal naturally satisfies this constraint:
- Always positive — never produces negative values
- Long right tail — adequately explores values larger than current
- Symmetric in log scale — multiplicative relationships of Impact are intuitive
σ=0.5 means: covers approximately ±60% of the current value with 95% probability. Wide enough for exploration while suppressing extreme values.
3.3 Why Beta?
dampening and silence_ratio are ratios between 0 and 1. Beta distribution naturally models this finite interval.
Beta(5, 15): mean = 5/(5+15) = 0.25
→ [0.08, 0.47] range with 95% probability density
The Prior configuration expresses "a weak belief that the currently used value is reasonable." If data contradicts this belief, the posterior overwhelms the Prior and shifts — this is the essence of Bayesian learning.
3.4 Likelihood Function
The engine of MCMC is the likelihood function. For each project, the same logic as Ruby's simulate_project is executed in Python:
for project in projects:
alpha, beta = prior_alpha, prior_beta
for activity in project.activities:
# Compound Score (MAX + dampening)
compound_pos = max(positive) + sum(rest) × dampening
compound_neg = max(negative) + sum(rest) × dampening
alpha += SWV × compound_pos
beta += SWV × compound_neg
beta += silence_penalty(gap, interval, silence_ratio)
p_win = alpha / (alpha + beta)
# Won → higher p_win increases likelihood; Lost → lower p_win increases likelihood
outcome ~ Bernoulli(p_win)
Bernoulli likelihood meaning:
Won (y=1) with p_win=0.8: (high likelihood — these parameters match reality) Lost (y=0) with p_win=0.2: (also high likelihood)
The more the parameters align with reality, the higher the likelihood, and Emcee's walkers naturally congregate in these high-likelihood regions.
4. Emcee's Strength: Gradient-Free
4.1 Why Gradients Were Problematic
The previous NUTS-based approach required automatic computation of the likelihood function's partial derivatives (gradients). This meant rebuilding Ruby's simulation logic into PyTensor's tensor graph, incurring structural costs:
MAXfunction is non-differentiable — LogSumExp approximation needed- The for-loops across projects × activities × signals create thousands of tensor nodes
- Node count proportional C code compilation takes 3+ minutes
4.2 Emcee's Solution
Emcee doesn't use gradients. Only the value of the likelihood needs to be computed. This brings decisive advantages:
- Ruby simulation logic can be directly ported — No need for tensor conversion or differentiability concerns
MAXfunction used as-is — Original Compound Score without LogSumExp approximation- Compilation step eliminated — Pure Python function calls execute immediately
- Project data pre-compiled into numpy arrays — Array indexing instead of dict lookups gives 5–10× speedup
# What Emcee needs: just this
def log_posterior(theta):
return log_prior(theta) + log_likelihood(theta)
# log_likelihood calls simulate_project() and sums Bernoulli probabilities
# No gradient computation, no tensor graph, no compilation
5. Execution Architecture
5.1 Rails ↔ Python Communication
BayesianAutoTuner.full_report
↓ If Phase ≥ 3
MCMCService.run(company, tuner_data)
↓
Serialize project data to JSON
↓
python3 lib/mcmc/mcmc_runner.py input.json output.json
↓ (subprocess, max 300s timeout)
Read result JSON → merge into report
5.2 Why Subprocess?
| Approach | Pros | Cons |
|---|---|---|
| API server (separate container) | Caching, scaling | Cost, auth, CORS |
| subprocess | Simple, isolated, stateless | Process creation overhead (negligible) |
| Direct Ruby implementation | No dependencies | Emcee implementation unrealistic |
subprocess provides process isolation:
- Memory safety — All memory released upon process termination, no leak concerns
- Multi-tenant safety — Each company runs in an independent Python process
- Fault isolation — MCMC failure does not affect the Rails server
5.3 Graceful Fallback
If Emcee is not installed or execution fails:
def run_mcmc_analysis
MCMCService.run(@company, tuner_data)
rescue => e
Rails.logger.warn("[AutoTuner] MCMC skipped: #{e.message}")
nil # report[:mcmc] = nil → MCMC section not displayed in UI
end
Existing Grid Search results always return normally. MCMC is an "additional analysis tool" — the Auto-Tuner is fully functional without it.
6. Interpreting Results
6.1 Output Format
{
"mcmc": {
"available": true,
"converged": true,
"r_hat_max": 1.01,
"samples": 1500,
"warmup": 500,
"runtime_seconds": 15.2,
"sampler": "emcee",
"nwalkers": 32,
"ndim": 10,
"parameters": {
"Game Changer": {
"type": "impact",
"mean": 4.8,
"sd": 0.7,
"hdi_95": [3.5, 6.2],
"r_hat": 1.002,
"current": 5.0
},
"dampening": {
"type": "dampening",
"mean": 0.22,
"sd": 0.08,
"hdi_95": [0.09, 0.38],
"r_hat": 1.001,
"current": 0.25
}
}
}
}
6.2 HDI — The Language of Uncertainty
HDI (Highest Density Interval) 95% means "the probability that the parameter lies within this range is 95%." Like a doctor saying "your blood pressure is 130, with a confidence interval of 125–135."
Game Changer: HDI [3.5, 6.2]
→ "Anything between 3.5 and 6.2 is reasonable. Current value 5.0 is within range, so no change needed."
Game Changer: HDI [2.0, 3.5]
→ "Current value 5.0 is outside the HDI. Likely overestimated. Recommend adjusting to around 3.0."
Narrower HDI means precise estimation — enough data. Wider HDI means uncertainty — insufficient data or the parameter has little impact on outcomes.
6.3 R̂ (R-hat) — Proof of Convergence
R̂ measures "whether expedition teams starting from different origins reached the same conclusion."
| R̂ | Interpretation |
|---|---|
| < 1.01 | Perfect convergence ✅ |
| 1.01 ~ 1.05 | Convergence OK |
| 1.05 ~ 1.10 | Caution needed ⚠️ |
| > 1.10 | Non-convergence 🚨 Results untrustworthy |
If R̂ > 1.05, converged: false is displayed, and the administrator sees a warning: "Use MCMC results for reference only." Since Emcee's 32 walkers are split into two groups to compute R̂, convergence diagnostics are reliable.
7. Grid Search and MCMC — Two Perspectives Intersecting
Good diagnostics never depend on a single test. Grid Search and MCMC illuminate the same problem from different angles:
| Property | Grid Search | MCMC |
|---|---|---|
| Result form | Point estimate (single value) | Distribution (range + probability) |
| Interaction | 1-D independent | (N+2)-dimensional simultaneous |
| Explainability | ✅ "Maximum at this value" | ⚠️ "Shape of the distribution" |
| Speed | < 1 second | 15–30 seconds |
| Overfitting risk | Present (verified by K-fold) | Low (Prior regularizes) |
When both results agree — recommend with high confidence:
Grid Search: Impact 4.7 recommended
MCMC: HDI [3.8, 5.9], mean 4.8
→ Both methods point the same direction → Strong evidence
When results disagree — prioritize MCMC's HDI:
Grid Search: Impact 2.0 recommended (separation +0.03)
MCMC: HDI [3.5, 6.0], mean 4.7
→ Grid Search may have fallen into a local optimum
→ MCMC's broader exploration is more trustworthy
8. The Complete Auto-Tuner Map
① Signal Lift → Is the signal genuinely significant?
② Impact Grid → What's the optimal weight?
③ T Youden J → What's the optimal threshold?
④ k Grid Search → What's the optimal slope?
⑤ Dampening Grid → What's the optimal attenuation rate?
⑥ Silence Grid → What's the optimal silence penalty?
⑦ AUC → What's the overall discriminative power?
⑧ K-fold CV → Is there overfitting?
⑨ Prior Recommendation → Are the initial values reasonable?
⑩ MCMC → Estimation including parameter uncertainty
These 10 analyses come together to form a single report.
Grid Search's intuitiveness, MCMC's precision, and cross-validation's safety — these three pillars complement each other, constituting a data-driven, trustworthy parameter recommendation system. And the final decision to apply always rests with humans — what the system provides is "evidence," not "commands."
Complete Series Table of Contents:
References
- Foreman-Mackey, D., Hogg, D.W., Lang, D., & Goodman, J. (2013). "emcee: The MCMC Hammer." Publications of the Astronomical Society of the Pacific, 125(925), 306-312.
- Goodman, J. & Weare, J. (2010). "Ensemble samplers with affine invariance." Communications in Applied Mathematics and Computational Science, 5(1), 65-80.
- Gelman, A. & Rubin, D.B. (1992). "Inference from Iterative Simulation Using Multiple Sequences." Statistical Science, 7(4), 457-472.