DOCUMENTATION

Auto-Tuner Anatomy ⑥: MCMC — The Engine That Knows Uncertainty

While Grid Search finds "the single best point," MCMC estimates "the distribution of all possible points." Dissecting posterior estimation using Emcee Ensemble MCMC, convergence diagnostics, and HDI interpretation.

In the previous part: ⑤ Statistical Validation Anatomy, we covered AUC, K-fold CV, and Prior recommendation. This final part dissects the Auto-Tuner's most advanced tool — MCMC posterior estimation.

1. The Gap Grid Search Left Behind

1.1 The Limitation of a Single Point

Grid Search gives answers like:

"Impact 4.7 improves the separation between Won and Lost by 0.02 over 5.0. Recommendation: 4.7."

This is a point estimate. Like sticking a single pin on a map saying "the treasure is buried here." Useful, but it fails to answer several important questions:

Could 4.2 also yield nearly the same separation as 4.7?
Even if 4.7 is best, how confident can we be?
What about interactions when Impact and dampening are changed simultaneously?

1.2 What MCMC Fills In

MCMC doesn't give a single pin — it draws the entire contour map of probability:

"The ideal range for Impact is 3.8 to 5.9 (95% probability), and the most likely value is 4.7."

This is the posterior distribution. Not "one answer" but all possible answers and their respective probabilities. The difference between a doctor saying "your blood pressure is 130" and "your blood pressure is 130, with a 95% confidence interval of 125–135, which is within the normal range."

2. What is MCMC?

2.1 Intuitive Explanation

MCMC (Markov Chain Monte Carlo) is "a method that explores the parameter space, spending more time in good regions." Imagine an explorer walking through fog-covered mountains:

Step 1: Start at the current position
Step 2: Propose a step to a neighboring position
Step 3: If the new position is higher (higher likelihood), move there
Step 4: Even if lower, there's a probability of moving — a safety mechanism against getting trapped in valleys
Step 5: After thousands of repetitions, the accumulated footprints form peaks (the posterior distribution)

Key insight: The density of footprints IS probability. Where the explorer lingers longest means those parameter values are most supported by data.

2.2 Emcee — Collective Intelligence of an Expedition Team

EXAWin uses Emcee (Affine-Invariant Ensemble Sampler) (Foreman-Mackey et al., 2013). While typical MCMC sends one explorer, Emcee sends 32 expedition members simultaneously.

Method	Analogy	Characteristics
Random Walk MCMC	Walking alone in fog	Slow, easily trapped in valleys
HMC / NUTS	One climber sensing gradients	Fast, but requires gradient computation
Emcee	32 people referencing each other's positions	No gradients needed, collective search

Each Emcee walker references other walkers' positions to determine its direction. Mathematically, this is called the "stretch move", with key properties:

No gradient required — Only the value of the likelihood function needs to be computed
Affine invariant — Automatically adapts to scale differences between parameters (Impact 5.0 vs dampening 0.25)
Parallel exploration — 32 walkers simultaneously cover the space, reducing the risk of getting trapped in local optima

EXAWin's parameter space is (N+2)-dimensional — $N$ Impact values + dampening 1 + silence 1 (standard configuration: $N=8$ , total 10 dimensions). At this scale, Emcee converges stably within 1,500–3,000 steps.

3. Probability Model Design

3.1 Parameter Space ((N+2)-dimensional)

# Impact values (8 — No Signal is fixed at 0.1, excluded)
impact_gc     ~ LogNormal(log(5.0), 0.5)   # Game Changer
impact_str_p  ~ LogNormal(log(1.0), 0.5)   # Strong Affirmation
impact_mod_p  ~ LogNormal(log(0.7), 0.5)   # Moderate Affirmation
impact_weak_p ~ LogNormal(log(0.4), 0.5)   # Weak Affirmation
impact_str_n  ~ LogNormal(log(1.0), 0.5)   # Strong Negation
impact_mod_n  ~ LogNormal(log(0.7), 0.5)   # Moderate Negative
impact_weak_n ~ LogNormal(log(0.4), 0.5)   # Weak Negation
impact_gc_n   ~ LogNormal(log(5.0), 0.5)   # Game Changer (Negative)

# Attenuation rate — 0~1 range
dampening     ~ Beta(5, 15)                # mean ≈ 0.25

# Silence ratio — 0~1 range
silence_ratio ~ Beta(3, 7)                 # mean ≈ 0.30

3.2 Why LogNormal?

Impact must always be positive — negative Impact has no physical meaning. LogNormal naturally satisfies this constraint:

Always positive — never produces negative values
Long right tail — adequately explores values larger than current
Symmetric in log scale — multiplicative relationships of Impact are intuitive

σ=0.5 means: covers approximately ±60% of the current value with 95% probability. Wide enough for exploration while suppressing extreme values.

3.3 Why Beta?

dampening and silence_ratio are ratios between 0 and 1. Beta distribution naturally models this finite interval.

Beta(5, 15): mean = 5/(5+15) = 0.25
             → [0.08, 0.47] range with 95% probability density

The Prior configuration expresses "a weak belief that the currently used value is reasonable." If data contradicts this belief, the posterior overwhelms the Prior and shifts — this is the essence of Bayesian learning.

3.4 Likelihood Function

The engine of MCMC is the likelihood function. For each project, the same logic as Ruby's simulate_project is executed in Python:

for project in projects:
    alpha, beta = prior_alpha, prior_beta

    for activity in project.activities:
        # Compound Score (MAX + dampening)
        compound_pos = max(positive) + sum(rest) × dampening
        compound_neg = max(negative) + sum(rest) × dampening

        alpha += SWV × compound_pos
        beta  += SWV × compound_neg
        beta  += silence_penalty(gap, interval, silence_ratio)

    p_win = alpha / (alpha + beta)

    # Won → higher p_win increases likelihood; Lost → lower p_win increases likelihood
    outcome ~ Bernoulli(p_win)

Bernoulli likelihood meaning:

P(\text{outcome} \mid \theta) = p_{\text{win}}^{y} \cdot (1 - p_{\text{win}})^{1-y}

Won (y=1) with p_win=0.8: $0.8^1 = 0.8$ (high likelihood — these parameters match reality) Lost (y=0) with p_win=0.2: $(1-0.2)^1 = 0.8$ (also high likelihood)

The more the parameters align with reality, the higher the likelihood, and Emcee's walkers naturally congregate in these high-likelihood regions.

4. Emcee's Strength: Gradient-Free

4.1 Why Gradients Were Problematic

The previous NUTS-based approach required automatic computation of the likelihood function's partial derivatives (gradients). This meant rebuilding Ruby's simulation logic into PyTensor's tensor graph, incurring structural costs:

MAX function is non-differentiable — LogSumExp approximation needed
The for-loops across projects × activities × signals create thousands of tensor nodes
Node count proportional C code compilation takes 3+ minutes

4.2 Emcee's Solution

Emcee doesn't use gradients. Only the value of the likelihood needs to be computed. This brings decisive advantages:

Ruby simulation logic can be directly ported — No need for tensor conversion or differentiability concerns
MAX function used as-is — Original Compound Score without LogSumExp approximation
Compilation step eliminated — Pure Python function calls execute immediately
Project data pre-compiled into numpy arrays — Array indexing instead of dict lookups gives 5–10× speedup

# What Emcee needs: just this
def log_posterior(theta):
    return log_prior(theta) + log_likelihood(theta)

# log_likelihood calls simulate_project() and sums Bernoulli probabilities
# No gradient computation, no tensor graph, no compilation

5. Execution Architecture

5.1 Rails ↔ Python Communication

BayesianAutoTuner.full_report
    ↓ If Phase ≥ 3
MCMCService.run(company, tuner_data)
    ↓
    Serialize project data to JSON
    ↓
    python3 lib/mcmc/mcmc_runner.py input.json output.json
    ↓ (subprocess, max 300s timeout)
    Read result JSON → merge into report

5.2 Why Subprocess?

Approach	Pros	Cons
API server (separate container)	Caching, scaling	Cost, auth, CORS
subprocess	Simple, isolated, stateless	Process creation overhead (negligible)
Direct Ruby implementation	No dependencies	Emcee implementation unrealistic

subprocess provides process isolation:

Memory safety — All memory released upon process termination, no leak concerns
Multi-tenant safety — Each company runs in an independent Python process
Fault isolation — MCMC failure does not affect the Rails server

5.3 Graceful Fallback

If Emcee is not installed or execution fails:

def run_mcmc_analysis
  MCMCService.run(@company, tuner_data)
rescue => e
  Rails.logger.warn("[AutoTuner] MCMC skipped: #{e.message}")
  nil  # report[:mcmc] = nil → MCMC section not displayed in UI
end

Existing Grid Search results always return normally. MCMC is an "additional analysis tool" — the Auto-Tuner is fully functional without it.

6. Interpreting Results

6.1 Output Format

{
  "mcmc": {
    "available": true,
    "converged": true,
    "r_hat_max": 1.01,
    "samples": 1500,
    "warmup": 500,
    "runtime_seconds": 15.2,
    "sampler": "emcee",
    "nwalkers": 32,
    "ndim": 10,
    "parameters": {
      "Game Changer": {
        "type": "impact",
        "mean": 4.8,
        "sd": 0.7,
        "hdi_95": [3.5, 6.2],
        "r_hat": 1.002,
        "current": 5.0
      },
      "dampening": {
        "type": "dampening",
        "mean": 0.22,
        "sd": 0.08,
        "hdi_95": [0.09, 0.38],
        "r_hat": 1.001,
        "current": 0.25
      }
    }
  }
}

6.2 HDI — The Language of Uncertainty

HDI (Highest Density Interval) 95% means "the probability that the parameter lies within this range is 95%." Like a doctor saying "your blood pressure is 130, with a confidence interval of 125–135."

Game Changer: HDI [3.5, 6.2]
→ "Anything between 3.5 and 6.2 is reasonable. Current value 5.0 is within range, so no change needed."

Game Changer: HDI [2.0, 3.5]
→ "Current value 5.0 is outside the HDI. Likely overestimated. Recommend adjusting to around 3.0."

Narrower HDI means precise estimation — enough data. Wider HDI means uncertainty — insufficient data or the parameter has little impact on outcomes.

6.3 R̂ (R-hat) — Proof of Convergence

R̂ measures "whether expedition teams starting from different origins reached the same conclusion."

\hat{R} = \sqrt{\frac{\text{between-chain variance} + \text{within-chain variance}}{\text{within-chain variance}}}

R̂	Interpretation
< 1.01	Perfect convergence ✅
1.01 ~ 1.05	Convergence OK
1.05 ~ 1.10	Caution needed ⚠️
> 1.10	Non-convergence 🚨 Results untrustworthy

If R̂ > 1.05, converged: false is displayed, and the administrator sees a warning: "Use MCMC results for reference only." Since Emcee's 32 walkers are split into two groups to compute R̂, convergence diagnostics are reliable.

7. Grid Search and MCMC — Two Perspectives Intersecting

Good diagnostics never depend on a single test. Grid Search and MCMC illuminate the same problem from different angles:

Property	Grid Search	MCMC
Result form	Point estimate (single value)	Distribution (range + probability)
Interaction	1-D independent	(N+2)-dimensional simultaneous
Explainability	✅ "Maximum at this value"	⚠️ "Shape of the distribution"
Speed	< 1 second	15–30 seconds
Overfitting risk	Present (verified by K-fold)	Low (Prior regularizes)

When both results agree — recommend with high confidence:

Grid Search: Impact 4.7 recommended
MCMC: HDI [3.8, 5.9], mean 4.8
→ Both methods point the same direction → Strong evidence

When results disagree — prioritize MCMC's HDI:

Grid Search: Impact 2.0 recommended (separation +0.03)
MCMC: HDI [3.5, 6.0], mean 4.7
→ Grid Search may have fallen into a local optimum
→ MCMC's broader exploration is more trustworthy

8. The Complete Auto-Tuner Map

① Signal Lift      → Is the signal genuinely significant?
② Impact Grid      → What's the optimal weight?
③ T Youden J       → What's the optimal threshold?
④ k Grid Search    → What's the optimal slope?
⑤ Dampening Grid   → What's the optimal attenuation rate?
⑥ Silence Grid     → What's the optimal silence penalty?
⑦ AUC              → What's the overall discriminative power?
⑧ K-fold CV        → Is there overfitting?
⑨ Prior Recommendation → Are the initial values reasonable?
⑩ MCMC             → Estimation including parameter uncertainty

These 10 analyses come together to form a single report.

Grid Search's intuitiveness, MCMC's precision, and cross-validation's safety — these three pillars complement each other, constituting a data-driven, trustworthy parameter recommendation system. And the final decision to apply always rests with humans — what the system provides is "evidence," not "commands."

Complete Series Table of Contents:
Engine Overview and Design Philosophy
Signal Lift Anatomy
Grid Search Engine Anatomy
Threshold · k Anatomy
Statistical Validation Anatomy
MCMC Posterior Anatomy [Current]

References

Foreman-Mackey, D., Hogg, D.W., Lang, D., & Goodman, J. (2013). "emcee: The MCMC Hammer." Publications of the Astronomical Society of the Pacific, 125(925), 306-312.
Goodman, J. & Weare, J. (2010). "Ensemble samplers with affine invariance." Communications in Applied Mathematics and Computational Science, 5(1), 65-80.
Gelman, A. & Rubin, D.B. (1992). "Inference from Iterative Simulation Using Multiple Sequences." Statistical Science, 7(4), 457-472.