A New Paradigm for Penalized Regression Confidence Intervals and Supporting Methods

Logan Harris

Chapter 1:
Bootstrap Confidence Intervals for lasso

Lasso

lasso-penalized linear regression (Tibshirani 1996) adds an L_1 penalty to the squared error loss:

Q(\boldsymbol{\beta}|\mathbf{X},\mathbf{y},\lambda) = \frac{1}{2n}\lVert\mathbf{y}- \mathbf{X}\boldsymbol{\beta}\rVert_2^2 + \lambda\lVert\boldsymbol{\beta}\rVert_1

  • \mathbf{y} is a length n vector of independent outcomes
  • \mathbf{X} is an n \times p matrix of features
  • \boldsymbol{\beta} is a length p vector of regression coefficients
  • \lambda is a regularization parameter controlling the amount of penalization

Inference for lasso

  • Penalization complicates the sampling distribution of lasso estimators.
  • This complexity has led to a wide variety of inferential approaches.
    • Most focus on significance (i.e. p-values, false discovery rate control).
  • Some methods have been proposed for constructing confidence intervals, although the effects of penalization are of particular challenge here.
  • Popular methods for constructing intervals using lasso focus on debiasing in one way or another:

Bootstrapping the lasso?

  • The bootstrap is a natural tool when dealing with complex sampling distributions.
  • However, Chatterjee and Lahiri (2010) demonstrated that when applied to lasso estimators, the bootstrap is inconsistent.
  • Accordingly, bootstrapping efforts for the lasso have focus on bootstrapping a de-biased (or de-sparsified) version of the lasso (Dezeure, Bühlmann, and Zhang 2017).

In this work, we revisit the bootstrap asking:

Does the bootstrap not work at all for the lasso or does it just not work in the traditional sense?

Does the Bootstrap Work?

We break the issues with bootstrapping lasso into two distinct but related problems:

  1. intervals with endpoints at 0, and
  2. bias from lasso penalty causing under coverage for larger values of \beta.

To address these problems we respectively introduced

  1. a methodological fix to avoid intervals with endpoints at 0, and
  2. an alternative perspective for evaluating coverage for high dimensional confidence intervals.

1. Intervals with endpoints at 0

Epsilon Conundrum

What happens if we apply the bootstrap directly to the lasso?

Data Generation

n = p = 100

10 signal features

90 features small, but not 0

Simulation

Repeated 1000 times

Plot gives 1 instance for 30 features

Coverage is average across all features for all repetitions

Epsilon Conundrum: \beta-min Condition

  • The \beta-min condition, \min_{j \in S} |\beta_j^*| \ge C, is often invoked in lasso theory.
  • S are the indices of the nonzero coefficients and C is some data dependent constant.
  • The \beta-min condition ensures the smallest non-zero coefficient in the true model is sufficiently large for the lasso to consistently select non-zero coefficients.
  • The preceding slide provides an extreme example where the \beta-min condition is violated.
  • Wherever this is violated, a traditional bootstrapping approach will produce intervals with pathological behavior.

The Hybrid Bootstrap

  • Let b \in \lbrace 1, \ldots, B \rbrace indicate the b^{th} bootstrap draw.
  • The Traditional bootstrap simply takes the point estimate, \hat{\beta}_j^b, for each \beta_j for each respective bootstrap sample.
  • As shown, this leads to issues if a large majority of \hat{\beta}_j^b = 0 for a given j.
  • We propose the Hybrid bootstrap, which implements a fix focused on the full conditional distribution of \beta_j, viewed as a Bayesian posterior.
  • The traditional bootstrap draw \hat{\beta}_j^b is the mode of this distribution.
  • We propose that when \hat{\beta}_j^b = 0, this draw is replaced by a draw from full conditional distribution.

Full Conditional Distributions for lasso-Penalized Linear Regression

The lasso can be formulated as a Bayesian regression model by setting an appropriate prior (Tibshirani 1996; Park and Casella 2008). Specifically for the lasso, a Laplace distribution:

\begin{align*}p(\boldsymbol{\beta}) = \prod_{j = 1}^{p} \frac{\gamma}{2}\exp(-\gamma \left\lvert\beta_j\right\rvert), \gamma > 0.\end{align*}

With this prior, the lasso estimate, \widehat{\boldsymbol{\beta}}(\lambda), is the posterior mode for \boldsymbol{\beta} when \gamma = n\lambda/\sigma^2.

Full Conditional Distributions for lasso-Penalized Linear Regression

The full conditional likelihood for \beta_j is defined as the likelihood for \beta_j conditional on \boldsymbol{\beta}_{-j} = \widehat{\boldsymbol{\beta}}_{-j}(\lambda):

L(\beta_j|\boldsymbol{\beta}_{-j}) \propto \exp(-\frac{n}{2\sigma^2}(\beta_{j} - z_{j})^2).

With this likelihood and the previously mentioned prior, the full conditional distribution, viewed as a Bayesian posterior can be written as:

\begin{aligned} p(\beta_j | \boldsymbol{\beta}_{-j}) &\propto \begin{cases} C_{-} \exp\{-\frac{n}{2\sigma^2} (\beta_j - (z_j + \lambda))^2\}, \text{ if } \beta_j < 0, \\ C_{+} \exp\{-\frac{n}{2\sigma^2} (\beta_j - (z_j - \lambda))^2\}, \text{ if } \beta_j \geq 0 \\ \end{cases} \end{aligned}

Note: This formulation is attractive because it allows efficient computation of quantiles from the full conditional posterior.

Full Conditional Distribution

Note: Solely for visualization purposes.

Details: Full Conditional Distribution

  • Assumes that \mathbf{X} has been standardized s.t. \mathbf{x}_j^T\mathbf{x}_j = n.
  • The partial residuals, \mathbf{r}_{j}, are defined as \mathbf{y}- \mathbf{X}_{-j}\hat{\boldsymbol{\beta}}_{-j}(\lambda).
  • C_{-} = \exp(z_j \lambda n/\sigma^2) and C_{+} = \exp(-z_j \lambda n/\sigma^2)
  • z_{j} = \frac{1}{n} \mathbf{x}_{j}^{T}\mathbf{r}_{j}

Details: Implimentation

  • This solution corresponds to a particular value of \lambda and \hat{\sigma}^2.
  • We propose using:
    • Cross validation (CV) to select \lambda that minimizes CV error
    • The estimate for \hat{\sigma}^2 recommended by Reid, Tibshirani, and Friedman (2016):

\hat{\sigma}^2 = \frac{1}{n - |\hat{S}_{\lambda_{CV}}|} ||\mathbf{y}- \mathbf{X}\widehat{\boldsymbol{\beta}}({\lambda_{CV}})||_2^2,

where |\hat{S}_{\lambda_{CV}}| = \sum \left( \widehat{\boldsymbol{\beta}}(\lambda_{CV}) \neq 0 \right).

Epsilon Conundrum: The Hybrid Bootstrap

Data Generation

n = p = 100

10 signal features

90 features small, but not 0

Simulation

Repeated 1000 times

Plot gives 1 instance for 30 features

Coverage is average across all features for all repetitions

2. Bias Causing Under Coverage

Traditional Frequentist Intervals

  • For the epsilon conundrum results, the coverage reported was the average coverage across all p confidence intervals.
  • This, of course, is not the focus of classical frequentist coverage.
  • Frequentist coverage is focused on achieving proper coverage for each parameter individually.
  • Let \mathcal{A}(\mathbf{y}) denote a process that produces an interval based on data \mathbf{y}, the coverage is defined as \textrm{Cover}(\theta) = \mathbb{P}\{\theta \in \mathcal{A}(\mathbf{y})\}.
  • Classical frequentist inference requires valid intervals satisfy \textrm{Cover}(\theta) = 1 - \alpha for all values of \theta.

Bayesian Intervals - Ridge Regression Example

However, Bayesian credible intervals behave differently.

The ridge (L_2) penalty can be formulated as a Normal prior.

Consider the credible intervals for \theta in a \textrm{N}(\theta, \sigma^2) model with prior \theta \sim \textrm{N}(0, \tau^2) where \sigma = \tau = 1.

Average Coverage

Theorem. If the likelihood is correctly specified according to the true data generating mechanism, p(\mathbf{y}| \boldsymbol{\theta}), then a 1-\alpha credible set for any parameter \theta_j will satisfy \int \textrm{Cover}(\theta_j)p(\boldsymbol{\theta}) d\boldsymbol{\theta}= 1 - \alpha.

Instead of integrating with respect to the prior, a more interesting quantity in high dimensions may be the average with respect to the distribution of \theta values present.

Does \tfrac{1}{p} \sum_{j=1}^p \textrm{Cover}(\theta_j) \approx \int \textrm{Cover}(\theta)p(\theta) \, d\theta ?

Ideal Conditions - Hybrid Bootstrap

Data Generation

\boldsymbol{\beta}\sim Laplace

\mathbf{X} generated under independence

n = p = 100

Simulation

1000 iterations

Red curve: estimated coverage as a f(\beta)

Dashed line: average coverage

Details: Ideal Conditions

  • Rubin (1981) originally pointed out that there is a quasi-equivalence between a bootstrap sampling distribution and a Bayesian posterior.
  • Recall, Theorem 1 and the surrounding discussion made a connection between average coverage and Bayesian credible intervals.
  • Together, these points suggest the Hybrid bootstrap should have approximately correct average coverage when the empirical distribution of \boldsymbol{\theta} matches the prior implied by the lasso penalty.

Is the Hybrid Bootstrap Robust to Assumptions?

Adding Autoregressive Correlation

Data Generation

\boldsymbol{\beta}\sim Laplace

\mathbf{X} generated with AR(\rho) correlation structure

p = 100, n varied

Simulation

1000 iterations

Compute coverage in each simulation

Plot displays distributions of these coverages

Alternative Distributions of \boldsymbol{\beta}

Details: Alternative Distributions of \boldsymbol{\beta}

  • \mathbf{X} generated under independence.
  • Prior to normalization…
    • Sparse 1: \boldsymbol{\beta}_{1-10} = \pm(0.5, 0.5, 0.5, 1, 2) with the rest equal to zero
    • Sparse 2: \boldsymbol{\beta}_{1-30} \sim N(0, 1) with the rest equal to zero
    • Sparse 3: \boldsymbol{\beta}_{1-50} \sim N(0, 1) with the rest equal to zero
    • T distribution: df = 3
    • Beta distribution: \boldsymbol{\beta}\sim Beta(0.1, 0.1) - 0.5

Comparison to Other Methods

Other Methods

  • Selective Inference accounts for uncertainty in model selection by conditioning on the selected model.
    • Only provides intervals for selected covariates.
  • Bootstrapped de-sparsified lasso (BDSL), as the name suggests, consists of bootstrapping a debiased form of the lasso.
  • Both methods can be viewed as performing some form of bias correction in attempts to satisfy a traditional frequentist definition of coverage.

Simulation Set Up

  • Data generation:
    • \boldsymbol{\beta}\sim Laplace
    • \mathbf{X} generated under independence
    • p = 100 and n is varied to be 50, 100, 400
  • Methods are used as they come “out of the box”
    • For BDSL (from hdi) \lambda is reselected for each bootstrap draw using the 1SE rule.
    • For Selective Inference and Hybrid, \lambda is fixed at the value that minimizes CV error on the original data.

Results

Data Generation

n = 100

Simulation

1000 iterations

Curves: estimated coverage as a f(\beta)

Dashed lines: average coverages

Results

Data Generation

n varied

Simulation

1000 iterations

Top: Distribution of coverages

Middle: Median width across all simulations / intervals

Bottom: Average runtime across all simulations

Data Analysis

Overview

Methods

  • Hybrid, Selective Inference, and BDSL
  • All methods are carried out at \lambda_{CV} from the original data set.

Scheetz2006

  • Scheetz et al. (2006) measured RNA levels from the eyes of 120 rats.
  • A gene with known association to the disease of interest (BBS) is used as the outcome.
  • The goal is to try to determine other genes whose expression is associated with BBS.
  • 120 observations
  • 18975 features

Results

Details: Data Analysis

  • 6 Hybrid intervals, 4035 BDSL intervals, and all 66 Selective Inference intervals do not contain respective lasso point estimates.
  • Hybrid identifies 4 significant genes, BDSL identifies 1927, Selective Inference identifies all 66 as “significant”.
  • Selective Inference took about 1.1 seconds, Hybrid took about 4.5 minutes, and BDSL took about 7 hours.

Chapter 1 Conclusion

What Properties Should High Dim CIs have?

  • Interval construction methods such as Selective Inference and BDSL are Frequentist in nature.
    • Attempt to provide nominal coverage for all values of \beta
    • Debiased intervals do not reflect assumptions that went into fitting the lasso.
  • The Hybrid intervals we developed have a Bayesian feel.
    • Require shift in perspective to average coverage across the set of parameters.
    • If a biased estimation method is chosen, it seems reasonable resulting intervals reflect that bias.
    • Reflects the model that was actually fit, leading to point estimates and confidence intervals that are coherent.

The Hybrid Bootstrap

  • Paper submitted to Biometrics
  • Available in ncvreg (3.15.0)

Chapter 2:
A Non-bootstrap Alternative

Issues with the Bootstrap

Bootstrapping is slow

Not prohibitively, but comparatively to non-bootstrap alternatives.

Could be more robust to correlation

The fix introduced for the Hybrid bootstrap doesn’t inherently do anything to account for correlation between features.

Hybrid intervals sometimes do not contain lasso point estimates

This is a side effect of bootstrapping.

Question

What if instead of bootstrapping to capture uncertainty, intervals are directly computed from a well adjusted distribution?

  • The lasso penalty fully defines the prior
  • This leaves the objective of finding a suitable normal likelihood

Previously Likelihood: Full Conditional

Conditioning on \boldsymbol{\beta}_{-j} = \widehat{\boldsymbol{\beta}}_{-j} results in the likelihood:

L(\beta_j|\boldsymbol{\beta}_{-j}) \propto \exp(-\frac{n}{2\sigma^2}(\beta_{j} - z_{j})^2).

where z_{j} = \frac{1}{n} \mathbf{x}_{j}^{T}\mathbf{r}_{j} and \mathbf{r}_{j} = \mathbf{y}- \mathbf{X}_{-j}\hat{\boldsymbol{\beta}}_{-j}(\lambda).

  • This is a somewhat naive likelihood.
  • However, Laplace-Normal sampling can be generalized to any Normal likelihood.

Alternative Likelihood: Relaxed lasso

  • What if instead of condition on the lasso estimates, we condition on the selected model?
  • Let \hat{S} = \lbrace k: \hat{\beta}_k \neq 0 \rbrace.
  • Then, \hat{S}_j = \hat{S} \text{ if } j \notin \hat{S} and \hat{S}_j = \hat{S} - \lbrace j \rbrace \text{ if } j \in \hat{S}
  • Define \mathbf{Q}_{\hat{S}_j} as \mathbf{I}- \mathbf{X}_{\hat{S}_j}(\mathbf{X}_{\hat{S}_j}^T \mathbf{X}_{\hat{S}_j})^{-1} \mathbf{X}_{\hat{S}_j}^T, the projection matrix onto the features selected by the lasso model.
  • The likelihood for \beta_j conditional on the selected model is:

L(\beta_j|\hat{S}_j) \propto \exp(-\frac{\mathbf{x}_j^T \mathbf{Q}_{\hat{S}_j} \mathbf{x}_j}{2\sigma^2}(\beta_{j} - \tilde{\beta}_{j})^2),

where \tilde{\beta}_j = (\mathbf{x}_j^T \mathbf{Q}_{\hat{S}_j} \mathbf{x}_j)^{-1} \mathbf{x}_j^T \mathbf{Q}_{\hat{S}_j} \mathbf{y}.

Alternative Likelihood: PIPE

  • However, to maintain the modes at \widehat{\boldsymbol{\beta}}(\lambda), the normal likelihood needs to be centered at z_{j}.
  • That is, I am interested in finding an alternate likelihood with mean z_{j} but with some adjusted variance.
  • Biyue Dai did work in her dissertation that would suggest the following is one reasonable option:

L(\beta_j|\hat{S}_j) \propto \exp(-\frac{\mathbf{x}_j^T \mathbf{Q}_{\hat{S}_j} \mathbf{x}_j}{2\sigma^2}(\beta_{j} - z_{j})^2)

Ideal Scenario

Data \boldsymbol{\beta}\sim Laplace | \mathbf{X} generated under independence | p = 100, n varied

Plot displays estimated coverage as a f(\beta) (curves) and average coverage (dashed lines) across 1000 simulations

Adding Autoregressive Correlation

Data \boldsymbol{\beta}\sim Laplace | \mathbf{X} generated with AR(\rho) cor | p = 100, n varied

Plot displays distributions of coverages across 1000 simulations

Scheetz2006

Hybrid vs. PIPE Posterior

The two methods take fundamentally different approaches to construct intervals that reflect the lasso model fit.

Hybrid

Takes a model averaging approach with the bootstrap.

PIPE Posterior

Adjusts the uncertainty by conditioning on the selected features.

Chapter 3:
Generalizing Laplace-Normal Sampling

Extending Methods Proposed

  • Both the Hybrid Bootstrap and PIPE Posterior methods depend on Laplace-Normal sampling.
  • Extending Laplace-Normal sampling then directly results in the extension of both methods.
  • Alternative families:
    • Binomial
    • Poisson
  • Alternative penalties:
    • Elastic Net
    • Bias reducing penalties (SCAD and MCP)

Other Families

Binomial

Q(\boldsymbol{\beta}|\mathbf{X},\mathbf{y},\lambda) = -\frac{1}{n} \sum\lbrace y_i X_i^T\boldsymbol{\beta}- \log(1+\exp(X_i^T \boldsymbol{\beta})) \rbrace + \lambda \sum_{j = 1}^p |\beta_j|

Poisson

Q(\boldsymbol{\beta}|\mathbf{X},\mathbf{y},\lambda) = -\frac{1}{n} \sum\lbrace y_i X_i^T\boldsymbol{\beta}- \exp(X_i^T \boldsymbol{\beta}) \rbrace + \lambda \sum_{j = 1}^p |\beta_j|

Details: Other Families

\begin{aligned} \eta_i &= \mathbf{x}_i^T \boldsymbol{\beta}, \\ \pi_i &= \begin{cases} \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} \\ \exp(\eta_i) \end{cases}, \quad \boldsymbol{W} = \begin{cases} diag(\pi_i (1 - \pi_i)) & \text{(Binomial)} \\ diag(\pi_i) & \text{(Poisson)} \end{cases} \\ \tilde{\mathbf{y}} &= \boldsymbol{W}^{-1} (\mathbf{y}- \boldsymbol{\pi}) + \boldsymbol{\eta}, \quad z_j = \frac{\mathbf{x}_j^T\boldsymbol{W} (\tilde{\mathbf{y}} - \mathbf{X}_{-j} \beta_{-j}) }{\mathbf{x}_j^T\boldsymbol{W} \mathbf{x}_j}, \\ \nu_j &= \frac{1}{n} \mathbf{x}^T_j \boldsymbol{W} \mathbf{x}_j, \quad \lambda_j = \frac{\lambda}{\nu_j} \end{aligned}

Binomial lasso (PIPE Posterior)

Data p = 100, n = 400 | \beta_{1-10} contain signal, \beta_{11-100} = 0

Elastic Net

Elastic net

p(\boldsymbol{\beta}| \lambda, \alpha) = \alpha \lambda||\boldsymbol{\beta}||_1 + \frac{(1 - \alpha) \lambda}{2} ||\boldsymbol{\beta}||^2_2 Leverage data augmentation approach where the Elastic Net can be written as a lasso model with:

  • \mathbf{y}^* = \begin{bmatrix} \mathbf{y}\\ \boldsymbol{0}_p \end{bmatrix}
  • \mathbf{X}^* = \begin{bmatrix} \mathbf{X}\\ I_p \sqrt{n(1-\alpha)\lambda} \end{bmatrix}
  • \lambda^* = \lambda\alpha

Elastic Net Example (Hybrid Bootstrap)

Data n = p = 100 \quad | \quad \beta_{A1} = 1, all other \beta_j = 0

\textrm{Cor}(\mathbf{x}_{A1}, \mathbf{x}_{B1}) = .99 \quad for all i \neq A1 and j \neq B1, \textrm{Cor}(\mathbf{x}_i, \mathbf{x}_j) = 0

Biased Reducing Penalties

MCP

p_\lambda(\beta|\lambda, \gamma) = \begin{cases} \begin{aligned} &\lambda\beta - \frac{\beta^2}{2\gamma} & \text{ if } \beta \leq \gamma \lambda\\ &\frac{1}{2}\gamma \lambda^2 & \text{ if } \beta \geq \gamma \lambda \end{aligned} \end{cases}

SCAD

p_\lambda(\beta|\lambda, \gamma) = \begin{cases} \begin{aligned} &\lambda\beta & \text{ if } \beta \leq \lambda\\ &\frac{2\gamma \lambda\beta - \beta^2 - \lambda^2}{2(\gamma - 1)} & \text{ if } \lambda< \beta < \gamma \lambda\\ &\frac{\lambda^2(\gamma + 1)}{2} & \text{ if } \beta \geq \gamma \lambda \end{aligned} \end{cases}

Approximating Bias Reducing Penalties

Example: Poisson lasso (PIPE Posterior)

Data p = 100, n = 200 | \beta_{1-10} contain signal, \beta_{11-100} = 0

Example: Poisson MCP (PIPE Posterior)

Data p = 100, n = 200 | \beta_{1-10} contain signal, \beta_{11-100} = 0

Bias Reducing Penalties: Question

If one does want intervals that have closer to nominal for larger values of \beta, is it better to

  1. use a reduced bias penalty and construct intervals that reflect that penalty, or
  2. a biased penalty and construct intervals using a de-biased procedure?

References

Chatterjee, A., and S. N. Lahiri. 2010. “Asymptotic Properties of the Residual Bootstrap for Lasso Estimators.” Proceedings of the American Methematical Society 138: 4497–4509.
Dezeure, Ruben, Peter Bühlmann, and Cun-Hui Zhang. 2017. “High-Dimensional Simultaneous Inference with the Bootstrap.” TEST 26 (4): 685–719. https://doi.org/10.1007/s11749-017-0554-2.
Javanmard, Adel, and Andrea Montanari. 2014. “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression.” Journal of Machine Learning Research (JMLR) 15 (1): 2869–909.
Lee, J. D., D. L. Sun, Y. Sun, and J. E. Taylor. 2016. “Exact Post-Selection Inference, with Application to the Lasso.” The Annals of Statistics 44 (3): 907–27.
Park, Trevor, and George Casella. 2008. “The Bayesian Lasso.” Journal of the American Statistical Association 103 (482): 681–86.
Reid, S., R. Tibshirani, and J. Friedman. 2016. “A Study of Error Variance Estimation in Lasso Regression.” Statistica Sinica 26: 35–67.
Rubin, Donald B. 1981. The Bayesian Bootstrap.” The Annals of Statistics 9 (1): 130–34. https://doi.org/10.1214/aos/1176345338.
Scheetz, TE, K-YA Kim, RE Swiderski, AR Philp, TA Braun, KL Knudtson, AM Dorrance, et al. 2006. “Regulation of Gene Expression in the Mammalian Eye and Its Relevance to Eye Disease.” Proceedings of the National Academy of Sciences 103: 14429–34.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society Series B 58: 267–88.
Zhang, C. H., and S. S. Zhang. 2014. “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 217–42.

Appendix

Coverage Theorem

If the likelihood is correctly specified according to the true data generating mechanism, p(\mathbf{y}| \boldsymbol{\theta}), then a 1-\alpha credible set for any parameter \theta_j will satisfy \int \textrm{Cover}(\theta_j)p(\boldsymbol{\theta}) d\boldsymbol{\theta}= 1 - \alpha.

Proof. By definition, a 100(1-\alpha)\% credible region for \theta_j is any set \mathcal{A}_j(\mathbf{y}) such that \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}|\mathbf{y})\,d\boldsymbol{\theta}= 1 - \alpha. The coverage probability is defined as \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\mathbf{y}| \boldsymbol{\theta})d\mathbf{y}.

Coverage Theorem

The average coverage, integrated with respect to the prior p(\boldsymbol{\theta}), is therefore \begin{aligned} \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\mathbf{y}| \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\mathbf{y}d\boldsymbol{\theta} &= \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}| \mathbf{y}) p(\mathbf{y}) d\mathbf{y}d\boldsymbol{\theta}\\ &= \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}| \mathbf{y}) d\boldsymbol{\theta}\, p(\mathbf{y}) d\mathbf{y}\\ &= \int (1 - \alpha) p(\mathbf{y}) d\mathbf{y}\\ &= 1 - \alpha. \end{aligned} Thus, the average coverage equals the nominal coverage.

Sampling from Full Conditional