A New Perspective on Bootstrap Confidence Intervals for LASSO

Logan Harris, MS, PhD Candidate
Advised by Patrick Breheny, PhD

LASSO

LASSO-penalized linear regression adds an L_1 penalty to the squared error loss:

Q(\boldsymbol{\beta}|\mathbf{X},\mathbf{y},\lambda) = \frac{1}{2n}\lVert\mathbf{y}- \mathbf{X}\boldsymbol{\beta}\rVert_2^2 + \lambda\lVert\boldsymbol{\beta}\rVert_1

  • \mathbf{y} is a length n vector of independent outcomes
  • \mathbf{X} is an n \times p matrix of features
  • \boldsymbol{\beta} is a length p vector of regression coefficients
  • \lambda is a regularization parameter controlling the amount of penalization

Reconsidering the Bootstrap

Chatterjee and Lahiri (2010) demonstrated that when applied to lasso estimators, the bootstrap is inconsistent.

In this work, we break the issues with bootstrapping lasso into two distinct but related problems:

  1. intervals with endpoints at 0, and
  2. bias causing under coverage for larger values of \beta.

To address these problems we introduced

  1. a methodological fix to avoid intervals with endpoints at 0, and
  2. an alternative perspective for evaluating coverage for high dimensional confidence intervals.

1. Endpoints at 0

Epsilon Conundrum

What happens if we apply the bootstrap directly to the lasso?

Data Generation

n = p = 100

10 features contain signal

90 are small but not 0

Simulation

Repeated 1000 times

Plot gives 1 instance for 30 features

Coverage is average across all 100 features for all 1000 repetitions

A Proposed Fix to the Epsilon Conundrum

The lasso can be formulated as a Bayesian regression model by setting an appropriate prior (Tibshirani 1996; Park and Casella 2008). Specifically for the lasso, a Laplace distribution:

\begin{align*}p(\boldsymbol{\beta}) = \prod_{j = 1}^{p} \frac{\gamma}{2}\exp(-\gamma \left\lvert\beta_j\right\rvert), \gamma > 0.\end{align*}

With this prior, the lasso estimate, \widehat{\boldsymbol{\beta}}(\lambda), is the posterior mode for \boldsymbol{\beta}
when \gamma = n\lambda/\sigma^2.

To address the Epsilon Conundrum, we propose drawing from the full conditional posterior, p(\beta_j | \boldsymbol{\beta}_{-j}), whenever the bootstrap draw is 0.

We call this the Hybrid Bootstrap.

Epsilon Conundrum: The Hybrid Bootstrap

Data Generation

n = p = 100

10 features contain signal

90 are small but not 0

Simulation

Repeated 1000 times

Plot gives 1 instance for 30 features

Coverage is average across all 100 features for all 1000 repetitions

2. Bias Causing Under Coverage

Benefits of Biased Estimators

The addition of a penalty has desirable implications for the resulting lasso estimators:

Sparsity
Can force some coefficients to exactly zero, leading to sparse, more interpretable models

Bias-Variance Trade-off
Controlled increase in bias often results in a significant reduction in variance, improving prediction

Overfitting Prevention
Shrinking coefficient estimates reduces model complexity, mitigating over fitting

Improved Stability in Presence of Multicollinearity
Stabilizes coefficient estimates when predictors are highly correlated

Should intervals also be biased?

Traditional Frequentist Intervals

The statistical community seems to instinctively answer NO.

Popular methods for constructing intervals using lasso focus on debiasing in one way or another:

The goal of debiasing is to obtain nominal coverage for all values of \beta.

Bayesian Intervals - Ridge Regression Example

However, Bayesian credible intervals behave differently.

The ridge (L_2) penalty can be formulated as a Normal prior.

Consider the credible intervals for \theta in a \textrm{N}(\theta, \sigma^2) model with prior \theta \sim \textrm{N}(0, \tau^2) where \sigma = \tau = 1.

Average coverage

Theorem. Let \textrm{Cover}(\theta) = \mathbb{P}\{\theta \in \mathcal{A}(\mathbf{y})\}. If the likelihood is correctly specified according to the true data generating mechanism, p(\mathbf{y}| \boldsymbol{\theta}), then a 1-\alpha credible set for any parameter \theta_j will satisfy \int \textrm{Cover}(\theta_j)p(\boldsymbol{\theta}) d\boldsymbol{\theta}= 1 - \alpha.

For a high-dimensional problem, instead of integrating with respect to the prior a more interesting quantity may be the average with respect to the distribution of \theta values present.

Does \tfrac{1}{p} \sum_{j=1}^p \textrm{Cover}(\theta_j) \approx \int \textrm{Cover}(\theta)p(\theta) \, d\theta ?

Ideal Conditions - Hybrid Bootstrap

Data Generation

\boldsymbol{\beta}\sim Laplace

\mathbf{X} generated under independence

n = p = 100

Simulation

1000 iterations

Red curve: estimated coverage as a f(\beta)

Dashed line: average coverage

Is the Hybrid Bootstrap Robust to Assumptions?

Adding Autoregressive Correlation

Data Generation

\boldsymbol{\beta}\sim Laplace

\mathbf{X} generated with AR(\rho) correlation structure

p = 100, n varied

Simulation

1000 iterations

Compute coverage in each simulation

Plot displays distributions of these coverages

Alternative Distributions of \beta

Data Analysis

Overview

  • Compare Hybrid, Selective Inference, and Bootstrapped De-sparsified Lasso (BDSL)
  • Here, all methods are carried out at \lambda_{CV} from the original data set

Scheetz2006

  • Scheetz et al. (2006) measured RNA levels from the eyes of 120 rats
  • A gene with known association to the disease of interest (BBS) is used as the outcome
  • The goal is to try to determine other genes whose expression is associated with BBS
  • 120 observations
  • 18975 features

Results

Details

  • 5 Hybrid intervals do not contain point estimate, 4035 BDSL intervals do not, and all 66 Selective Inference intervals do not
  • Hybrid identifies 3 significant genes, BDSL identifies 1927, Selective Inference “identifies” all 66 as significant
  • Selective Inference took about .3 seconds, Hybrid took about 4.5 minutes, and BDSL took about 7 hours

Conclusion

If you chose a biased estimation method, is it not reasonable that the resulting intervals should also reflect that bias?

We introduced,

  1. a method for interval construction that fixes the Epsilon Conundrum, and
  2. an alternative perspective that allows for biased intervals.

Together these proposals lead to intervals which clearly reflect the assumptions that went into fitting the original lasso model.

The Hybrid bootstrap is available in the R package ncvreg.

References

Chatterjee, A., and S. N. Lahiri. 2010. “Asymptotic Properties of the Residual Bootstrap for Lasso Estimators.” Proceedings of the American Methematical Society 138: 4497–4509.
Javanmard, Adel, and Andrea Montanari. 2014. “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression.” Journal of Machine Learning Research (JMLR) 15 (1): 2869–909.
Lee, J. D., D. L. Sun, Y. Sun, and J. E. Taylor. 2016. “Exact Post-Selection Inference, with Application to the Lasso.” The Annals of Statistics 44 (3): 907–27.
Park, Trevor, and George Casella. 2008. “The Bayesian Lasso.” Journal of the American Statistical Association 103 (482): 681–86.
Scheetz, TE, K-YA Kim, RE Swiderski, AR Philp, TA Braun, KL Knudtson, AM Dorrance, et al. 2006. “Regulation of Gene Expression in the Mammalian Eye and Its Relevance to Eye Disease.” Proceedings of the National Academy of Sciences 103: 14429–34.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society Series B 58: 267–88.
Zhang, C. H., and S. S. Zhang. 2014. “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 217–42.

Coverage Theorem

If the likelihood is correctly specified according to the true data generating mechanism, p(\mathbf{y}| \boldsymbol{\theta}), then a 1-\alpha credible set for any parameter \theta_j will satisfy \int \textrm{Cover}(\theta_j)p(\boldsymbol{\theta}) d\boldsymbol{\theta}= 1 - \alpha.

Proof. By definition, a 100(1-\alpha)\% credible region for \theta_j is any set \mathcal{A}_j(\mathbf{y}) such that \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}|\mathbf{y})\,d\boldsymbol{\theta}= 1 - \alpha. The coverage probability is defined as \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\mathbf{y}| \boldsymbol{\theta})d\mathbf{y}.

Coverage Theorem

The average coverage, integrated with respect to the prior p(\boldsymbol{\theta}), is therefore \begin{aligned} \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\mathbf{y}| \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\mathbf{y}d\boldsymbol{\theta} &= \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}| \mathbf{y}) p(\mathbf{y}) d\mathbf{y}d\boldsymbol{\theta}\\ &= \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}| \mathbf{y}) d\boldsymbol{\theta}\, p(\mathbf{y}) d\mathbf{y}\\ &= \int (1 - \alpha) p(\mathbf{y}) d\mathbf{y}\\ &= 1 - \alpha. \end{aligned} Thus, the average coverage equals the nominal coverage.

MCP Example

Full Conditional Posterior

  • The full conditional posterior for \beta_j is defined as the distribution for \beta_j conditional on \boldsymbol{\beta}_{-j} = \widehat{\boldsymbol{\beta}}_{-j}(\lambda).

  • The partial residuals, \mathbf{r}_{j}, are then defined as \mathbf{y}- \mathbf{X}_{-j}\hat{\boldsymbol{\beta}}_{-j}(\lambda).

  • A normal likelihood and Laplace prior are not conjugate.

\begin{aligned} p(\beta_j | \boldsymbol{\beta}_{-j}) &\propto \begin{cases} C_{-} \exp\{-\frac{n}{2\sigma^2} (\beta_j - (z_j + \lambda))^2\}, \text{ if } \beta_j < 0, \\ C_{+} \exp\{-\frac{n}{2\sigma^2} (\beta_j - (z_j - \lambda))^2\}, \text{ if } \beta_j \geq 0 \\ \end{cases} \end{aligned} where C_{-} = \exp(z_j \lambda n/\sigma^2), C_{+} = \exp(-z_j \lambda n/\sigma^2), z_{j} = \frac{1}{n} \mathbf{x}_{j}^{T}\mathbf{r}_{j}, and assuming that \mathbf{X} has been standardized s.t. \mathbf{x}_j^T\mathbf{x}_j = n.

  • This formulation is attractive because it allows efficient computation of quantiles from the full conditional posterior.