$$
% Uppercase roman letters
% Lowercase roman letters (c, d, u, v have to be treated special; see end)%% Roman letters with hats
%% Roman letters with subscripts
%% Roman letters with tildes
%% Script letters
%% Greek letters
%% Greek letters with tildes
%% Operators
%% Statistical
% Fisher/observed information@ifpackagelater{mathalpha}{2021/01/01}{% }{% }
% Independence
%% Mathematical %% %% Requires dsfonts %% Equations % Other$$
LASSO-penalized linear regression adds an L_1 penalty to the squared error loss:
Q(\boldsymbol{\beta}|\mathbf{X},\mathbf{y},\lambda) = \frac{1}{2n}\lVert\mathbf{y}- \mathbf{X}\boldsymbol{\beta}\rVert_2^2 + \lambda\lVert\boldsymbol{\beta}\rVert_1
Chatterjee and Lahiri (2010) demonstrated that when applied to lasso estimators, the bootstrap is inconsistent.
In this work, we break the issues with bootstrapping lasso into two distinct but related problems:
To address these problems we introduced
What happens if we apply the bootstrap directly to the lasso?
Data Generation
n = p = 100
10 features contain signal
90 are small but not 0
Simulation
Repeated 1000 times
Plot gives 1 instance for 30 features
Coverage is average across all 100 features for all 1000 repetitions
The lasso can be formulated as a Bayesian regression model by setting an appropriate prior (Tibshirani 1996; Park and Casella 2008). Specifically for the lasso, a Laplace distribution:
\begin{align*}p(\boldsymbol{\beta}) = \prod_{j = 1}^{p} \frac{\gamma}{2}\exp(-\gamma \left\lvert\beta_j\right\rvert), \gamma > 0.\end{align*}
With this prior, the lasso estimate, \widehat{\boldsymbol{\beta}}(\lambda), is the posterior mode for \boldsymbol{\beta}
when \gamma = n\lambda/\sigma^2.
To address the Epsilon Conundrum, we propose drawing from the full conditional posterior, p(\beta_j | \boldsymbol{\beta}_{-j}), whenever the bootstrap draw is 0.
We call this the Hybrid Bootstrap.
Data Generation
n = p = 100
10 features contain signal
90 are small but not 0
Simulation
Repeated 1000 times
Plot gives 1 instance for 30 features
Coverage is average across all 100 features for all 1000 repetitions
The addition of a penalty has desirable implications for the resulting lasso estimators:
Sparsity
Can force some coefficients to exactly zero, leading to sparse, more interpretable models
Bias-Variance Trade-off
Controlled increase in bias often results in a significant reduction in variance, improving prediction
Overfitting Prevention
Shrinking coefficient estimates reduces model complexity, mitigating over fitting
Improved Stability in Presence of Multicollinearity
Stabilizes coefficient estimates when predictors are highly correlated
The statistical community seems to instinctively answer NO.
Popular methods for constructing intervals using lasso focus on debiasing in one way or another:
The goal of debiasing is to obtain nominal coverage for all values of \beta.
However, Bayesian credible intervals behave differently.
The ridge (L_2) penalty can be formulated as a Normal prior.
Consider the credible intervals for \theta in a \textrm{N}(\theta, \sigma^2) model with prior \theta \sim \textrm{N}(0, \tau^2) where \sigma = \tau = 1.
Theorem. Let \textrm{Cover}(\theta) = \mathbb{P}\{\theta \in \mathcal{A}(\mathbf{y})\}. If the likelihood is correctly specified according to the true data generating mechanism, p(\mathbf{y}| \boldsymbol{\theta}), then a 1-\alpha credible set for any parameter \theta_j will satisfy \int \textrm{Cover}(\theta_j)p(\boldsymbol{\theta}) d\boldsymbol{\theta}= 1 - \alpha.
For a high-dimensional problem, instead of integrating with respect to the prior a more interesting quantity may be the average with respect to the distribution of \theta values present.
Does \tfrac{1}{p} \sum_{j=1}^p \textrm{Cover}(\theta_j) \approx \int \textrm{Cover}(\theta)p(\theta) \, d\theta ?
Data Generation
\boldsymbol{\beta}\sim Laplace
\mathbf{X} generated under independence
n = p = 100
Simulation
1000 iterations
Red curve: estimated coverage as a f(\beta)
Dashed line: average coverage
Data Generation
\boldsymbol{\beta}\sim Laplace
\mathbf{X} generated with AR(\rho) correlation structure
p = 100, n varied
Simulation
1000 iterations
Compute coverage in each simulation
Plot displays distributions of these coverages
Scheetz2006
If you chose a biased estimation method, is it not reasonable that the resulting intervals should also reflect that bias?
We introduced,
Together these proposals lead to intervals which clearly reflect the assumptions that went into fitting the original lasso model.
The Hybrid bootstrap is available in the R package ncvreg
.
If the likelihood is correctly specified according to the true data generating mechanism, p(\mathbf{y}| \boldsymbol{\theta}), then a 1-\alpha credible set for any parameter \theta_j will satisfy \int \textrm{Cover}(\theta_j)p(\boldsymbol{\theta}) d\boldsymbol{\theta}= 1 - \alpha.
Proof. By definition, a 100(1-\alpha)\% credible region for \theta_j is any set \mathcal{A}_j(\mathbf{y}) such that \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}|\mathbf{y})\,d\boldsymbol{\theta}= 1 - \alpha. The coverage probability is defined as \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\mathbf{y}| \boldsymbol{\theta})d\mathbf{y}.
The average coverage, integrated with respect to the prior p(\boldsymbol{\theta}), is therefore \begin{aligned} \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\mathbf{y}| \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\mathbf{y}d\boldsymbol{\theta} &= \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}| \mathbf{y}) p(\mathbf{y}) d\mathbf{y}d\boldsymbol{\theta}\\ &= \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}| \mathbf{y}) d\boldsymbol{\theta}\, p(\mathbf{y}) d\mathbf{y}\\ &= \int (1 - \alpha) p(\mathbf{y}) d\mathbf{y}\\ &= 1 - \alpha. \end{aligned} Thus, the average coverage equals the nominal coverage.
The full conditional posterior for \beta_j is defined as the distribution for \beta_j conditional on \boldsymbol{\beta}_{-j} = \widehat{\boldsymbol{\beta}}_{-j}(\lambda).
The partial residuals, \mathbf{r}_{j}, are then defined as \mathbf{y}- \mathbf{X}_{-j}\hat{\boldsymbol{\beta}}_{-j}(\lambda).
A normal likelihood and Laplace prior are not conjugate.
\begin{aligned} p(\beta_j | \boldsymbol{\beta}_{-j}) &\propto \begin{cases} C_{-} \exp\{-\frac{n}{2\sigma^2} (\beta_j - (z_j + \lambda))^2\}, \text{ if } \beta_j < 0, \\ C_{+} \exp\{-\frac{n}{2\sigma^2} (\beta_j - (z_j - \lambda))^2\}, \text{ if } \beta_j \geq 0 \\ \end{cases} \end{aligned} where C_{-} = \exp(z_j \lambda n/\sigma^2), C_{+} = \exp(-z_j \lambda n/\sigma^2), z_{j} = \frac{1}{n} \mathbf{x}_{j}^{T}\mathbf{r}_{j}, and assuming that \mathbf{X} has been standardized s.t. \mathbf{x}_j^T\mathbf{x}_j = n.
ENAR 2025