$$
% Uppercase roman letters
% Lowercase roman letters (c, d, u, v have to be treated special; see end)%% Roman letters with hats
%% Roman letters with subscripts
%% Roman letters with tildes
%% Script letters
%% Greek letters
%% Greek letters with tildes
%% Operators
%% Statistical
% Fisher/observed information@ifpackagelater{mathalpha}{2021/01/01}{% }{% }
% Independence
%% Mathematical %% %% Requires dsfonts %% Equations % Other$$
lasso-penalized linear regression (Tibshirani 1996) adds an L_1 penalty to the squared error loss:
Q(\boldsymbol{\beta}|\mathbf{X},\mathbf{y},\lambda) = \frac{1}{2n}\lVert\mathbf{y}- \mathbf{X}\boldsymbol{\beta}\rVert_2^2 + \lambda\lVert\boldsymbol{\beta}\rVert_1
In this work, we revisit the bootstrap asking:
Does the bootstrap not work at all for the lasso or does it just not work in the traditional sense?
We break the issues with bootstrapping lasso into two distinct but related problems:
To address these problems we respectively introduced
What happens if we apply the bootstrap directly to the lasso?
Data Generation
n = p = 100
10 signal features
90 features small, but not 0
Simulation
Repeated 1000 times
Plot gives 1 instance for 30 features
Coverage is average across all features for all repetitions
The lasso can be formulated as a Bayesian regression model by setting an appropriate prior (Tibshirani 1996; Park and Casella 2008). Specifically for the lasso, a Laplace distribution:
\begin{align*}p(\boldsymbol{\beta}) = \prod_{j = 1}^{p} \frac{\gamma}{2}\exp(-\gamma \left\lvert\beta_j\right\rvert), \gamma > 0.\end{align*}
With this prior, the lasso estimate, \widehat{\boldsymbol{\beta}}(\lambda), is the posterior mode for \boldsymbol{\beta} when \gamma = n\lambda/\sigma^2.
The full conditional likelihood for \beta_j is defined as the likelihood for \beta_j conditional on \boldsymbol{\beta}_{-j} = \widehat{\boldsymbol{\beta}}_{-j}(\lambda):
L(\beta_j|\boldsymbol{\beta}_{-j}) \propto \exp(-\frac{n}{2\sigma^2}(\beta_{j} - z_{j})^2).
With this likelihood and the previously mentioned prior, the full conditional distribution, viewed as a Bayesian posterior can be written as:
\begin{aligned} p(\beta_j | \boldsymbol{\beta}_{-j}) &\propto \begin{cases} C_{-} \exp\{-\frac{n}{2\sigma^2} (\beta_j - (z_j + \lambda))^2\}, \text{ if } \beta_j < 0, \\ C_{+} \exp\{-\frac{n}{2\sigma^2} (\beta_j - (z_j - \lambda))^2\}, \text{ if } \beta_j \geq 0 \\ \end{cases} \end{aligned}
Note: This formulation is attractive because it allows efficient computation of quantiles from the full conditional posterior.
Note: Solely for visualization purposes.
\hat{\sigma}^2 = \frac{1}{n - |\hat{S}_{\lambda_{CV}}|} ||\mathbf{y}- \mathbf{X}\widehat{\boldsymbol{\beta}}({\lambda_{CV}})||_2^2,
where |\hat{S}_{\lambda_{CV}}| = \sum \left( \widehat{\boldsymbol{\beta}}(\lambda_{CV}) \neq 0 \right).
Data Generation
n = p = 100
10 signal features
90 features small, but not 0
Simulation
Repeated 1000 times
Plot gives 1 instance for 30 features
Coverage is average across all features for all repetitions
However, Bayesian credible intervals behave differently.
The ridge (L_2) penalty can be formulated as a Normal prior.
Consider the credible intervals for \theta in a \textrm{N}(\theta, \sigma^2) model with prior \theta \sim \textrm{N}(0, \tau^2) where \sigma = \tau = 1.
Theorem. If the likelihood is correctly specified according to the true data generating mechanism, p(\mathbf{y}| \boldsymbol{\theta}), then a 1-\alpha credible set for any parameter \theta_j will satisfy \int \textrm{Cover}(\theta_j)p(\boldsymbol{\theta}) d\boldsymbol{\theta}= 1 - \alpha.
Instead of integrating with respect to the prior, a more interesting quantity in high dimensions may be the average with respect to the distribution of \theta values present.
Does \tfrac{1}{p} \sum_{j=1}^p \textrm{Cover}(\theta_j) \approx \int \textrm{Cover}(\theta)p(\theta) \, d\theta ?
Data Generation
\boldsymbol{\beta}\sim Laplace
\mathbf{X} generated under independence
n = p = 100
Simulation
1000 iterations
Red curve: estimated coverage as a f(\beta)
Dashed line: average coverage
Data Generation
\boldsymbol{\beta}\sim Laplace
\mathbf{X} generated with AR(\rho) correlation structure
p = 100, n varied
Simulation
1000 iterations
Compute coverage in each simulation
Plot displays distributions of these coverages
hdi
) \lambda is reselected for each bootstrap draw using the 1SE rule.Data Generation
n = 100
Simulation
1000 iterations
Curves: estimated coverage as a f(\beta)
Dashed lines: average coverages
Data Generation
n varied
Simulation
1000 iterations
Top: Distribution of coverages
Middle: Median width across all simulations / intervals
Bottom: Average runtime across all simulations
Methods
Scheetz2006
ncvreg (3.15.0)
Bootstrapping is slow
Not prohibitively, but comparatively to non-bootstrap alternatives.
Could be more robust to correlation
The fix introduced for the Hybrid bootstrap doesn’t inherently do anything to account for correlation between features.
Hybrid intervals sometimes do not contain lasso point estimates
This is a side effect of bootstrapping.
Question
What if instead of bootstrapping to capture uncertainty, intervals are directly computed from a well adjusted distribution?
Conditioning on \boldsymbol{\beta}_{-j} = \widehat{\boldsymbol{\beta}}_{-j} results in the likelihood:
L(\beta_j|\boldsymbol{\beta}_{-j}) \propto \exp(-\frac{n}{2\sigma^2}(\beta_{j} - z_{j})^2).
where z_{j} = \frac{1}{n} \mathbf{x}_{j}^{T}\mathbf{r}_{j} and \mathbf{r}_{j} = \mathbf{y}- \mathbf{X}_{-j}\hat{\boldsymbol{\beta}}_{-j}(\lambda).
L(\beta_j|\hat{S}_j) \propto \exp(-\frac{\mathbf{x}_j^T \mathbf{Q}_{\hat{S}_j} \mathbf{x}_j}{2\sigma^2}(\beta_{j} - \tilde{\beta}_{j})^2),
where \tilde{\beta}_j = (\mathbf{x}_j^T \mathbf{Q}_{\hat{S}_j} \mathbf{x}_j)^{-1} \mathbf{x}_j^T \mathbf{Q}_{\hat{S}_j} \mathbf{y}.
L(\beta_j|\hat{S}_j) \propto \exp(-\frac{\mathbf{x}_j^T \mathbf{Q}_{\hat{S}_j} \mathbf{x}_j}{2\sigma^2}(\beta_{j} - z_{j})^2)
Data \boldsymbol{\beta}\sim Laplace | \mathbf{X} generated under independence | p = 100, n varied
Plot displays estimated coverage as a f(\beta) (curves) and average coverage (dashed lines) across 1000 simulations
Data \boldsymbol{\beta}\sim Laplace | \mathbf{X} generated with AR(\rho) cor | p = 100, n varied
Plot displays distributions of coverages across 1000 simulations
The two methods take fundamentally different approaches to construct intervals that reflect the lasso model fit.
Hybrid
Takes a model averaging approach with the bootstrap.
PIPE Posterior
Adjusts the uncertainty by conditioning on the selected features.
Binomial
Q(\boldsymbol{\beta}|\mathbf{X},\mathbf{y},\lambda) = -\frac{1}{n} \sum\lbrace y_i X_i^T\boldsymbol{\beta}- \log(1+\exp(X_i^T \boldsymbol{\beta})) \rbrace + \lambda \sum_{j = 1}^p |\beta_j|
Poisson
Q(\boldsymbol{\beta}|\mathbf{X},\mathbf{y},\lambda) = -\frac{1}{n} \sum\lbrace y_i X_i^T\boldsymbol{\beta}- \exp(X_i^T \boldsymbol{\beta}) \rbrace + \lambda \sum_{j = 1}^p |\beta_j|
\begin{aligned} \eta_i &= \mathbf{x}_i^T \boldsymbol{\beta}, \\ \pi_i &= \begin{cases} \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} \\ \exp(\eta_i) \end{cases}, \quad \boldsymbol{W} = \begin{cases} diag(\pi_i (1 - \pi_i)) & \text{(Binomial)} \\ diag(\pi_i) & \text{(Poisson)} \end{cases} \\ \tilde{\mathbf{y}} &= \boldsymbol{W}^{-1} (\mathbf{y}- \boldsymbol{\pi}) + \boldsymbol{\eta}, \quad z_j = \frac{\mathbf{x}_j^T\boldsymbol{W} (\tilde{\mathbf{y}} - \mathbf{X}_{-j} \beta_{-j}) }{\mathbf{x}_j^T\boldsymbol{W} \mathbf{x}_j}, \\ \nu_j &= \frac{1}{n} \mathbf{x}^T_j \boldsymbol{W} \mathbf{x}_j, \quad \lambda_j = \frac{\lambda}{\nu_j} \end{aligned}
Data p = 100, n = 400 | \beta_{1-10} contain signal, \beta_{11-100} = 0
Elastic net
p(\boldsymbol{\beta}| \lambda, \alpha) = \alpha \lambda||\boldsymbol{\beta}||_1 + \frac{(1 - \alpha) \lambda}{2} ||\boldsymbol{\beta}||^2_2 Leverage data augmentation approach where the Elastic Net can be written as a lasso model with:
Data n = p = 100 \quad | \quad \beta_{A1} = 1, all other \beta_j = 0
\textrm{Cor}(\mathbf{x}_{A1}, \mathbf{x}_{B1}) = .99 \quad for all i \neq A1 and j \neq B1, \textrm{Cor}(\mathbf{x}_i, \mathbf{x}_j) = 0
MCP
p_\lambda(\beta|\lambda, \gamma) = \begin{cases} \begin{aligned} &\lambda\beta - \frac{\beta^2}{2\gamma} & \text{ if } \beta \leq \gamma \lambda\\ &\frac{1}{2}\gamma \lambda^2 & \text{ if } \beta \geq \gamma \lambda \end{aligned} \end{cases}
SCAD
p_\lambda(\beta|\lambda, \gamma) = \begin{cases} \begin{aligned} &\lambda\beta & \text{ if } \beta \leq \lambda\\ &\frac{2\gamma \lambda\beta - \beta^2 - \lambda^2}{2(\gamma - 1)} & \text{ if } \lambda< \beta < \gamma \lambda\\ &\frac{\lambda^2(\gamma + 1)}{2} & \text{ if } \beta \geq \gamma \lambda \end{aligned} \end{cases}
Data p = 100, n = 200 | \beta_{1-10} contain signal, \beta_{11-100} = 0
Data p = 100, n = 200 | \beta_{1-10} contain signal, \beta_{11-100} = 0
If one does want intervals that have closer to nominal for larger values of \beta, is it better to
If the likelihood is correctly specified according to the true data generating mechanism, p(\mathbf{y}| \boldsymbol{\theta}), then a 1-\alpha credible set for any parameter \theta_j will satisfy \int \textrm{Cover}(\theta_j)p(\boldsymbol{\theta}) d\boldsymbol{\theta}= 1 - \alpha.
Proof. By definition, a 100(1-\alpha)\% credible region for \theta_j is any set \mathcal{A}_j(\mathbf{y}) such that \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}|\mathbf{y})\,d\boldsymbol{\theta}= 1 - \alpha. The coverage probability is defined as \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\mathbf{y}| \boldsymbol{\theta})d\mathbf{y}.
The average coverage, integrated with respect to the prior p(\boldsymbol{\theta}), is therefore \begin{aligned} \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\mathbf{y}| \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\mathbf{y}d\boldsymbol{\theta} &= \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}| \mathbf{y}) p(\mathbf{y}) d\mathbf{y}d\boldsymbol{\theta}\\ &= \int \int I\{\theta_j \in \mathcal{A}_j(\mathbf{y})\} p(\boldsymbol{\theta}| \mathbf{y}) d\boldsymbol{\theta}\, p(\mathbf{y}) d\mathbf{y}\\ &= \int (1 - \alpha) p(\mathbf{y}) d\mathbf{y}\\ &= 1 - \alpha. \end{aligned} Thus, the average coverage equals the nominal coverage.