Title: | Efficient Inference on High-Dimensional Linear Model with Missing Outcomes |
---|---|
Description: | A statistically and computationally efficient debiasing method for conducting valid inference on the high-dimensional linear regression function with missing outcomes. The reference paper is Zhang, Giessing, and Chen (2023) <arXiv:2309.06429>. |
Authors: | Yikun Zhang [aut, cre]
|
Maintainer: | Yikun Zhang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2 |
Built: | 2025-03-02 04:19:52 UTC |
Source: | https://github.com/cran/DebiasInfer |
This function implements our proposed debiasing (primal) program that solves for the weights for correcting the Lasso pilot estimate.
DebiasProg(X, x, Pi, gamma_n = 0.1)
DebiasProg(X, x, Pi, gamma_n = 0.1)
X |
The input design n*d matrix. |
x |
The current query point, which is a 1*d array. |
Pi |
An n*n diagonal matrix with (estimated) propensity scores as its diagonal entries. |
gamma_n |
The regularization parameter " |
The estimated weights by our debiasing program, which is a n-dim vector.
Yikun Zhang, [email protected]
Zhang, Y., Giessing, A. and Chen, Y.-C. (2023) Efficient Inference on High-Dimensional Linear Model with Missing Outcomes. https://arxiv.org/abs/2309.06429.
Fu, A., Narasimhan, B. and Boyd, S. (2017) CVXR: An R Package for Disciplined Convex Optimization. Journal of Statistical Software 94 (14): 1–34. doi:10.18637/jss.v094.i14.
require(MASS) require(glmnet) d = 1000 n = 900 Sigma = array(0, dim = c(d,d)) + diag(d) rho = 0.1 for(i in 1:(d-1)){ for(j in (i+1):d){ if ((j < i+6) | (j > i+d-6)){ Sigma[i,j] = rho Sigma[j,i] = rho } } } sig = 1 ## Current query point x_cur = rep(0, d) x_cur[c(1, 2, 3, 7, 8)] = c(1, 1/2, 1/4, 1/2, 1/8) x_cur = array(x_cur, dim = c(1,d)) ## True regression coefficient s_beta = 5 beta_0 = rep(0, d) beta_0[1:s_beta] = sqrt(5) ## Generate the design matrix and outcomes X_sim = mvrnorm(n, mu = rep(0, d), Sigma) eps_err_sim = sig * rnorm(n) Y_sim = drop(X_sim %*% beta_0) + eps_err_sim obs_prob = 1 / (1 + exp(-1 + X_sim[, 7] - X_sim[, 8])) R_sim = rep(1, n) R_sim[runif(n) >= obs_prob] = 0 ## Estimate the propensity scores via the Lasso-type generalized linear model zeta = 5*sqrt(log(d)/n)/n lr1 = glmnet(X_sim, R_sim, family = "binomial", alpha = 1, lambda = zeta, standardize = TRUE, thresh=1e-6) prop_score = drop(predict(lr1, newx = X_sim, type = "response")) ## Estimate the debiasing weights w_obs = DebiasProg(X_sim, x_cur, Pi=diag(prop_score), gamma_n = 0.1)
require(MASS) require(glmnet) d = 1000 n = 900 Sigma = array(0, dim = c(d,d)) + diag(d) rho = 0.1 for(i in 1:(d-1)){ for(j in (i+1):d){ if ((j < i+6) | (j > i+d-6)){ Sigma[i,j] = rho Sigma[j,i] = rho } } } sig = 1 ## Current query point x_cur = rep(0, d) x_cur[c(1, 2, 3, 7, 8)] = c(1, 1/2, 1/4, 1/2, 1/8) x_cur = array(x_cur, dim = c(1,d)) ## True regression coefficient s_beta = 5 beta_0 = rep(0, d) beta_0[1:s_beta] = sqrt(5) ## Generate the design matrix and outcomes X_sim = mvrnorm(n, mu = rep(0, d), Sigma) eps_err_sim = sig * rnorm(n) Y_sim = drop(X_sim %*% beta_0) + eps_err_sim obs_prob = 1 / (1 + exp(-1 + X_sim[, 7] - X_sim[, 8])) R_sim = rep(1, n) R_sim[runif(n) >= obs_prob] = 0 ## Estimate the propensity scores via the Lasso-type generalized linear model zeta = 5*sqrt(log(d)/n)/n lr1 = glmnet(X_sim, R_sim, family = "binomial", alpha = 1, lambda = zeta, standardize = TRUE, thresh=1e-6) prop_score = drop(predict(lr1, newx = X_sim, type = "response")) ## Estimate the debiasing weights w_obs = DebiasProg(X_sim, x_cur, Pi=diag(prop_score), gamma_n = 0.1)
This function implements our proposed debiasing program that selects the tuning parameter
"" by cross-validation and returns the final debiasing weights.
DebiasProgCV(X, x, prop_score, gamma_lst = NULL, cv_fold = 5, cv_rule = "1se")
DebiasProgCV(X, x, prop_score, gamma_lst = NULL, cv_fold = 5, cv_rule = "1se")
X |
The input design n*d matrix. |
x |
The current query point, which is a 1*d array. |
prop_score |
An n-dim numeric vector with (estimated) propensity scores as its entries. |
gamma_lst |
A numeric vector with candidate values for the regularization
parameter " |
cv_fold |
The number of folds for cross-validation on the dual program. (Default: cv_fold=5.) |
cv_rule |
The criteria/rules for selecting the final value of the regularization
parameter " |
A list that contains three elements.
w_obs |
The final estimated weights by our debiasing program. |
ll_obs |
The final value of the solution to our debiasing dual program. |
gamma_n_opt |
The final value of the tuning parameter " |
Yikun Zhang, [email protected]
Zhang, Y., Giessing, A. and Chen, Y.-C. (2023) Efficient Inference on High-Dimensional Linear Model with Missing Outcomes. https://arxiv.org/abs/2309.06429.
require(MASS) require(glmnet) d = 1000 n = 900 Sigma = array(0, dim = c(d,d)) + diag(d) rho = 0.1 for(i in 1:(d-1)){ for(j in (i+1):d){ if ((j < i+6) | (j > i+d-6)){ Sigma[i,j] = rho Sigma[j,i] = rho } } } sig = 1 ## Current query point x_cur = rep(0, d) x_cur[c(1, 2, 3, 7, 8)] = c(1, 1/2, 1/4, 1/2, 1/8) x_cur = array(x_cur, dim = c(1,d)) ## True regression coefficient s_beta = 5 beta_0 = rep(0, d) beta_0[1:s_beta] = sqrt(5) ## Generate the design matrix and outcomes X_sim = mvrnorm(n, mu = rep(0, d), Sigma) eps_err_sim = sig * rnorm(n) Y_sim = drop(X_sim %*% beta_0) + eps_err_sim obs_prob = 1 / (1 + exp(-1 + X_sim[, 7] - X_sim[, 8])) R_sim = rep(1, n) R_sim[runif(n) >= obs_prob] = 0 ## Estimate the propensity scores via the Lasso-type generalized linear model zeta = 5*sqrt(log(d)/n)/n lr1 = glmnet(X_sim, R_sim, family = "binomial", alpha = 1, lambda = zeta, standardize = TRUE, thresh=1e-6) prop_score = drop(predict(lr1, newx = X_sim, type = "response")) ## Estimate the debiasing weights with the tuning parameter selected by cross-validations. deb_res = DebiasProgCV(X_sim, x_cur, prop_score, gamma_lst = c(0.1, 0.5, 1), cv_fold = 5, cv_rule = '1se')
require(MASS) require(glmnet) d = 1000 n = 900 Sigma = array(0, dim = c(d,d)) + diag(d) rho = 0.1 for(i in 1:(d-1)){ for(j in (i+1):d){ if ((j < i+6) | (j > i+d-6)){ Sigma[i,j] = rho Sigma[j,i] = rho } } } sig = 1 ## Current query point x_cur = rep(0, d) x_cur[c(1, 2, 3, 7, 8)] = c(1, 1/2, 1/4, 1/2, 1/8) x_cur = array(x_cur, dim = c(1,d)) ## True regression coefficient s_beta = 5 beta_0 = rep(0, d) beta_0[1:s_beta] = sqrt(5) ## Generate the design matrix and outcomes X_sim = mvrnorm(n, mu = rep(0, d), Sigma) eps_err_sim = sig * rnorm(n) Y_sim = drop(X_sim %*% beta_0) + eps_err_sim obs_prob = 1 / (1 + exp(-1 + X_sim[, 7] - X_sim[, 8])) R_sim = rep(1, n) R_sim[runif(n) >= obs_prob] = 0 ## Estimate the propensity scores via the Lasso-type generalized linear model zeta = 5*sqrt(log(d)/n)/n lr1 = glmnet(X_sim, R_sim, family = "binomial", alpha = 1, lambda = zeta, standardize = TRUE, thresh=1e-6) prop_score = drop(predict(lr1, newx = X_sim, type = "response")) ## Estimate the debiasing weights with the tuning parameter selected by cross-validations. deb_res = DebiasProgCV(X_sim, x_cur, prop_score, gamma_lst = c(0.1, 0.5, 1), cv_fold = 5, cv_rule = '1se')
This function implements the coordinate descent algorithm for the debiasing dual program. More details can be found in Appendix A of our paper.
DualCD( X, x, Pi = NULL, gamma_n = 0.05, ll_init = NULL, eps = 1e-09, max_iter = 5000 )
DualCD( X, x, Pi = NULL, gamma_n = 0.05, ll_init = NULL, eps = 1e-09, max_iter = 5000 )
X |
The input design n*d matrix. |
x |
The current query point, which is a 1*d array. |
Pi |
An n*n diagonal matrix with (estimated) propensity scores as its diagonal entries. |
gamma_n |
The regularization parameter " |
ll_init |
The initial value of the dual solution vector. (Default: ll_init=NULL. Then, the vector with all-one entries is used.) |
eps |
The tolerance value for convergence. (Default: eps=1e-9.) |
max_iter |
The maximum number of coordinate descent iterations. (Default: max_iter=5000.) |
The solution vector to our dual debiasing program.
Yikun Zhang, [email protected]
Zhang, Y., Giessing, A. and Chen, Y.-C. (2023) Efficient Inference on High-Dimensional Linear Model with Missing Outcomes. https://arxiv.org/abs/2309.06429.
require(MASS) require(glmnet) d = 1000 n = 900 Sigma = array(0, dim = c(d,d)) + diag(d) rho = 0.1 for(i in 1:(d-1)){ for(j in (i+1):d){ if ((j < i+6) | (j > i+d-6)){ Sigma[i,j] = rho Sigma[j,i] = rho } } } sig = 1 ## Current query point x_cur = rep(0, d) x_cur[c(1, 2, 3, 7, 8)] = c(1, 1/2, 1/4, 1/2, 1/8) x_cur = array(x_cur, dim = c(1,d)) ## True regression coefficient s_beta = 5 beta_0 = rep(0, d) beta_0[1:s_beta] = sqrt(5) ## Generate the design matrix and outcomes X_sim = mvrnorm(n, mu = rep(0, d), Sigma) eps_err_sim = sig * rnorm(n) Y_sim = drop(X_sim %*% beta_0) + eps_err_sim obs_prob = 1 / (1 + exp(-1 + X_sim[, 7] - X_sim[, 8])) R_sim = rep(1, n) R_sim[runif(n) >= obs_prob] = 0 ## Estimate the propensity scores via the Lasso-type generalized linear model zeta = 5*sqrt(log(d)/n)/n lr1 = glmnet(X_sim, R_sim, family = "binomial", alpha = 1, lambda = zeta, standardize = TRUE, thresh=1e-6) prop_score = drop(predict(lr1, newx = X_sim, type = "response")) ## Solve the debiasing dual program ll_cur = DualCD(X_sim, x_cur, Pi = diag(prop_score), gamma_n = 0.1, ll_init = NULL, eps=1e-9, max_iter = 5000)
require(MASS) require(glmnet) d = 1000 n = 900 Sigma = array(0, dim = c(d,d)) + diag(d) rho = 0.1 for(i in 1:(d-1)){ for(j in (i+1):d){ if ((j < i+6) | (j > i+d-6)){ Sigma[i,j] = rho Sigma[j,i] = rho } } } sig = 1 ## Current query point x_cur = rep(0, d) x_cur[c(1, 2, 3, 7, 8)] = c(1, 1/2, 1/4, 1/2, 1/8) x_cur = array(x_cur, dim = c(1,d)) ## True regression coefficient s_beta = 5 beta_0 = rep(0, d) beta_0[1:s_beta] = sqrt(5) ## Generate the design matrix and outcomes X_sim = mvrnorm(n, mu = rep(0, d), Sigma) eps_err_sim = sig * rnorm(n) Y_sim = drop(X_sim %*% beta_0) + eps_err_sim obs_prob = 1 / (1 + exp(-1 + X_sim[, 7] - X_sim[, 8])) R_sim = rep(1, n) R_sim[runif(n) >= obs_prob] = 0 ## Estimate the propensity scores via the Lasso-type generalized linear model zeta = 5*sqrt(log(d)/n)/n lr1 = glmnet(X_sim, R_sim, family = "binomial", alpha = 1, lambda = zeta, standardize = TRUE, thresh=1e-6) prop_score = drop(predict(lr1, newx = X_sim, type = "response")) ## Solve the debiasing dual program ll_cur = DualCD(X_sim, x_cur, Pi = diag(prop_score), gamma_n = 0.1, ll_init = NULL, eps=1e-9, max_iter = 5000)
This function computes the objective function value of the debiasing dual program.
DualObj(X, x, Pi, ll_cur, gamma_n = 0.05)
DualObj(X, x, Pi, ll_cur, gamma_n = 0.05)
X |
The input design n*d matrix. |
x |
The current query point, which is a 1*d array. |
Pi |
An n*n diagonal matrix with (estimated) propensity scores as its diagonal entries. |
ll_cur |
The current value of the dual solution vector. |
gamma_n |
The regularization parameter " |
The value of the objective function of our dual debiasing program.
Yikun Zhang, [email protected]
Zhang, Y., Giessing, A. and Chen, Y.-C. (2023) Efficient Inference on High-Dimensional Linear Model with Missing Outcomes. https://arxiv.org/abs/2309.06429.
require(MASS) require(glmnet) d = 1000 n = 900 Sigma = array(0, dim = c(d,d)) + diag(d) rho = 0.1 for(i in 1:(d-1)){ for(j in (i+1):d){ if ((j < i+6) | (j > i+d-6)){ Sigma[i,j] = rho Sigma[j,i] = rho } } } sig = 1 ## Current query point x_cur = rep(0, d) x_cur[c(1, 2, 3, 7, 8)] = c(1, 1/2, 1/4, 1/2, 1/8) x_cur = array(x_cur, dim = c(1,d)) ## True regression coefficient s_beta = 5 beta_0 = rep(0, d) beta_0[1:s_beta] = sqrt(5) ## Generate the design matrix and outcomes X_sim = mvrnorm(n, mu = rep(0, d), Sigma) eps_err_sim = sig * rnorm(n) Y_sim = drop(X_sim %*% beta_0) + eps_err_sim obs_prob = 1 / (1 + exp(-1 + X_sim[, 7] - X_sim[, 8])) R_sim = rep(1, n) R_sim[runif(n) >= obs_prob] = 0 ## Estimate the propensity scores via the Lasso-type generalized linear model zeta = 5*sqrt(log(d)/n)/n lr1 = glmnet(X_sim, R_sim, family = "binomial", alpha = 1, lambda = zeta, standardize = TRUE, thresh=1e-6) prop_score = drop(predict(lr1, newx = X_sim, type = "response")) ## Solve the debiasing dual program and estimate the dual objective function value ll_cur = DualCD(X_sim, x_cur, Pi = diag(prop_score), gamma_n = 0.1, ll_init = NULL, eps=1e-9, max_iter = 5000) dual_val = DualObj(X_sim, x_cur, Pi=diag(prop_score), ll_cur=ll_cur, gamma_n=0.1)
require(MASS) require(glmnet) d = 1000 n = 900 Sigma = array(0, dim = c(d,d)) + diag(d) rho = 0.1 for(i in 1:(d-1)){ for(j in (i+1):d){ if ((j < i+6) | (j > i+d-6)){ Sigma[i,j] = rho Sigma[j,i] = rho } } } sig = 1 ## Current query point x_cur = rep(0, d) x_cur[c(1, 2, 3, 7, 8)] = c(1, 1/2, 1/4, 1/2, 1/8) x_cur = array(x_cur, dim = c(1,d)) ## True regression coefficient s_beta = 5 beta_0 = rep(0, d) beta_0[1:s_beta] = sqrt(5) ## Generate the design matrix and outcomes X_sim = mvrnorm(n, mu = rep(0, d), Sigma) eps_err_sim = sig * rnorm(n) Y_sim = drop(X_sim %*% beta_0) + eps_err_sim obs_prob = 1 / (1 + exp(-1 + X_sim[, 7] - X_sim[, 8])) R_sim = rep(1, n) R_sim[runif(n) >= obs_prob] = 0 ## Estimate the propensity scores via the Lasso-type generalized linear model zeta = 5*sqrt(log(d)/n)/n lr1 = glmnet(X_sim, R_sim, family = "binomial", alpha = 1, lambda = zeta, standardize = TRUE, thresh=1e-6) prop_score = drop(predict(lr1, newx = X_sim, type = "response")) ## Solve the debiasing dual program and estimate the dual objective function value ll_cur = DualCD(X_sim, x_cur, Pi = diag(prop_score), gamma_n = 0.1, ll_init = NULL, eps=1e-9, max_iter = 5000) dual_val = DualObj(X_sim, x_cur, Pi=diag(prop_score), ll_cur=ll_cur, gamma_n=0.1)
This function implements the soft-threshold operator
.
SoftThres(theta, lamb)
SoftThres(theta, lamb)
theta |
The input numeric vector. |
lamb |
The thresholding parameter. |
The resulting vector after soft-thresholding.
Yikun Zhang, [email protected]
a = c(1,2,4,6) SoftThres(theta=a, lamb=3)
a = c(1,2,4,6) SoftThres(theta=a, lamb=3)