This blog post describes the ForestIV
R package, an implementation of the ForestIV approach to correct for estimation biases caused by having machine-learning-generated covariates in regression analyses.
To install the package, simply run the following command in R:
::install_github("mochenyang/ForestIV") devtools
The package depends on three external libraries:
hdm
: to perform Lasso-based selection of instrumental variables;AER
: to perform 2-stage least-square estimations (for linear regressions);ivtools
: to perform control function estimations (for GLMs).
ForestIV Implementation
The core function of the package is ForestIV()
, which takes the following inputs:
data_test
: A dataframe of the testing data (in a training-testing split) when building the random forest. It must have a column named “actual” that contains the ground truth values, and separate columns that contain each tree’s predictions named “X1”, “X2”, … Note that if you use therandomForest
package to train the random forest model, then the column names will be taken care of automatically.data_unlabel
: A dataframe of the unlabeled data. It must have separate columns that contain each tree’s predictions named “X1”, “X2”, …control
: A character vector of control variable names. Pass an empty vector if there is no control variable.method
: The method for IV selection. Supported values are “Lasso” for lasso-based selection and “IIV” for the imperfect IV method (forthcoming in our next paper, stay tuned!).iterative
: Whether or not to perform iterative IV selection to remove invalid and weak IVs, default to TRUE. This parameter is only relevant when method = “Lasso”.ntree
: Number of trees in the random forest.model_unbias
: Alm
orglm
object that contains the unbiased regression estimates, typically obtained by running the regression on the labeled data.family
: Model specification. Same as thefamily
parameter inglm
.diagnostic:
Whether to output diagnostic correlations for instrument validity and strength. Default to TRUE, which will produce four additional columns in the output, respectively named “pp_abs_before”, “pe_abs_before”, “pp_abs_after” and “pe_abs_after”. The “pp_abs_*” columns contain the average absolute correlation between endogenous covariate and IVs, before and after IV selection. Similarly, the “pe_abs_*” columns contain the average absolute correlation between the model error term and the IVs, before and after IV selection. If IV selection works properly, one should expect “pp_abs_after” to be higher than “pp_abs_before” (indicating that strong IVs are being selected) and “pe_abs_after” to be lower than “pe_abs_before” (indicating that invalid IVs are being removed).
The output is a dataframe with no more than ntree
number of rows, each corresponding to the ForestIV estimation results where a specific tree is used as the endogenous covariate (and other trees as candidate IVs).
A Replicable Demonstration
For demonstration, I replicate a part of the simulations reported in our paper, using the Bike Sharing Dataset, which contains 17,379 records of hourly bike rental activities. The machine learning task is to predict (the log of) rental counts based on time, seasonal, and weather features. The following code prepares the dataset.
# Replicable Simulation Example with Bike Sharing Dataset
library(ForestIV)
library(dplyr)
library(randomForest)
# import Bike Sharing data
# removed "instant" (ID), "dteday" (date), and "registered"/"casual" (which simply add up to the outcome variable)
= read.csv("hour.csv", stringsAsFactors = FALSE) %>%
Bike ::mutate(lnCnt = log(cnt)) %>%
dplyr::select(-instant, -dteday, -registered, -casual, -cnt) dplyr
Next, let’s train the random forest model. I use 1,000 data points for training, reserve 200 for testing, and the rest 16,379 are treated as unlabeled. After the model is trained, I obtain its predictions both on the testing data and on the unlabeled data.
# parameters for random forest
= 100
ntree = nrow(Bike)
N = 1000
N_train = 200
N_test = N - N_train - N_test
N_unlabel
# need to set the same seed here to fully replicate this demonstration
set.seed(123456)
# train random forest
= sample(1:nrow(Bike), N_train)
train = sample((1:nrow(Bike))[-train], N_test)
test = sample((1:nrow(Bike))[-c(train, test)], N_unlabel)
unlabel =randomForest(lnCnt ~ . , data = Bike,
Bike.rfmtry = 3, subset = train, ntree = ntree)
# retrieve ground truth and predictions on testing and unlabeled data
# setting "predict.all = TRUE" will produce predictions from each individual tree in the forest
= Bike$lnCnt
actual = predict(Bike.rf, Bike[unlabel,], predict.all = TRUE)
pred_unlabel = pred_unlabel$individual
indiv_pred_unlabel = pred_unlabel$aggregate
aggr_pred_unlabel = predict(Bike.rf, Bike[test,], predict.all = TRUE)
pred_test = pred_test$individual
indiv_pred_test = pred_test$aggregate aggr_pred_test
Next, let’s simulate the second-stage regression, where the ML predictions enter as an independent covariate with measurement error. Same as what we have done in the paper, I simulate a simple linear regression: \(Y=1+0.5\times lnCnt + 2 \times control_1 + control_2 + \varepsilon\), where \(lnCnt\) is the ground truth values of log rental count, and \(\{control_1, control_2\}\) are two exogenous control variables that follow \(Uniform[-10,10]\) and \(N(0,10^2)\) respectively. The regression model error term, \(\varepsilon\), follows \(N(0,2^2)\).
# simulate data for econometric model
= runif(N, min = -10, max = 10)
control1 = rnorm(N, sd = 10)
control2 = rnorm(N, sd = 2)
epsilon = c("control1", "control2")
control = 1.0 + 0.5*actual + 2.0*control1 + control2 + epsilon
Y
# prepare various data partitions for estimations
= data.frame(Y = Y[train], control1 = control1[train], control2 = control2[train], actual = actual[train])
data_train = data.frame(indiv_pred_test, aggr_pred_test, actual = actual[test])
data_test = data.frame(Y = Y[c(train, test)], control1 = control1[c(train, test)], control2 = control2[c(train, test)], actual = actual[c(train, test)])
data_label = data.frame(Y = Y[unlabel], control1 = control1[unlabel], control2 = control2[unlabel], indiv_pred_unlabel, aggr_pred_unlabel) data_unlabel
Now, let’s estimate the biased regression using the unlabeled data (with predicted \(lnCnt\) as covariate) and the unbiased regression using the labeled data. In the biased regression, coefficient on ML-generated covariate is overestimated! In the unbiased regression, coefficient estimates are fine, but the standard errors are expectedly larger (due to smaller sample size).
# biased regression
= lm(Y ~ aggr_pred_unlabel + control1 + control2, data = data_unlabel)
model_biased summary(model_biased)
#Results:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.637370 0.062325 10.23 <2e-16 ***
#aggr_pred_unlabel 0.580574 0.013315 43.60 <2e-16 ***
#control1 1.998247 0.002752 726.19 <2e-16 ***
#control2 0.998151 0.001602 623.10 <2e-16 ***
# unbiased regression
= lm(Y ~ actual + control1 + control2, data = data_label)
model_unbias summary(model_unbias)
#Results:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.05690 0.18596 5.683 1.66e-08 ***
#actual 0.49197 0.03882 12.672 < 2e-16 ***
#control1 2.00336 0.01009 198.565 < 2e-16 ***
#control2 1.00356 0.00586 171.270 < 2e-16 ***
Finally, let’s run ForestIV estimation.
# ForestIV estimation
= ForestIV(data_test = data_test, data_unlabel = data_unlabel, control = control,
result method = "Lasso", iterative = TRUE, ntree = ntree, model_unbias = model_unbias,
family = gaussian(link = "identity"), diagnostic = TRUE)
The result
is a dataframe with 99 rows (because the procedure failed to find valid and strong IVs for one tree in the random forest). It can be further processed to produce the final estimations. Same as what we have done in the paper, here I report the ForestIV estimates that are statistically closest to the unbiased estimates (in the Mean-Squared-Error sense) and also pass the Hotelling test at the 95% significance level.
= qchisq(0.95, df = 4)
H_critical = coef(model_unbias)
coef_unbiased %>%
result mutate(bias2 = sum((c(beta_1, beta_2, beta_3, beta_4) - coef_unbiased)^2),
variance = se_1^2 + se_2^2 + se_3^2 + se_4^2,
mse = bias2+variance) %>%
arrange(mse) %>%
filter(row_number() == 1 & Hotelling < H_critical)
#Results:
#beta_1 beta_2 beta_3 beta_4
#0.984 0.503 1.998 0.998
To obtain valid standard error estimates, one can bootstrap the above procedure multiple times. As we show in the paper, the (bootstrapped) standard errors of ForestIV estimates are smaller than those in the unbiased regression. In other words, ForestIV produce less biased point estimates (compared to the biased regression) with higher precision (compared to the unbiased regression).