This blog post describes the ForestIV R package, an implementation of the ForestIV approach to correct for estimation biases caused by having machine-learning-generated covariates in regression analyses.

To install the package, simply run the following command in R:

devtools::install_github("mochenyang/ForestIV")

The package depends on three external libraries:

  1. hdm: to perform Lasso-based selection of instrumental variables;
  2. AER: to perform 2-stage least-square estimations (for linear regressions);
  3. ivtools: to perform control function estimations (for GLMs).

ForestIV Implementation

The core function of the package is ForestIV(), which takes the following inputs:

  1. data_test: A dataframe of the testing data (in a training-testing split) when building the random forest. It must have a column named “actual” that contains the ground truth values, and separate columns that contain each tree’s predictions named “X1”, “X2”, … Note that if you use the randomForest package to train the random forest model, then the column names will be taken care of automatically.
  2. data_unlabel: A dataframe of the unlabeled data. It must have separate columns that contain each tree’s predictions named “X1”, “X2”, …
  3. control: A character vector of control variable names. Pass an empty vector if there is no control variable.
  4. method: The method for IV selection. Supported values are “Lasso” for lasso-based selection and “IIV” for the imperfect IV method (forthcoming in our next paper, stay tuned!).
  5. iterative: Whether or not to perform iterative IV selection to remove invalid and weak IVs, default to TRUE. This parameter is only relevant when method = “Lasso”.
  6. ntree: Number of trees in the random forest.
  7. model_unbias: A lm or glm object that contains the unbiased regression estimates, typically obtained by running the regression on the labeled data.
  8. family: Model specification. Same as the family parameter in glm.
  9. diagnostic: Whether to output diagnostic correlations for instrument validity and strength. Default to TRUE, which will produce four additional columns in the output, respectively named “pp_abs_before”, “pe_abs_before”, “pp_abs_after” and “pe_abs_after”. The “pp_abs_*” columns contain the average absolute correlation between endogenous covariate and IVs, before and after IV selection. Similarly, the “pe_abs_*” columns contain the average absolute correlation between the model error term and the IVs, before and after IV selection. If IV selection works properly, one should expect “pp_abs_after” to be higher than “pp_abs_before” (indicating that strong IVs are being selected) and “pe_abs_after” to be lower than “pe_abs_before” (indicating that invalid IVs are being removed).

The output is a dataframe with no more than ntree number of rows, each corresponding to the ForestIV estimation results where a specific tree is used as the endogenous covariate (and other trees as candidate IVs).

A Replicable Demonstration

For demonstration, I replicate a part of the simulations reported in our paper, using the Bike Sharing Dataset, which contains 17,379 records of hourly bike rental activities. The machine learning task is to predict (the log of) rental counts based on time, seasonal, and weather features. The following code prepares the dataset.

# Replicable Simulation Example with Bike Sharing Dataset
library(ForestIV)
library(dplyr)
library(randomForest)

# import Bike Sharing data
# removed "instant" (ID), "dteday" (date), and "registered"/"casual" (which simply add up to the outcome variable)
Bike = read.csv("hour.csv", stringsAsFactors = FALSE) %>%
  dplyr::mutate(lnCnt = log(cnt)) %>%
  dplyr::select(-instant, -dteday, -registered, -casual, -cnt)

Next, let’s train the random forest model. I use 1,000 data points for training, reserve 200 for testing, and the rest 16,379 are treated as unlabeled. After the model is trained, I obtain its predictions both on the testing data and on the unlabeled data.

# parameters for random forest
ntree = 100
N = nrow(Bike)
N_train = 1000
N_test = 200
N_unlabel = N - N_train - N_test

# need to set the same seed here to fully replicate this demonstration
set.seed(123456)
# train random forest
train = sample(1:nrow(Bike), N_train)
test = sample((1:nrow(Bike))[-train], N_test)
unlabel = sample((1:nrow(Bike))[-c(train, test)], N_unlabel)
Bike.rf=randomForest(lnCnt ~ . , data = Bike,
                     mtry = 3, subset = train, ntree = ntree)

# retrieve ground truth and predictions on testing and unlabeled data
# setting "predict.all = TRUE" will produce predictions from each individual tree in the forest
actual = Bike$lnCnt
pred_unlabel = predict(Bike.rf, Bike[unlabel,], predict.all = TRUE)
indiv_pred_unlabel = pred_unlabel$individual
aggr_pred_unlabel = pred_unlabel$aggregate
pred_test = predict(Bike.rf, Bike[test,], predict.all = TRUE)
indiv_pred_test = pred_test$individual
aggr_pred_test = pred_test$aggregate

Next, let’s simulate the second-stage regression, where the ML predictions enter as an independent covariate with measurement error. Same as what we have done in the paper, I simulate a simple linear regression: $Y=1+0.5\times lnCnt + 2 \times control_1 + control_2 + \varepsilon$, where $lnCnt$ is the ground truth values of log rental count, and ${control_1, control_2}$ are two exogenous control variables that follow $Uniform[-10,10]$ and $N(0,10^2)$ respectively. The regression model error term, $\varepsilon$, follows $N(0,2^2)$.

# simulate data for econometric model
control1 = runif(N, min = -10, max = 10)
control2 = rnorm(N, sd = 10)
epsilon = rnorm(N, sd = 2)
control = c("control1", "control2")
Y = 1.0 + 0.5*actual + 2.0*control1 + control2 + epsilon

# prepare various data partitions for estimations
data_train = data.frame(Y = Y[train], control1 = control1[train], control2 = control2[train], actual = actual[train])
data_test = data.frame(indiv_pred_test, aggr_pred_test, actual = actual[test])
data_label = data.frame(Y = Y[c(train, test)], control1 = control1[c(train, test)], control2 = control2[c(train, test)], actual = actual[c(train, test)])
data_unlabel = data.frame(Y = Y[unlabel], control1 = control1[unlabel], control2 = control2[unlabel], indiv_pred_unlabel, aggr_pred_unlabel)

Now, let’s estimate the biased regression using the unlabeled data (with predicted $lnCnt$ as covariate) and the unbiased regression using the labeled data. In the biased regression, coefficient on ML-generated covariate is overestimated! In the unbiased regression, coefficient estimates are fine, but the standard errors are expectedly larger (due to smaller sample size).

# biased regression
model_biased = lm(Y ~ aggr_pred_unlabel + control1 + control2, data = data_unlabel)
summary(model_biased)
#Results:
#                  Estimate Std. Error t value Pr(>|t|)    
#(Intercept)       0.637370   0.062325   10.23   <2e-16 ***
#aggr_pred_unlabel 0.580574   0.013315   43.60   <2e-16 ***
#control1          1.998247   0.002752  726.19   <2e-16 ***
#control2          0.998151   0.001602  623.10   <2e-16 ***

# unbiased regression
model_unbias = lm(Y ~ actual + control1 + control2, data = data_label)
summary(model_unbias)
#Results:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept)  1.05690    0.18596   5.683 1.66e-08 ***
#actual       0.49197    0.03882  12.672  < 2e-16 ***
#control1     2.00336    0.01009 198.565  < 2e-16 ***
#control2     1.00356    0.00586 171.270  < 2e-16 ***

Finally, let’s run ForestIV estimation.

# ForestIV estimation
result = ForestIV(data_test = data_test, data_unlabel = data_unlabel, control = control,
                  method = "Lasso", iterative = TRUE, ntree = ntree, model_unbias = model_unbias,
                  family = gaussian(link = "identity"), diagnostic = TRUE)

The result is a dataframe with 99 rows (because the procedure failed to find valid and strong IVs for one tree in the random forest). It can be further processed to produce the final estimations. Same as what we have done in the paper, here I report the ForestIV estimates that are statistically closest to the unbiased estimates (in the Mean-Squared-Error sense) and also pass the Hotelling test at the 95% significance level.

H_critical = qchisq(0.95, df = 4)
coef_unbiased = coef(model_unbias)
result %>% 
  mutate(bias2 = sum((c(beta_1, beta_2, beta_3, beta_4) - coef_unbiased)^2),
         variance = se_1^2 + se_2^2 + se_3^2 + se_4^2,
         mse = bias2+variance) %>%
  arrange(mse) %>%
  filter(row_number() == 1 & Hotelling < H_critical)
#Results:
#beta_1	beta_2	beta_3	beta_4
#0.984	0.503	1.998	0.998

To obtain valid standard error estimates, one can bootstrap the above procedure multiple times. As we show in the paper, the (bootstrapped) standard errors of ForestIV estimates are smaller than those in the unbiased regression. In other words, ForestIV produce less biased point estimates (compared to the biased regression) with higher precision (compared to the unbiased regression).