Random Stuff Cancels Out, Right?

The Perils of Random Measurement Error in Regression Analysis
causal-inference
exposition
Published

December 27, 2021

Imagine you want to understand the impact of \(X\) on \(Y\) (and potentially controlling for other factors \(\boldsymbol{Z}\)), and proceed to collect data on \(\{X,\boldsymbol{Z},Y\}\) and run a linear regression of \(Y\) on \(\{X,\boldsymbol{Z}\}\). The problem is, however, you cannot get a precise measurement of \(X\) (e.g., perhaps your measurement instrument is flawed), and your measurement \(\widehat{X}\) is only an imperfect representation of true \(X\), in the following sense:

\[ \widehat{X} = X + e \]

In other words, the imperfect measurement differs from the true measurement by a value \(e\), commonly known as the measurement error. Now, assume that this measurement error \(e\) is completely random, i.e., it is independent of anything in \(\{X,\boldsymbol{Z},Y\}\). The question is: does the presence of such purely random measurement error cause any problem to your regression estimations?

Through teaching and research, I found that people (ranging from master students, Ph.D. students, all the way to professors) have a very strong belief that the answer to the above question is NO, i.e., having purely random measurement error doesn’t bias your regression estimations. And the intuition is simply that “random stuff cancels out”. Because the measurement error is completely random, it has equal chance of being larger or smaller than the true value, and on average, voila, things should be fine. This strong belief, in my opinion, makes the issue of random measurement error particular dangerous in regression analysis.

Random Measurement Error Biases Regression Estimations

Random stuff don’t always cancels out, and yes, random measurement error does bias regression estimations. The proof can be found in most introductory econometrics textbooks, and is reproduced here.

First, one needs to clearly understand the difference between a population regression equation and the regression that is being estimated (the difference is important in this case). While the population regression equation describes the true relationship between independent and dependent variables (i.e., the data generation process), the regression being estimated approximates such relationship based on the available measurements. In our setup, the population regression equation is

\[ Y = \beta_0 + \beta_1 X + \boldsymbol{\beta_2 Z} + \varepsilon \]

where \(\beta_1\) is the coefficient of interest. However, because only \(\widehat{X}\) is available, the regression being estimated is \(Y \sim \{\widehat{X}, \boldsymbol{Z}\}\). It follows that:

\[ Y = \beta_0 + \beta_1 (\widehat{X}-e) + \boldsymbol{\beta_2 Z} + \varepsilon = \beta_0 + \beta_1 \widehat{X} + \boldsymbol{\beta_2 Z} + (\varepsilon - \beta_1 e) \]

and because \(Cov(\widehat{X}, \varepsilon-\beta_1 e) = Cov(X + e, \varepsilon-\beta_1 e) = -\beta_1 Var(e) \neq 0\), the estimated regression essentially has an “omitted variable” bias, where the random measurement error \(e\) is left in the regression error term, and is correlated with the independent variable \(\widehat{X}\). To paraphrase, although the measurement error is purely random, it is nevertheless not independent of itself.

In the case of simple linear regression, one can even derive the degree of bias:

\[ \mathbb{E}(\widehat{\beta_1}) = \frac{Cov(Y,\widehat{X})}{Var(\widehat{X})} = \beta_1 \frac{Var(X)}{Var(X) + Var(e)} \]

And because \(\beta_1 \frac{Var(X)}{Var(X) + Var(e)} < \beta_1\), the bias manifests as attenuation, i.e., the coefficient of interest is underestimated.

Underestimation is OK, Right?

Often times, after I explain the above to someone, they may continue to insist that having random measurement error is no big deal, because it only results in underestimation of coefficient and hence makes your conclusions more “conservative” than they really are. This line of argument is equally dangerous, because of at least two reasons:

  1. Beyond the simple linear regression model discussed here, it is typically not true. With most generalized linear models (e.g., logit, probit, Poisson), overestimation is possible. Even with a simple linear regression, slightly more complex measurement error structure can result in overestimation. Our research paper demonstrate this point with both theoretical and empirical analyses.
  2. Even if the coefficient is underestimated (and therefore conservative), it could be a very bad outcome depending on the context. Suppose you are estimating the impact of COVID infection on mortality, underestimating the effect is arguably worse than overestimating it.

Concluding Remarks

Intuitions can be powerful, but they are not always correct. The case of purely random measurement error is an example of how intuition alone can fail you in dangerous ways. Random stuff don’t always cancels out, and even random measurement error in independent covariates can cause biases to regression estimations.