Correlated Data and Random Effects
This week we learned about correlated data. An assumption in ordinary and generalized linear models is that the data is uncorrelated. However, that is not always the case. Sometimes the rows of a dataset may not be independent of one another. This means that the basic observational units are related in such a way that produces similar outcomes.
For example, say you are creating a model to predict the number of days of school a student missed during their junior year of high school. We are looking at students from various schools in the Minneapolis area. In addition to number of days they missed. We know whether they attended a public or private institution, as well as the neighborhood they reside in. Here the data is likely to be correlated based on the neighborhood the student is from. In our model, we could choose to include a slope for each neighborhood to account for this. However, that would likely result in too many parameters. This is where we could opt to include a random effect for each neighborhood instead. Random effects are similar to random variables. They have mean 0 and a known variance.
Our model would look something like this:
\[\hat{Days\:Absent}=\beta_0+beta_1I(private)+u_{neighborhood} \]
Here the I(private) variable is 1 for a private school and 0 otherwise. The “u” is the random effect. In this example, we may have a random effect for each neighborhood.
There are two main types of random effects. The first type is nested. The above is an example of nesting as students who missed similar amounts of school are likely to be nested in the same neighborhood. The other type is crossed random effects. An example of this is a speed dating experiment, where statisticians attempt to build a logistic model predicting the log odds of a speed date leading to another date. In this example, there would be a random effect for each participant in the speed dating event. Effects would be “crossed” since each date would include the random effects from both individuals participating.
Failure to address correlated data, by means of random effects, causes one to understate the standard errors of estimated coefficients.