Correlated Data

This week we finished up our section on logistic regression and began learning about correlated data with Dr. McNamara

In past courses, we often times assume data is not correlated. In fact, one of the main assumptions of standard modeling is that the data is independent.

Often times in life, data is not always independent between observations.

Here are some examples of this violation of independence:

  1. In a survey about political views, the responses submitted by a certain friend group may be correlated as they might discuss issues together and agree on some topics.
  2. Grades in math classes across various schools might be correlated because of the different teachers students had.
  3. In scientific research, if certain animals are used in multiple trials the results are most likely correlated.

These are just a couple examples but it is pretty clear that correlated data is a common phenomenon.

Tree Growth

In class, we went through an R example concerning tree growth and tree tubes.

Since the data isn’t available I will go through the R code and comment on the results.

tube_linear <- lm(growth_yr1 ~ tubes, data = treetubes_yr1)

         Estimate Std. Error t value  Pr(>|t|)
(Intercept)  0.10585   0.005665  18.685 9.931e-56
tubes       -0.04013   0.041848  -0.959 3.382e-01

R squared =  0.002414 
Residual standard error =  0.1097

tube_multi1 <- lmer(growth_yr1 ~ tubes + (1|transect), data = treetubes_yr1)

         Estimate Std. Error t value
(Intercept)  0.10636    0.01329   8.005
 tubes       -0.04065    0.05165  -0.787

Groups   Name        Variance Std.Dev.
transect (Intercept) 0.00084  0.029   
Residual             0.01155  0.107

The first model is a standard linear model and the second model takes into account the correlation.

As you can see, the standard error for the ‘tubes’ coefficients increases when correlation is accounted for. This means the original model (tube_linear) is overly enthusiastic about the contribution tubes has on tree growth.

Conceptual Exercise

The last thing I want to make note of is the conceptual exercise our team was tasked with evaluating.

These are the instructions:

Examples with correlated data. For each of the following studies: -Identify the most basic observational units -Identify the grouping units (could be multiple levels of grouping) -State the response(s) measured and variable type (normal, binary, Poisson, etc.) -Write a sentence describing the within-group correlation. -Identify fixed and random effects

Nurse stress study.

Four wards were randomly selected at each of 25 hospitals and randomly assigned to offer a stress reduction program for nurses on the ward or to serve as a control. At the conclusion of the study period, a random sample of 10 nurses from each ward completed a test to measure job-related stress. Factors assumed to be related include nurse experience, age, hospital size and type of ward.

The basic observational units are the nurses.

The groups are the hospitals and the particular wards in those hospitals.

The response is the stress evaluation which is probably a normal variable.

The reason there is within-group correlation is because nurses in a certain hospital might experiences similar levels of stress. For example a hospital in a large city might on average be more stressful than a hospital in a smaller city. Also there is correlation within each ward. It is fair to assume that nurses working in the same ward of a hospital have similar stress levels.

Lastly, the fixed variable is whether or not the nurses were in a stress reduction program or not. Pretty much everything else is random effects like experience, age, hospital size, etc… This is because these variables are not controlled by the researchers.