Linear models in R

This is part 1 of 2 tutorials on mixed effects models. Part 2 is entitled Fixed and Random Effects Model (Part 2 of 2 Mixed Effect Model).

Introduction

In this tutorial we give a trivial example of determining if the pitch of voice can be partly determined by sex. i.e.

\[pitch \sim sex\]

Since sex does not fully determine pitch we summarize the other effects as epsilon, \(\varepsilon\) i.e.

\[pitch \sim sex + \varepsilon\]

Build Data

pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
pitch_sex_df = data.frame(sex,pitch)

Simple Linear Model

pitch_sex_lm = lm(pitch~sex,pitch_sex_df)
summary(pitch_sex_lm)
## 
## Call:
## lm(formula = pitch ~ sex, data = pitch_sex_df)
## 
## Residuals:
##       1       2       3       4       5       6 
##   6.667 -22.333  15.667   2.000 -16.000  14.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   226.33      10.18  22.224 2.43e-05 ***
## sexmale       -98.33      14.40  -6.827  0.00241 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.64 on 4 degrees of freedom
## Multiple R-squared:  0.921,  Adjusted R-squared:  0.9012 
## F-statistic: 46.61 on 1 and 4 DF,  p-value: 0.002407

Explained Linear Model

  • “Multiple R-Squared” refers to \(R^2\), our \(R^2\) is 0.921, quite high
  • “Adjusted R-Squared” refers to \(R^2\) that is adjusted for the how many fixed effects there are. In our case there is only one, but in theory you could have things such as age, psychological traits, language, etc.)
  • “P-Value” refers the chance the null hypothesis is true
  • “Intercept” in this case refers to “sexfemale”, or the base case. (In this case because “f” is before “m” alphabetically)
  • “Sexmale” refers to the change in slope from female to male voices

Multiple Regression Model

A Multiple Regression Model is a linear model where one predictor has many predictors variable (fixed effects). Such as: \[pitch \sim sex + age + language + ... + \varepsilon\]

Assumptions

  1. Linearity - the target variable must be some form of linear combination of the predictor variables (i.e. NOT squared, cubed, etc.). If this is true the residual plot should not be curved. If this is the case, generally people do a log-transform of the response to minimize the effects.
  2. Abesence of Collinearity - that two fixed/predictors effects are not correlated with each other.
  3. Homoskedasticity - homoscedasticity is where variance is roughly equal across the range of target values. If this is not true, the data is heteroskedastic.
  4. Normality of Residuals - The residuals resemble a normal distribution. Do either a hist() or qqnorm()
  5. Absence of Influential Data Points - Lack of data points that disturb the model, use dfbeta()
  6. Independence - That each data point came from a different subject

References & Credits

Example adapted from Bodo Winter of the University of California:

Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [http://arxiv.org/pdf/1308.5499.pdf]