Module 4: Linear Regression Module

Learning Objectives:

Learn the difference between explanatory and response variables
Define correlation and regression analysis
Understand linear models and how to build them in R
How to make predictions using linear models
Understand the pros and cons of R²

1. Linear Regression

In most experiments, we want to learn the answer to the question: “How does X affect Y?” In these experiments, X is our independent variable, and Y is our dependent variable. Looking for any kind of relationship between the two variables is known as correlation analysis. A correlation exists if a change in the magnitude of one variable occurs simultaneous to a change in the magnitude of the other variable. However, correlation does not necessarily imply causation. For example, there certainly exists a correlation between a person's weight and height, but an increase in weight does not cause an increase in height.

To find the magnitude of these correlations, we use regression analysis. Regression analysis derives a linear equation between the variables of interest. It allows us to predict the value of the dependent variable (called the “response variable” in regression analysis) based on the value of the independent variable (called the “explanatory variable”) using a best fit line.

We can use the lm() function to perform regression analysis. The general format of these models is: model_name = lm(dependent_variable ~ independent_variable, data = data_name). Make sure you give your model a name!

In the example below, we create a linear model using the Galton data. We use the summary command to show the results of the regression analysis.

Galton = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdFlOcmt5bzY4VlFKRmtDdFJRMldTeEE&output=csv
")

## Error: Missing packages.  Please retry after installing the following:
## RCurl

mod1 = lm(height ~ father, data = Galton)
mod1

## 
## Call:
## lm(formula = height ~ father, data = Galton)
## 
## Coefficients:
## (Intercept)       father  
##      39.110        0.399

We can now see the correlation between the two variables with this linear model, and we can write the model in y = mx + b form. In the example above, we now have

height = 0.39938*father + 39.11039

How do we interpret this? We can say that for every inch the father's height rises, the child's height rises by 0.39938 inches.

For example, our model predicts that if the father's height is 60, then:

height = 0.39938*(60) + 39.11039, meaning the model predicts the child's height will be 63.07319

We can further examine the model using the summary() function, which provides descriptive statistics of the model. The summary() function takes one argument, which is a model name.

summary(mod1)

## 
## Call:
## lm(formula = height ~ father, data = Galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.268  -2.669  -0.209   2.634  11.933 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.1104     3.2271   12.12   <2e-16 ***
## father        0.3994     0.0466    8.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.45 on 896 degrees of freedom
## Multiple R-squared:  0.0758, Adjusted R-squared:  0.0748 
## F-statistic: 73.5 on 1 and 896 DF,  p-value: <2e-16

The summary output also shows the standard error on each variable. It also shows the t and p-values, but you can ignore those for the time being.

2. Linear Models and Categorical Variables

We frequently use categorical variables as explanatory variables in our models. In these situations, linear regression still applies, but we interpret the models differently.

For example, the Galton data contains one categorical variable: sex. Suppose we want to create a model examining the relationship between a child's height and their gender. We use the same syntax in R to build this type of model.

mod2 = lm(height ~ sex, data = Galton)
summary(mod2)

## 
## Call:
## lm(formula = height ~ sex, data = Galton)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9.23  -1.61  -0.11   1.77   9.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   64.110      0.121   531.7   <2e-16 ***
## sexM           5.119      0.168    30.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.51 on 896 degrees of freedom
## Multiple R-squared:  0.51,   Adjusted R-squared:  0.51 
## F-statistic:  933 on 1 and 896 DF,  p-value: <2e-16

Before attempting to interpret this model's coefficients, remember that 'sex' is a categorical variable. In the above output, we are given the model:

height = 5.1187sexM (yes/no) + 64.1102. Here, sexM stands for the child being a male. If this is the case, then R codes it as a “1”, whereas if the case is a female, then sexM is coded as “0”. Therefore, the model predicts that a male will have a height of 5.1187*(1) + 64.1102 = 69.22882, and a child that is not a male will have a height of 5.1187(0) + 64.1102 = 64.1102. In other words, our model predicts that males are 5.1187 inches taller than females.

3. Numerically predicting response variables

There is also a function in the Project Mosaic package that allows us to set the values of the explanatory variables in order to make predictions. This function is called makeFun(), and it takes the name of a linear model as its argument

First, we must create a function. Decide on a name for this function (we use 'f1' in the example below), and determine which model you will use (mod1 in this example).

f1 = makeFun(mod1)

Now, we have a function, 'f', that takes in values of the model's explanatory variables in order to predict the value of the response variable. To input these values, simply type:

f1(father = 60)

##     1 
## 63.07

Notice that the output to this function is the same number as the answer we got algebraically with our linear equation.

The makeFun() function also makes predictions when our explanatory variables are categorical. For example, we can create another function for mod2:

f2 = makeFun(mod2)

What should we set as our arguments for this function? We can set “sex” equal to any of the levels of gender. To figure out how gender is coded in this dataset, use the levels() function:

levels(Galton$sex)

## [1] "F" "M"

Since “M” and “F” are the levels of sex, set one of those to be the arguments for the function f2:

f2(sex = "M")

##     1 
## 69.23

f2(sex = "F")

##     1 
## 64.11

4. R²:

[This text will hopefully be replaced with a Shiny App] There is one more thing you need to know about the regression output for now. Note at the bottom of the output the Multiple R-squared value.

What is it?

Multiple R² tells us the proportion of the variability in the data that can be explained by our model. R² can be between 0 and 1, with 0 being a model that explains nothing, while 1 means the model explains everything.

What are the drawbacks to relying on R^2?

Adding more explanatory variables will always increase the Multiple R² value.
A Model based on only two data points will have an R² value of 1, because the best-fit line will simply connect the two points (so the model will explain all the variation in the data). But, this is misleading because we cannot make accurate statistical predictions based on two trials!

Questions to submit

The 'college' data contains data on 195 schools' admission rates, graduation rates, and public/private status. The data is already formatted in long format, and can be found at the link: 'https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdGJWS0RJWnEyQjFnazV2VGxlVVdxeXc&output=csv' Read the data into R using the fetchGoogle() function, and use it to answer the following questions.

Create a linear model exploring the relationship betwen graduation rate and admission rate. Show the summary output below.

# Insert code here

Using the above model, determine the graduation rate of a school with an admission rate of 10%

# Insert code here

Now, create a new model examining the relationship between graduation rate and the school's status as public or private. Show the summary output below.

# Insert code here

According to your model in question 3, do public schools have higher or lower graduation rates? Numerically justify your answer.
Interpret the R² value on your model from question 3.