In most experiments, we want to learn the answer to the question: “How does X affect Y?” In these experiments, X is our independent variable, and Y is our dependent variable. Looking for any kind of relationship between the two variables is known as correlation analysis. A correlation exists if a change in the magnitude of one variable occurs simultaneous to a change in the magnitude of the other variable. However, correlation does not necessarily imply causation. For example, there certainly exists a correlation between a person's weight and height, but an increase in weight does not cause an increase in height.
To find the magnitude of these correlations, we use regression analysis. Regression analysis derives a linear equation between the variables of interest. It allows us to predict the value of the dependent variable (called the “response variable” in regression analysis) based on the value of the independent variable (called the “explanatory variable”) using a best fit line.
We can use the lm() function to perform regression analysis. The general format of these models is: model_name = lm(dependent_variable ~ independent_variable, data = data_name). Make sure you give your model a name!
In the example below, we create a linear model using the Galton data. We use the summary command to show the results of the regression analysis.
Galton = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdFlOcmt5bzY4VlFKRmtDdFJRMldTeEE&output=csv
")
## Error: Missing packages. Please retry after installing the following:
## RCurl
mod1 = lm(height ~ father, data = Galton)
mod1
##
## Call:
## lm(formula = height ~ father, data = Galton)
##
## Coefficients:
## (Intercept) father
## 39.110 0.399
We can now see the correlation between the two variables with this linear model, and we can write the model in y = mx + b form. In the example above, we now have
height = 0.39938*father + 39.11039
How do we interpret this? We can say that for every inch the father's height rises, the child's height rises by 0.39938 inches.
For example, our model predicts that if the father's height is 60, then:
height = 0.39938*(60) + 39.11039, meaning the model predicts the child's height will be 63.07319
We can further examine the model using the summary() function, which provides descriptive statistics of the model. The summary() function takes one argument, which is a model name.
summary(mod1)
##
## Call:
## lm(formula = height ~ father, data = Galton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.268 -2.669 -0.209 2.634 11.933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.1104 3.2271 12.12 <2e-16 ***
## father 0.3994 0.0466 8.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.45 on 896 degrees of freedom
## Multiple R-squared: 0.0758, Adjusted R-squared: 0.0748
## F-statistic: 73.5 on 1 and 896 DF, p-value: <2e-16
The summary output also shows the standard error on each variable. It also shows the t and p-values, but you can ignore those for the time being.
We frequently use categorical variables as explanatory variables in our models. In these situations, linear regression still applies, but we interpret the models differently.
For example, the Galton data contains one categorical variable: sex. Suppose we want to create a model examining the relationship between a child's height and their gender. We use the same syntax in R to build this type of model.
mod2 = lm(height ~ sex, data = Galton)
summary(mod2)
##
## Call:
## lm(formula = height ~ sex, data = Galton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.23 -1.61 -0.11 1.77 9.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.110 0.121 531.7 <2e-16 ***
## sexM 5.119 0.168 30.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51 on 896 degrees of freedom
## Multiple R-squared: 0.51, Adjusted R-squared: 0.51
## F-statistic: 933 on 1 and 896 DF, p-value: <2e-16
Before attempting to interpret this model's coefficients, remember that 'sex' is a categorical variable. In the above output, we are given the model:
height = 5.1187sexM (yes/no) + 64.1102. Here, sexM stands for the child being a male. If this is the case, then R codes it as a “1”, whereas if the case is a female, then sexM is coded as “0”. Therefore, the model predicts that a male will have a height of 5.1187*(1) + 64.1102 = 69.22882, and a child that is not a male will have a height of 5.1187(0) + 64.1102 = 64.1102. In other words, our model predicts that males are 5.1187 inches taller than females.
There is also a function in the Project Mosaic package that allows us to set the values of the explanatory variables in order to make predictions. This function is called makeFun(), and it takes the name of a linear model as its argument
First, we must create a function. Decide on a name for this function (we use 'f1' in the example below), and determine which model you will use (mod1 in this example).
f1 = makeFun(mod1)
Now, we have a function, 'f', that takes in values of the model's explanatory variables in order to predict the value of the response variable. To input these values, simply type:
f1(father = 60)
## 1
## 63.07
Notice that the output to this function is the same number as the answer we got algebraically with our linear equation.
The makeFun() function also makes predictions when our explanatory variables are categorical. For example, we can create another function for mod2:
f2 = makeFun(mod2)
What should we set as our arguments for this function? We can set “sex” equal to any of the levels of gender. To figure out how gender is coded in this dataset, use the levels() function:
levels(Galton$sex)
## [1] "F" "M"
Since “M” and “F” are the levels of sex, set one of those to be the arguments for the function f2:
f2(sex = "M")
## 1
## 69.23
f2(sex = "F")
## 1
## 64.11
[This text will hopefully be replaced with a Shiny App] There is one more thing you need to know about the regression output for now. Note at the bottom of the output the Multiple R-squared value.
Multiple R2 tells us the proportion of the variability in the data that can be explained by our model. R2 can be between 0 and 1, with 0 being a model that explains nothing, while 1 means the model explains everything.
Adding more explanatory variables will always increase the Multiple R2 value.
A Model based on only two data points will have an R2 value of 1, because the best-fit line will simply connect the two points (so the model will explain all the variation in the data). But, this is misleading because we cannot make accurate statistical predictions based on two trials!
The 'college' data contains data on 195 schools' admission rates, graduation rates, and public/private status. The data is already formatted in long format, and can be found at the link: 'https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdGJWS0RJWnEyQjFnazV2VGxlVVdxeXc&output=csv' Read the data into R using the fetchGoogle() function, and use it to answer the following questions.
# Insert code here
# Insert code here
# Insert code here
According to your model in question 3, do public schools have higher or lower graduation rates? Numerically justify your answer.
Interpret the R2 value on your model from question 3.