For this document, we assume that we have two quantitative variables. We will learn how to plot the data and interpret the resulting scatterplot. We will also calculate correlation (used for measuring the strength and direction of the linear relationship). Then we will begin regression. We will fit the regression line, interpret it, use it for prediction, and calculate residuals.
For this document, we use the USStates dataset. For our ease, we download the data set from the website. To load the dataset, we tell R the URL and we tell it that we have a csv. Let’s read in the data and name the data set “states.”
states<-read.csv(url("http://www.lock5stat.com/datasets/USStates.csv"))
Before we go farther, I always look at the data to see what it looks like. Let’s look at the first few rows:
head(states)
## State HouseholdIncome Region Population EighthGradeMath HighSchool
## 1 Alabama 43.253 S 4.849 269.2 84.9
## 2 Alaska 70.760 W 0.737 281.6 92.8
## 3 Arizona 49.774 W 6.731 279.7 85.6
## 4 Arkansas 40.768 S 2.966 277.9 87.1
## 5 California 61.094 W 38.803 275.9 84.1
## 6 Colorado 58.433 W 5.356 289.7 89.5
## College IQ GSP Vegetables Fruit Smokers PhysicalActivity Obese
## 1 24.9 95.7 32.615 74.2 54.1 21.5 45.4 32.4
## 2 24.7 99.0 61.156 80.8 60.3 22.6 55.3 28.4
## 3 25.5 97.4 35.195 76.2 60.5 16.3 51.9 26.8
## 4 22.4 97.5 31.837 72.0 49.5 25.9 41.2 34.6
## 5 31.4 95.5 46.029 82.7 69.6 12.5 56.3 24.1
## 6 37.0 101.6 46.242 80.9 64.3 17.7 60.4 21.3
## NonWhite HeavyDrinkers Electoral ObamaVote ObamaRomney TwoParents
## 1 30.7 4.3 9 0.384 R 58.7
## 2 33.1 8.2 3 0.408 R 69.6
## 3 20.8 6.3 11 0.446 R 62.7
## 4 21.7 5.0 6 0.369 R 62.0
## 5 37.7 6.4 55 0.602 O 65.3
## 6 15.8 6.7 9 0.515 O 69.9
## StudentSpending Insured
## 1 8.755 78.8
## 2 18.175 79.8
## 3 7.208 74.7
## 4 9.394 71.7
## 5 9.220 79.7
## 6 8.647 80.0
Let’s get the names of the variables:
names(states)
## [1] "State" "HouseholdIncome" "Region"
## [4] "Population" "EighthGradeMath" "HighSchool"
## [7] "College" "IQ" "GSP"
## [10] "Vegetables" "Fruit" "Smokers"
## [13] "PhysicalActivity" "Obese" "NonWhite"
## [16] "HeavyDrinkers" "Electoral" "ObamaVote"
## [19] "ObamaRomney" "TwoParents" "StudentSpending"
## [22] "Insured"
You should check out the head and the summary every time you open a data set, but keep the results to yourself. In other words, do it for your own knowledge, but don’t turn that part in.
Let’s attach the data set so that we don’t have to refer to it by name every time.
attach(states)
We graphically display the relationship between 2 quantitative variables using scatterplots. In R, we do this with the command “plot.”" This command has two arguments: the two quantitative variables. We enter the quantitative variables separated by commas. The first quantitative variable is generally the explanatory variable, while the second quantitative variable is generally the response.
We are interested in the relationship between two variables: {HouseholdIncome} and Obese. Obese is the percentage of residents in a state who are classified as obese. {HouseholdIncome} is the state’s mean household income.
We could make an argument either way, but let’s say that {HouseholdIncome} is the explanatory variable and Obese is the response. My reasoning is that those who cannot afford healthy food are more likely to eat cheap junk food and become obese.
To get the scatterplot, type:
plot(HouseholdIncome,Obese)
You can also type plot(Obese~HouseholdIncome)
Do you like the axis labels? I don’t. R just uses the variable names by default, but the variable names are often vague or confusing. Let’s make them understandable to the general reader. First, check out how we made this scatterplot. The variables HouseholdIncome and Obese are “arguments” of the plot function. We can add more arguments to this function to customize the output. We use xlab to change the x-axis label. Similarly, ylab changes the y-axis label. We can also add a title through the main argument. I generally do not add titles to my plots, but I’ll show you in case this strikes your fancy. (If/When you write an academic paper, your plots will not have titles because each plot should instead have a figure caption.)
plot(HouseholdIncome, Obese, ylab = "Obesity Rate (%)",
xlab = "Household Income (USD)",
main = "US States' Obesity Rates and Mean Household Incomes")
What trends do we see? Be sure to describe the trend in the context of the problem. Is there a positive association or negative association between the two variables? Do the points follow the trend closely? That is, are the points tightly or loosely packed together? Does it look like there could be a linear relationship?
Correlation tells us the strength and direction of the linear relationship. In other words, we know there is a linear association. The next step is to actually find the line that best describes the linear association. In other words, we need to find the line of best fit or the regression line.
We fit a linear model with the {lm} command. Those are the letters “el” and “em”. The general arguments for the {lm} command are as follows:In our case, our response is the obesity rate and the explanatory variable is the mean household income. We can name our model anything we want, but I’ll name it mymod.
mymod <- lm(Obese ~ HouseholdIncome)
The lm command creates the linear model. Since we’ve saved it in mymod, we can now call it up.
mymod
##
## Call:
## lm(formula = Obese ~ HouseholdIncome)
##
## Coefficients:
## (Intercept) HouseholdIncome
## 42.1759 -0.2517
The call output first reminds us of the exact model we fit. This can be useful when we’ve fit many models. Next, we see our estimates for the intercept and slope. You can also think of slope as the coefficient for the variable HouseholdIncome. This is how R presents the slope.
Since we have the intercept and slope for our linear regression, please go ahead and write down the regression line.
Once you have your linear regression, it is very important to interpret it. A regression is worthless if you cannot communicate your results. Remember, communication is essential!
For our regression, let’s interpret the slope. Usually we interpret the slope by saying what we expect to happen to the response when we increase the explanatory variable by one unit. In our specific regression, we’d say: for every thousand dollar increase in a state’s mean household income, the obesity rate decreases on average by .2571 percentage points.
Can we interpret the intercept? Look back at the data and at the scatterplot to answer this question. (Hint: we can interpret the intercept if and only if we can reasonably set the explanatory variable to zero.)
You may want to plot the data in a scatterplot and include the regression line. This can help people reading your report see the linear trend. After all, not everyone is a natural statistician like you are!
To plot the data and the regression line, you first need to create the regression model. You did that earlier with the lm command. Next, you plot the data, and immediately follow up with the abline command to draw in the regression line. The arguments for the abline command are the intercept followed by the slope. That is, the command looks like abline(intercept, slope).
plot(HouseholdIncome,Obese,ylab="Obesity Rate (%)",xlab="Household Income (USD)")
abline(42.1759,-0.2571)
If your regression line doesn’t show up on your plot, it probably means you’re typing it in wrong. Check the order of your arguments (always intercept first, then slope) and check that you’ve copied and pasted your numbers correctly.
To predict, we simply plug a value of an explanatory variable into our regression equation. For example, if we want to predict a state’s obesity rate, we input its mean household income.
Let’s look at California. It’s on line 5 of the states data. We can look at California’s data using indexing. (You can read more about indexing in a section coming up.) Here are a couple ways to index: use the variable name (e.g. “Obese”) or use the column number.
states[5,"Obese"]
## [1] 24.1
states[5,2]
## [1] 61.094
Let’s see what our regression line predicts for California’s obesity rate. This is how you can do it if you just want to type in everything yourself:
42.1759 + -.2571 * 61.094
## [1] 26.46863
If you are more comfortable with R, you can use some additional commands. coef selects the coefficients of the model:
coef(mymod)
## (Intercept) HouseholdIncome
## 42.1758781 -0.2516667
We see the order of the coefficients is intercept followed by predictor. We also see the coefficients are contained in a vector of length 2. If we matrix multiply the coefficients by a vector (of length 2) containing 1 and by the household income, the result is the predicted obesity rate.
coef(mymod)%*%c(1, 61.094)
## [,1]
## [1,] 26.80055
There are other ways to do prediction in R. This is just two ways. Do whichever you prefer.
Using whatever method you prefer, predict the obesity rate for Minnesota, which is on line 23.
Recall a residual is observed - expected (or observed - predicted). Then the residual for California is its true obesity rate minus the predicted obesity rate according to the linear regression.
Subtracting two numbers is easy enough. Calculate and interpret the residuals for California and Minnesota, then let’s move onto indexing, which is much more interesting than subtraction.
Indexing means we use brackets to isolate pieces of the data set. Since states is a matrix, it has rows and columns. We can get certain rows of the data set by specifying the row numbers. Similarly, we can get certain columns by specifying the column numbers. California is the 5th row. We can get the 5th row and every column of the dataset by typing
states[5, ]
## State HouseholdIncome Region Population EighthGradeMath HighSchool
## 5 California 61.094 W 38.803 275.9 84.1
## College IQ GSP Vegetables Fruit Smokers PhysicalActivity Obese
## 5 31.4 95.5 46.029 82.7 69.6 12.5 56.3 24.1
## NonWhite HeavyDrinkers Electoral ObamaVote ObamaRomney TwoParents
## 5 37.7 6.4 55 0.602 O 65.3
## StudentSpending Insured
## 5 9.22 79.7
(Note there is nothing between the comma and the bracket.) Similarly, the obesity rate is in the 14th column. We can get every state’s obesity rate by typing
states[ ,14]
## [1] 32.4 28.4 26.8 34.6 24.1 21.3 25.0 31.1 26.4 30.3 21.8 29.6 29.4 31.8
## [15] 31.3 30.0 33.2 33.1 28.9 28.3 23.6 31.5 25.5 35.1 30.4 24.6 29.6 26.2
## [29] 26.7 26.3 26.4 25.4 29.4 31.0 30.4 32.5 26.5 30.0 27.3 31.7 29.9 33.7
## [43] 30.9 24.1 24.7 27.2 27.2 35.1 29.8 27.8
(Again, note there is nothing between the comma and the bracket.) To get California’s obesity rate, we need to tell R the row (5) and column (14)
states[5,14]
## [1] 24.1
Now we know how to use indexing to get specific data values from our data set. Let’s put it together with our previous calculation for predicting the obesity rate in California. You can replace the prediction calculation with your own prediction calculation.
CA.actual <- states[5 ,14]
CA.income <- states[5, "HouseholdIncome"]
CA.predicted <- coef(mymod)%*%c(1, CA.income)
CA.resid <- CA.actual - CA.predicted
CA.resid
## [,1]
## [1,] -2.700554
Our predicted obesity rate for California was about 2.7 percentage points too low.
We were able to figure out that California was the fifth state in our dataset based on the head of the dataset. In general, how can we figure out which row a state is in?
We know the state names are contained in the State variable. We can ask R, “Which state is named California?” through the following command:
which(State=="California")
## [1] 5
This says the fifth state in our data set has the name “California.” What happens if you misspell California or call it “CA”? R gets confused. No states in the data set are named “CA,” so R gives you a weird answer that is trying to say “I can’t find CA in my dataset.”
which(State=="CA")
## integer(0)
Notice that we posed our question using two equality signs (==). In R (and many other programming languages), a double equals is a way to check for equality. For example,
mango <- 3
mango
## [1] 3
sets mango to 3. Then we can ask R, “Is mango equal to 4?”
mango == 4
## [1] FALSE
and R tells us it is not.
What are the assumptions for regression models? Go check your notes to remind yourself of all of them. Which of these assumptions can be checked without a computer?
One of the assumptions is that the errors will have a normal distribution. Let’s check this. We can get all of the residuals for the data using the following:
myresids <- mymod$residuals
Let’s see if these are normally distributed. Your first thought is probably, “Let’s make a histogram of the residuals!”
hist(myresids)
It’s hard for humans to compare data against a curve. It’s easier for us to compare data against a straight line. Let’s plot the quantiles of the residuals against the quantiles of a normal distribution. If the residuals are normally distributed, they’ll follow a straight line.
qqnorm(myresids)
qqline(myresids)
This doesn’t look bad: the points are too far from the line.
Another one of the assumptions is homoscedasticity (i.e. we assume that the variance of the residuals will be equal for all predictor values). For simple linear regression, the laziest way to check this is to look at the original data again.
plot(HouseholdIncome, Obese, ylab = "Obesity Rate (%)",
xlab = "Household Income (USD)",
main = "US States' Obesity Rates and Mean Household Incomes")
Check out the vertical spread at various household incomes. The vertical spread should be approximately the same, regardless of the household income. If the data are more tightly packed at one place than another, then you have trouble.
Still, there’s another plot we can make to check this assumption of equal variance. We can plot the residuals against the household income values. It’s a bit easier to compare variances.
plot(mymod$residuals ~ HouseholdIncome)
abline(0,0)
I don’t see obvious evidence of heteroscedasticity.
Look at the book for the equation for mean square error. You can calculate it by hand:
sqrt(sum((mymod$residuals)^2)/48)
## [1] 2.589553
but really you should just let R do the work for you. You can get the MSE (and a lot more information) from the model summary.
summary(mymod)
##
## Call:
## lm(formula = Obese ~ HouseholdIncome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1702 -1.7152 0.4645 1.7809 4.6312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.17588 2.29764 18.356 < 2e-16 ***
## HouseholdIncome -0.25167 0.04257 -5.912 3.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.59 on 48 degrees of freedom
## Multiple R-squared: 0.4213, Adjusted R-squared: 0.4093
## F-statistic: 34.95 on 1 and 48 DF, p-value: 3.416e-07
The model’s slope and intercept are in the model summary. Find these two pieces of information. Skim the rest of the model summary. Why are there 28 degrees of freedom? (Hint: the US currently has 50 states and we are estimating 2 regression coefficients in our regression equation.)
It’s your turn to try this all out! Have fun practicing your new skills on at least 2 of the following data sets.
Use the women data (built into R); in particular, use the women’s weight as the response and the height as the predictor.
data(women)
Use petal length and petal width to keep practicing your new skills.
data(iris)
Use the Beerwings data from the resample package. (You might need to isntall the resample package first.) #{r} #library(resampledata) #data(Beerwings) The units of beer is fluid ounces.