Simple linear regression is a statistical method that allows us to analyze relationships between two quantitative variables. One variable is labeled the explanatory variable, and the other is denoted as the response variable, among other names. For this R Guide, we will learn to graph these two variables in a scatterplot and interpret the results, specifically looking for trend. Then, we will perform a simple linear regression and interpret the results of correlation. Using this information, we will make predictions and and calculate their residuals, as well as performing some hypothesis tests.
Below is an outline of our line of best fit - or, in other words - our regression line.
\(\sf{y_{i}}\) denotes the response variable
\(\sf{x_{i}}\) denotes the predictor variable
\(\sf{ŷ_{i}}\) is the predicted response
Then, the equation for the best fitting line is:
\(\sf{ŷ_{i}}\) = \(\sf{b_{0}}\) + \(\sf{b_{1}}\)\(\sf{x_{i}}\)
We will use a dataset that analyzes death rate back in 1960 based on 14 different variables including weather, age of population, office workers, pollution, as well as others.
We’ll start by importing our data into R by reading the csv file that you downloaded, entitled “deathrateregression.csv”.
deathratedata <- read.csv(file="/Users/Cloutiers2/Documents/deathrateregression.csv")
To better visualize the data, we can take a look at the first few rows by using the head function.
head(deathratedata)
## Index Precipitation JanuaryTemp JulyTemp Ovr65Pop Members.Household
## 1 1 36 27 71 8.1 3.34
## 2 2 35 23 72 11.1 3.14
## 3 3 44 29 74 10.4 3.21
## 4 4 47 45 79 6.5 3.41
## 5 5 43 35 77 7.6 3.44
## 6 6 53 45 80 7.7 3.45
## Schooling Kitchens Pop.Sq.Mile OfficeWorkers PoorFamililes
## 1 11.4 81.5 8.8 42.6 11.7
## 2 11.0 78.8 3.6 50.7 14.4
## 3 9.8 81.6 0.8 39.4 12.4
## 4 11.1 77.5 27.1 50.2 20.6
## 5 9.6 84.6 24.4 43.7 14.3
## 6 10.2 66.8 38.5 43.1 25.5
## HydrocarbonPollution NitricOxidePollution SulfurDioxidePollution
## 1 21 15 59
## 2 8 10 39
## 3 6 6 33
## 4 18 8 24
## 5 43 38 206
## 6 30 32 72
## Humidity DeathRate
## 1 59 921.870
## 2 57 997.875
## 3 54 962.354
## 4 56 982.291
## 5 55 1071.289
## 6 54 1030.380
It is useful to attach our data so we can reference each row and column by name.
attach(deathratedata)
Now we can plot a couple of the variables that we are most interested in looking at their relationship. R defaults to a scatter plot with the “plot” function, which is a useful way to look at the variables anyways. We’re looking for a “trend” in the data that would suggest a linear relationship between the two variables.
We’ll start by plotting the death rate and the percentage of office workers in that area. Intuitively, it would seem that the greater number of people who work in a sedentary job like an office would contribute to an increased death rate.
plot(OfficeWorkers,DeathRate,ylab ="Death Rate",xlab="Office Workers",main = "Death Rate and Office Workers")
However, as you can see by the graph, no clear trend exists in this scatterplot. Therefore, it may be better to move on to a different varaible and see if we have more to work with.
plot(Schooling,DeathRate,xlab="Years of School",ylab="Death Rate",main="Death Rate and Schooling")
The relationship between Death Rate and Years of School completed for a person age 22 seemed to represent a more significant (negative) trend than the previous plot. So let’s run a simple linear regression to analyze the correlation and see exactly what is going on.
Below is the syntax for a simple linear regression in R. Using the “Summary” function can help us to learn more information about our model, and is something that we will look into later in the guide.
deathmod <- lm(DeathRate ~ Schooling)
deathmod
##
## Call:
## lm(formula = DeathRate ~ Schooling)
##
## Coefficients:
## (Intercept) Schooling
## 1353.0 -37.6
Using the information that R gave us when we called the linear regression, we can extract the “regression line”, as we discussed earlier. In this context, the death rate is our response variable, and the average number of years of schooling is the explanatory variable. The line is as follows:
DeathRate = 1353 - (37.6 * Schooling)
Let’s graph that line with our scatter plot so that we can visualize it.
plot(Schooling,DeathRate,xlab="Years of School",ylab="Death Rate",main="Death Rate and Schooling")
abline(1353,-37.6)
As you can see, the regression line seems to fit the trend of the scatter plot fairly well.
Now, we will try to use our line of regression to make predictions for our data. Let’s try to predict the death rate based on the average number of years of school completed for index number 5. Each of the indices represents a location, which were not specified in the particular study from which we drew the data. We start by indexing our actual death rate for the index that we want. Then we make a prediction and calculate the residual of that prediction.
actualrate <- deathratedata[5,16]
schoolyrs <- deathratedata[5,7]
predictedrate <- coef(deathmod)%*%c(1,schoolyrs)
deathresid <- actualrate - predictedrate
deathresid/actualrate
## [,1]
## [1,] 0.0740119
Our predicted death rate (expressed as number of deaths for that location) was around 80 deaths too high… This is a prediction roughly 7.4% too high.
One of the most important assumtpions of a linear regression model is that our errors will follow a normal distribution. We can check this by comparing the quantiles of the residual to normal quantiles. This is a great way to visualize the residuals against normality.
resids <- deathmod$residuals
qqnorm(resids)
qqline(resids)
The points are not too far from the line for the most part, which is a good sign for our model.
Another way to test how good our model is at making predicitons based on our data is by calculating the mean squared error. As you can imagine, this is caluclated by taking the square of the differences between the estimator and what is estimated, and finding the average of them. it’s much easier to just let R do the work for you however. The easiest way is to take a look at the model summary.
summary(deathmod)
##
## Call:
## lm(formula = DeathRate ~ Schooling)
##
## Residuals:
## Min 1Q Median 3Q Max
## -151.708 -36.691 2.417 43.811 124.916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1352.996 91.412 14.801 < 2e-16 ***
## Schooling -37.604 8.306 -4.527 3.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.93 on 58 degrees of freedom
## Multiple R-squared: 0.2611, Adjusted R-squared: 0.2484
## F-statistic: 20.5 on 1 and 58 DF, p-value: 3.022e-05
We seem to have a fairly high residual standard area which would bring to question the effectiveness of the model based on the data that we have. Another important measure found in the summary is the R Squared value, which is a measure of how closely our regression line fits the data. This seems to be a fairly low R-Squared value in this context, which would, again, bring into question the accuracy of the regression line.
Confidence intervals show you how well you have determined the mean. If repeated samples were taken and the 95% confidence interval computed for each sample, 95% of the intervals would contain the population mean.
Prediction intervals tell you where you can expect to see the next data point sampled. If we collect a sample of observations and calculate a 95% prediction interval based on that sample, there is a 95% probability that a future observation will be contained within the prediction interval.
New Data Frame Creation
We need to create a new data frame based on the years of schooling being 11. Thus we can make a prediction and confidence interval based on this data frame. Following is the syntax for both the data frame and the intervals
newdata <- data.frame(Schooling = 11.0)
Prediction and Confidence Itervals when the average number of years of schooling was 11.
confi <- predict(deathmod,newdata,interval = "confidence")
predi <- predict(deathmod,newdata,interval = "prediction")
confi
## fit lwr upr
## 1 939.3557 925.4118 953.2996
predi
## fit lwr upr
## 1 939.3557 830.5044 1048.207
To begin, we identify a hypothesis or claim that we feel should be tested. We select a criterion upon which we decide that the claim being tested is true or not. We compare what we observe in our sample to what we expect to observe if the claim we are testing is true. If the discrepancy between the sample mean and population mean is small, then we will likely decide that the claim we are testing is indeed true. If the discrepancy is too large, then we will likely decide to reject the claim as being not true.
For an example, we will say that our null hypothesis is that the Slope of the regression line is 0. Our alternative hypothesis can be that the slope of the regression line is not equal to zero. We can test this hypothesis simply by once again looking at the summary of our linear regression model.
summary(deathmod)
##
## Call:
## lm(formula = DeathRate ~ Schooling)
##
## Residuals:
## Min 1Q Median 3Q Max
## -151.708 -36.691 2.417 43.811 124.916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1352.996 91.412 14.801 < 2e-16 ***
## Schooling -37.604 8.306 -4.527 3.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.93 on 58 degrees of freedom
## Multiple R-squared: 0.2611, Adjusted R-squared: 0.2484
## F-statistic: 20.5 on 1 and 58 DF, p-value: 3.022e-05
We can see here that our test statistic is -4.527, and our p-value is 3.02e-05. I therefore have evidence to reject the null hypothesis due to such a low p-value, and it is clear that death rate and schooling maintain a linear relationship.
We are also able to perform an F-test in R to look at the significance of the slope. In SLR, it gives us the same results that our test above gave us.
The F-test essentially loooks at the ratio of explained to unexplained variation. Explained variation is explained by our model specifically. Once again, the F statistic can simply be pulled from our model summary as before, and we are given a p-value of 3.022e-05. We have the same conclusion of rejecting our null hypothesis.
The R-Squared Value, similar to the F-statistic, is the ratio of explained variation to total variation. it is a good measure of how effective a model is at explaining what’s going on in the data set. Being a ratio, it is a value between 0 and 1, with the goal to be as close as possible to 1… That is, we want to be able to explain as much variation as possible with our model.
Throughout this guide, we have looked at Simple linear Regression, how to implement it in R, and how you can interpret the various things you can do with it within the framework. I’d encourage you to try these methods on your own with data sets that interest you and dive further into the wonderful world of applied statistics.