This R guide is all about Simple Linear Regression and all of the statistics and tests that come with it. I will go into how to run a simple linear regression in R, what we can do with our model, interpretations, intervals, significance test, and some more useful information.
There are a couple of preliminary things that are nice to look at to get an idea of where to go with a simple linear regression using our data set. One is just to look at a summary of the data and explore the variables. Another thing to do to get a glance at the relationship between the variables is to make a plot.
To demonstrate and explain what Simple Linear Regression is all about I will be walking through everything using a data set that contains data about people height, their parents heights, and their gender. The first step is to get our data into R. We can do that by using a line of code that will download the data from a URL by commanding, name<-read.csv(“URL”).
Galton <- read.csv("http://cknudson.com/data/Galton.csv")
Now our data set is in the R environment so we can use it and manipulate it to run our analysis. The next thing to do is to always get to know your data. A person needs to know what their data contains and how it looks to know the best techniques for using it. I like two commands that give us a quick glance. The first is the head() command which shows you the first 6 rows of data. The second one I like is the names() command which shows you all your variable names which will be very important because using these variables requires them to be spelled correctly and case sensitive every time.
head(Galton)
## FamilyID FatherHeight MotherHeight Gender Height NumKids
## 1 1 78.5 67.0 M 73.2 4
## 2 1 78.5 67.0 F 69.2 4
## 3 1 78.5 67.0 F 69.0 4
## 4 1 78.5 67.0 F 69.0 4
## 5 2 75.5 66.5 M 73.5 4
## 6 2 75.5 66.5 M 72.5 4
names(Galton)
## [1] "FamilyID" "FatherHeight" "MotherHeight" "Gender"
## [5] "Height" "NumKids"
Another command that is handy is the summary() command. This will tell us some basic statistics about our observations in our data set and more specifically about the variables.
summary(Galton)
## FamilyID FatherHeight MotherHeight Gender Height
## 185 : 15 Min. :62.00 Min. :58.00 F:433 Min. :56.00
## 166 : 11 1st Qu.:68.00 1st Qu.:63.00 M:465 1st Qu.:64.00
## 66 : 11 Median :69.00 Median :64.00 Median :66.50
## 130 : 10 Mean :69.23 Mean :64.08 Mean :66.76
## 136 : 10 3rd Qu.:71.00 3rd Qu.:65.50 3rd Qu.:69.70
## 140 : 10 Max. :78.50 Max. :70.50 Max. :79.00
## (Other):831
## NumKids
## Min. : 1.000
## 1st Qu.: 4.000
## Median : 6.000
## Mean : 6.136
## 3rd Qu.: 8.000
## Max. :15.000
##
Now we have an idea what we have for data so we have to decide what we want to look at. Well simple linear regression is a tool for looking at the relationship between two quantitative variables, such as father’s height and height, and using one to predict the other. In this case we will use the father’s height to predict the subject’s height.
Now in order for R to know what data set we are referencing there is a little trick by using the attach() command. This allows us to skip having to type the data set name in every command.
attach(Galton)
After getting the data prepared, the best way to see the relationship between variables would be to plot them and check out how our plot looks. To do this I will use the plot() command which lists the x-axis variable first and the y-axis variable second but separated by a comma.
plot(FatherHeight, Height)
This plot looks kind of sloppy. It just lists the variable names even if they don’t make sense. Instead I want to make labels for both the x and y axis and also give our plot a title. In order to do this all that is required is a few more words in the plot command. ylab means the y-axis label, xlab is the x-axis label, and main is the title of the graph.
plot(FatherHeight, Height, ylab = "Subject's Height(in.)",
xlab = "Father's Height(in.)",
main = "Persons's Height Compared to their Father's Height")
That plot looks much better. The thing that isn’t very great about it is how our points are scattered. The things to consider when assessing a plot is if there is a trend, is it positive or negative, are the points scattered or tightly packed, and do you think there could be a linear relationship between the predictor and response variable. The big problem is that there doesn’t seem to be a super obvious trend in our plot. While this isn’t a great sign we will proceed anyways to show the concepts of our analysis.
The next step is to create our regression model to get some concrete numbers about the relationship. We are looking for numbers that will tell us about our correlation and our error of our model. Our correlation is the big one because that tells us how well our predictor variable predicts our response variable.
The linear model command is a very simple one. It is setup like:
modelname <-lm(response~predictor)
Like mentioned earlier, we want to use father’s height to predict height.
Then we just call the model’s name and it gives us it’s data on coefficients.
model <- lm(Height~FatherHeight)
model
##
## Call:
## lm(formula = Height ~ FatherHeight)
##
## Coefficients:
## (Intercept) FatherHeight
## 39.1104 0.3994
As we can see here our model was created under the name “model” so by just calling it’s name in a command it will show us our coefficients. We could now create a regression equation from this output: \[predicted height=39.1104+observed father's height*.3994\]
Now to interpret our model and its equation. Interpretation is important so that other people understand what analysis you are doing. The best way is to interpret the slope. We say what we expect to happen to the response variable when we increase the predictor variable by one unit. For our model this would sound like: For every one inch increase in the father’s height, our subjects height will increase .3994 inches on average.
The other thing to interpret would be the intercept but you can only interpret that if it makes sense that our predictor variable be zero. In our case the father’s height cannot be zero so we cannot interpret the intercept.
Now that we have coefficients we can find our regression line and put that into our plot. To do that we use command abline() and we put the intercept first and the slope second, separated by a comma.
plot(FatherHeight, Height, ylab = "Subject's Height(in.)",
xlab = "Father's Height(in.)",
main = "Persons's Height Compared to their Father's Height")
abline(39.1104,.3994)
This puts our regression line on our plot. This is what R considers to be the best fit line for our scatter plot. Does it look like it is a good prediction? Not the best as you can see lots of scatter around the line and the points are not closely packed around the line.
Another way to look at this is to use the smoothscatter() command to see how dense some of the points are around the predicted line.
smoothScatter(FatherHeight, Height, ylab = "Subject's Height(in.)",
xlab = "Father's Height(in.)",
main = "Persons's Height Compared to their Father's Height")
abline(39.1104,.3994)
This didn’t improve a ton as it isn’t very dark blue around the line which would mean there is a higher density there so more points in that area. This shows that there is quite a bit of variance in our data.
An important thing when considering a simple linear regression model is that our assumptions are met. Assumptions are very important because if they are not met our model wont be as good as it could be. If there are patterns in our residuals or our errors are not normally distributed then we run into the problem that we may not have a linear relationship but more of a polynomial relationship or maybe there are certain points that are outliers that are effecting our model more than others. At the end of the day if you meet your assumptions then your model has the best chance of predicting to the best of it’s ability.
One important assumption is that our errors (or residuals) have a normal distribution. A residual is our observed value minus our predicted value for any value of x. We can check this easiest by getting our residuals for all our points and making a histogram for them using hist() command and calling for the residuals by using name<-modelname$residuals.
resid<- model$residuals
hist(resid)
This histogram looks pretty good. It has pretty even tails and looks fairly bell shaped.
Another good way to look at this assumption is to plot our errors along a straight line in a quantile plot. If our error is normally distributed then they will follow the straight line.
qqnorm(resid)
qqline(resid)
As we can see from this plot our errors follow the straight line decently so we will say this assumption is met and discuss possible issues. The points off the line tell us that we might have skewed data or, the most likely situation, we have extreme values in our data that don’t fit well into a normal distribution.
Another assumption was homoscadicity which is constant variance throughout our data. We can look at this by making sure the spread is roughly the same throughout the model which we had in our model but we can look at this plot again.
plot(FatherHeight, Height, ylab = "Subject's Height",
xlab = "Father's Height",
main = "Perons's Height Compared to their Father's Height")
abline(39.1104,.3994)
This looks good but there is an easier way to check constant variance. That is to plot our residuals along a horizontal line across 0.
plot(model$residuals ~ FatherHeight)
abline(0,0)
We will say this assumption is met as our plots give no evidence of different levels of variance.
One last thing to look at is if there is a trend in our residual plot. This would not be a good thing as that would possibly show there is a hidden pattern that our linear model doesn’t account for so maybe we would have to change our model but we can look at the last plot we just made and see that there is not pattern and that there is random, equal scatter along the horizontal line at zero.
The nice thing about R is that when you create a simple linear regression there are many pieces of information about this model that are also created. This information is stored in the model’s summary output.
The model’s output is a quick and easy way to see a lot of data. This is a great way to see the coefficients for the intercept and slope to make and interpret our model. There are also quite a few things to look in the output for diagnostics including correlation/correlation of determination(R and R-squared), t-tests, the f-test, coefficient confidence intervals, standard error, and mean squared error.
The first two columns of the summary output have the coefficients for the slope and intercept as well as their standard errors. This is the easiest way to find this information as it is all in one spot.
summary(model)
##
## Call:
## lm(formula = Height ~ FatherHeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2683 -2.6689 -0.2092 2.6342 11.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.11039 3.22706 12.120 <2e-16 ***
## FatherHeight 0.39938 0.04658 8.574 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.446 on 896 degrees of freedom
## Multiple R-squared: 0.07582, Adjusted R-squared: 0.07479
## F-statistic: 73.51 on 1 and 896 DF, p-value: < 2.2e-16
These coefficients will help make the linear equation and the standard errors show how much error there is in a certain variable.
Arguably the most important measures of how good a model is are R and \(R^2\). These tell us how well our model predicts the desired response variable. R-squared = Explained variation / Total variation This basically means how much of our total variation is explained by our model, or how good does it fit. R^2 is always between zero and one so the closer R^2 is to one the better. A value of 1 would mean that our model is perfect. To find this we can just look in the model summary using the summary() command.
summary(model)
##
## Call:
## lm(formula = Height ~ FatherHeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2683 -2.6689 -0.2092 2.6342 11.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.11039 3.22706 12.120 <2e-16 ***
## FatherHeight 0.39938 0.04658 8.574 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.446 on 896 degrees of freedom
## Multiple R-squared: 0.07582, Adjusted R-squared: 0.07479
## F-statistic: 73.51 on 1 and 896 DF, p-value: < 2.2e-16
As we can see our R^2 value is .07582 which is very bad. These means our model doesn’t predict very well. This value means that father’s height explains 7.582% of the variability in height. Another thing to look at is our value of R. R tells us the strength and direction of the linear relationship so it can be positive or negative depending on if the variables are positively or negatively correlated and higher or lower depending on if they are strongly or weakly correlated. These can be found using the cor() command with either variable first followed by the other one as you will still get the same answer.
cor(FatherHeight, Height)
## [1] 0.2753548
cor(Height, FatherHeight)
## [1] 0.2753548
These are not great values of R either. This would be a big indicator to not use this model. Our predictor variable doesn’t predict our response variable very well.
Mean squared error is exactly how it sounds: we take the mean of all of our errors squared. This is a good measure for seeing how good a model is because we obviously want as little error as possible. To find this we can use the following formula:
sqrt(sum((model$residuals)^2)/896)
## [1] 3.446334
Or we can take the easy route and look in the summary:
summary(model)
##
## Call:
## lm(formula = Height ~ FatherHeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2683 -2.6689 -0.2092 2.6342 11.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.11039 3.22706 12.120 <2e-16 ***
## FatherHeight 0.39938 0.04658 8.574 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.446 on 896 degrees of freedom
## Multiple R-squared: 0.07582, Adjusted R-squared: 0.07479
## F-statistic: 73.51 on 1 and 896 DF, p-value: < 2.2e-16
As you can see these values are equal. There are other useful pieces of information in this summary such as our coefficients and degrees of freedom.
An important thing to do with our model is test the significance of a single predictor variable to see if it is related to our response variable. We use a hypothesis test to do this. Our null hypothesis is that the Father’s Height coefficient is equal to 0 with our alternative hypothesis being that the Father’s Height coefficient is not equal to 0. To find our p-value we run a T-test with n-(k+1) degrees of freedom. Our test statistic is \[b_j/s_b\] which is just the point estimate of the slope of the model divided by the standard error of that coefficient/slope value. The easiest way to find this is just calling the numbers/coefficients from the summary of the model and divide them by each other.
tstat <- coef(summary(model))[2,1]/coef(summary(model))[2,2]
tstat
## [1] 8.573704
Now we can use our test statistic and the degrees of freedom of 896 that we found for our model to run our t-test. This is just done by doubling our pt() command because we need to include both tails of our distribution and not just one. Then the command goes: pt(test statistic, degrees of freedom, lower.tail=TRUE/FALSE(if false it includes the right side of the test stat on the distribution)).
2*pt(tstat, 896, lower.tail=FALSE)
## [1] 4.354588e-17
Here we can see our P-value is much smaller than our .05 level of significance so this tells us to reject our null hypothesis and that our predictor variable of FatherHeight has a significant relationship with our response variable of Height. This also means that our interpretation of our slope was correct.
We can do another one of these tests for our intercept.
tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2]
tstat2
## [1] 12.1195
2*pt(tstat2, 896, lower.tail=FALSE)
## [1] 2.055761e-31
We have another small p-value so we can say our intercept is significant to our model as well although we still cannot interpret this.
An easier way to do this is just look in the model’s summary and look at the t-test/level of significance for each coefficient.
summary(model)
##
## Call:
## lm(formula = Height ~ FatherHeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2683 -2.6689 -0.2092 2.6342 11.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.11039 3.22706 12.120 <2e-16 ***
## FatherHeight 0.39938 0.04658 8.574 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.446 on 896 degrees of freedom
## Multiple R-squared: 0.07582, Adjusted R-squared: 0.07479
## F-statistic: 73.51 on 1 and 896 DF, p-value: < 2.2e-16
As you can see these values are the exact same so do the easiest way that works for you.
An additional way to see if there is a linear relationship between the two variables is by using the F-Test. This test determines if there is any relationship between our predictor and response variables. It is set up like our t-test where our null hypothesis is that there is no relationship between the two and our alternative is that there is some linear relationship between the two variables. We need a p-value of less than .05 to reject our null hypothesis. We can find our f-test p-value in the bottom of our model’s summary.
summary(model)
##
## Call:
## lm(formula = Height ~ FatherHeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2683 -2.6689 -0.2092 2.6342 11.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.11039 3.22706 12.120 <2e-16 ***
## FatherHeight 0.39938 0.04658 8.574 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.446 on 896 degrees of freedom
## Multiple R-squared: 0.07582, Adjusted R-squared: 0.07479
## F-statistic: 73.51 on 1 and 896 DF, p-value: < 2.2e-16
Our P-value is very small so we can reject our null hypothesis and say that there is a relationship between our two variables. One thing you might notice is that our p-value for our f and t tests are identical. This is true for simple linear regression. When we start adding variables in multiple linear regression this will change but for now it is a simple way to find significance because you can do either test and get the same result.
Another thing to look at is the confidence intervals for our coefficients. Our point estimates for each coefficient are not exact so we want to find a range where we are a percent confident that the actual value is in this range. A common interval is a 95% confidence interval so I will make that next.
confint(model, level=.95)
## 2.5 % 97.5 %
## (Intercept) 32.7769050 45.443869
## FatherHeight 0.3079585 0.490804
We can interpret this output to mean: for every inch the father’s height increases, we are 95% confident that the response person’s height will increase between 0.308 and 0.491 inches.
Now the point of this model is to use it as a prediction tool. To predict a person’s height we would just plug their father’s height into our model.
We will look at a random row of data by calling the data’s name then [row number, ] following it.
Galton[52, ]
## FamilyID FatherHeight MotherHeight Gender Height NumKids
## 52 16 73 65 M 69.2 9
This gives us all the data from row 52. This row has two pieces that are important to us: the father’s height and height. We will use the father’s height to predict the height. So we plug this into our equation:
39.1104+73*.3994
## [1] 68.2666
This gives us our predicted value for height as 68.2666 inches. Now the important measure is how far our predicted value is from the observed value or what is called a residual. You can find this by simply taking the observed value minus the predicted value.
69.2-68.2666
## [1] 0.9334
As we can see our residual is right about 1 inch. This means our model under predicted by 1 inch. Residuals are very important for calculating our error.
We can also make confidence and prediction intervals for a certain point in our predictor variable’s range. For this experiment we will choose a father’s height of 73 inches.
In order to do this we need to create a data frame with father’s height equal to 73.
newdata <- data.frame(FatherHeight = 73)
Now after that we can actually make our intervals. The first one we will do is a prediction interval. Use the command predict(model name, the new data frame, interval=“predict”).
predy <- predict(model, newdata, interval="predict")
predy
## fit lwr upr
## 1 68.26522 61.48887 75.04157
The output is our prediction interval which is centered on 68.265 meaning that is our point prediction for someone with a father’s height of 73 inches. So, we are 95% confident that the height of an individual person with a father whose height is 73 inches is between 61.489 inches and 75.042 inches.
The other interval is a confidence interval, which is the same code but with confidence instead of predict.
confy <- predict(model, newdata, interval="confidence")
confy
## fit lwr upr
## 1 68.26522 67.85344 68.677
The output is our confidence interval which is centered on 68.265 meaning the point prediction for the mean would be that. So, we are 95% confident that the mean height for a person with a father whose height is 73 inches is between 67.853 inches and 68.677 inches.
Thinking of the concepts of prediction versus confidence intervals we would conclude that our prediction interval should be much wider than our confidence interval. A quick way to check this is true is to use this command.
confy %*% c(0, -1, 1) #conf interval width
## [,1]
## 1 0.8235518
predy %*% c(0, -1, 1)
## [,1]
## 1 13.5527
We can see that our initial thoughts are obviously true. The other thing that both these intervals should have is the same center. The centers should be the point prediction for height with a father’s height of 73. We can check that by simply seeing if they are equal.
confy[1] == predy[1]
## [1] TRUE
R gives us a nice output saying simply “TRUE” which means they are equal and have the same centers.
Simple linear regression has lots of steps involved. We start with getting our data in, getting to know what its like, plotting some variables against each other, creating our model, checking our model for everything like errors, correlation, coefficient significance, intervals, etc. In the end we have many different ways to tell if our model is good or not or even how good it is or how it could be better. This a powerful tool if someone knows how to do all that is involved.