This R guide will cover different commands used in order to examine the simple linear regression model. Lets start by defining what a simple linear regression model is.
A simple linear regression model assumes that X & Y are lineraly related. X is the predictor and Y the response variable. A regression is used to establish that two variables move together and that the independent variable contributes information for predicting the dependent variable.
I chose to look at the data titled cars. In this example we will use speed as the predictor and distance as the response. We are interested in whether the speed at which the vehicle is going will have any affect on the stopping distance. Lets start by getting the data and examine the different categories.
data("cars")
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
names(cars)
## [1] "speed" "dist"
attach(cars)
The set of commands above takes the data named cars from R. The head command allows the user to see all of the tiles from the data. The names command give user the list of data head. This is good to use in order to double check that all of the lables are listed. The attach command allows user to store the data so it doesn’t have to be prompted repeately in the future. Now that we have the data, lets graph the two varibles that we’re interested in to see if there is any relationship between the two.
plot(dist~speed,ylab = "Stopping Distance in ft",xlab = "Speed in mph",main = "Stopping Distance for cars")
The plot commannd gives a scatter plot of points. The command takes two quantative arguments, the explanatory which in this case is the speed of the car and response variables which is the stopping distance. For every speed of the car, there’s a stopping distance. Some speed have multiple stopping distance which means there are other factors beside speed that affect how fast the cars come to a complete stop. Factors such as breaks, tire brand or conditions and road conditions.
cmod<-lm(dist~speed,data = cars)
cmod
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
Now that we have the scatter plot, lets find the linear regression line. R allows us to name our linear model so that the equation doesn’t have to be typed repeatly. Running the command will give us the y-intercept and slope which we will add to our plot.
plot(dist~speed,ylab = "Stopping Distance in ft",xlab = "Speed in mph",main = "Stopping Distance for cars")
abline(-17.579,3.932)
distance=-17.579+3.931*speed Beta zero corresponse to the y-intercept and beta one is the slope of the line. Beta zero is the mean value of the speed if the distance is 0. We can’t interpret beta zero in this instance because we can’t have a zero speed on the car with a negative stopping distance. The beta one of 3.932 means the distance increases at a rate of 3.932 for every increase in one unit in speed.
newdata <- data.frame(speed = 18)
(distpredict <- predict(cmod, newdata, interval="predict") )
## fit lwr upr
## 1 53.20426 21.89839 84.51014
(confpredict <- predict(cmod, newdata, interval="confidence") )
## fit lwr upr
## 1 53.20426 48.32138 58.08715
Once we have the linear regression model, we can use it to make predictions. There are two ways to predict our stopping distance based on our linear model. The prediction interval estimates that there is a 95% chance that the stopping distance is between 53 and 84 feet. The confidence interval estimates that there is a 95% chance that the mean stopping distance is between 53 and 58.
distpredict %*% c(0, -1, 1)
## [,1]
## 1 62.61175
confpredict %*% c(0, -1, 1)
## [,1]
## 1 9.76577
distpredict[1] == confpredict[1]
## [1] TRUE
We can see that the prediction interval is much wider than the confidence interval. This is expected because the variance of x is less than the variance of the mean. We also made R check to see if those two are centered at the same place which they are.
The normality assumption says that at any given value of x, the population of potential error term values has a normal distribution. In other words we can sum up all of the error terms and assume that it follows a normal distribution.
cresid1<-cmod$residuals
cresid1
## 1 2 3 4 5 6
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584
## 7 8 9 10 11 12
## -3.744993 4.255007 12.255007 -8.677401 2.322599 -15.609810
## 13 14 15 16 17 18
## -9.609810 -5.609810 -1.609810 -7.542219 0.457781 0.457781
## 19 20 21 22 23 24
## 12.457781 -11.474628 -1.474628 22.525372 42.525372 -21.407036
## 25 26 27 28 29 30
## -15.407036 12.592964 -13.339445 -5.339445 -17.271854 -9.271854
## 31 32 33 34 35 36
## 0.728146 -11.204263 2.795737 22.795737 30.795737 -21.136672
## 37 38 39 40 41 42
## -11.136672 10.863328 -29.069080 -13.069080 -9.069080 -5.069080
## 43 44 45 46 47 48
## 2.930920 -2.933898 -18.866307 -6.798715 15.201285 16.201285
## 49 50
## 43.201285 4.268876
hist(cresid1)
qqnorm(cresid1)
qqline(cresid1)
First I extracted the residuals from the data using $ command. I graphed the residual in a histogram to see if it’s bell- shaped. Based on the histogram it looks right tailed skewed. Since we’re not convinced let’s plot the quantiles of the residuals against the quantiles of a normal distribution.
Another assumption we can look at is the constant variance assumption. The population of potential error is independent of x (speed).
plot(cmod$residuals ~ speed)
abline(0,0)
We can plot the residuals against the household income values. It’s a bit easier to compare variances. There doesn’t seem to be evidence for heteroscedasticity since the verticle spread between the points and the line are all different.
In order to determine the usefulness of this model, we need to test the significance of the regression relationship and the y-intercept.
summary(cmod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The summary command gives us a list of statistical data. Once again we can see the slope and y-intercept right below the coefficients category. Both the p-values for intercept and slope are less than .05 which means that we can reject the null hypothesis that beta one is equal to zero. Therefore we can say that the regression relationship are significant.
Another way to look at at the significane is the T test which is displayed as the F-Statistics because in a simple linear regression, they are the same. We define our null hypothesis as beta one is equal to zero against a two sided alternative. If we an reject the null hypothesis, then we can conclude that there is a correlation between x and y. Based on our result, we can see that the p-value is significantly small and definitely less than .05 significant level.
On the same line as the F-statistics we can find the degree of freedom. Degrees of freedom =number of terms -number of regression coefficients. In this case our degrees of freedom is 48.
We can also find the simple coefficient of determination denoted as r^2. It’s defined as the usefulenss of a simple linear regression. The value of r^2 says that the regression model explains 65.11% of the total variation.
confint(cmod,level = .95)
## 2.5 % 97.5 %
## (Intercept) -31.167850 -3.990340
## speed 3.096964 4.767853
We’re 95 % confident that the both the speed and slope will be between their respected range.
cor(speed,dist)
## [1] 0.8068949
We can see that the two variables are highly positively correlated which is what we want.
We can see that a linear regression model is useful when we’re interested in how the predictor and response variables are related. With a useful model, we can use it to predict any response relatively accurately. Once we’ve establised a linear model, we used different methods to test and make sure it’s a useful model. In this particular example, using a linear model is a great fit because we see that the two variables are highly correlated.