R Guide: Chapter 3

Introduction

This R guide will show the reader how to use functions in R relating to what was covered in chapter 3 of Forecasting, Time Series, and Regression Fourth Edition by Bowerman, 0’Connell, and Koehler. The main topic covered in chapter 3 was simple linear regression and thus this guide will focus on what it is, how to do it in R, and what the results mean to the reader.

Section 3.1: The Simple Linear Regression Model

Simple Linear Regression

Simple linear regression is when the connection between two variables, a predictor variable and a response variable, is thought to be in straight line.

A good way to explain simple linear regression would be if the predictor variable of the model is the size of a house and the response variable is the price of the house. As the size of the house increases, the price of the house would also most likely increase in almost a straight line. The increase in price for every increase in square feet of house size would be covered under the slope of a simple linear regression model or B1*x where B1 is the amount that the price increases for every 1 sqft increase in size. x would be the predictor variable, or in this case the size of the house.

The price at a house with the size of 0 square feet, would be the intercept for the price or B0, but since there are no houses with 0 square feet this data is not in our data cloud and thus we cannot speculate what the intercept really is. There is also the error of the model that we have to consider and that is given the symbol E. So the simple linear regression model shapes out to be y= B0 +B1x+E.

Using R to Make A Simple Linear Regression Model

To use R to find the simple linear regression model is really simple. First pick the package that the reader wants to find a simple linear regression of. Instead of a package, I chose to use a dataset that is already located in R, which was the dataset Cars.

attach(cars)
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

?cars

## starting httpd help server ... done

If the reader is using a package first they would have to use the R code install.packages(), which installs the package that the reader picks. Then they would use Library(), which then calls the package and makes it useable to the reader.

The command attach() uses the variables attached to the package chosen without the need to name the package every time. With the type of attach I used I can look at data in the variables for cars, such as speed, without calling the dataset Cars every time. The R code head() calls the first 6 data and their categories. The categories under cars are speed and dist. ?cars is R code that allows us to look us the meaning behind cars and its variables. It shows that the data included research from the 1920s about how far it would take a car to stop in feet depending on its speed in mph. Calling cars shows what data is included in the dataset.

Now that the data is called, we can get on to making a simple linear regression with it. With cars, we see that speed is a predictor variable for the response variable dist (stopping distance). So we can a make a simple linear regression of these two data by using the command lm() and we can call it model1.

model1<-lm(dist~speed)
model1

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

plot(dist~speed)

The command lm() is used to find the intercept and slope between the variables that we have picked. When using the lm() command the response variable goes first followed by ~ and the predictor variables added together, but since in simple linear regression there is only one predictor variable and one response variable the formula for this command looks like this: lm(response variable~predictor variable).

Model1 is the name we gave to the command lm(dist~speed) and from the results of it we can see a linear formula with the intercept= -17.579 and the slope of speed = 3.932. We know that intercept is given when the speed of the car is 0, but since in the dataset cars there was no car with speed given 0 we can’t assume what the intercept is. The slope means that for each 1 mph increase in speed the amount of stopping distance needed increases by 3.932 feet.

The command plot(dist~speed) shows a good scatterplot representation of speed vs stopping distance and also shows that there seems to be a linear relationship between these two variables that is positively correlated (upward sloping).

How to Create an Appealing Scatterplot

The reader can make the scatterplot more appealing by using the following commands:

plot(dist~speed, ylab = "Stopping Distance of a Car in Feet",
     xlab = "The Speed of a Car in MPH", 
     main = "The Stopping Distance of a Car in the 1920s Depending on its Speed")
abline(model1)

The scatterplot command was expanded to show that by using ylab the reader can add a title for the y axis, by using xlab the reader can add a title for the x axis, and by using main the reader can add a title for the whole plot.

The command abline(model1) takes the intercept and slope of the simple linear regression that the reader found by using the lm(dist~speed) command that was named model1, and adds the line created from this model1 to the graph. We can see that the data points are custered around the line somewhat strongly showing that model1 is a good linear regression model of cars. Also by using these two commands the graph looks more appealing.

Section 3.2: The Least Squares Point Estimates

Secton 3.2 talked about the least squares point estimates and residuals.

Residuals

Residuals are found by calculating the observed values of a point compared to the predicted values of a point due to the linear regression model.

For an example, lets find a residual for the dataset cars by looking at the point where a car in the 1920s is going 25 miles per hour.

cargoing25perdiction<--17.579+3.932*25
cargoing25perdiction

## [1] 80.721

cargoing25observed<-dist[speed==25]
cargoing25observed

## [1] 85

rescargoing25<-cargoing25observed-cargoing25perdiction
rescargoing25

## [1] 4.279

The code called cargoing25perdiction takes the car speed of 25 miles per hour and puts it into the linear regression model found in section 3.1. As seen from the output, the model predicts that a car going 25 miles per hour in the 1920s would have a stopping distance of 80.721 feet. The code called cargoing25observed takes the speed of 25 miles per hour and finds a car matching that desciption in the dataset and tells the stopping distance of the car. The car in the 1920s that was going 25 miles per hour had a stopping distance of 85 feet.

This code shows that there is a difference between predicted values using the linear regression model and what values were observed in the dataset. This distance is the residual value. As seen by the code called rescargoing25, the residual value of a car going 25 miles per hour is 4.279.

We can find all of the residuals of the response variable dist and the predictor variable speed by using the command residuals() and the liner regression model. We can also plot these residuals to get a clearer picture of their values. We want the variance, which is what the plot shows to be center around 0 so we can add in a stright line at 0 residual by using the command abline(0,0)

res<-residuals(model1)
plot(res, ylab = "Residuals",
     xlab = "The Index of Residuals", 
     main = "The Plot of Residuals")
abline(0,0)

As we can see from the plot the variance of the stopping distance of the car is mostly constant but it increases slightly when the speed of the car increases. We can see this because according to the dataset cars the higher the number in the dataset the faster the car was going. Those cars that were numbered between 30 and 50 have a higher variance then those going a slower speed, which were numbered 0-20. This is somewhat alarming for this tells us that stopping distance at higher speeds will not be as accurate as those at lower speeds.

We can check if the histogram of the residuals is normally distributed with the command hist(), which the residuals should be.

hist(res)

This command hist(res) shows a histogram of the residuals and from the looks of it the residuals are normally distributed, but a little bit skewed to the right.

We can also double check if it is normally distributed by using the plot of the quantiles of the residuals against the quantiles of a normal distribution using the command qqnorm() and adding a the line with the command qqline(). The better the data follows the line formed by qqline() the more normally distributed it is.

qqnorm(res)
qqline(res)

These two commands show basicly the same thing as the histogram. The line hit or is very close to the majority of the points with just a few outliers. Since it follows the straight line very well, the residuals are normally distrubted.

Least Square Point Estimate

The reader learned in Section 3.1 that the simple linear regression formula is y=B0+B1x+E. As shown in the section above we can find the values of intercept (B0) and slope (B1) of a simple linear regression model. The reader can also find the point estimates, b0 and b1, of B0 and B1 by using n observed values of the predictor and the response variable. This is the Least Square Point Estimates, which can be used to calculate the point prediction as yihat=b0+b1xi+Ei.

The least square point estimate of the slope, B1, is b1=SSxy/SSxx where SSxy is the sum of xi*yi minus ((the sum of xi times the sum of yi)/n). SSxx is the sum of (xi^2)-((the sum of xi)^2)/n.

The least square point estimate of the intercept, B0, is b0=sum of (yi) over n - b1*(sum of (xi)/n).

With the dataset cars, in R we can also find these estimates.

n<-length(speed)
n

## [1] 50

length(dist)

## [1] 50

SSxy<-sum(speed*dist)-(sum(speed)*sum(dist))/n
SSxx<-sum(speed^2)-(sum(speed)^2)/n
b1<-SSxy/SSxx
b1

## [1] 3.932409

The equation length(speed) finds the numbers of observations in the predictor variable speed, 50, which is the same as the number of observations in the response variable dist in the data cars which the reader can check by using the command length(dist). The next couple of equations were already explained above as ways to find the least square point estimate of slope, B1. The result of the least square point estimate, b1, was 3.9324 which is the same result, that the reader found in Section 3.1, which shows the reader that both and good ways of finding the estimate for the slope, but that the using the command lm(response variable~predictor variable) will be slightly faster.

b0<-(sum(dist))/n-b1*(sum(speed)/n)
b0

## [1] -17.57909

The equation above is the least square point estimate,b0, of the intercept,B0. Once again, the reader can tell that the intercept is the same as the one given by the slightly faster command lm(response variable~predictor variable). Now the reader can also put these two point estimates into the formula yihat<-b0+b1*xi to create a simple linear regression model.

To tell how well this linear regression model works with the data, we use residuals (ei) found by the equation ei=yi-yihat. If any values of b0 and b1 are good point estimates of B0 and B1 then the ei will be small if not then the ei will be large.

Now we can ith residual (ei) in order to tell if b0 and b1 are good predictors of B0 and B1. The formula for this is ei=yi-yihat. These good predictors, the best of b0 and b1, which we already found for B0 and B1 will give a value of the sum of squared residuals, SSE, that is smaller than any other values of b0 and b1. Thus this SSE will be minimized by the least square point estimates. If the reader is not sure that the values they found are the least square point estimates they could use the following formulas to double check. SSE=sum(yi-(b0+b1*xi))^2. anova()

SSE<-sum((dist-(b0+b1*speed))^2)
SSE

## [1] 11353.52

SSE1<-anova(model1)
SSE1

## Analysis of Variance Table
## 
## Response: dist
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## speed      1  21186 21185.5  89.567 1.49e-12 ***
## Residuals 48  11354   236.5                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see from the formula above that the SSE is 11353.52 and thus helps the least square point estimates make sense. Also by using the command anova() and looking under Sum Sq to find Residuals and seeing that the data is the same as the other estimate we can make sure our math is correct.

Section 3.3: Point Estimate and Point Predictions

This section mostly focuses on how to help the reader find the response value at a certain pre-determined point in the predictor variable. We already learned how to find the simple linear regression model from section 1 and 2 so it is just a simple task of having the reader plug in a point that they liked.

For the car data, a good example of this would be to say that the reader wanted to find out what the stopping distance of a car going 20 miles per hour in the 1920s. There are two ways of writing code for this in R. The first is the easier option by using the information gathered by the command model1, which is model1<-lm(dist~speed) in section 3.1. We take the formula discovered by this and times the speed slope by 20 and add the intercept given.

model1

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

distance20<--17.579+20*3.932
distance20

## [1] 61.061

The second way is slightly tougher and that is by writing a command for the new data of 20 speed.

newdata1<-data.frame(speed=20)
predict(model1, newdata1)

##        1 
## 61.06908

This command predict takes the information given to the reader from model1 which used the command lm() to create a simple liner regression and inputs the new data that we want a the stopping distance in feet of a car going 20 mph and predicts what the stopping distance would be. Comparing the two ways the reader will notice that the answers will differ slightly because of rounding errors but other than that be the same and say that we predict that a car that was going 20 mph in the 1920s would have a stopping distance of 61 feet.

Section 3.4: Model Assumptions and the Standard Error

This section covers what assumptions the reader makes when using linear regression and what the error in the simple linear regression is, which if you remember is the E in y=B0+B1*x+E.

Model Assumptions

There are 4 common assumptions when using linear regression and they are:

At any value of x, the potential error has a mean equal to 0.
At any value of x, the potential error have a constant variance.
At any value of x, the potential error has a normal distribution.
Any one value of E is independent of any other E.

Now knowing these assumptions and if they are satisfied, which as shown in section 3.2 when we talked about residuals we know that they are mostly satisfied, but the variance was mostly constant but it increased slightly the faster the speed of the 1920s car was, and the SSE is the sum of squared residuals, such as found in Section 3.3 we can find the mean square error, s^2.

Standard Error

The mean square error is the point estimate of the variance and is equal to SSE/(n-2). We divide by n-2 because it makes s^2 an unbiased point estimate of variance. The n-2 is the number of degrees of freedom associated with SSE. In R, using the cars data, this would look like this:

s2<-(SSE/(n-2))

As the reader can see from the answer the mean square error of the cars data is 236.5317.

Now using the mean square error, the reader can find the standard error, s, by taking the square root of the mean square error. The standard error is the point estimate of sigma. In R, using the cars data, this would look like this:

s<-sqrt(s2)

As the reader can see from the answer the standard error in the cars data is 15.37959 feet in the stopping distance of a car.

There are two slightly faster ways to find the residual standard error and that is by using the command summary() or the command aov().

summary(model1)

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

aov(model1)

## Call:
##    aov(formula = model1)
## 
## Terms:
##                    speed Residuals
## Sum of Squares  21185.46  11353.52
## Deg. of Freedom        1        48
## 
## Residual standard error: 15.37959
## Estimated effects may be unbalanced

As the reader can see, the answer to the residual standard error in both of these commands in the cars data is 15.38 feet the same as the one given by the equation above and thus we know that the standard error is correct. The two equations also tell you the degrees of freedom for the error which is n-2 or 50-2=48 since there were 50 observations of the cars’ speed and stopping distance in the 1920s.

Section 3.5: Testing the Significance of the Slope and y-Intercept

Section 3.5 covered the significance of a slope and y-Intercept of a simple linear regression. This is important since if there was a slope of 0 between the predictor and the response variables in a simple linear regression then there is no significant relationship between them.

Using the data cars and the null hypothesis H0= 0, which means that the slope is 0, and the alternative hypothesis HA is not equal to 0, which means that the slope is not 0, we can find out the formula for the simple linear regression.

The intercept for the data cars is the stopping distance of a car in the 1920s that was going 0 miles per hour. Since a car that goes 0 miles per hour doesn’t have a stopping distance, this data is not in the data cloud so we can’t estimate it. Even so we can use an inappropriate amount to find the line of the simple linear regression. The slope is the amount of feet of stopping distance it takes a car for every increase in mile per hour. To find out the intercept and the slope for predictor variable, which the speed of a car in 1920s and its stopping distance can be found by using the command summary(model1).

summary(model1)

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The summary(model1) command shows the reader that the intercept for the data cars is that it takes -17.5791 feet to stop a car in the 1920s that was going 0 mph, which is impossible. Summary also shows that the slope of cars is that it takes an extra 3.9324 feet to stop per every mph the car was going. The summary also shows the significance of the speed in comparison to the stopping distance of a 1920s car. This is found under coefficients in the row of speed and it is a the p value 1.49*10^-12. This p value is so small, less than .05, that we would reject the H0 in favor of HA. This means that the slope is not 0 or a flat line and that there is significance in the relationship between the speed of the car in 1920s and its stopping distance. This topic is further discussed in section 3.8.

Section 3.6: Confidence and Prediction Intervals

Section 3.6 covered intervals such as confidence intervals and prediction intervals and the “distance value”. The “distance value” for a particular X0 is (1/n +((x0-xbar)^2)/SSxx) where n is the sample size and xbar is the mean and SSxx is equal to the sum of xi^2 minus the ((sum of xi)^2)/n. This is important because the distance value is used to calculate the confidence interval and the prediction intervals. Confidence interval when s=sqrt(MSE) where MSE means mean squared error is yhat +- t(n-2) a/2 ssqrt(distance value) and the prediction interval is yhat +- t(n-2) a/2 ssqrt(distance value+1), which also shows that the interval for prediction is larger than the confidence interval.

Confidence Interval

The confidence interval is for the mean value of y when x=x0. The confidence interval’s variation for a mean is smaller than the variation for a point, which could be found with a prediction interval. The variance of a mean is the variance over the sample size or var(xbar)= sigma^2/n. The variance of the point is just the variance or sigma^2. Even though the variances differ in size, they are still both centered at yhat=B0 hat+ B1 hat * x0.

The reader can find the confidence interval for cars by using the command confint(model1).

confint(model1)

##                  2.5 %    97.5 %
## (Intercept) -31.167850 -3.990340
## speed         3.096964  4.767853

This finds the 95% confidence interval for the intercept of a 1920s car’s stopping distance, which is when the car’s speed is 0 mph. This interval was between -31.167850 and -3.99 feet. Both of which are impossible. The command also finds the 95% confidence interval for the slope of a 1920s car’s stopping distance depending on its speed, which is between 3.09696 and 4.767853 feet for each mile per hour the car was going.

This means that we are 95% confident that a car will stop between 3.09696 feet and 4.767853 feet more if the speed of the car increases by 1 mile per hour. We are also 95% confident that the true mean is within this interval.

Intervals Using New Data

The reader can also find the prediction and confidence intervals for the data cars by naming new data and then predict command to find out the intervals that the new data is in.

newdata1<-data.frame(speed=40)
prediction_interval<-predict(model1,newdata1,interval="predict")
prediction_interval

##        fit      lwr      upr
## 1 139.7173 102.3311 177.1034

confidence_interval<-predict(model1,newdata1,interval="confidence")
confidence_interval

##        fit      lwr      upr
## 1 139.7173 118.7052 160.7293

The new data in the command data.frame(speed=40) was to find the prediction and confidence intervals to estimate the stopping distance of a car in 1920s that was going 40 miles per hour. According to the results from the two predict commands above, there is a 95% prediction interval that the car from 1920s would stop between 102.33 feet and 177.1034, and there is a 95% confidence interval that the car from 1920s would stop between 139.7173 feet and 160.7293 feet.

Using this data that we got from the confidence interval, we are 95% confident that the mean is within the interval from 139.71 feet to 160.7293 feet and 95% confident that intervals made in this fashion will contain the true mean. Using the prediction interval we are 95% sure that a car going 40 mph in the 1920s will have a stopping distance between 102.33 feet and 177.1034. As the reader can see the confidence interval is smaller than the prediction interval.

The reader can make sure the confidence interval is smaller by checking their widths by using the following code.

confidence_interval%*%c(0,-1,1)

##      [,1]
## 1 42.0241

prediction_interval%*%c(0,-1,1)

##       [,1]
## 1 74.77223

The width of the 95% confidence interval comes out to be 42.0241 feet and the 95% prediction interval is much larger with a range of 74.77223 feet. This would be the same as if we took the higher number in the intervals - the smaller number that were found by using the other command. Such as the highest number in the 95% prediction interval 177.1034 feet- the lowest 102.33 feet = a width of 74.77 feet for the 95% prediction interval.

It is also important to make sure that both of the intervals are centered at the same place. The reader can do this by seeing if they are equal to each other.

confidence_interval[1]==prediction_interval[1]

## [1] TRUE

The result was true, which means that they are centered at the same point and that the intervals make sense.

Section 3.7: Simple Coefficients of Determination and Correlation

This section talks about the simple coefficient of determination in a simple linear regression and the simple correlation coefficient of a simple linear regression.

Simple Coefficient of Determination

The simple coefficient of determination, often referred to as r^2, is a measure of how useful a simple linear regression model is, such as the model formed from lm(dist~speed). It is calculated by using the total variation, unexplained variation, and explained variation. Total variation is the sum of squared prediction errors we would get if we did not use the predictor variable x which can be found by taking the sum of (yi-ybar)^2 or the variance of y. Unexplained variation, also called SSE, is the amount of variation of y that is not explained by the predictor variable x. Explained variation is the amount of variation of y that is explained by predictor variable x.

Total variation= explained variation + unexplained variation. The reader can use this information by understanding that the simple coefficient of determination is the proportion of the total variation in the n observations of y that is explained by the simple linear regression model. In simple terms: r^2= explained variation/total variation. The formula to find this in R using the dataset cars is below.

summary(model1)

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The formula is one that the reader has used quite often in this R guide and gives a summary of the simple linear regression model that is found by using lm(dist~speed). The r^2 is located in this summary under Multiple R-squared and is given the value of .6511. This means that the predictor variable for stopping distance, speed, explains 65.11% of the variation in stopping distance, which is not excellent but is not bad either. This means that there are other variables in play that could affect the stopping distance of a car in 1920s, such as road conditions, tires, and model of the car.

Simple Correlation Coefficient

The simple correlation coefficient of a simple linear regression can tell if the predictor and response variable are positively correlated (upward sloping) or negatively correlated (downward sloping). It also tells the linear relationship between the two, by how close they are to negative 1 and positive 1. The closer they are the better linear relationship the two variables have. The simple correlation coefficient is just the square root of r^2. Lest use the dataset cars to explain.

r2<-summary(model1)$r.squared
r<-sqrt(abs(r2))
r

## [1] 0.8068949

cor(dist,speed)

## [1] 0.8068949

The command summary(model1)$r.squared pulls out only the r^2 value from summary which the reader can see is the same from what we saw previously. The command sqrt(abs(r2)) finds the square root of the absolute value of r^2. It is important to take the absolute value since if r^2 was negative the reader would have unable to square root it without getting an error.

Since r^2 is positive we know that r has to be positive too, which means that the model is positively correlated, this is also shown in the graph from section 3.1 in the R guide. The result from r was .8068949 which means that the speed of the car in 1920s is strongly correlated with the stopping distance of the car. The command cor(dist,speed) does the same thing as the previous formulas and confirms that the result for the correlation, r, is 8068949.

Section 3.8: An F-Test for the Model

The F-test is a test that tests the significance of the simple linear regression model, where the null hypothesis H0= 0, which means that the relationship between the predictor variable and the response variable is not significant, and the alternative hypothesis HA is not equal to 0, which means that the relationship between the predictor variable and the response variable is significant.

Using the dataset cars let’s say that we want to test the significance of the relationship between the response variable dist, the stopping distance and the predictor variable speed, the speed of a car in the 1920s. We can do this by using the very useful command summary.

summary(model1)

## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

This code gives you a F-statistic of 89.57 and a p-value of 1.49e-12. Since it was preset at the 95% significance level and since the pvalue is less than 5%, the reader can say that they reject the H0 and accept the HA and thus there is a significant relationship between speed and stopping distance of cars in the 1920s.

There are two tests that we can use with the data that came from the summary(model1) command and they are the F-Test and the T-Test.

F-Test

The F-Test can be found at the bottom of the data given by summary(model1). There it shows that the F-Statistic is 89.57 and that there are 48 degrees of freedom, which makes sense because it was meant to be n-2 and n does equal 50 observations in this dataset. The p value for the F-Test is 1.49*10^-12, which since it was so small, smaller than .05, we would reject the H0 in favor of the HA. This means that the slope is not 0 or a flat line and that there is significance in the relationship between the speed of a car in 1920s and its stopping distance.

T-Test

The T-Test is slightly different and is found under the coefficients section of the summary(model1) command. In that section, we can find our test stat because it is under t value in the row of speed, since that is the predictor variable we are testing. The test stat is 9.464 and the p value right next to it is 1.49*10^-12. This p value is also way below .05 we would reject the H0 in favor of the HA. So the slope is not 0 and there is significance in the relationship between the speed of a car in 1920s and its stopping distance, or in simple terms the stopping distance of a car in the 1920s did depend on its speed.

Conclusion

This concludes the Chapter 3 R Guide, hopefully the reader came out of this knowing more about simple linear regression models and how to use them. As a reminder, this guide covered how to make a plot, residuals, predictions, confidence intervals, errors, correlation, coefficients, and F-tests.