Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. Linear regression models are a key part of the family of supervised learning models. In particular, linear regression models are a useful tool for predicting a quantitative response.
library(datasets)
data(cars)
head(cars, n=10)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
The cars dataset gives Speed and Stopping Distances of Cars. This dataset is a data frame with 50 rows and 2 variables. The rows refer to cars and the variables refer to speed (the numeric Speed in mph) and dist (the numeric stopping distance in ft.)
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
The cars dataset structure shows that there are 50 observation/rows and 2 variables/columns. Str function returns the data type of the variables
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The summary function outputs the basic summary statistics of the data. As the summary output above shows, the cars dataset’s speed variable varies from cars with speed of 4 mph to 25 mph
table(is.na(cars))
##
## FALSE
## 100
There are no null values or missing data in the cars dataset. The is.na checks for nulls and returns TRUE if nulls exists and FALSE if no nulls.
plot(cars, col='blue', pch=20, cex=2, main="Relationship between Speed and Stopping Distance for 50 Cars",
xlab="Speed in mph", ylab="Stopping Distance in feet")
From the plot above, we can visualise that there is a somewhat strong relationship between a cars’ speed and the distance required for it to stop (i.e.: the faster the car goes the longer the distance it takes to come to a stop)
cor(cars$speed, cars$dist)
## [1] 0.8068949
correlation coefficient can take values between -1 to +1 . A value closer to +1 suggests strong correlation and a value closer to -1 suggests weaker correlation.
The correlation value is 0.81 which shows that roughly that there is somewhat a strong positive correlation between speed and the distance required to stop.
mod1 = lm(formula = dist ~ speed, data = cars)
mod1
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
The coefficient intercept, in our example is essentially the expected value of the distance required for a car to stop when we consider the average speed of all cars in the dataset.That is -17.579 feet to come to a stop.
Hypothetically, this would mean that a car going 0 mph takes -17.59 feet to stop. Having a neagtive stopping distance is impossible so y intercept cannot be interpreted in this instance.
summary(mod1)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Residuals are essentially the difference between the actual observed response values (distance to stop dist in our case) and the response values that the model predicted.
When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0). In our example, we can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points.
Theoretically, in simple linear regression, the coefficients are two unknown constants that represent the intercept and slope terms in the linear model. If we wanted to predict the Distance required for a car to stop given its speed, we would get a training set and produce estimates of the coefficients to then use it in the model formula.
The coefficient intercept, in our example is essentially the expected value of the distance required for a car to stop when we consider the average speed of all cars in the dataset.That is -17.579 feet to come to a stop.
Hypothetically, this would mean that a car going 0 mph takes -17.59 feet to stop. Having a neagtive stopping distance is impossible so y intercept cannot be interpreted in this instance.
The second row in the Coefficients is the slope, or in our example, the effect speed has in distance required for a car to stop. The slope term in our model is saying that for every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324088 feet.
The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable.We want a lower number relative to its coefficients. The Standard Error can be used to compute an estimate of the expected difference in case we ran the model again and again. In other words, we can say that the required distance for a car to stop can vary by 0.4155 feet.
The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between speed and distance exist. In our example, the t-statistic values are relatively far away from zero and are large relative to the standard error, which could indicate a relationship exists.
The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (speed) and response (dist) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero. Note the ‘signif. Codes’ associated to each estimate. Three stars (or asterisks) represent a highly significant p-value.
Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between speed and distance.
Residual Standard Error is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term E. Due to the presence of this error term, we are not capable of perfectly predicting our response variable (dist) from the predictor (speed).
The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line. In our example, the actual distance required to stop can deviate from the true regression line by approximately 15.38 feet, on average.
The degrees of freedom are the number of data points that went into the estimation of the parameters
The R-squared (R2) statistic provides a measure of how well the model is fitting the actual data. It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable).
In our example, the R2 we get is 0.65. Or roughly 65% of the variance found in the response variable (dist) can be explained by the predictor variable (speed).
A side note: In multiple regression settings, the R2 will always increase as more variables are included in the model. That’s why the adjusted R2 is the preferred measure as it adjusts for the number of variables considered.
F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is.
In our example it is larger than 1 which indicates there is a realtionship between predictor and response variable.
plot(cars, col='blue', pch=20, cex=2, main="Relationship between Speed and Stopping Distance for 50 Cars",
xlab="Speed in mph", ylab="Stopping Distance in feet")
abline (mod1,col ="red")
Overall, the relationship between Speed and the distance required to stop the car is positively correlated. There is a linear correlation btween the 2 variables.