You may be familiar with modeling the equation of a line as:
where \(y\) is our dependent variable, \(x\) is our independent variable, \(m\) is the slope of the line, and \(b\) represents the y-intercept (the y-value when x = 0).
Linear regression takes the same form and can be modeled by the equation below:
We can use linear regression when we are trying to predict the outcome of a numerical variable using a single (or multiple) other variables we call predictor variables.
Let’s take a look at an example from the ‘mtcars’ dataset.
# Load the necessary libraries
library(tidyverse)
library(caret)
# Load the mtcars dataset
data("mtcars")
# Summary statistics
summary(mtcars)
mpg cyl disp hp drat wt qsec
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
vs am gear carb
Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
# Convert relevant columns to factors
mtcars <- mtcars %>%
mutate(cyl = as.factor(cyl),
gear = as.factor(gear),
vs = as.factor(vs),
am = as.factor(am),
carb = as.factor(carb))
# Rename columns for better readability
mtcars <- mtcars %>%
rename(MilesPerGallon = mpg,
Cylinders = cyl,
Displacement = disp,
HorsePower = hp,
RearAxleRatio = drat,
Weight = wt,
QuarterMileTime = qsec,
VEngine = vs,
Transmission = am,
Gears = gear,
Carburetors = carb)
trash <- trash %>%
rename(discarded = "Total Discarded (in tonnes)")
Error: object 'trash' not found
Let’s visualize some of the data.
# Boxplot of MilesPerGallon by Cylinders
ggplot(mtcars, aes(x = Cylinders, y = MilesPerGallon)) +
geom_boxplot(fill = "lightblue", color = "black") +
theme_minimal() +
labs(title = "Miles Per Gallon by Number of Cylinders")
# Scatter plot of Weight vs. MilesPerGallon
ggplot(mtcars, aes(x = Weight, y = MilesPerGallon)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
theme_minimal() +
labs(title = "Weight vs. Miles Per Gallon")
It appears that both the weight of a car and the number of cylinders it has can have an effect on the miles per gallon of the car. Let’s see if we can use one of these variables for our model. Before we create a linear regression model however, we need to split our data into “train” and “test” data. This will allow us to build a model using 80% of our data, and then “test” our model with the remaining 20% of the data that our model has not seen yet. This will help us determine how good our model is at making predictions on data it has not seen before.
# Split the data into training and testing sets
set.seed(123) #This ensures that anyone who runs this will grab the same random values
trainIndex <- createDataPartition(mtcars$MilesPerGallon, p = 0.8,
list = FALSE,
times = 1)
mtcarsTrain <- mtcars[ trainIndex,]
mtcarsTest <- mtcars[-trainIndex,]
We can now build a simple linear regression modeling trying to predict mileage per gallon from weight alone. We use the lm() function in R to do this.
# Linear regression model to predict MilesPerGallon
model <- lm(MilesPerGallon ~ Weight, data = mtcarsTrain)
summary(model)
Call:
lm(formula = MilesPerGallon ~ Weight, data = mtcarsTrain)
Residuals:
Min 1Q Median 3Q Max
-3.890 -2.163 -0.091 1.361 7.140
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.2505 1.7925 20.223 < 2e-16 ***
Weight -4.9957 0.5249 -9.516 5.89e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.75 on 26 degrees of freedom
Multiple R-squared: 0.7769, Adjusted R-squared: 0.7684
F-statistic: 90.56 on 1 and 26 DF, p-value: 5.889e-10
Whoah! This is a lot of information. Fortunately, all we really need to look at are the R-squared (Multiple) and p-values.
Our Multiple R-Squared value is 0.7769 This means that the weight of cars accounts for about 78% of the variation in the mileage per gallon!
Furthermore, since our p-value is below 0.05, we can conclude that weight is a strong predictor of mileage per gallon.
Let’s see if we can improve our model by accounting for an additional variable. Let’s see if we can use combined information about the weight AND number of cylinders in determining the mileage per gallon.
model <- lm(MilesPerGallon ~ Weight + Cylinders, data = mtcarsTrain)
summary(model)
Call:
lm(formula = MilesPerGallon ~ Weight + Cylinders, data = mtcarsTrain)
Residuals:
Min 1Q Median 3Q Max
-4.0100 -1.1761 -0.4717 1.2757 6.0409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.4080 1.8921 17.657 2.97e-15 ***
Weight -3.2040 0.7448 -4.302 0.000245 ***
Cylinders6 -3.7579 1.3745 -2.734 0.011564 *
Cylinders8 -5.1500 1.6681 -3.087 0.005038 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.38 on 24 degrees of freedom
Multiple R-squared: 0.8459, Adjusted R-squared: 0.8266
F-statistic: 43.91 on 3 and 24 DF, p-value: 6.699e-10
It appears that this information has improved our model! Our R-squared value (we will now use the “Adjusted” value since we are using more than one predictor) is 0.8266 - this means that both weight and cylinders account for 83% of the variation in mileage per gallon of the cars. The Pr(>|t|) value for each variable with the *** symbols also shows that these are all statically significant variables.
We can now test our model on our “test” data and look at the accuracy.
mtcarsTest <- mtcarsTest %>%
mutate(PredictedMPG = predict(model, newdata = mtcarsTest))
# Predicted vs Actual plot
ggplot(mtcarsTest, aes(x = MilesPerGallon, y = PredictedMPG)) +
geom_point(color = "blue") +
geom_abline(slope = 1, intercept = 0, color = "red") +
theme_minimal() +
labs(title = "Predicted vs Actual Miles Per Gallon")
# Create a summary table
summary_table <- mtcarsTest %>%
select(MilesPerGallon, PredictedMPG) %>%
summary()
print(summary_table)
MilesPerGallon PredictedMPG
Min. :14.30 Min. :16.82
1st Qu.:15.43 1st Qu.:17.78
Median :18.40 Median :19.27
Mean :21.25 Mean :20.72
3rd Qu.:24.23 3rd Qu.:22.21
Max. :33.90 Max. :27.53