Title: Predicting Fuel Efficiency (MPG) in Cars: A Linear Regression
Analysis
Summary of Assignment: Exploring the application of linear
regression using the ‘mtcars’ dataset in R. This dataset contains
information about various car models, including their fuel efficiency
(measured in miles per gallon, MPG), and ten other attributes.
Performing linear regression by fitting a model to predict “MPG” based
on carefully choosing independent variables from the dataset. The goal
is determining which independent variables have the most significant
impact on predicting MPG.
Load Libraries
library(readr)
library(caTools)
library(ggplot2)
Setting working directory
setwd("C:/Users/Nithi/OneDrive/Desktop/Data Mining")
Clean previous Graphs
graphics.off()
Load the mtcars dataset
dataset <- mtcars
summary(dataset)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Split the dataset into training and testing sets
set.seed(123)
split <- sample.split(dataset$mpg, SplitRatio = 7/10)
training_data <- subset(dataset, split == TRUE)
test_data <- subset(dataset, split == FALSE)
Fitting Simple Linear Regression on Single independent variables
like “Gear”
lin_reg <- lm(mpg ~ gear, data = mtcars)
summary(lin_reg)
##
## Call:
## lm(formula = mpg ~ gear, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.240 -2.793 -0.205 2.126 12.583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.623 4.916 1.144 0.2618
## gear 3.923 1.308 2.999 0.0054 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.374 on 30 degrees of freedom
## Multiple R-squared: 0.2307, Adjusted R-squared: 0.205
## F-statistic: 8.995 on 1 and 30 DF, p-value: 0.005401
Fitting Simple Linear Regression on Multiple independent variables
like “Cylinders” and “Gear”
lin_reg1 <- lm(mpg ~ cyl + gear, data = mtcars)
summary(lin_reg1)
##
## Call:
## lm(formula = mpg ~ cyl + gear, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8949 -1.9353 -0.0725 1.3648 7.6051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.6595 4.9369 7.020 1.01e-07 ***
## cyl -2.7431 0.3735 -7.344 4.32e-08 ***
## gear 0.6519 0.9041 0.721 0.477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.232 on 29 degrees of freedom
## Multiple R-squared: 0.731, Adjusted R-squared: 0.7125
## F-statistic: 39.4 on 2 and 29 DF, p-value: 5.387e-09
Fitting Simple Linear Regression to the Training set
regressor <- lm(formula = mpg ~ gear, data = training_data)
summary(regressor)
##
## Call:
## lm(formula = mpg ~ gear, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2985 -2.0544 -0.4842 1.3110 10.9110
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.051 5.270 0.579 0.56913
## gear 4.610 1.386 3.325 0.00337 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.874 on 20 degrees of freedom
## Multiple R-squared: 0.3561, Adjusted R-squared: 0.3239
## F-statistic: 11.06 on 1 and 20 DF, p-value: 0.003373
Predicting the Test set results
y_pred <- predict(regressor, newdata = test_data)
Vizualising the Training set results
ggplot() +
geom_point(aes(x = training_data$gear, y = training_data$mpg),
colour = as.factor(training_data$mpg)) +
geom_line(aes(x = training_data$gear, y = predict(regressor, newdata = training_data)),
colour = as.factor('black')) +
ggtitle('Gear vs MPG (Training Data)') +
xlab('Gear') +
ylab('MPG')

Vizualising the Test set results
ggplot() +
geom_point(aes(x = test_data$gear, y = test_data$mpg),
colour = as.factor(test_data$gear)) +
geom_line(aes(x = test_data$gear, y = predict(regressor, newdata = test_data)),
colour = as.factor('black')) +
ggtitle('Gear vs MPG (Training Data)') +
xlab('Gear') +
ylab('MPG')
