Miles per Gallon(mpg) of an Average Ford Car’s Model

Porshi Gupta-S3894438,
Dhruv Pathak-S3908797,
Unnimaya Stalin-S3861387

Last updated: 31 May, 2022

Introduction

Introduction Cont.

Problem Statement

Data

ford = read.csv("ford.csv")
str(ford)
## 'data.frame':    17966 obs. of  9 variables:
##  $ model       : chr  " Fiesta" " Focus" " Focus" " Fiesta" ...
##  $ year        : int  2017 2018 2017 2019 2019 2015 2019 2017 2019 2018 ...
##  $ price       : int  12000 14000 13000 17500 16500 10500 22500 9000 25500 10000 ...
##  $ transmission: chr  "Automatic" "Manual" "Manual" "Manual" ...
##  $ mileage     : int  15944 9083 12456 10460 1482 35432 2029 13054 6894 48141 ...
##  $ fuelType    : chr  "Petrol" "Petrol" "Petrol" "Petrol" ...
##  $ tax         : int  150 150 150 145 145 145 145 145 145 145 ...
##  $ mpg         : num  57.7 57.7 57.7 40.3 48.7 47.9 50.4 54.3 42.2 61.4 ...
##  $ engineSize  : num  1 1 1 1.5 1 1.6 1 1.2 2 1 ...
ford$model = factor(ford$model)
ford$year = factor(ford$year, levels = levels(as.factor(ford$year)), ordered = TRUE)
ford$price = factor(ford$price, levels = levels(as.factor(ford$price)), ordered = TRUE)

Data Cont.

  1. mpg- Miles per gallon is a numeric variable that provides us information on how far a car can travel for every gallon of fuel it consumes.

  2. mileage- It is an efficiency metric that assists in measuring the car’s financial and economic affordability.

  3. engineSize- The engine size determines the amount of power the engine can produce which affects the fuel consumption of a car.

Descriptive Statistics and Visualisation

ford <- ford[ford$tax != 0,]

mileage_summary <- ford %>% summarise(Parameter = "mileage",
                               Min = min(mileage,na.rm = TRUE),
                               Q1 = quantile(mileage,probs = .25,na.rm = TRUE),
                               Median = median(mileage, na.rm = TRUE),
                               Q3 = quantile(mileage,probs = .75,na.rm = TRUE), 
                               Max = max(mileage,na.rm = TRUE),
                               Mean = mean(mileage, na.rm = TRUE),
                               SD = sd(mileage, na.rm = TRUE), 
                               n = n(),Missing = sum(is.na(mileage)))

mpg_summary <- ford %>% summarise(Parameter = "mpg",
                               Min = min(mpg,na.rm = TRUE),
                               Q1 = quantile(mpg,probs = .25,na.rm = TRUE),
                               Median = median(mpg, na.rm = TRUE),
                               Q3 = quantile(mpg,probs = .75,na.rm = TRUE), 
                               Max = max(mpg,na.rm = TRUE),
                               Mean = mean(mpg, na.rm = TRUE),
                               SD = sd(mpg, na.rm = TRUE), 
                               n = n(),Missing = sum(is.na(mpg)))

engineSize_summary <- ford %>% summarise(Parameter = "engineSize",
                               Min = min(engineSize,na.rm = TRUE),
                               Q1 = quantile(engineSize,probs = .25,na.rm = TRUE),
                               Median = median(engineSize, na.rm = TRUE),
                               Q3 = quantile(engineSize,probs = .75,na.rm = TRUE), 
                               Max = max(engineSize,na.rm = TRUE),
                               Mean = mean(engineSize, na.rm = TRUE),
                               SD = sd(engineSize, na.rm = TRUE), 
                               n = n(),Missing = sum(is.na(engineSize)))

table1<-ford %>% group_by(fuelType) %>% summarise(Parameter = "mpg",
                               Min = min(mpg,na.rm = TRUE),
                               Q1 = quantile(mpg, probs = .25,na.rm = TRUE),
                               Median = median(mpg, na.rm = TRUE),
                               Q3 = quantile(mpg,probs = .75,na.rm = TRUE),
                               Max = max(mpg,na.rm = TRUE),
                               Mean = mean(mpg, na.rm = TRUE),
                               SD = sd(mpg, na.rm = TRUE),
                               range = max(mpg,na.rm = TRUE) - min(mpg,na.rm = TRUE),
                               n = n(),Missing = sum(is.na(mpg)))
knitr::kable(table1)
fuelType Parameter Min Q1 Median Q3 Max Mean SD range n Missing
Diesel mpg 28.3 54.3 60.10 67.3 88.3 59.72766 10.050223 60.0 4888 0
Hybrid mpg 46.3 47.1 49.15 201.8 201.8 96.82500 73.161652 155.5 16 0
Petrol mpg 20.8 49.6 55.40 60.1 85.6 54.73053 8.239173 64.8 10909 0

Decsriptive Statistics Cont.

The combined summary statistics of the Ford data-set has been described as follows:

  1. mileage_summary
  1. mpg_summary (Miles per Gallon)
  1. engineSize_summary
table2 <- rbind(mileage_summary, mpg_summary, engineSize_summary)
knitr::kable(table2)
Parameter Min Q1 Median Q3 Max Mean SD n Missing
mileage 1.0 9236.0 16816.0 29440.0 177644.0 22107.463037 19281.262605 15813 0
mpg 20.8 51.4 57.7 62.8 201.8 56.317802 9.493097 15813 0
engineSize 0.0 1.0 1.2 1.6 5.0 1.370467 0.446978 15813 0

Decsriptive Statistics Cont.

  1. Miles per Gallon
  1. Number of Miles traveled
  1. Number of mileages covered based on mpg
# histogram
ford$mpg %>%  hist(col="grey", ylim=c(0,7000), xlim=c(0,300), xlab="Number of mpg",main="Histogram of Number of Miles per Gallon")

ford$mileage %>%  hist(col="grey",ylim=c(0,5000), xlim=c(0,200000), xlab="Mileage",main="Histogram of Number of Miles Travelled")

#Scatter plot
ford %>% plot(mpg ~ mileage, data = .,ylab="Miles per gallon", xlab="Mileage",col="blue",main="Mileage covered for every MPG  ")

# Decsriptive Statistics Cont.

Boxplot of Miles per Gallon(mpg) showcasing outliers:

# boxplot
plot1 <- boxplot(ford$mpg, main="Boxplot of Rate of Mileage with outliers", ylab= "Percentage of miles traveled")

Hypothesis Testing

model1 <- t.test(ford$mpg, mu = 50)

model1
## 
##  One Sample t-test
## 
## data:  ford$mpg
## t = 83.688, df = 15812, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  56.16983 56.46577
## sample estimates:
## mean of x 
##   56.3178

\[H_0: \mu_1 = 50 \]

\[H_A: \mu_1 \ne 50 \]

Hypothesis cont.

  1. t-value = 104.67
  2. p-value < 2.2e-16
  3. Confidence interval for 95% : 57.75891, 58.05505

Discussion

References