PROJECT - VEHICLE FEATURES AND SPECIFICATIONS

Abstract

The project uses CARS data set which contains data about various aspects of common vehicles being manufactured and sold including their place of origin, manufacturing company, mileage, component specifications, etc. The objective of the project is to apply the concepts of statistical inference to the data set. Various methods such as ECDF, parametric and nonparametric bootstrap, MLE and Bayesian were used to infer the population statistics of the sample data set.

Introduction:

The dataset has 15 variables, of which, 6 are categorical and 9 are numerical. There are 428 observations in the dataset. URL: https://www.kaggle.com/ljanjughazyan/cars1#CARS.csv

Variables in the Dataset

  • Make: Manufacturer, Categorical
  • Model: Car Model Name, Categorical
  • Type: Segment based on size, Categorical
  • Origin: Location of Production, Categorical
  • DriveTrain: Type based on output of power, Categorical
  • MSRP: Maximum Retail Price, Numerical
  • Invoice: Final Invoice(Price), Numerical
  • EngineSize: Size of the Engine based on a reference metric, Numerical
  • Cylinders: Number of cylinders in the Engine, Categorical
  • Horsepower: Output Power, Numerical
  • MPG_City: Mileage in City, Numerical
  • MPG_Highway: Mileage in Highway, Numerical
  • Wheelbase: the distance between the front and rear axles, Numerical
  • Length: Total length from hood to trunk, Numerical

Packages Used:

  • Readr : for importing data from csv
  • Ecdat : for calculating Empirical CDF
library(readr)
library(Ecdat)
CARS <- read_csv("C:/Users/Swagatam/Downloads/CARS.csv")
attach(CARS)

Summary Statistics of the Columns to be analysed:

  • We would be using the ‘Make’, ‘Horsepower’, ‘MPG_City’ and ‘Length’ columns for various statistical analysis.
  • The Make column has 38 levels indicating 38 different manufacturers of vehicles. Eg. Audi, Acura, BMW etc.
  • The project focuses on ‘Ford’ and ‘Chevrolet’ for categorical analysis.
  • The Horsepower column is a continuous variable and represents the BHP of different vehicles.
  • Here are the summary statistics:

Histogram Plots

Methodology:

Following methods were used to analyze the data set:
  • Empirical CDF of Length
  • Non-Parametric Bootstrap for estimating standard errors and confidence interval for correlation between Length and MPG_City
  • Maximum Likelihood Estimator for difference in means of the Lengths of car manufacturers: Ford and Chevrolet; also estimated standard errors and confidence intervals using parametric bootstrap for the difference in means of lengths
  • Hypothesis testing using Wald test for difference in means of lengths : H0 :µ1- µ2=0 v/s H1 :µ1- µ2≠0
  • Bayesian Analysis, expressing our belief that prior (mean of Horsepower) µ ~ N(a,b) and the Horsepower variable follows normal distribution ~ N(µ,σ2).
# Emprical cdf of Length

#calculate the upper and lower bands vectors
##ecdf
len.ecdf<-ecdf(x = Length)
plot(len.ecdf,col="blue",main='Empirical CDF of Length')

#Confidence Band for Ecdf
Alpha=0.05
n=length(Length)
Eps=sqrt(log(2/Alpha)/(2*n))
grid<-seq(140,250, length.out = 1000)
lines(grid, pmin(len.ecdf(grid)+Eps,1),col="red")
lines(grid, pmax(len.ecdf(grid)-Eps,0),col="red")

This shows the 95% confidence band of ECDF plot

Non Parametric Bootstrap:

  • The sample correlation between the Length and MPG_City is -0.5015.
  • Using non-parametric bootstrap, standard error is 0.033 and 95% confidence intervals using different procedures are as follows:
cor.sample <- cor(MPG_City, Length)
CarsBootSample <- CARS[,c("MPG_City", "Length")]
N <- dim(CarsBootSample)[1]
cor.boot <- replicate(3000, cor(CarsBootSample[sample(1:N, size = N, replace = TRUE),])[1,2])
sd.cor.boot <- sqrt(var(cor.boot))
sd.cor.boot
## [1] 0.03271003
cor.sample
## [1] -0.5015264
hist(cor.boot,col="pink", main='Histogram of Non Parametric Bootstrap samples')

normal.ci<-c(cor.sample-2*sd.cor.boot, cor.sample+2*sd.cor.boot)
normal.ci
## [1] -0.5669465 -0.4361064
pivotal.ci<-c(2*cor.sample-quantile(cor.boot,0.975), 2*cor.sample-quantile(cor.boot,0.025))
pivotal.ci
##      97.5%       2.5% 
## -0.5641556 -0.4349655
quantile.ci<-quantile(cor.boot, c(0.025, 0.975))
quantile.ci
##       2.5%      97.5% 
## -0.5680873 -0.4388972

MLE and Parametric Bootstrap:

  • We first find the MLE for the difference in means of the Lengths of car manufacturers: Ford and Chevrolet.
  • For this analysis, we first divide our dataset into 2 groups for the 2 manufacturers.
  • For Ford, there are 23 observations and for Chevrolet, there are 27 observations. Our MLE is the difference between the 2 means: mu_hat= 0.2834138 with standard deviation, sigma_hat= 5.098177
  • We create the ‘theta’ function and finally find the standard error using the Replicate function for Parametric Bootstrap.
# Assuming Length of honda and Audi 
Ford=CARS$Length[CARS$Make=="Ford"]    
Chevrolet=CARS$Length[CARS$Make=="Chevrolet"]
n.Ford=length(Ford)
n.Ford
## [1] 23
mu.Ford=mean(Ford)
sigma.Ford<-sd(Ford)
n.Chevrolet=length(Chevrolet)
n.Chevrolet
## [1] 27
mu.Chevrolet=mean(Chevrolet)
sigma.Chevrolet<-sd(Chevrolet)
mu_hat=mu.Ford-mu.Chevrolet
mu_hat
## [1] 0.2834138
sigma_hat<-sqrt(var(Ford)/n.Ford+var(Chevrolet)/n.Chevrolet)  
sigma_hat
## [1] 5.098177
theta.hat<-function(s1, s2){
  mean(s1)-mean(s2)
}

boot.theta.hat<-replicate(3200, theta.hat(rnorm(n.Ford, mean = mu.Ford, sd = sigma.Ford), rnorm(n.Chevrolet, mean = mu.Chevrolet, sd = sigma.Chevrolet)))
se<-sd(boot.theta.hat)
hist(boot.theta.hat,col="orange",main = 'Histogram of Parametric Bootstrap samples')

CI <- c(mu_hat-2*se, mu_hat+2*se)
se
## [1] 5.044365
CI
## [1] -9.805316 10.372143
z.stat <- mu_hat/se
p.value=2*(1-pnorm(abs(z.stat)))
z.stat
## [1] 0.05618425
p.value
## [1] 0.955195

Hypothesis Testing using Wald Test:

  • Wald test statistic was used to do hypothesis testing

Bayesian Analysis:

Bayesian Analysis was performed to find the distribution of the population mean of the Horsepower variable.
  1. Expressing one of our teammates’ beliefs, we took prior to be N(mean=200,sd=1) and obtained a posterior of N(mean=201.21, sd=0.9234).
  2. Expressing another belief that the prior is µ~(mean=250,sd=1500), the posterior for this prior being N(mean=216.15, sd=11.96)
  3. Conclusion: One of us had an idea of the vehicle’s HP and hence he took low standard deviation of 1, so his posterior results were close to the prior. The other teammate did not have an idea of the HP, so he took larger variance(SD being 1500). This resulted with a posterior mean close to the mean of the observed dataset. Mean of dataset being 215.88.
mu.hp <- mean(Horsepower)
sd.hp <- sd(Horsepower)
mu.prior1 <- 200
sd.prior1 <- 1
Ib1 <- 1/sd.prior1
Ix <- length(Horsepower)/(sd.hp)^2
mu.posterior1 <- (mu.prior1*Ib1 + (mu.hp)*Ix)/(Ib1+Ix)
sd.posterior1 <- 1/(Ib1+Ix)
mu.prior1
## [1] 200
mu.prior2 <- 250
sd.prior2 <- 1500
Ib2 <- 1/sd.prior2
mu.posterior2 <- (mu.prior2*Ib2 + (mu.hp)*Ix)/(Ib2+Ix)
sd.posterior2 <- 1/(Ib2+Ix)
mu.prior2
## [1] 250
posterior1 <- rnorm(100,mean=mu.posterior1,sd=sd.posterior1)
posterior2 <- rnorm(100,mean=mu.posterior2,sd=sd.posterior2)
hist(posterior1, col=rgb(1,0,0,0.5),xlim=c(180,250), ylim=c(0,35),main="Overlapping Histogram for Posterior HP", xlab="Posterior HP", legend=T)
hist(posterior2, col=rgb(0,0,1,0.5), add=T)
box()

  • Brown colour ~ N(mean=201.21, sd=0.9234)
  • Blue Colour ~ N(mean=216.15, sd=11.96)

Results and Conclusion:

  • The Length variable approximately follows Normal distribution
  • There is a negative correlation between Length and MPG_City
  • There is no difference in mean lengths of Ford and Chevrolet cars from the Wald Test