Probability Models - Cars Dataset

PROJECT - VEHICLE FEATURES AND SPECIFICATIONS

Abstract

The project uses CARS data set which contains data about various aspects of common vehicles being manufactured and sold including their place of origin, manufacturing company, mileage, component specifications, etc. The objective of the project is to apply the concepts of statistical inference to the data set. Various methods such as ECDF, parametric and nonparametric bootstrap, MLE and Bayesian were used to infer the population statistics of the sample data set.

Introduction:

The dataset has 15 variables, of which, 6 are categorical and 9 are numerical. There are 428 observations in the dataset. URL: https://www.kaggle.com/ljanjughazyan/cars1#CARS.csv

Variables in the Dataset

Make: Manufacturer, Categorical
Model: Car Model Name, Categorical
Type: Segment based on size, Categorical
Origin: Location of Production, Categorical
DriveTrain: Type based on output of power, Categorical
MSRP: Maximum Retail Price, Numerical
Invoice: Final Invoice(Price), Numerical
EngineSize: Size of the Engine based on a reference metric, Numerical
Cylinders: Number of cylinders in the Engine, Categorical
Horsepower: Output Power, Numerical
MPG_City: Mileage in City, Numerical
MPG_Highway: Mileage in Highway, Numerical
Wheelbase: the distance between the front and rear axles, Numerical
Length: Total length from hood to trunk, Numerical

Packages Used:

Readr : for importing data from csv
Ecdat : for calculating Empirical CDF

library(readr)
library(Ecdat)
CARS <- read_csv("C:/Users/Swagatam/Downloads/CARS.csv")
attach(CARS)

Summary Statistics of the Columns to be analysed:

We would be using the ‘Make’, ‘Horsepower’, ‘MPG_City’ and ‘Length’ columns for various statistical analysis.
The Make column has 38 levels indicating 38 different manufacturers of vehicles. Eg. Audi, Acura, BMW etc.
The project focuses on ‘Ford’ and ‘Chevrolet’ for categorical analysis.
The Horsepower column is a continuous variable and represents the BHP of different vehicles.
Here are the summary statistics:

The MPG_City column is a continuous variable and represents the mileage per gallon(on city roads) of different vehicles. Here are the summary statistics:

The Length column is a continuous variable and represents the total length from the trunk to the hood of different vehicles. Here are the summary statistics:

Histogram Plots

From the above plots, we see that Length follows normal distribution. The MPG_City and Horsepower variables are also approximately normal, but are positively skewed.

Methodology:

Following methods were used to analyze the data set:

Empirical CDF of Length
Non-Parametric Bootstrap for estimating standard errors and confidence interval for correlation between Length and MPG_City
Maximum Likelihood Estimator for difference in means of the Lengths of car manufacturers: Ford and Chevrolet; also estimated standard errors and confidence intervals using parametric bootstrap for the difference in means of lengths
Hypothesis testing using Wald test for difference in means of lengths : H0 :µ1- µ2=0 v/s H1 :µ1- µ2≠0
Bayesian Analysis, expressing our belief that prior (mean of Horsepower) µ ~ N(a,b) and the Horsepower variable follows normal distribution ~ N(µ,σ2).

# Emprical cdf of Length

#calculate the upper and lower bands vectors
##ecdf
len.ecdf<-ecdf(x = Length)
plot(len.ecdf,col="blue",main='Empirical CDF of Length')

#Confidence Band for Ecdf
Alpha=0.05
n=length(Length)
Eps=sqrt(log(2/Alpha)/(2*n))
grid<-seq(140,250, length.out = 1000)
lines(grid, pmin(len.ecdf(grid)+Eps,1),col="red")
lines(grid, pmax(len.ecdf(grid)-Eps,0),col="red")

This shows the 95% confidence band of ECDF plot

Non Parametric Bootstrap:

The sample correlation between the Length and MPG_City is -0.5015.
Using non-parametric bootstrap, standard error is 0.033 and 95% confidence intervals using different procedures are as follows:

cor.sample <- cor(MPG_City, Length)
CarsBootSample <- CARS[,c("MPG_City", "Length")]
N <- dim(CarsBootSample)[1]
cor.boot <- replicate(3000, cor(CarsBootSample[sample(1:N, size = N, replace = TRUE),])[1,2])
sd.cor.boot <- sqrt(var(cor.boot))
sd.cor.boot

## [1] 0.03271003

cor.sample

## [1] -0.5015264

hist(cor.boot,col="pink", main='Histogram of Non Parametric Bootstrap samples')

normal.ci<-c(cor.sample-2*sd.cor.boot, cor.sample+2*sd.cor.boot)
normal.ci

## [1] -0.5669465 -0.4361064

pivotal.ci<-c(2*cor.sample-quantile(cor.boot,0.975), 2*cor.sample-quantile(cor.boot,0.025))
pivotal.ci

##      97.5%       2.5% 
## -0.5641556 -0.4349655

quantile.ci<-quantile(cor.boot, c(0.025, 0.975))
quantile.ci

##       2.5%      97.5% 
## -0.5680873 -0.4388972

MLE and Parametric Bootstrap:

We first find the MLE for the difference in means of the Lengths of car manufacturers: Ford and Chevrolet.
For this analysis, we first divide our dataset into 2 groups for the 2 manufacturers.
For Ford, there are 23 observations and for Chevrolet, there are 27 observations. Our MLE is the difference between the 2 means: mu_hat= 0.2834138 with standard deviation, sigma_hat= 5.098177
We create the ‘theta’ function and finally find the standard error using the Replicate function for Parametric Bootstrap.

# Assuming Length of honda and Audi 
Ford=CARS$Length[CARS$Make=="Ford"]    
Chevrolet=CARS$Length[CARS$Make=="Chevrolet"]
n.Ford=length(Ford)
n.Ford

## [1] 23

mu.Ford=mean(Ford)
sigma.Ford<-sd(Ford)
n.Chevrolet=length(Chevrolet)
n.Chevrolet

## [1] 27

mu.Chevrolet=mean(Chevrolet)
sigma.Chevrolet<-sd(Chevrolet)
mu_hat=mu.Ford-mu.Chevrolet
mu_hat

## [1] 0.2834138

sigma_hat<-sqrt(var(Ford)/n.Ford+var(Chevrolet)/n.Chevrolet)  
sigma_hat

## [1] 5.098177

theta.hat<-function(s1, s2){
  mean(s1)-mean(s2)
}

boot.theta.hat<-replicate(3200, theta.hat(rnorm(n.Ford, mean = mu.Ford, sd = sigma.Ford), rnorm(n.Chevrolet, mean = mu.Chevrolet, sd = sigma.Chevrolet)))
se<-sd(boot.theta.hat)
hist(boot.theta.hat,col="orange",main = 'Histogram of Parametric Bootstrap samples')

CI <- c(mu_hat-2*se, mu_hat+2*se)
se

## [1] 5.044365

CI

## [1] -9.805316 10.372143

z.stat <- mu_hat/se
p.value=2*(1-pnorm(abs(z.stat)))
z.stat

## [1] 0.05618425

p.value

## [1] 0.955195

Hypothesis Testing using Wald Test:

Wald test statistic was used to do hypothesis testing

H0 : There is no difference in mean lengths of Ford and Chevrolet cars
H1 : Mean length of the 2 car manufacturers are different
The W statistic is 0.05574695
The p value is 0.9555434
At α=0.05, this implies that we have enough evidence not to reject the null that there is no difference in mean lengths of Ford and Chevrolet cars.

Bayesian Analysis:

Bayesian Analysis was performed to find the distribution of the population mean of the Horsepower variable.

Expressing one of our teammates’ beliefs, we took prior to be N(mean=200,sd=1) and obtained a posterior of N(mean=201.21, sd=0.9234).
Expressing another belief that the prior is µ~(mean=250,sd=1500), the posterior for this prior being N(mean=216.15, sd=11.96)
Conclusion: One of us had an idea of the vehicle’s HP and hence he took low standard deviation of 1, so his posterior results were close to the prior. The other teammate did not have an idea of the HP, so he took larger variance(SD being 1500). This resulted with a posterior mean close to the mean of the observed dataset. Mean of dataset being 215.88.

mu.hp <- mean(Horsepower)
sd.hp <- sd(Horsepower)
mu.prior1 <- 200
sd.prior1 <- 1
Ib1 <- 1/sd.prior1
Ix <- length(Horsepower)/(sd.hp)^2
mu.posterior1 <- (mu.prior1*Ib1 + (mu.hp)*Ix)/(Ib1+Ix)
sd.posterior1 <- 1/(Ib1+Ix)
mu.prior1

## [1] 200

mu.prior2 <- 250
sd.prior2 <- 1500
Ib2 <- 1/sd.prior2
mu.posterior2 <- (mu.prior2*Ib2 + (mu.hp)*Ix)/(Ib2+Ix)
sd.posterior2 <- 1/(Ib2+Ix)
mu.prior2

## [1] 250

posterior1 <- rnorm(100,mean=mu.posterior1,sd=sd.posterior1)
posterior2 <- rnorm(100,mean=mu.posterior2,sd=sd.posterior2)
hist(posterior1, col=rgb(1,0,0,0.5),xlim=c(180,250), ylim=c(0,35),main="Overlapping Histogram for Posterior HP", xlab="Posterior HP", legend=T)
hist(posterior2, col=rgb(0,0,1,0.5), add=T)
box()

Brown colour ~ N(mean=201.21, sd=0.9234)
Blue Colour ~ N(mean=216.15, sd=11.96)

Results and Conclusion:

The Length variable approximately follows Normal distribution
There is a negative correlation between Length and MPG_City
There is no difference in mean lengths of Ford and Chevrolet cars from the Wald Test