Bootstrap Resampling

Overview and summary

Apply bootstrap resampling to the auto price data as follows:

Compare the difference of the bootstrap resampled mean of the log price of autos grouped by 1) aspiration and 2) fuel type. Use both numerical and graphical methods for your comparison. Are these means different within a 95% confidence interval? How do your conclusions compare to the results you obtained using the t-test last week?
Compare the differences of the bootstrap resampled mean of the log price of the autos grouped by body style. You will need to do this pair wise; e.g. between each possible pairing of body styles. Use both numerical and graphical methods for your comparison. Which pairs of means are different within a 95% confidence interval? How do your conclusions compare to the results you obtained from the ANOVA and Tukey’s HSD analysis you performed last week?

Source of the data can be found at : https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

Note: Following packages are required to run the below report.

dplyr
simpleboot
repr
knitr

rm(list = ls())
require(dplyr)
require(simpleboot)
require(repr)
require(knitr)

setwd("C:\\Tejo\\DataScience\\UW_Datascience_Course\\350\\DataScience350-master\\Lecture5\\Assignment")

Data loading and preparation

read.auto = function(file = 'Automobile price data _Raw_.csv'){
        ## Read the csv file
        auto.price <- read.csv(file, header = TRUE, 
                               stringsAsFactors = FALSE)
        
        ## Coerce some character columns to numeric
        numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm')
        auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)
        
        ## Remove cases or rows with missing values. In this case we keep the 
        ## rows which do not have nas. 
        auto.price[complete.cases(auto.price), ]
}
auto.price = read.auto()

#Compare and test Normality the distributions of price and log price
#- Use both a graphical method and a formal test.

read.auto <- auto.price[auto.price$drive.wheels != "4wd",c("fuel.type", 
                                                           "aspiration","drive.wheels" ,"price" ,
                                                           "num.of.doors", "body.style" )]

read.auto <- read.auto[read.auto$num.of.doors != "?", ]
read.auto$log.price <- log(read.auto$price)
read.auto$scaled.log.price <- scale(read.auto$log.price, center = TRUE, scale = TRUE)
read.auto$scaled.price <- scale(read.auto$price, center = TRUE, scale = TRUE)

Lets start with the Question 1:

Compare the difference of the bootstrap resampled mean of the log price of autos grouped by 1) aspiration and 2) fuel type.
Use both numerical and graphical methods for your comparison. Are these means different within a 95% confidence interval?
How do your conclusions compare to the results you obtained using the t-test last week?

Lets create a reusable functions to plot.

plot.hist <- function(a, maxs, mins, cols = 'difference of means', nbins = 80, p = 0.05) {
        breaks = seq(maxs, mins, length.out = (nbins + 1))
        hist(a, breaks = breaks, main = paste('Histogram of', cols), xlab = cols)
        abline(v = mean(a), lwd = 4, col = 'red')
        abline(v = 0, lwd = 4, col = 'blue')
        abline(v = quantile(a, probs = p/2), lty = 3, col = 'red', lwd = 3)  
        abline(v = quantile(a, probs = (1 - p/2)), lty = 3, col = 'red', lwd = 3)
}

plot.diff <- function(a, cols = 'difference of means', nbins = 80, p = 0.05){
        maxs = max(a)
        mins = min(a)
        plot.hist(a, maxs, mins, cols = cols[1])
}

1a) Null Hypothesis: Significance of log price by aspiration. There is no log price difference with the aspiration (std vs turbo).

#options(repr.plot.width=6, repr.plot.height=4)

read.auto.aspiration.std = read.auto[read.auto$aspiration == 'std',]
read.auto.aspiration.turbo = read.auto[read.auto$aspiration == 'turbo',]

two.boot.mean = two.boot(read.auto.aspiration.std$scaled.log.price, read.auto.aspiration.turbo$scaled.log.price, mean, R = 100000)

Visualization - 1

plot.diff(two.boot.mean$t)

Conclusion: Based on the above visualization, we reject the null hypothesis.

1b) Null Hypothesis: Significance of log price by fuel type There is no log price difference with the fuel type (gas vs diesel).

read.auto.fuel.type.gas = read.auto[read.auto$fuel.type == 'gas',]
read.auto.fuel.type.diesel = read.auto[read.auto$fuel.type == 'diesel',]

two.boot.mean = two.boot(read.auto.fuel.type.gas$scaled.log.price, read.auto.fuel.type.diesel$scaled.log.price, mean, R = 100000)

Visualization - 2

plot.diff(two.boot.mean$t)

Conclusion: Based on the above visualization(s), we reject the null hypothesis.

How do your conclusions compare to the results you obtained using the t-test last week?

Overall Conclusion:: In KS-test we have Accepted the NULL Hypothesis for aspiration, and Fail to accept for fuel type. But when using bootstrapping sampling, we fail to ACCEPT the NULL Hypothesis.

Q2:

2a) Compare the differences of the bootstrap resampled mean of the log price of the autos grouped by body style. You will need to do this pair wise; e.g. between each possible pairing of body styles. Use both numerical and graphical methods for your comparison. Which pairs of means are different within a 95% confidence interval?

Null Hypothesis: There is no significant log price difference between pair wise combinations for body style.

body.style <- unique(read.auto$body.style)
body.style.pair.index <- NULL
pairwise.columns <- NULL
final.data <- NULL

for ( i in 1:(length(body.style ) - 1))
{
        pair.1 <- read.auto[read.auto$body.style == body.style[i],]
        for ( j in (i+1):length(body.style))
        {
                pair.2 <- read.auto[read.auto$body.style == body.style[j],]
                body.style.pair.index <- two.boot(pair.1$scaled.log.price, 
                                pair.2$scaled.log.price, mean, R = 100000)        
                pairwise.columns <- rbind(pairwise.columns, 
                                         paste(body.style[i], '-', body.style[j]))
                final.data <- cbind(final.data, body.style.pair.index$t)
        }
        
}

final.data <- data.frame(final.data)
names(final.data) <- pairwise.columns

Visualization - 3

par(mfrow=c(5,2), mar=c(2,2,2,2))

for (i in 1: ncol(final.data))
{
  a <- names(final.data[i])
  plot.diff(final.data[,a],a)
}

par(mfrow = c(1,1))

2b)How do your conclusions compare to the results you obtained from the ANOVA and Tukey’s HSD analysis you performed last week?

PairWise Combinations	Bootstrap Result	Anova & Tukey HSD Result
convertible - hatchback	Reject	Accept
convertible - sedan	Reject	Accept
convertible - wagon	Reject	Accept
convertible - hardtop	Accept	Accept
hatchback - sedan	Reject	Accept
hatchback - wagon	Reject	Accept
hatchback - hardtop	Reject	Reject
sedan - wagon	Accept	Accept
sedan - hardtop	Accept	Accept
wagon - hardtop	Accept	Accept