Overview and summary

Apply bootstrap resampling to the auto price data as follows:

Source of the data can be found at : https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

Note: Following packages are required to run the below report.

rm(list = ls())
require(dplyr)
require(simpleboot)
require(repr)
require(knitr)
setwd("C:\\Tejo\\DataScience\\UW_Datascience_Course\\350\\DataScience350-master\\Lecture5\\Assignment")

Data loading and preparation

read.auto = function(file = 'Automobile price data _Raw_.csv'){
        ## Read the csv file
        auto.price <- read.csv(file, header = TRUE, 
                               stringsAsFactors = FALSE)
        
        ## Coerce some character columns to numeric
        numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm')
        auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)
        
        ## Remove cases or rows with missing values. In this case we keep the 
        ## rows which do not have nas. 
        auto.price[complete.cases(auto.price), ]
}
auto.price = read.auto()

#Compare and test Normality the distributions of price and log price
#- Use both a graphical method and a formal test.

read.auto <- auto.price[auto.price$drive.wheels != "4wd",c("fuel.type", 
                                                           "aspiration","drive.wheels" ,"price" ,
                                                           "num.of.doors", "body.style" )]

read.auto <- read.auto[read.auto$num.of.doors != "?", ]
read.auto$log.price <- log(read.auto$price)
read.auto$scaled.log.price <- scale(read.auto$log.price, center = TRUE, scale = TRUE)
read.auto$scaled.price <- scale(read.auto$price, center = TRUE, scale = TRUE)

Lets start with the Question 1:

  1. Compare the difference of the bootstrap resampled mean of the log price of autos grouped by 1) aspiration and 2) fuel type.
  2. Use both numerical and graphical methods for your comparison. Are these means different within a 95% confidence interval?
  3. How do your conclusions compare to the results you obtained using the t-test last week?

Lets create a reusable functions to plot.

plot.hist <- function(a, maxs, mins, cols = 'difference of means', nbins = 80, p = 0.05) {
        breaks = seq(maxs, mins, length.out = (nbins + 1))
        hist(a, breaks = breaks, main = paste('Histogram of', cols), xlab = cols)
        abline(v = mean(a), lwd = 4, col = 'red')
        abline(v = 0, lwd = 4, col = 'blue')
        abline(v = quantile(a, probs = p/2), lty = 3, col = 'red', lwd = 3)  
        abline(v = quantile(a, probs = (1 - p/2)), lty = 3, col = 'red', lwd = 3)
}

plot.diff <- function(a, cols = 'difference of means', nbins = 80, p = 0.05){
        maxs = max(a)
        mins = min(a)
        plot.hist(a, maxs, mins, cols = cols[1])
}

1a) Null Hypothesis: Significance of log price by aspiration. There is no log price difference with the aspiration (std vs turbo).

#options(repr.plot.width=6, repr.plot.height=4)

read.auto.aspiration.std = read.auto[read.auto$aspiration == 'std',]
read.auto.aspiration.turbo = read.auto[read.auto$aspiration == 'turbo',]

two.boot.mean = two.boot(read.auto.aspiration.std$scaled.log.price, read.auto.aspiration.turbo$scaled.log.price, mean, R = 100000)

Visualization - 1

plot.diff(two.boot.mean$t)

Conclusion: Based on the above visualization, we reject the null hypothesis.

1b) Null Hypothesis: Significance of log price by fuel type There is no log price difference with the fuel type (gas vs diesel).

read.auto.fuel.type.gas = read.auto[read.auto$fuel.type == 'gas',]
read.auto.fuel.type.diesel = read.auto[read.auto$fuel.type == 'diesel',]

two.boot.mean = two.boot(read.auto.fuel.type.gas$scaled.log.price, read.auto.fuel.type.diesel$scaled.log.price, mean, R = 100000)

Visualization - 2

plot.diff(two.boot.mean$t)

Conclusion: Based on the above visualization(s), we reject the null hypothesis.

  1. How do your conclusions compare to the results you obtained using the t-test last week?

Overall Conclusion:: In KS-test we have Accepted the NULL Hypothesis for aspiration, and Fail to accept for fuel type. But when using bootstrapping sampling, we fail to ACCEPT the NULL Hypothesis.

Q2:

2a) Compare the differences of the bootstrap resampled mean of the log price of the autos grouped by body style. You will need to do this pair wise; e.g. between each possible pairing of body styles. Use both numerical and graphical methods for your comparison. Which pairs of means are different within a 95% confidence interval?

Null Hypothesis: There is no significant log price difference between pair wise combinations for body style.

body.style <- unique(read.auto$body.style)
body.style.pair.index <- NULL
pairwise.columns <- NULL
final.data <- NULL

for ( i in 1:(length(body.style ) - 1))
{
        pair.1 <- read.auto[read.auto$body.style == body.style[i],]
        for ( j in (i+1):length(body.style))
        {
                pair.2 <- read.auto[read.auto$body.style == body.style[j],]
                body.style.pair.index <- two.boot(pair.1$scaled.log.price, 
                                pair.2$scaled.log.price, mean, R = 100000)        
                pairwise.columns <- rbind(pairwise.columns, 
                                         paste(body.style[i], '-', body.style[j]))
                final.data <- cbind(final.data, body.style.pair.index$t)
        }
        
}

final.data <- data.frame(final.data)
names(final.data) <- pairwise.columns

Visualization - 3

par(mfrow=c(5,2), mar=c(2,2,2,2))

for (i in 1: ncol(final.data))
{
  a <- names(final.data[i])
  plot.diff(final.data[,a],a)
}

par(mfrow = c(1,1))

2b)How do your conclusions compare to the results you obtained from the ANOVA and Tukey’s HSD analysis you performed last week?

PairWise Combinations Bootstrap Result Anova & Tukey HSD Result
convertible - hatchback Reject Accept
convertible - sedan Reject Accept
convertible - wagon Reject Accept
convertible - hardtop Accept Accept
hatchback - sedan Reject Accept
hatchback - wagon Reject Accept
hatchback - hardtop Reject Reject
sedan - wagon Accept Accept
sedan - hardtop Accept Accept
wagon - hardtop Accept Accept

Conclusion:

Above table has provided the clear results for the sampling techniques Bootstrap vs Anova & TukeyHSD applied on the auto price. There is a significant difference in the results, we tested.