Apply bootstrap resampling to the auto price data as follows:
Compare the difference of the bootstrap resampled mean of the log price of autos grouped by 1) aspiration and 2) fuel type. Use both numerical and graphical methods for your comparison. Are these means different within a 95% confidence interval? How do your conclusions compare to the results you obtained using the t-test last week?
Compare the differences of the bootstrap resampled mean of the log price of the autos grouped by body style. You will need to do this pair wise; e.g. between each possible pairing of body styles. Use both numerical and graphical methods for your comparison. Which pairs of means are different within a 95% confidence interval? How do your conclusions compare to the results you obtained from the ANOVA and Tukey’s HSD analysis you performed last week?
Source of the data can be found at : https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Note: Following packages are required to run the below report.
rm(list = ls())
require(dplyr)
require(simpleboot)
require(repr)
require(knitr)
setwd("C:\\Tejo\\DataScience\\UW_Datascience_Course\\350\\DataScience350-master\\Lecture5\\Assignment")
read.auto = function(file = 'Automobile price data _Raw_.csv'){
## Read the csv file
auto.price <- read.csv(file, header = TRUE,
stringsAsFactors = FALSE)
## Coerce some character columns to numeric
numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm')
auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)
## Remove cases or rows with missing values. In this case we keep the
## rows which do not have nas.
auto.price[complete.cases(auto.price), ]
}
auto.price = read.auto()
#Compare and test Normality the distributions of price and log price
#- Use both a graphical method and a formal test.
read.auto <- auto.price[auto.price$drive.wheels != "4wd",c("fuel.type",
"aspiration","drive.wheels" ,"price" ,
"num.of.doors", "body.style" )]
read.auto <- read.auto[read.auto$num.of.doors != "?", ]
read.auto$log.price <- log(read.auto$price)
read.auto$scaled.log.price <- scale(read.auto$log.price, center = TRUE, scale = TRUE)
read.auto$scaled.price <- scale(read.auto$price, center = TRUE, scale = TRUE)
Lets start with the Question 1:
Lets create a reusable functions to plot.
plot.hist <- function(a, maxs, mins, cols = 'difference of means', nbins = 80, p = 0.05) {
breaks = seq(maxs, mins, length.out = (nbins + 1))
hist(a, breaks = breaks, main = paste('Histogram of', cols), xlab = cols)
abline(v = mean(a), lwd = 4, col = 'red')
abline(v = 0, lwd = 4, col = 'blue')
abline(v = quantile(a, probs = p/2), lty = 3, col = 'red', lwd = 3)
abline(v = quantile(a, probs = (1 - p/2)), lty = 3, col = 'red', lwd = 3)
}
plot.diff <- function(a, cols = 'difference of means', nbins = 80, p = 0.05){
maxs = max(a)
mins = min(a)
plot.hist(a, maxs, mins, cols = cols[1])
}
1a) Null Hypothesis: Significance of log price by aspiration. There is no log price difference with the aspiration (std vs turbo).
#options(repr.plot.width=6, repr.plot.height=4)
read.auto.aspiration.std = read.auto[read.auto$aspiration == 'std',]
read.auto.aspiration.turbo = read.auto[read.auto$aspiration == 'turbo',]
two.boot.mean = two.boot(read.auto.aspiration.std$scaled.log.price, read.auto.aspiration.turbo$scaled.log.price, mean, R = 100000)
plot.diff(two.boot.mean$t)
Conclusion: Based on the above visualization, we reject the null hypothesis.
1b) Null Hypothesis: Significance of log price by fuel type There is no log price difference with the fuel type (gas vs diesel).
read.auto.fuel.type.gas = read.auto[read.auto$fuel.type == 'gas',]
read.auto.fuel.type.diesel = read.auto[read.auto$fuel.type == 'diesel',]
two.boot.mean = two.boot(read.auto.fuel.type.gas$scaled.log.price, read.auto.fuel.type.diesel$scaled.log.price, mean, R = 100000)
plot.diff(two.boot.mean$t)
Conclusion: Based on the above visualization(s), we reject the null hypothesis.
Overall Conclusion:: In KS-test we have Accepted the NULL Hypothesis for aspiration, and Fail to accept for fuel type. But when using bootstrapping sampling, we fail to ACCEPT the NULL Hypothesis.
Q2:
2a) Compare the differences of the bootstrap resampled mean of the log price of the autos grouped by body style. You will need to do this pair wise; e.g. between each possible pairing of body styles. Use both numerical and graphical methods for your comparison. Which pairs of means are different within a 95% confidence interval?
Null Hypothesis: There is no significant log price difference between pair wise combinations for body style.
body.style <- unique(read.auto$body.style)
body.style.pair.index <- NULL
pairwise.columns <- NULL
final.data <- NULL
for ( i in 1:(length(body.style ) - 1))
{
pair.1 <- read.auto[read.auto$body.style == body.style[i],]
for ( j in (i+1):length(body.style))
{
pair.2 <- read.auto[read.auto$body.style == body.style[j],]
body.style.pair.index <- two.boot(pair.1$scaled.log.price,
pair.2$scaled.log.price, mean, R = 100000)
pairwise.columns <- rbind(pairwise.columns,
paste(body.style[i], '-', body.style[j]))
final.data <- cbind(final.data, body.style.pair.index$t)
}
}
final.data <- data.frame(final.data)
names(final.data) <- pairwise.columns
par(mfrow=c(5,2), mar=c(2,2,2,2))
for (i in 1: ncol(final.data))
{
a <- names(final.data[i])
plot.diff(final.data[,a],a)
}
par(mfrow = c(1,1))
2b)How do your conclusions compare to the results you obtained from the ANOVA and Tukey’s HSD analysis you performed last week?
| PairWise Combinations | Bootstrap Result | Anova & Tukey HSD Result |
|---|---|---|
| convertible - hatchback | Reject | Accept |
| convertible - sedan | Reject | Accept |
| convertible - wagon | Reject | Accept |
| convertible - hardtop | Accept | Accept |
| hatchback - sedan | Reject | Accept |
| hatchback - wagon | Reject | Accept |
| hatchback - hardtop | Reject | Reject |
| sedan - wagon | Accept | Accept |
| sedan - hardtop | Accept | Accept |
| wagon - hardtop | Accept | Accept |
Above table has provided the clear results for the sampling techniques Bootstrap vs Anova & TukeyHSD applied on the auto price. There is a significant difference in the results, we tested.