This document is intended to be a concise report to explain the analysis of multi-valued categorical variables performed on a dataset containing information about cars (available here). The analysis was created as part of the Data Science Certificate in the class Methods for Data Analysis at University of Washington.
The idea is to show some findings regarding the significance of the price of the cars, more especifically when relating to values of the categorical variables body style, drive wheels, number of cylinders, number of doors and engine type. Some functions created for this purpose are included in the appendix.
The analysis shows that drive wheels of type rwd have higher mean price than fwd and body style sedan have also higher mean price than hatchback. On the other hand, cars with 4 cylinders have lower price than those with 5 or 6 cylinders, whereas the number of doors does not seem to affect the price. Finally, engine also influences the price, with type ohcv having higher mean price than ohc and ohcf, and type dohc having higher mean price than ohc.
This report considers data cleaned as described by “Auto Exploration Report” with two additional points. The first is that it replaces the NA values of the variable num.of.doors with 4 since these cars are of body type sedan. Finally we also create the column lnprice with the log of the price, since we already chose to use it instead of price, for being closer to normality, as described in “Hypothesis Testing of Auto Data”.
# read.auto function loads and cleans the data
Auto.Price = read.auto(path = '.') # function read.auto is included in the appendix
It is worth noticing that the function read.auto that loads and cleans the data – which can be seen in the Appendix – performs adjustments on two variables we will analyze: num.of.doors and num.of.cylinders.
Following we have five sections with the analysis of relations between log of price and body style, drive wheels, number of cylinders, number of doors and engine type.
The body style of a vehicle is the indicative of its shape, therefore resulting in different styles of cars. In this dataset we have five body styles: convertible, hardtop, hatchback, sedan and wagon.
table(Auto.Price$body.style, useNA = 'always')
##
## convertible hardtop hatchback sedan wagon <NA>
## 6 8 70 96 25 0
From the counts of each value we can see that two styles have less than 5% of the data: convertible (6 out of 205, ~3%) and hardtop (8 out of 205, ~4%). Therefore we will not use these two values, since we do not have enough data and it may lead to wrong conclusions.
Now we redefine the data to select only the column of interest and lnprice for the categorical values that have a reasonable number of rows.
body.style <- redefine.data('body.style', c('hatchback', 'sedan','wagon')) # function included in the appendix
anova.tests(formula(lnprice ~ body.style), body.style, 8) # function included in the appendix
## ANOVA
## Df Sum Sq Mean Sq F value Pr(>F)
## body.style 2 3.85 1.9264 9.274 0.000145 ***
## Residuals 184 38.22 0.2077
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ----------------------------------------------------------------
##
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = formula.to.anova, data = data.to.anova)
##
## $body.style
## diff lwr upr p adj
## sedan-hatchback 0.31088820 0.13945380 0.4823226 0.0000870
## wagon-hatchback 0.22291209 -0.02895758 0.4747818 0.0944466
## wagon-sedan -0.08797611 -0.33030122 0.1543490 0.6675193
The first test above is the standard ANOVA, showing a very small p-value, which indicates that must exist some difference in lnprice between the values of the categorical variable. Then to find out which values have differences we also performed the Tukey’s Range Test above, that shows a significant positive value for the difference of means of lnprice between sedan and hatchback. Therefore we can attest that the mean price of cars with body style sedan is higher than those with body style hatchback.
bootstrap.diff.in.means(body.style$lnprice, body.style$body.style) # function included in the appendix
## diff ci_lwr ci_upr
## sedan-hatchback 0.3109 0.1725 0.4496
## wagon-hatchback 0.2228 0.0528 0.3957
## wagon-sedan -0.0883 -0.2655 0.0915
To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of body style. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.
Drive wheels essentially dictates the traction of the cars, into 4 wheels (4wd), two forward wheels (fwd) or two rear wheels (rwd). Observing the variable drive.wheels, we see it has exactly three levels: 4wd, fwd and rwd.
table(Auto.Price$drive.wheels, useNA = 'always')
##
## 4wd fwd rwd <NA>
## 9 120 76 0
We can also notice that the level 4wd has very few observations (9 out of 205, ~4%) and for this reason may be hard to account since it may lead to wrong conclusions. Therefore, we will only compare rwd and fwd.
Now we redefine the data to select only the column of interest and lnprice for the categorical values that have a reasonable number of rows.
drive.wheels <- redefine.data('drive.wheels', c('rwd', 'fwd')) # function included in the appendix
sort(tapply(drive.wheels$lnprice, drive.wheels$drive.wheels, mean, na.rm=TRUE), decreasing=TRUE)
## rwd fwd
## 9.795195 9.077632
By analyzing the log of price for each level, we see that rwd has a higher mean log of price than fwd. To assert this hypothesis, we do a Welch Two Sample t-test, defining the following null and alternative hypothesis:
t.test(drive.wheels$lnprice[drive.wheels$drive.wheels == 'rwd'],
drive.wheels$lnprice[drive.wheels$drive.wheels == 'fwd'], "greater", 0, FALSE, FALSE, 0.95)
##
## Welch Two Sample t-test
##
## data: drive.wheels$lnprice[drive.wheels$drive.wheels == "rwd"] and drive.wheels$lnprice[drive.wheels$drive.wheels == "fwd"]
## t = 12.273, df = 123.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.6206705 Inf
## sample estimates:
## mean of x mean of y
## 9.795195 9.077632
With a very small p-value, we can reject the null hypothesis, confirming our finding that the mean price of rwd drive wheel is greater than the mean price of the fwd drive wheel.
bootstrap.diff.in.means(drive.wheels$lnprice, drive.wheels$drive.wheels) # function included in the appendix
## diff ci_lwr ci_upr
## fwd-rwd -0.7174 -0.8317 -0.6032
To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of drive wheels. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.
The number of cylinders of a car is related to how the engine works, where generally more cylinders mean more potential power. In this dataset we have cylinders in the range from 2 to 6, 8 and 12.
table(Auto.Price$num.of.cylinders, useNA = 'always')
##
## 2 3 4 5 6 8 12 <NA>
## 4 1 159 11 24 5 1 0
From the counts of each value we can see that four numbers of cylinders have less than 5% of the data: 2 (4 out of 205, ~2%), 3 (1 out of 205, ~0.5%), 8 (5 out of 205, ~2%) and 12 (1 out of 205, ~0.5%). Therefore we will not use these four values, since we do not have enough data and it may lead to wrong conclusions.
Now we redefine the data to select only the column of interest and lnprice for the values that have a reasonable number of rows.
cylinders <- redefine.data('num.of.cylinders', c(4,5,6)) # function included in the appendix
anova.tests(formula(lnprice ~ num.of.cylinders), cylinders) # function included in the appendix
## ANOVA
## Df Sum Sq Mean Sq F value Pr(>F)
## num.of.cylinders 2 18.77 9.384 75.03 <2e-16 ***
## Residuals 188 23.51 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ----------------------------------------------------------------
##
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = formula.to.anova, data = data.to.anova)
##
## $num.of.cylinders
## diff lwr upr p adj
## 5-4 0.78619245 0.5137035 1.0586814 0.0000000
## 6-4 0.83286755 0.6497524 1.0159828 0.0000000
## 6-5 0.04667511 -0.2677914 0.3611416 0.9344985
The first test above is the standard ANOVA, showing a very small p-value, which indicates that must exist some difference in lnprice between the values of the categorical variable. Then to find out which values have differences we also performed the Tukey’s Range Test above, that shows a significant positive value for the difference of means of lnprice between 5 and 4, and also between 6 and 4. Therefore we can attest that the mean price of cars with 4 cylinders is lower than those with 5 and 6.
bootstrap.diff.in.means(cylinders$lnprice, cylinders$num.of.cylinders) # function included in the appendix
## diff ci_lwr ci_upr
## 6-4 0.8328 0.6813 0.9877
## 5-4 0.7862 0.6046 0.9655
## 5-6 -0.0466 -0.2718 0.1755
To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of number of cylinders. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.
The number of doors of a car can be counted considering the trunk or not – in this dataset it does not. Observing the variable num.of.doors, we see it has exactly two levels: 2 and 4. Now we isolate the data to select only the column of interest and lnprice.
doors <- redefine.data('num.of.doors', c(2, 4)) # function included in the appendix
sort(tapply(doors$lnprice, doors$num.of.doors, mean, na.rm=TRUE), decreasing=TRUE)
## 4 2
## 9.392833 9.292991
By analyzing the log of price for each level, we see that 4 has a mean log of price that is very close to the mean log of price of 2. Thus, to assert the hypothesis of same mean log of price, we do a Welch Two Sample t-test, defining the following null and alternative hypothesis:
t.test(doors$lnprice[doors$num.of.doors == '4'],
doors$lnprice[doors$num.of.doors == '2'], "two.sided", 0, FALSE, FALSE, 0.95)
##
## Welch Two Sample t-test
##
## data: doors$lnprice[doors$num.of.doors == "4"] and doors$lnprice[doors$num.of.doors == "2"]
## t = 1.3672, df = 166.65, p-value = 0.1734
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04433514 0.24401794
## sample estimates:
## mean of x mean of y
## 9.392833 9.292991
With a p-value considerably higher than 0.05, we can not reject the null hypothesis. But this does not mean that we can accept the null hypothesis, since there are two possible reasons as to why we failed: the alternative hypothesis was false to begin with; or we did not collect enough evidence for the alternative hypothesis.
bootstrap.diff.in.means(doors$lnprice, doors$num.of.doors) # function included in the appendix
## diff ci_lwr ci_upr
## 4-2 0.0998 -0.0436 0.2409
To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of number of doors. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.
The engine type of a vehicle states how the engine is assembled or designed in terms of operations of valves and cylinders. In this dataset we have seven engine types: dohc (Dual OverHead Cam), dohcv (Dual OverHead Cam and Valve), l (L engine), ohc (OverHead Cam), ohcf (OverHead Cam and Valve F engine), ohcv (OverHead Cam and Valve) and rotor (Rotary engine).
table(Auto.Price$engine.type, useNA = 'always')
##
## dohc dohcv l ohc ohcf ohcv rotor <NA>
## 12 1 12 148 15 13 4 0
From the counts of each value we can see that two types have less than 5% of the data: dohcv (1 out of 205, ~0.5%) and rotor (4 out of 205, ~2%). Therefore we will not use these two values, since we do not have enough data and it may lead to wrong conclusions.
Now we redefine the data to select only the column of interest and lnprice for the categorical values that have a reasonable number of rows.
engine.type <- redefine.data('engine.type', c('dohc','l','ohc','ohcf','ohcv')) # function included in the appendix
anova.tests(formula(lnprice ~ engine.type), engine.type, 6) # function included in the appendix
## ANOVA
## Df Sum Sq Mean Sq F value Pr(>F)
## engine.type 4 10.04 2.5103 11.96 1.06e-08 ***
## Residuals 192 40.28 0.2098
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ----------------------------------------------------------------
##
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = formula.to.anova, data = data.to.anova)
##
## $engine.type
## diff lwr upr p adj
## l-dohc -0.1814966 -0.69645611 0.33346289 0.8680941
## ohc-dohc -0.4866595 -0.86555891 -0.10776013 0.0045682
## ohcf-dohc -0.4162636 -0.90479710 0.07226986 0.1349546
## ohcv-dohc 0.3087090 -0.19625033 0.81366834 0.4465183
## ohc-l -0.3051629 -0.68406230 0.07373648 0.1774917
## ohcf-l -0.2347670 -0.72330049 0.25376647 0.6770050
## ohcv-l 0.4902056 -0.01475372 0.99516495 0.0617230
## ohcf-ohc 0.0703959 -0.27172457 0.41251637 0.9797063
## ohcv-ohc 0.7953685 0.43017627 1.16056079 0.0000001
## ohcv-ohcf 0.7249726 0.24699187 1.20295338 0.0004274
The first test above is the standard ANOVA, showing a very small p-value, which indicates that must exist some difference in lnprice between the values of the categorical variable. Then to find out which values have differences we also performed the Tukey’s Range Test above, that shows a significant positive value for the difference of means of lnprice between ohcv and ohc, ohcv and ohcf, as well as dohc and ohc. Therefore we can attest that the mean price of cars with engine type ohcv have higher mean price than ohc and ohcf, and type dohc have higher mean price than ohc.
bootstrap.diff.in.means(engine.type$lnprice, engine.type$engine.type) # function included in the appendix
## diff ci_lwr ci_upr
## ohcv-dohc 0.3092 -0.0075 0.6271
## ohc-dohc -0.4866 -0.7210 -0.2655
## l-dohc -0.1820 -0.4849 0.0871
## ohcf-dohc -0.4162 -0.7761 -0.0324
## ohc-ohcv -0.7957 -1.0446 -0.5559
## l-ohcv -0.4887 -0.7999 -0.2040
## ohcf-ohcv -0.7250 -1.0972 -0.3316
## l-ohc 0.3057 0.0818 0.4778
## ohcf-ohc 0.0703 -0.2230 0.4049
## ohcf-l -0.2354 -0.5681 0.1445
To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of engine type. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.
Thus, the data clearly shows a few important aspects of the price of cars. In other words, we saw that cars with:
Finally, from the available data we could not see any influence of the number of doors on the price of cars.
Functions used in this report are showed below.
## Read the csv file into a data frame
read.auto <- function(path = 'SET-YOUR-PATH-HERE'){
## Function to read the csv file
filePath <- file.path(path, 'Automobile price data _Raw_.csv')
auto.price <- read.csv(filePath, header = TRUE,
stringsAsFactors = TRUE, na.strings = "?")
## Coerce some character columns to numeric
numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm',
'highway.mpg', 'city.mpg', 'compression.ratio',
'engine.size', 'curb.weight', 'height', 'width',
'length', 'wheel.base', 'normalized.losses',
'symboling')
auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)
## Clean and tidy num.of.doors
auto.price$num.of.doors <- as.character(auto.price$num.of.doors)
auto.price$num.of.doors[auto.price$num.of.doors == 'four'] <- 4
auto.price$num.of.doors[auto.price$num.of.doors == 'two'] <- 2
auto.price$num.of.doors <- as.integer(auto.price$num.of.doors)
## Since the only NAs in num.of.doors are of cars with body.style sedan, they have 4 doors
auto.price$num.of.doors[is.na(auto.price$num.of.doors)] <- 4
## Clean and tidy num.of.cylinders
auto.price$num.of.cylinders <- as.character(auto.price$num.of.cylinders)
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'eight'] <- 8
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'five'] <- 5
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'four'] <- 4
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'six'] <- 6
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'three'] <- 3
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'twelve'] <- 12
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'two'] <- 2
auto.price$num.of.cylinders <- as.integer(auto.price$num.of.cylinders)
auto.price$lnprice <- log(auto.price$price)
auto.price
}
# Adjusts data to remain only name.of.column (with the values given by values.of.column) and lnprice
redefine.data <- function(name.of.column, values.of.column){
# Subsets rows and cols
new.Auto <- Auto.Price[Auto.Price[, name.of.column] %in% values.of.column, c(name.of.column, 'lnprice')]
new.Auto <- new.Auto[complete.cases(new.Auto),] # Keep only non NA values
new.Auto[,name.of.column] <- factor(new.Auto[,name.of.column]) # Transforms to factor for ANOVA tests
new.Auto
}
# Does ANOVA and Tukey's Range Test, outputing results in text and plot for Tukey's
anova.tests <- function(formula.to.anova, data.to.anova, leftpar=4){
aov <- aov(formula.to.anova, data = data.to.anova) # Calculates ANOVA
# Print ANOVA
cat(" ANOVA\n")
print(summary(aov))
cat("\n----------------------------------------------------------------\n\n")
tukey_anova <- TukeyHSD(aov) # Calculates Tukey's
# Print Tukey's
print(tukey_anova)
par(mar=c(5,leftpar,4,2) + 0.1)
plot(tukey_anova, las=1)
par(mar=c(5,4,4,2) + 0.1)
}
# Outputs mean, lower and upper values of confidence interval constructed using bootstrap samples
bootstrap.diff.in.means <- function(main.col, col.to.boot, p=0.05, nr.iter=100000){
library(simpleboot)
col.to.boot <- factor(col.to.boot) # Vector needs to be factor
categories <- unique(col.to.boot) # Obtain the categories in which to iterate
boot.Output <- data.frame(diff=numeric(0), ci_lwr=numeric(0), ci_upr=numeric(0)) # Data frame with answers
for(i in 1:(length(categories)-1)){
for(j in (i+1):length(categories)){
two.boot.mean = two.boot(main.col[col.to.boot == categories[j]],
main.col[col.to.boot == categories[i]], mean, R=nr.iter) # Bootstrap samples
boot.values <- two.boot.mean$t
# Calculates confidence intervals
ci_lwr = round(quantile(boot.values, probs = p/2, na.rm=TRUE), 4)
mean_vl = round(mean(boot.values), 4)
ci_upr = round(quantile(boot.values, probs = (1 - p/2), na.rm=TRUE), 4)
# Adds to data frame with answers
current.boot = data.frame(diff=mean_vl, ci_lwr=ci_lwr, ci_upr=ci_upr)
rownames(current.boot) <- paste0(categories[j], "-", categories[i])
boot.Output <- rbind(boot.Output, current.boot)
}
}
boot.Output
}