Analysis of Categorical Variables of Auto Data

Summary and conclusions

This document is intended to be a concise report to explain the analysis of multi-valued categorical variables performed on a dataset containing information about cars (available here). The analysis was created as part of the Data Science Certificate in the class Methods for Data Analysis at University of Washington.

The idea is to show some findings regarding the significance of the price of the cars, more especifically when relating to values of the categorical variables body style, drive wheels, number of cylinders, number of doors and engine type. Some functions created for this purpose are included in the appendix.

The analysis shows that drive wheels of type rwd have higher mean price than fwd and body style sedan have also higher mean price than hatchback. On the other hand, cars with 4 cylinders have lower price than those with 5 or 6 cylinders, whereas the number of doors does not seem to affect the price. Finally, engine also influences the price, with type ohcv having higher mean price than ohc and ohcf, and type dohc having higher mean price than ohc.

Loading the data

This report considers data cleaned as described by “Auto Exploration Report” with two additional points. The first is that it replaces the NA values of the variable num.of.doors with 4 since these cars are of body type sedan. Finally we also create the column lnprice with the log of the price, since we already chose to use it instead of price, for being closer to normality, as described in “Hypothesis Testing of Auto Data”.

# read.auto function loads and cleans the data
Auto.Price = read.auto(path = '.') # function read.auto is included in the appendix

It is worth noticing that the function read.auto that loads and cleans the data – which can be seen in the Appendix – performs adjustments on two variables we will analyze: num.of.doors and num.of.cylinders.

Analysis of categorical variables

Following we have five sections with the analysis of relations between log of price and body style, drive wheels, number of cylinders, number of doors and engine type.

Body Style

The body style of a vehicle is the indicative of its shape, therefore resulting in different styles of cars. In this dataset we have five body styles: convertible, hardtop, hatchback, sedan and wagon.

table(Auto.Price$body.style, useNA = 'always')

## 
## convertible     hardtop   hatchback       sedan       wagon        <NA> 
##           6           8          70          96          25           0

From the counts of each value we can see that two styles have less than 5% of the data: convertible (6 out of 205, ~3%) and hardtop (8 out of 205, ~4%). Therefore we will not use these two values, since we do not have enough data and it may lead to wrong conclusions.

Now we redefine the data to select only the column of interest and lnprice for the categorical values that have a reasonable number of rows.

body.style <- redefine.data('body.style', c('hatchback', 'sedan','wagon')) # function included in the appendix
anova.tests(formula(lnprice ~ body.style), body.style, 8) # function included in the appendix

##   ANOVA
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## body.style    2   3.85  1.9264   9.274 0.000145 ***
## Residuals   184  38.22  0.2077                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ----------------------------------------------------------------
## 
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = formula.to.anova, data = data.to.anova)
## 
## $body.style
##                        diff         lwr       upr     p adj
## sedan-hatchback  0.31088820  0.13945380 0.4823226 0.0000870
## wagon-hatchback  0.22291209 -0.02895758 0.4747818 0.0944466
## wagon-sedan     -0.08797611 -0.33030122 0.1543490 0.6675193

The first test above is the standard ANOVA, showing a very small p-value, which indicates that must exist some difference in lnprice between the values of the categorical variable. Then to find out which values have differences we also performed the Tukey’s Range Test above, that shows a significant positive value for the difference of means of lnprice between sedan and hatchback. Therefore we can attest that the mean price of cars with body style sedan is higher than those with body style hatchback.

bootstrap.diff.in.means(body.style$lnprice, body.style$body.style) # function included in the appendix

##                    diff  ci_lwr ci_upr
## sedan-hatchback  0.3109  0.1725 0.4496
## wagon-hatchback  0.2228  0.0528 0.3957
## wagon-sedan     -0.0883 -0.2655 0.0915

To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of body style. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.

Drive Wheels

Drive wheels essentially dictates the traction of the cars, into 4 wheels (4wd), two forward wheels (fwd) or two rear wheels (rwd). Observing the variable drive.wheels, we see it has exactly three levels: 4wd, fwd and rwd.

table(Auto.Price$drive.wheels, useNA = 'always')

## 
##  4wd  fwd  rwd <NA> 
##    9  120   76    0

We can also notice that the level 4wd has very few observations (9 out of 205, ~4%) and for this reason may be hard to account since it may lead to wrong conclusions. Therefore, we will only compare rwd and fwd.

Now we redefine the data to select only the column of interest and lnprice for the categorical values that have a reasonable number of rows.

drive.wheels <- redefine.data('drive.wheels', c('rwd', 'fwd')) # function included in the appendix
sort(tapply(drive.wheels$lnprice, drive.wheels$drive.wheels, mean, na.rm=TRUE), decreasing=TRUE)

##      rwd      fwd 
## 9.795195 9.077632

By analyzing the log of price for each level, we see that rwd has a higher mean log of price than fwd. To assert this hypothesis, we do a Welch Two Sample t-test, defining the following null and alternative hypothesis:

Null Hypothesis - H0: Mean log of price of rwd is lesser or equal than the mean log of price of fwd
Alternative Hypothesis - HA: Mean log of price of rwd is greater than the mean log of price of fwd

t.test(drive.wheels$lnprice[drive.wheels$drive.wheels == 'rwd'], 
       drive.wheels$lnprice[drive.wheels$drive.wheels == 'fwd'], "greater", 0, FALSE, FALSE, 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  drive.wheels$lnprice[drive.wheels$drive.wheels == "rwd"] and drive.wheels$lnprice[drive.wheels$drive.wheels == "fwd"]
## t = 12.273, df = 123.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.6206705       Inf
## sample estimates:
## mean of x mean of y 
##  9.795195  9.077632

With a very small p-value, we can reject the null hypothesis, confirming our finding that the mean price of rwd drive wheel is greater than the mean price of the fwd drive wheel.

bootstrap.diff.in.means(drive.wheels$lnprice, drive.wheels$drive.wheels) # function included in the appendix

##            diff  ci_lwr  ci_upr
## fwd-rwd -0.7174 -0.8317 -0.6032

To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of drive wheels. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.

Number of Cylinders

The number of cylinders of a car is related to how the engine works, where generally more cylinders mean more potential power. In this dataset we have cylinders in the range from 2 to 6, 8 and 12.

table(Auto.Price$num.of.cylinders, useNA = 'always')

## 
##    2    3    4    5    6    8   12 <NA> 
##    4    1  159   11   24    5    1    0

From the counts of each value we can see that four numbers of cylinders have less than 5% of the data: 2 (4 out of 205, ~2%), 3 (1 out of 205, ~0.5%), 8 (5 out of 205, ~2%) and 12 (1 out of 205, ~0.5%). Therefore we will not use these four values, since we do not have enough data and it may lead to wrong conclusions.

Now we redefine the data to select only the column of interest and lnprice for the values that have a reasonable number of rows.

cylinders <- redefine.data('num.of.cylinders', c(4,5,6)) # function included in the appendix
anova.tests(formula(lnprice ~ num.of.cylinders), cylinders) # function included in the appendix

##   ANOVA
##                   Df Sum Sq Mean Sq F value Pr(>F)    
## num.of.cylinders   2  18.77   9.384   75.03 <2e-16 ***
## Residuals        188  23.51   0.125                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ----------------------------------------------------------------
## 
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = formula.to.anova, data = data.to.anova)
## 
## $num.of.cylinders
##           diff        lwr       upr     p adj
## 5-4 0.78619245  0.5137035 1.0586814 0.0000000
## 6-4 0.83286755  0.6497524 1.0159828 0.0000000
## 6-5 0.04667511 -0.2677914 0.3611416 0.9344985

The first test above is the standard ANOVA, showing a very small p-value, which indicates that must exist some difference in lnprice between the values of the categorical variable. Then to find out which values have differences we also performed the Tukey’s Range Test above, that shows a significant positive value for the difference of means of lnprice between 5 and 4, and also between 6 and 4. Therefore we can attest that the mean price of cars with 4 cylinders is lower than those with 5 and 6.

bootstrap.diff.in.means(cylinders$lnprice, cylinders$num.of.cylinders) # function included in the appendix

##        diff  ci_lwr ci_upr
## 6-4  0.8328  0.6813 0.9877
## 5-4  0.7862  0.6046 0.9655
## 5-6 -0.0466 -0.2718 0.1755

To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of number of cylinders. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.

Number of Doors

The number of doors of a car can be counted considering the trunk or not – in this dataset it does not. Observing the variable num.of.doors, we see it has exactly two levels: 2 and 4. Now we isolate the data to select only the column of interest and lnprice.

doors <- redefine.data('num.of.doors', c(2, 4)) # function included in the appendix
sort(tapply(doors$lnprice, doors$num.of.doors, mean, na.rm=TRUE), decreasing=TRUE)

##        4        2 
## 9.392833 9.292991

By analyzing the log of price for each level, we see that 4 has a mean log of price that is very close to the mean log of price of 2. Thus, to assert the hypothesis of same mean log of price, we do a Welch Two Sample t-test, defining the following null and alternative hypothesis:

Null Hypothesis - H0: Mean log of price of 4 is equal to the mean log of price of 2
Alternative Hypothesis - HA: Mean log of price of 4 is not equal to the mean log of price of 2

t.test(doors$lnprice[doors$num.of.doors == '4'], 
       doors$lnprice[doors$num.of.doors == '2'], "two.sided", 0, FALSE, FALSE, 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  doors$lnprice[doors$num.of.doors == "4"] and doors$lnprice[doors$num.of.doors == "2"]
## t = 1.3672, df = 166.65, p-value = 0.1734
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04433514  0.24401794
## sample estimates:
## mean of x mean of y 
##  9.392833  9.292991

With a p-value considerably higher than 0.05, we can not reject the null hypothesis. But this does not mean that we can accept the null hypothesis, since there are two possible reasons as to why we failed: the alternative hypothesis was false to begin with; or we did not collect enough evidence for the alternative hypothesis.

bootstrap.diff.in.means(doors$lnprice, doors$num.of.doors) # function included in the appendix

##       diff  ci_lwr ci_upr
## 4-2 0.0998 -0.0436 0.2409

To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of number of doors. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.

Engine Type

The engine type of a vehicle states how the engine is assembled or designed in terms of operations of valves and cylinders. In this dataset we have seven engine types: dohc (Dual OverHead Cam), dohcv (Dual OverHead Cam and Valve), l (L engine), ohc (OverHead Cam), ohcf (OverHead Cam and Valve F engine), ohcv (OverHead Cam and Valve) and rotor (Rotary engine).

table(Auto.Price$engine.type, useNA = 'always')

## 
##  dohc dohcv     l   ohc  ohcf  ohcv rotor  <NA> 
##    12     1    12   148    15    13     4     0

From the counts of each value we can see that two types have less than 5% of the data: dohcv (1 out of 205, ~0.5%) and rotor (4 out of 205, ~2%). Therefore we will not use these two values, since we do not have enough data and it may lead to wrong conclusions.

Now we redefine the data to select only the column of interest and lnprice for the categorical values that have a reasonable number of rows.

engine.type <- redefine.data('engine.type', c('dohc','l','ohc','ohcf','ohcv')) # function included in the appendix
anova.tests(formula(lnprice ~ engine.type), engine.type, 6) # function included in the appendix

##   ANOVA
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## engine.type   4  10.04  2.5103   11.96 1.06e-08 ***
## Residuals   192  40.28  0.2098                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ----------------------------------------------------------------
## 
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = formula.to.anova, data = data.to.anova)
## 
## $engine.type
##                 diff         lwr         upr     p adj
## l-dohc    -0.1814966 -0.69645611  0.33346289 0.8680941
## ohc-dohc  -0.4866595 -0.86555891 -0.10776013 0.0045682
## ohcf-dohc -0.4162636 -0.90479710  0.07226986 0.1349546
## ohcv-dohc  0.3087090 -0.19625033  0.81366834 0.4465183
## ohc-l     -0.3051629 -0.68406230  0.07373648 0.1774917
## ohcf-l    -0.2347670 -0.72330049  0.25376647 0.6770050
## ohcv-l     0.4902056 -0.01475372  0.99516495 0.0617230
## ohcf-ohc   0.0703959 -0.27172457  0.41251637 0.9797063
## ohcv-ohc   0.7953685  0.43017627  1.16056079 0.0000001
## ohcv-ohcf  0.7249726  0.24699187  1.20295338 0.0004274

The first test above is the standard ANOVA, showing a very small p-value, which indicates that must exist some difference in lnprice between the values of the categorical variable. Then to find out which values have differences we also performed the Tukey’s Range Test above, that shows a significant positive value for the difference of means of lnprice between ohcv and ohc, ohcv and ohcf, as well as dohc and ohc. Therefore we can attest that the mean price of cars with engine type ohcv have higher mean price than ohc and ohcf, and type dohc have higher mean price than ohc.

bootstrap.diff.in.means(engine.type$lnprice, engine.type$engine.type) # function included in the appendix

##              diff  ci_lwr  ci_upr
## ohcv-dohc  0.3092 -0.0075  0.6271
## ohc-dohc  -0.4866 -0.7210 -0.2655
## l-dohc    -0.1820 -0.4849  0.0871
## ohcf-dohc -0.4162 -0.7761 -0.0324
## ohc-ohcv  -0.7957 -1.0446 -0.5559
## l-ohcv    -0.4887 -0.7999 -0.2040
## ohcf-ohcv -0.7250 -1.0972 -0.3316
## l-ohc      0.3057  0.0818  0.4778
## ohcf-ohc   0.0703 -0.2230  0.4049
## ohcf-l    -0.2354 -0.5681  0.1445

To assess the confidence intervals of the differences of means analyzed, we created 100000 bootstrap samples and calculated the difference in mean for each pair of values of engine type. The results shown above are aligned with those of the Tukey’s Range Test, giving us reliable confidence intervals for the differences in mean for each pair of values in the categorical variable.

Conclusion

Thus, the data clearly shows a few important aspects of the price of cars. In other words, we saw that cars with:

Drive wheels of type rwd have higher mean price than fwd;
Body style sedan have higher mean price than hatchback;
4 cylinders have lower price than 5 or 6 cylinders;
Engine type ohcv have higher mean price than ohc and ohcf; and
Engine type dohc have higher mean price than ohc.

Finally, from the available data we could not see any influence of the number of doors on the price of cars.

Appendix

Functions used in this report are showed below.

##  Read the csv file into a data frame
read.auto <- function(path = 'SET-YOUR-PATH-HERE'){
    ## Function to read the csv file
    filePath <- file.path(path, 'Automobile price data _Raw_.csv')
    auto.price <- read.csv(filePath, header = TRUE, 
                           stringsAsFactors = TRUE, na.strings = "?")
    
    ## Coerce some character columns to numeric
    numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm',
                 'highway.mpg', 'city.mpg', 'compression.ratio',
                 'engine.size', 'curb.weight', 'height', 'width',
                 'length', 'wheel.base', 'normalized.losses',
                 'symboling')
    auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)
    
    ## Clean and tidy num.of.doors
    auto.price$num.of.doors <- as.character(auto.price$num.of.doors)
    auto.price$num.of.doors[auto.price$num.of.doors == 'four'] <- 4
    auto.price$num.of.doors[auto.price$num.of.doors == 'two'] <- 2
    auto.price$num.of.doors <- as.integer(auto.price$num.of.doors)
    
    ## Since the only NAs in num.of.doors are of cars with body.style sedan, they have 4 doors
    auto.price$num.of.doors[is.na(auto.price$num.of.doors)] <- 4
    
    ## Clean and tidy num.of.cylinders
    auto.price$num.of.cylinders <- as.character(auto.price$num.of.cylinders)
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'eight'] <- 8
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'five'] <- 5
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'four'] <- 4
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'six'] <- 6
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'three'] <- 3
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'twelve'] <- 12
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'two'] <- 2
    auto.price$num.of.cylinders <- as.integer(auto.price$num.of.cylinders)
    
    auto.price$lnprice <- log(auto.price$price)
    
    auto.price
}

# Adjusts data to remain only name.of.column (with the values given by values.of.column) and lnprice
redefine.data <- function(name.of.column, values.of.column){
    # Subsets rows and cols
    new.Auto <- Auto.Price[Auto.Price[, name.of.column] %in% values.of.column, c(name.of.column, 'lnprice')]
    new.Auto <- new.Auto[complete.cases(new.Auto),] # Keep only non NA values
    new.Auto[,name.of.column] <- factor(new.Auto[,name.of.column]) # Transforms to factor for ANOVA tests
    new.Auto
}

# Does ANOVA and Tukey's Range Test, outputing results in text and plot for Tukey's
anova.tests <- function(formula.to.anova, data.to.anova, leftpar=4){
    aov <- aov(formula.to.anova, data = data.to.anova) # Calculates ANOVA
    # Print ANOVA
    cat("  ANOVA\n")
    print(summary(aov))
    cat("\n----------------------------------------------------------------\n\n")
    tukey_anova <- TukeyHSD(aov) # Calculates Tukey's
    # Print Tukey's
    print(tukey_anova)
    par(mar=c(5,leftpar,4,2) + 0.1)
    plot(tukey_anova, las=1)
    par(mar=c(5,4,4,2) + 0.1)
}

# Outputs mean, lower and upper values of confidence interval constructed using bootstrap samples
bootstrap.diff.in.means <- function(main.col, col.to.boot, p=0.05, nr.iter=100000){
    library(simpleboot)
    col.to.boot <- factor(col.to.boot) # Vector needs to be factor
    categories <- unique(col.to.boot) # Obtain the categories in which to iterate
    boot.Output <- data.frame(diff=numeric(0), ci_lwr=numeric(0), ci_upr=numeric(0)) # Data frame with answers
    for(i in 1:(length(categories)-1)){
        for(j in (i+1):length(categories)){
            two.boot.mean = two.boot(main.col[col.to.boot == categories[j]], 
                                     main.col[col.to.boot == categories[i]], mean, R=nr.iter) # Bootstrap samples
            boot.values <- two.boot.mean$t
            # Calculates confidence intervals
            ci_lwr = round(quantile(boot.values, probs = p/2, na.rm=TRUE), 4)
            mean_vl = round(mean(boot.values), 4)
            ci_upr = round(quantile(boot.values, probs = (1 - p/2), na.rm=TRUE), 4)
            # Adds to data frame with answers
            current.boot = data.frame(diff=mean_vl, ci_lwr=ci_lwr, ci_upr=ci_upr)
            rownames(current.boot) <- paste0(categories[j], "-", categories[i])
            boot.Output <- rbind(boot.Output, current.boot)
        }
    }
    boot.Output
}

Analysis of Categorical Variables of Auto Data

Ricardo S. Carvalho

July 25th, 2016

Summary and conclusions

Loading the data

Analysis of categorical variables

Body Style

Drive Wheels

Number of Cylinders

Number of Doors

Engine Type

Conclusion

Appendix