This document is intended to be a concise report to explain a few takeaways of a dataset containing information about cars (available here). The analysis was created as part of the Data Science Certificate in the class Methods for Data Analysis at University of Washington.
The idea is to show some findings regarding information related to the price of the cars. Some functions created for this purpose are included in the appendix.
The report starts with data loading/cleaning, followed by the exploratory analysis with three sections with takeaways for price related to weight, make and drive wheel of cars. Finally, some initial basic modeling for exploration is done to close the report.
First we load the file with the functions created for this report (included in the appendix). Then the data is loaded using the function read.auto, that also coerces some character columns to numeric and adjusts the columns num.of.doors and num.of.cylinders, that have numerical data in the form of text (see the function read.auto in the appendix at the end of this report for more info).
# load created functions
source('Functions.R')
# read.auto function loads and cleans the data
Auto.Price = read.auto(path = '.') # function read.auto is included in the appendix
summary(Auto.Price)
## symboling normalized.losses make fuel.type aspiration num.of.doors
## Min. :-2.0000 Min. : 65 toyota : 32 diesel: 20 std :168 Min. :2.000
## 1st Qu.: 0.0000 1st Qu.: 94 nissan : 18 gas :185 turbo: 37 1st Qu.:2.000
## Median : 1.0000 Median :115 mazda : 17 Median :4.000
## Mean : 0.8341 Mean :122 honda : 13 Mean :3.123
## 3rd Qu.: 2.0000 3rd Qu.:150 mitsubishi: 13 3rd Qu.:4.000
## Max. : 3.0000 Max. :256 subaru : 12 Max. :4.000
## NA's :41 (Other) :100 NA's :2
## body.style drive.wheels engine.location wheel.base length width
## convertible: 6 4wd: 9 front:202 Min. : 86.60 Min. :141.1 Min. :60.30
## hardtop : 8 fwd:120 rear : 3 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:64.10
## hatchback :70 rwd: 76 Median : 97.00 Median :173.2 Median :65.50
## sedan :96 Mean : 98.76 Mean :174.0 Mean :65.91
## wagon :25 3rd Qu.:102.40 3rd Qu.:183.1 3rd Qu.:66.90
## Max. :120.90 Max. :208.1 Max. :72.30
##
## height curb.weight engine.type num.of.cylinders engine.size fuel.system bore
## Min. :47.80 Min. :1488 dohc : 12 Min. : 2.00 Min. : 61.0 mpfi :94 Min. :2.54
## 1st Qu.:52.00 1st Qu.:2145 dohcv: 1 1st Qu.: 4.00 1st Qu.: 97.0 2bbl :66 1st Qu.:3.15
## Median :54.10 Median :2414 l : 12 Median : 4.00 Median :120.0 idi :20 Median :3.31
## Mean :53.72 Mean :2556 ohc :148 Mean : 4.38 Mean :126.9 1bbl :11 Mean :3.33
## 3rd Qu.:55.50 3rd Qu.:2935 ohcf : 15 3rd Qu.: 4.00 3rd Qu.:141.0 spdi : 9 3rd Qu.:3.59
## Max. :59.80 Max. :4066 ohcv : 13 Max. :12.00 Max. :326.0 4bbl : 3 Max. :3.94
## rotor: 4 (Other): 2 NA's :4
## stroke compression.ratio horsepower peak.rpm city.mpg highway.mpg
## Min. :2.070 Min. : 7.00 Min. : 48.0 Min. :4150 Min. :13.00 Min. :16.00
## 1st Qu.:3.110 1st Qu.: 8.60 1st Qu.: 70.0 1st Qu.:4800 1st Qu.:19.00 1st Qu.:25.00
## Median :3.290 Median : 9.00 Median : 95.0 Median :5200 Median :24.00 Median :30.00
## Mean :3.255 Mean :10.14 Mean :104.3 Mean :5125 Mean :25.22 Mean :30.75
## 3rd Qu.:3.410 3rd Qu.: 9.40 3rd Qu.:116.0 3rd Qu.:5500 3rd Qu.:30.00 3rd Qu.:34.00
## Max. :4.170 Max. :23.00 Max. :288.0 Max. :6600 Max. :49.00 Max. :54.00
## NA's :4 NA's :2 NA's :2
## price
## Min. : 5118
## 1st Qu.: 7775
## Median :10295
## Mean :13207
## 3rd Qu.:16500
## Max. :45400
## NA's :4
We can see that a few columns have NA values, and we have features like make and fuel.system that are categorical variables with a high number of levels.
Also, although not seen in the summary, some features have inter-relationships. For example, the diesel level of fuel.type is only found in the level idi of fuel.system and vice-versa, as we can see below.
table(Auto.Price$fuel.type, Auto.Price$fuel.system)
##
## 1bbl 2bbl 4bbl idi mfi mpfi spdi spfi
## diesel 0 0 0 20 0 0 0 0
## gas 11 66 3 0 1 94 9 1
These will only be treated later with modeling purposes, but it is interesting and vital to notice these cases early on.
Following we have three sections with the takeaways found for price related to drive wheels, weight and make of cars.
Drive wheels essentially dictates the traction of the cars, into 4 wheels (4wd), two forward wheels (fwd) or two rear wheels (rwd). Observing the variable drive.wheels, we see it has exactly three levels: 4wd, fwd and rwd.
table(Auto.Price$drive.wheels)
##
## 4wd fwd rwd
## 9 120 76
sort(tapply(Auto.Price$price, Auto.Price$drive.wheels, mean, na.rm=TRUE), decreasing=TRUE)
## rwd 4wd fwd
## 19757.61 10241.00 9244.78
We can also notice that the level 4wd has very few observations (9 out of 205, ~4%) and therefore may be hard to account while making sure to not overfit. Also, by analyzing the price for each level, we see that rwd has a higher mean price than fwd. The following plot also seems to confirm this hypothesis.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
ggplot(Auto.Price[Auto.Price$drive.wheels %in% c('fwd','rwd'),], aes(price)) +
geom_histogram(binwidth=1000, na.rm=TRUE) + facet_grid(. ~ drive.wheels) +
labs(title = "Histogram of Price by Drive Wheels fwd and rwd") + labs(x = "Price (US$)", y = "Frequency")
To assert this hypothesis, we do a Welch Two Sample t-test, defining the following null and alternative hypothesis:
t.test(Auto.Price$price[Auto.Price$drive.wheels == 'rwd'],
Auto.Price$price[Auto.Price$drive.wheels != 'rwd'], "greater", 0, FALSE, FALSE, 0.95)
##
## Welch Two Sample t-test
##
## data: Auto.Price$price[Auto.Price$drive.wheels == "rwd"] and Auto.Price$price[Auto.Price$drive.wheels != "rwd"]
## t = 9.5858, df = 86.049, p-value = 1.571e-15
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 8636.988 Inf
## sample estimates:
## mean of x mean of y
## 19757.613 9308.032
With a very small p-value, we can reject the null hypothesis, confirming our finding that the mean price of rwd drive wheel is greater than the mean price of the other two levels.
The Curb Weight is the total weight of a vehicle with all of its standard equipment. We found this feature by looking at the numerical variables that have the highest correlation with price, as we can see below.
# Obtain numeric-like columns of Auto.Price
numerical.cols <- getColsOfClass(Auto.Price, "num") # function getColsOfClass is included in the appendix
# Obtain three highest correlated (absolute correlation) features with Price
featCorPrice() # function featCorPrice is included in the appendix
## curb.weight width engine.size
## 0.8936391 0.8433705 0.8414956
# Obtain three highest correlated (absolute correlation) features with LOG of Price
featCorPrice(log) # function featCorPrice is included in the appendix
## curb.weight width engine.size
## 0.9174756 0.8524911 0.8351468
Basically, we looked at the correlation of the numerical features with price and then with log(price). First we see that curb.weight is the highest correlated variable with price (absolute correlation).
We can also notice that by applying the log in price, we get higher correlation. To observe this relationship we can see below a plot of the variable curb.weight by the log(price) with a linear regression line added.
ggplot(Auto.Price, aes(curb.weight, log(price))) + geom_point(aes(alpha = 0.2, size=2)) +
guides(alpha=F,size=F) + stat_smooth(method = "lm", level = 0.95, colour = "red") +
labs(title = "Scatterplot of Curb Weight by Log of Price with Linear Regression Line") +
labs(x = "Curb Weight (lbs)", y = "Log of Price")
The image confirms the linear relationship between curb.weight and log(price).
Intuitively, make should be a variable that influences the price of the cars. We already saw that the feature has many levels, so to try to avoid reaching conclusions based on data that is not representative, we will limit make for those that have more than 10 observations among the 205 rows (~5%).
make.ten.or.more <- names(sort(table(Auto.Price$make)[table(Auto.Price$make) > 10], decreasing = TRUE))
sort(with(Auto.Price[Auto.Price$make %in% make.ten.or.more,], tapply(price, make, mean, na.rm=TRUE)), decreasing = TRUE)
## volvo peugot mazda nissan volkswagen toyota mitsubishi subaru honda
## 18063.182 15489.091 10652.882 10415.667 10077.500 9885.812 9239.769 8541.250 8184.692
Sorting the price for each make with more than 10 observations we see that volvo and peugeot (misspelled as peugot) have higher mean prices. Now we can look at the density plots for these makes.
ggplot(Auto.Price[Auto.Price$make %in% make.ten.or.more,], aes(price,fill=make)) +
geom_density(na.rm = TRUE) + facet_grid(. ~ make) + guides(fill=F) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
labs(title = "Density Plot of Price by Make") + labs(x = "Price (US$)", y = "Density")
One make that has a high mean price and is considerably skewed is peugeot. Below we compare its price with the other makes to assess the difference.
library(gridExtra)
bp1 = ggplot(Auto.Price[Auto.Price$make == 'peugot',], aes(1:11, price)) + geom_boxplot() +
theme(axis.ticks=element_blank(), axis.text.x = element_blank()) + labs(x = "Peugeot", y = "Price (US$)")
bp2 = ggplot(Auto.Price[Auto.Price$make != 'peugot',], aes(1:194, price)) + geom_boxplot() +
theme(axis.ticks=element_blank(), axis.text.x = element_blank()) + labs(x = "Others", y = "Price (US$)")
grid.arrange(bp1, bp2, nrow = 1, top = "Boxplots of Price for Peugeot and other makes")
The plots confirm our hypothesis that Peugeot stands out amongst the makes by having a higher mean price.
As a final essay, still for exploration purposes, we create a model for the dataset and analyze the results. To control the features and each of the levels of the categorical variables individually, we first transform the factor columns into binary or dummy columns.
library(dummies)
cols_to_dummy <- getColsOfClass(Auto.Price, "char") # function getColsOfClass is included in the appendix
dummy.sel <- dummy.data.frame(Auto.Price[,cols_to_dummy], sep="_", drop=T)
auto.dummy <- cbind(Auto.Price[,-cols_to_dummy], dummy.sel)
After transforming the factor features, we build a simple model with the variables we found in our three takeaways.
# Simple Model
simple.model <- lm(log(price) ~ curb.weight + drive.wheels_rwd + make_peugot, data=auto.dummy)
summary(simple.model)
##
## Call:
## lm(formula = log(price) ~ curb.weight + drive.wheels_rwd + make_peugot,
## data = auto.dummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65661 -0.11483 -0.00953 0.10152 0.82957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.270e+00 8.928e-02 81.428 < 2e-16 ***
## curb.weight 7.932e-04 3.818e-05 20.772 < 2e-16 ***
## drive.wheels_rwd 1.991e-01 4.076e-02 4.885 2.14e-06 ***
## make_peugot -3.860e-01 6.763e-02 -5.707 4.17e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2051 on 197 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.8358, Adjusted R-squared: 0.8333
## F-statistic: 334.2 on 3 and 197 DF, p-value: < 2.2e-16
Above we see that with only the three findings we managed to obtain, the model is already giving quite satisfactory results with R-squared of approximatelly 0.84.
Next, we build the model with all of the available features and output the R-squared.
# Full Model
full.model <- lm(log(price) ~ ., data=auto.dummy)
summary(full.model)$r.squared
## [1] 0.9672693
The R-squared above shows us that there is still room to improve, and an even more in-depth analysis can lead to greater results.
It is worth noticing that for modeling purposes there are several other techniques that should be addressed and were skipped in this report (for example, the data should be split into train and test to start with).
Functions loaded using the file Functions.R are showed below.
## Read the csv file into a data frame
read.auto <- function(path = 'SET-YOUR-PATH-HERE'){
## Function to read the csv file
filePath <- file.path(path, 'Automobile price data _Raw_.csv')
auto.price <- read.csv(filePath, header = TRUE,
stringsAsFactors = TRUE, na.strings = "?")
## Coerce some character columns to numeric
numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm',
'highway.mpg', 'city.mpg', 'compression.ratio',
'engine.size', 'curb.weight', 'height', 'width',
'length', 'wheel.base', 'normalized.losses',
'symboling')
auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)
## Clean and tidy num.of.doors
auto.price$num.of.doors <- as.character(auto.price$num.of.doors)
auto.price$num.of.doors[auto.price$num.of.doors == 'four'] <- 4
auto.price$num.of.doors[auto.price$num.of.doors == 'two'] <- 2
auto.price$num.of.doors <- as.integer(auto.price$num.of.doors)
## Clean and tidy num.of.cylinders
auto.price$num.of.cylinders <- as.character(auto.price$num.of.cylinders)
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'eight'] <- 8
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'five'] <- 5
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'four'] <- 4
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'six'] <- 6
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'three'] <- 3
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'twelve'] <- 12
auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'two'] <- 2
auto.price$num.of.cylinders <- as.integer(auto.price$num.of.cylinders)
auto.price
}
# Automatically obtain numeric-like or character-like columns of data
getColsOfClass <- function(full.data, class='num'){
if(class=='num'){
classesChosen <- c('integer','numeric','double','float')
}
else{
classesChosen <- c('factor','character')
}
return(sort(unique(sapply(1:ncol(full.data),
function(x){
if(class(full.data[,x]) %in% classesChosen){
x
}
else{
0
}
}
)
)
)[-1]
)
}
# Obtain three highest correlated (absolute correlation) features of Auto.Price
featCorPrice <- function(price.function=I, n=3){
cor.mat <- cor(Auto.Price[,numerical.cols], price.function(Auto.Price[,'price']), use = 'complete.obs')
feat.cor <- abs(cor.mat)
feat.cor <- feat.cor[order(feat.cor, decreasing = T),][-1]
feat.cor[1:n]
}