Description
Gas mileage, horsepower, and other information for 392 vehicles.
Source
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The dataset was used in the 1983 American Statistical Association Exposition.
References
This dataset is a part of the course material of the book : Introduction to Statistical Learning with R
Short description of variables
- mpg : miles per gallon
- cylinders : Number of cylinders between 4 and 8
- displacement : Engine displacement (cu. inches)
- horsepower : Engine horsepower
- weight : Vehicle weight (lbs.)
- acceleration : Time to accelerate from 0 to 60 mph (sec.)
- year : Model year (modulo 100)
- origin : Origin of car (1. American, 2. European, 3. Japanese)
- name : Vehicle name
===============================================================================================================
library(ggplot2)
library(RColorBrewer)
pacman::p_load(plotly, tidyverse, reshape2)
Some preliminary workings
# save default options and parameters
defop = options()
defpar = par(no.readonly=T)
# function to modify plot parameters
plot_pars = function(w=7,h=7) {options(repr.plot.width=w, repr.plot.height=h)}
===============================================================================================================
fdir = r"(E:\Data Science\Statistics\Intro to Statistical Learning with R\datasets)"
fpath = file.path(fdir,'Auto.csv')
auto = read.csv(fpath)
dim(auto)
## [1] 397 9
head(auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
# check for missing values
anyNA(auto)
## [1] FALSE
===============================================================================================================
# structure of dataset
str(auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr "130" "165" "150" "150" ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
The fact that a column containing numbers (horsepower) has been saved as ‘chr’ is a red flag. This may happen when all the elements in a column are not numerical. This column will have to be further examined.
# Rows with non-numeric values
auto[which(is.na(as.numeric(auto$horsepower))),]
## Warning in which(is.na(as.numeric(auto$horsepower))): NAs introduced by coercion
## mpg cylinders displacement horsepower weight acceleration year origin
## 33 25.0 4 98 ? 2046 19.0 71 1
## 127 21.0 6 200 ? 2875 17.0 74 1
## 331 40.9 4 85 ? 1835 17.3 80 2
## 337 23.6 4 140 ? 2905 14.3 80 1
## 355 34.5 4 100 ? 2320 15.8 81 2
## name
## 33 ford pinto
## 127 ford maverick
## 331 renault lecar deluxe
## 337 ford mustang cobra
## 355 renault 18i
Since the number of missing values is very small, those rows can just be deleted.
# Rows with ? in horsepower
which(auto$horsepower == '?')
## [1] 33 127 331 337 355
# Rows with ? in any column
auto[which(apply(auto, 1, function(x) any(x %in% c("?")))), ]
## mpg cylinders displacement horsepower weight acceleration year origin
## 33 25.0 4 98 ? 2046 19.0 71 1
## 127 21.0 6 200 ? 2875 17.0 74 1
## 331 40.9 4 85 ? 1835 17.3 80 2
## 337 23.6 4 140 ? 2905 14.3 80 1
## 355 34.5 4 100 ? 2320 15.8 81 2
## name
## 33 ford pinto
## 127 ford maverick
## 331 renault lecar deluxe
## 337 ford mustang cobra
## 355 renault 18i
# Deleting rows with ?
auto = auto[-which(auto$horsepower=='?'), ]
sum(auto=="?")
## [1] 0
dim(auto)
## [1] 392 9
class(auto$horsepower)
## [1] "character"
# Convert horsepower to numeric
auto$horsepower = as.numeric(auto$horsepower)
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
===============================================================================================================
Quantitative → numerical values.
Qualitative → values in one of K different classes, or categories.
# No. of unique values in columns
sapply(auto, function(x) length(unique(x)))
## mpg cylinders displacement horsepower weight acceleration
## 127 5 81 93 346 95
## year origin name
## 13 3 301
| variable | description | variable type |
|---|---|---|
| mpg | miles per gallon | quantitative |
| cylinders | Number of cylinders between 4 and 8 | qualitative or categorical |
| displacement | Engine displacement (cu. inches) | quantitative |
| horsepower | Engine horsepower | quantitative |
| weight | Vehicle weight (lbs.) | quantitative |
| acceleration | Time to accelerate from 0 to 60 mph (sec.) | quantitative |
| year | Model year (modulo 100) | quantitative |
| origin | Origin of car (1. American, 2. European, 3. Japanese) | qualitative or categorical |
| name | Vehicle name | qualitative or categorical |
“year” can be considered to be quantitative in the sense that it could indirectly reflect the impact of technological abilities of the times, otherwise it can be considered qualitative (categorical).
===============================================================================================================
# Quantitative predictors
quant_data = auto[, setdiff(names(auto), c('cylinders','origin','name'))]
names(quant_data)
## [1] "mpg" "displacement" "horsepower" "weight" "acceleration"
## [6] "year"
summ = data.frame(sapply(quant_data, range), row.names=c('min','max'))
summ['range',] = summ['max',] - summ['min',]
summ
## mpg displacement horsepower weight acceleration year
## min 9.0 68 46 1613 8.0 70
## max 46.6 455 230 5140 24.8 82
## range 37.6 387 184 3527 16.8 12
===============================================================================================================
summ['mean',] = sapply(quant_data, mean)
summ['sd',] = sapply(quant_data, sd)
round(summ, 2)
## mpg displacement horsepower weight acceleration year
## min 9.00 68.00 46.00 1613.00 8.00 70.00
## max 46.60 455.00 230.00 5140.00 24.80 82.00
## range 37.60 387.00 184.00 3527.00 16.80 12.00
## mean 23.45 194.41 104.47 2977.58 15.54 75.98
## sd 7.81 104.64 38.49 849.40 2.76 3.68
===============================================================================================================
# Removing rows 10-85
df1 = quant_data[-seq(10,85), ]
dim(df1)
## [1] 316 6
# Check if the 10th row in filtered dataset matches 86th row of unfiltered dataset
all(df1[10,] == quant_data[86, ])
## [1] TRUE
summ = data.frame(sapply(df1, range), row.names=c('min','max'))
summ['range',] = summ['max',] - summ['min',]
summ['mean',] = sapply(df1, mean)
summ['sd',] = sapply(df1, sd)
round(summ, 3)
## mpg displacement horsepower weight acceleration year
## min 11.000 68.000 46.000 1649.000 8.500 70.000
## max 46.600 455.000 230.000 4997.000 24.800 82.000
## range 35.600 387.000 184.000 3348.000 16.300 12.000
## mean 24.404 187.241 100.722 2935.972 15.727 77.146
## sd 7.867 99.678 35.709 811.300 2.694 3.106
# Check sum
sapply(df1, sum)
## mpg displacement horsepower weight acceleration year
## 7711.8 59168.0 31828.0 927767.0 4969.7 24378.0
===============================================================================================================
# Pairs plot panels
panel.hist <- function(x, col="thistle", ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = 'whitesmoke', ...)
}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- cor(x, y)
abs_r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * abs_r)
}
# Scatter plot (hue - cylinder)
col_map = setNames(c('#F1DCD7','#E3BAC0','#CD9BB1','#AE80A3','#61566E'), sort(unique(auto$cylinders)))
cols = col_map[unlist(as.factor(auto$cylinders))]
pairs(quant_data, cex=2.5, cex.labels = 2, cex.axis=1.6, pch=21, bg=cols, col='whitesmoke', lwd=0.3,
diag.panel=panel.hist, lower.panel=panel.cor, oma=c(3,3,15,3))
par(xpd = TRUE)
legend(x=0.4, y=1, fill=unlist(col_map), legend=names(col_map), horiz=T, bty="n")
Observations:
- acceleration appears to be roughly normally distributed.
- Increase in no. of cylinders leads to
• lower : mpg, acceleration
• higher : displacement, horsepower, weight
- There has been a decline in the no. of new models coming out with 8 cylinders.
- Newer models are lighter and and have loss horsepower (presumably because of decreased weight).
- mpg appears to have strong (non-linear) relationships with displacement, horsepower, weight and is negatively correlated with the 3 variables.
- mpg of new models has imporoved over the years.
- displacement, horsepower and weight appear to have a strong positive correlation with each other.
- A moderate negative correlation may exist between horsepower and acceleration.
# Converting column cylinder to factor before using as 'color'
auto$cylinder = as.factor(auto$cylinder)
# Scatter plot - Cylinders as hue
pal = c('#fdc086','#386cb0','#beaed4','#33a02c','#f0027f')
fig = plot_ly(data=auto, x=~year, y=~mpg, color=~as.factor(cylinders), colors=pal,
type="scatter", mode='markers', marker = list(size=8),
text = ~paste('mpg:',mpg, '<br>name:',name, '<br>origin:',origin,
'<br>cyl:',cylinders), width=700, height=400) %>%
layout(showlegend=T, legend = list(orientation="v", xanchor="center", x=1.05, y=0.9),
title = "Scatter Plot", hoverlabel=list(bgcolor=pal),
xaxis = list(title = "Year", zeroline=F, showgrid=F),
yaxis = list(title = "mpg", zeroline=F, gridcolor="white"))
fig
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Scatter plot - origin as hue
fig = plot_ly(auto, x=~year, y=~mpg, color=~as.factor(origin), width=700, height=400,
colors = brewer.pal(length(unique(auto$origin)),"Set1"),
symbol=~origin, symbols = c('triangle-up','circle','x'),
text = ~paste('mpg:',mpg, '<br>name:',name, '<br>origin:',origin,
'<br>cyl:',cylinder), type="scatter", mode='markers',
marker=list(size=8)) %>%
hide_colorbar() %>%
layout(showlegend = TRUE, legend = list(orientation="h", xanchor="center", x=0.45, y=1),
hoverlabel=list(bgcolor="white"), title = "Scatter Plot",
xaxis = list(title = "Year", zeroline=F, showgrid=F),
yaxis = list(title = "mpg", zeroline=F, gridcolor="white"))
fig
go to toc# Color map
cols = character(nrow(auto))
cols[] = '#61566E'
cols[auto$origin == 1] = '#EDD1CB'
cols[auto$origin == 2] = '#AD6E94'
# Hue - origin
pairs(quant_data, cex=2.2, cex.labels = 2, cex.axis=1.6, bg=cols, pch=21, col='whitesmoke',
diag.panel=panel.hist, lower.panel=panel.cor, oma=c(3,3,16,3))
par(xpd = TRUE)
legend(x=0.4, y=1,, fill=c('#EDD1CB','#AD6E94','#61566E'),
legend=c(levels(as.factor(auto$origin))), horiz=T, bty="n")
Observations:
- Cars of European (2) and Japenese (3) origin can be seen to be overlapping in many criteria whereas American cars (1) have a larger and distinct spread.
- Clear distinctions can be seen between American and the other 2 carmakers in displacement, horsepower and weight.
– Workings ————————————————————–
# Frequency distribution of cylinders
table(auto$cylinders)
##
## 3 4 5 6 8
## 4 199 3 83 103
# year-wise distribution of cars with cylinder count
cyl_year = auto %>% count(year,cylinders) %>% spread(cylinders, n, fill=0)
cyl_year
## year 3 4 5 6 8
## 1 70 0 7 0 4 18
## 2 71 0 12 0 8 7
## 3 72 1 14 0 0 13
## 4 73 1 11 0 8 20
## 5 74 0 15 0 6 5
## 6 75 0 12 0 12 6
## 7 76 0 15 0 10 9
## 8 77 1 14 0 5 8
## 9 78 0 17 1 12 6
## 10 79 0 12 1 6 10
## 11 80 1 23 1 2 0
## 12 81 0 20 0 7 1
## 13 82 0 27 0 3 0
ggplot with plotly
# year-wise distribution of cars with cylinder count
df <- melt(cyl_year , id.vars='year', variable.name='cylinders')
p = ggplot(data=df, aes(x=year, y=value)) +
geom_line(linetype='twodash', aes(color=cylinders)) +
geom_point(aes(color=cylinders)) +
theme_classic() +
scale_color_brewer(palette="Paired") +
theme(legend.title=element_text(size=10), legend.text=element_text(size=8))
fig = ggplotly(p, width=650, height=350)
fig = fig %>% layout(fig, plot_bgcolor='white',
xaxis=list(showticklabels=T, tickmode="linear", tickformat='0', dtick=1,
tickfont=list(size=10), zeroline=F, showgrid=F, color="black",
title=list(text="year", standoff=10, font=list(size=15))),
yaxis=list(showticklabels=T, tickmode="linear", tickformat='0', dtick=5,
title=list(text="mpg", standoff=15, font=list(size=15)),
gridcolor="white", showticklabels=T, tickfont=list(size=10)))
fig
ggplot
df <- melt(cyl_year , id.vars='year', variable.name='cylinders')
fig = ggplot(data=df, aes(x=year, y=value)) +
geom_line(linetype='twodash', aes(color=cylinders), size=0.8) +
geom_point(aes(color=cylinders), size=2) +
theme_classic() +
scale_color_manual(values=c('#fdc086','#386cb0','#beaed4','#33a02c','#f0027f')) +
scale_x_continuous(breaks = seq(min(df$year), max(df$year), by=1)) +
scale_y_continuous(breaks = round(seq(min(df$value), max(df$value), by=5), 1)) +
theme(text=element_text(size=13), panel.background=element_rect(fill="white"),
panel.grid.major.x=element_blank(), panel.grid.major.y=element_blank(),
panel.grid.minor.x=element_blank(), panel.grid.minor.y=element_blank(),
axis.text.x=element_text(color="grey20", size=9, angle=0, hjust=.5, vjust=.5, face="plain"),
axis.text.y=element_text(color="grey20", size=9, angle=0, hjust=1, vjust=0, face="plain"),
axis.title.x=element_text(color="grey20", size=12, angle=0, hjust=.5, vjust=0, face="plain"),
axis.title.y=element_text(color="grey20", size=12, angle=0, hjust=.5, vjust=.5, face="plain"))
fig
————————————————————– Workings –
===============================================================================================================
Except for acceleration, all the varibles display some sort of relationship or trend with mpg, whether positive or negative.
Positive : year
Negative : cylinders, displacement, horsepower, weight
Non-directional : origin
They can be taken into account for predicting mpg, after adjusting for collinearity.