Auto dataset

1) Load packages
2) Import Data
3) Data preparation
a) Which of the predictors are quantitative, and which are qualitative?
b) What is the range of each quantitative predictor?
c) What is the mean and standard deviation of each quantitative predictor?
d) Range, mean and standard deviation after removing observations 10-85
e) Graphical examination of predictors
f) Variables useful in predicting mpg

Description
Gas mileage, horsepower, and other information for 392 vehicles.

Source
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The dataset was used in the 1983 American Statistical Association Exposition.

References
This dataset is a part of the course material of the book : Introduction to Statistical Learning with R

Short description of variables
- mpg : miles per gallon
- cylinders : Number of cylinders between 4 and 8
- displacement : Engine displacement (cu. inches)
- horsepower : Engine horsepower
- weight : Vehicle weight (lbs.)
- acceleration : Time to accelerate from 0 to 60 mph (sec.)
- year : Model year (modulo 100)
- origin : Origin of car (1. American, 2. European, 3. Japanese)
- name : Vehicle name

===============================================================================================================

1) Load packages

library(ggplot2)
library(RColorBrewer)
pacman::p_load(plotly, tidyverse, reshape2)

Some preliminary workings

# save default options and parameters
defop = options()
defpar = par(no.readonly=T)

# function to modify plot parameters
plot_pars = function(w=7,h=7) {options(repr.plot.width=w, repr.plot.height=h)}

===============================================================================================================

2) Import Data

fdir = r"(E:\Data Science\Statistics\Intro to Statistical Learning with R\datasets)"
fpath = file.path(fdir,'Auto.csv')
auto = read.csv(fpath)

dim(auto)

## [1] 397   9

head(auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

# check for missing values
anyNA(auto)

## [1] FALSE

go to toc

===============================================================================================================

3) Data preparation

# structure of dataset
str(auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr  "130" "165" "150" "150" ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

The fact that a column containing numbers (horsepower) has been saved as ‘chr’ is a red flag. This may happen when all the elements in a column are not numerical. This column will have to be further examined.

# Rows with non-numeric values
auto[which(is.na(as.numeric(auto$horsepower))),]

## Warning in which(is.na(as.numeric(auto$horsepower))): NAs introduced by coercion

##      mpg cylinders displacement horsepower weight acceleration year origin
## 33  25.0         4           98          ?   2046         19.0   71      1
## 127 21.0         6          200          ?   2875         17.0   74      1
## 331 40.9         4           85          ?   1835         17.3   80      2
## 337 23.6         4          140          ?   2905         14.3   80      1
## 355 34.5         4          100          ?   2320         15.8   81      2
##                     name
## 33            ford pinto
## 127        ford maverick
## 331 renault lecar deluxe
## 337   ford mustang cobra
## 355          renault 18i

Since the number of missing values is very small, those rows can just be deleted.

# Rows with ? in horsepower
which(auto$horsepower == '?')

## [1]  33 127 331 337 355

# Rows with ? in any column
auto[which(apply(auto, 1, function(x) any(x %in% c("?")))), ]

##      mpg cylinders displacement horsepower weight acceleration year origin
## 33  25.0         4           98          ?   2046         19.0   71      1
## 127 21.0         6          200          ?   2875         17.0   74      1
## 331 40.9         4           85          ?   1835         17.3   80      2
## 337 23.6         4          140          ?   2905         14.3   80      1
## 355 34.5         4          100          ?   2320         15.8   81      2
##                     name
## 33            ford pinto
## 127        ford maverick
## 331 renault lecar deluxe
## 337   ford mustang cobra
## 355          renault 18i

# Deleting rows with ?
auto = auto[-which(auto$horsepower=='?'), ]
sum(auto=="?")

## [1] 0

dim(auto)

## [1] 392   9

class(auto$horsepower)

## [1] "character"

# Convert horsepower to numeric
auto$horsepower = as.numeric(auto$horsepower)
str(auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

go to toc

===============================================================================================================

a) Which of the predictors are quantitative, and which are qualitative?

Quantitative → numerical values.
Qualitative → values in one of K different classes, or categories.

# No. of unique values in columns
sapply(auto, function(x) length(unique(x)))

##          mpg    cylinders displacement   horsepower       weight acceleration 
##          127            5           81           93          346           95 
##         year       origin         name 
##           13            3          301

variable	description	variable type
mpg	miles per gallon	quantitative
cylinders	Number of cylinders between 4 and 8	qualitative or categorical
displacement	Engine displacement (cu. inches)	quantitative
horsepower	Engine horsepower	quantitative
weight	Vehicle weight (lbs.)	quantitative
acceleration	Time to accelerate from 0 to 60 mph (sec.)	quantitative
year	Model year (modulo 100)	quantitative
origin	Origin of car (1. American, 2. European, 3. Japanese)	qualitative or categorical
name	Vehicle name	qualitative or categorical

“year” can be considered to be quantitative in the sense that it could indirectly reflect the impact of technological abilities of the times, otherwise it can be considered qualitative (categorical).

go to toc

===============================================================================================================

b) What is the range of each quantitative predictor?

# Quantitative predictors
quant_data = auto[, setdiff(names(auto), c('cylinders','origin','name'))]
names(quant_data)

## [1] "mpg"          "displacement" "horsepower"   "weight"       "acceleration"
## [6] "year"

summ = data.frame(sapply(quant_data, range), row.names=c('min','max'))
summ['range',] = summ['max',] - summ['min',]
summ

##        mpg displacement horsepower weight acceleration year
## min    9.0           68         46   1613          8.0   70
## max   46.6          455        230   5140         24.8   82
## range 37.6          387        184   3527         16.8   12

go to toc

===============================================================================================================

c) What is the mean and standard deviation of each quantitative predictor?

summ['mean',] = sapply(quant_data, mean)
summ['sd',] = sapply(quant_data, sd)
round(summ, 2)

##         mpg displacement horsepower  weight acceleration  year
## min    9.00        68.00      46.00 1613.00         8.00 70.00
## max   46.60       455.00     230.00 5140.00        24.80 82.00
## range 37.60       387.00     184.00 3527.00        16.80 12.00
## mean  23.45       194.41     104.47 2977.58        15.54 75.98
## sd     7.81       104.64      38.49  849.40         2.76  3.68

go to toc

===============================================================================================================

d) Range, mean and standard deviation after removing observations 10-85

# Removing rows 10-85
df1 = quant_data[-seq(10,85), ]
dim(df1)

## [1] 316   6

# Check if the 10th row in filtered dataset matches 86th row of unfiltered dataset
all(df1[10,] == quant_data[86, ])

## [1] TRUE

summ = data.frame(sapply(df1, range), row.names=c('min','max'))
summ['range',] = summ['max',] - summ['min',]
summ['mean',] = sapply(df1, mean)
summ['sd',] = sapply(df1, sd)
round(summ, 3)

##          mpg displacement horsepower   weight acceleration   year
## min   11.000       68.000     46.000 1649.000        8.500 70.000
## max   46.600      455.000    230.000 4997.000       24.800 82.000
## range 35.600      387.000    184.000 3348.000       16.300 12.000
## mean  24.404      187.241    100.722 2935.972       15.727 77.146
## sd     7.867       99.678     35.709  811.300        2.694  3.106

# Check sum
sapply(df1, sum)

##          mpg displacement   horsepower       weight acceleration         year 
##       7711.8      59168.0      31828.0     927767.0       4969.7      24378.0

go to toc

===============================================================================================================

e) Graphical examination of predictors

# Pairs plot panels
panel.hist <- function(x, col="thistle", ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = 'whitesmoke', ...)
}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- cor(x, y)
    abs_r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * abs_r)
}

# Scatter plot (hue - cylinder)
col_map = setNames(c('#F1DCD7','#E3BAC0','#CD9BB1','#AE80A3','#61566E'), sort(unique(auto$cylinders)))
cols = col_map[unlist(as.factor(auto$cylinders))]
pairs(quant_data, cex=2.5, cex.labels = 2, cex.axis=1.6, pch=21, bg=cols, col='whitesmoke', lwd=0.3,
     diag.panel=panel.hist, lower.panel=panel.cor, oma=c(3,3,15,3))
par(xpd = TRUE)
legend(x=0.4, y=1, fill=unlist(col_map), legend=names(col_map), horiz=T, bty="n")

Observations:
- acceleration appears to be roughly normally distributed.
- Increase in no. of cylinders leads to
• lower : mpg, acceleration
• higher : displacement, horsepower, weight
- There has been a decline in the no. of new models coming out with 8 cylinders.
- Newer models are lighter and and have loss horsepower (presumably because of decreased weight).
- mpg appears to have strong (non-linear) relationships with displacement, horsepower, weight and is negatively correlated with the 3 variables.
- mpg of new models has imporoved over the years.
- displacement, horsepower and weight appear to have a strong positive correlation with each other.
- A moderate negative correlation may exist between horsepower and acceleration.

# Converting column cylinder to factor before using as 'color'
auto$cylinder = as.factor(auto$cylinder)

# Scatter plot - Cylinders as hue
pal = c('#fdc086','#386cb0','#beaed4','#33a02c','#f0027f')
fig = plot_ly(data=auto, x=~year, y=~mpg, color=~as.factor(cylinders), colors=pal, 
              type="scatter", mode='markers', marker = list(size=8),
              text = ~paste('mpg:',mpg, '<br>name:',name, '<br>origin:',origin,
                            '<br>cyl:',cylinders), width=700, height=400) %>%
      layout(showlegend=T, legend = list(orientation="v", xanchor="center", x=1.05, y=0.9),
             title = "Scatter Plot", hoverlabel=list(bgcolor=pal),
             xaxis = list(title = "Year", zeroline=F, showgrid=F),
             yaxis = list(title = "mpg", zeroline=F, gridcolor="white"))
fig

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

# Scatter plot - origin as hue
fig = plot_ly(auto, x=~year, y=~mpg, color=~as.factor(origin), width=700, height=400,
              colors = brewer.pal(length(unique(auto$origin)),"Set1"),
              symbol=~origin, symbols = c('triangle-up','circle','x'),
              text = ~paste('mpg:',mpg, '<br>name:',name, '<br>origin:',origin,
                            '<br>cyl:',cylinder), type="scatter", mode='markers',
                            marker=list(size=8)) %>%
      hide_colorbar() %>%
      layout(showlegend = TRUE, legend = list(orientation="h", xanchor="center", x=0.45, y=1),
             hoverlabel=list(bgcolor="white"), title = "Scatter Plot",
             xaxis = list(title = "Year", zeroline=F, showgrid=F),
             yaxis = list(title = "mpg", zeroline=F, gridcolor="white"))
fig

go to toc

# Color map
cols = character(nrow(auto))
cols[] = '#61566E'
cols[auto$origin == 1] = '#EDD1CB'
cols[auto$origin == 2] = '#AD6E94'

# Hue - origin
pairs(quant_data, cex=2.2, cex.labels = 2, cex.axis=1.6, bg=cols, pch=21, col='whitesmoke',
     diag.panel=panel.hist, lower.panel=panel.cor, oma=c(3,3,16,3))
par(xpd = TRUE)
legend(x=0.4, y=1,, fill=c('#EDD1CB','#AD6E94','#61566E'),
       legend=c(levels(as.factor(auto$origin))), horiz=T, bty="n")

Observations:
- Cars of European (2) and Japenese (3) origin can be seen to be overlapping in many criteria whereas American cars (1) have a larger and distinct spread.
- Clear distinctions can be seen between American and the other 2 carmakers in displacement, horsepower and weight.

– Workings ————————————————————–

# Frequency distribution of cylinders
table(auto$cylinders)

## 
##   3   4   5   6   8 
##   4 199   3  83 103

# year-wise distribution of cars with cylinder count
cyl_year = auto %>% count(year,cylinders) %>% spread(cylinders, n, fill=0)
cyl_year

##    year 3  4 5  6  8
## 1    70 0  7 0  4 18
## 2    71 0 12 0  8  7
## 3    72 1 14 0  0 13
## 4    73 1 11 0  8 20
## 5    74 0 15 0  6  5
## 6    75 0 12 0 12  6
## 7    76 0 15 0 10  9
## 8    77 1 14 0  5  8
## 9    78 0 17 1 12  6
## 10   79 0 12 1  6 10
## 11   80 1 23 1  2  0
## 12   81 0 20 0  7  1
## 13   82 0 27 0  3  0

ggplot with plotly

# year-wise distribution of cars with cylinder count
df <- melt(cyl_year ,  id.vars='year', variable.name='cylinders')

p = ggplot(data=df, aes(x=year, y=value)) + 
        geom_line(linetype='twodash', aes(color=cylinders)) + 
        geom_point(aes(color=cylinders)) + 
        theme_classic() + 
        scale_color_brewer(palette="Paired") + 
        theme(legend.title=element_text(size=10), legend.text=element_text(size=8))

fig = ggplotly(p, width=650, height=350)
fig = fig %>% layout(fig, plot_bgcolor='white', 
       xaxis=list(showticklabels=T, tickmode="linear", tickformat='0', dtick=1, 
                  tickfont=list(size=10), zeroline=F, showgrid=F, color="black",
                  title=list(text="year", standoff=10, font=list(size=15))),
       yaxis=list(showticklabels=T, tickmode="linear", tickformat='0', dtick=5, 
                  title=list(text="mpg", standoff=15, font=list(size=15)),
                  gridcolor="white", showticklabels=T, tickfont=list(size=10)))
fig

ggplot

df <- melt(cyl_year ,  id.vars='year', variable.name='cylinders')

fig = ggplot(data=df, aes(x=year, y=value)) + 
    geom_line(linetype='twodash', aes(color=cylinders), size=0.8) + 
    geom_point(aes(color=cylinders), size=2) + 
    theme_classic() + 
    scale_color_manual(values=c('#fdc086','#386cb0','#beaed4','#33a02c','#f0027f')) + 
    scale_x_continuous(breaks = seq(min(df$year), max(df$year), by=1)) + 
    scale_y_continuous(breaks = round(seq(min(df$value), max(df$value), by=5), 1)) + 
    theme(text=element_text(size=13), panel.background=element_rect(fill="white"),
          panel.grid.major.x=element_blank(), panel.grid.major.y=element_blank(),
          panel.grid.minor.x=element_blank(), panel.grid.minor.y=element_blank(),
          axis.text.x=element_text(color="grey20", size=9, angle=0, hjust=.5, vjust=.5, face="plain"),
          axis.text.y=element_text(color="grey20", size=9, angle=0, hjust=1, vjust=0, face="plain"),  
          axis.title.x=element_text(color="grey20", size=12, angle=0, hjust=.5, vjust=0, face="plain"),
          axis.title.y=element_text(color="grey20", size=12, angle=0, hjust=.5, vjust=.5, face="plain"))
fig

————————————————————– Workings –

go to toc

===============================================================================================================

f) Variables useful in predicting mpg

Except for acceleration, all the varibles display some sort of relationship or trend with mpg, whether positive or negative.
Positive : year
Negative : cylinders, displacement, horsepower, weight
Non-directional : origin
They can be taken into account for predicting mpg, after adjusting for collinearity.

——————————————– That’s All Folks! ——————————————–

Auto dataset

Ch 02 - Statistical Learning - Q9 (applied)

1) Load packages

2) Import Data

3) Data preparation

a) Which of the predictors are quantitative, and which are qualitative?

b) What is the range of each quantitative predictor?

c) What is the mean and standard deviation of each quantitative predictor?

d) Range, mean and standard deviation after removing observations 10-85

e) Graphical examination of predictors

f) Variables useful in predicting mpg