Auto MPG Data Analysis

Libraries Used:

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.5.3

library(reshape2)

## Warning: package 'reshape2' was built under R version 3.5.3

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

library(visreg)

## Warning: package 'visreg' was built under R version 3.5.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Data is read into for processing

data <- read.csv("C:/Users/Sharathchandra/Desktop/Skill Assessment/Pinpoint/auto-mpg.data.csv",header = T, col.names = c("mpg","cylinders","displacement","horsepower","weight","acceleration","model_year", "origin","car_name"))

Exploratory Analysis and Visualizations

The data consists of 397 observations and 9 attributes. “car_name” column contains unique values. “origin” column conatins values - 1,2 and 3. So, I have made an assumption here that these values represnt ‘Americas’, ‘European Union’ and ‘Asia’ origin regions. ‘model_year’ column says that the 80’s (1970 - 1979) cars’ data. The other columns give a brief specifications or values of cars.

Except Acceleration column, rest of the column values are right skewed.
There are more number of ‘4’ cylinder cars than others. And, they also prove to have a better miles per gallon value. ‘8’ cylinder cars have the worst MPG values. There are not many ‘3’ and ‘5’ cylinder car observations. But, we can make a safe assumption that ‘8’ cylinder cars give poor MPG.
Most of ‘4’ cylinder cars have better HP along with MPG values. Most of ‘8’ cylinder cars have less HP along with worse MPG values.
‘1’ origin produces more of ‘6’ and ‘8’ cylinder cars.
The most important plot which gives us how the attributes are correlated. This visualization says that cylinders, displacement and weight values negatively impact mpg value. The cars have got better with mpg value along with years. This is understandable as the scientific and manufacturing process got better. And, according to assumption made, origin ‘3’ - Asia cars have better mpg value.

Please do look into this corogram for further analysis of data.

data$horsepower <- as.numeric(data$horsepower)
data$model_year <- data$model_year %>% factor(labels = sort(unique(data$model_year)))
data$origin <- data$origin %>% factor(labels = sort(unique(data$origin)))
data$cylinders <- data$cylinders %>% factor(labels = sort(unique(data$cylinders)))

histData <- melt(data[,-c(9)])

## Using cylinders, model_year, origin as id variables

ggplot(data = histData,mapping = aes(value)) + facet_wrap(~variable,scales = "free_x",nrow = 3) + geom_histogram(colour="black", fill="blue") + theme(panel.background = element_blank()) + theme_bw() + labs(title = "Histogram (Distributed values) of Data per variable", subtitle = "Except Acceleration Data column which is normal distribution, rest are right skewed")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = data,mapping = aes(mpg, fill=cylinders)) + geom_histogram(color="black") + theme(panel.background = element_blank()) + theme_bw() + labs(x = "Miles Per Gallon", y = "Count of Cars", title = "MPG filtered by cylinders VS Count of Cars")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = data,mapping = aes(horsepower, fill=cylinders)) + geom_histogram(color="black") + theme(panel.background = element_blank()) + theme_bw() + labs(x = "Horsepower (HP)", y = "Count of Cars", title = "HP filtered by cylinders VS Count of Cars")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = data,mapping = aes(cylinders,fill=origin)) + geom_bar(position = "dodge") + theme(panel.background = element_blank()) + theme_bw() + labs(x = "Cylinders", y = "Count of Cars", title = "Cylinders filtered by origin VS Count of Cars")

data$model_year <- as.numeric(data$model_year)
data$origin <- as.numeric(data$origin)
data$cylinders <- as.numeric(data$cylinders)
cormat <- round(cor(data[,-c(9)]),2)

get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}
upper_tri <- get_upper_tri(cormat)
heatData <- melt(upper_tri, na.rm = TRUE)
ggplot(data = heatData, mapping = aes(Var2, Var1, fill = value)) + geom_tile(color = "white") + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Auto MPG Correlation") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) + coord_fixed() + geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.major = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.ticks = element_blank(), legend.justification = c(1, 0), legend.position = c(0.6, 0.7), legend.direction = "horizontal")+ guides(fill = guide_colorbar(barwidth = 7, barheight = 1, title.position = "top", title.hjust = 0.5))

ModelData - Split into TrainData and TestData in ratio of 80:20. The data is randomly sampled and put into train/test dataframes.

modelData <- data[,-c(7:9)]
set.seed(100)

indexes <- sample(nrow(modelData), (0.80*nrow(modelData)), replace = FALSE)

trainData <- modelData[indexes, ]
testData <- modelData[-indexes, ]

Model 1 - Linear Regression Model.

model1 <- lm(mpg~weight+horsepower+cylinders+displacement+acceleration,data = modelData)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ weight + horsepower + cylinders + displacement + 
##     acceleration, data = modelData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3953  -2.6232  -0.3332   2.2797  16.2448 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.7783404  2.0295129  20.585  < 2e-16 ***
## weight       -0.0061088  0.0007429  -8.223 2.97e-15 ***
## horsepower   -0.0022963  0.0087944  -0.261   0.7941    
## cylinders    -1.1143698  0.4883627  -2.282   0.0230 *  
## displacement  0.0018826  0.0085032   0.221   0.8249    
## acceleration  0.2053428  0.0983581   2.088   0.0375 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.281 on 391 degrees of freedom
## Multiple R-squared:  0.7042, Adjusted R-squared:  0.7004 
## F-statistic: 186.2 on 5 and 391 DF,  p-value: < 2.2e-16

predictions1 <- predict(model1, newdata = testData)
sqrt(mean((predictions1 - testData$mpg)^2))

## [1] 4.081768

Model 2 - Random Forest Algorithm Model with 15 trees and mtry 4 variables at a time.

model2 <- randomForest(mpg ~ ., data = trainData, importance = TRUE, ntree = 15, mtry = 4, replace = T)
summary(model2)

##                 Length Class  Mode     
## call              7    -none- call     
## type              1    -none- character
## predicted       317    -none- numeric  
## mse              15    -none- numeric  
## rsq              15    -none- numeric  
## oob.times       317    -none- numeric  
## importance       10    -none- numeric  
## importanceSD      5    -none- numeric  
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y               317    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

predictions2 <- predict(model2, newdata = testData)
sqrt(mean((predictions2 - testData$mpg)^2))

## [1] 3.396063

I have used the below code chunks as well.

plot(model1) visreg(model1) plot(model2)

These plots are included with PDF file containing all plots - Exploratory and Model Analysis.

I have performed model analysis taking mpg as target variable and weight, horsepower,cylinders,displacement and acceleration as independent variables. RF model proves to be a better regression model with reduction in RMSE value.

Yes, the customer can create a new attribute containing metric values which can tell if the car performs better. Here, in this scenario, as I have considered mpg as target variable - the new metric can conatin values as ‘Very Efficient’, ‘Efficient’, ‘Good’ and ‘Bad’.

Auto MPG Data Analysis

Sharathchandra BM