Libraries Used:
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.5.3
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(visreg)
## Warning: package 'visreg' was built under R version 3.5.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
Data is read into for processing
data <- read.csv("C:/Users/Sharathchandra/Desktop/Skill Assessment/Pinpoint/auto-mpg.data.csv",header = T, col.names = c("mpg","cylinders","displacement","horsepower","weight","acceleration","model_year", "origin","car_name"))
Exploratory Analysis and Visualizations
The data consists of 397 observations and 9 attributes. “car_name” column contains unique values. “origin” column conatins values - 1,2 and 3. So, I have made an assumption here that these values represnt ‘Americas’, ‘European Union’ and ‘Asia’ origin regions. ‘model_year’ column says that the 80’s (1970 - 1979) cars’ data. The other columns give a brief specifications or values of cars.
Please do look into this corogram for further analysis of data.
data$horsepower <- as.numeric(data$horsepower)
data$model_year <- data$model_year %>% factor(labels = sort(unique(data$model_year)))
data$origin <- data$origin %>% factor(labels = sort(unique(data$origin)))
data$cylinders <- data$cylinders %>% factor(labels = sort(unique(data$cylinders)))
histData <- melt(data[,-c(9)])
## Using cylinders, model_year, origin as id variables
ggplot(data = histData,mapping = aes(value)) + facet_wrap(~variable,scales = "free_x",nrow = 3) + geom_histogram(colour="black", fill="blue") + theme(panel.background = element_blank()) + theme_bw() + labs(title = "Histogram (Distributed values) of Data per variable", subtitle = "Except Acceleration Data column which is normal distribution, rest are right skewed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data,mapping = aes(mpg, fill=cylinders)) + geom_histogram(color="black") + theme(panel.background = element_blank()) + theme_bw() + labs(x = "Miles Per Gallon", y = "Count of Cars", title = "MPG filtered by cylinders VS Count of Cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data,mapping = aes(horsepower, fill=cylinders)) + geom_histogram(color="black") + theme(panel.background = element_blank()) + theme_bw() + labs(x = "Horsepower (HP)", y = "Count of Cars", title = "HP filtered by cylinders VS Count of Cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data,mapping = aes(cylinders,fill=origin)) + geom_bar(position = "dodge") + theme(panel.background = element_blank()) + theme_bw() + labs(x = "Cylinders", y = "Count of Cars", title = "Cylinders filtered by origin VS Count of Cars")
data$model_year <- as.numeric(data$model_year)
data$origin <- as.numeric(data$origin)
data$cylinders <- as.numeric(data$cylinders)
cormat <- round(cor(data[,-c(9)]),2)
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
heatData <- melt(upper_tri, na.rm = TRUE)
ggplot(data = heatData, mapping = aes(Var2, Var1, fill = value)) + geom_tile(color = "white") + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Auto MPG Correlation") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) + coord_fixed() + geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.major = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.ticks = element_blank(), legend.justification = c(1, 0), legend.position = c(0.6, 0.7), legend.direction = "horizontal")+ guides(fill = guide_colorbar(barwidth = 7, barheight = 1, title.position = "top", title.hjust = 0.5))
ModelData - Split into TrainData and TestData in ratio of 80:20. The data is randomly sampled and put into train/test dataframes.
modelData <- data[,-c(7:9)]
set.seed(100)
indexes <- sample(nrow(modelData), (0.80*nrow(modelData)), replace = FALSE)
trainData <- modelData[indexes, ]
testData <- modelData[-indexes, ]
Model 1 - Linear Regression Model.
model1 <- lm(mpg~weight+horsepower+cylinders+displacement+acceleration,data = modelData)
summary(model1)
##
## Call:
## lm(formula = mpg ~ weight + horsepower + cylinders + displacement +
## acceleration, data = modelData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3953 -2.6232 -0.3332 2.2797 16.2448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.7783404 2.0295129 20.585 < 2e-16 ***
## weight -0.0061088 0.0007429 -8.223 2.97e-15 ***
## horsepower -0.0022963 0.0087944 -0.261 0.7941
## cylinders -1.1143698 0.4883627 -2.282 0.0230 *
## displacement 0.0018826 0.0085032 0.221 0.8249
## acceleration 0.2053428 0.0983581 2.088 0.0375 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.281 on 391 degrees of freedom
## Multiple R-squared: 0.7042, Adjusted R-squared: 0.7004
## F-statistic: 186.2 on 5 and 391 DF, p-value: < 2.2e-16
predictions1 <- predict(model1, newdata = testData)
sqrt(mean((predictions1 - testData$mpg)^2))
## [1] 4.081768
Model 2 - Random Forest Algorithm Model with 15 trees and mtry 4 variables at a time.
model2 <- randomForest(mpg ~ ., data = trainData, importance = TRUE, ntree = 15, mtry = 4, replace = T)
summary(model2)
## Length Class Mode
## call 7 -none- call
## type 1 -none- character
## predicted 317 -none- numeric
## mse 15 -none- numeric
## rsq 15 -none- numeric
## oob.times 317 -none- numeric
## importance 10 -none- numeric
## importanceSD 5 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 317 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
predictions2 <- predict(model2, newdata = testData)
sqrt(mean((predictions2 - testData$mpg)^2))
## [1] 3.396063
I have used the below code chunks as well.
plot(model1) visreg(model1) plot(model2)
These plots are included with PDF file containing all plots - Exploratory and Model Analysis.
I have performed model analysis taking mpg as target variable and weight, horsepower,cylinders,displacement and acceleration as independent variables. RF model proves to be a better regression model with reduction in RMSE value.
Yes, the customer can create a new attribute containing metric values which can tell if the car performs better. Here, in this scenario, as I have considered mpg as target variable - the new metric can conatin values as ‘Very Efficient’, ‘Efficient’, ‘Good’ and ‘Bad’.