TrueCar Data Scientist Challenge

Libraries Used

library(plyr)

## Warning: package 'plyr' was built under R version 3.5.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

library(treemap)

## Warning: package 'treemap' was built under R version 3.5.3

library(data.table)

## Warning: package 'data.table' was built under R version 3.5.3

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(h2o)

## Warning: package 'h2o' was built under R version 3.5.3

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:data.table':
## 
##     hour, month, week, year

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

h2o.init(nthreads = -1)

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\SHARAT~1\AppData\Local\Temp\Rtmp0ExEvK/h2o_Sharathchandra_started_from_r.out
##     C:\Users\SHARAT~1\AppData\Local\Temp\Rtmp0ExEvK/h2o_Sharathchandra_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         4 seconds 559 milliseconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    5 months and 28 days !!! 
##     H2O cluster name:           H2O_started_from_R_Sharathchandra_urn847 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.76 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.2 (2018-12-20)

## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (5 months and 28 days)!
## Please download and install the latest version from http://h2o.ai/download/

library(tm)

## Warning: package 'tm' was built under R version 3.5.3

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(SnowballC)
library(stringr)

## Warning: package 'stringr' was built under R version 3.5.3

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.5.3

## Loading required package: RColorBrewer

The PriceRatio.csv file given was not well tab seperated. Therefore, first read the file using read.delim function and saved the file as PriceRatio2.csv file. Then read this latter file using read.csv function. The blank values are substitued by NA vlaues while reading, to make NAs management easy while analyzing the data and to prevent data loss.

data <- read.csv("C:/Users/Sharathchandra/Desktop/TrueCar/PriceRatio2.csv", header = T, stringsAsFactors = F, na.strings=c("","NA"))

Data Cleansing was needed while working on the datasheet.

I skipped reading in columns 1 to 7 - “ï..sale_date”, “chrome_style_id”, “country_code”, “customer_postal_code”, “days_in_inventory”, “dealership_id”, “dealership_postal_code” and column 13 - “new_or_used”. Reasons as followed: 1. Columns 1, 3 and 13 contained same values for each record/observation. 2. Column 2 was skipped as I focused on Column 21- “style_name” for visualizaion and this latter column gave more information than column 2 which just gave numbers. 3. Columns 4, 6 and 7 contained all numbers, and while running regression models, these variables gave deceiving results. And the main reason, the NA and blank values field in these three columns cannot be managed. 4. Column 5, I pondered over this as this column provides crucial results. But this column had NA values for 7595 records/observations of data. Here too, NAs cannot be managed as filling in with wrong values would yeild wrong results.

NAs management and removal. Column 27 - “Discarded” had NAs value for 1216 records in the column. And, to add on, the rest of the other crucial columns too had NA and Blank values and thus, had to be removed. Column 25 - “curve_rating” had Nas value for 2310 records in the column. The reason for removal of NAs is same as above. Column 9 - “finance_type” , Column 11 - “model_name” and Column 19 - “sale_type” had blank values and thus here NAs were managed by substituing the value “UNKNOWN”. Column 23 - “truecar_sale”, values ‘2,3,4,5,6 and 7’ were replaced by “FALSE” value. This was done because for the numerical values, “adjusted_sale_price” column had “False” as value in its field. Thus, I made an assumption here, that these cars might have not been sold through TrueCar.

Managing Outliers through substituting NAs by mean value of the respective columns. This was done to normalize and eliminate skewness in the data. Yes, there were outliers in each column. These outliers I found were due to human oversight error and placing of decimal points caused these outliers. In the categroical value columns, there were no outliers, they were present only in numerical values columns.

data <- data[,-c(1:7,13)]
data <- data[!is.na(data$discarded),]
data <- data[!is.na(data$curve_rating),]
data$finance_type[is.na(data$finance_type)] <- "UNKNOWN"
data$model_name[is.na(data$model_name)] <- "UNKNOWN"
data$sale_type[is.na(data$sale_type)] <- "UNKNOWN"
data$truecar_sale <- ifelse(data$truecar_sale == "2","FALSE",
               ifelse(data$truecar_sale == "3","FALSE",
                      ifelse(data$truecar_sale == "4","FALSE",
                             ifelse(data$truecar_sale == "5","FALSE",
                                    ifelse(data$truecar_sale == "6","FALSE",
                                           ifelse(data$truecar_sale == "7","FALSE",data$truecar_sale))))))
data$adjusted_sale_price <- as.numeric(data$adjusted_sale_price)
data$discarded <- as.character(data$discarded)
data$adjusted_sale_price[is.na(data$adjusted_sale_price)] <- 35791.97
data$odometer[is.na(data$odometer)] <- 536
data$price_invoice[is.na(data$price_invoice)] <- 36066.74
data$price_msrp[is.na(data$price_msrp)] <- 37080.58
data$price_retail[is.na(data$price_retail)] <- 36085.36
data$price_ratio <- as.numeric(data$price_ratio)
data$price_ratio[is.na(data$price_ratio)] <- 0.91151
rownames(data) <- NULL

The below line of code is to check if NAs are still present and to prevent further data loss. After previous data cleansing data setps were executed and I had 8583 observations/data records and after executing the below step too, I had 8583 observations. Therefore, there was no data loss due to the line executed below.

data <- na.omit(data)

Final preparation of data for data modelling. Factorizing of categorical, logical and string/character columns must be done.

columns.factor <- c("dma_id","finance_type","make_name","model_name","model_year","provider_name","sale_type","state","style_name","tc_region_id","truecar_sale","curve_rating","discarded")
data[columns.factor] <- sapply(data[columns.factor],as.factor)
data$dma_id <- as.factor(data$dma_id)
data$finance_type <- as.factor(data$finance_type)
data$make_name <- as.factor(data$make_name)
data$model_name <- as.factor(data$model_name)
data$model_year <- as.factor(data$model_year)
data$provider_name <- as.factor(data$provider_name)
data$sale_type <- as.factor(data$sale_type)
data$state <- as.factor(data$state)
data$style_name <- as.factor(data$style_name)
data$tc_region_id <- as.factor(data$tc_region_id)
data$truecar_sale <- as.factor(data$truecar_sale)
data$curve_rating <- as.factor(data$curve_rating)
data$discarded <- as.factor(data$discarded)
columns.numeric <- c("odometer","price_invoice","price_msrp","price_retail","adjusted_sale_price","price_ratio")
data[columns.numeric] <- sapply(data[columns.numeric],as.numeric)

I have used H2O.AI library for ML and to get instant visualization to understand statistics, important variables affecting the “PriceRatio” target variable. Please do make a note of running below codes in this description to start H2O server.

Start server h2o.init(nthreads=-1)
Go to any Internet browser and type in localhost:54321/ . This is to access the model related information.
Then execute the below code chunk and see in the local host server which contains the models and predictions.
Shutdown server h2o.shutdown() Then enter Y

In brief, this library provides an user interface.

Following is the coding for ML. I have randomly split data to 80% for training and 20% for testing.

h2o.data1 <- as.h2o(data, destination_frame = "data1")

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

targetVariable1 <- "curve_rating"
otherVariables1 <- c("dma_id","finance_type","make_name","model_name","model_year","provider_name","sale_type","state","style_name","tc_region_id","truecar_sale","curve_rating","discarded")
parts1 <- h2o.splitFrame(h2o.data1, 0.8)
train1 <- parts1[[1]]
test1 <- parts1[[2]]
model1 <- h2o.randomForest(otherVariables1, targetVariable1, train1, nfolds = 2, model_id = "RF-Model-1")

## Warning in .verify_dataxy(training_frame, x, y): removing response variable
## from the explanatory variables

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==============================                                   |  47%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=================================================================| 100%

predictmodel1 <- h2o.predict(model1,test1)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o.data2 <- as.h2o(data, destination_frame = "data2")

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

targetVariable2 <- "price_ratio"
otherVariables2 <- c("odometer","price_invoice","price_msrp","price_retail","adjusted_sale_price","curve_average_market_price")
parts2 <- h2o.splitFrame(h2o.data2, 0.8)
train2 <- parts2[[1]]
test2 <- parts2[[2]]
model2 <- h2o.glm(otherVariables2, targetVariable2, train2, nfolds = 2,model_id = "GLM-Model-2")

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

predictmodel2 <- h2o.predict(model2,test2)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o.data3 <- as.h2o(data, destination_frame = "data3")

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

targetVariable3 <- "price_ratio"
otherVariables3 <- c("odometer","price_invoice","price_msrp","price_retail","adjusted_sale_price","curve_average_market_price","make_name","model_name","model_year","sale_type","state","truecar_sale","curve_rating")
parts3 <- h2o.splitFrame(h2o.data3, 0.8)
train3 <- parts3[[1]]
test3 <- parts3[[2]]
model3 <- h2o.randomForest(otherVariables3, targetVariable3, train3, nfolds = 2,model_id = "RF-Model-3")

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |=================================================================| 100%

predictmodel3 <- h2o.predict(model3,test3)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Visualizations to better understand data and evidence for story telling of data.

dmaCount <- as.data.frame(table(data$dma_id))
colnames(dmaCount) <- c("dma_id","counts")
dmaCount <- arrange(dmaCount,desc(counts))
dmaCountTop25 <- dmaCount[1:25,]
plot1 <- ggplot(data = dmaCountTop25) + geom_bar(mapping = aes(x=reorder(dma_id,counts),y= counts, fill = counts),stat = "identity", position = "dodge") + labs(x="Top 25 DMA_Id (Cities) data records", y="Number Of Data records", title = "Top 25 DMA_Id (Cities) data records", subtitle = "DMA IDs - Visualization to assist TV Broadcast campaigning team") + theme(panel.background = element_blank()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() + geom_text(aes(x=dma_id, y=counts, label=counts), hjust = -0.5, size = 2, inherit.aes = TRUE) + theme_bw()
plot1

financeCount <- as.data.frame(table(data$finance_type))
colnames(financeCount) <- c("finance_type","counts")
plot2 <- ggplot(data = financeCount) + geom_bar(mapping = aes(x=reorder(finance_type,counts),y= counts, fill = counts),stat = "identity", position = "dodge") + labs(x="Finance Types of the cars sold", y="Count of Types of Finance", title = "Count of Finance Types of cars sold", subtitle = "UNKNOWN (arbitrary value as data is missing/blank)") + theme(panel.background = element_blank()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + geom_text(aes(x=finance_type, y=counts, label=counts), hjust = 0.5, vjust = -1, size = 4, inherit.aes = TRUE) + theme_bw()
plot2

makeNameCount <- as.data.frame(table(data$make_name))
colnames(makeNameCount) <- c("make_name","counts")
makeNameCount <- arrange(makeNameCount,desc(counts))
plot3 <- ggplot(data = makeNameCount) + geom_bar(mapping = aes(x=reorder(make_name,counts),y= counts, fill = counts),stat = "identity", position = "dodge") + labs(x="Car Manufacturers' List", y="Number Of Cars Sold", title = "Number of Cars sold per Manufacturer") + theme(panel.background = element_blank()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() + geom_text(aes(x=make_name, y=counts, label=counts), hjust = -0.5, size = 2, inherit.aes = TRUE) + theme_bw()
plot3

modelCount <- data
modelCount$make_model <- paste(modelCount$make_name, modelCount$model_name, sep = "\n")
modelCount <- as.data.frame(table(modelCount$make_model))
colnames(modelCount) <- c("make_model_name","counts")
modelCount <- arrange(modelCount,desc(counts))
modelCountTop25 <- modelCount[1:25,]
plot4 <- ggplot(data = modelCountTop25) + geom_bar(mapping = aes(x=reorder(make_model_name,counts),y= counts, fill = counts),stat = "identity", position = "dodge") + labs(x="Top 25 Cars Sold", y="Number of particular car sold", title = "Top 25 Cars Sold") + theme(panel.background = element_blank()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() + geom_text(aes(x=make_model_name, y=counts, label=counts), hjust = -0.5, size = 2, inherit.aes = TRUE) + theme_bw()
plot4

saleTypeCount <- as.data.frame(table(data$sale_type))
colnames(saleTypeCount) <- c("sale_type","counts")
saleTypeCount$percent <- saleTypeCount$counts/sum(saleTypeCount$counts)*100
saleTypeCount$percent <- round(saleTypeCount$percent,2)
saleTypeCount <- arrange(saleTypeCount, desc(percent))
saleTypeCount$label <- paste(saleTypeCount$sale_type,saleTypeCount$percent,sep = "\n")
saleTypeCount$label <- paste(saleTypeCount$label,"%")
treeplot <- treemap(saleTypeCount, index = "label", vSize = "counts", type = "index", palette = "Dark2", title="Sale Types Percentage", fontsize.title=12,fontsize.labels = 10)

provNameCount <- as.data.frame(table(data$provider_name))
colnames(provNameCount) <- c("provider_name","counts")
provNameCount$percent <- provNameCount$counts/sum(provNameCount$counts)*100
provNameCount$percent <- round(provNameCount$percent,2)
provNameCount <- arrange(provNameCount, desc(percent))
provNameCount$label <- paste(provNameCount$provider_name,provNameCount$percent,sep = "\n")
provNameCount$label <- paste(provNameCount$label,"%")
treeplot <- treemap(provNameCount, index = "label", vSize = "counts", type = "index", palette = "Dark2", title="Data records Provider percentage", fontsize.title=12,fontsize.labels = 10)

datastyle_name <- fread("C:/Users/Sharathchandra/Desktop/TrueCar/PriceRatio2.csv", select = c(21), header = T, stringsAsFactors = F)

cleanData <- function(cleanData1){
  callText <- gsub(pattern="\\W", replace = " ", cleanData1)
  callText<- tolower(callText)
  callText <- removeWords(callText,stopwords())
  callText <- gsub(pattern = "\\b[A-z]\\b{1}", replace = " ", callText)
  callText <- gsub(pattern = "_", replace = " ",callText)
  callText <- stripWhitespace(callText)
  return(callText)
}

callText1 <- cleanData(datastyle_name)

wordsBreak <- function(cleanData2){
  callTextBag <- str_split(cleanData2, pattern = "\\s+")
  callTextBag <- unlist(callTextBag)
  return(callTextBag)
}

callTextBag1 <- wordsBreak(callText1)

styleNameCloud <- wordcloud(callTextBag1, min.freq = 50, random.order = F, color = "blue")

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation):
## transformation drops documents

## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

I have also attached MS Word DOC, plots PDF and clean CSV file used to consolidate my findings.

TrueCar Data Scientist Challenge

Sharathchandra BM