Exploratory Data Analysis on CarDash Data
Exploratory Data Analysis on CarDash Data
CarDash is a full service automative concierge provider that picks your car, oversee the services and basically advocates on your behalf.
Goal: Provide a set of recommendations on how to improve their business or product based on the given dataset.
| Column Names | Description |
|---|---|
| orderid | unique identifier of order |
| parentorderid | unique identifier of order that may contain associated children orders |
| contactcustomerid | unique identifier of customer |
| servicecenterid | unique identifier of service center |
| createdat | timestamp that customer placed the order |
| pickupdate | timestamp that driver picked up the order |
| closingdate | timestamp of vehicle return and order close |
| finalinvoice | final invoice amount to customer |
| tip | customer tip amount |
| promocodediscount | discount value of order |
| grossrevenue | finalinvoice + tip + promocodediscount |
| netrevenue | CarDash revenue after deducting: payment to service centers for work, parts costs, corporate discount promo codes, warranty and returns. |
1 Preparing Data
Loading Libraries
library(readr)
library(dplyr)
library(DT)
library(ggplot2)
library(gridExtra)
library(forcats)
library(lubridate)
library(corrplot)
library(randomForest)
library(scales)Loading Data
data <- read_csv("Data.csv")
#servicecenterid and parentorderid seems to be integer but we know it should be a categorical so converting them into factors.
data$parentorderid <- as.factor(data$parentorderid)
data$servicecenterid <- as.factor(data$servicecenterid)
data$contactcustomerid <- as.factor(data$contactcustomerid)
data$pickupdate <- as.Date(as.character(data$pickupdate), format="%Y-%m-%d")
data$createdat <- as.Date(as.character(data$createdat), format="%Y-%m-%d")
data$closingdate <- as.Date(as.character(data$closingdate), format="%Y-%m-%d")Lets have a glimpse at the first 10 records in the data.
data## # A tibble: 6,562 x 12
## orderid parentorderid contactcustomerid servicecenterid createdat
## <int> <fct> <fct> <fct> <date>
## 1 1169 <NA> 398 81 2017-07-31
## 2 1400 <NA> 1215 341 2017-08-08
## 3 1457 <NA> 1244 <NA> 2017-08-09
## 4 1630 <NA> 1394 180 2017-08-18
## 5 1630 <NA> 1394 153 2017-08-18
## 6 1689 <NA> 18 <NA> 2017-08-22
## 7 1745 <NA> 1027 24 2017-08-24
## 8 2494 <NA> 1982 38 2017-10-06
## 9 2494 <NA> 1982 714 2017-10-06
## 10 2494 <NA> 1982 700 2017-10-06
## # ... with 6,552 more rows, and 7 more variables: pickupdate <date>,
## # closingdate <date>, finalinvoice <dbl>, tip <dbl>,
## # promocodediscount <int>, grossrevenue <dbl>, netrevenue <dbl>
1.1 Number of Records and Variables
The dataset contains 6562 observations and 12 variables.
1.2 Data Type of Each Variable
Lets check the structure and missing values of each variable.
glimpse(data)## Observations: 6,562
## Variables: 12
## $ orderid <int> 1169, 1400, 1457, 1630, 1630, 1689, 1745, 24...
## $ parentorderid <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ contactcustomerid <fct> 398, 1215, 1244, 1394, 1394, 18, 1027, 1982,...
## $ servicecenterid <fct> 81, 341, NA, 180, 153, NA, 24, 38, 714, 700,...
## $ createdat <date> 2017-07-31, 2017-08-08, 2017-08-09, 2017-08...
## $ pickupdate <date> 2017-08-02, 2017-08-14, NA, 2017-08-24, 201...
## $ closingdate <date> NA, NA, NA, 2017-08-28, 2017-08-28, NA, 201...
## $ finalinvoice <dbl> 39.00, 410.07, 0.00, 2160.08, 2160.08, 0.00,...
## $ tip <dbl> 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 20.0, 10.0, 10...
## $ promocodediscount <int> 5, 20, 0, 20, 20, 0, 20, 20, 20, 20, 0, 5, 0...
## $ grossrevenue <dbl> 50.00, 430.07, 0.00, 2180.08, 2180.08, 0.00,...
## $ netrevenue <dbl> 20.00, 58.57, 0.00, 363.79, 363.79, 0.00, 18...
1.3 Missing Values
To avoid discprencies in the data analysis, identification of null values is neccessary. The table below shows the attributes with the percentage of null values it contains.
d <- colMeans(is.na(data))
distribution<- d[d>0] #Variables having missing values only
p<-data.frame(distribution)
colnames(p) <- "Percentage"
p$Percentage <- p$Percentage *100 #Calculating percentages and renaming the column
df = data.frame(
p,
stringsAsFactors = TRUE
)
dt <- datatable(df, filter = 'bottom', options = list(pageLength = 6))
dtPercentage of missing service centers 31.11% servicecenter id’s are missing. As it is a categorical variable we can not predict them or perform imputation. So in our analysis we have to filter them out.
2 Questions
2.1 Most Experienced Service centers
The Count of Services - Service Centers
freq_servicecenterid <- data %>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
ungroup() %>%
mutate(servicecenterid = reorder(servicecenterid,Count)) %>%
head(10) ggplot(freq_servicecenterid, aes(x = servicecenterid,y = Count)) +
geom_bar(stat='identity',colour="white",fill = " deepskyblue3 ") +
geom_text(aes(x = servicecenterid, y = 1, label = paste0("(",Count,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'black',fontface = 'bold') +
labs(x = 'Service Center', y = 'Count') +
theme_light() +
coord_flip()As a result of the analysis it seems the service center id with 24 is the most experienced service center with 798 services.
2.2 What are the most revenue generating service centers ?
avg_netrevenue <- data %>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid) %>%
summarise(avg_netrevenue = mean(netrevenue)) %>%
arrange(desc(avg_netrevenue)) %>%
ungroup() %>%
mutate(servicecenterid = reorder(servicecenterid,avg_netrevenue)) %>%
head(10) ggplot(avg_netrevenue, aes(x = servicecenterid,y = avg_netrevenue)) +
geom_bar(stat='identity',colour="white",fill = " forestgreen ") +
geom_text(aes(x = servicecenterid, y = 1, label = paste0("(",avg_netrevenue,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'black',fontface = 'bold') +
labs(x = 'Service Center', y = 'Average NetRevenue') +
theme_light() +
coord_flip()As a result of the analysis it seems that service center id with 758 has the highest average netrevenue i.e $1107
2.3 Distribution of Tip, Discount, Netrevenue and Grossrevenue
p1 <- ggplot(data, aes(x = tip)) +
geom_histogram()
p2 <- ggplot(data, aes(x = promocodediscount)) +
geom_histogram()
p3 <- ggplot(data, aes(x = netrevenue)) +
geom_histogram()
p4 <- ggplot(data, aes(x = grossrevenue)) +
geom_histogram()
grid.arrange(p1,p2,p3,p4,ncol = 2)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
2.4 What is an average tip for servicing?
mean(data$tip)## [1] 2.110483
tm <- data.frame(table(data$tip))
names(tm)[1] <- "Tip_Amount"
tm <- tm %>%
mutate(Range = case_when(
as.numeric(Tip_Amount) <= 5 ~ "0-5",
as.numeric(Tip_Amount) >= 6 & as.numeric(Tip_Amount) <= 10 ~ "6-10",
as.numeric(Tip_Amount) >= 11 & as.numeric(Tip_Amount) <= 20 ~ "11-20",
as.numeric(Tip_Amount) >= 21 & as.numeric(Tip_Amount) <= 50 ~ "21-50",
as.numeric(Tip_Amount) > 50 ~ "Above 50"
))
tm%>%
group_by(Range)%>%
summarise(Total_Count = sum(Freq))%>%
ungroup() %>%
ggplot(aes(x= Range, y= Total_Count, fill= -Total_Count))+
geom_bar(stat="identity", width = 0.7)+
xlab("Tip Range") + ylab("Count of Tips")High percentage of people do not tip and then there are 139 people who have given more than $50 as a tip, which tells how satisfied they are with the service.
2.5 What is the Maintainence cost?
data %>%
summarize(Total_netrevenue = sum(netrevenue),
Total_grossrevenue = sum(grossrevenue))%>%
mutate(Expenditure_Percentage = ((Total_grossrevenue - Total_netrevenue)/Total_grossrevenue)*100)## # A tibble: 1 x 3
## Total_netrevenue Total_grossrevenue Expenditure_Percentage
## <dbl> <dbl> <dbl>
## 1 271560. 1156375. 76.5
So 76.5 percent of the revenue has been put on maintaing services centers and buying required parts for the car.
3 Loyal Customers
3.1 How many customers does CarDash have?
data %>%
filter(!is.na(contactcustomerid))%>%
summarize(distinct_customers = n_distinct(contactcustomerid)) ##length(unique(data$contactcustomerid))## # A tibble: 1 x 1
## distinct_customers
## <int>
## 1 3977
There are 3977 customers who use carDash services
3.2 Who are the loyal customers?
Lc <- data%>% #Loyal Customers
filter(!is.na(contactcustomerid))%>%
group_by(contactcustomerid) %>%
summarise(unique_orders = n_distinct(orderid), count_orders = n()) %>% # you can use length(unique(orderid)) instead of n_distinct()
arrange(desc(unique_orders)) %>%
top_n(10, unique_orders)
Lc## # A tibble: 11 x 3
## contactcustomerid unique_orders count_orders
## <fct> <int> <int>
## 1 1466 81 86
## 2 483 14 14
## 3 13 12 12
## 4 18 12 13
## 5 219 10 13
## 6 3766 10 10
## 7 2 9 10
## 8 316 9 14
## 9 555 9 11
## 10 410 8 9
## 11 636 8 9
These are the customers who who have taken the carDash services most of the times.
3.3 How much do Repeative customers tip and bring revenue to the company?
inner_join(data, Lc, by = "contactcustomerid") %>%
filter(!is.na(servicecenterid))%>%
select(contactcustomerid, servicecenterid, tip, finalinvoice, grossrevenue)%>%
group_by(contactcustomerid)## # A tibble: 156 x 5
## # Groups: contactcustomerid [10]
## contactcustomerid servicecenterid tip finalinvoice grossrevenue
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 13 52 0 0 0
## 2 1466 677 0 433. 433.
## 3 1466 693 0 586. 586.
## 4 1466 48 0 45 45
## 5 1466 661 0 74 74
## 6 2 39 20 40 60
## 7 555 138 0 45 50
## 8 316 17 0 226. 246.
## 9 316 34 0 226. 246.
## 10 219 38 0 0 0
## # ... with 146 more rows
4 Creating New Variables
Creating new datetime variables for some more exploratory data analysis and model building.
data <- data%>%
mutate(create_to_pickup_duration = as.numeric(difftime(pickupdate, createdat, units = "days")),
pickup_to_closedat_duration = as.numeric(difftime(closingdate,pickupdate,units = "days")),
total_process_duration = as.numeric(difftime(closingdate, createdat, units = "days")))Note: All the 3 variables are in terms of number of days.
pickup_to_closedat_duration variable can be considered as the service duration
5 Service Duration Analysis
data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_service_duration = as.numeric(mean(pickup_to_closedat_duration)), count_service = n())%>%
filter(!is.na(avg_service_duration)) %>%
arrange(desc(avg_service_duration))## # A tibble: 120 x 3
## servicecenterid avg_service_duration count_service
## <fct> <dbl> <int>
## 1 72 123 1
## 2 266 82 1
## 3 101 45 2
## 4 770 35 1
## 5 776 24.5 2
## 6 251 22 1
## 7 822 15 2
## 8 758 10.2 4
## 9 695 5 1
## 10 814 5 1
## # ... with 110 more rows
As some of the datetime variables have missing values the newly created variables will also have missing values, so by neglecting those we are getting average service duration for each service center.
Service center 758 seems to take more than 10 days to service a car, possibly the worst. Can not consider others as they have serviced only once or twice and are basically outliers.
According to autobutler it only takes 3 hours to service a car. To be on safer side lets check service centers where service has been provided within 2 days.
5.1 What is the average service duration for service centers which has provided atleast 2 services ?
data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_service_duration = as.numeric(mean(pickup_to_closedat_duration)), count_service = n())%>%
ungroup()%>%
filter(!is.na(avg_service_duration)) %>%
filter(count_service >= 2) %>%
summarise(avg = mean(avg_service_duration)) ## # A tibble: 1 x 1
## avg
## <dbl>
## 1 2.76
Based on the analysis it seems that average service duration is close to 3 days where service centers have provided service more than once.
5.2 What is the average service duration for service centers which has provided atleast 5 services ?
Lets find average service duration across all service centers which have provided service more than 5 times.
d <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_service_duration = as.numeric(mean(pickup_to_closedat_duration)), count_service = n())%>%
filter(!is.na(avg_service_duration))%>%
filter(count_service >= 5) The average service duration is 1.17. Its getting bette, more the services better is the service duration. But how representative is it ? lets check how many service centers actually had service duration
sd <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_service_duration = as.numeric(mean(pickup_to_closedat_duration)), count_service = n())
missing_service_duration <- sum(is.na(sd$avg_service_duration))
percent_of_missing_serviceduration <- (missing_service_duration/nrow(sd))*100As 61% service duration records are missing. Records have either missing pick up or close date.
6 Lag Service Analysis
These are some of the worst picking service centers which takes days to pick the car and provide service.
6.1 Delay in car pick up
Average lag from the Service centers which have provided service more than once.
lag_services <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_create_to_pickup_duration = as.numeric(mean(create_to_pickup_duration)), count = n())%>%
filter(!is.na(avg_create_to_pickup_duration))%>%
filter(count >= 2) %>%
summarise(avg = mean(avg_create_to_pickup_duration))
lag_services## # A tibble: 1 x 1
## avg
## <dbl>
## 1 6.68
Based on the analysis it takes 6.68 days to pick a car for service center which has provided service more than once.
Average lag from the Service centers which have provided service more than 5 times.
lag_services <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_create_to_pickup_duration = as.numeric(mean(create_to_pickup_duration)), count = n())%>%
filter(!is.na(avg_create_to_pickup_duration))%>%
filter(count >= 5) %>%
summarise(avg = mean(avg_create_to_pickup_duration))
lag_services## # A tibble: 1 x 1
## avg
## <dbl>
## 1 7.61
Its getting worser, even after service more than 5 times the average delay is picking up a car is close to 8 days.
6.2 Problem with missing lag durations
The average lag duration for non missing service centers is close to 7.5 days. But how representative is it ? lets check how many service centers actually had service lag duration.
ld <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_create_to_pickup_duration = mean(create_to_pickup_duration), na.rm = TRUE)
missing_lag_duration <- sum(is.na(ld$avg_create_to_pickup_duration))
percent_of_missing_totalduration <- (missing_lag_duration/nrow(ld))*100As 30.7443366% service duration records are missing. Records have either missing pick up or create date.
6.3 Most time taking service centers
Lets check the service centers with most lag (time taken pick the car)
data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_create_to_pickup_duration = as.numeric(mean(create_to_pickup_duration)), count = n())%>%
filter(!is.na(avg_create_to_pickup_duration))%>%
arrange(desc(avg_create_to_pickup_duration)) %>%
top_n(10) %>%
ggplot(aes(x= reorder(servicecenterid, avg_create_to_pickup_duration), y= avg_create_to_pickup_duration,fill = -avg_create_to_pickup_duration)) +
geom_bar(stat="identity", width = 0.5)+
theme_minimal()+
coord_flip() +
xlab("Service Center") +
ylab("Number of Days taken to start the service") ## Selecting by count
7 Revenue Analysis
We know that some service centers has provided services way more than any other service center and also earned revenue. We can neglect those outstanding records.
servicecenter_revenue <- data %>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(count = n(),
Avg_servicecenter_netrevenue = mean(netrevenue),
Avg_servicecenter_grossrevenue = mean(grossrevenue))
p5 <- servicecenter_revenue %>%
filter(Avg_servicecenter_netrevenue < 2000 & count < 400) %>%
arrange(desc(Avg_servicecenter_netrevenue)) %>%
ggplot(aes(x = count, y = Avg_servicecenter_netrevenue)) +
geom_point(color = "Blue") + xlab("Number of Services") + ylab("Average Netrevenue")
p6 <- servicecenter_revenue %>%
filter(Avg_servicecenter_grossrevenue < 2000 & count < 400) %>%
arrange(desc(Avg_servicecenter_grossrevenue)) %>%
ggplot(aes(x = count, y = Avg_servicecenter_grossrevenue)) +
geom_point(color = "red") + xlab("Number of Services") + ylab("Average Grossrevenue")
grid.arrange(p5,p6)8 Feature Selection
Hypo <- data %>%
filter(!is.na(servicecenterid)) %>%
group_by(servicecenterid)%>%
summarise(count = n(),
avg_service_duration = mean(pickup_to_closedat_duration, na.rm = TRUE),
avg_netrevenue = mean(netrevenue, na.rm = TRUE)) %>%
filter(!is.nan(avg_service_duration) & avg_netrevenue > 0)
Hypo$avg_service_duration <- as.numeric(Hypo$avg_service_duration)
cor(Hypo$avg_service_duration, Hypo$avg_netrevenue)## [1] -0.006366065
cor(Hypo$count, Hypo$avg_netrevenue)## [1] -0.05057159
#sapply(Hypo, function(x) sum(is.na(x)))Hypo1 <- data%>%
filter(finalinvoice != 0 & !is.na(closingdate) & !is.na(servicecenterid)) %>%
group_by(servicecenterid)%>%
summarise(count = n(),
avg_service_duration = as.numeric(mean(pickup_to_closedat_duration, na.rm = TRUE)),
avg_total_duration = as.numeric(mean(total_process_duration, na.rm = TRUE)),
avg_lag_duration = as.numeric(mean(create_to_pickup_duration, na.rm = TRUE)),
avg_grossrevenue = mean(grossrevenue),
avg_netrevenue = mean(netrevenue),
avg_final_invoice = mean(finalinvoice))%>%
filter(avg_service_duration < 25 & avg_grossrevenue < 1600)
#sapply(Hypo1, function(x) sum(is.na(x)))
cor(Hypo1$avg_grossrevenue, Hypo1$avg_service_duration) ## [1] 0.2908371
M <- cor(Hypo1[,-1])
correlation_plot <- corrplot(M, method = "number") correlation_plot## count avg_service_duration avg_total_duration
## count 1.000000000 0.01938288 0.01707789
## avg_service_duration 0.019382877 1.00000000 0.25585142
## avg_total_duration 0.017077887 0.25585142 1.00000000
## avg_lag_duration 0.006952528 -0.25683320 0.86837227
## avg_grossrevenue 0.033787119 0.29083710 0.13930892
## avg_netrevenue 0.016567990 0.27352454 0.17654545
## avg_final_invoice 0.033281527 0.29075916 0.13726052
## avg_lag_duration avg_grossrevenue avg_netrevenue
## count 0.006952528 0.03378712 0.01656799
## avg_service_duration -0.256833200 0.29083710 0.27352454
## avg_total_duration 0.868372271 0.13930892 0.17654545
## avg_lag_duration 1.000000000 -0.01024353 0.03601193
## avg_grossrevenue -0.010243531 1.00000000 0.61828813
## avg_netrevenue 0.036011931 0.61828813 1.00000000
## avg_final_invoice -0.012268597 0.99901482 0.62020484
## avg_final_invoice
## count 0.03328153
## avg_service_duration 0.29075916
## avg_total_duration 0.13726052
## avg_lag_duration -0.01226860
## avg_grossrevenue 0.99901482
## avg_netrevenue 0.62020484
## avg_final_invoice 1.00000000
#Hypo1$avg_service_duration <- is.numeric(Hypo1$avg_total_duration)
#Hypo1[which(is.na(Hypo1$avg_service_duration)),]
#Hypo1[which(Hypo1$servicecenterid == 86),]8.0.1 Checking the variable importance using random forest
fit=randomForest(avg_final_invoice ~ avg_service_duration + avg_total_duration + avg_lag_duration + count, data=Hypo1)
varImpPlot(fit)9 Regression Analysis
plot <- ggplot(Hypo1, aes(x= avg_service_duration , y = avg_netrevenue)) +
geom_point()
plotmod <- lm(avg_grossrevenue ~ avg_service_duration, data = Hypo1)
summary(mod)##
## Call:
## lm(formula = avg_grossrevenue ~ avg_service_duration, data = Hypo1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -537.25 -179.45 -76.69 105.15 1192.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 255.692 18.024 14.187 < 2e-16 ***
## avg_service_duration 24.767 5.609 4.416 1.61e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259 on 211 degrees of freedom
## Multiple R-squared: 0.08459, Adjusted R-squared: 0.08025
## F-statistic: 19.5 on 1 and 211 DF, p-value: 1.609e-05
max <- Hypo1%>%
filter(avg_grossrevenue == max(avg_grossrevenue))
max## # A tibble: 1 x 8
## servicecenterid count avg_service_dur~ avg_total_durat~ avg_lag_duration
## <fct> <int> <dbl> <dbl> <dbl>
## 1 529 5 5.6 11 5.4
## # ... with 3 more variables: avg_grossrevenue <dbl>, avg_netrevenue <dbl>,
## # avg_final_invoice <dbl>