Exploratory Data Analysis on CarDash Data

1 Preparing Data
2 Questions
3 Loyal Customers
4 Creating New Variables
5 Service Duration Analysis
- 5.1 What is the average service duration for service centers which has provided atleast 2 services ?
- 5.2 What is the average service duration for service centers which has provided atleast 5 services ?
6 Lag Service Analysis
7 Revenue Analysis
8 Feature Selection
9 Regression Analysis

CarDash is a full service automative concierge provider that picks your car, oversee the services and basically advocates on your behalf.

Goal: Provide a set of recommendations on how to improve their business or product based on the given dataset.

Column Names	Description
orderid	unique identifier of order
parentorderid	unique identifier of order that may contain associated children orders
contactcustomerid	unique identifier of customer
servicecenterid	unique identifier of service center
createdat	timestamp that customer placed the order
pickupdate	timestamp that driver picked up the order
closingdate	timestamp of vehicle return and order close
finalinvoice	final invoice amount to customer
tip	customer tip amount
promocodediscount	discount value of order
grossrevenue	finalinvoice + tip + promocodediscount
netrevenue	CarDash revenue after deducting: payment to service centers for work, parts costs, corporate discount promo codes, warranty and returns.

1 Preparing Data

Loading Libraries

library(readr)
library(dplyr)
library(DT)
library(ggplot2)
library(gridExtra)
library(forcats)
library(lubridate)
library(corrplot)
library(randomForest)
library(scales)

Loading Data

data <- read_csv("Data.csv")

#servicecenterid and parentorderid seems to be integer but we know it should be a categorical so converting them into factors.
data$parentorderid <- as.factor(data$parentorderid)
data$servicecenterid <- as.factor(data$servicecenterid)
data$contactcustomerid <- as.factor(data$contactcustomerid)
data$pickupdate <- as.Date(as.character(data$pickupdate), format="%Y-%m-%d")
data$createdat <- as.Date(as.character(data$createdat), format="%Y-%m-%d")
data$closingdate <- as.Date(as.character(data$closingdate), format="%Y-%m-%d")

Lets have a glimpse at the first 10 records in the data.

   data

## # A tibble: 6,562 x 12
##    orderid parentorderid contactcustomerid servicecenterid createdat 
##      <int> <fct>         <fct>             <fct>           <date>    
##  1    1169 <NA>          398               81              2017-07-31
##  2    1400 <NA>          1215              341             2017-08-08
##  3    1457 <NA>          1244              <NA>            2017-08-09
##  4    1630 <NA>          1394              180             2017-08-18
##  5    1630 <NA>          1394              153             2017-08-18
##  6    1689 <NA>          18                <NA>            2017-08-22
##  7    1745 <NA>          1027              24              2017-08-24
##  8    2494 <NA>          1982              38              2017-10-06
##  9    2494 <NA>          1982              714             2017-10-06
## 10    2494 <NA>          1982              700             2017-10-06
## # ... with 6,552 more rows, and 7 more variables: pickupdate <date>,
## #   closingdate <date>, finalinvoice <dbl>, tip <dbl>,
## #   promocodediscount <int>, grossrevenue <dbl>, netrevenue <dbl>

1.1 Number of Records and Variables

The dataset contains 6562 observations and 12 variables.

1.2 Data Type of Each Variable

Lets check the structure and missing values of each variable.

glimpse(data)

## Observations: 6,562
## Variables: 12
## $ orderid           <int> 1169, 1400, 1457, 1630, 1630, 1689, 1745, 24...
## $ parentorderid     <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ contactcustomerid <fct> 398, 1215, 1244, 1394, 1394, 18, 1027, 1982,...
## $ servicecenterid   <fct> 81, 341, NA, 180, 153, NA, 24, 38, 714, 700,...
## $ createdat         <date> 2017-07-31, 2017-08-08, 2017-08-09, 2017-08...
## $ pickupdate        <date> 2017-08-02, 2017-08-14, NA, 2017-08-24, 201...
## $ closingdate       <date> NA, NA, NA, 2017-08-28, 2017-08-28, NA, 201...
## $ finalinvoice      <dbl> 39.00, 410.07, 0.00, 2160.08, 2160.08, 0.00,...
## $ tip               <dbl> 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 20.0, 10.0, 10...
## $ promocodediscount <int> 5, 20, 0, 20, 20, 0, 20, 20, 20, 20, 0, 5, 0...
## $ grossrevenue      <dbl> 50.00, 430.07, 0.00, 2180.08, 2180.08, 0.00,...
## $ netrevenue        <dbl> 20.00, 58.57, 0.00, 363.79, 363.79, 0.00, 18...

1.3 Missing Values

To avoid discprencies in the data analysis, identification of null values is neccessary. The table below shows the attributes with the percentage of null values it contains.

d <- colMeans(is.na(data))

distribution<- d[d>0]       #Variables having missing values only
p<-data.frame(distribution)

colnames(p) <- "Percentage"     
p$Percentage <- p$Percentage *100     #Calculating percentages and renaming the column

df = data.frame(
      p,
      stringsAsFactors = TRUE
    )
    dt <- datatable(df, filter = 'bottom', options = list(pageLength = 6))
    dt

Percentage of missing service centers 31.11% servicecenter id’s are missing. As it is a categorical variable we can not predict them or perform imputation. So in our analysis we have to filter them out.

2 Questions

2.1 Most Experienced Service centers

The Count of Services - Service Centers

freq_servicecenterid <-  data %>%
                         filter(!is.na(servicecenterid))%>%
                         group_by(servicecenterid) %>%
                         summarise(Count = n()) %>%
                         arrange(desc(Count)) %>%
                         ungroup() %>% 
                         mutate(servicecenterid = reorder(servicecenterid,Count)) %>%
                         head(10)

  ggplot(freq_servicecenterid, aes(x = servicecenterid,y = Count)) +
  geom_bar(stat='identity',colour="white",fill = " deepskyblue3 ") +
  geom_text(aes(x = servicecenterid, y = 1, label = paste0("(",Count,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'black',fontface = 'bold') +
  labs(x = 'Service Center', y = 'Count') +
   theme_light() +
   coord_flip()

As a result of the analysis it seems the service center id with 24 is the most experienced service center with 798 services.

2.2 What are the most revenue generating service centers ?

avg_netrevenue <-  data %>%
                         filter(!is.na(servicecenterid))%>%
                         group_by(servicecenterid) %>%
                         summarise(avg_netrevenue = mean(netrevenue)) %>%
                         arrange(desc(avg_netrevenue)) %>%
                         ungroup() %>% 
                         mutate(servicecenterid = reorder(servicecenterid,avg_netrevenue)) %>%
                         head(10)

  ggplot(avg_netrevenue, aes(x = servicecenterid,y = avg_netrevenue)) +
  geom_bar(stat='identity',colour="white",fill = " forestgreen ") +
  geom_text(aes(x = servicecenterid, y = 1, label = paste0("(",avg_netrevenue,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'black',fontface = 'bold') +
  labs(x = 'Service Center', y = 'Average NetRevenue') +
   theme_light() +
   coord_flip()

As a result of the analysis it seems that service center id with 758 has the highest average netrevenue i.e $1107

2.3 Distribution of Tip, Discount, Netrevenue and Grossrevenue

p1 <- ggplot(data, aes(x = tip)) +
      geom_histogram()

p2 <- ggplot(data, aes(x = promocodediscount)) +
      geom_histogram()  

p3 <- ggplot(data, aes(x = netrevenue)) +
      geom_histogram()

p4 <- ggplot(data, aes(x = grossrevenue)) +
      geom_histogram()


grid.arrange(p1,p2,p3,p4,ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.4 What is an average tip for servicing?

mean(data$tip)

## [1] 2.110483

tm <- data.frame(table(data$tip))
names(tm)[1] <- "Tip_Amount"

tm <- tm %>%
  mutate(Range = case_when(
                             as.numeric(Tip_Amount) <= 5 ~ "0-5",
                             as.numeric(Tip_Amount) >= 6 & as.numeric(Tip_Amount) <= 10 ~ "6-10",
                             as.numeric(Tip_Amount) >= 11 & as.numeric(Tip_Amount) <= 20 ~ "11-20",
                             as.numeric(Tip_Amount) >= 21 & as.numeric(Tip_Amount) <= 50 ~ "21-50",
                             as.numeric(Tip_Amount) > 50 ~ "Above 50"
                          ))


tm%>%
         group_by(Range)%>% 
         summarise(Total_Count = sum(Freq))%>%
         ungroup() %>%
  ggplot(aes(x= Range, y= Total_Count, fill= -Total_Count))+
  geom_bar(stat="identity", width = 0.7)+ 
  xlab("Tip Range") + ylab("Count of Tips")

High percentage of people do not tip and then there are 139 people who have given more than $50 as a tip, which tells how satisfied they are with the service.

2.5 What is the Maintainence cost?

data %>%
summarize(Total_netrevenue = sum(netrevenue),
           Total_grossrevenue = sum(grossrevenue))%>%
mutate(Expenditure_Percentage = ((Total_grossrevenue - Total_netrevenue)/Total_grossrevenue)*100)

## # A tibble: 1 x 3
##   Total_netrevenue Total_grossrevenue Expenditure_Percentage
##              <dbl>              <dbl>                  <dbl>
## 1          271560.           1156375.                   76.5

So 76.5 percent of the revenue has been put on maintaing services centers and buying required parts for the car.

3 Loyal Customers

3.1 How many customers does CarDash have?

data %>%
  filter(!is.na(contactcustomerid))%>%
  summarize(distinct_customers = n_distinct(contactcustomerid)) ##length(unique(data$contactcustomerid))

## # A tibble: 1 x 1
##   distinct_customers
##                <int>
## 1               3977

There are 3977 customers who use carDash services

3.2 Who are the loyal customers?

Lc <-  data%>%                                      #Loyal Customers
  filter(!is.na(contactcustomerid))%>%
  group_by(contactcustomerid) %>%
  summarise(unique_orders = n_distinct(orderid), count_orders = n()) %>%   # you can use length(unique(orderid)) instead of n_distinct() 
  arrange(desc(unique_orders)) %>%
  top_n(10, unique_orders)
Lc

## # A tibble: 11 x 3
##    contactcustomerid unique_orders count_orders
##    <fct>                     <int>        <int>
##  1 1466                         81           86
##  2 483                          14           14
##  3 13                           12           12
##  4 18                           12           13
##  5 219                          10           13
##  6 3766                         10           10
##  7 2                             9           10
##  8 316                           9           14
##  9 555                           9           11
## 10 410                           8            9
## 11 636                           8            9

These are the customers who who have taken the carDash services most of the times.

3.3 How much do Repeative customers tip and bring revenue to the company?

    inner_join(data, Lc, by = "contactcustomerid") %>%
     filter(!is.na(servicecenterid))%>%
    select(contactcustomerid, servicecenterid, tip, finalinvoice, grossrevenue)%>%
    group_by(contactcustomerid)

## # A tibble: 156 x 5
## # Groups:   contactcustomerid [10]
##    contactcustomerid servicecenterid   tip finalinvoice grossrevenue
##    <fct>             <fct>           <dbl>        <dbl>        <dbl>
##  1 13                52                  0           0            0 
##  2 1466              677                 0         433.         433.
##  3 1466              693                 0         586.         586.
##  4 1466              48                  0          45           45 
##  5 1466              661                 0          74           74 
##  6 2                 39                 20          40           60 
##  7 555               138                 0          45           50 
##  8 316               17                  0         226.         246.
##  9 316               34                  0         226.         246.
## 10 219               38                  0           0            0 
## # ... with 146 more rows

4 Creating New Variables

Creating new datetime variables for some more exploratory data analysis and model building.

data <-       data%>%
              mutate(create_to_pickup_duration = as.numeric(difftime(pickupdate, createdat, units = "days")),
              pickup_to_closedat_duration = as.numeric(difftime(closingdate,pickupdate,units = "days")),
              total_process_duration = as.numeric(difftime(closingdate, createdat, units = "days")))

Note: All the 3 variables are in terms of number of days.

pickup_to_closedat_duration variable can be considered as the service duration

5 Service Duration Analysis

data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_service_duration = as.numeric(mean(pickup_to_closedat_duration)), count_service = n())%>%
filter(!is.na(avg_service_duration)) %>%
arrange(desc(avg_service_duration))

## # A tibble: 120 x 3
##    servicecenterid avg_service_duration count_service
##    <fct>                          <dbl>         <int>
##  1 72                             123               1
##  2 266                             82               1
##  3 101                             45               2
##  4 770                             35               1
##  5 776                             24.5             2
##  6 251                             22               1
##  7 822                             15               2
##  8 758                             10.2             4
##  9 695                              5               1
## 10 814                              5               1
## # ... with 110 more rows

As some of the datetime variables have missing values the newly created variables will also have missing values, so by neglecting those we are getting average service duration for each service center.

Service center 758 seems to take more than 10 days to service a car, possibly the worst. Can not consider others as they have serviced only once or twice and are basically outliers.

According to autobutler it only takes 3 hours to service a car. To be on safer side lets check service centers where service has been provided within 2 days.

5.1 What is the average service duration for service centers which has provided atleast 2 services ?

data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_service_duration = as.numeric(mean(pickup_to_closedat_duration)), count_service = n())%>%
ungroup()%>%  
filter(!is.na(avg_service_duration)) %>%
filter(count_service >= 2) %>%  
summarise(avg = mean(avg_service_duration))

## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1  2.76

Based on the analysis it seems that average service duration is close to 3 days where service centers have provided service more than once.

5.2 What is the average service duration for service centers which has provided atleast 5 services ?

Lets find average service duration across all service centers which have provided service more than 5 times.

d <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_service_duration = as.numeric(mean(pickup_to_closedat_duration)), count_service = n())%>%
filter(!is.na(avg_service_duration))%>%
filter(count_service >= 5)

The average service duration is 1.17. Its getting bette, more the services better is the service duration. But how representative is it ? lets check how many service centers actually had service duration

sd <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_service_duration = as.numeric(mean(pickup_to_closedat_duration)), count_service = n())

missing_service_duration <-  sum(is.na(sd$avg_service_duration))
percent_of_missing_serviceduration <- (missing_service_duration/nrow(sd))*100

As 61% service duration records are missing. Records have either missing pick up or close date.

6 Lag Service Analysis

These are some of the worst picking service centers which takes days to pick the car and provide service.

6.1 Delay in car pick up

Average lag from the Service centers which have provided service more than once.

lag_services <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_create_to_pickup_duration = as.numeric(mean(create_to_pickup_duration)), count = n())%>%
filter(!is.na(avg_create_to_pickup_duration))%>%
filter(count >= 2) %>%
summarise(avg = mean(avg_create_to_pickup_duration))
lag_services

## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1  6.68

Based on the analysis it takes 6.68 days to pick a car for service center which has provided service more than once.

Average lag from the Service centers which have provided service more than 5 times.

lag_services <- data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_create_to_pickup_duration = as.numeric(mean(create_to_pickup_duration)), count = n())%>%
filter(!is.na(avg_create_to_pickup_duration))%>%
filter(count >= 5) %>%
summarise(avg = mean(avg_create_to_pickup_duration))
lag_services

## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1  7.61

Its getting worser, even after service more than 5 times the average delay is picking up a car is close to 8 days.

6.2 Problem with missing lag durations

The average lag duration for non missing service centers is close to 7.5 days. But how representative is it ? lets check how many service centers actually had service lag duration.

ld <- data%>%
  filter(!is.na(servicecenterid))%>%
  group_by(servicecenterid)%>%
  summarise(avg_create_to_pickup_duration = mean(create_to_pickup_duration), na.rm = TRUE)

missing_lag_duration <-  sum(is.na(ld$avg_create_to_pickup_duration))
percent_of_missing_totalduration <- (missing_lag_duration/nrow(ld))*100

As 30.7443366% service duration records are missing. Records have either missing pick up or create date.

6.3 Most time taking service centers

Lets check the service centers with most lag (time taken pick the car)

data%>%
filter(!is.na(servicecenterid))%>%
group_by(servicecenterid)%>%
summarise(avg_create_to_pickup_duration = as.numeric(mean(create_to_pickup_duration)), count = n())%>%
filter(!is.na(avg_create_to_pickup_duration))%>%
arrange(desc(avg_create_to_pickup_duration)) %>%  
top_n(10) %>%
ggplot(aes(x= reorder(servicecenterid, avg_create_to_pickup_duration), y= avg_create_to_pickup_duration,fill = -avg_create_to_pickup_duration)) +
geom_bar(stat="identity", width = 0.5)+ 
theme_minimal()+
coord_flip() +
xlab("Service Center") +
ylab("Number of Days taken to start the service")

## Selecting by count

7 Revenue Analysis

We know that some service centers has provided services way more than any other service center and also earned revenue. We can neglect those outstanding records.

servicecenter_revenue <- data %>%
                                 filter(!is.na(servicecenterid))%>%
                                 group_by(servicecenterid)%>%
                                 summarise(count = n(), 
                                 Avg_servicecenter_netrevenue = mean(netrevenue),
                                 Avg_servicecenter_grossrevenue = mean(grossrevenue)) 

                           p5 <- servicecenter_revenue %>%
                                 filter(Avg_servicecenter_netrevenue < 2000 & count < 400) %>%
                                 arrange(desc(Avg_servicecenter_netrevenue)) %>%
                                 ggplot(aes(x = count, y = Avg_servicecenter_netrevenue)) +
                                 geom_point(color = "Blue") + xlab("Number of Services") + ylab("Average Netrevenue")
                           
                           p6 <- servicecenter_revenue %>%
                                 filter(Avg_servicecenter_grossrevenue < 2000 & count < 400) %>%
                                 arrange(desc(Avg_servicecenter_grossrevenue)) %>%
                                 ggplot(aes(x = count, y = Avg_servicecenter_grossrevenue)) +
                                 geom_point(color = "red") + xlab("Number of Services") + ylab("Average Grossrevenue")

 grid.arrange(p5,p6)

8 Feature Selection

Hypo <- data %>%
        filter(!is.na(servicecenterid)) %>%
        group_by(servicecenterid)%>%
        summarise(count = n(),
                  avg_service_duration = mean(pickup_to_closedat_duration, na.rm = TRUE),
                  avg_netrevenue = mean(netrevenue, na.rm = TRUE)) %>%
        filter(!is.nan(avg_service_duration) & avg_netrevenue > 0)
 
Hypo$avg_service_duration <- as.numeric(Hypo$avg_service_duration)

cor(Hypo$avg_service_duration, Hypo$avg_netrevenue)

## [1] -0.006366065

cor(Hypo$count, Hypo$avg_netrevenue)

## [1] -0.05057159

#sapply(Hypo, function(x) sum(is.na(x)))

Hypo1 <- data%>%
        filter(finalinvoice != 0 & !is.na(closingdate) & !is.na(servicecenterid)) %>%
        group_by(servicecenterid)%>%
         summarise(count = n(),
          avg_service_duration = as.numeric(mean(pickup_to_closedat_duration, na.rm = TRUE)),
          avg_total_duration = as.numeric(mean(total_process_duration, na.rm = TRUE)),
          avg_lag_duration = as.numeric(mean(create_to_pickup_duration, na.rm = TRUE)),
          avg_grossrevenue = mean(grossrevenue),
          avg_netrevenue = mean(netrevenue),
          avg_final_invoice = mean(finalinvoice))%>%
          filter(avg_service_duration < 25 & avg_grossrevenue < 1600)
#sapply(Hypo1, function(x) sum(is.na(x)))
          cor(Hypo1$avg_grossrevenue, Hypo1$avg_service_duration)

## [1] 0.2908371

          M <- cor(Hypo1[,-1])
correlation_plot <- corrplot(M, method = "number")

correlation_plot

##                            count avg_service_duration avg_total_duration
## count                1.000000000           0.01938288         0.01707789
## avg_service_duration 0.019382877           1.00000000         0.25585142
## avg_total_duration   0.017077887           0.25585142         1.00000000
## avg_lag_duration     0.006952528          -0.25683320         0.86837227
## avg_grossrevenue     0.033787119           0.29083710         0.13930892
## avg_netrevenue       0.016567990           0.27352454         0.17654545
## avg_final_invoice    0.033281527           0.29075916         0.13726052
##                      avg_lag_duration avg_grossrevenue avg_netrevenue
## count                     0.006952528       0.03378712     0.01656799
## avg_service_duration     -0.256833200       0.29083710     0.27352454
## avg_total_duration        0.868372271       0.13930892     0.17654545
## avg_lag_duration          1.000000000      -0.01024353     0.03601193
## avg_grossrevenue         -0.010243531       1.00000000     0.61828813
## avg_netrevenue            0.036011931       0.61828813     1.00000000
## avg_final_invoice        -0.012268597       0.99901482     0.62020484
##                      avg_final_invoice
## count                       0.03328153
## avg_service_duration        0.29075916
## avg_total_duration          0.13726052
## avg_lag_duration           -0.01226860
## avg_grossrevenue            0.99901482
## avg_netrevenue              0.62020484
## avg_final_invoice           1.00000000

#Hypo1$avg_service_duration <- is.numeric(Hypo1$avg_total_duration)
#Hypo1[which(is.na(Hypo1$avg_service_duration)),]
#Hypo1[which(Hypo1$servicecenterid == 86),]

8.0.1 Checking the variable importance using random forest

fit=randomForest(avg_final_invoice ~ avg_service_duration + avg_total_duration + avg_lag_duration + count, data=Hypo1)
varImpPlot(fit)

9 Regression Analysis

plot <- ggplot(Hypo1, aes(x= avg_service_duration , y = avg_netrevenue)) +
         geom_point()
plot

mod <- lm(avg_grossrevenue ~ avg_service_duration, data = Hypo1)
summary(mod)

## 
## Call:
## lm(formula = avg_grossrevenue ~ avg_service_duration, data = Hypo1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -537.25 -179.45  -76.69  105.15 1192.89 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           255.692     18.024  14.187  < 2e-16 ***
## avg_service_duration   24.767      5.609   4.416 1.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 259 on 211 degrees of freedom
## Multiple R-squared:  0.08459,    Adjusted R-squared:  0.08025 
## F-statistic:  19.5 on 1 and 211 DF,  p-value: 1.609e-05

max <- Hypo1%>%
  filter(avg_grossrevenue == max(avg_grossrevenue))
max

## # A tibble: 1 x 8
##   servicecenterid count avg_service_dur~ avg_total_durat~ avg_lag_duration
##   <fct>           <int>            <dbl>            <dbl>            <dbl>
## 1 529                 5              5.6               11              5.4
## # ... with 3 more variables: avg_grossrevenue <dbl>, avg_netrevenue <dbl>,
## #   avg_final_invoice <dbl>