library(readr) #for reading the csv file
library(tidyverse)  #for data manipulation and other needs
library(ggplot2) # for visualization
library(survival) #for survival analysis
library(survminer) #for survival analysis plots
library(DT) #for displaying table

Brief Description

Dell provides after-sales service to customers who purchase their products. This means there are tons of data produced related to tech support, warranties, repairs, etc. The focus of this challenge is not only to explore the dataset and produce meaningful results but also to lay down the thoughts for the future scope of work.


Data Description

Dataset

Variables and their description:


Data Importing and Cleaning

Before we begin the analysis, it is essential to clean the data so that there is no type of bias in the study. We will perform various steps to clean the data in a systematic approach.


Exploring the dataset

We will begin our analysis with data exploration. We will look at what are the variables, what is their type, number of rows in the dataset, number of unique values for all the variables, the number of missing values for all the variables, checking for duplicates, etc. Also, it is essential to remove or impute missing values, especially when we need to perform any modeling. As no modeling has been performed for this analysis, not all the missing values have been dealt with. Some algorithms, like linear regression, don’t handle missing values while algorithms like trees, handle missing values. Thus, to make an apple to apple comparison, it is essential to manage the missing amount before doing modeling.

data <- read_csv("data/DellGSPE_DSrole_dataset.csv") # Reading data


names(data) # Names of variables
##  [1] "asst_id"             "product_type"        "mnfture_wk"         
##  [4] "contract_st"         "contract_end"        "contact_wk"         
##  [7] "contact_type"        "issue_type"          "topic_category"     
## [10] "parts_sent"          "repair_type"         "repeat_ct"          
## [13] "parts_ct"            "agent_tenure_indays" "contact_manager_flg"
## [16] "diagnostics"         "repeat_parts_sent"   "region"             
## [19] "country"
nrow(data) # Number of rows in the dataset
## [1] 100000
glimpse(data) # summarising the data
## Observations: 100,000
## Variables: 19
## $ asst_id             <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ product_type        <chr> "Laptops", "Laptops", "Laptops", "Laptops"...
## $ mnfture_wk          <dbl> 201726, 201650, 201821, 201608, 201732, 20...
## $ contract_st         <dbl> 201726, 201650, 201821, 201608, 201732, 20...
## $ contract_end        <dbl> 202125, 201949, 202222, 201908, 202032, 20...
## $ contact_wk          <dbl> 201840, 201840, 201840, 201840, 201840, 20...
## $ contact_type        <chr> "Voice", "Voice", "Voice", "VOICE", "Voice...
## $ issue_type          <chr> NA, NA, NA, "Hard Drive", NA, "Fee Based S...
## $ topic_category      <chr> NA, NA, NA, "Booting", NA, "General Querie...
## $ parts_sent          <chr> "Hard Drive, Documentation(2)", "Speaker, ...
## $ repair_type         <chr> "Hard", "Hard", "Hard", "Hard", "Hard", "S...
## $ repeat_ct           <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ parts_ct            <dbl> 3, 2, 1, 1, 1, 0, 2, 0, 0, 1, 0, 0, 0, 2, ...
## $ agent_tenure_indays <dbl> 1018, 1018, 1018, 298, 1018, 802, 1757, 81...
## $ contact_manager_flg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ diagnostics         <chr> "Not_USED", "Not_USED", "Not_USED", "Effec...
## $ repeat_parts_sent   <chr> NA, NA, "Motherboard, Miscellaneous(2), Ha...
## $ region              <chr> "Hogwarts", "Hogwarts", "Hogwarts", "Hogwa...
## $ country             <chr> NA, "Zonko's Joke Shop", "Zonko's Joke Sho...
sum(complete.cases(data)) # Checking for rows with no missing values
## [1] 6109
sum(duplicated(data))  # Checking duplicates
## [1] 13464
data <- data[-(which(duplicated(data))),] #Given we don't have time stamp, it is fair to assume that the rows are duplicate. Removing duplicate rows to avoid double counting
Unique values

Checking for variables that are constant or zero to remove them. Also, looking for variables that have missing values. We can see contact_wk has only one week of data. Thus, it can be removed as it serves no purpose. Also, on doing further analysis, we can see there are missing values for different variables. For example, product_type should contain only three types of unique values, but it includes four unique values - “Laptops”, “Desktops”, “Other Electronics”, “NA”. “NA” implies missing values and these need to be dealt with in the next section.

The chart below shows the number of unique values for each column. Specifically asst_id column has 82442 unique IDs implying 82442 items reported issues in the particular week.

# Plotting the graph to check number of unique values
data %>% 
  map_dfr(n_distinct) %>% 
  gather() %>% 
  ggplot(aes(reorder(key, -value), value)) +
  geom_bar(stat = "identity", fill = "tomato") + 
  scale_y_log10(breaks = c(5, 50, 250, 500, 1000, 10000, 50000)) +
  geom_text(aes(label = value), vjust = 1.6, color = "white", size = 3.5) +
  theme_minimal() +
  labs(x = "features", y = "Number of unique values") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(panel.grid = element_blank())

Missing values

The table below shows the number of missing values for each column. Please note that repeat_parts_sent and parts_sent have a large number of missing values primarily due to the fact that Software issues do not require any tranfer or parts.

For this exercise, the missing values in the column topic_category have been imputed based on parts_sent column which can be seen in the next section.

# Checking the percentage of missing values for all the variables

is_na_val <- function(x) x %in% c("not available", "<NA>",  "Unknown",
                                  "NA"," ","")

data <- data %>% mutate_all(funs(ifelse(is_na_val(.), NA, .)))

missing_values <- data %>% summarise_all(funs(sum(is.na(.)))) %>% 
  gather(key = "Variable", value = "missing_count") %>%
  arrange(desc(missing_count))


#Displaying the missing values in a table for each variable
datatable(
  missing_values, rownames = FALSE,
  extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)
Data Cleaning and Data Transformation

Performing various data cleaning activities: * Changing the case of the text under the contact_type field to remove redundancy * Setting repeat_count value to 1 when the repeat_parts_sent field is not empty and the repeat_count is zero. The imputation has been done based on the distribution of the repeat_count. * Converting repair_type to ‘Soft’ when both repair_parts_sent and parts_sent fields are empty * Imputing the topic_category values based on the parts_sent. Topic_category is different even if the parts_sent are the same. Thus, attributing the missing topic_category with the most frequent topic category for any combination of parts_sent. It reduces missing count for the topic_category by approximately 5000

unique(data$contact_type)
## [1] "Voice" "VOICE" "CHAT"  "EMAIL" NA
# Changing Voice to VOICE
data$contact_type <- toupper(data$contact_type)
unique(data$contact_type)
## [1] "VOICE" "CHAT"  "EMAIL" NA
# Repeat count is zero and repeat part  is not empty 

sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 0)
## [1] 5293
# Checking for distribution of repeat count  when it is not zero and repeat part is not empty

  sum(!is.na(data$repeat_parts_sent) & data$repeat_ct != 0)
## [1] 6270
  sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 1)
## [1] 6269
  sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 2)
## [1] 1
  sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 3)
## [1] 0
# Imputing repeat count with 1 when repeat part is not bank based on the above distribution. Aas almost all the values under repeat count are 1 so imputing repeat count with 1
  
# Assuming additional visit was made even if the repeat part sent is same as the part sent. 


data <- data %>%
    mutate(repeat_ct = ifelse((!is.na(repeat_parts_sent) & 
                                 repeat_ct == 0),1,repeat_ct))


sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 0) #Repeat count is no longer 0 when repeat part is not empty
## [1] 0
# As both parts sent adn repeat parts sent empty so converting hard to soft repair type

  
  sum((is.na(data$repeat_parts_sent) & is.na(data$parts_sent)  & 
         data$repair_type == "Hard"))
## [1] 3769
  data <- data %>%
    mutate(repair_type = ifelse((is.na(data$repeat_parts_sent) & 
                                   is.na(data$parts_sent)  & 
                                   data$repair_type == "Hard"),
                                "Soft",repair_type))    
      
 sum((is.na(data$repeat_parts_sent) & is.na(data$parts_sent)  & 
         data$repair_type == "Hard"))
## [1] 0
#Imputing topic_category based on parts sent column

#If there are multiple topic categories for same part sent, then the missing topic category is imputed with the most frequent topic category for that part sent

#Function to extract most frequent topic category for each combination of parts sent


impute_func <- function(key){
 data_subset <- data %>% filter(parts_sent == key) %>%
                 select(topic_category, parts_sent) %>%
                 group_by(topic_category) %>%
                mutate(count = n())

 max_count   <-  max(data_subset$count, na.rm = TRUE)
  impute_value <- data_subset %>% filter(count == max_count) %>%
    .$topic_category %>% unique()
                 
}


#Calling the function and extracting most frequent topic catgory for each combination of parts sent and storing in the lookup_table variable

parts_sent_var <- na.omit(unique(data$parts_sent, na.rm = TRUE))
impute_topic <- vector()
parts_sent <- vector()
for (i in 1:length(parts_sent_var)) {
  impute_topic[i] <- impute_func(parts_sent_var[i])
  parts_sent[i] <- parts_sent_var[i]
}

lookup_table <- as.data.frame(cbind(parts_sent, impute_topic))

#Converting parts sent and impute_topic to character

lookup_table$parts_sent <- as.character(lookup_table$parts_sent)
lookup_table$impute_topic <- as.character(lookup_table$impute_topic)

lookup_table <- na.omit(lookup_table)

#Impuing missing value with most frequent topic category depending on the parts sent

#for (i in 1:nrow(data)) {
 # if (is.na(data[i,9]) & !is.na(data[i,10])) {
#    for (j in 1:nrow(lookup_table)) {
#      if (data[i,10] == lookup_table[j,1]) {
#        data[i,9] <- lookup_table[j,2]
        
 #     } else {
        
#      }
      
#    }
    
#  } else {
    
#    data[i,9] <- data[i,9]
#  }
  
#}

Exploratory Data Analysis

Exploring the data visually helps us seeing what the data can tell us beyond the formal modeling or hypothesis testing task.


Issue by Topic Category - Hard Repair

The plot below depicts the issue count for each topic category for all the product types. This analysis helps to understand the inventory that needs to be maintained based on the number of issues a specific topic category contains.

# Problematic Parts 

problematic_parts <- data %>%
  group_by(repair_type, product_type,topic_category) %>%
  dplyr::summarise(issue_count = n()) %>%
  arrange(repair_type, product_type, desc(issue_count)) %>%
  na.omit() 

#Function definition
problem_parts <- function(repair_typ) {
problematic_parts %>%
  filter(repair_type == repair_typ) %>%
  ggplot(aes(x = reorder(topic_category, issue_count), 
             y = issue_count, 
             fill = as.factor(product_type))) +
  geom_bar(stat = "identity") +
  facet_wrap(~as.factor(product_type), scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90),
        axis.text.y = element_text(size = 6)) +
  scale_fill_manual(name = "Product Type", 
                    values = c("Laptops" = "#3039DC", 
                               "Desktops" = "#2D9A2D",
                               "Other Electronics" = "#FD0006")) +
    labs(title = "Issue Count by Topic Category", 
       y = "Issue Count", x = "Topic Category") +
  coord_flip()

  
}  

 problem_parts("Hard") #Calling the function

Issue by Topic Category - Soft Repair

The plot below depicts the issue count for each topic category for all the product types. As we are looking at soft repair type, it is essential to focus not on the inventory, but the kind of training the agents require the most to make them super skilled to solve a particular type of issue. If the agents are qualified only to handle specific kinds of problems, then the number of agents can also be maintained proportionately based on the number of issues within a particular topic category.

 problem_parts("Soft")  #Calling the function

Experience and the repeat visit count

The plot below depicts the experience of the agents in years vs. the repeat visits divided by repeat visits plus the service that doesn’t require any additional visits. Repeat visit means repeat count can be either one, two, or three additional visits. We can see that with higher experience, the percentage of a repeat visit is usually increasing. We might be surprised that the percentage is more and not less for people with more experience. It might be the case that the complex issues that require repeat visits are being dealt with by more experienced people. Thus, it might also be beneficial to look at the topic categories that need additional visits, so that focused training can be provided.

\[RepeatVisitPercentage = {\frac{\sum_{i=1}^nRepeatCount1,2,3}{\sum_{i=1}^nRepeatCount 0,1,2,3}}\]

#Creating new variable indicating agent experience as 0-1, 1-2 and so on

  data <- data %>% 
    mutate(agent_experience = case_when(
      
      is.na(agent_tenure_indays)  ~ NA_character_,
      
      (agent_tenure_indays > 0 & agent_tenure_indays <= 365) ~ "0-1",
      (agent_tenure_indays > 365 & agent_tenure_indays <= 730) ~ "1-2",
      (agent_tenure_indays > 730 & agent_tenure_indays <= 1095) ~ "2-3",
      (agent_tenure_indays > 1095 & agent_tenure_indays <= 1460) ~ "3-4",
      TRUE ~ "4+"
    ))
  
  data$agent_experience <- as.factor(data$agent_experience)
  
  agent_data <- data %>%
    filter(repair_type == "Hard") %>%
    group_by(agent_experience, repeat_ct) %>%
    dplyr::summarise(add_visit = n()) %>%
    spread(repeat_ct,add_visit)
    
  agent_data <- agent_data %>%
    mutate(non_zero = sum(`1`,`2`,`3`, na.rm = TRUE))
  
  agent_data <- agent_data %>% 
    mutate(repeat_ratio = (non_zero / (`0` + non_zero)) * 100)
  
  
  agent_data %>%
    ggplot() +
    geom_bar(aes(x = agent_experience,
                 y = repeat_ratio), width = 0.3, fill = "tomato", 
             stat = "identity") +
    theme_minimal() +
    labs(title = "Repeat to Non-Repeat Visits by Experience", 
         y = "Repeat Visit/ Non - Repeat percentage", x = "Agent Experience in years") +
    theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
          axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
          axis.title = element_text(size = 10, face = "bold", color = "#444444"),
          plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"), 
          legend.text = element_text(face = "bold", color = "#444444"), 
          legend.title = element_text(face = "bold", color = "#444444"),
          panel.grid.major = element_blank())  

Call the Manager

The plot below shows the amount of times the manager was called vs. agent’s experience. We can see that less experienced an agent is more the times the manger wass called. Another type of analysis that can be made is - manager intervention v/s topic category. This will help us in identifying the categories in which the agents need more training.

# Contact Manager -----------------------------------------

escalation <- data %>% 
    group_by(agent_experience) %>%
    dplyr::summarise(sum(contact_manager_flg), 
                         n(),
                         manager_call = 
                       sum((contact_manager_flg) / n()) * 100) %>%
    na.omit()
  
  escalation %>%
    ggplot() +
    geom_bar(aes(x = agent_experience,
                 y = manager_call), width = 0.3, fill = "tomato", 
             stat = "identity") +
    theme_minimal() +
    labs(title = "Manager's involvment based on Agent's experience", 
         y = "Manager calling percentage", x = "Agent's Experience in years") +
    theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
          axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
          axis.title = element_text(size = 10, face = "bold", color = "#444444"),
          plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"), 
          legend.text = element_text(face = "bold", color = "#444444"), 
          legend.title = element_text(face = "bold", color = "#444444"),
          panel.grid.major = element_blank())  

Ineffective Diagonstic Usage

The plot below depicts the ineffective use of diagnosis percentage vs. the agent’s experience. The inefficient use of diagnosis percentage is the number of times the diagnosis has been done ineffectively to the total number of times the diagnosis done or not done at all. We can see the ineffective usage is higher for agents having less experience. Thus, less experienced agents require more training.

\[IneffectiveUsagePercentage = {\frac{\sum_{i=1}^nIneffectiveUsage}{\sum_{i=1}^nIneffectiveUsage,EffectiveUsage,NotUsed,NotConsidered}}\]

# Diagnosis ------------------------------------------------

  diag_data <- data %>%
    group_by(agent_experience, diagnostics) %>%
    dplyr::summarise(diag_count = n()) %>%
    spread(diagnostics,diag_count) %>%
    na.omit()
  
  diag_data <- diag_data %>%
    mutate(ineff_ratio = (InEffective_Usage / 
    sum(Effective_Usage,InEffective_Usage,Not_Considered,Not_USED))*100)
  
  diag_data %>%
  ggplot() +
    geom_bar(aes(x = agent_experience,
                   y = ineff_ratio), width = 0.3, fill = "tomato", 
             stat = "identity") +
    theme_minimal() +
    labs(title = "Ineffective Diagnostic Usage", 
         y = "Ineffective Usage Percentage", x = "Agent Experience in years") +
    theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
          axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
          axis.title = element_text(size = 10, face = "bold", color = "#444444"),
          plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"), 
          legend.text = element_text(face = "bold", color = "#444444"), 
          legend.title = element_text(face = "bold", color = "#444444"),
          panel.grid.major = element_blank())  


Survival Analysis

How much time does it take before an issue related to product happens? It takes an analysis of time to event data, the event in our case being an issue related to a specific topic category for a given product. Doing this kind of time to event analysis is called survival analysis.


# Time to issue  analysis for each combination of product type and topic category


# considering rows where contact week is greater than manufacturi week. Ignoring the rows where contact week is less than the manufacturing week
  

data <- data %>%
    mutate(time_toissue = contact_wk - mnfture_wk) #Creating new variable to calculate the number of weeks for the issue to arise
# Function plot the survival analysis for different product type and topic category combination

survival_analysis <- function(product, top_cat) {
  
  temp_surv <- data %>% 
    filter(time_toissue >= 0 & product_type == product,
           topic_category %in%  top_cat)  
  
  
  if (nrow(temp_surv) != 0) {
    
    km_temp <- survfit(Surv(time_toissue) ~ topic_category, 
                       data = temp_surv)
    
    ggsurv <- ggsurvplot(km_temp, 
                         data = temp_surv,surv.median.line = "hv",
                         conf.int = TRUE)
    ggsurv$plot + theme_bw() + facet_wrap(~strata) + theme_minimal()
      
    
  } else {
    print("no records")
    
  }
  
}


#Creating temporary variables to store product type and topic categories

prod <- c("Laptops","Desktops","Other Electronics")
topic_cat <- na.omit(unique(data$topic_category))  
Survival Analysis for Laptop

The plot below shows the survival analysis for different topic categories for all three product types. The first type of product is the laptop. The survival analysis plot shows the probabilities vs. the number of weeks it takes for an issue to occur for each topic category. A line in the plot is highlighting the median number of weeks for a problem to occur. This can be used to plan the inventory by incorporating it in a forecasting model. It can also give us a time duration before which we should reach out to the customer to improve the customer experience by asking if he is facing an issue with a specific topic category.

# Laptops ----

survival_analysis(prod[1], topic_cat[1:12])

survival_analysis(prod[1], topic_cat[13:24])

survival_analysis(prod[1], topic_cat[25:36])

survival_analysis(prod[1], topic_cat[37:49])

survival_analysis(prod[1], topic_cat[50:63])

Survival Analysis for Desktop
# Desktop ---

survival_analysis(prod[2], topic_cat[1:12])

survival_analysis(prod[2], topic_cat[13:25])

survival_analysis(prod[2], topic_cat[26:37])

survival_analysis(prod[2], topic_cat[c(38:47,51,52,56,57,61)])

# not considering topic category where there is only one record like miracast under desktop
Survival Analysis for Other Electronics
# Other Electronics 

survival_analysis(prod[3], topic_cat[1:12])

survival_analysis(prod[3], topic_cat[c(13:26,29,30)])

survival_analysis(prod[3], topic_cat[c(32:33,35:36,39,41:49,51:52,56:63)])


Contact Preference by Region

The plot below shows the percentage breakdown of all communication types for all three regions. We can see ‘Voice’ dominates the communication type in all three regions. It means that people are still comfortable using the phone instead of typing. Also, the plot shows that Hogwarts has the most number of issues as we can see by Voice and Chat percentage. It might be just that the number of laptops sold within that region is more. But, if that is not the case, then further analysis needs to be done on why the issues are more within that region.

# Contact Preference by region-------

contact_preference <- data %>%
  group_by(region, contact_type) %>%
  dplyr::summarise(percent_cntct_medium = (n() / nrow(data)) * 100) %>%
  na.omit()

contact_preference %>%
  ggplot(aes(x = reorder(contact_type, percent_cntct_medium), 
             y = percent_cntct_medium, 
             fill = as.factor(region))) +
  geom_bar(stat = "identity", width = 0.3) +
  facet_wrap(~as.factor(region), scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90),
        axis.text.y = element_text(size = 8)) +
  scale_fill_manual(name = "Region", 
                    values = c("Hogwarts" = "#3039DC", 
                               "Middle Earth" = "#2D9A2D",
                               "Milky Way" = "#FD0006")) +
  labs(title = "Contact Preference", 
         y = "Percentage", x = "Communication Mode") +
  coord_flip() 


Future Scope of Work - Machine Learning

Considering the time limit, it was not possible to dig deeper and do a bit of modelling. But, following are the suggestions related to modelling that can be conducted in future: