Dell Customer Service Improvement

library(readr) #for reading the csv file
library(tidyverse)  #for data manipulation and other needs
library(ggplot2) # for visualization
library(survival) #for survival analysis
library(survminer) #for survival analysis plots
library(DT) #for displaying table

Brief Description

Dell provides after-sales service to customers who purchase their products. This means there are tons of data produced related to tech support, warranties, repairs, etc. The focus of this challenge is not only to explore the dataset and produce meaningful results but also to lay down the thoughts for the future scope of work.

Data Description

Dataset

Variables and their description:

Asst_id - an identifier for each individual machine sold
Product_type - class of product that describes the asset
Region - region where the asset is located
Country - country where the asset is located
Mnfcture_wk - week when product was manufactured
Contract_st - week when warranty became active
Contract_end - week when warranty expires
Contact_wk - week when customer contacted Dell about a problem
Contact_type - way that customer contacted Dell
Issue_type - type of problem identified by customer
Topic_category - type of problem as classified by the tech support agent`
Parts_sent - what parts were sent to fix the problem
Repair_type - if a part was required, this is a hard repair; otherwise, a soft repair
Repeat_ct - how many additional visits were required to fix the problem, past the first one
Parts_ct - how many parts were sent to fix the problem
Agent_tenure_indays - how long the tech support agent has worked in Dell tech support
Contact_manager_flg - did the tech support agent have to bring in a manager to solve the problem
Diagnostics - were agents compliant with diagnostic usage
Repeat_parts_sent - which parts were sent on additional visit

Data Importing and Cleaning

Before we begin the analysis, it is essential to clean the data so that there is no type of bias in the study. We will perform various steps to clean the data in a systematic approach.

Exploring the dataset

We will begin our analysis with data exploration. We will look at what are the variables, what is their type, number of rows in the dataset, number of unique values for all the variables, the number of missing values for all the variables, checking for duplicates, etc. Also, it is essential to remove or impute missing values, especially when we need to perform any modeling. As no modeling has been performed for this analysis, not all the missing values have been dealt with. Some algorithms, like linear regression, don’t handle missing values while algorithms like trees, handle missing values. Thus, to make an apple to apple comparison, it is essential to manage the missing amount before doing modeling.

data <- read_csv("data/DellGSPE_DSrole_dataset.csv") # Reading data


names(data) # Names of variables

##  [1] "asst_id"             "product_type"        "mnfture_wk"         
##  [4] "contract_st"         "contract_end"        "contact_wk"         
##  [7] "contact_type"        "issue_type"          "topic_category"     
## [10] "parts_sent"          "repair_type"         "repeat_ct"          
## [13] "parts_ct"            "agent_tenure_indays" "contact_manager_flg"
## [16] "diagnostics"         "repeat_parts_sent"   "region"             
## [19] "country"

nrow(data) # Number of rows in the dataset

## [1] 100000

glimpse(data) # summarising the data

## Observations: 100,000
## Variables: 19
## $ asst_id             <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ product_type        <chr> "Laptops", "Laptops", "Laptops", "Laptops"...
## $ mnfture_wk          <dbl> 201726, 201650, 201821, 201608, 201732, 20...
## $ contract_st         <dbl> 201726, 201650, 201821, 201608, 201732, 20...
## $ contract_end        <dbl> 202125, 201949, 202222, 201908, 202032, 20...
## $ contact_wk          <dbl> 201840, 201840, 201840, 201840, 201840, 20...
## $ contact_type        <chr> "Voice", "Voice", "Voice", "VOICE", "Voice...
## $ issue_type          <chr> NA, NA, NA, "Hard Drive", NA, "Fee Based S...
## $ topic_category      <chr> NA, NA, NA, "Booting", NA, "General Querie...
## $ parts_sent          <chr> "Hard Drive, Documentation(2)", "Speaker, ...
## $ repair_type         <chr> "Hard", "Hard", "Hard", "Hard", "Hard", "S...
## $ repeat_ct           <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ parts_ct            <dbl> 3, 2, 1, 1, 1, 0, 2, 0, 0, 1, 0, 0, 0, 2, ...
## $ agent_tenure_indays <dbl> 1018, 1018, 1018, 298, 1018, 802, 1757, 81...
## $ contact_manager_flg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ diagnostics         <chr> "Not_USED", "Not_USED", "Not_USED", "Effec...
## $ repeat_parts_sent   <chr> NA, NA, "Motherboard, Miscellaneous(2), Ha...
## $ region              <chr> "Hogwarts", "Hogwarts", "Hogwarts", "Hogwa...
## $ country             <chr> NA, "Zonko's Joke Shop", "Zonko's Joke Sho...

sum(complete.cases(data)) # Checking for rows with no missing values

## [1] 6109

sum(duplicated(data))  # Checking duplicates

## [1] 13464

data <- data[-(which(duplicated(data))),] #Given we don't have time stamp, it is fair to assume that the rows are duplicate. Removing duplicate rows to avoid double counting

Unique values

Checking for variables that are constant or zero to remove them. Also, looking for variables that have missing values. We can see contact_wk has only one week of data. Thus, it can be removed as it serves no purpose. Also, on doing further analysis, we can see there are missing values for different variables. For example, product_type should contain only three types of unique values, but it includes four unique values - “Laptops”, “Desktops”, “Other Electronics”, “NA”. “NA” implies missing values and these need to be dealt with in the next section.

The chart below shows the number of unique values for each column. Specifically asst_id column has 82442 unique IDs implying 82442 items reported issues in the particular week.

# Plotting the graph to check number of unique values
data %>% 
  map_dfr(n_distinct) %>% 
  gather() %>% 
  ggplot(aes(reorder(key, -value), value)) +
  geom_bar(stat = "identity", fill = "tomato") + 
  scale_y_log10(breaks = c(5, 50, 250, 500, 1000, 10000, 50000)) +
  geom_text(aes(label = value), vjust = 1.6, color = "white", size = 3.5) +
  theme_minimal() +
  labs(x = "features", y = "Number of unique values") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(panel.grid = element_blank())

Missing values

The table below shows the number of missing values for each column. Please note that repeat_parts_sent and parts_sent have a large number of missing values primarily due to the fact that Software issues do not require any tranfer or parts.

For this exercise, the missing values in the column topic_category have been imputed based on parts_sent column which can be seen in the next section.

# Checking the percentage of missing values for all the variables

is_na_val <- function(x) x %in% c("not available", "<NA>",  "Unknown",
                                  "NA"," ","")

data <- data %>% mutate_all(funs(ifelse(is_na_val(.), NA, .)))

missing_values <- data %>% summarise_all(funs(sum(is.na(.)))) %>% 
  gather(key = "Variable", value = "missing_count") %>%
  arrange(desc(missing_count))


#Displaying the missing values in a table for each variable
datatable(
  missing_values, rownames = FALSE,
  extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)

Data Cleaning and Data Transformation

Performing various data cleaning activities: * Changing the case of the text under the contact_type field to remove redundancy * Setting repeat_count value to 1 when the repeat_parts_sent field is not empty and the repeat_count is zero. The imputation has been done based on the distribution of the repeat_count. * Converting repair_type to ‘Soft’ when both repair_parts_sent and parts_sent fields are empty * Imputing the topic_category values based on the parts_sent. Topic_category is different even if the parts_sent are the same. Thus, attributing the missing topic_category with the most frequent topic category for any combination of parts_sent. It reduces missing count for the topic_category by approximately 5000

unique(data$contact_type)

## [1] "Voice" "VOICE" "CHAT"  "EMAIL" NA

# Changing Voice to VOICE
data$contact_type <- toupper(data$contact_type)
unique(data$contact_type)

## [1] "VOICE" "CHAT"  "EMAIL" NA

# Repeat count is zero and repeat part  is not empty 

sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 0)

## [1] 5293

# Checking for distribution of repeat count  when it is not zero and repeat part is not empty

  sum(!is.na(data$repeat_parts_sent) & data$repeat_ct != 0)

## [1] 6270

  sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 1)

## [1] 6269

  sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 2)

## [1] 1

  sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 3)

## [1] 0

# Imputing repeat count with 1 when repeat part is not bank based on the above distribution. Aas almost all the values under repeat count are 1 so imputing repeat count with 1
  
# Assuming additional visit was made even if the repeat part sent is same as the part sent. 


data <- data %>%
    mutate(repeat_ct = ifelse((!is.na(repeat_parts_sent) & 
                                 repeat_ct == 0),1,repeat_ct))


sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 0) #Repeat count is no longer 0 when repeat part is not empty

## [1] 0

# As both parts sent adn repeat parts sent empty so converting hard to soft repair type

  
  sum((is.na(data$repeat_parts_sent) & is.na(data$parts_sent)  & 
         data$repair_type == "Hard"))

## [1] 3769

  data <- data %>%
    mutate(repair_type = ifelse((is.na(data$repeat_parts_sent) & 
                                   is.na(data$parts_sent)  & 
                                   data$repair_type == "Hard"),
                                "Soft",repair_type))    
      
 sum((is.na(data$repeat_parts_sent) & is.na(data$parts_sent)  & 
         data$repair_type == "Hard"))

## [1] 0

#Imputing topic_category based on parts sent column

#If there are multiple topic categories for same part sent, then the missing topic category is imputed with the most frequent topic category for that part sent

#Function to extract most frequent topic category for each combination of parts sent


impute_func <- function(key){
 data_subset <- data %>% filter(parts_sent == key) %>%
                 select(topic_category, parts_sent) %>%
                 group_by(topic_category) %>%
                mutate(count = n())

 max_count   <-  max(data_subset$count, na.rm = TRUE)
  impute_value <- data_subset %>% filter(count == max_count) %>%
    .$topic_category %>% unique()
                 
}


#Calling the function and extracting most frequent topic catgory for each combination of parts sent and storing in the lookup_table variable

parts_sent_var <- na.omit(unique(data$parts_sent, na.rm = TRUE))
impute_topic <- vector()
parts_sent <- vector()
for (i in 1:length(parts_sent_var)) {
  impute_topic[i] <- impute_func(parts_sent_var[i])
  parts_sent[i] <- parts_sent_var[i]
}

lookup_table <- as.data.frame(cbind(parts_sent, impute_topic))

#Converting parts sent and impute_topic to character

lookup_table$parts_sent <- as.character(lookup_table$parts_sent)
lookup_table$impute_topic <- as.character(lookup_table$impute_topic)

lookup_table <- na.omit(lookup_table)

#Impuing missing value with most frequent topic category depending on the parts sent

#for (i in 1:nrow(data)) {
 # if (is.na(data[i,9]) & !is.na(data[i,10])) {
#    for (j in 1:nrow(lookup_table)) {
#      if (data[i,10] == lookup_table[j,1]) {
#        data[i,9] <- lookup_table[j,2]
        
 #     } else {
        
#      }
      
#    }
    
#  } else {
    
#    data[i,9] <- data[i,9]
#  }
  
#}

Exploratory Data Analysis

Exploring the data visually helps us seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Issue by Topic Category - Hard Repair

The plot below depicts the issue count for each topic category for all the product types. This analysis helps to understand the inventory that needs to be maintained based on the number of issues a specific topic category contains.

# Problematic Parts 

problematic_parts <- data %>%
  group_by(repair_type, product_type,topic_category) %>%
  dplyr::summarise(issue_count = n()) %>%
  arrange(repair_type, product_type, desc(issue_count)) %>%
  na.omit() 

#Function definition
problem_parts <- function(repair_typ) {
problematic_parts %>%
  filter(repair_type == repair_typ) %>%
  ggplot(aes(x = reorder(topic_category, issue_count), 
             y = issue_count, 
             fill = as.factor(product_type))) +
  geom_bar(stat = "identity") +
  facet_wrap(~as.factor(product_type), scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90),
        axis.text.y = element_text(size = 6)) +
  scale_fill_manual(name = "Product Type", 
                    values = c("Laptops" = "#3039DC", 
                               "Desktops" = "#2D9A2D",
                               "Other Electronics" = "#FD0006")) +
    labs(title = "Issue Count by Topic Category", 
       y = "Issue Count", x = "Topic Category") +
  coord_flip()

  
}  

 problem_parts("Hard") #Calling the function

Issue by Topic Category - Soft Repair

The plot below depicts the issue count for each topic category for all the product types. As we are looking at soft repair type, it is essential to focus not on the inventory, but the kind of training the agents require the most to make them super skilled to solve a particular type of issue. If the agents are qualified only to handle specific kinds of problems, then the number of agents can also be maintained proportionately based on the number of issues within a particular topic category.

 problem_parts("Soft")  #Calling the function

Experience and the repeat visit count

The plot below depicts the experience of the agents in years vs. the repeat visits divided by repeat visits plus the service that doesn’t require any additional visits. Repeat visit means repeat count can be either one, two, or three additional visits. We can see that with higher experience, the percentage of a repeat visit is usually increasing. We might be surprised that the percentage is more and not less for people with more experience. It might be the case that the complex issues that require repeat visits are being dealt with by more experienced people. Thus, it might also be beneficial to look at the topic categories that need additional visits, so that focused training can be provided.

\[RepeatVisitPercentage = {\frac{\sum_{i=1}^nRepeatCount1,2,3}{\sum_{i=1}^nRepeatCount 0,1,2,3}}\]

#Creating new variable indicating agent experience as 0-1, 1-2 and so on

  data <- data %>% 
    mutate(agent_experience = case_when(
      
      is.na(agent_tenure_indays)  ~ NA_character_,
      
      (agent_tenure_indays > 0 & agent_tenure_indays <= 365) ~ "0-1",
      (agent_tenure_indays > 365 & agent_tenure_indays <= 730) ~ "1-2",
      (agent_tenure_indays > 730 & agent_tenure_indays <= 1095) ~ "2-3",
      (agent_tenure_indays > 1095 & agent_tenure_indays <= 1460) ~ "3-4",
      TRUE ~ "4+"
    ))
  
  data$agent_experience <- as.factor(data$agent_experience)
  
  agent_data <- data %>%
    filter(repair_type == "Hard") %>%
    group_by(agent_experience, repeat_ct) %>%
    dplyr::summarise(add_visit = n()) %>%
    spread(repeat_ct,add_visit)
    
  agent_data <- agent_data %>%
    mutate(non_zero = sum(`1`,`2`,`3`, na.rm = TRUE))
  
  agent_data <- agent_data %>% 
    mutate(repeat_ratio = (non_zero / (`0` + non_zero)) * 100)
  
  
  agent_data %>%
    ggplot() +
    geom_bar(aes(x = agent_experience,
                 y = repeat_ratio), width = 0.3, fill = "tomato", 
             stat = "identity") +
    theme_minimal() +
    labs(title = "Repeat to Non-Repeat Visits by Experience", 
         y = "Repeat Visit/ Non - Repeat percentage", x = "Agent Experience in years") +
    theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
          axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
          axis.title = element_text(size = 10, face = "bold", color = "#444444"),
          plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"), 
          legend.text = element_text(face = "bold", color = "#444444"), 
          legend.title = element_text(face = "bold", color = "#444444"),
          panel.grid.major = element_blank())

Call the Manager

The plot below shows the amount of times the manager was called vs. agent’s experience. We can see that less experienced an agent is more the times the manger wass called. Another type of analysis that can be made is - manager intervention v/s topic category. This will help us in identifying the categories in which the agents need more training.

# Contact Manager -----------------------------------------

escalation <- data %>% 
    group_by(agent_experience) %>%
    dplyr::summarise(sum(contact_manager_flg), 
                         n(),
                         manager_call = 
                       sum((contact_manager_flg) / n()) * 100) %>%
    na.omit()
  
  escalation %>%
    ggplot() +
    geom_bar(aes(x = agent_experience,
                 y = manager_call), width = 0.3, fill = "tomato", 
             stat = "identity") +
    theme_minimal() +
    labs(title = "Manager's involvment based on Agent's experience", 
         y = "Manager calling percentage", x = "Agent's Experience in years") +
    theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
          axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
          axis.title = element_text(size = 10, face = "bold", color = "#444444"),
          plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"), 
          legend.text = element_text(face = "bold", color = "#444444"), 
          legend.title = element_text(face = "bold", color = "#444444"),
          panel.grid.major = element_blank())

Ineffective Diagonstic Usage

The plot below depicts the ineffective use of diagnosis percentage vs. the agent’s experience. The inefficient use of diagnosis percentage is the number of times the diagnosis has been done ineffectively to the total number of times the diagnosis done or not done at all. We can see the ineffective usage is higher for agents having less experience. Thus, less experienced agents require more training.

\[IneffectiveUsagePercentage = {\frac{\sum_{i=1}^nIneffectiveUsage}{\sum_{i=1}^nIneffectiveUsage,EffectiveUsage,NotUsed,NotConsidered}}\]

# Diagnosis ------------------------------------------------

  diag_data <- data %>%
    group_by(agent_experience, diagnostics) %>%
    dplyr::summarise(diag_count = n()) %>%
    spread(diagnostics,diag_count) %>%
    na.omit()
  
  diag_data <- diag_data %>%
    mutate(ineff_ratio = (InEffective_Usage / 
    sum(Effective_Usage,InEffective_Usage,Not_Considered,Not_USED))*100)
  
  diag_data %>%
  ggplot() +
    geom_bar(aes(x = agent_experience,
                   y = ineff_ratio), width = 0.3, fill = "tomato", 
             stat = "identity") +
    theme_minimal() +
    labs(title = "Ineffective Diagnostic Usage", 
         y = "Ineffective Usage Percentage", x = "Agent Experience in years") +
    theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
          axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
          axis.title = element_text(size = 10, face = "bold", color = "#444444"),
          plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"), 
          legend.text = element_text(face = "bold", color = "#444444"), 
          legend.title = element_text(face = "bold", color = "#444444"),
          panel.grid.major = element_blank())

Survival Analysis

How much time does it take before an issue related to product happens? It takes an analysis of time to event data, the event in our case being an issue related to a specific topic category for a given product. Doing this kind of time to event analysis is called survival analysis.

# Time to issue  analysis for each combination of product type and topic category


# considering rows where contact week is greater than manufacturi week. Ignoring the rows where contact week is less than the manufacturing week
  

data <- data %>%
    mutate(time_toissue = contact_wk - mnfture_wk) #Creating new variable to calculate the number of weeks for the issue to arise

# Function plot the survival analysis for different product type and topic category combination

survival_analysis <- function(product, top_cat) {
  
  temp_surv <- data %>% 
    filter(time_toissue >= 0 & product_type == product,
           topic_category %in%  top_cat)  
  
  
  if (nrow(temp_surv) != 0) {
    
    km_temp <- survfit(Surv(time_toissue) ~ topic_category, 
                       data = temp_surv)
    
    ggsurv <- ggsurvplot(km_temp, 
                         data = temp_surv,surv.median.line = "hv",
                         conf.int = TRUE)
    ggsurv$plot + theme_bw() + facet_wrap(~strata) + theme_minimal()
      
    
  } else {
    print("no records")
    
  }
  
}


#Creating temporary variables to store product type and topic categories

prod <- c("Laptops","Desktops","Other Electronics")
topic_cat <- na.omit(unique(data$topic_category))

Survival Analysis for Laptop

The plot below shows the survival analysis for different topic categories for all three product types. The first type of product is the laptop. The survival analysis plot shows the probabilities vs. the number of weeks it takes for an issue to occur for each topic category. A line in the plot is highlighting the median number of weeks for a problem to occur. This can be used to plan the inventory by incorporating it in a forecasting model. It can also give us a time duration before which we should reach out to the customer to improve the customer experience by asking if he is facing an issue with a specific topic category.

# Laptops ----

survival_analysis(prod[1], topic_cat[1:12])

survival_analysis(prod[1], topic_cat[13:24])

survival_analysis(prod[1], topic_cat[25:36])

survival_analysis(prod[1], topic_cat[37:49])

survival_analysis(prod[1], topic_cat[50:63])

Survival Analysis for Desktop

# Desktop ---

survival_analysis(prod[2], topic_cat[1:12])

survival_analysis(prod[2], topic_cat[13:25])

survival_analysis(prod[2], topic_cat[26:37])

survival_analysis(prod[2], topic_cat[c(38:47,51,52,56,57,61)])

# not considering topic category where there is only one record like miracast under desktop

Survival Analysis for Other Electronics

# Other Electronics 

survival_analysis(prod[3], topic_cat[1:12])

survival_analysis(prod[3], topic_cat[c(13:26,29,30)])

survival_analysis(prod[3], topic_cat[c(32:33,35:36,39,41:49,51:52,56:63)])

Contact Preference by Region

The plot below shows the percentage breakdown of all communication types for all three regions. We can see ‘Voice’ dominates the communication type in all three regions. It means that people are still comfortable using the phone instead of typing. Also, the plot shows that Hogwarts has the most number of issues as we can see by Voice and Chat percentage. It might be just that the number of laptops sold within that region is more. But, if that is not the case, then further analysis needs to be done on why the issues are more within that region.

# Contact Preference by region-------

contact_preference <- data %>%
  group_by(region, contact_type) %>%
  dplyr::summarise(percent_cntct_medium = (n() / nrow(data)) * 100) %>%
  na.omit()

contact_preference %>%
  ggplot(aes(x = reorder(contact_type, percent_cntct_medium), 
             y = percent_cntct_medium, 
             fill = as.factor(region))) +
  geom_bar(stat = "identity", width = 0.3) +
  facet_wrap(~as.factor(region), scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90),
        axis.text.y = element_text(size = 8)) +
  scale_fill_manual(name = "Region", 
                    values = c("Hogwarts" = "#3039DC", 
                               "Middle Earth" = "#2D9A2D",
                               "Milky Way" = "#FD0006")) +
  labs(title = "Contact Preference", 
         y = "Percentage", x = "Communication Mode") +
  coord_flip()

Future Scope of Work - Machine Learning

Considering the time limit, it was not possible to dig deeper and do a bit of modelling. But, following are the suggestions related to modelling that can be conducted in future:

Forecasting: EDA tells us that we have complaints data only for a week. If we had 2 or 3 years of data along with a number of laptops/ desktops sold, we could have done time series forecasting for inventory planning (using LSTM Deep Learning or ARIMAX possibly). We could have aggregated the numbers at a weekly level. We could have considered variables like the number of products sold in a week, the total number of, issues, total number of repairs in a week - separate variable for the unique product (maybe top products that are sold), median time for a problem to arise (survival analysis), type of product, region, etc.
If we had conversations, we could have used NLP to automatically fill the topic category and send initial response thus saving work of an agent, and improving customer experience with quick response
Prediction - Convert hard/soft type repair to binary and then predict whether the repair type would be hard or soft based on variables like topic category, parts sent or not, etc. Predicting this will help you allocate agent with particular expertise to a case if there are separate agents for hard & soft repairs
Prediction - We can predict whether a repeat visit will be required or not based on the issue, topic_category, region, product type, conatct manager, agents experience, etc. We can say repeat visit as ‘Yes’ whenever it is 1,2 or 3 and ‘No’ whenever there is no visit. Making predictions might help to allocate an experienced agent to solve the problem so that the problem can be solved in one go and thus enhancing the customer experience.