library(readr) #for reading the csv file
library(tidyverse) #for data manipulation and other needs
library(ggplot2) # for visualization
library(survival) #for survival analysis
library(survminer) #for survival analysis plots
library(DT) #for displaying table
Dell provides after-sales service to customers who purchase their products. This means there are tons of data produced related to tech support, warranties, repairs, etc. The focus of this challenge is not only to explore the dataset and produce meaningful results but also to lay down the thoughts for the future scope of work.
Dataset
Variables and their description:
Asst_id - an identifier for each individual machine soldProduct_type - class of product that describes the assetRegion - region where the asset is locatedCountry - country where the asset is locatedMnfcture_wk - week when product was manufacturedContract_st - week when warranty became activeContract_end - week when warranty expiresContact_wk - week when customer contacted Dell about a problemContact_type - way that customer contacted DellIssue_type - type of problem identified by customerTopic_category - type of problem as classified by the tech support agent`Parts_sent - what parts were sent to fix the problemRepair_type - if a part was required, this is a hard repair; otherwise, a soft repairRepeat_ct - how many additional visits were required to fix the problem, past the first oneParts_ct - how many parts were sent to fix the problemAgent_tenure_indays - how long the tech support agent has worked in Dell tech supportContact_manager_flg - did the tech support agent have to bring in a manager to solve the problemDiagnostics - were agents compliant with diagnostic usageRepeat_parts_sent - which parts were sent on additional visitBefore we begin the analysis, it is essential to clean the data so that there is no type of bias in the study. We will perform various steps to clean the data in a systematic approach.
We will begin our analysis with data exploration. We will look at what are the variables, what is their type, number of rows in the dataset, number of unique values for all the variables, the number of missing values for all the variables, checking for duplicates, etc. Also, it is essential to remove or impute missing values, especially when we need to perform any modeling. As no modeling has been performed for this analysis, not all the missing values have been dealt with. Some algorithms, like linear regression, don’t handle missing values while algorithms like trees, handle missing values. Thus, to make an apple to apple comparison, it is essential to manage the missing amount before doing modeling.
data <- read_csv("data/DellGSPE_DSrole_dataset.csv") # Reading data
names(data) # Names of variables
## [1] "asst_id" "product_type" "mnfture_wk"
## [4] "contract_st" "contract_end" "contact_wk"
## [7] "contact_type" "issue_type" "topic_category"
## [10] "parts_sent" "repair_type" "repeat_ct"
## [13] "parts_ct" "agent_tenure_indays" "contact_manager_flg"
## [16] "diagnostics" "repeat_parts_sent" "region"
## [19] "country"
nrow(data) # Number of rows in the dataset
## [1] 100000
glimpse(data) # summarising the data
## Observations: 100,000
## Variables: 19
## $ asst_id <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ product_type <chr> "Laptops", "Laptops", "Laptops", "Laptops"...
## $ mnfture_wk <dbl> 201726, 201650, 201821, 201608, 201732, 20...
## $ contract_st <dbl> 201726, 201650, 201821, 201608, 201732, 20...
## $ contract_end <dbl> 202125, 201949, 202222, 201908, 202032, 20...
## $ contact_wk <dbl> 201840, 201840, 201840, 201840, 201840, 20...
## $ contact_type <chr> "Voice", "Voice", "Voice", "VOICE", "Voice...
## $ issue_type <chr> NA, NA, NA, "Hard Drive", NA, "Fee Based S...
## $ topic_category <chr> NA, NA, NA, "Booting", NA, "General Querie...
## $ parts_sent <chr> "Hard Drive, Documentation(2)", "Speaker, ...
## $ repair_type <chr> "Hard", "Hard", "Hard", "Hard", "Hard", "S...
## $ repeat_ct <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ parts_ct <dbl> 3, 2, 1, 1, 1, 0, 2, 0, 0, 1, 0, 0, 0, 2, ...
## $ agent_tenure_indays <dbl> 1018, 1018, 1018, 298, 1018, 802, 1757, 81...
## $ contact_manager_flg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ diagnostics <chr> "Not_USED", "Not_USED", "Not_USED", "Effec...
## $ repeat_parts_sent <chr> NA, NA, "Motherboard, Miscellaneous(2), Ha...
## $ region <chr> "Hogwarts", "Hogwarts", "Hogwarts", "Hogwa...
## $ country <chr> NA, "Zonko's Joke Shop", "Zonko's Joke Sho...
sum(complete.cases(data)) # Checking for rows with no missing values
## [1] 6109
sum(duplicated(data)) # Checking duplicates
## [1] 13464
data <- data[-(which(duplicated(data))),] #Given we don't have time stamp, it is fair to assume that the rows are duplicate. Removing duplicate rows to avoid double counting
Checking for variables that are constant or zero to remove them. Also, looking for variables that have missing values. We can see contact_wk has only one week of data. Thus, it can be removed as it serves no purpose. Also, on doing further analysis, we can see there are missing values for different variables. For example, product_type should contain only three types of unique values, but it includes four unique values - “Laptops”, “Desktops”, “Other Electronics”, “NA”. “NA” implies missing values and these need to be dealt with in the next section.
The chart below shows the number of unique values for each column. Specifically asst_id column has 82442 unique IDs implying 82442 items reported issues in the particular week.
# Plotting the graph to check number of unique values
data %>%
map_dfr(n_distinct) %>%
gather() %>%
ggplot(aes(reorder(key, -value), value)) +
geom_bar(stat = "identity", fill = "tomato") +
scale_y_log10(breaks = c(5, 50, 250, 500, 1000, 10000, 50000)) +
geom_text(aes(label = value), vjust = 1.6, color = "white", size = 3.5) +
theme_minimal() +
labs(x = "features", y = "Number of unique values") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(panel.grid = element_blank())
The table below shows the number of missing values for each column. Please note that repeat_parts_sent and parts_sent have a large number of missing values primarily due to the fact that Software issues do not require any tranfer or parts.
For this exercise, the missing values in the column topic_category have been imputed based on parts_sent column which can be seen in the next section.
# Checking the percentage of missing values for all the variables
is_na_val <- function(x) x %in% c("not available", "<NA>", "Unknown",
"NA"," ","")
data <- data %>% mutate_all(funs(ifelse(is_na_val(.), NA, .)))
missing_values <- data %>% summarise_all(funs(sum(is.na(.)))) %>%
gather(key = "Variable", value = "missing_count") %>%
arrange(desc(missing_count))
#Displaying the missing values in a table for each variable
datatable(
missing_values, rownames = FALSE,
extensions = 'Buttons', options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
)
)
Performing various data cleaning activities: * Changing the case of the text under the contact_type field to remove redundancy * Setting repeat_count value to 1 when the repeat_parts_sent field is not empty and the repeat_count is zero. The imputation has been done based on the distribution of the repeat_count. * Converting repair_type to ‘Soft’ when both repair_parts_sent and parts_sent fields are empty * Imputing the topic_category values based on the parts_sent. Topic_category is different even if the parts_sent are the same. Thus, attributing the missing topic_category with the most frequent topic category for any combination of parts_sent. It reduces missing count for the topic_category by approximately 5000
unique(data$contact_type)
## [1] "Voice" "VOICE" "CHAT" "EMAIL" NA
# Changing Voice to VOICE
data$contact_type <- toupper(data$contact_type)
unique(data$contact_type)
## [1] "VOICE" "CHAT" "EMAIL" NA
# Repeat count is zero and repeat part is not empty
sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 0)
## [1] 5293
# Checking for distribution of repeat count when it is not zero and repeat part is not empty
sum(!is.na(data$repeat_parts_sent) & data$repeat_ct != 0)
## [1] 6270
sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 1)
## [1] 6269
sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 2)
## [1] 1
sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 3)
## [1] 0
# Imputing repeat count with 1 when repeat part is not bank based on the above distribution. Aas almost all the values under repeat count are 1 so imputing repeat count with 1
# Assuming additional visit was made even if the repeat part sent is same as the part sent.
data <- data %>%
mutate(repeat_ct = ifelse((!is.na(repeat_parts_sent) &
repeat_ct == 0),1,repeat_ct))
sum(!is.na(data$repeat_parts_sent) & data$repeat_ct == 0) #Repeat count is no longer 0 when repeat part is not empty
## [1] 0
# As both parts sent adn repeat parts sent empty so converting hard to soft repair type
sum((is.na(data$repeat_parts_sent) & is.na(data$parts_sent) &
data$repair_type == "Hard"))
## [1] 3769
data <- data %>%
mutate(repair_type = ifelse((is.na(data$repeat_parts_sent) &
is.na(data$parts_sent) &
data$repair_type == "Hard"),
"Soft",repair_type))
sum((is.na(data$repeat_parts_sent) & is.na(data$parts_sent) &
data$repair_type == "Hard"))
## [1] 0
#Imputing topic_category based on parts sent column
#If there are multiple topic categories for same part sent, then the missing topic category is imputed with the most frequent topic category for that part sent
#Function to extract most frequent topic category for each combination of parts sent
impute_func <- function(key){
data_subset <- data %>% filter(parts_sent == key) %>%
select(topic_category, parts_sent) %>%
group_by(topic_category) %>%
mutate(count = n())
max_count <- max(data_subset$count, na.rm = TRUE)
impute_value <- data_subset %>% filter(count == max_count) %>%
.$topic_category %>% unique()
}
#Calling the function and extracting most frequent topic catgory for each combination of parts sent and storing in the lookup_table variable
parts_sent_var <- na.omit(unique(data$parts_sent, na.rm = TRUE))
impute_topic <- vector()
parts_sent <- vector()
for (i in 1:length(parts_sent_var)) {
impute_topic[i] <- impute_func(parts_sent_var[i])
parts_sent[i] <- parts_sent_var[i]
}
lookup_table <- as.data.frame(cbind(parts_sent, impute_topic))
#Converting parts sent and impute_topic to character
lookup_table$parts_sent <- as.character(lookup_table$parts_sent)
lookup_table$impute_topic <- as.character(lookup_table$impute_topic)
lookup_table <- na.omit(lookup_table)
#Impuing missing value with most frequent topic category depending on the parts sent
#for (i in 1:nrow(data)) {
# if (is.na(data[i,9]) & !is.na(data[i,10])) {
# for (j in 1:nrow(lookup_table)) {
# if (data[i,10] == lookup_table[j,1]) {
# data[i,9] <- lookup_table[j,2]
# } else {
# }
# }
# } else {
# data[i,9] <- data[i,9]
# }
#}
Exploring the data visually helps us seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
The plot below depicts the issue count for each topic category for all the product types. This analysis helps to understand the inventory that needs to be maintained based on the number of issues a specific topic category contains.
# Problematic Parts
problematic_parts <- data %>%
group_by(repair_type, product_type,topic_category) %>%
dplyr::summarise(issue_count = n()) %>%
arrange(repair_type, product_type, desc(issue_count)) %>%
na.omit()
#Function definition
problem_parts <- function(repair_typ) {
problematic_parts %>%
filter(repair_type == repair_typ) %>%
ggplot(aes(x = reorder(topic_category, issue_count),
y = issue_count,
fill = as.factor(product_type))) +
geom_bar(stat = "identity") +
facet_wrap(~as.factor(product_type), scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(size = 6)) +
scale_fill_manual(name = "Product Type",
values = c("Laptops" = "#3039DC",
"Desktops" = "#2D9A2D",
"Other Electronics" = "#FD0006")) +
labs(title = "Issue Count by Topic Category",
y = "Issue Count", x = "Topic Category") +
coord_flip()
}
problem_parts("Hard") #Calling the function
The plot below depicts the issue count for each topic category for all the product types. As we are looking at soft repair type, it is essential to focus not on the inventory, but the kind of training the agents require the most to make them super skilled to solve a particular type of issue. If the agents are qualified only to handle specific kinds of problems, then the number of agents can also be maintained proportionately based on the number of issues within a particular topic category.
problem_parts("Soft") #Calling the function
The plot below depicts the experience of the agents in years vs. the repeat visits divided by repeat visits plus the service that doesn’t require any additional visits. Repeat visit means repeat count can be either one, two, or three additional visits. We can see that with higher experience, the percentage of a repeat visit is usually increasing. We might be surprised that the percentage is more and not less for people with more experience. It might be the case that the complex issues that require repeat visits are being dealt with by more experienced people. Thus, it might also be beneficial to look at the topic categories that need additional visits, so that focused training can be provided.
\[RepeatVisitPercentage = {\frac{\sum_{i=1}^nRepeatCount1,2,3}{\sum_{i=1}^nRepeatCount 0,1,2,3}}\]
#Creating new variable indicating agent experience as 0-1, 1-2 and so on
data <- data %>%
mutate(agent_experience = case_when(
is.na(agent_tenure_indays) ~ NA_character_,
(agent_tenure_indays > 0 & agent_tenure_indays <= 365) ~ "0-1",
(agent_tenure_indays > 365 & agent_tenure_indays <= 730) ~ "1-2",
(agent_tenure_indays > 730 & agent_tenure_indays <= 1095) ~ "2-3",
(agent_tenure_indays > 1095 & agent_tenure_indays <= 1460) ~ "3-4",
TRUE ~ "4+"
))
data$agent_experience <- as.factor(data$agent_experience)
agent_data <- data %>%
filter(repair_type == "Hard") %>%
group_by(agent_experience, repeat_ct) %>%
dplyr::summarise(add_visit = n()) %>%
spread(repeat_ct,add_visit)
agent_data <- agent_data %>%
mutate(non_zero = sum(`1`,`2`,`3`, na.rm = TRUE))
agent_data <- agent_data %>%
mutate(repeat_ratio = (non_zero / (`0` + non_zero)) * 100)
agent_data %>%
ggplot() +
geom_bar(aes(x = agent_experience,
y = repeat_ratio), width = 0.3, fill = "tomato",
stat = "identity") +
theme_minimal() +
labs(title = "Repeat to Non-Repeat Visits by Experience",
y = "Repeat Visit/ Non - Repeat percentage", x = "Agent Experience in years") +
theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
axis.title = element_text(size = 10, face = "bold", color = "#444444"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"),
legend.text = element_text(face = "bold", color = "#444444"),
legend.title = element_text(face = "bold", color = "#444444"),
panel.grid.major = element_blank())
The plot below shows the amount of times the manager was called vs. agent’s experience. We can see that less experienced an agent is more the times the manger wass called. Another type of analysis that can be made is - manager intervention v/s topic category. This will help us in identifying the categories in which the agents need more training.
# Contact Manager -----------------------------------------
escalation <- data %>%
group_by(agent_experience) %>%
dplyr::summarise(sum(contact_manager_flg),
n(),
manager_call =
sum((contact_manager_flg) / n()) * 100) %>%
na.omit()
escalation %>%
ggplot() +
geom_bar(aes(x = agent_experience,
y = manager_call), width = 0.3, fill = "tomato",
stat = "identity") +
theme_minimal() +
labs(title = "Manager's involvment based on Agent's experience",
y = "Manager calling percentage", x = "Agent's Experience in years") +
theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
axis.title = element_text(size = 10, face = "bold", color = "#444444"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"),
legend.text = element_text(face = "bold", color = "#444444"),
legend.title = element_text(face = "bold", color = "#444444"),
panel.grid.major = element_blank())
The plot below depicts the ineffective use of diagnosis percentage vs. the agent’s experience. The inefficient use of diagnosis percentage is the number of times the diagnosis has been done ineffectively to the total number of times the diagnosis done or not done at all. We can see the ineffective usage is higher for agents having less experience. Thus, less experienced agents require more training.
\[IneffectiveUsagePercentage = {\frac{\sum_{i=1}^nIneffectiveUsage}{\sum_{i=1}^nIneffectiveUsage,EffectiveUsage,NotUsed,NotConsidered}}\]
# Diagnosis ------------------------------------------------
diag_data <- data %>%
group_by(agent_experience, diagnostics) %>%
dplyr::summarise(diag_count = n()) %>%
spread(diagnostics,diag_count) %>%
na.omit()
diag_data <- diag_data %>%
mutate(ineff_ratio = (InEffective_Usage /
sum(Effective_Usage,InEffective_Usage,Not_Considered,Not_USED))*100)
diag_data %>%
ggplot() +
geom_bar(aes(x = agent_experience,
y = ineff_ratio), width = 0.3, fill = "tomato",
stat = "identity") +
theme_minimal() +
labs(title = "Ineffective Diagnostic Usage",
y = "Ineffective Usage Percentage", x = "Agent Experience in years") +
theme(axis.text.x = element_text(size = 8, face = "bold", colour = "#656565"),
axis.text.y = element_text(size = 8, face = "bold", color = "#656565"),
axis.title = element_text(size = 10, face = "bold", color = "#444444"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "#444444"),
legend.text = element_text(face = "bold", color = "#444444"),
legend.title = element_text(face = "bold", color = "#444444"),
panel.grid.major = element_blank())
How much time does it take before an issue related to product happens? It takes an analysis of time to event data, the event in our case being an issue related to a specific topic category for a given product. Doing this kind of time to event analysis is called survival analysis.
# Time to issue analysis for each combination of product type and topic category
# considering rows where contact week is greater than manufacturi week. Ignoring the rows where contact week is less than the manufacturing week
data <- data %>%
mutate(time_toissue = contact_wk - mnfture_wk) #Creating new variable to calculate the number of weeks for the issue to arise
# Function plot the survival analysis for different product type and topic category combination
survival_analysis <- function(product, top_cat) {
temp_surv <- data %>%
filter(time_toissue >= 0 & product_type == product,
topic_category %in% top_cat)
if (nrow(temp_surv) != 0) {
km_temp <- survfit(Surv(time_toissue) ~ topic_category,
data = temp_surv)
ggsurv <- ggsurvplot(km_temp,
data = temp_surv,surv.median.line = "hv",
conf.int = TRUE)
ggsurv$plot + theme_bw() + facet_wrap(~strata) + theme_minimal()
} else {
print("no records")
}
}
#Creating temporary variables to store product type and topic categories
prod <- c("Laptops","Desktops","Other Electronics")
topic_cat <- na.omit(unique(data$topic_category))
The plot below shows the survival analysis for different topic categories for all three product types. The first type of product is the laptop. The survival analysis plot shows the probabilities vs. the number of weeks it takes for an issue to occur for each topic category. A line in the plot is highlighting the median number of weeks for a problem to occur. This can be used to plan the inventory by incorporating it in a forecasting model. It can also give us a time duration before which we should reach out to the customer to improve the customer experience by asking if he is facing an issue with a specific topic category.
# Laptops ----
survival_analysis(prod[1], topic_cat[1:12])
survival_analysis(prod[1], topic_cat[13:24])
survival_analysis(prod[1], topic_cat[25:36])
survival_analysis(prod[1], topic_cat[37:49])
survival_analysis(prod[1], topic_cat[50:63])
# Desktop ---
survival_analysis(prod[2], topic_cat[1:12])
survival_analysis(prod[2], topic_cat[13:25])
survival_analysis(prod[2], topic_cat[26:37])
survival_analysis(prod[2], topic_cat[c(38:47,51,52,56,57,61)])
# not considering topic category where there is only one record like miracast under desktop
# Other Electronics
survival_analysis(prod[3], topic_cat[1:12])
survival_analysis(prod[3], topic_cat[c(13:26,29,30)])
survival_analysis(prod[3], topic_cat[c(32:33,35:36,39,41:49,51:52,56:63)])
The plot below shows the percentage breakdown of all communication types for all three regions. We can see ‘Voice’ dominates the communication type in all three regions. It means that people are still comfortable using the phone instead of typing. Also, the plot shows that Hogwarts has the most number of issues as we can see by Voice and Chat percentage. It might be just that the number of laptops sold within that region is more. But, if that is not the case, then further analysis needs to be done on why the issues are more within that region.
# Contact Preference by region-------
contact_preference <- data %>%
group_by(region, contact_type) %>%
dplyr::summarise(percent_cntct_medium = (n() / nrow(data)) * 100) %>%
na.omit()
contact_preference %>%
ggplot(aes(x = reorder(contact_type, percent_cntct_medium),
y = percent_cntct_medium,
fill = as.factor(region))) +
geom_bar(stat = "identity", width = 0.3) +
facet_wrap(~as.factor(region), scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(size = 8)) +
scale_fill_manual(name = "Region",
values = c("Hogwarts" = "#3039DC",
"Middle Earth" = "#2D9A2D",
"Milky Way" = "#FD0006")) +
labs(title = "Contact Preference",
y = "Percentage", x = "Communication Mode") +
coord_flip()
Considering the time limit, it was not possible to dig deeper and do a bit of modelling. But, following are the suggestions related to modelling that can be conducted in future: