The travel agency has many products and services. To reach the customer, the travel agency has to do some effort to promote its products and services. As for a traditional method, the company will call all of the customers to offer the new products or services. As for a data analytic method, we can help the marketing and sales department to optimize the opportunity to get customers from a call.
Let us set up the environment and load in the dataset.
if(!require("pacman")) install.packages("pacman")
pacman::p_load(readr, skimr, tidyr, dplyr, ggplot2, gghalves, car,
scales, gridExtra, ggsci, stringr, caret, olsrr, MASS,
corrplot, RColorBrewer, ROSE, e1071, kableExtra, tibble,
ggridges, grid)
theme = theme_bw() +
theme(plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(size = 10),
axis.title = element_text(size = 10),
legend.position = "none")
df.1 = df.0
df.1 = df.1 %>%
drop_na() %>%
dplyr::select(-CustomerID) %>%
mutate_at(c("TypeofContact",
"Occupation",
"Gender",
"ProductPitched",
"MaritalStatus",
"Designation",
"Passport",
"OwnCar",
"ProdTaken"),
factor) %>%
mutate(Gender = gsub("Fe Male",
"Female",
Gender),
Gender = as.factor(Gender)) %>%
mutate(Often = ifelse(NumberOfTrips >= 4, 1, 0),
Often = as.factor(Often))
We process some data manipulations at first for better data analysis later.
Name | df.1 |
Number of rows | 4128 |
Number of columns | 20 |
_______________________ | |
Column type frequency: | |
factor | 10 |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
ProdTaken | 0 | 1 | FALSE | 2 | 0: 3331, 1: 797 |
TypeofContact | 0 | 1 | FALSE | 2 | Sel: 2918, Com: 1210 |
Occupation | 0 | 1 | FALSE | 4 | Sal: 1999, Sma: 1746, Lar: 381, Fre: 2 |
Gender | 0 | 1 | FALSE | 2 | Mal: 2463, Fem: 1665 |
ProductPitched | 0 | 1 | FALSE | 5 | Bas: 1615, Del: 1422, Sta: 737, Sup: 250 |
MaritalStatus | 0 | 1 | FALSE | 4 | Mar: 1990, Div: 789, Unm: 682, Sin: 667 |
Passport | 0 | 1 | FALSE | 2 | 0: 2909, 1: 1219 |
OwnCar | 0 | 1 | FALSE | 2 | 1: 2527, 0: 1601 |
Designation | 0 | 1 | FALSE | 5 | Exe: 1615, Man: 1422, Sen: 737, AVP: 250 |
Often | 0 | 1 | FALSE | 2 | 0: 2727, 1: 1401 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
Age | 0 | 1 | 37.23 | 9.17 | 18 | 31 | 36 | 43 | 61 |
CityTier | 0 | 1 | 1.66 | 0.92 | 1 | 1 | 1 | 3 | 3 |
DurationOfPitch | 0 | 1 | 15.58 | 8.40 | 5 | 9 | 14 | 20 | 127 |
NumberOfPersonVisiting | 0 | 1 | 2.95 | 0.72 | 1 | 2 | 3 | 3 | 5 |
NumberOfFollowups | 0 | 1 | 3.74 | 1.01 | 1 | 3 | 4 | 4 | 6 |
PreferredPropertyStar | 0 | 1 | 3.58 | 0.80 | 3 | 3 | 3 | 4 | 5 |
NumberOfTrips | 0 | 1 | 3.30 | 1.86 | 1 | 2 | 3 | 4 | 22 |
PitchSatisfactionScore | 0 | 1 | 3.06 | 1.36 | 1 | 2 | 3 | 4 | 5 |
NumberOfChildrenVisiting | 0 | 1 | 1.22 | 0.85 | 0 | 1 | 1 | 2 | 3 |
MonthlyIncome | 0 | 1 | 23178.46 | 4506.61 | 1000 | 20751 | 22418 | 25301 | 98678 |
p.1 = df.1 %>%
ggplot(aes(x = TypeofContact, y = 1, fill = ProdTaken)) +
geom_col(position = "fill") +
theme +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
scale_fill_locuszoom() +
labs(x = "Contact Type",
y = "Rate",
fill = "Offer Taken")
p.2 = df.1 %>%
ggplot(aes(x = Occupation, y = 1, fill = ProdTaken)) +
geom_col(position = "fill") +
theme +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
scale_fill_locuszoom() +
labs(x = "Occupation",
y = "Rate",
fill = "Offer Taken")
p.3 = df.1 %>%
ggplot(aes(x = Gender, y = 1, fill = ProdTaken)) +
geom_col(position = "fill") +
theme +
scale_fill_locuszoom() +
labs(x = "Gender",
y = "Rate",
fill = "Offer Taken")
p.4 = df.1 %>%
ggplot(aes(x = ProductPitched, y = 1, fill = ProdTaken)) +
geom_col(position = "fill") +
theme +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
scale_fill_locuszoom() +
labs(x = "Product Type",
y = "Rate",
fill = "Offer Taken")
p.5 = df.1 %>%
ggplot(aes(x = MaritalStatus, y = 1, fill = ProdTaken)) +
geom_col(position = "fill") +
theme +
scale_fill_locuszoom() +
labs(x = "Marital Status",
y = "Rate",
fill = "Offer Taken")
p.6 = df.1 %>%
ggplot(aes(x = Passport, y = 1, fill = ProdTaken)) +
geom_col(position = "fill") +
theme +
scale_fill_locuszoom() +
labs(x = "Passport",
y = "Rate",
fill = "Offer Taken")
p.7 = df.1 %>%
ggplot(aes(x = OwnCar, y = 1, fill = ProdTaken)) +
geom_col(position = "fill") +
theme +
scale_fill_locuszoom() +
labs(x = "Own Car",
y = "Rate",
fill = "Offer Taken")
p.8 = df.1 %>%
ggplot(aes(x = Designation, y = 1, fill = ProdTaken)) +
geom_col(position = "fill") +
theme +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
theme(legend.position = c(0.83, 0.74)) +
scale_fill_locuszoom() +
labs(x = "Designation",
y = "Rate",
fill = "Offer Taken")
grid.arrange(p.1, p.2, p.3, p.4, p.5, p.6, p.7, p.8,
layout_matrix = rbind(c(6, 7, 3, 5),
c(1, 2, 4, 8)),
top = textGrob("Categorical Features",
gp = gpar(fontsize = 15,
font = 2)))
We will kind of know that customers who have a passport, are single, work as free lancer, are introduced to the basic type of product, and work at the position of an executive are the high acceptance rate, which can be our targeted audience for the next product promotion.
p.1 = df.1 %>%
ggplot(aes(y = Age, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
p.2 = df.1 %>%
ggplot(aes(y = CityTier, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
p.3 =df.1 %>%
ggplot(aes(y = DurationOfPitch, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
p.4 = df.1 %>%
ggplot(aes(y = NumberOfPersonVisiting, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
p.5 = df.1 %>%
ggplot(aes(y = NumberOfFollowups, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
p.6 = df.1 %>%
ggplot(aes(y = PreferredPropertyStar, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
p.7 = df.1 %>%
ggplot(aes(y = NumberOfTrips, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
p.8 = df.1 %>%
ggplot(aes(y = PitchSatisfactionScore, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank(),
legend.position = c(0.80, 0.78)) +
scale_fill_locuszoom() +
labs(x = NULL)
p.9 = df.1 %>%
ggplot(aes(y = NumberOfChildrenVisiting, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
p.10 = df.1 %>%
ggplot(aes(y = MonthlyIncome, x = "", fill = ProdTaken)) +
geom_boxplot() +
theme +
scale_y_continuous(labels = function(x) paste0("$", {x/1000}, "k")) +
theme(axis.ticks.x = element_blank()) +
scale_fill_locuszoom() +
labs(x = NULL)
grid.arrange(p.1, p.2, p.3, p.4, p.5, p.6, p.7, p.8, p.9, p.10,
layout_matrix = rbind(c(1, 10, 7, 2, 6),
c(4, 9, 3, 5, 8)),
top = textGrob("Continuous Features",
gp = gpar(fontsize = 15,
font = 2)))
As for the continuous features, we find out that the higher the preferred property star, duration of pitch, and the number of follow-ups, the more acceptance rate for customers to the product. The next time of promotion, we should try to increase the duration of the pitch and the number of follow-ups to boost up the acceptance rate.
p.1 = df.1 %>%
group_by(Often) %>%
summarise(mean = mean(MonthlyIncome)) %>%
ggplot(aes(x = Often,
y = mean,
fill = Often)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0("$",
prettyNum(round(mean, 0),
big.mark = ","))),
vjust = -0.5) +
scale_y_continuous(labels = dollar_format(),
breaks = seq(20000, 25000, 2500)) +
coord_cartesian(ylim = c(20000, 25000)) +
scale_fill_locuszoom() +
theme +
labs(title = "Often/Rarely Trip Customer",
x = "Often Trip",
y = "Monthly Income")
p.2 = df.1 %>%
group_by(ProdTaken) %>%
summarise(mean = mean(MonthlyIncome)) %>%
ggplot(aes(x = ProdTaken,
y = mean,
fill = ProdTaken)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0("$",
prettyNum(round(mean, 0),
big.mark = ","))),
vjust = -0.5) +
scale_y_continuous(labels = dollar_format(),
breaks = seq(20000, 25000, 2500)) +
coord_cartesian(ylim = c(20000, 25000)) +
scale_fill_locuszoom() +
theme +
labs(title = "Offer Taken Customer",
x = "Offer taken",
y = "")
grid.arrange(p.1, p.2,
layout_matrix = rbind(c(1,2)))
Customers who accept the product have a lower monthly income than those who reject the product, which does not follow the common statement that customers who have a higher monthly income more often travel than those who have a lower monthly income.
df.1 %>%
dplyr::select(ProdTaken, ProductPitched) %>%
group_by(ProductPitched, ProdTaken) %>%
summarise(count = n()) %>%
mutate(per = count/sum(count)*100) %>%
ggplot(aes(x = reorder(ProductPitched, -count),
y = count,
fill = ProdTaken))+
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0(round(per, 0), "%")),
position = position_dodge(width = 0.7),
vjust = -0.5) +
scale_y_continuous(limits = c(0, 1400),
breaks = seq(0, 1400, 200)) +
scale_fill_locuszoom() +
theme +
theme(legend.position = "right") +
labs(title = "Customer Number by Package by Offer Taken Status",
x = "Package",
y = "Number of Customer",
fill = "Offer Taken")
Most package offer taken by customers is basic, deluxe, and standard. Also, the basic package is the highest probability that customers take the product. The overall probability that customers accept the product is 20%. The basic package acceptance rate is above the average overall.
df.1 %>%
filter(Often == 1) %>%
dplyr::select(MonthlyIncome, Occupation, Designation, Often) %>%
mutate(Position = paste0(Designation, " at ", Occupation)) %>%
group_by(Position) %>%
summarise(mean = mean(MonthlyIncome)) %>%
ggplot(aes(y = reorder(Position, mean),
x = mean)) +
geom_col(position = "dodge",
width = 0.7,
fill = "#9632B8FF") +
scale_x_continuous(labels = dollar_format(),
breaks = seq(0, 35000, 5000)) +
coord_cartesian(xlim = c(15000, 35000)) +
scale_fill_locuszoom() +
theme +
labs(title = "Taken Offer Customer Number by Position",
x = "Monthly Income",
y = "Position")
The highest monthly income is the position of VP at salaried which means government employees. The executive at free lancer has the lowest monthly income.
The people who often travel are more financially free. This is the hypothesis that we will justify with a statistical test in this section.
temp.4.1 = df.1 %>%
group_by(Often) %>%
mutate(count = n(),
x.lab = paste0(Often,
"\n",
"(n=",
count,
")")) %>%
dplyr::select(MonthlyIncome, Often, count, x.lab)
title = "Monthly Income Distribution by Group of Rarely/Often Trip"
ggplot(data = temp.4.1,
aes(x = x.lab,
y = MonthlyIncome)) +
geom_half_violin(side = "l",
alpha = 0.5,
trim = F,
fill = "#D62728FF") +
geom_half_boxplot(side = "r",
alpha = 0.5,
fill = "#FF7F0EFF") +
stat_summary(fun = "mean",
shape = 4,
color = "black",
size = 0.5,
stroke = 2) +
scale_y_continuous(labels = dollar_format()) +
theme +
labs(title = title,
x = "Often Trip",
y = "Monthly Income")
Normality Test - Shapiro Wilk Test
## temp.4.1$Often: 0
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.87322, p-value < 2.2e-16
##
## ------------------------------------------------------------
## temp.4.1$Often: 1
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.95793, p-value < 2.2e-16
Homogeneity of Variance Test - Levene’s Test
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 7.3379 0.00678 **
## 4126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA Test - Kruskal Wallis Test
Since the dataset is non-normally distributed and non-equal variance, we will use Mann Whitney test instead of the original ANOVA test.
##
## Wilcoxon rank sum test with continuity correction
##
## data: temp.4.1$MonthlyIncome by temp.4.1$Often
## W = 1583437, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The result supports the statement that people who have a higher monthly income are more often travelers with a p-value < 0.05.
We will use logistic regression in this section to discuss. So, we can not only know the prediction result from the model, but also we can have the causation analysis. Therefore, we can predict which customers are more likely to accept the travel package and which variables are most significant.
set.seed(123)
split = createDataPartition(df.1$ProdTaken, p = 0.8, list = F)
train.data = df.1[split, ]
test.data = df.1[-split, ]
mod.4.1 = glm(data = train.data,
ProdTaken ~
MonthlyIncome +
Designation +
OwnCar +
Passport +
PreferredPropertyStar +
DurationOfPitch +
CityTier +
Age +
NumberOfFollowups +
NumberOfChildrenVisiting +
PitchSatisfactionScore,
family = "binomial")
mod.4.2 = stepAIC(mod.4.1,
direction = "both",
trace = F)
pred = predict(mod.4.2,
type = "response",
newdata = test.data)
pred.cutoff = factor(ifelse(pred >= 0.5, "Yes", "No"))
actual = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.cutoff, actual, positive = "Yes")
accuracy = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]
a = c("Accuracy",
"Sensitivity",
"Specificity",
"Total")
b = c(accuracy,
sensitivity,
specificity,
sum(accuracy, sensitivity, specificity)) %>%
unname() %>%
round(3)
data.frame(a, b) %>%
rename("Index" = "a",
"Value" = "b") %>%
kbl(align = "l",
caption = "Without Cross Validation / Cutoff in 0.500") %>%
kable_classic("hover")
Index | Value |
---|---|
Accuracy | 0.838 |
Sensitivity | 0.302 |
Specificity | 0.965 |
Total | 2.105 |
The first model is without cross validation and with setting up a cutoff by default as 0.500. However, due to the data imbalance, this can be the cause to get a bad model.
perform.fn = function(cutoff){
pred.temp = factor(ifelse(pred >= cutoff, "Yes", "No"))
actual.temp = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.temp, actual.temp, positive = "Yes")
accuray = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]
out = t(as.matrix(c(sensitivity, specificity, accuray)))
colnames(out) = c("sensitivity", "specificity", "accuracy")
return(out)}
s = seq(0.01, 0.99, length = 100)
out = matrix(1, 100, 3)
for(i in 1:100){
out[i,] = perform.fn(s[i])}
# plot(s, out[, 1], col = 2, lwd = 2,
# type = "l", axes = T,
# xlab = "Cutoff", ylab = "Value")
# lines(s, out[,2], col = "darkgreen", lwd = 2)
# lines(s, out[,3], col = 4, lwd = 2)
# legend("right",
# col = c(2, "darkgreen", 4, "darkred"),
# text.font = 3,
# inset = 0.02,
# box.lty = 0,
# cex = 0.8,
# lwd = c(2, 2, 2, 2),
# c("Sensitivity", "Specificity", "Accuracy"))
cutoff = s[which(abs(out[,1]-out[,2]) < 0.01)]
# abline(v = cutoff, col = "black", lwd = 1, lty = 3)
pred.cutoff = factor(ifelse(pred >= cutoff, "Yes", "No"))
actual = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.cutoff, actual, positive = "Yes")
accuracy = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]
a = c("Accuracy",
"Sensitivity",
"Specificity",
"Total")
b = c(accuracy,
sensitivity,
specificity,
sum(accuracy, sensitivity, specificity)) %>%
unname() %>%
round(3)
data.frame(a, b) %>%
rename("Index" = "a",
"Value" = "b") %>%
kbl(align = "l",
caption = "Without Cross Validation / Cutoff in 0.188") %>%
kable_classic("hover")
Index | Value |
---|---|
Accuracy | 0.731 |
Sensitivity | 0.723 |
Specificity | 0.733 |
Total | 2.187 |
The second model is without cross validation but with tunning for the cutoff value as 0.188. Although the accuracy decreases, the overall score increases, especially for the sensitivity. The sensitivity which is also called recall is the interested target prediction rate, which is about the “yes” or “1” status in this case. So, we should care about sensitivity as well but not just go blindly for accuracy.
df.2 = ROSE(data = df.1,
seed = 123,
ProdTaken ~ .)$data
set.seed(123)
split = createDataPartition(df.2$ProdTaken, p = 0.8, list = F)
train.data = df.2[split, ]
test.data = df.2[-split, ]
mod.4.3 = glm(data = train.data,
ProdTaken ~
MonthlyIncome +
Designation +
OwnCar +
Passport +
PreferredPropertyStar +
DurationOfPitch +
CityTier +
Age +
NumberOfFollowups +
NumberOfChildrenVisiting +
PitchSatisfactionScore,
family = "binomial")
mod.4.4 = stepAIC(mod.4.3,
direction = "both",
trace = F)
pred = predict(mod.4.4,
type = "response",
newdata = test.data)
pred.cutoff = factor(ifelse(pred >= 0.5, "Yes", "No"))
actual = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.cutoff, actual, positive = "Yes")
accuracy = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]
a = c("Accuracy",
"Sensitivity",
"Specificity",
"Total")
b = c(accuracy,
sensitivity,
specificity,
sum(accuracy, sensitivity, specificity)) %>%
unname() %>%
round(3)
data.frame(a, b) %>%
rename("Index" = "a",
"Value" = "b") %>%
kbl(align = "l",
caption = "With Cross Validation / Cutoff in 0.500") %>%
kable_classic("hover")
Index | Value |
---|---|
Accuracy | 0.698 |
Sensitivity | 0.682 |
Specificity | 0.714 |
Total | 2.095 |
As for the cross validation, we use the ROSE (Random Over Sampling Examples) function to balance the data. Without tunning for the cutoff, the accuracy or the overall score is not better than the last model. Then, we proceed to the next model which is tuned for its cutoff.
s = seq(0.01, 0.99, length = 100)
out = matrix(1, 100, 3)
for(i in 1:100){
out[i,] = perform.fn(s[i])}
# plot(s, out[, 1], col = 2, lwd = 2,
# type = "l", axes = T,
# xlab = "Cutoff", ylab = "Value")
# lines(s, out[,2], col = "darkgreen", lwd = 2)
# lines(s, out[,3], col = 4, lwd = 2)
# legend("right",
# col = c(2, "darkgreen", 4, "darkred"),
# text.font = 3,
# inset = 0.02,
# box.lty = 0,
# cex = 0.8,
# lwd = c(2, 2, 2, 2),
# c("Sensitivity", "Specificity", "Accuracy"))
cutoff = s[which(abs(out[,1]-out[,2]) < 0.025)]
# abline(v = cutoff, col = "black", lwd = 1, lty = 3)
pred.cutoff = factor(ifelse(pred >= cutoff, "Yes", "No"))
actual = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.cutoff, actual, positive = "Yes")
accuracy = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]
a = c("Accuracy",
"Sensitivity",
"Specificity",
"Total")
b = c(accuracy,
sensitivity,
specificity,
sum(accuracy, sensitivity, specificity)) %>%
unname() %>%
round(3)
data.frame(a, b) %>%
rename("Index" = "a",
"Value" = "b") %>%
kbl(align = "l",
caption = "With Cross Validation / Cutoff in 0.485") %>%
kable_classic("hover")
Index | Value |
---|---|
Accuracy | 0.693 |
Sensitivity | 0.694 |
Specificity | 0.692 |
Total | 2.080 |
The last model is even worse. The performance is worse in not only the accuracy but also the total value.
mod.sum = summary(mod.4.2)
mod.sum.df = mod.sum$coefficients %>%
as.data.frame() %>%
dplyr::select(Estimate, `Pr(>|z|)`) %>%
mutate(Estimate = round(Estimate, 4)) %>%
rename("P.value" = `Pr(>|z|)`) %>%
filter(P.value < 0.05) %>%
rownames_to_column(var = "Variable")
Des = c("The customer has a passport",
"Executive designation of the customer in the current organization",
"City tier depends on the development of a city, population, facilities, and living standards",
"Total number of followups has been done by the salesperson after the sales pitch",
"Preferred hotel property rating by customer",
"Sales pitch satisfaction score",
"Duration of the pitch by a salesperson to the customer",
"Gross monthly income of the customer",
"Age of customer")
mod.sum.df %>%
arrange(desc(Estimate)) %>%
filter(Variable != "(Intercept)") %>%
dplyr::select(-P.value) %>%
add_column(Description = Des) %>%
kbl(align = "l",
caption = "Significantly Effect Factors") %>%
kable_classic("hover")
Variable | Estimate | Description |
---|---|---|
Passport1 | 1.5641 | The customer has a passport |
DesignationExecutive | 1.0088 | Executive designation of the customer in the current organization |
CityTier | 0.5546 | City tier depends on the development of a city, population, facilities, and living standards |
NumberOfFollowups | 0.4062 | Total number of followups has been done by the salesperson after the sales pitch |
PreferredPropertyStar | 0.3484 | Preferred hotel property rating by customer |
PitchSatisfactionScore | 0.1144 | Sales pitch satisfaction score |
DurationOfPitch | 0.0270 | Duration of the pitch by a salesperson to the customer |
MonthlyIncome | -0.0001 | Gross monthly income of the customer |
Age | -0.0208 | Age of customer |
Based on the total value, the second model has the best performance. We will choose the second model with no cross validation and tuned cutoff. Then, what are the factors that are significantly related to product taken decision? We know the result from above table. Take the most positive and negative one as examples. As having a passport, customers are more likely to accept the offer based on the estimate greater than zero. This is statistically significant at 0.05. On the other hand, as having an older age, customers are more unlikely to accept the offer based on the estimate lesser than zero. This is statistically significant at 0.05. Therefore, based on the analysis, we find the best prediction model for its prediction and also know the casuation from its model analysis.
Firstly, we know the acceptance rate in each feature without a significant test but just a rough picture. We also find out that the customers who are taken the offer have a lower monthly income, which is against the common statement that customers who have a higher income trip more frequently. The popular taken offer type is the basic package, which might be due to the cheap price. The customer position is related to its salary. However, the most taken offer customers work in the position of executive. The agency should invent the package and target the audience with a higher monthly income to increase the revenue. Then, we do the model analysis to know the prediction and causation. As for the original product, the agency can have a more accurate way to promote or advertise to the target audience, which should give the agency a more successful sale condition. As for the new product, the agency can search or develop to target new audiences that they are not interested in the agency product before.