1 Objective
2 Preparation
3 Exploring Data Analysis
4 Statistical Analysis
- 1) Do often travelers have a higher monthly income?
5 Machine Learning Analysis
- 1) Which customers are more likely to accept the travel package?
6 Conclusion
7 Reference

1 Objective

The travel agency has many products and services. To reach the customer, the travel agency has to do some effort to promote its products and services. As for a traditional method, the company will call all of the customers to offer the new products or services. As for a data analytic method, we can help the marketing and sales department to optimize the opportunity to get customers from a call.

2 Preparation

Let us set up the environment and load in the dataset.

Environment

if(!require("pacman")) install.packages("pacman")
pacman::p_load(readr, skimr, tidyr, dplyr, ggplot2, gghalves, car,
               scales, gridExtra, ggsci, stringr, caret, olsrr, MASS,
               corrplot, RColorBrewer, ROSE, e1071, kableExtra, tibble,
               ggridges, grid)

theme = theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 15),
        plot.subtitle = element_text(size = 10),
        axis.title = element_text(size = 10), 
        legend.position = "none")

Dataset

df.0 = read_csv("DATA.csv")

Clean & Manipulation

df.1 = df.0

df.1 = df.1 %>%
  drop_na() %>% 
  dplyr::select(-CustomerID) %>% 
  mutate_at(c("TypeofContact",
              "Occupation",
              "Gender",
              "ProductPitched",
              "MaritalStatus",
              "Designation", 
              "Passport",
              "OwnCar",
              "ProdTaken"), 
            factor) %>% 
  mutate(Gender = gsub("Fe Male",
                       "Female",
                       Gender),
         Gender = as.factor(Gender)) %>% 
  mutate(Often = ifelse(NumberOfTrips >= 4, 1, 0),
         Often = as.factor(Often))

We process some data manipulations at first for better data analysis later.

Deselect uninformative variables
Format variable types
Correct error in variables
Group values based on an elbow method

skim_without_charts(df.1)

Data summary

Name	df.1
Number of rows	4128
Number of columns	20
_______________________
Column type frequency:
factor	10
numeric	10
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
ProdTaken	1	FALSE	2	0: 3331, 1: 797
TypeofContact	1	FALSE	2	Sel: 2918, Com: 1210
Occupation	1	FALSE	4	Sal: 1999, Sma: 1746, Lar: 381, Fre: 2
Gender	1	FALSE	2	Mal: 2463, Fem: 1665
ProductPitched	1	FALSE	5	Bas: 1615, Del: 1422, Sta: 737, Sup: 250
MaritalStatus	1	FALSE	4	Mar: 1990, Div: 789, Unm: 682, Sin: 667
Passport	1	FALSE	2	0: 2909, 1: 1219
OwnCar	1	FALSE	2	1: 2527, 0: 1601
Designation	1	FALSE	5	Exe: 1615, Man: 1422, Sen: 737, AVP: 250
Often	1	FALSE	2	0: 2727, 1: 1401

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Age	1	37.23	9.17	18	31	36	43	61
CityTier	1	1.66	0.92	1	1	1	3	3
DurationOfPitch	1	15.58	8.40	5	9	14	20	127
NumberOfPersonVisiting	1	2.95	0.72	1	2	3	3	5
NumberOfFollowups	1	3.74	1.01	1	3	4	4	6
PreferredPropertyStar	1	3.58	0.80	3	3	3	4	5
NumberOfTrips	1	3.30	1.86	1	2	3	4	22
PitchSatisfactionScore	1	3.06	1.36	1	2	3	4	5
NumberOfChildrenVisiting	1	1.22	0.85	0	1	1	2	3
MonthlyIncome	1	23178.46	4506.61	1000	20751	22418	25301	98678

3 Exploring Data Analysis

1) What is the acceptance in each feature?

p.1 = df.1 %>% 
  ggplot(aes(x = TypeofContact, y = 1, fill = ProdTaken)) +
  geom_col(position = "fill") + 
  theme +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +  
  scale_fill_locuszoom() +
  labs(x = "Contact Type",
       y = "Rate",
       fill = "Offer Taken")
p.2 = df.1 %>% 
  ggplot(aes(x = Occupation, y = 1, fill = ProdTaken)) +
  geom_col(position = "fill") + 
  theme +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  scale_fill_locuszoom() +
  labs(x = "Occupation",
       y = "Rate",
       fill = "Offer Taken")
p.3 = df.1 %>% 
  ggplot(aes(x = Gender, y = 1, fill = ProdTaken)) +
  geom_col(position = "fill") + 
  theme +
  scale_fill_locuszoom() +
  labs(x = "Gender",
       y = "Rate",
       fill = "Offer Taken")
p.4 = df.1 %>% 
  ggplot(aes(x = ProductPitched, y = 1, fill = ProdTaken)) +
  geom_col(position = "fill") + 
  theme +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  scale_fill_locuszoom() +
  labs(x = "Product Type",
       y = "Rate",
       fill = "Offer Taken")
p.5 = df.1 %>% 
  ggplot(aes(x = MaritalStatus, y = 1, fill = ProdTaken)) +
  geom_col(position = "fill") + 
  theme +
  scale_fill_locuszoom() +
  labs(x = "Marital Status",
       y = "Rate",
       fill = "Offer Taken")
p.6 = df.1 %>% 
  ggplot(aes(x = Passport, y = 1, fill = ProdTaken)) +
  geom_col(position = "fill") + 
  theme +
  scale_fill_locuszoom() +
  labs(x = "Passport",
       y = "Rate",
       fill = "Offer Taken")
p.7 = df.1 %>% 
  ggplot(aes(x = OwnCar, y = 1, fill = ProdTaken)) +
  geom_col(position = "fill") + 
  theme +
  scale_fill_locuszoom() +
  labs(x = "Own Car",
       y = "Rate",
       fill = "Offer Taken")
p.8 = df.1 %>% 
  ggplot(aes(x = Designation, y = 1, fill = ProdTaken)) +
  geom_col(position = "fill") + 
  theme +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  theme(legend.position = c(0.83, 0.74)) +
  scale_fill_locuszoom() +
  labs(x = "Designation",
       y = "Rate",
       fill = "Offer Taken")

grid.arrange(p.1, p.2, p.3, p.4, p.5, p.6, p.7, p.8,
             layout_matrix = rbind(c(6, 7, 3, 5),
                                   c(1, 2, 4, 8)),
             top = textGrob("Categorical Features",
                            gp = gpar(fontsize = 15,
                                      font = 2)))

We will kind of know that customers who have a passport, are single, work as free lancer, are introduced to the basic type of product, and work at the position of an executive are the high acceptance rate, which can be our targeted audience for the next product promotion.

p.1 = df.1 %>%
  ggplot(aes(y = Age, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.2 = df.1 %>%
  ggplot(aes(y = CityTier, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.3 =df.1 %>% 
  ggplot(aes(y = DurationOfPitch, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.4 = df.1 %>% 
  ggplot(aes(y = NumberOfPersonVisiting, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.5 = df.1 %>% 
  ggplot(aes(y = NumberOfFollowups, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.6 = df.1 %>% 
  ggplot(aes(y = PreferredPropertyStar, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.7 = df.1 %>% 
  ggplot(aes(y = NumberOfTrips, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.8 = df.1 %>% 
  ggplot(aes(y = PitchSatisfactionScore, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank(),
        legend.position = c(0.80, 0.78)) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.9 = df.1 %>% 
  ggplot(aes(y = NumberOfChildrenVisiting, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)
p.10 = df.1 %>% 
  ggplot(aes(y = MonthlyIncome, x = "", fill = ProdTaken)) +
  geom_boxplot() +
  theme +
  scale_y_continuous(labels = function(x) paste0("$", {x/1000}, "k")) +  
  theme(axis.ticks.x = element_blank()) +
  scale_fill_locuszoom() +
  labs(x = NULL)

grid.arrange(p.1, p.2, p.3, p.4, p.5, p.6, p.7, p.8, p.9, p.10,
             layout_matrix = rbind(c(1, 10, 7, 2, 6),
                                   c(4, 9, 3, 5, 8)),
             top = textGrob("Continuous Features",
                            gp = gpar(fontsize = 15,
                                      font = 2)))

As for the continuous features, we find out that the higher the preferred property star, duration of pitch, and the number of follow-ups, the more acceptance rate for customers to the product. The next time of promotion, we should try to increase the duration of the pitch and the number of follow-ups to boost up the acceptance rate.

2) What kind of customer would accept the product?

p.1 = df.1 %>% 
  group_by(Often) %>% 
  summarise(mean = mean(MonthlyIncome)) %>% 
  ggplot(aes(x = Often,
             y = mean,
             fill = Often)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0("$",
                               prettyNum(round(mean, 0), 
                                         big.mark = ","))),
            vjust = -0.5) +
  scale_y_continuous(labels = dollar_format(),
                     breaks = seq(20000, 25000, 2500)) +
  coord_cartesian(ylim = c(20000, 25000)) +
  scale_fill_locuszoom() +
  theme +
  labs(title = "Often/Rarely Trip Customer",
       x = "Often Trip",
       y = "Monthly Income")

p.2 = df.1 %>% 
  group_by(ProdTaken) %>% 
  summarise(mean = mean(MonthlyIncome)) %>% 
  ggplot(aes(x = ProdTaken,
             y = mean,
             fill = ProdTaken)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0("$",
                               prettyNum(round(mean, 0), 
                                         big.mark = ","))),
            vjust = -0.5) +
  scale_y_continuous(labels = dollar_format(),
                     breaks = seq(20000, 25000, 2500)) +
  coord_cartesian(ylim = c(20000, 25000)) +
  scale_fill_locuszoom() +
  theme +
  labs(title = "Offer Taken Customer",
       x = "Offer taken",
       y = "")

grid.arrange(p.1, p.2,
             layout_matrix = rbind(c(1,2)))

Customers who accept the product have a lower monthly income than those who reject the product, which does not follow the common statement that customers who have a higher monthly income more often travel than those who have a lower monthly income.

3) What kind of package is taken by the customer in popular?

df.1 %>% 
  dplyr::select(ProdTaken, ProductPitched) %>% 
  group_by(ProductPitched, ProdTaken) %>% 
  summarise(count = n()) %>% 
  mutate(per = count/sum(count)*100) %>% 
  ggplot(aes(x = reorder(ProductPitched, -count),
             y = count,
             fill = ProdTaken))+
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0(round(per, 0), "%")),
            position = position_dodge(width = 0.7),
            vjust = -0.5) + 
  scale_y_continuous(limits = c(0, 1400),
                     breaks = seq(0, 1400, 200)) +
  scale_fill_locuszoom() +
  theme +
  theme(legend.position = "right") +
  labs(title = "Customer Number by Package by Offer Taken Status",
       x = "Package",
       y = "Number of Customer",
       fill = "Offer Taken")

Most package offer taken by customers is basic, deluxe, and standard. Also, the basic package is the highest probability that customers take the product. The overall probability that customers accept the product is 20%. The basic package acceptance rate is above the average overall.

4) What are the positions which the customer accept the offer?

df.1 %>% 
  filter(Often == 1) %>% 
  dplyr::select(MonthlyIncome, Occupation, Designation, Often) %>% 
  mutate(Position = paste0(Designation, " at ", Occupation)) %>% 
  group_by(Position) %>% 
  summarise(mean = mean(MonthlyIncome)) %>% 
  ggplot(aes(y = reorder(Position, mean),
             x = mean)) +
  geom_col(position = "dodge",
           width = 0.7,
           fill = "#9632B8FF") +
  scale_x_continuous(labels = dollar_format(),
                     breaks = seq(0, 35000, 5000)) +
  coord_cartesian(xlim = c(15000, 35000)) +
  scale_fill_locuszoom() +
  theme +
  labs(title = "Taken Offer Customer Number by Position",
       x = "Monthly Income",
       y = "Position")

The highest monthly income is the position of VP at salaried which means government employees. The executive at free lancer has the lowest monthly income.

4 Statistical Analysis

1) Do often travelers have a higher monthly income?

The people who often travel are more financially free. This is the hypothesis that we will justify with a statistical test in this section.

temp.4.1 = df.1 %>% 
  group_by(Often) %>% 
  mutate(count = n(),
         x.lab = paste0(Often,
                        "\n",
                        "(n=",
                        count,
                        ")")) %>% 
  dplyr::select(MonthlyIncome, Often, count, x.lab)

title = "Monthly Income Distribution by Group of Rarely/Often Trip"
ggplot(data = temp.4.1,
       aes(x = x.lab,
           y = MonthlyIncome)) +
  geom_half_violin(side = "l",
                   alpha = 0.5,
                   trim = F,
                   fill = "#D62728FF") +
  geom_half_boxplot(side = "r",
                    alpha = 0.5,
                    fill = "#FF7F0EFF") +
  stat_summary(fun = "mean", 
               shape = 4, 
               color = "black",
               size = 0.5,
               stroke = 2) +
  scale_y_continuous(labels = dollar_format()) +  
  theme +
  labs(title = title,
       x = "Often Trip",
       y = "Monthly Income")

Normality Test - Shapiro Wilk Test

by(temp.4.1$MonthlyIncome,
   temp.4.1$Often,
   shapiro.test)

## temp.4.1$Often: 0
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.87322, p-value < 2.2e-16
## 
## ------------------------------------------------------------ 
## temp.4.1$Often: 1
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.95793, p-value < 2.2e-16

Homogeneity of Variance Test - Levene’s Test

leveneTest(temp.4.1$MonthlyIncome,
           temp.4.1$Often)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value  Pr(>F)   
## group    1  7.3379 0.00678 **
##       4126                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Test - Kruskal Wallis Test

Since the dataset is non-normally distributed and non-equal variance, we will use Mann Whitney test instead of the original ANOVA test.

wilcox.test(temp.4.1$MonthlyIncome ~
              temp.4.1$Often)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  temp.4.1$MonthlyIncome by temp.4.1$Often
## W = 1583437, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The result supports the statement that people who have a higher monthly income are more often travelers with a p-value < 0.05.

5 Machine Learning Analysis

1) Which customers are more likely to accept the travel package?

We will use logistic regression in this section to discuss. So, we can not only know the prediction result from the model, but also we can have the causation analysis. Therefore, we can predict which customers are more likely to accept the travel package and which variables are most significant.

set.seed(123)
split = createDataPartition(df.1$ProdTaken, p = 0.8, list = F)
train.data = df.1[split, ]
test.data = df.1[-split, ]

mod.4.1 = glm(data = train.data,
              ProdTaken ~ 
                MonthlyIncome +
                Designation +
                OwnCar + 
                Passport +
                PreferredPropertyStar + 
                DurationOfPitch + 
                CityTier +
                Age +
                NumberOfFollowups + 
                NumberOfChildrenVisiting +
                PitchSatisfactionScore,
              family = "binomial")

mod.4.2 = stepAIC(mod.4.1,
                  direction = "both",
                  trace = F)

pred = predict(mod.4.2,
               type = "response",
               newdata = test.data)

pred.cutoff = factor(ifelse(pred >= 0.5, "Yes", "No"))
actual = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.cutoff, actual, positive = "Yes")
accuracy = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]

a = c("Accuracy",
      "Sensitivity",
      "Specificity",
      "Total")
b = c(accuracy,
      sensitivity,
      specificity,
      sum(accuracy, sensitivity, specificity)) %>% 
  unname() %>% 
  round(3)
data.frame(a, b) %>% 
  rename("Index" = "a",
         "Value" = "b") %>% 
  kbl(align = "l",
      caption = "Without Cross Validation / Cutoff in 0.500") %>% 
  kable_classic("hover")

Without Cross Validation / Cutoff in 0.500
Index	Value
Accuracy	0.838
Sensitivity	0.302
Specificity	0.965
Total	2.105

The first model is without cross validation and with setting up a cutoff by default as 0.500. However, due to the data imbalance, this can be the cause to get a bad model.

perform.fn = function(cutoff){
  pred.temp = factor(ifelse(pred >= cutoff, "Yes", "No"))
  actual.temp = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
  cm = confusionMatrix(pred.temp, actual.temp, positive = "Yes")
  accuray = cm$overall[1]
  sensitivity = cm$byClass[1]
  specificity = cm$byClass[2]
  out = t(as.matrix(c(sensitivity, specificity, accuray))) 
  colnames(out) = c("sensitivity", "specificity", "accuracy")
  return(out)}

s = seq(0.01, 0.99, length = 100)
out = matrix(1, 100, 3)
for(i in 1:100){
  out[i,] = perform.fn(s[i])} 

# plot(s, out[, 1], col = 2, lwd = 2, 
#      type = "l", axes = T, 
#      xlab = "Cutoff", ylab = "Value")
# lines(s, out[,2], col = "darkgreen", lwd = 2)
# lines(s, out[,3], col = 4, lwd = 2)
# legend("right", 
#        col = c(2, "darkgreen", 4, "darkred"), 
#        text.font = 3, 
#        inset = 0.02, 
#        box.lty = 0, 
#        cex = 0.8, 
#        lwd = c(2, 2, 2, 2), 
#        c("Sensitivity", "Specificity", "Accuracy"))
cutoff = s[which(abs(out[,1]-out[,2]) < 0.01)]
# abline(v = cutoff, col = "black", lwd = 1, lty = 3)

pred.cutoff = factor(ifelse(pred >= cutoff, "Yes", "No"))
actual = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.cutoff, actual, positive = "Yes")
accuracy = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]

a = c("Accuracy",
      "Sensitivity",
      "Specificity",
      "Total")
b = c(accuracy,
      sensitivity,
      specificity,
      sum(accuracy, sensitivity, specificity)) %>% 
  unname() %>% 
  round(3)
data.frame(a, b) %>% 
  rename("Index" = "a",
         "Value" = "b") %>% 
  kbl(align = "l",
      caption = "Without Cross Validation / Cutoff in 0.188") %>% 
  kable_classic("hover")

Without Cross Validation / Cutoff in 0.188
Index	Value
Accuracy	0.731
Sensitivity	0.723
Specificity	0.733
Total	2.187

The second model is without cross validation but with tunning for the cutoff value as 0.188. Although the accuracy decreases, the overall score increases, especially for the sensitivity. The sensitivity which is also called recall is the interested target prediction rate, which is about the “yes” or “1” status in this case. So, we should care about sensitivity as well but not just go blindly for accuracy.

df.2 = ROSE(data = df.1,
            seed = 123,
            ProdTaken ~ .)$data

set.seed(123)
split = createDataPartition(df.2$ProdTaken, p = 0.8, list = F)
train.data = df.2[split, ]
test.data = df.2[-split, ]

mod.4.3 = glm(data = train.data,
              ProdTaken ~ 
                MonthlyIncome +
                Designation +
                OwnCar + 
                Passport +
                PreferredPropertyStar + 
                DurationOfPitch + 
                CityTier +
                Age +
                NumberOfFollowups + 
                NumberOfChildrenVisiting +
                PitchSatisfactionScore,
              family = "binomial")

mod.4.4 = stepAIC(mod.4.3,
                  direction = "both",
                  trace = F)

pred = predict(mod.4.4,
               type = "response",
               newdata = test.data)

pred.cutoff = factor(ifelse(pred >= 0.5, "Yes", "No"))
actual = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.cutoff, actual, positive = "Yes")
accuracy = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]

a = c("Accuracy",
      "Sensitivity",
      "Specificity",
      "Total")
b = c(accuracy,
      sensitivity,
      specificity,
      sum(accuracy, sensitivity, specificity)) %>% 
  unname() %>% 
  round(3)
data.frame(a, b) %>% 
  rename("Index" = "a",
         "Value" = "b") %>% 
  kbl(align = "l",
      caption = "With Cross Validation / Cutoff in 0.500") %>% 
  kable_classic("hover")

With Cross Validation / Cutoff in 0.500
Index	Value
Accuracy	0.698
Sensitivity	0.682
Specificity	0.714
Total	2.095

As for the cross validation, we use the ROSE (Random Over Sampling Examples) function to balance the data. Without tunning for the cutoff, the accuracy or the overall score is not better than the last model. Then, we proceed to the next model which is tuned for its cutoff.

s = seq(0.01, 0.99, length = 100)
out = matrix(1, 100, 3)
for(i in 1:100){
  out[i,] = perform.fn(s[i])} 

# plot(s, out[, 1], col = 2, lwd = 2, 
#      type = "l", axes = T, 
#      xlab = "Cutoff", ylab = "Value")
# lines(s, out[,2], col = "darkgreen", lwd = 2)
# lines(s, out[,3], col = 4, lwd = 2)
# legend("right", 
#        col = c(2, "darkgreen", 4, "darkred"), 
#        text.font = 3, 
#        inset = 0.02, 
#        box.lty = 0, 
#        cex = 0.8, 
#        lwd = c(2, 2, 2, 2), 
#        c("Sensitivity", "Specificity", "Accuracy"))
cutoff = s[which(abs(out[,1]-out[,2]) < 0.025)]
# abline(v = cutoff, col = "black", lwd = 1, lty = 3)

pred.cutoff = factor(ifelse(pred >= cutoff, "Yes", "No"))
actual = factor(ifelse(test.data$ProdTaken == 1, "Yes", "No"))
cm = confusionMatrix(pred.cutoff, actual, positive = "Yes")
accuracy = cm$overall[1]
sensitivity = cm$byClass[1]
specificity = cm$byClass[2]

a = c("Accuracy",
      "Sensitivity",
      "Specificity",
      "Total")
b = c(accuracy,
      sensitivity,
      specificity,
      sum(accuracy, sensitivity, specificity)) %>% 
  unname() %>% 
  round(3)
data.frame(a, b) %>% 
  rename("Index" = "a",
         "Value" = "b") %>% 
  kbl(align = "l",
      caption = "With Cross Validation / Cutoff in 0.485") %>% 
  kable_classic("hover")

With Cross Validation / Cutoff in 0.485
Index	Value
Accuracy	0.693
Sensitivity	0.694
Specificity	0.692
Total	2.080

The last model is even worse. The performance is worse in not only the accuracy but also the total value.

mod.sum = summary(mod.4.2)

mod.sum.df = mod.sum$coefficients %>%
  as.data.frame() %>% 
  dplyr::select(Estimate, `Pr(>|z|)`) %>% 
  mutate(Estimate = round(Estimate, 4)) %>% 
  rename("P.value" = `Pr(>|z|)`) %>% 
  filter(P.value < 0.05) %>% 
  rownames_to_column(var = "Variable")

Des = c("The customer has a passport",
        "Executive designation of the customer in the current organization",
        "City tier depends on the development of a city, population, facilities, and living standards",
        "Total number of followups has been done by the salesperson after the sales pitch",
        "Preferred hotel property rating by customer",
        "Sales pitch satisfaction score",
        "Duration of the pitch by a salesperson to the customer",
        "Gross monthly income of the customer",
        "Age of customer")

mod.sum.df %>% 
  arrange(desc(Estimate)) %>% 
  filter(Variable != "(Intercept)") %>% 
  dplyr::select(-P.value) %>%
  add_column(Description = Des) %>%
  kbl(align = "l",
      caption = "Significantly Effect Factors") %>%
  kable_classic("hover")

Significantly Effect Factors
Variable	Estimate	Description
Passport1	1.5641	The customer has a passport
DesignationExecutive	1.0088	Executive designation of the customer in the current organization
CityTier	0.5546	City tier depends on the development of a city, population, facilities, and living standards
NumberOfFollowups	0.4062	Total number of followups has been done by the salesperson after the sales pitch
PreferredPropertyStar	0.3484	Preferred hotel property rating by customer
PitchSatisfactionScore	0.1144	Sales pitch satisfaction score
DurationOfPitch	0.0270	Duration of the pitch by a salesperson to the customer
MonthlyIncome	-0.0001	Gross monthly income of the customer
Age	-0.0208	Age of customer

Based on the total value, the second model has the best performance. We will choose the second model with no cross validation and tuned cutoff. Then, what are the factors that are significantly related to product taken decision? We know the result from above table. Take the most positive and negative one as examples. As having a passport, customers are more likely to accept the offer based on the estimate greater than zero. This is statistically significant at 0.05. On the other hand, as having an older age, customers are more unlikely to accept the offer based on the estimate lesser than zero. This is statistically significant at 0.05. Therefore, based on the analysis, we find the best prediction model for its prediction and also know the casuation from its model analysis.

6 Conclusion

Firstly, we know the acceptance rate in each feature without a significant test but just a rough picture. We also find out that the customers who are taken the offer have a lower monthly income, which is against the common statement that customers who have a higher income trip more frequently. The popular taken offer type is the basic package, which might be due to the cheap price. The customer position is related to its salary. However, the most taken offer customers work in the position of executive. The agency should invent the package and target the audience with a higher monthly income to increase the revenue. Then, we do the model analysis to know the prediction and causation. As for the original product, the agency can have a more accurate way to promote or advertise to the target audience, which should give the agency a more successful sale condition. As for the new product, the agency can search or develop to target new audiences that they are not interested in the agency product before.

Travel Agency Analysis

Chun-Li Hou

September 28, 2021