Introduction

Brief

Travel Agency

Travel agency have many product and service. To reach their customer, Travel Agency have to do some effort to promote their product, sometimes they will share leaflet, call the customer, or use adds in any platforms. For a few company, the best way is to call the customer. Travel Agency use that way. They call all of the costumer to offer the new product or package until now. As a Data Analyst, we want to help our marketing and sales department to opmitize their opportunity to get customer from a call.

Data

Data set was taken from kaggle (https://www.kaggle.com/susant4learning/holiday-package-purchase-prediction). this data about customer purchased in ther travel and how to travel get customer. We have some importance column :

Age : Age of customer
CityTier : City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered
DurationOfPitch : Duration of the pitch by a salesperson to the customer
NumberOfFollowups : Total number of follow-ups has been done by the salesperson after the sales pitch
PreferredPropertyStar : Preferred hotel property rating by customer
NumberOfTrips : this cloumn will help us to classified customer into 2 class (“often : yes (>4) and no(<= 3)”)
Passport :The customer has a passport or not (0: No, 1: Yes)
PitchSatisfactionScore : Sales pitch satisfaction score
OwnCar :Whether the customers own a car or not (0: No, 1: Yes)
NumberOfChildrenVisiting : Total number of children with age less than 5 planning to take the trip with the customer
Designation : Designation of the customer in the current organization
MonthlyIncome : Gross monthly income of the customer
ProdTaken : The result of our offer to customer

Prepare Tools and Data

Setup The Tools

There is a few of tools we use in this case.

library(dplyr)
library(tidyverse)
library(ggplot2)
library(gsubfn)
library(rsample)
library(caret)
library(class)
library(ROSE)
library(cmplot)

Now we can prepare our data the data

Data

Call our data and check it

travel <- read.csv("Travel.csv")
glimpse(travel)

## Rows: 4,888
## Columns: 20
## $ CustomerID               <int> 200000, 200001, 200002, 200003, 200004, 20000~
## $ ProdTaken                <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ Age                      <int> 41, 49, 37, 33, NA, 32, 59, 30, 38, 36, 35, N~
## $ TypeofContact            <chr> "Self Enquiry", "Company Invited", "Self Enqu~
## $ CityTier                 <int> 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ DurationOfPitch          <int> 6, 14, 8, 9, 8, 8, 9, 30, 29, 33, 22, 21, 32,~
## $ Occupation               <chr> "Salaried", "Salaried", "Free Lancer", "Salar~
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "~
## $ NumberOfPersonVisiting   <int> 3, 3, 3, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, ~
## $ NumberOfFollowups        <int> 3, 4, 4, 3, 3, 3, 2, 3, 4, 3, 2, 4, 3, 3, 4, ~
## $ ProductPitched           <chr> "Deluxe", "Deluxe", "Basic", "Basic", "Basic"~
## $ PreferredPropertyStar    <int> 3, 4, 3, 3, 4, 3, 5, 3, 3, 3, 4, 3, 3, 3, 3, ~
## $ MaritalStatus            <chr> "Single", "Divorced", "Single", "Divorced", "~
## $ NumberOfTrips            <int> 1, 2, 7, 2, 1, 1, 5, 2, 1, 7, 1, 1, 2, 1, 6, ~
## $ Passport                 <int> 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, ~
## $ PitchSatisfactionScore   <int> 2, 3, 3, 5, 5, 5, 2, 2, 3, 3, 3, 3, 3, 3, 2, ~
## $ OwnCar                   <int> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, ~
## $ NumberOfChildrenVisiting <int> 0, 2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 2, 0, ~
## $ Designation              <chr> "Manager", "Manager", "Executive", "Executive~
## $ MonthlyIncome            <int> 20993, 20130, 17090, 17909, 18468, 18068, 176~

We have 4888 rows of data and 20 column, our main target is to classify who likes to take a new product by offered our marketing’s team. The target is Customer who often use ourproduct each rows.But we need to check the NA values in each column.

colSums(is.na(travel))

##               CustomerID                ProdTaken                      Age 
##                        0                        0                      226 
##            TypeofContact                 CityTier          DurationOfPitch 
##                        0                        0                      251 
##               Occupation                   Gender   NumberOfPersonVisiting 
##                        0                        0                        0 
##        NumberOfFollowups           ProductPitched    PreferredPropertyStar 
##                       45                        0                       26 
##            MaritalStatus            NumberOfTrips                 Passport 
##                        0                      140                        0 
##   PitchSatisfactionScore                   OwnCar NumberOfChildrenVisiting 
##                        0                        0                       66 
##              Designation            MonthlyIncome 
##                        0                      233

new_DF <- travel %>% filter_all(any_vars(is.na(.)))
new_DF

15 % (760/4888) from our data have NA values, we want take out all of it because we stil have 85 % data.

Prepare Data

Change the type of data

Our main target which is determine by many factors, first we need to change our data type to factors. After that, we need to take out our NA values to make our data fit.

travel <- travel %>% 
  mutate_at(c("TypeofContact","Occupation","Gender","ProductPitched","MaritalStatus" ,'Designation', "Passport", "OwnCar", "ProdTaken"),  as.factor) %>% 
  select(-CustomerID)
travel <- na.omit(travel)

travel don’t have NA values anymore. We don’t need customer ID so we will delete this column.

Check Column

Before we continue to do some exploratory and data analysis, we need to know levels for each column. It is to avoid some column that have same values or levels but write with different letter. Check the levels of our data.

levels(travel$TypeofContact)

## [1] ""                "Company Invited" "Self Enquiry"

levels(travel$Occupation)

## [1] "Free Lancer"    "Large Business" "Salaried"       "Small Business"

levels(travel$Gender)

## [1] "Fe Male" "Female"  "Male"

We found false levels in Gender’s column, we will change “fe male” to “female”. we will use func gsub to fix it.

travel <- travel %>% 
  mutate(Gender = gsub("Fe Male","Female", Gender),
         Gender = as.factor(Gender))
levels(travel$Gender)

## [1] "Female" "Male"

After we check again, our levels has been fixed.

levels(travel$ProductPitched)

## [1] "Basic"        "Deluxe"       "King"         "Standard"     "Super Deluxe"

levels(travel$MaritalStatus)

## [1] "Divorced"  "Married"   "Single"    "Unmarried"

levels(travel$Designation)

## [1] "AVP"            "Executive"      "Manager"        "Senior Manager"
## [5] "VP"

From TypeofContact, Occupation, Gender, ProductPitched, MaritalStatus, until Designation column we found error levels in gender. But we fix it, after that we check again and we don’t have any error levels. To help us determine customer who like to use our product or not, we will make new column “often” with the value is Yes = 1, and No = 0. To make it, we must grouping NumberOfTrips column.

check costumer’s number of trips.

table(travel$NumberOfTrips)

## 
##    1    2    3    4    5    6    7    8   19   20   21   22 
##  467 1267  993  422  396  283  193  103    1    1    1    1

We can their popular number of trip is 2 and 3. In this case we will use 4 as a treshold. If our customer use product more than 4 we will labeled with 1/yes in often column (4 included), and if less than 4 we will label with 0/no.

travel <- travel %>% 
  mutate(often = ifelse(NumberOfTrips>=4, 1, 0),
         often= as.factor(often))
# travel_a <- travel %>% 
#   mutate(often = ifelse(NumberOfTrips<5, 0, 1),
#          often = as.factor(often))

Exploratory Data Analysis

We want to see our customer monthly income by 2 group (often and rarely trip)

data.frame(aggregate(MonthlyIncome ~ often, travel, mean)) %>% 
  ggplot(aes(x = often, y = MonthlyIncome)) +
  geom_col()

data.frame(aggregate(MonthlyIncome ~ ProdTaken, travel, mean)) %>% 
  ggplot(aes(x = ProdTaken, y = MonthlyIncome)) +
  geom_col()

Based on the data history, people who often get our product have high more income. But in our last record, customer who reject our product have more high monthly income than who take our product.

trav_tab <- data.frame(table(travel$ProdTaken, travel$ProductPitched))
ggplot(trav_tab, 
       aes(x = Var2,
           y = Freq,
           fill = Var1)) + 
  geom_col(aes(fill = Var1)) +
  labs(x = "Package",
       y = "Number of Customer",
       fill = "Product's Taken Status")

The most package offer to customer are Basic, Deluxe and Standard. How the probability customer take our product? It can count by total customer who received product and total customer get offer.

table(travel$ProdTaken)

## 
##    0    1 
## 3331  797

797/3331*100

## [1] 23.92675

With low probability, if our marketing department call of the customer and just have this low probability will take many time. We will help them.

Bar with monthly income by often status tell us that people often to vacation have high mean income than people who rarely to vacation, but the difference not far. For the specific target we want to see “often” group monthly income.

often_trav <- travel %>% 
  filter(often == 1)
income <- aggregate(MonthlyIncome ~ Occupation+Designation, data = often_trav, FUN = mean)
income <- income[order(income$MonthlyIncome, decreasing = T),]
income$newcol <- paste(income$Designation, income$Occupation, sep=" at ")
ggplot(income, aes(x = MonthlyIncome, y = reorder(newcol, MonthlyIncome))) +
  geom_col() +
  labs(x = "Monthly Income",
       y = "Position")

We know that VP at Salaried (Salaried means government employees) is the most high monthly income and freelancer is the lowest monthly income in “often” group status. We want to make model to optimize our marketing to promote a new holiday packet into the right niche.

Cross Validation

Split the data to train and test data set.

set.seed(101)
index <- initial_split(data = travel, prop = 0.7, strata = "often")
travel_train <- training(index)
travel_test <- testing(index)

Check the proportion of our target (often column)

prop.table(table(travel_train$often))

## 
##         0         1 
## 0.6606648 0.3393352

our data train have imbalance but we will try to make model.

Modeling

Logistic Regression

We try to make first model with our selected predictor.

model_1_1 <- glm(often ~ MonthlyIncome + Designation + OwnCar +Passport + PreferredPropertyStar + DurationOfPitch + CityTier + Age + NumberOfFollowups + NumberOfChildrenVisiting + PitchSatisfactionScore , data = travel_train, family = binomial("logit"))
summary(model_1_1)

## 
## Call:
## glm(formula = often ~ MonthlyIncome + Designation + OwnCar + 
##     Passport + PreferredPropertyStar + DurationOfPitch + CityTier + 
##     Age + NumberOfFollowups + NumberOfChildrenVisiting + PitchSatisfactionScore, 
##     family = binomial("logit"), data = travel_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6439  -0.8947  -0.6853   1.2043   2.1138  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -3.7103377  0.5856832  -6.335 2.37e-10 ***
## MonthlyIncome              0.0000145  0.0000158   0.917  0.35891    
## DesignationExecutive       0.0509378  0.2564586   0.199  0.84256    
## DesignationManager         0.3242970  0.2226897   1.456  0.14532    
## DesignationSenior Manager  0.0339347  0.2012344   0.169  0.86609    
## DesignationVP             -0.6287245  0.3083987  -2.039  0.04148 *  
## OwnCar1                   -0.2541194  0.0848539  -2.995  0.00275 ** 
## Passport1                  0.0393157  0.0904823   0.435  0.66392    
## PreferredPropertyStar      0.0511770  0.0519095   0.986  0.32419    
## DurationOfPitch           -0.0039810  0.0050261  -0.792  0.42832    
## CityTier                  -0.0835885  0.0471103  -1.774  0.07601 .  
## Age                        0.0667525  0.0053207  12.546  < 2e-16 ***
## NumberOfFollowups          0.0017476  0.0442917   0.039  0.96853    
## NumberOfChildrenVisiting   0.2189515  0.0528240   4.145 3.40e-05 ***
## PitchSatisfactionScore    -0.0219550  0.0302063  -0.727  0.46733    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3700.1  on 2887  degrees of freedom
## Residual deviance: 3433.3  on 2873  degrees of freedom
## AIC: 3463.3
## 
## Number of Fisher Scoring iterations: 4

travel_test$pred_back_1_1 <- predict(model_1_1, newdata = travel_test, type = "response")

travel_test$pred.Label_1_1 <- ifelse(travel_test$pred_back_1_1 > 0.5, "1", "0")

travel_test$pred.Label_1_1 <- as.factor(travel_test$pred.Label_1_1)
confusionMatrix(data = travel_test$pred.Label_1_1,
                reference = travel_test$often,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 719 308
##          1 100 113
##                                          
##                Accuracy : 0.671          
##                  95% CI : (0.644, 0.6971)
##     No Information Rate : 0.6605         
##     P-Value [Acc > NIR] : 0.2272         
##                                          
##                   Kappa : 0.1663         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.26841        
##             Specificity : 0.87790        
##          Pos Pred Value : 0.53052        
##          Neg Pred Value : 0.70010        
##              Prevalence : 0.33952        
##          Detection Rate : 0.09113        
##    Detection Prevalence : 0.17177        
##       Balanced Accuracy : 0.57315        
##                                          
##        'Positive' Class : 1              
##

We have bad Sensitivity score, so We try use stepwise

model_back_1 <- step(object = model_1_1, direction = "backward", trace = F)
summary(model_back_1)

## 
## Call:
## glm(formula = often ~ Designation + OwnCar + CityTier + Age + 
##     NumberOfChildrenVisiting, family = binomial("logit"), data = travel_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6496  -0.8926  -0.6844   1.2106   2.0843  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -3.245982   0.324044 -10.017  < 2e-16 ***
## DesignationExecutive      -0.086103   0.187497  -0.459  0.64607    
## DesignationManager         0.221097   0.178865   1.236  0.21642    
## DesignationSenior Manager -0.012411   0.185827  -0.067  0.94675    
## DesignationVP             -0.557811   0.301761  -1.849  0.06453 .  
## OwnCar1                   -0.254627   0.084617  -3.009  0.00262 ** 
## CityTier                  -0.082719   0.046999  -1.760  0.07840 .  
## Age                        0.067198   0.005287  12.711  < 2e-16 ***
## NumberOfChildrenVisiting   0.238196   0.048743   4.887 1.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3700.1  on 2887  degrees of freedom
## Residual deviance: 3436.5  on 2879  degrees of freedom
## AIC: 3454.5
## 
## Number of Fisher Scoring iterations: 4

travel_test$pred_back_1 <- predict(model_back_1, newdata = travel_test, type = "response")

travel_test$pred.Label_1 <- ifelse(travel_test$pred_back_1 > 0.5, "1", "0")

travel_test$pred.Label_1 <- as.factor(travel_test$pred.Label_1)
confusionMatrix(data = travel_test$pred.Label_1,
                reference = travel_test$often,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 719 310
##          1 100 111
##                                           
##                Accuracy : 0.6694          
##                  95% CI : (0.6424, 0.6955)
##     No Information Rate : 0.6605          
##     P-Value [Acc > NIR] : 0.2651          
##                                           
##                   Kappa : 0.1611          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.26366         
##             Specificity : 0.87790         
##          Pos Pred Value : 0.52607         
##          Neg Pred Value : 0.69874         
##              Prevalence : 0.33952         
##          Detection Rate : 0.08952         
##    Detection Prevalence : 0.17016         
##       Balanced Accuracy : 0.57078         
##                                           
##        'Positive' Class : 1               
##

Our model still have bad sensitivity score. Try to make KNN model.

KNN

We need to scale our data when use KNN for our model, We can use normalize to scaling

normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
           )
}

travel_train_norm <- travel_train %>% 
  mutate_if(is.numeric, normalize)
travel_test_norm <- travel_test %>% 
  mutate_if(is.numeric, normalize)

travel_train_norm <- travel_train_norm %>% 
  select(-c("TypeofContact", "Occupation", "Gender", "ProductPitched", "MaritalStatus", "Designation", "Passport", "OwnCar")) %>% 
  mutate(often = as.factor(often))

travel_test_norm <- travel_test_norm  %>% 
  select(-c("TypeofContact", "Occupation", "Gender", "ProductPitched", "MaritalStatus", "Designation", "Passport", "OwnCar", "pred_back_1_1", "pred.Label_1_1", "pred_back_1", "pred.Label_1"))  %>% 
  mutate(often = as.factor(often))
anyNA(travel_test_norm)

## [1] FALSE

anyNA(travel_train_norm)

## [1] FALSE

travel_train_norm_x <- travel_train_norm %>% select(-often)
travel_test_norm_x <- travel_test_norm %>% select(-often)

travel_train_norm_y <- travel_train_norm$often
travel_test_norm_y <- travel_test_norm$often

sqrt(nrow(travel_train_norm))

## [1] 53.74012

because our target’s class is 2 so we will use 53 for k

travel_pred_c <- knn(train = travel_train_norm_x, test = travel_test_norm_x, k = 53, cl = travel_train_norm_y)

confusionMatrix(travel_pred_c, reference = as.factor(travel_test_norm_y), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 725  96
##          1  94 325
##                                           
##                Accuracy : 0.8468          
##                  95% CI : (0.8255, 0.8664)
##     No Information Rate : 0.6605          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.658           
##                                           
##  Mcnemar's Test P-Value : 0.9422          
##                                           
##             Sensitivity : 0.7720          
##             Specificity : 0.8852          
##          Pos Pred Value : 0.7757          
##          Neg Pred Value : 0.8831          
##              Prevalence : 0.3395          
##          Detection Rate : 0.2621          
##    Detection Prevalence : 0.3379          
##       Balanced Accuracy : 0.8286          
##                                           
##        'Positive' Class : 1               
##

Model Machine Learning from KNN much better than Logistic Regression but we need to try optimize our model because we have imbalance data We can use some method.

Model Improvement

Data

For tuning model, we can use ROSE to balance our data. In this case our data imbalance, that can be our causes to get bad model. We use it to train data set, so the unseen data still imbalance.

travel_train_rose <- training(index)
travel_test_rose <- testing(index)
new_travel_train <- ROSE(often ~ ., data = travel_train_rose, seed = 101)$data
table(new_travel_train$often)

## 
##    0    1 
## 1442 1446

Logistic Regression

model_1 <- glm(often ~ MonthlyIncome + Designation + OwnCar +Passport + PreferredPropertyStar + DurationOfPitch + CityTier + Age + NumberOfFollowups + NumberOfChildrenVisiting + PitchSatisfactionScore +NumberOfTrips, data = new_travel_train, family = binomial("logit"))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_1)

## 
## Call:
## glm(formula = often ~ MonthlyIncome + Designation + OwnCar + 
##     Passport + PreferredPropertyStar + DurationOfPitch + CityTier + 
##     Age + NumberOfFollowups + NumberOfChildrenVisiting + PitchSatisfactionScore + 
##     NumberOfTrips, family = binomial("logit"), data = new_travel_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5233  -0.1670   0.0000   0.0556   3.4223  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.117e+01  1.206e+00  -9.265  < 2e-16 ***
## MonthlyIncome             -4.211e-05  2.540e-05  -1.658  0.09727 .  
## DesignationExecutive      -9.591e-01  4.943e-01  -1.940  0.05236 .  
## DesignationManager        -2.395e-01  4.445e-01  -0.539  0.59001    
## DesignationSenior Manager -1.781e-01  4.216e-01  -0.423  0.67266    
## DesignationVP              6.786e-02  6.993e-01   0.097  0.92269    
## OwnCar1                   -5.584e-01  1.791e-01  -3.118  0.00182 ** 
## Passport1                  2.075e-02  1.940e-01   0.107  0.91485    
## PreferredPropertyStar      1.674e-01  9.790e-02   1.710  0.08729 .  
## DurationOfPitch           -1.287e-02  9.334e-03  -1.379  0.16797    
## CityTier                  -9.858e-02  8.701e-02  -1.133  0.25725    
## Age                        5.821e-02  9.717e-03   5.990 2.09e-09 ***
## NumberOfFollowups         -1.189e-01  7.969e-02  -1.491  0.13585    
## NumberOfChildrenVisiting  -7.607e-02  9.242e-02  -0.823  0.41049    
## PitchSatisfactionScore    -7.665e-02  5.665e-02  -1.353  0.17608    
## NumberOfTrips              3.310e+00  1.609e-01  20.577  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4003.61  on 2887  degrees of freedom
## Residual deviance:  894.58  on 2872  degrees of freedom
## AIC: 926.58
## 
## Number of Fisher Scoring iterations: 8

model_back <- step(object = model_1, direction = "backward", trace = F)
summary(model_back)

## 
## Call:
## glm(formula = often ~ MonthlyIncome + Designation + OwnCar + 
##     PreferredPropertyStar + Age + NumberOfFollowups + NumberOfTrips, 
##     family = binomial("logit"), data = new_travel_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6039  -0.1671   0.0000   0.0560   3.5100  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.167e+01  1.154e+00 -10.107  < 2e-16 ***
## MonthlyIncome             -4.820e-05  2.460e-05  -1.959 0.050086 .  
## DesignationExecutive      -9.992e-01  4.769e-01  -2.095 0.036163 *  
## DesignationManager        -3.276e-01  4.308e-01  -0.760 0.446963    
## DesignationSenior Manager -1.930e-01  4.140e-01  -0.466 0.641090    
## DesignationVP              1.446e-01  6.963e-01   0.208 0.835465    
## OwnCar1                   -5.947e-01  1.773e-01  -3.353 0.000799 ***
## PreferredPropertyStar      1.668e-01  9.670e-02   1.725 0.084546 .  
## Age                        5.913e-02  9.593e-03   6.163 7.12e-10 ***
## NumberOfFollowups         -1.227e-01  7.817e-02  -1.569 0.116627    
## NumberOfTrips              3.311e+00  1.608e-01  20.584  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4003.61  on 2887  degrees of freedom
## Residual deviance:  900.32  on 2877  degrees of freedom
## AIC: 922.32
## 
## Number of Fisher Scoring iterations: 8

travel_test_rose$pred_back <- predict(model_back, newdata = travel_test_rose, type = "response")

travel_test_rose$pred.Label <- ifelse(travel_test_rose$pred_back > 0.5, "1", "0")

travel_test_rose$pred.Label <- as.factor(travel_test_rose$pred.Label)
confusionMatrix(data = travel_test_rose$pred.Label,
                reference = travel_test_rose$often,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 816   0
##          1   3 421
##                                           
##                Accuracy : 0.9976          
##                  95% CI : (0.9929, 0.9995)
##     No Information Rate : 0.6605          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9946          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9963          
##          Pos Pred Value : 0.9929          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3395          
##          Detection Rate : 0.3395          
##    Detection Prevalence : 0.3419          
##       Balanced Accuracy : 0.9982          
##                                           
##        'Positive' Class : 1               
##

Sensitivity score from this model better than our last model, especially for our sensitivity or recall because we want to get more actual “yes” status in often. So we can offer our new product to customer who have loyalty with us by their history number of trips from our package. But we must see that our model is over fiting. It can be

KNN

new_travel_train_norm <- new_travel_train %>% 
  mutate_if(is.numeric, normalize)
new_travel_test_norm <- travel_test_rose %>% 
  mutate_if(is.numeric, normalize)

new_travel_train_norm <- new_travel_train_norm %>% 
  select(-c("TypeofContact", "Occupation", "Gender", "ProductPitched", "MaritalStatus", "Designation", "Passport", "OwnCar")) %>% 
  mutate(often = ifelse(often == 1, "Yes", "No"))
new_travel_test_norm <- new_travel_test_norm %>% 
  mutate(often = ifelse(often == 1, "Yes", "No"))  %>% 
select(-c("TypeofContact", "Occupation", "Gender", "ProductPitched", "MaritalStatus", "Designation", "Passport", "OwnCar", "pred_back", "pred.Label"))

new_travel_train_norm_x <- new_travel_train_norm %>% select(-often)
new_travel_test_norm_x <- new_travel_test_norm %>% select(-often)

new_travel_train_norm_y <- new_travel_train_norm$often
new_travel_test_norm_y <- new_travel_test_norm$often

sqrt(nrow(new_travel_train_norm))

## [1] 53.74012

we will use k : 53 again.

new_travel_pred_c <- knn(train = new_travel_train_norm_x, test = new_travel_test_norm_x, k = 53, cl = new_travel_train_norm_y)

confusionMatrix(new_travel_pred_c, reference = as.factor(new_travel_test_norm_y), positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  463   7
##        Yes 356 414
##                                           
##                Accuracy : 0.7073          
##                  95% CI : (0.6811, 0.7325)
##     No Information Rate : 0.6605          
##     P-Value [Acc > NIR] : 0.0002443       
##                                           
##                   Kappa : 0.4567          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9834          
##             Specificity : 0.5653          
##          Pos Pred Value : 0.5377          
##          Neg Pred Value : 0.9851          
##              Prevalence : 0.3395          
##          Detection Rate : 0.3339          
##    Detection Prevalence : 0.6210          
##       Balanced Accuracy : 0.7743          
##                                           
##        'Positive' Class : Yes             
##

In KNN model, sensitivity score is lower than stepwise model. But we can see that our model not overfitting with the data. At this model, we have good sensitivity, precision, and accuracy from other model except stepwise with rose data tuning.

Conclusion

From this model we can understand that our data imbalanced. When we use original data in logistic regression make bad result by the sensitivity’s score, but different with the knn method. That is because we want optimize the number of customer who often use purchase our package of travel. We can optimize the model with ROSE to make data balance, that because ROSE (Random Over Sampling Examples) give new data with smoothed bootsrap approach. When we use ROSE for our method’s it help use to improve sensitivity of logistic regression but decrease the sensitivity of knn method. From ROSE, our stepwise model have good accuracy but it can be over fitting. But from stepwise model use ROSE we can see that predictor that influence the target are age, number of trips, and own car.

Travel Agent

Dionisius Widjayanto

8/31/2021

Introduction

Brief

Travel Agency

Data

Prepare Tools and Data

Setup The Tools

Data

Prepare Data

Change the type of data

Check Column

Exploratory Data Analysis

Cross Validation

Modeling

Logistic Regression

KNN

Model Improvement

Data

Logistic Regression

KNN

Conclusion