Travel agency have many product and service. To reach their customer, Travel Agency have to do some effort to promote their product, sometimes they will share leaflet, call the customer, or use adds in any platforms. For a few company, the best way is to call the customer. Travel Agency use that way. They call all of the costumer to offer the new product or package until now. As a Data Analyst, we want to help our marketing and sales department to opmitize their opportunity to get customer from a call.
Data set was taken from kaggle (https://www.kaggle.com/susant4learning/holiday-package-purchase-prediction). this data about customer purchased in ther travel and how to travel get customer. We have some importance column :
Age : Age of customer
CityTier : City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered
DurationOfPitch : Duration of the pitch by a salesperson to the customer
NumberOfFollowups : Total number of follow-ups has been done by the salesperson after the sales pitch
PreferredPropertyStar : Preferred hotel property rating by customer
NumberOfTrips : this cloumn will help us to classified customer into 2 class (“often : yes (>4) and no(<= 3)”)
Passport :The customer has a passport or not (0: No, 1: Yes)
PitchSatisfactionScore : Sales pitch satisfaction score
OwnCar :Whether the customers own a car or not (0: No, 1: Yes)
NumberOfChildrenVisiting : Total number of children with age less than 5 planning to take the trip with the customer
Designation : Designation of the customer in the current organization
MonthlyIncome : Gross monthly income of the customer
ProdTaken : The result of our offer to customer
There is a few of tools we use in this case.
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gsubfn)
library(rsample)
library(caret)
library(class)
library(ROSE)
library(cmplot)Now we can prepare our data the data
Call our data and check it
travel <- read.csv("Travel.csv")
glimpse(travel)## Rows: 4,888
## Columns: 20
## $ CustomerID <int> 200000, 200001, 200002, 200003, 200004, 20000~
## $ ProdTaken <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ Age <int> 41, 49, 37, 33, NA, 32, 59, 30, 38, 36, 35, N~
## $ TypeofContact <chr> "Self Enquiry", "Company Invited", "Self Enqu~
## $ CityTier <int> 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ DurationOfPitch <int> 6, 14, 8, 9, 8, 8, 9, 30, 29, 33, 22, 21, 32,~
## $ Occupation <chr> "Salaried", "Salaried", "Free Lancer", "Salar~
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "~
## $ NumberOfPersonVisiting <int> 3, 3, 3, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, ~
## $ NumberOfFollowups <int> 3, 4, 4, 3, 3, 3, 2, 3, 4, 3, 2, 4, 3, 3, 4, ~
## $ ProductPitched <chr> "Deluxe", "Deluxe", "Basic", "Basic", "Basic"~
## $ PreferredPropertyStar <int> 3, 4, 3, 3, 4, 3, 5, 3, 3, 3, 4, 3, 3, 3, 3, ~
## $ MaritalStatus <chr> "Single", "Divorced", "Single", "Divorced", "~
## $ NumberOfTrips <int> 1, 2, 7, 2, 1, 1, 5, 2, 1, 7, 1, 1, 2, 1, 6, ~
## $ Passport <int> 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, ~
## $ PitchSatisfactionScore <int> 2, 3, 3, 5, 5, 5, 2, 2, 3, 3, 3, 3, 3, 3, 2, ~
## $ OwnCar <int> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, ~
## $ NumberOfChildrenVisiting <int> 0, 2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 2, 0, ~
## $ Designation <chr> "Manager", "Manager", "Executive", "Executive~
## $ MonthlyIncome <int> 20993, 20130, 17090, 17909, 18468, 18068, 176~
We have 4888 rows of data and 20 column, our main target is to classify who likes to take a new product by offered our marketing’s team. The target is Customer who often use ourproduct each rows.But we need to check the NA values in each column.
colSums(is.na(travel))## CustomerID ProdTaken Age
## 0 0 226
## TypeofContact CityTier DurationOfPitch
## 0 0 251
## Occupation Gender NumberOfPersonVisiting
## 0 0 0
## NumberOfFollowups ProductPitched PreferredPropertyStar
## 45 0 26
## MaritalStatus NumberOfTrips Passport
## 0 140 0
## PitchSatisfactionScore OwnCar NumberOfChildrenVisiting
## 0 0 66
## Designation MonthlyIncome
## 0 233
new_DF <- travel %>% filter_all(any_vars(is.na(.)))
new_DF15 % (760/4888) from our data have NA values, we want take out all of it because we stil have 85 % data.
Our main target which is determine by many factors, first we need to change our data type to factors. After that, we need to take out our NA values to make our data fit.
travel <- travel %>%
mutate_at(c("TypeofContact","Occupation","Gender","ProductPitched","MaritalStatus" ,'Designation', "Passport", "OwnCar", "ProdTaken"), as.factor) %>%
select(-CustomerID)
travel <- na.omit(travel)travel don’t have NA values anymore. We don’t need customer ID so we will delete this column.
Before we continue to do some exploratory and data analysis, we need to know levels for each column. It is to avoid some column that have same values or levels but write with different letter. Check the levels of our data.
levels(travel$TypeofContact)## [1] "" "Company Invited" "Self Enquiry"
levels(travel$Occupation)## [1] "Free Lancer" "Large Business" "Salaried" "Small Business"
levels(travel$Gender)## [1] "Fe Male" "Female" "Male"
We found false levels in Gender’s column, we will change “fe male” to “female”. we will use func gsub to fix it.
travel <- travel %>%
mutate(Gender = gsub("Fe Male","Female", Gender),
Gender = as.factor(Gender))
levels(travel$Gender)## [1] "Female" "Male"
After we check again, our levels has been fixed.
levels(travel$ProductPitched)## [1] "Basic" "Deluxe" "King" "Standard" "Super Deluxe"
levels(travel$MaritalStatus)## [1] "Divorced" "Married" "Single" "Unmarried"
levels(travel$Designation)## [1] "AVP" "Executive" "Manager" "Senior Manager"
## [5] "VP"
From TypeofContact, Occupation, Gender, ProductPitched, MaritalStatus, until Designation column we found error levels in gender. But we fix it, after that we check again and we don’t have any error levels. To help us determine customer who like to use our product or not, we will make new column “often” with the value is Yes = 1, and No = 0. To make it, we must grouping NumberOfTrips column.
check costumer’s number of trips.
table(travel$NumberOfTrips)##
## 1 2 3 4 5 6 7 8 19 20 21 22
## 467 1267 993 422 396 283 193 103 1 1 1 1
We can their popular number of trip is 2 and 3. In this case we will use 4 as a treshold. If our customer use product more than 4 we will labeled with 1/yes in often column (4 included), and if less than 4 we will label with 0/no.
travel <- travel %>%
mutate(often = ifelse(NumberOfTrips>=4, 1, 0),
often= as.factor(often))
# travel_a <- travel %>%
# mutate(often = ifelse(NumberOfTrips<5, 0, 1),
# often = as.factor(often))We want to see our customer monthly income by 2 group (often and rarely trip)
data.frame(aggregate(MonthlyIncome ~ often, travel, mean)) %>%
ggplot(aes(x = often, y = MonthlyIncome)) +
geom_col()data.frame(aggregate(MonthlyIncome ~ ProdTaken, travel, mean)) %>%
ggplot(aes(x = ProdTaken, y = MonthlyIncome)) +
geom_col() Based on the data history, people who often get our product have high more income. But in our last record, customer who reject our product have more high monthly income than who take our product.
trav_tab <- data.frame(table(travel$ProdTaken, travel$ProductPitched))
ggplot(trav_tab,
aes(x = Var2,
y = Freq,
fill = Var1)) +
geom_col(aes(fill = Var1)) +
labs(x = "Package",
y = "Number of Customer",
fill = "Product's Taken Status") The most package offer to customer are Basic, Deluxe and Standard. How the probability customer take our product? It can count by total customer who received product and total customer get offer.
table(travel$ProdTaken)##
## 0 1
## 3331 797
797/3331*100## [1] 23.92675
With low probability, if our marketing department call of the customer and just have this low probability will take many time. We will help them.
Bar with monthly income by often status tell us that people often to vacation have high mean income than people who rarely to vacation, but the difference not far. For the specific target we want to see “often” group monthly income.
often_trav <- travel %>%
filter(often == 1)
income <- aggregate(MonthlyIncome ~ Occupation+Designation, data = often_trav, FUN = mean)
income <- income[order(income$MonthlyIncome, decreasing = T),]
income$newcol <- paste(income$Designation, income$Occupation, sep=" at ")
ggplot(income, aes(x = MonthlyIncome, y = reorder(newcol, MonthlyIncome))) +
geom_col() +
labs(x = "Monthly Income",
y = "Position")We know that VP at Salaried (Salaried means government employees) is the most high monthly income and freelancer is the lowest monthly income in “often” group status. We want to make model to optimize our marketing to promote a new holiday packet into the right niche.
Split the data to train and test data set.
set.seed(101)
index <- initial_split(data = travel, prop = 0.7, strata = "often")
travel_train <- training(index)
travel_test <- testing(index)Check the proportion of our target (often column)
prop.table(table(travel_train$often))##
## 0 1
## 0.6606648 0.3393352
our data train have imbalance but we will try to make model.
We try to make first model with our selected predictor.
model_1_1 <- glm(often ~ MonthlyIncome + Designation + OwnCar +Passport + PreferredPropertyStar + DurationOfPitch + CityTier + Age + NumberOfFollowups + NumberOfChildrenVisiting + PitchSatisfactionScore , data = travel_train, family = binomial("logit"))
summary(model_1_1)##
## Call:
## glm(formula = often ~ MonthlyIncome + Designation + OwnCar +
## Passport + PreferredPropertyStar + DurationOfPitch + CityTier +
## Age + NumberOfFollowups + NumberOfChildrenVisiting + PitchSatisfactionScore,
## family = binomial("logit"), data = travel_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6439 -0.8947 -0.6853 1.2043 2.1138
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7103377 0.5856832 -6.335 2.37e-10 ***
## MonthlyIncome 0.0000145 0.0000158 0.917 0.35891
## DesignationExecutive 0.0509378 0.2564586 0.199 0.84256
## DesignationManager 0.3242970 0.2226897 1.456 0.14532
## DesignationSenior Manager 0.0339347 0.2012344 0.169 0.86609
## DesignationVP -0.6287245 0.3083987 -2.039 0.04148 *
## OwnCar1 -0.2541194 0.0848539 -2.995 0.00275 **
## Passport1 0.0393157 0.0904823 0.435 0.66392
## PreferredPropertyStar 0.0511770 0.0519095 0.986 0.32419
## DurationOfPitch -0.0039810 0.0050261 -0.792 0.42832
## CityTier -0.0835885 0.0471103 -1.774 0.07601 .
## Age 0.0667525 0.0053207 12.546 < 2e-16 ***
## NumberOfFollowups 0.0017476 0.0442917 0.039 0.96853
## NumberOfChildrenVisiting 0.2189515 0.0528240 4.145 3.40e-05 ***
## PitchSatisfactionScore -0.0219550 0.0302063 -0.727 0.46733
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3700.1 on 2887 degrees of freedom
## Residual deviance: 3433.3 on 2873 degrees of freedom
## AIC: 3463.3
##
## Number of Fisher Scoring iterations: 4
travel_test$pred_back_1_1 <- predict(model_1_1, newdata = travel_test, type = "response")travel_test$pred.Label_1_1 <- ifelse(travel_test$pred_back_1_1 > 0.5, "1", "0")travel_test$pred.Label_1_1 <- as.factor(travel_test$pred.Label_1_1)
confusionMatrix(data = travel_test$pred.Label_1_1,
reference = travel_test$often,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 719 308
## 1 100 113
##
## Accuracy : 0.671
## 95% CI : (0.644, 0.6971)
## No Information Rate : 0.6605
## P-Value [Acc > NIR] : 0.2272
##
## Kappa : 0.1663
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.26841
## Specificity : 0.87790
## Pos Pred Value : 0.53052
## Neg Pred Value : 0.70010
## Prevalence : 0.33952
## Detection Rate : 0.09113
## Detection Prevalence : 0.17177
## Balanced Accuracy : 0.57315
##
## 'Positive' Class : 1
##
We have bad Sensitivity score, so We try use stepwise
model_back_1 <- step(object = model_1_1, direction = "backward", trace = F)
summary(model_back_1)##
## Call:
## glm(formula = often ~ Designation + OwnCar + CityTier + Age +
## NumberOfChildrenVisiting, family = binomial("logit"), data = travel_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6496 -0.8926 -0.6844 1.2106 2.0843
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.245982 0.324044 -10.017 < 2e-16 ***
## DesignationExecutive -0.086103 0.187497 -0.459 0.64607
## DesignationManager 0.221097 0.178865 1.236 0.21642
## DesignationSenior Manager -0.012411 0.185827 -0.067 0.94675
## DesignationVP -0.557811 0.301761 -1.849 0.06453 .
## OwnCar1 -0.254627 0.084617 -3.009 0.00262 **
## CityTier -0.082719 0.046999 -1.760 0.07840 .
## Age 0.067198 0.005287 12.711 < 2e-16 ***
## NumberOfChildrenVisiting 0.238196 0.048743 4.887 1.03e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3700.1 on 2887 degrees of freedom
## Residual deviance: 3436.5 on 2879 degrees of freedom
## AIC: 3454.5
##
## Number of Fisher Scoring iterations: 4
travel_test$pred_back_1 <- predict(model_back_1, newdata = travel_test, type = "response")travel_test$pred.Label_1 <- ifelse(travel_test$pred_back_1 > 0.5, "1", "0")travel_test$pred.Label_1 <- as.factor(travel_test$pred.Label_1)
confusionMatrix(data = travel_test$pred.Label_1,
reference = travel_test$often,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 719 310
## 1 100 111
##
## Accuracy : 0.6694
## 95% CI : (0.6424, 0.6955)
## No Information Rate : 0.6605
## P-Value [Acc > NIR] : 0.2651
##
## Kappa : 0.1611
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.26366
## Specificity : 0.87790
## Pos Pred Value : 0.52607
## Neg Pred Value : 0.69874
## Prevalence : 0.33952
## Detection Rate : 0.08952
## Detection Prevalence : 0.17016
## Balanced Accuracy : 0.57078
##
## 'Positive' Class : 1
##
Our model still have bad sensitivity score. Try to make KNN model.
We need to scale our data when use KNN for our model, We can use normalize to scaling
normalize <- function(x){
return (
(x - min(x))/(max(x) - min(x))
)
}travel_train_norm <- travel_train %>%
mutate_if(is.numeric, normalize)
travel_test_norm <- travel_test %>%
mutate_if(is.numeric, normalize)travel_train_norm <- travel_train_norm %>%
select(-c("TypeofContact", "Occupation", "Gender", "ProductPitched", "MaritalStatus", "Designation", "Passport", "OwnCar")) %>%
mutate(often = as.factor(often))
travel_test_norm <- travel_test_norm %>%
select(-c("TypeofContact", "Occupation", "Gender", "ProductPitched", "MaritalStatus", "Designation", "Passport", "OwnCar", "pred_back_1_1", "pred.Label_1_1", "pred_back_1", "pred.Label_1")) %>%
mutate(often = as.factor(often))
anyNA(travel_test_norm)## [1] FALSE
anyNA(travel_train_norm)## [1] FALSE
travel_train_norm_x <- travel_train_norm %>% select(-often)
travel_test_norm_x <- travel_test_norm %>% select(-often)travel_train_norm_y <- travel_train_norm$often
travel_test_norm_y <- travel_test_norm$oftensqrt(nrow(travel_train_norm))## [1] 53.74012
because our target’s class is 2 so we will use 53 for k
travel_pred_c <- knn(train = travel_train_norm_x, test = travel_test_norm_x, k = 53, cl = travel_train_norm_y)confusionMatrix(travel_pred_c, reference = as.factor(travel_test_norm_y), positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 725 96
## 1 94 325
##
## Accuracy : 0.8468
## 95% CI : (0.8255, 0.8664)
## No Information Rate : 0.6605
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.658
##
## Mcnemar's Test P-Value : 0.9422
##
## Sensitivity : 0.7720
## Specificity : 0.8852
## Pos Pred Value : 0.7757
## Neg Pred Value : 0.8831
## Prevalence : 0.3395
## Detection Rate : 0.2621
## Detection Prevalence : 0.3379
## Balanced Accuracy : 0.8286
##
## 'Positive' Class : 1
##
Model Machine Learning from KNN much better than Logistic Regression but we need to try optimize our model because we have imbalance data We can use some method.
For tuning model, we can use ROSE to balance our data. In this case our data imbalance, that can be our causes to get bad model. We use it to train data set, so the unseen data still imbalance.
travel_train_rose <- training(index)
travel_test_rose <- testing(index)
new_travel_train <- ROSE(often ~ ., data = travel_train_rose, seed = 101)$data
table(new_travel_train$often)##
## 0 1
## 1442 1446
model_1 <- glm(often ~ MonthlyIncome + Designation + OwnCar +Passport + PreferredPropertyStar + DurationOfPitch + CityTier + Age + NumberOfFollowups + NumberOfChildrenVisiting + PitchSatisfactionScore +NumberOfTrips, data = new_travel_train, family = binomial("logit"))## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_1)##
## Call:
## glm(formula = often ~ MonthlyIncome + Designation + OwnCar +
## Passport + PreferredPropertyStar + DurationOfPitch + CityTier +
## Age + NumberOfFollowups + NumberOfChildrenVisiting + PitchSatisfactionScore +
## NumberOfTrips, family = binomial("logit"), data = new_travel_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5233 -0.1670 0.0000 0.0556 3.4223
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.117e+01 1.206e+00 -9.265 < 2e-16 ***
## MonthlyIncome -4.211e-05 2.540e-05 -1.658 0.09727 .
## DesignationExecutive -9.591e-01 4.943e-01 -1.940 0.05236 .
## DesignationManager -2.395e-01 4.445e-01 -0.539 0.59001
## DesignationSenior Manager -1.781e-01 4.216e-01 -0.423 0.67266
## DesignationVP 6.786e-02 6.993e-01 0.097 0.92269
## OwnCar1 -5.584e-01 1.791e-01 -3.118 0.00182 **
## Passport1 2.075e-02 1.940e-01 0.107 0.91485
## PreferredPropertyStar 1.674e-01 9.790e-02 1.710 0.08729 .
## DurationOfPitch -1.287e-02 9.334e-03 -1.379 0.16797
## CityTier -9.858e-02 8.701e-02 -1.133 0.25725
## Age 5.821e-02 9.717e-03 5.990 2.09e-09 ***
## NumberOfFollowups -1.189e-01 7.969e-02 -1.491 0.13585
## NumberOfChildrenVisiting -7.607e-02 9.242e-02 -0.823 0.41049
## PitchSatisfactionScore -7.665e-02 5.665e-02 -1.353 0.17608
## NumberOfTrips 3.310e+00 1.609e-01 20.577 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4003.61 on 2887 degrees of freedom
## Residual deviance: 894.58 on 2872 degrees of freedom
## AIC: 926.58
##
## Number of Fisher Scoring iterations: 8
model_back <- step(object = model_1, direction = "backward", trace = F)
summary(model_back)##
## Call:
## glm(formula = often ~ MonthlyIncome + Designation + OwnCar +
## PreferredPropertyStar + Age + NumberOfFollowups + NumberOfTrips,
## family = binomial("logit"), data = new_travel_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6039 -0.1671 0.0000 0.0560 3.5100
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.167e+01 1.154e+00 -10.107 < 2e-16 ***
## MonthlyIncome -4.820e-05 2.460e-05 -1.959 0.050086 .
## DesignationExecutive -9.992e-01 4.769e-01 -2.095 0.036163 *
## DesignationManager -3.276e-01 4.308e-01 -0.760 0.446963
## DesignationSenior Manager -1.930e-01 4.140e-01 -0.466 0.641090
## DesignationVP 1.446e-01 6.963e-01 0.208 0.835465
## OwnCar1 -5.947e-01 1.773e-01 -3.353 0.000799 ***
## PreferredPropertyStar 1.668e-01 9.670e-02 1.725 0.084546 .
## Age 5.913e-02 9.593e-03 6.163 7.12e-10 ***
## NumberOfFollowups -1.227e-01 7.817e-02 -1.569 0.116627
## NumberOfTrips 3.311e+00 1.608e-01 20.584 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4003.61 on 2887 degrees of freedom
## Residual deviance: 900.32 on 2877 degrees of freedom
## AIC: 922.32
##
## Number of Fisher Scoring iterations: 8
travel_test_rose$pred_back <- predict(model_back, newdata = travel_test_rose, type = "response")travel_test_rose$pred.Label <- ifelse(travel_test_rose$pred_back > 0.5, "1", "0")travel_test_rose$pred.Label <- as.factor(travel_test_rose$pred.Label)
confusionMatrix(data = travel_test_rose$pred.Label,
reference = travel_test_rose$often,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 816 0
## 1 3 421
##
## Accuracy : 0.9976
## 95% CI : (0.9929, 0.9995)
## No Information Rate : 0.6605
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9946
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9963
## Pos Pred Value : 0.9929
## Neg Pred Value : 1.0000
## Prevalence : 0.3395
## Detection Rate : 0.3395
## Detection Prevalence : 0.3419
## Balanced Accuracy : 0.9982
##
## 'Positive' Class : 1
##
Sensitivity score from this model better than our last model, especially for our sensitivity or recall because we want to get more actual “yes” status in often. So we can offer our new product to customer who have loyalty with us by their history number of trips from our package. But we must see that our model is over fiting. It can be
new_travel_train_norm <- new_travel_train %>%
mutate_if(is.numeric, normalize)
new_travel_test_norm <- travel_test_rose %>%
mutate_if(is.numeric, normalize)new_travel_train_norm <- new_travel_train_norm %>%
select(-c("TypeofContact", "Occupation", "Gender", "ProductPitched", "MaritalStatus", "Designation", "Passport", "OwnCar")) %>%
mutate(often = ifelse(often == 1, "Yes", "No"))
new_travel_test_norm <- new_travel_test_norm %>%
mutate(often = ifelse(often == 1, "Yes", "No")) %>%
select(-c("TypeofContact", "Occupation", "Gender", "ProductPitched", "MaritalStatus", "Designation", "Passport", "OwnCar", "pred_back", "pred.Label"))new_travel_train_norm_x <- new_travel_train_norm %>% select(-often)
new_travel_test_norm_x <- new_travel_test_norm %>% select(-often)new_travel_train_norm_y <- new_travel_train_norm$often
new_travel_test_norm_y <- new_travel_test_norm$oftensqrt(nrow(new_travel_train_norm))## [1] 53.74012
we will use k : 53 again.
new_travel_pred_c <- knn(train = new_travel_train_norm_x, test = new_travel_test_norm_x, k = 53, cl = new_travel_train_norm_y)confusionMatrix(new_travel_pred_c, reference = as.factor(new_travel_test_norm_y), positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 463 7
## Yes 356 414
##
## Accuracy : 0.7073
## 95% CI : (0.6811, 0.7325)
## No Information Rate : 0.6605
## P-Value [Acc > NIR] : 0.0002443
##
## Kappa : 0.4567
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9834
## Specificity : 0.5653
## Pos Pred Value : 0.5377
## Neg Pred Value : 0.9851
## Prevalence : 0.3395
## Detection Rate : 0.3339
## Detection Prevalence : 0.6210
## Balanced Accuracy : 0.7743
##
## 'Positive' Class : Yes
##
In KNN model, sensitivity score is lower than stepwise model. But we can see that our model not overfitting with the data. At this model, we have good sensitivity, precision, and accuracy from other model except stepwise with rose data tuning.
From this model we can understand that our data imbalanced. When we use original data in logistic regression make bad result by the sensitivity’s score, but different with the knn method. That is because we want optimize the number of customer who often use purchase our package of travel. We can optimize the model with ROSE to make data balance, that because ROSE (Random Over Sampling Examples) give new data with smoothed bootsrap approach. When we use ROSE for our method’s it help use to improve sensitivity of logistic regression but decrease the sensitivity of knn method. From ROSE, our stepwise model have good accuracy but it can be over fitting. But from stepwise model use ROSE we can see that predictor that influence the target are age, number of trips, and own car.