WQD7004 - Programming for Data Science

Title: Travel Package Prediction

Group 3

Group Members:

Abdullah Bin Luqman (S2186792)
Navaneeta A/P P Shanmugam (S2192763)
Ruzana Binti Mohamed Aris (S2182858)
Intan Nor Qamarina Binti Yunus (S2180929)

Recording Link:

https://365umedumy-my.sharepoint.com/:v:/r/personal/s2186792_siswa365_um_edu_my/Documents/Full%20Video.mp4?csf=1&web=1&e=rGSdTq

1. Intro:

Background

As internet travel information has grown significantly nowadays, it has become more difficult for travellers to choose from the various travel packages accessible to meet their practical needs. On the other hand, in order to increase sales and profits, travel agencies must be aware of the interests of various travellers and offer more attractive packages. Hence, in this project, we are going to predict the travel package that will be purchased by customer so it will be easier to get customer’s travel preferences and to know the characteristics of potential customers who will buy travel packages.

This project was obtained from Kaggle dataset and based on following the scenario given that in order to increase its customer base, a company intends to enable and develop a workable business strategy for travel package options. The organisation currently offers 5 different sorts of packages: Basic, Standard, Deluxe, and Super Deluxe. Only 18% of clients purchased packages last year, according to the report. However, because clients were contacted at random and without considering the available data, the cost of marketing was quite expensive. To tackle this problem and to attract more customers, the company plans to launch a new package, namely the Wellness Tourism Package. This new package enables the traveller to sustain and improve a healthy lifestyle and well-being. To increase the effectiveness of its marketing spending, the company this time intends to make use of the existing data on both current and potential clients. Thus, our group project task has to analyse customer data and information in order to make recommendations to the marketing team and to develop a model that would forecast which customers will buy the newly presented ‘Wellness Tourism Package’.

Flowchart

Below shows the flowchart that summarize the pipeline of this project.

## Loading required package: DiagrammeR

Data Dictionary

The dataset used is a csv file originating from Kaggle. The data consists of 4888 rows and 20 columns (14 numeric columns and 6 categorical columns). Data description are as follows:

Profile Customer:
- CustomerID: Unique customer ID
- Age: Age of customer
- CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e.  Tier 1 > Tier 2 > Tier 3
- Occupation: Occupation of customer
- Gender: Gender of customer
- NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
- PreferredPropertyStar: Preferred hotel property rating by customer
- MaritalStatus: Marital status of customer
- NumberOfTrips: Average number of trips in a year by customer
- Passport: The customer has a passport or not (0: No, 1: Yes)
- OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
- NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
- NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
- MonthlyIncome: Gross monthly income of the customer
Information of Pitch:
- TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
- NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
- DurationOfPitch: Duration of the pitch by a salesperson to the customer
- ProductPitched: Product pitched by the salesperson
- PitchSatisfactionScore: Sales pitch satisfaction score
Target:
- ProdTaken: Product taken or not (0: No, 1: Yes)

Problem Statement

Which classification model is suitable for this dataset?
Which regression model is suitable for this dataset?
What is the best performing travel package prediction model?

Objectives

To predict which customer will purchase the newly introduced package using classification and regression model.
To compare the performance of both models
To identify which features or variables of potential customers who will buy new travel packages

2. Data Understanding:

##To read the csv file
df<-read.csv("Travel.csv")

##To view it like excel
View(df)

##To see the summary
summary(df)

##    CustomerID       ProdTaken           Age        TypeofContact     
##  Min.   :200000   Min.   :0.0000   Min.   :18.00   Length:4888       
##  1st Qu.:201222   1st Qu.:0.0000   1st Qu.:31.00   Class :character  
##  Median :202444   Median :0.0000   Median :36.00   Mode  :character  
##  Mean   :202444   Mean   :0.1882   Mean   :37.62                     
##  3rd Qu.:203665   3rd Qu.:0.0000   3rd Qu.:44.00                     
##  Max.   :204887   Max.   :1.0000   Max.   :61.00                     
##                                    NA's   :226                       
##     CityTier     DurationOfPitch   Occupation           Gender         
##  Min.   :1.000   Min.   :  5.00   Length:4888        Length:4888       
##  1st Qu.:1.000   1st Qu.:  9.00   Class :character   Class :character  
##  Median :1.000   Median : 13.00   Mode  :character   Mode  :character  
##  Mean   :1.654   Mean   : 15.49                                        
##  3rd Qu.:3.000   3rd Qu.: 20.00                                        
##  Max.   :3.000   Max.   :127.00                                        
##                  NA's   :251                                           
##  NumberOfPersonVisiting NumberOfFollowups ProductPitched    
##  Min.   :1.000          Min.   :1.000     Length:4888       
##  1st Qu.:2.000          1st Qu.:3.000     Class :character  
##  Median :3.000          Median :4.000     Mode  :character  
##  Mean   :2.905          Mean   :3.708                       
##  3rd Qu.:3.000          3rd Qu.:4.000                       
##  Max.   :5.000          Max.   :6.000                       
##                         NA's   :45                          
##  PreferredPropertyStar MaritalStatus      NumberOfTrips       Passport     
##  Min.   :3.000         Length:4888        Min.   : 1.000   Min.   :0.0000  
##  1st Qu.:3.000         Class :character   1st Qu.: 2.000   1st Qu.:0.0000  
##  Median :3.000         Mode  :character   Median : 3.000   Median :0.0000  
##  Mean   :3.581                            Mean   : 3.237   Mean   :0.2909  
##  3rd Qu.:4.000                            3rd Qu.: 4.000   3rd Qu.:1.0000  
##  Max.   :5.000                            Max.   :22.000   Max.   :1.0000  
##  NA's   :26                               NA's   :140                      
##  PitchSatisfactionScore     OwnCar       NumberOfChildrenVisiting
##  Min.   :1.000          Min.   :0.0000   Min.   :0.000           
##  1st Qu.:2.000          1st Qu.:0.0000   1st Qu.:1.000           
##  Median :3.000          Median :1.0000   Median :1.000           
##  Mean   :3.078          Mean   :0.6203   Mean   :1.187           
##  3rd Qu.:4.000          3rd Qu.:1.0000   3rd Qu.:2.000           
##  Max.   :5.000          Max.   :1.0000   Max.   :3.000           
##                                          NA's   :66              
##  Designation        MonthlyIncome  
##  Length:4888        Min.   : 1000  
##  Class :character   1st Qu.:20346  
##  Mode  :character   Median :22347  
##                     Mean   :23620  
##                     3rd Qu.:25571  
##                     Max.   :98678  
##                     NA's   :233

##To check the duplication
#unique(df)
df[duplicated(df)]

## data frame with 0 columns and 4888 rows

3. Data Cleaning:

The data cleaning process that is done in this project including checking data duplication, checking the mean, imputing missing values, removing blanks value, replace inconsistent values, remove outlier, drops unnecessary columns, and transform the data type.

##To impute the missing values using mean/median
df$Age[is.na(df$Age)]<-median(df$Age, na.rm = TRUE)
df$DurationOfPitch[is.na(df$DurationOfPitch)]<-median(df$DurationOfPitch, na.rm = TRUE)
df$NumberOfFollowups[is.na(df$NumberOfFollowups)]<-median(df$NumberOfFollowups, na.rm = TRUE)
df$PreferredPropertyStar[is.na(df$PreferredPropertyStar)]<-median(df$PreferredPropertyStar, na.rm = TRUE)
df$NumberOfTrips[is.na(df$NumberOfTrips)]<-median(df$NumberOfTrips, na.rm = TRUE)
df$NumberOfChildrenVisiting[is.na(df$NumberOfChildrenVisiting)]<-median(df$NumberOfChildrenVisiting, na.rm = TRUE)
df$MonthlyIncome[is.na(df$MonthlyIncome)]<-mean(df$MonthlyIncome, na.rm = TRUE)

##To check the missing values of characters
sum(is.na(df$TypeofContact))

## [1] 0

sum(is.na(df$Occupation))

## [1] 0

sum(is.na(df$Gender))

## [1] 0

sum(is.na(df$ProductPitched))

## [1] 0

sum(is.na(df$MaritalStatus))

## [1] 0

sum(is.na(df$Designation))

## [1] 0

sum(is.na(df))

## [1] 0

#To classify the variables
unique(df['CityTier'])

##    CityTier
## 1         3
## 2         1
## 80        2

unique(df['Occupation'])

##        Occupation
## 1        Salaried
## 3     Free Lancer
## 5  Small Business
## 34 Large Business

unique(df['Gender'])

##     Gender
## 1   Female
## 2     Male
## 23 Fe Male

##To check for blanks
sum(df$TypeofContact=="")

## [1] 25

#Remove blanks in Type of Contact
df <- df[!(df$TypeofContact == ""), ]

#Replace "Fe Male" with "Female"
sum(df$Gender=="Fe Male")

## [1] 155

sum(df$Gender=="Female")

## [1] 1807

df$Gender[df$Gender == "Fe Male"] <- "Female"

#Replace "Unmarried" with "Single"
sum(df$MaritalStatus=="Unmarried")

## [1] 682

sum(df$MaritalStatus=="Single")

## [1] 912

df$MaritalStatus[df$MaritalStatus == "Unmarried"] <- "Single"

#Remove Monthly Income less than 16000
df <- df[!(df$MonthlyIncome < 16000), ]

#To drop the columns related to customer interactions
df <- subset(df, select = -c(DurationOfPitch,NumberOfFollowups,ProductPitched,PitchSatisfactionScore))

##To transform the character data to numeric
df$TypeofContact[df$TypeofContact == 'Self Enquiry'] <- '1'
df$TypeofContact[df$TypeofContact == 'Company Invited'] <- '2'
df$TypeofContact<-as.numeric(df$TypeofContact)

df$Occupation[df$Occupation == 'Free Lancer'] <- '1'
df$Occupation[df$Occupation == 'Large Business'] <- '2'
df$Occupation[df$Occupation == 'Salaried'] <- '3'
df$Occupation[df$Occupation == 'Small Business'] <- '4'
df$Occupation<-as.numeric(df$Occupation)

df$Gender[df$Gender == 'Female'] <- '1'
df$Gender[df$Gender == 'Male'] <- '2'
df$Gender<-as.numeric(df$Gender)

df$MaritalStatus[df$MaritalStatus == 'Single'] <- '1'
df$MaritalStatus[df$MaritalStatus == 'Divorced'] <- '2'
df$MaritalStatus[df$MaritalStatus == 'Married'] <- '3'
df$MaritalStatus<-as.numeric(df$MaritalStatus)

df$Designation[df$Designation == 'Executive'] <- '1'
df$Designation[df$Designation == 'Manager'] <- '2'
df$Designation[df$Designation == 'Senior Manager'] <- '3'
df$Designation[df$Designation == 'AVP'] <- '4'
df$Designation[df$Designation == 'VP'] <- '5'
df$Designation
df$Designation<-as.numeric(df$Designation)

4. Data Exploratory:

Univariate, Bivariate and Multivariate Analysis

For Data Exploratory Analysis the box plot and the histogram were being created to analyse deeper into the data

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: reshape

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:dplyr':
## 
##     rename

## Loading required package: ggplot2

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$Age , horizontal=TRUE , ylim=c(0,80), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Age")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$Age , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,80), ylim = c(0,500))

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$DurationOfPitch , horizontal=TRUE , ylim=c(0,50), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Duration Of Pitch")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$DurationOfPitch , breaks=100 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,50), ylim = c(0,600))

#Spread of Data for NumberOfPersonVisiting

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$NumberOfPersonVisiting , horizontal=TRUE , ylim=c(0,10), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Number Of Person Visiting")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$NumberOfPersonVisiting , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,10), ylim = c(0,600))

#Spread of Data for NumberOfFollowups

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$NumberOfFollowups , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Number Of Followups")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$NumberOfFollowups , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,8), ylim = c(0,600))

#Spread of Data for PreferredPropertyStar

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$PreferredPropertyStar , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Preferred Property Star")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$PreferredPropertyStar , breaks=3 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,8), ylim = c(0,3500))

#Spread of Data for NumberOfTrips

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$NumberOfTrips , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Number Of Trips")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$NumberOfTrips , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,8), ylim = c(0,1500))

#Spread of Data for PitchSatisfactionScore

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$PitchSatisfactionScore , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Pitch Satisfaction Score")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$PitchSatisfactionScore , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,8), ylim = c(0,1500))

#Spread of Data for NumberOfChildrenVisiting

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$NumberOfChildrenVisiting , horizontal=TRUE , ylim=c(0,4), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Number Of Children Visiting")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$NumberOfChildrenVisiting , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,4), ylim = c(0,1800))

#Spread of Data for MonthlyIncome

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$MonthlyIncome , horizontal=TRUE , ylim=c(0,100000), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Monthly Income")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$MonthlyIncome , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,100000), ylim = c(0,1300))

Numvar <- select(df,ProdTaken,Age,CityTier,NumberOfPersonVisiting,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,
                 PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation, MonthlyIncome)
res <- round(cor(Numvar),2)
head(res)

##                        ProdTaken   Age CityTier NumberOfPersonVisiting
## ProdTaken                   1.00 -0.14     0.09                   0.01
## Age                        -0.14  1.00    -0.01                   0.02
## CityTier                    0.09 -0.01     1.00                   0.00
## NumberOfPersonVisiting      0.01  0.02     0.00                   1.00
## NumberOfFollowups           0.11  0.00     0.02                   0.33
## PreferredPropertyStar       0.10 -0.02    -0.01                   0.03
##                        NumberOfFollowups PreferredPropertyStar NumberOfTrips
## ProdTaken                           0.11                  0.10          0.02
## Age                                 0.00                 -0.02          0.17
## CityTier                            0.02                 -0.01         -0.03
## NumberOfPersonVisiting              0.33                  0.03          0.19
## NumberOfFollowups                   1.00                 -0.03          0.14
## PreferredPropertyStar              -0.03                  1.00          0.01
##                        Passport PitchSatisfactionScore OwnCar
## ProdTaken                  0.26                   0.05  -0.01
## Age                        0.03                   0.02   0.05
## CityTier                   0.00                  -0.04   0.00
## NumberOfPersonVisiting     0.01                  -0.02   0.01
## NumberOfFollowups          0.00                   0.00   0.01
## PreferredPropertyStar      0.00                  -0.02   0.01
##                        NumberOfChildrenVisiting Designation MonthlyIncome
## ProdTaken                                  0.01       -0.18         -0.13
## Age                                        0.01        0.50          0.46
## CityTier                                   0.00        0.12          0.05
## NumberOfPersonVisiting                     0.61       -0.03          0.19
## NumberOfFollowups                          0.28        0.01          0.17
## PreferredPropertyStar                      0.03       -0.01          0.00

melted_corr_mat <- melt(res)

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE

head(melted_corr_mat)

##                       X1        X2 value
## 1              ProdTaken ProdTaken  1.00
## 2                    Age ProdTaken -0.14
## 3               CityTier ProdTaken  0.09
## 4 NumberOfPersonVisiting ProdTaken  0.01
## 5      NumberOfFollowups ProdTaken  0.11
## 6  PreferredPropertyStar ProdTaken  0.10

ggplot(data = melted_corr_mat, aes(x=X1, y=X2,fill=value)) + geom_tile() + 
  geom_text(aes(x = X2, y = X1, label = value),color = "black", size = 4)

EDA Observations Part 1:

The average number of visitors is usually 3, with one outlier at 5. The average number of follow-ups is 4, with an extreme value of 6. Age is distributed normally and does not contain any outliers. The duration of pitch is right-skewed and has some outliers greater than 120. These outliers should be investigated to determine if they need to be treated. The number of trips is right-skewed and has some outliers where the number of trips is greater than 17. Monthly income is right-skewed and has some outliers at the higher end.

Histogram Density Line & Boxplot for Taken and Not Taken Product

library(stats)
library(dplyr)
library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(reticulate)
library(readxl)

# Read data from excel #
df <- read_excel("travel_cleaned.xlsx")

## New names:
## • `` -> `...1`

# View the data set data #
View(df)
df = subset(df, select = -c(...1,CustomerID) )

# Histogram with density line--
List_col = list(2,4,7,8,10,11,12,13)
df$ProdTaken <- factor(df$ProdTaken)

Yes <- df %>% 
  filter(ProdTaken == 1) %>%
  select(Age,CityTier,
         NumberOfPersonVisiting,
         PreferredPropertyStar,
         NumberOfTrips,
         Passport,
         OwnCar,
         NumberOfChildrenVisiting,
         MonthlyIncome) 

No <- df %>% 
  filter(ProdTaken == 0) %>%
  select(Age,CityTier,
         NumberOfPersonVisiting,
         PreferredPropertyStar,
         NumberOfTrips,
         Passport,
         OwnCar,
         NumberOfChildrenVisiting,
         MonthlyIncome) 

Yes <- as.matrix(Yes)
No <- as.matrix(No)

list <- c("Age","CityTier",
          "NumberOfPersonVisiting",
          "PreferredPropertyStar",
          "NumberOfTrips",
          "Passport",
          "OwnCar",
          "NumberOfChildrenVisiting",
          "MonthlyIncome")
par(mfrow=c(2,2)) # Create two rows and two columns
par(bg="gray") # Manipulate background 
for(i in 1:9)
{
  hist(Yes[,i], prob = TRUE,
       main = paste("Distribution of ",(colnames(Yes[i])),
                    "\n of a customer who had Not taken Product"),
       xlab = paste(list[i]), cex.main=.6)
  lines(density(Yes[,i]), col = 1, lwd = 2)
  hist(No[,i], prob = TRUE,
       main = paste("Distribution of ",(colnames(No[i])),
                    "\n of a customer who had taken Product"),
       xlab = paste(list[i]), cex.main=.6)
  lines(density(No[,i]), col = 1, lwd = 2)
}

# create grouped boxplot 
List_col = list(2,4,7,8,10,11,12,13)
par(mfrow=c(2,2)) # Create two rows and two columns
par(bg="gray")
for(i in List_col)
{
  boxplot( df[[i]]~df[[1]],
           data=ToothGrowth,
           main=paste("Boxplot of", (colnames(df[i])), "\nw.r.t Product taken"),
           cex.main=.6,
           xlab="Product Taken",
           ylab=paste(colnames(df[i])),
           border="black"
  )
}

EDA Observations Part 2:

Customers who purchased the product are mostly in age range of 25 to 35
Number of person visiting distribution is slightly higher for customer who had not taken the product.
Number of trips feature has some outliers ranging of 17 to 20.
Customers who purchased product has monthly income in average of 18000 to 23000 is likely to purchase the travel package
Customers located in City Tier 1 and 3 are more interested in purchasing travel packages
The number of people visiting ranging of 2 to 4 are more likely to purchase travel packages
Customers who have a car are more likely to purchase travel packages

5. Data Modeling:

SVM Regression
Multiple Linear Regression
Decision Tree Classification
K-Nearest Neighbours

(A) SVM Regression

## 'data.frame':    4861 obs. of  10 variables:
##  $ ProdTaken               : int  1 0 1 0 0 0 0 0 0 0 ...
##  $ Age                     : int  41 49 37 33 36 32 59 30 38 36 ...
##  $ CityTier                : int  3 1 1 1 1 1 1 1 1 1 ...
##  $ NumberOfPersonVisiting  : int  3 3 3 2 2 3 2 3 2 3 ...
##  $ PreferredPropertyStar   : int  3 4 3 3 4 3 5 3 3 3 ...
##  $ NumberOfTrips           : int  1 2 7 2 1 1 5 2 1 7 ...
##  $ Passport                : int  1 0 1 1 0 0 1 0 0 0 ...
##  $ OwnCar                  : int  1 1 0 1 1 1 1 0 0 1 ...
##  $ NumberOfChildrenVisiting: int  0 2 0 1 0 1 1 1 0 0 ...
##  $ MonthlyIncome           : num  20993 20130 17090 17909 18468 ...

## Loading required package: caret

## Loading required package: lattice

parts = createDataPartition(df$ProdTaken, p = .8, list= F)
train = df[parts,]
test = df[-parts,]

nrow(train)

## [1] 3889

head(train)

##    ProdTaken Age CityTier NumberOfPersonVisiting PreferredPropertyStar
## 1          1  41        3                      3                     3
## 2          0  49        1                      3                     4
## 3          1  37        1                      3                     3
## 7          0  59        1                      2                     5
## 9          0  38        1                      2                     3
## 10         0  36        1                      3                     3
##    NumberOfTrips Passport OwnCar NumberOfChildrenVisiting MonthlyIncome
## 1              1        1      1                        0         20993
## 2              2        0      1                        2         20130
## 3              7        1      0                        0         17090
## 7              5        1      1                        1         17670
## 9              1        0      0                        0         24526
## 10             7        0      1                        0         20237

nrow(test)

## [1] 972

# reverse the data frame
df <- df[nrow(df):1,]
# select the last 20% of ProdTaken from df
test1 <- df[1:floor(.2 * nrow(df)),]
# reverse the test data frame back to its original order
test1 <- test[nrow(test):1,]

train_control = trainControl(method = "cv", number = 5)
set.seed(50)

model = train(ProdTaken ~ ., data = train, method = "svmLinear", trControl = train_control)

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

print(model)

## Support Vector Machines with Linear Kernel 
## 
## 3889 samples
##    9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3111, 3111, 3112, 3111, 3111 
## Resampling results:
## 
##   RMSE       Rsquared    MAE      
##   0.4140002  0.03481166  0.2090933
## 
## Tuning parameter 'C' was held constant at a value of 1

#use model to make predictions on test data
pred_y = predict(model, test)
typeof(pred_y)

## [1] "double"

# performance metrics on the test data
test_y = test[, 1]
typeof(test_y)

## [1] "integer"

mean((test_y - pred_y)^2) #mse - Mean Squared Error

## [1] 0.1903521

caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error

## [1] 0.4362936

# load library ggplot2
library(ggplot2)

# create dataframe with actual and predicted values
plot_data <- data.frame(Predicted_value = pred_y,  
                        Observed_value = test_y)

# plot predicted values and actual values
ggplot(plot_data, aes(x = Predicted_value, y = Observed_value)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "green")

Observation for SVM Regression:
- For the Support Vector Machine (SVM), the model’s performance was evaluated using the RMSE and MSE values. The RMSE value is 0.4172625 and the MSE value is 0.174108.

(B) Multiple Linear Regression

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:reshape':
## 
##     melt

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

#To keep the related numeric values
df1<-df[ -c(1,2,5,7:8,11,16) ]

#To construct the initial model
model <- lm(ProdTaken ~ ., data = df1)
summary(model)

## 
## Call:
## lm(formula = ProdTaken ~ ., data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57634 -0.20747 -0.12224 -0.00655  1.08056 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.802e-01  4.258e-02   4.232 2.35e-05 ***
## Age                      -4.979e-03  6.668e-04  -7.467 9.70e-14 ***
## CityTier                  3.910e-02  5.778e-03   6.766 1.48e-11 ***
## NumberOfPersonVisiting    7.144e-03  9.285e-03   0.769  0.44167    
## PreferredPropertyStar     4.713e-02  6.634e-03   7.105 1.38e-12 ***
## NumberOfTrips             9.287e-03  3.006e-03   3.090  0.00201 ** 
## Passport                  2.276e-01  1.163e-02  19.572  < 2e-16 ***
## OwnCar                    3.755e-03  1.093e-02   0.344  0.73115    
## NumberOfChildrenVisiting  7.797e-04  7.853e-03   0.099  0.92092    
## MonthlyIncome            -6.702e-06  1.170e-06  -5.727 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3683 on 4851 degrees of freedom
## Multiple R-squared:  0.1158, Adjusted R-squared:  0.1142 
## F-statistic: 70.59 on 9 and 4851 DF,  p-value: < 2.2e-16

#To compute the Variance Inflation Factor of every predictor
library(caTools)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(model)

##                      Age                 CityTier   NumberOfPersonVisiting 
##                 1.321522                 1.005296                 1.623030 
##    PreferredPropertyStar            NumberOfTrips                 Passport 
##                 1.001892                 1.075632                 1.001989 
##                   OwnCar NumberOfChildrenVisiting            MonthlyIncome 
##                 1.007720                 1.605412                 1.351880

##A VIF of more than 10 is a concern

#Overview of the current model residuals
par(mfrow = c(2, 2))
plot(model)

#Required for RMSE and MAE commands
library(caret)

#To split the data randomly into a training set and a test set
set.seed(100)
n_train <- ceiling(0.8 * length(df1$ProdTaken))
train_sample <- sample(c(1:length(df1$ProdTaken)), n_train)
train_data <- df1[train_sample, ]
test_data <- df1[-train_sample, ]

#To fit the model on the training data
model3 <- lm(ProdTaken ~ ., data = train_data)
predictions <- predict(model3, test_data)

#To measure the performance by comparing the prediction with 
#the data using multiple criterion
RMSE <- RMSE(predictions, test_data$ProdTaken)
MAE <- MAE(predictions, test_data$ProdTaken)
print(c(RMSE, MAE))

## [1] 0.3814345 0.2825624

#To compute the prediction error
pred_error_rate <- RMSE / mean(test_data$ProdTaken)
pred_error_rate

## [1] 1.853772

R_sq <- 0
RMSE <- 0
MAE <- 0
#To run the validation 10 times
for(i in 1:10){
  
  n_train <- ceiling(0.8 * length(df1$ProdTaken))
  train_sample <- sample(c(1:length(df1$ProdTaken)), n_train)
  train_data <- df[train_sample, ]
  test_data <- df[-train_sample, ]
  
  model <- lm(ProdTaken ~ ., data = train_data)
  
  predictions <- predict(model, test_data,warning = FALSE)
  
  RMSE <- RMSE + RMSE(predictions, test_data$ProdTaken)
  MAE <- MAE + MAE(predictions, test_data$ProdTaken)
  
}

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading

RMSE = RMSE / 10
MAE = MAE / 10
print(c(RMSE, MAE))

## [1] 0.3562234 0.2663532

#To create create a data frame with actual and predicted values
plot_data <- data.frame(Predicted_value = predictions,  
                        Actual_value = test_data$ProdTaken)

#To plot predicted values and actual values
ggplot(plot_data, aes(x = Predicted_value, y = Actual_value)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "green")

Observation for Multiple Linear Regression:
- For the Multiple Linear Regression, the model’s performance was evaluated using the RMSE and MAE values. The RMSE value of the model is 0.3562234 and the MAE value is 0.2663532.

(C) Decision Tree Classification

## New names:
## • `` -> `...1`

library(rpart)
library(rpart.plot)
create_train_test <- function(df, size = 0.8, train = TRUE) {
  n_row = nrow(df)
  total_row = size * n_row
  #train_sample < - 1: total_row
  if (train == TRUE) {
    return (df[1:3500, ])
  } else {
    return (df[3500:n_row, ])
  }
}
data_train <- create_train_test(df, 0.8, train = TRUE)
data_test <- create_train_test(df, 0.8, train = FALSE)
dim(data_train)

## [1] 3500   15

dim(data_test)

## [1] 1362   15

prop.table(table(df$ProdTaken))

## 
##         0         1 
## 0.8113557 0.1886443

par(mfrow=c(1,1)) # Create two rows and two columns
par(bg="gray")
fit <- rpart(ProdTaken~., data = data_train, method = 'class') 
predict_unseen <-predict(fit, data_test, type = 'class')
table_mat <- table(data_test$ProdTaken, predict_unseen)
table_mat

##    predict_unseen
##        0    1
##   0 1045   40
##   1  184   93

accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste('Accuracy for test', accuracy_Test))

## [1] "Accuracy for test 0.83553597650514"

Check Feature Importance Variable

#check importance variable #
importance <- as.data.frame(fit$variable.importance)
imp <- cbind(importance, row.names(importance))
colnames(imp) <- c("Importance","varnames")

library(ggplot2)

col = c("#B22222","#CD5C5C","#FF4500","#FFD700","#F0E68C",
                 "#556B2F","#ADFF2F","#98FB98","#008080")
                 ggplot(imp, aes(x = reorder(varnames,Importance), y=Importance)) +
                   geom_bar(stat="identity",color=col,fill=col) +
                   coord_flip() +
                   ggtitle("Importance Variable Feature") + 
                   ylab("Importance") +
                   xlab("Feature")

#Check for model accuracy again                  
model2 <- rpart(ProdTaken~Passport+
                  Designation+MonthlyIncome+
                  MaritalStatus+Age+CityTier, data=data_train,method="class")
predict_unseen2 <- predict(model2, data_test, type="class")
table_mat <- table(data_test$ProdTaken, predict_unseen2)
table_mat

##    predict_unseen2
##        0    1
##   0 1045   40
##   1  184   93

accuracy_Test <- sum(diag(table_mat)) / sum(table_mat) * 100
print(paste('Accuracy for test', accuracy_Test))

## [1] "Accuracy for test 83.553597650514"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1045   40
##          1  184   93
##                                           
##                Accuracy : 0.8355          
##                  95% CI : (0.8148, 0.8548)
##     No Information Rate : 0.9023          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3706          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8503          
##             Specificity : 0.6992          
##          Pos Pred Value : 0.9631          
##          Neg Pred Value : 0.3357          
##              Prevalence : 0.9023          
##          Detection Rate : 0.7673          
##    Detection Prevalence : 0.7966          
##       Balanced Accuracy : 0.7748          
##                                           
##        'Positive' Class : 0               
##

Observation for Decision Tree Classification:
- Feature importance result shown that Passport, Designation, Monthly Income, Marital Status, Age and City Tier are the relevant variable.
- Based on the confusion matrix and statistics, The Sensitivity of the model is 0.8503, which means that the model correctly identifies 85% of the customers who actually took a travel package. The Specificity of the model is 0.6992, which means that the model correctly identifies 70% of the customers who actually did not take a travel package.
- The Positive Predictive Value (PPV) is 0.9631, which means that when the model predicts that a customer will take a travel package, 96% of the time it is correct. The Negative Predictive Value (NPV) is 0.3357, which means that when the model predicts that a customer will not take a travel package, 33.57% of the time it is correct.
- TN (True Negative) = 1045, which means the number of times the model correctly predicted that the customer did not take the product.
- TP (True Positive) = 93, which means the number of times the model correctly predicted that the customer took the product.
- FN (False Negative) = 184, which means the number of times the model incorrectly predicted that the customer did not take the product, when in fact they did.
- FP (False Positive) = 40, which means the number of times the model incorrectly predicted that the customer took the product, when in fact they did not.

(D) K-Nearest Neighbours (KNN)

KNN is a supervised leaning algorithm that works on classification problems. For this KNN, Euclidean distance is calculated and assigned as the K-value.

#load class package
library(class)

## 
## Attaching package: 'class'

## The following object is masked from 'package:reshape':
## 
##     condense

#Randomization with set.seed
set.seed(1000)
gp <- runif(nrow(df))
df<-df[order(gp),]

#str(df)
#head(df)
#head(df,10)
#str(df)
summary(df[,c(3:16)])

##       Age        TypeofContact      CityTier       Occupation   
##  Min.   :18.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:31.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :36.00   Median :1.000   Median :1.000   Median :3.000  
##  Mean   :37.59   Mean   :1.292   Mean   :1.655   Mean   :3.335  
##  3rd Qu.:43.00   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :61.00   Max.   :2.000   Max.   :3.000   Max.   :4.000  
##      Gender      NumberOfPersonVisiting PreferredPropertyStar MaritalStatus  
##  Min.   :1.000   Min.   :1.000          Min.   :3.000         Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000          1st Qu.:3.000         1st Qu.:1.000  
##  Median :2.000   Median :3.000          Median :3.000         Median :2.000  
##  Mean   :1.597   Mean   :2.908          Mean   :3.579         Mean   :2.151  
##  3rd Qu.:2.000   3rd Qu.:3.000          3rd Qu.:4.000         3rd Qu.:3.000  
##  Max.   :2.000   Max.   :5.000          Max.   :5.000         Max.   :3.000  
##  NumberOfTrips       Passport          OwnCar       NumberOfChildrenVisiting
##  Min.   : 1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000           
##  1st Qu.: 2.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.000           
##  Median : 3.000   Median :0.0000   Median :1.0000   Median :1.000           
##  Mean   : 3.231   Mean   :0.2921   Mean   :0.6209   Mean   :1.188           
##  3rd Qu.: 4.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.000           
##  Max.   :22.000   Max.   :1.0000   Max.   :1.0000   Max.   :3.000           
##   Designation    MonthlyIncome  
##  Min.   :1.000   Min.   :16009  
##  1st Qu.:1.000   1st Qu.:20482  
##  Median :2.000   Median :22637  
##  Mean   :2.058   Mean   :23628  
##  3rd Qu.:3.000   3rd Qu.:25448  
##  Max.   :5.000   Max.   :98678

#normalize data set
normalize <- function(x){return((x-min(x))/(max(x)-min(x)))}
normalize(c(2:16))

##  [1] 0.00000000 0.07142857 0.14285714 0.21428571 0.28571429 0.35714286
##  [7] 0.42857143 0.50000000 0.57142857 0.64285714 0.71428571 0.78571429
## [13] 0.85714286 0.92857143 1.00000000

df_n<-as.data.frame(lapply(df[,c(3:16)], normalize))

#data splitting into 80% train data and 20% test data
df_train <- df_n[1:3889,]
df_test <- df_n[3890:4861,]
df_train_target <- df[1:3889,2]
df_test_target <- df[3890:4861,2]
require(class)

#using the Euclidean distance
sqrt(4861)

## [1] 69.72087

#train model using K-Nearest Neighbours
m1 <- knn(train=df_train, test=df_test, cl=df_train_target, k=70)
table(df_test_target,m1)

##               m1
## df_test_target   0   1
##              0 786   8
##              1 168  10

round(795/972,2)

## [1] 0.82

#Find the number of observation
NROW(df_train_target)

## [1] 3889

sqrt(NROW(df_train_target))

## [1] 62.36185

m1.62 <- knn(train=df_train, test=df_test, cl=df_train_target, k=62)
m1.63 <- knn(train=df_train, test=df_test, cl=df_train_target, k=63)

#Calculate the proportion of correct classification for k =62, 63
ACC.62 <- 100 * sum(df_test_target == m1.62)/NROW(df_test_target)
ACC.63 <- 100 * sum(df_test_target == m1.63)/NROW(df_test_target)

ACC.62

## [1] 82.20165

ACC.63

## [1] 82.30453

# Check prediction against actual value in tabular form for k=62
table(m1.62 ,df_test_target)

##      df_test_target
## m1.62   0   1
##     0 784 163
##     1  10  15

m1.62

##   [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
##  [38] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [704] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [815] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [852] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [889] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [963] 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1

round(795/972,2)

## [1] 0.82

table(m1.63 ,df_test_target)

##      df_test_target
## m1.63   0   1
##     0 785 163
##     1   9  15

m1.63

##   [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
##  [38] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [704] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [815] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [852] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [889] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [963] 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1

round(795/972,2)

## [1] 0.82

library(caret)

#Using Confusion Matrix as an evaluation technique
confusionMatrix(table(m1.62 ,df_test_target))

## Confusion Matrix and Statistics
## 
##      df_test_target
## m1.62   0   1
##     0 784 163
##     1  10  15
##                                           
##                Accuracy : 0.822           
##                  95% CI : (0.7965, 0.8456)
##     No Information Rate : 0.8169          
##     P-Value [Acc > NIR] : 0.3573          
##                                           
##                   Kappa : 0.1075          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.98741         
##             Specificity : 0.08427         
##          Pos Pred Value : 0.82788         
##          Neg Pred Value : 0.60000         
##              Prevalence : 0.81687         
##          Detection Rate : 0.80658         
##    Detection Prevalence : 0.97428         
##       Balanced Accuracy : 0.53584         
##                                           
##        'Positive' Class : 0               
##

#Confusion Matrix
confusion_mat <- table(m1.62 ,df_test_target)
confusion_df <- data.frame(confusion_mat)
val<- c(784,163,10,15)
library(ggplot2)

ggplot(data = confusion_df, mapping = aes(x = TClass, y = PClass)) +
geom_tile(aes(fill = val), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", val)), vjust = 1, size=10, hjust=0.5) +
scale_fill_gradientn(values=c(1, .6, .5, .4, 0),colours=c("red", "#770000", "black", "#007777", "cyan")) +
theme_bw() + theme(legend.position = "none")+
ggtitle("Confusion Matrix") +
xlab("Predicted") +
ylab("Actual")

Observation for KNN Model:
- The K Nearest Neighbor model’s performance was also evaluated using the accuracy test which shows an accuracy of 82.20%.
- True Positive (TP) is 15: These are the observations that are predicted as taken (0) and are actually taken (1).
True Negative (TN) is 784 : These are the observations that are predicted as not taken (1) and are actually not taken (0).
False Positive (FP) is 10: These are the observations that are predicted as taken (1) but are actually not taken (1).
False Negative (FN) is 163: These are the observations that are predicted as not taken (0) but are actually taken (0).

Comparison for Regression Models:

Based on the RMSE values of both the models (SVM and Multiple Linear Regression), it was observed that the Support Vector Machine (SVM) model outperformed the Multiple Linear Regression Model (MLR). However, for this dataset the regression modelling is not suitable since the target data is in binary form.

Comparison for Classification Models:

Based on the accuracy test of both classification models (Decision Tree and KNN), it was observed that the Decision Tree performed well with this dataset and gives and accuracy of 83.55% and feature importance result shown that Passport, Designation, Monthly Income, Marital Status, Age and City Tier are the relevant variable, hence for this dataset the classification model suits well.

6. Conclusion:

Overall, the project has developed a predictive model using classification model that can accurately identify customers who are likely to be interested in purchasing the Travel package. By analyzing various factors through feature selection such as designation, passport status, marital status, age, monthly income and city tier the model is able to identify the most likely customers to take the package. Additionally, EDA revealed that customers in certain age and income ranges, with a preference for 5-star properties and those who have longer interactions with salesmen, higher pitch satisfaction scores and multiple follow ups are also more likely to take the package. To maximize success, the company should target its marketing efforts towards these identified customers, and also provide support for passport acquisition and child care services.