Recording Link:
As internet travel information has grown significantly nowadays, it has become more difficult for travellers to choose from the various travel packages accessible to meet their practical needs. On the other hand, in order to increase sales and profits, travel agencies must be aware of the interests of various travellers and offer more attractive packages. Hence, in this project, we are going to predict the travel package that will be purchased by customer so it will be easier to get customer’s travel preferences and to know the characteristics of potential customers who will buy travel packages.
This project was obtained from Kaggle dataset and based on following the scenario given that in order to increase its customer base, a company intends to enable and develop a workable business strategy for travel package options. The organisation currently offers 5 different sorts of packages: Basic, Standard, Deluxe, and Super Deluxe. Only 18% of clients purchased packages last year, according to the report. However, because clients were contacted at random and without considering the available data, the cost of marketing was quite expensive. To tackle this problem and to attract more customers, the company plans to launch a new package, namely the Wellness Tourism Package. This new package enables the traveller to sustain and improve a healthy lifestyle and well-being. To increase the effectiveness of its marketing spending, the company this time intends to make use of the existing data on both current and potential clients. Thus, our group project task has to analyse customer data and information in order to make recommendations to the marketing team and to develop a model that would forecast which customers will buy the newly presented ‘Wellness Tourism Package’.
Below shows the flowchart that summarize the pipeline of this project.
## Loading required package: DiagrammeR
The dataset used is a csv file originating from Kaggle. The data consists of 4888 rows and 20 columns (14 numeric columns and 6 categorical columns). Data description are as follows:
##To read the csv file
df<-read.csv("Travel.csv")
##To view it like excel
View(df)
##To see the summary
summary(df)
## CustomerID ProdTaken Age TypeofContact
## Min. :200000 Min. :0.0000 Min. :18.00 Length:4888
## 1st Qu.:201222 1st Qu.:0.0000 1st Qu.:31.00 Class :character
## Median :202444 Median :0.0000 Median :36.00 Mode :character
## Mean :202444 Mean :0.1882 Mean :37.62
## 3rd Qu.:203665 3rd Qu.:0.0000 3rd Qu.:44.00
## Max. :204887 Max. :1.0000 Max. :61.00
## NA's :226
## CityTier DurationOfPitch Occupation Gender
## Min. :1.000 Min. : 5.00 Length:4888 Length:4888
## 1st Qu.:1.000 1st Qu.: 9.00 Class :character Class :character
## Median :1.000 Median : 13.00 Mode :character Mode :character
## Mean :1.654 Mean : 15.49
## 3rd Qu.:3.000 3rd Qu.: 20.00
## Max. :3.000 Max. :127.00
## NA's :251
## NumberOfPersonVisiting NumberOfFollowups ProductPitched
## Min. :1.000 Min. :1.000 Length:4888
## 1st Qu.:2.000 1st Qu.:3.000 Class :character
## Median :3.000 Median :4.000 Mode :character
## Mean :2.905 Mean :3.708
## 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :5.000 Max. :6.000
## NA's :45
## PreferredPropertyStar MaritalStatus NumberOfTrips Passport
## Min. :3.000 Length:4888 Min. : 1.000 Min. :0.0000
## 1st Qu.:3.000 Class :character 1st Qu.: 2.000 1st Qu.:0.0000
## Median :3.000 Mode :character Median : 3.000 Median :0.0000
## Mean :3.581 Mean : 3.237 Mean :0.2909
## 3rd Qu.:4.000 3rd Qu.: 4.000 3rd Qu.:1.0000
## Max. :5.000 Max. :22.000 Max. :1.0000
## NA's :26 NA's :140
## PitchSatisfactionScore OwnCar NumberOfChildrenVisiting
## Min. :1.000 Min. :0.0000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.000
## Median :3.000 Median :1.0000 Median :1.000
## Mean :3.078 Mean :0.6203 Mean :1.187
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :5.000 Max. :1.0000 Max. :3.000
## NA's :66
## Designation MonthlyIncome
## Length:4888 Min. : 1000
## Class :character 1st Qu.:20346
## Mode :character Median :22347
## Mean :23620
## 3rd Qu.:25571
## Max. :98678
## NA's :233
##To check the duplication
#unique(df)
df[duplicated(df)]
## data frame with 0 columns and 4888 rows
The data cleaning process that is done in this project including checking data duplication, checking the mean, imputing missing values, removing blanks value, replace inconsistent values, remove outlier, drops unnecessary columns, and transform the data type.
##To impute the missing values using mean/median
df$Age[is.na(df$Age)]<-median(df$Age, na.rm = TRUE)
df$DurationOfPitch[is.na(df$DurationOfPitch)]<-median(df$DurationOfPitch, na.rm = TRUE)
df$NumberOfFollowups[is.na(df$NumberOfFollowups)]<-median(df$NumberOfFollowups, na.rm = TRUE)
df$PreferredPropertyStar[is.na(df$PreferredPropertyStar)]<-median(df$PreferredPropertyStar, na.rm = TRUE)
df$NumberOfTrips[is.na(df$NumberOfTrips)]<-median(df$NumberOfTrips, na.rm = TRUE)
df$NumberOfChildrenVisiting[is.na(df$NumberOfChildrenVisiting)]<-median(df$NumberOfChildrenVisiting, na.rm = TRUE)
df$MonthlyIncome[is.na(df$MonthlyIncome)]<-mean(df$MonthlyIncome, na.rm = TRUE)
##To check the missing values of characters
sum(is.na(df$TypeofContact))
## [1] 0
sum(is.na(df$Occupation))
## [1] 0
sum(is.na(df$Gender))
## [1] 0
sum(is.na(df$ProductPitched))
## [1] 0
sum(is.na(df$MaritalStatus))
## [1] 0
sum(is.na(df$Designation))
## [1] 0
sum(is.na(df))
## [1] 0
#To classify the variables
unique(df['CityTier'])
## CityTier
## 1 3
## 2 1
## 80 2
unique(df['Occupation'])
## Occupation
## 1 Salaried
## 3 Free Lancer
## 5 Small Business
## 34 Large Business
unique(df['Gender'])
## Gender
## 1 Female
## 2 Male
## 23 Fe Male
##To check for blanks
sum(df$TypeofContact=="")
## [1] 25
#Remove blanks in Type of Contact
df <- df[!(df$TypeofContact == ""), ]
#Replace "Fe Male" with "Female"
sum(df$Gender=="Fe Male")
## [1] 155
sum(df$Gender=="Female")
## [1] 1807
df$Gender[df$Gender == "Fe Male"] <- "Female"
#Replace "Unmarried" with "Single"
sum(df$MaritalStatus=="Unmarried")
## [1] 682
sum(df$MaritalStatus=="Single")
## [1] 912
df$MaritalStatus[df$MaritalStatus == "Unmarried"] <- "Single"
#Remove Monthly Income less than 16000
df <- df[!(df$MonthlyIncome < 16000), ]
#To drop the columns related to customer interactions
df <- subset(df, select = -c(DurationOfPitch,NumberOfFollowups,ProductPitched,PitchSatisfactionScore))
##To transform the character data to numeric
df$TypeofContact[df$TypeofContact == 'Self Enquiry'] <- '1'
df$TypeofContact[df$TypeofContact == 'Company Invited'] <- '2'
df$TypeofContact<-as.numeric(df$TypeofContact)
df$Occupation[df$Occupation == 'Free Lancer'] <- '1'
df$Occupation[df$Occupation == 'Large Business'] <- '2'
df$Occupation[df$Occupation == 'Salaried'] <- '3'
df$Occupation[df$Occupation == 'Small Business'] <- '4'
df$Occupation<-as.numeric(df$Occupation)
df$Gender[df$Gender == 'Female'] <- '1'
df$Gender[df$Gender == 'Male'] <- '2'
df$Gender<-as.numeric(df$Gender)
df$MaritalStatus[df$MaritalStatus == 'Single'] <- '1'
df$MaritalStatus[df$MaritalStatus == 'Divorced'] <- '2'
df$MaritalStatus[df$MaritalStatus == 'Married'] <- '3'
df$MaritalStatus<-as.numeric(df$MaritalStatus)
df$Designation[df$Designation == 'Executive'] <- '1'
df$Designation[df$Designation == 'Manager'] <- '2'
df$Designation[df$Designation == 'Senior Manager'] <- '3'
df$Designation[df$Designation == 'AVP'] <- '4'
df$Designation[df$Designation == 'VP'] <- '5'
df$Designation
df$Designation<-as.numeric(df$Designation)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: reshape
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
## Loading required package: ggplot2
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$Age , horizontal=TRUE , ylim=c(0,80), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Age")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$Age , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,80), ylim = c(0,500))
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$DurationOfPitch , horizontal=TRUE , ylim=c(0,50), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Duration Of Pitch")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$DurationOfPitch , breaks=100 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,50), ylim = c(0,600))
#Spread of Data for NumberOfPersonVisiting
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$NumberOfPersonVisiting , horizontal=TRUE , ylim=c(0,10), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Number Of Person Visiting")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$NumberOfPersonVisiting , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,10), ylim = c(0,600))
#Spread of Data for NumberOfFollowups
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$NumberOfFollowups , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Number Of Followups")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$NumberOfFollowups , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,8), ylim = c(0,600))
#Spread of Data for PreferredPropertyStar
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$PreferredPropertyStar , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Preferred Property Star")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$PreferredPropertyStar , breaks=3 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,8), ylim = c(0,3500))
#Spread of Data for NumberOfTrips
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$NumberOfTrips , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Number Of Trips")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$NumberOfTrips , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,8), ylim = c(0,1500))
#Spread of Data for PitchSatisfactionScore
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$PitchSatisfactionScore , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Pitch Satisfaction Score")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$PitchSatisfactionScore , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,8), ylim = c(0,1500))
#Spread of Data for NumberOfChildrenVisiting
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$NumberOfChildrenVisiting , horizontal=TRUE , ylim=c(0,4), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Number Of Children Visiting")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$NumberOfChildrenVisiting , breaks=20 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,4), ylim = c(0,1800))
#Spread of Data for MonthlyIncome
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$MonthlyIncome , horizontal=TRUE , ylim=c(0,100000), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F, main = "Spread of Data for Monthly Income")
par(mar=c(4, 3.1, 1.1, 2.1))
hist(df$MonthlyIncome , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(0,100000), ylim = c(0,1300))
Numvar <- select(df,ProdTaken,Age,CityTier,NumberOfPersonVisiting,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,
PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation, MonthlyIncome)
res <- round(cor(Numvar),2)
head(res)
## ProdTaken Age CityTier NumberOfPersonVisiting
## ProdTaken 1.00 -0.14 0.09 0.01
## Age -0.14 1.00 -0.01 0.02
## CityTier 0.09 -0.01 1.00 0.00
## NumberOfPersonVisiting 0.01 0.02 0.00 1.00
## NumberOfFollowups 0.11 0.00 0.02 0.33
## PreferredPropertyStar 0.10 -0.02 -0.01 0.03
## NumberOfFollowups PreferredPropertyStar NumberOfTrips
## ProdTaken 0.11 0.10 0.02
## Age 0.00 -0.02 0.17
## CityTier 0.02 -0.01 -0.03
## NumberOfPersonVisiting 0.33 0.03 0.19
## NumberOfFollowups 1.00 -0.03 0.14
## PreferredPropertyStar -0.03 1.00 0.01
## Passport PitchSatisfactionScore OwnCar
## ProdTaken 0.26 0.05 -0.01
## Age 0.03 0.02 0.05
## CityTier 0.00 -0.04 0.00
## NumberOfPersonVisiting 0.01 -0.02 0.01
## NumberOfFollowups 0.00 0.00 0.01
## PreferredPropertyStar 0.00 -0.02 0.01
## NumberOfChildrenVisiting Designation MonthlyIncome
## ProdTaken 0.01 -0.18 -0.13
## Age 0.01 0.50 0.46
## CityTier 0.00 0.12 0.05
## NumberOfPersonVisiting 0.61 -0.03 0.19
## NumberOfFollowups 0.28 0.01 0.17
## PreferredPropertyStar 0.03 -0.01 0.00
melted_corr_mat <- melt(res)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
head(melted_corr_mat)
## X1 X2 value
## 1 ProdTaken ProdTaken 1.00
## 2 Age ProdTaken -0.14
## 3 CityTier ProdTaken 0.09
## 4 NumberOfPersonVisiting ProdTaken 0.01
## 5 NumberOfFollowups ProdTaken 0.11
## 6 PreferredPropertyStar ProdTaken 0.10
ggplot(data = melted_corr_mat, aes(x=X1, y=X2,fill=value)) + geom_tile() +
geom_text(aes(x = X2, y = X1, label = value),color = "black", size = 4)
EDA Observations Part 1:
The average number of visitors is usually 3, with one outlier at 5. The average number of follow-ups is 4, with an extreme value of 6. Age is distributed normally and does not contain any outliers. The duration of pitch is right-skewed and has some outliers greater than 120. These outliers should be investigated to determine if they need to be treated. The number of trips is right-skewed and has some outliers where the number of trips is greater than 17. Monthly income is right-skewed and has some outliers at the higher end.
library(stats)
library(dplyr)
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(reticulate)
library(readxl)
# Read data from excel #
df <- read_excel("travel_cleaned.xlsx")
## New names:
## • `` -> `...1`
# View the data set data #
View(df)
df = subset(df, select = -c(...1,CustomerID) )
# Histogram with density line--
List_col = list(2,4,7,8,10,11,12,13)
df$ProdTaken <- factor(df$ProdTaken)
Yes <- df %>%
filter(ProdTaken == 1) %>%
select(Age,CityTier,
NumberOfPersonVisiting,
PreferredPropertyStar,
NumberOfTrips,
Passport,
OwnCar,
NumberOfChildrenVisiting,
MonthlyIncome)
No <- df %>%
filter(ProdTaken == 0) %>%
select(Age,CityTier,
NumberOfPersonVisiting,
PreferredPropertyStar,
NumberOfTrips,
Passport,
OwnCar,
NumberOfChildrenVisiting,
MonthlyIncome)
Yes <- as.matrix(Yes)
No <- as.matrix(No)
list <- c("Age","CityTier",
"NumberOfPersonVisiting",
"PreferredPropertyStar",
"NumberOfTrips",
"Passport",
"OwnCar",
"NumberOfChildrenVisiting",
"MonthlyIncome")
par(mfrow=c(2,2)) # Create two rows and two columns
par(bg="gray") # Manipulate background
for(i in 1:9)
{
hist(Yes[,i], prob = TRUE,
main = paste("Distribution of ",(colnames(Yes[i])),
"\n of a customer who had Not taken Product"),
xlab = paste(list[i]), cex.main=.6)
lines(density(Yes[,i]), col = 1, lwd = 2)
hist(No[,i], prob = TRUE,
main = paste("Distribution of ",(colnames(No[i])),
"\n of a customer who had taken Product"),
xlab = paste(list[i]), cex.main=.6)
lines(density(No[,i]), col = 1, lwd = 2)
}
# create grouped boxplot
List_col = list(2,4,7,8,10,11,12,13)
par(mfrow=c(2,2)) # Create two rows and two columns
par(bg="gray")
for(i in List_col)
{
boxplot( df[[i]]~df[[1]],
data=ToothGrowth,
main=paste("Boxplot of", (colnames(df[i])), "\nw.r.t Product taken"),
cex.main=.6,
xlab="Product Taken",
ylab=paste(colnames(df[i])),
border="black"
)
}
EDA Observations Part 2:
## 'data.frame': 4861 obs. of 10 variables:
## $ ProdTaken : int 1 0 1 0 0 0 0 0 0 0 ...
## $ Age : int 41 49 37 33 36 32 59 30 38 36 ...
## $ CityTier : int 3 1 1 1 1 1 1 1 1 1 ...
## $ NumberOfPersonVisiting : int 3 3 3 2 2 3 2 3 2 3 ...
## $ PreferredPropertyStar : int 3 4 3 3 4 3 5 3 3 3 ...
## $ NumberOfTrips : int 1 2 7 2 1 1 5 2 1 7 ...
## $ Passport : int 1 0 1 1 0 0 1 0 0 0 ...
## $ OwnCar : int 1 1 0 1 1 1 1 0 0 1 ...
## $ NumberOfChildrenVisiting: int 0 2 0 1 0 1 1 1 0 0 ...
## $ MonthlyIncome : num 20993 20130 17090 17909 18468 ...
## Loading required package: caret
## Loading required package: lattice
parts = createDataPartition(df$ProdTaken, p = .8, list= F)
train = df[parts,]
test = df[-parts,]
nrow(train)
## [1] 3889
head(train)
## ProdTaken Age CityTier NumberOfPersonVisiting PreferredPropertyStar
## 1 1 41 3 3 3
## 2 0 49 1 3 4
## 3 1 37 1 3 3
## 7 0 59 1 2 5
## 9 0 38 1 2 3
## 10 0 36 1 3 3
## NumberOfTrips Passport OwnCar NumberOfChildrenVisiting MonthlyIncome
## 1 1 1 1 0 20993
## 2 2 0 1 2 20130
## 3 7 1 0 0 17090
## 7 5 1 1 1 17670
## 9 1 0 0 0 24526
## 10 7 0 1 0 20237
nrow(test)
## [1] 972
# reverse the data frame
df <- df[nrow(df):1,]
# select the last 20% of ProdTaken from df
test1 <- df[1:floor(.2 * nrow(df)),]
# reverse the test data frame back to its original order
test1 <- test[nrow(test):1,]
train_control = trainControl(method = "cv", number = 5)
set.seed(50)
model = train(ProdTaken ~ ., data = train, method = "svmLinear", trControl = train_control)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
print(model)
## Support Vector Machines with Linear Kernel
##
## 3889 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3111, 3111, 3112, 3111, 3111
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4140002 0.03481166 0.2090933
##
## Tuning parameter 'C' was held constant at a value of 1
#use model to make predictions on test data
pred_y = predict(model, test)
typeof(pred_y)
## [1] "double"
# performance metrics on the test data
test_y = test[, 1]
typeof(test_y)
## [1] "integer"
mean((test_y - pred_y)^2) #mse - Mean Squared Error
## [1] 0.1903521
caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error
## [1] 0.4362936
# load library ggplot2
library(ggplot2)
# create dataframe with actual and predicted values
plot_data <- data.frame(Predicted_value = pred_y,
Observed_value = test_y)
# plot predicted values and actual values
ggplot(plot_data, aes(x = Predicted_value, y = Observed_value)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "green")
##
## Attaching package: 'data.table'
## The following object is masked from 'package:reshape':
##
## melt
## The following objects are masked from 'package:dplyr':
##
## between, first, last
#To keep the related numeric values
df1<-df[ -c(1,2,5,7:8,11,16) ]
#To construct the initial model
model <- lm(ProdTaken ~ ., data = df1)
summary(model)
##
## Call:
## lm(formula = ProdTaken ~ ., data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57634 -0.20747 -0.12224 -0.00655 1.08056
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.802e-01 4.258e-02 4.232 2.35e-05 ***
## Age -4.979e-03 6.668e-04 -7.467 9.70e-14 ***
## CityTier 3.910e-02 5.778e-03 6.766 1.48e-11 ***
## NumberOfPersonVisiting 7.144e-03 9.285e-03 0.769 0.44167
## PreferredPropertyStar 4.713e-02 6.634e-03 7.105 1.38e-12 ***
## NumberOfTrips 9.287e-03 3.006e-03 3.090 0.00201 **
## Passport 2.276e-01 1.163e-02 19.572 < 2e-16 ***
## OwnCar 3.755e-03 1.093e-02 0.344 0.73115
## NumberOfChildrenVisiting 7.797e-04 7.853e-03 0.099 0.92092
## MonthlyIncome -6.702e-06 1.170e-06 -5.727 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3683 on 4851 degrees of freedom
## Multiple R-squared: 0.1158, Adjusted R-squared: 0.1142
## F-statistic: 70.59 on 9 and 4851 DF, p-value: < 2.2e-16
#To compute the Variance Inflation Factor of every predictor
library(caTools)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model)
## Age CityTier NumberOfPersonVisiting
## 1.321522 1.005296 1.623030
## PreferredPropertyStar NumberOfTrips Passport
## 1.001892 1.075632 1.001989
## OwnCar NumberOfChildrenVisiting MonthlyIncome
## 1.007720 1.605412 1.351880
##A VIF of more than 10 is a concern
#Overview of the current model residuals
par(mfrow = c(2, 2))
plot(model)
#Required for RMSE and MAE commands
library(caret)
#To split the data randomly into a training set and a test set
set.seed(100)
n_train <- ceiling(0.8 * length(df1$ProdTaken))
train_sample <- sample(c(1:length(df1$ProdTaken)), n_train)
train_data <- df1[train_sample, ]
test_data <- df1[-train_sample, ]
#To fit the model on the training data
model3 <- lm(ProdTaken ~ ., data = train_data)
predictions <- predict(model3, test_data)
#To measure the performance by comparing the prediction with
#the data using multiple criterion
RMSE <- RMSE(predictions, test_data$ProdTaken)
MAE <- MAE(predictions, test_data$ProdTaken)
print(c(RMSE, MAE))
## [1] 0.3814345 0.2825624
#To compute the prediction error
pred_error_rate <- RMSE / mean(test_data$ProdTaken)
pred_error_rate
## [1] 1.853772
R_sq <- 0
RMSE <- 0
MAE <- 0
#To run the validation 10 times
for(i in 1:10){
n_train <- ceiling(0.8 * length(df1$ProdTaken))
train_sample <- sample(c(1:length(df1$ProdTaken)), n_train)
train_data <- df[train_sample, ]
test_data <- df[-train_sample, ]
model <- lm(ProdTaken ~ ., data = train_data)
predictions <- predict(model, test_data,warning = FALSE)
RMSE <- RMSE + RMSE(predictions, test_data$ProdTaken)
MAE <- MAE + MAE(predictions, test_data$ProdTaken)
}
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
## Warning in predict.lm(model, test_data, warning = FALSE): prediction from a
## rank-deficient fit may be misleading
RMSE = RMSE / 10
MAE = MAE / 10
print(c(RMSE, MAE))
## [1] 0.3562234 0.2663532
#To create create a data frame with actual and predicted values
plot_data <- data.frame(Predicted_value = predictions,
Actual_value = test_data$ProdTaken)
#To plot predicted values and actual values
ggplot(plot_data, aes(x = Predicted_value, y = Actual_value)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "green")
## New names:
## • `` -> `...1`
library(rpart)
library(rpart.plot)
create_train_test <- function(df, size = 0.8, train = TRUE) {
n_row = nrow(df)
total_row = size * n_row
#train_sample < - 1: total_row
if (train == TRUE) {
return (df[1:3500, ])
} else {
return (df[3500:n_row, ])
}
}
data_train <- create_train_test(df, 0.8, train = TRUE)
data_test <- create_train_test(df, 0.8, train = FALSE)
dim(data_train)
## [1] 3500 15
dim(data_test)
## [1] 1362 15
prop.table(table(df$ProdTaken))
##
## 0 1
## 0.8113557 0.1886443
par(mfrow=c(1,1)) # Create two rows and two columns
par(bg="gray")
fit <- rpart(ProdTaken~., data = data_train, method = 'class')
predict_unseen <-predict(fit, data_test, type = 'class')
table_mat <- table(data_test$ProdTaken, predict_unseen)
table_mat
## predict_unseen
## 0 1
## 0 1045 40
## 1 184 93
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 0.83553597650514"
#check importance variable #
importance <- as.data.frame(fit$variable.importance)
imp <- cbind(importance, row.names(importance))
colnames(imp) <- c("Importance","varnames")
library(ggplot2)
col = c("#B22222","#CD5C5C","#FF4500","#FFD700","#F0E68C",
"#556B2F","#ADFF2F","#98FB98","#008080")
ggplot(imp, aes(x = reorder(varnames,Importance), y=Importance)) +
geom_bar(stat="identity",color=col,fill=col) +
coord_flip() +
ggtitle("Importance Variable Feature") +
ylab("Importance") +
xlab("Feature")
#Check for model accuracy again
model2 <- rpart(ProdTaken~Passport+
Designation+MonthlyIncome+
MaritalStatus+Age+CityTier, data=data_train,method="class")
predict_unseen2 <- predict(model2, data_test, type="class")
table_mat <- table(data_test$ProdTaken, predict_unseen2)
table_mat
## predict_unseen2
## 0 1
## 0 1045 40
## 1 184 93
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat) * 100
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 83.553597650514"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1045 40
## 1 184 93
##
## Accuracy : 0.8355
## 95% CI : (0.8148, 0.8548)
## No Information Rate : 0.9023
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3706
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8503
## Specificity : 0.6992
## Pos Pred Value : 0.9631
## Neg Pred Value : 0.3357
## Prevalence : 0.9023
## Detection Rate : 0.7673
## Detection Prevalence : 0.7966
## Balanced Accuracy : 0.7748
##
## 'Positive' Class : 0
##
KNN is a supervised leaning algorithm that works on classification problems. For this KNN, Euclidean distance is calculated and assigned as the K-value.
#load class package
library(class)
##
## Attaching package: 'class'
## The following object is masked from 'package:reshape':
##
## condense
#Randomization with set.seed
set.seed(1000)
gp <- runif(nrow(df))
df<-df[order(gp),]
#str(df)
#head(df)
#head(df,10)
#str(df)
summary(df[,c(3:16)])
## Age TypeofContact CityTier Occupation
## Min. :18.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:31.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000
## Median :36.00 Median :1.000 Median :1.000 Median :3.000
## Mean :37.59 Mean :1.292 Mean :1.655 Mean :3.335
## 3rd Qu.:43.00 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :61.00 Max. :2.000 Max. :3.000 Max. :4.000
## Gender NumberOfPersonVisiting PreferredPropertyStar MaritalStatus
## Min. :1.000 Min. :1.000 Min. :3.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:1.000
## Median :2.000 Median :3.000 Median :3.000 Median :2.000
## Mean :1.597 Mean :2.908 Mean :3.579 Mean :2.151
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :2.000 Max. :5.000 Max. :5.000 Max. :3.000
## NumberOfTrips Passport OwnCar NumberOfChildrenVisiting
## Min. : 1.000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.: 2.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.000
## Median : 3.000 Median :0.0000 Median :1.0000 Median :1.000
## Mean : 3.231 Mean :0.2921 Mean :0.6209 Mean :1.188
## 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :22.000 Max. :1.0000 Max. :1.0000 Max. :3.000
## Designation MonthlyIncome
## Min. :1.000 Min. :16009
## 1st Qu.:1.000 1st Qu.:20482
## Median :2.000 Median :22637
## Mean :2.058 Mean :23628
## 3rd Qu.:3.000 3rd Qu.:25448
## Max. :5.000 Max. :98678
#normalize data set
normalize <- function(x){return((x-min(x))/(max(x)-min(x)))}
normalize(c(2:16))
## [1] 0.00000000 0.07142857 0.14285714 0.21428571 0.28571429 0.35714286
## [7] 0.42857143 0.50000000 0.57142857 0.64285714 0.71428571 0.78571429
## [13] 0.85714286 0.92857143 1.00000000
df_n<-as.data.frame(lapply(df[,c(3:16)], normalize))
#data splitting into 80% train data and 20% test data
df_train <- df_n[1:3889,]
df_test <- df_n[3890:4861,]
df_train_target <- df[1:3889,2]
df_test_target <- df[3890:4861,2]
require(class)
#using the Euclidean distance
sqrt(4861)
## [1] 69.72087
#train model using K-Nearest Neighbours
m1 <- knn(train=df_train, test=df_test, cl=df_train_target, k=70)
table(df_test_target,m1)
## m1
## df_test_target 0 1
## 0 786 8
## 1 168 10
round(795/972,2)
## [1] 0.82
#Find the number of observation
NROW(df_train_target)
## [1] 3889
sqrt(NROW(df_train_target))
## [1] 62.36185
m1.62 <- knn(train=df_train, test=df_test, cl=df_train_target, k=62)
m1.63 <- knn(train=df_train, test=df_test, cl=df_train_target, k=63)
#Calculate the proportion of correct classification for k =62, 63
ACC.62 <- 100 * sum(df_test_target == m1.62)/NROW(df_test_target)
ACC.63 <- 100 * sum(df_test_target == m1.63)/NROW(df_test_target)
ACC.62
## [1] 82.20165
ACC.63
## [1] 82.30453
# Check prediction against actual value in tabular form for k=62
table(m1.62 ,df_test_target)
## df_test_target
## m1.62 0 1
## 0 784 163
## 1 10 15
m1.62
## [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [38] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [704] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [815] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [852] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [889] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [963] 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
round(795/972,2)
## [1] 0.82
table(m1.63 ,df_test_target)
## df_test_target
## m1.63 0 1
## 0 785 163
## 1 9 15
m1.63
## [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [38] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [704] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [815] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [852] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [889] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [963] 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
round(795/972,2)
## [1] 0.82
library(caret)
#Using Confusion Matrix as an evaluation technique
confusionMatrix(table(m1.62 ,df_test_target))
## Confusion Matrix and Statistics
##
## df_test_target
## m1.62 0 1
## 0 784 163
## 1 10 15
##
## Accuracy : 0.822
## 95% CI : (0.7965, 0.8456)
## No Information Rate : 0.8169
## P-Value [Acc > NIR] : 0.3573
##
## Kappa : 0.1075
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.98741
## Specificity : 0.08427
## Pos Pred Value : 0.82788
## Neg Pred Value : 0.60000
## Prevalence : 0.81687
## Detection Rate : 0.80658
## Detection Prevalence : 0.97428
## Balanced Accuracy : 0.53584
##
## 'Positive' Class : 0
##
#Confusion Matrix
confusion_mat <- table(m1.62 ,df_test_target)
confusion_df <- data.frame(confusion_mat)
val<- c(784,163,10,15)
library(ggplot2)
ggplot(data = confusion_df, mapping = aes(x = TClass, y = PClass)) +
geom_tile(aes(fill = val), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", val)), vjust = 1, size=10, hjust=0.5) +
scale_fill_gradientn(values=c(1, .6, .5, .4, 0),colours=c("red", "#770000", "black", "#007777", "cyan")) +
theme_bw() + theme(legend.position = "none")+
ggtitle("Confusion Matrix") +
xlab("Predicted") +
ylab("Actual")
Overall, the project has developed a predictive model using classification model that can accurately identify customers who are likely to be interested in purchasing the Travel package. By analyzing various factors through feature selection such as designation, passport status, marital status, age, monthly income and city tier the model is able to identify the most likely customers to take the package. Additionally, EDA revealed that customers in certain age and income ranges, with a preference for 5-star properties and those who have longer interactions with salesmen, higher pitch satisfaction scores and multiple follow ups are also more likely to take the package. To maximize success, the company should target its marketing efforts towards these identified customers, and also provide support for passport acquisition and child care services.