Project Title: Implementation of CRISP-DM Model on churn prediction dataset
NAME: ASWATHY GUNADEEP
EMAIL: aswathygunadeep@gmail.com
COLLEGE / COMPANY: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA
According to CRISP-DM, a given datamining project has a life cycle consisting of six phases. The phase-sequence is adaptive.That is, the next phase in the sequence often depends on the outcomes associated with the previous phase.
CRISP-DM: The Six Phases:
Business/Research Understanding Phase
Data Understanding Phase
Data Preparation Phase
Modeling Phase
Evaluation Phase
Deployment Phase
A breakup of the main CRISP-DM activities in each phase is given below.
I PROBLEM UNDERTSANDING PHASE
setwd("C:/Users/user/Desktop/tarsha systems summer internship/CRISP-DM model")
ash.df <- read.csv(paste("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=""))
attach(ash.df)
dim(ash.df)
## [1] 7043 21
library(car)
## Loading required package: carData
some(ash.df)
## customerID gender SeniorCitizen Partner Dependents tenure
## 266 0123-CRBRT Female 0 Yes Yes 61
## 1154 8777-PVYGU Female 0 Yes No 52
## 3266 9722-UJOJR Male 0 Yes Yes 29
## 3424 2947-DOMLJ Male 0 No Yes 1
## 3518 0748-RDGGM Male 0 Yes No 70
## 4314 6288-LBEAR Female 0 No No 23
## 4598 3440-JPSCL Female 0 No No 6
## 6094 7110-BDTWG Female 0 Yes No 71
## 6372 1535-VTJOQ Female 0 No No 24
## 6554 4891-NLUBA Female 0 Yes Yes 61
## PhoneService MultipleLines InternetService OnlineSecurity
## 266 Yes Yes DSL Yes
## 1154 Yes No DSL Yes
## 3266 Yes No Fiber optic No
## 3424 Yes No DSL Yes
## 3518 Yes No Fiber optic Yes
## 4314 Yes No Fiber optic No
## 4598 Yes No Fiber optic No
## 6094 No No phone service DSL No
## 6372 Yes Yes No No internet service
## 6554 No No phone service DSL No
## OnlineBackup DeviceProtection TechSupport
## 266 Yes Yes Yes
## 1154 No Yes No
## 3266 No No No
## 3424 No No No
## 3518 Yes Yes Yes
## 4314 Yes No Yes
## 4598 No Yes Yes
## 6094 No Yes Yes
## 6372 No internet service No internet service No internet service
## 6554 Yes Yes Yes
## StreamingTV StreamingMovies Contract
## 266 Yes Yes Two year
## 1154 Yes No One year
## 3266 No No Month-to-month
## 3424 No No Month-to-month
## 3518 Yes Yes One year
## 4314 No No Month-to-month
## 4598 Yes Yes Month-to-month
## 6094 No Yes Two year
## 6372 No internet service No internet service Two year
## 6554 Yes Yes Two year
## PaperlessBilling PaymentMethod MonthlyCharges
## 266 No Mailed check 88.10
## 1154 Yes Mailed check 64.30
## 3266 Yes Electronic check 70.75
## 3424 No Mailed check 50.60
## 3518 Yes Bank transfer (automatic) 109.50
## 4314 Yes Mailed check 79.10
## 4598 Yes Mailed check 99.95
## 6094 Yes Electronic check 47.05
## 6372 No Credit card (automatic) 25.15
## 6554 Yes Bank transfer (automatic) 61.45
## TotalCharges Churn
## 266 5526.75 No
## 1154 3410.60 No
## 3266 1974.80 Yes
## 3424 50.60 Yes
## 3518 7534.65 Yes
## 4314 1783.75 No
## 4598 547.65 Yes
## 6094 3263.60 No
## 6372 553.00 No
## 6554 3751.15 No
str(ash.df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Problem Statement:
pie(table(ash.df$Churn),main="churn", col=c("skyblue","yellow"))
mytable <- with(ash.df, table(Churn))
prop.table(mytable)*100
## Churn
## No Yes
## 73.46301 26.53699
Which customers are likely to leave?
The obvious guess is Total charges and Monthly charges and tenure. To make sure of this, we plot the corrgram, create correlation matrix and scatterplot matrix , excluding the the less important factor variables , having 2 levels-“Yes” and “No”.
II DATA UNDERSTANDING PHASE
CORRGRAM
library(corrgram)
par(mfrow=c(1,1))
corrgram(ash.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of dataset")
These are some of the possible conclusions that can be deduced from the above graph:
SCATTERPLOTMATRIX
library(car)
scatterplotMatrix(~TotalCharges+MonthlyCharges+tenure | Churn, data=ash.df,main="scatterplotmatrix", col=c("red","black"))
CORRELATION MATRIX
den.df <- subset(ash.df[,c(6,19,20)], ash.df$TotalCharges>=0 && ash.df$MonthlyCharges>=0)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
data <- cor(den.df,use = "complete.obs")
round(data,2)
## tenure MonthlyCharges TotalCharges
## tenure 1.00 0.25 0.83
## MonthlyCharges 0.25 1.00 0.65
## TotalCharges 0.83 0.65 1.00
The correlation matrix also suggests that tenure affects total charges greatly.
1.Demographic variables such as gender and no. of senior citizens:
par(mfrow=c(1,2))
mytable1 <- with(ash.df, table(gender))
barchart(mytable1, col="grey", main="genderwise breakup")
prop.table(mytable1)*100
## gender
## Female Male
## 49.52435 50.47565
mytable <- with(ash.df, table(SeniorCitizen))
barchart(mytable,main="no. of seniorcitizens",col="grey")
prop.table(mytable)*100
## SeniorCitizen
## 0 1
## 83.78532 16.21468
There are more males than females.
par(mfrow=c(1,1))
d <- density(ash.df$tenure)
plot(d, main="tenure", col="blue")
polygon(d, col="blue", border="turquoise")
Many customers use the company’s service for a shorter period.
par(mfrow=c(1,2))
mytable1 <- with(ash.df, table(PhoneService))
histogram(ash.df$PhoneService, main="phone service", col=c("red", "orange"), xlab="phone service")
prop.table(mytable1)*100
## PhoneService
## No Yes
## 9.683374 90.316626
mytable <- with(ash.df, table(MultipleLines))
histogram(ash.df$MultipleLines, main="multiple lines",col=c("red", "orange", "yellow"),xlab="multiple lines service")
prop.table(mytable)*100
## MultipleLines
## No No phone service Yes
## 48.132898 9.683374 42.183729
Most customers use the company’s phone service and do not have multiple lines.
pie(table(ash.df$InternetService),main="types of internet services", col=c("magenta","violet","pink"))
mytable <- with(ash.df, table(InternetService))
prop.table(mytable)*100
## InternetService
## DSL Fiber optic No
## 34.37456 43.95854 21.66690
Almost 44% customers have fiber optic type of internet service.
par(mfrow=c(1,2))
mytable <- with(ash.df, table(OnlineSecurity))
prop.table(mytable)*100
## OnlineSecurity
## No No internet service Yes
## 49.66634 21.66690 28.66676
barchart(mytable, main="online security split up", horizontal=FALSE, xlab="online security", col=c("burlywood", "darkolivegreen"))
mytable1 <- with(ash.df, table(OnlineBackup))
prop.table(mytable1)*100
## OnlineBackup
## No No internet service Yes
## 43.84495 21.66690 34.48814
barchart(mytable1, main="online backup split up", horizontal=FALSE, xlab="online backup", col=c("burlywood","darkolivegreen"))
par(mfrow=c(1,1))
mytable <- with(ash.df, table(Contract))
prop.table(mytable)*100
## Contract
## Month-to-month One year Two year
## 55.01917 20.91438 24.06645
barchart(mytable, main="contract", horizontal=FALSE, col=c("skyblue","orange","green"), xlab="type of contract")
55% customers have availed month-to-month contract basis.
par(mfrow=c(1,2))
den.df <- subset(ash.df, ash.df$TotalCharges>=0)
d <- density(den.df$TotalCharges)
plot(d, main="Total charges")
polygon(d, col="red", border="blue")
den1.df <- subset(ash.df, ash.df$MonthlyCharges>=0)
d1 <- density(den1.df$MonthlyCharges)
plot(d1, main="Monthly charges")
polygon(d1, col="red", border="blue")
Monthly charges are more than total charges, which could be reason for customers to leave.
The key drivers are tenure, contract, internet service, total charges and monthly charges.
III DATA PREPARATION PHASE
par(mfrow=c(1,2))
histogram(~ tenure | Churn,col="green", main="tenure with churn")
This shows that customers who tend to leave are people who have been customers for shorter periods.
xyplot(TotalCharges ~ MonthlyCharges, data=den.df, main="total charges v/s monthly charges", type = c("p", "g"),xlab="monthly charges", ylab="total charges")
par(mfrow=c(2,1))
library(lattice)
histogram(~ TotalCharges | Churn, data=ash.df)
histogram(~ MonthlyCharges | Churn, data=ash.df, col="orange")
As stated above, due to high monthly charges, churn percentage is more in that case.
library(vcd)
## Loading required package: grid
par(mfrow=c(1,1))
tab1 <- xtabs(~ Churn + Contract)
tab1
## Contract
## Churn Month-to-month One year Two year
## No 2220 1307 1647
## Yes 1655 166 48
mosaic(tab1, shade=TRUE, legend=TRUE, main="contract basis")
Customers who tend to leave are on a month-to-month contract basis.
par(mfrow=c(1,1))
tab2 <- xtabs(~ Churn + InternetService)
tab2
## InternetService
## Churn DSL Fiber optic No
## No 1962 1799 1413
## Yes 459 1297 113
mosaic(tab2, shade=TRUE, legend=TRUE, main="internet service effects")
Customers who tend to leave are havinf fiber optic type of internet connection.
boxplot(TotalCharges ~ InternetService, data=den.df,main="total charges with type of internet service", col=c("blue","red","yellow"),xlab="type of internet service",ylab="total charges")
Total charges are more for fiber optic type of internet service.
boxplot(tenure ~ Contract, data=ash.df,main="tenure v/s contract", col=c("green","orange","pink"),xlab="type of contract",ylab="tenure")
boxplot(tenure ~ OnlineSecurity, data=ash.df,main="tenure v/s online security", col=c("darkcyan","yellowgreen","darkseagreen1"),xlab="online security",ylab="tenure")
Evidently, customers prefer to use the company’s services for a longer time if they are assured online security.
plot(x=ash.df$tenure, y=ash.df$TotalCharges, main="total charges v/s tenure", col="red", cex=0.6, xlab="tenure", ylab="total charges")
From the above analysis and visualizations, customers who leave tend to be ones who are on a month-to-month contract, have fiber optic internet service, and have been customers for shorter periods.
library(ggplot2)
par(mfrow=c(1,2))
qplot(x = TotalCharges, y = tenure, data = ash.df, color = factor(Churn))
## Warning: Removed 11 rows containing missing values (geom_point).
qplot(x = TotalCharges, y = MonthlyCharges, data = ash.df, color = factor(Churn))
## Warning: Removed 11 rows containing missing values (geom_point).
IV MODELING PHASE
CHI-SQUARED TESTS:
chisq.test(Churn,TotalCharges)
## Warning in chisq.test(Churn, TotalCharges): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and TotalCharges
## X-squared = 6503.6, df = 6529, p-value = 0.5859
chisq.test(Churn,MonthlyCharges)
## Warning in chisq.test(Churn, MonthlyCharges): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and MonthlyCharges
## X-squared = 2123.6, df = 1584, p-value < 2.2e-16
chisq.test(Churn,tenure)
## Warning in chisq.test(Churn, tenure): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and tenure
## X-squared = 1065.3, df = 72, p-value < 2.2e-16
chisq.test(Churn,Contract)
##
## Pearson's Chi-squared test
##
## data: Churn and Contract
## X-squared = 1184.6, df = 2, p-value < 2.2e-16
Hence, since all the p-values are very less (except for totalcharges), our original hypothesis is correct and NULL hypothesis of independence is rejected.
t-Tests NULL hypothesis: There is no difference in total charges for people abandoning and still using the company’s services.
t.test(TotalCharges ~ Churn, data=ash.df)
##
## Welch Two Sample t-test
##
## data: TotalCharges by Churn
## t = 18.801, df = 4042.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 916.8121 1130.2840
## sample estimates:
## mean in group No mean in group Yes
## 2555.344 1531.796
Based on the above output of the t-test, we can reject the hypothesis that the total charges of people leaving and still using the company’s services are the same. (p<0.001)
Regression models:
Simple linear regression (1)
sfit1 <- lm(TotalCharges ~ tenure, data=ash.df)
summary(sfit1)
##
## Call:
## lm(formula = TotalCharges ~ tenure, data = ash.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3997.2 -416.1 157.1 526.1 3382.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -189.529 25.254 -7.505 6.91e-14 ***
## tenure 76.271 0.621 122.810 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1278 on 7030 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.6821, Adjusted R-squared: 0.682
## F-statistic: 1.508e+04 on 1 and 7030 DF, p-value: < 2.2e-16
plot(y=TotalCharges, x=tenure, main="total charges v/s tenure", col="orange")
abline(sfit1)
The regression coefficient (76.27) is significantly dfferent from zero (p < 0.001). The multiple R-squared indciates that the model accounts for 68% of the variance in tenure. Although, the residual standard error is quite high for this model to predict total charges. This is not a good model.
Simple linear regression (2)
sfit2 <- lm(MonthlyCharges ~ tenure, data=ash.df)
summary(sfit2)
##
## Call:
## lm(formula = MonthlyCharges ~ tenure, data = ash.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.498 -27.251 6.245 24.943 54.376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.92978 0.57476 95.57 <2e-16 ***
## tenure 0.30372 0.01415 21.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.15 on 7041 degrees of freedom
## Multiple R-squared: 0.06145, Adjusted R-squared: 0.06132
## F-statistic: 461 on 1 and 7041 DF, p-value: < 2.2e-16
plot(y=MonthlyCharges, x=tenure, main="monthly charges v/s tenure", col="red")
abline(sfit2)
The regression coefficient (0.303) is dfferent from zero (p < 0.001). The multiple R-squared indciates that the model accounts for 6% of the variance in tenure. Although, the residual standard error is quite high for this model to predict monthly charges. This is not at all a good model.
Logistic regression
model <- glm(Churn ~ TotalCharges + OnlineSecurity + OnlineBackup + tenure + MonthlyCharges + gender + Contract + PhoneService + MultipleLines + InternetService, family=binomial(link='logit'),data=ash.df)
summary(model)
##
## Call:
## glm(formula = Churn ~ TotalCharges + OnlineSecurity + OnlineBackup +
## tenure + MonthlyCharges + gender + Contract + PhoneService +
## MultipleLines + InternetService, family = binomial(link = "logit"),
## data = ash.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7072 -0.7073 -0.2926 0.8059 3.4724
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.733e-01 2.027e-01 0.855 0.39250
## TotalCharges 3.057e-04 7.022e-05 4.353 1.34e-05
## OnlineSecurityNo internet service -5.870e-01 1.873e-01 -3.133 0.00173
## OnlineSecurityYes -5.550e-01 8.490e-02 -6.537 6.29e-11
## OnlineBackupNo internet service NA NA NA NA
## OnlineBackupYes -2.356e-01 7.834e-02 -3.007 0.00264
## tenure -5.915e-02 6.135e-03 -9.641 < 2e-16
## MonthlyCharges 1.075e-02 4.263e-03 2.521 0.01171
## genderMale -3.006e-02 6.397e-02 -0.470 0.63839
## ContractOne year -8.254e-01 1.056e-01 -7.817 5.39e-15
## ContractTwo year -1.682e+00 1.729e-01 -9.726 < 2e-16
## PhoneServiceYes -9.549e-01 1.463e-01 -6.527 6.71e-11
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 2.627e-01 8.099e-02 3.244 0.00118
## InternetServiceFiber optic 7.417e-01 1.390e-01 5.338 9.41e-08
## InternetServiceNo NA NA NA NA
##
## (Intercept)
## TotalCharges ***
## OnlineSecurityNo internet service **
## OnlineSecurityYes ***
## OnlineBackupNo internet service
## OnlineBackupYes **
## tenure ***
## MonthlyCharges *
## genderMale
## ContractOne year ***
## ContractTwo year ***
## PhoneServiceYes ***
## MultipleLinesNo phone service
## MultipleLinesYes **
## InternetServiceFiber optic ***
## InternetServiceNo
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8143.4 on 7031 degrees of freedom
## Residual deviance: 5938.0 on 7019 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 5964
##
## Number of Fisher Scoring iterations: 6
V EVALUATION PHASE
As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. The ROC(receiving operating characteristic) is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(gplots)
fitted.results <- predict(model,newdata=subset(ash.df,select=c(10,20,6,16,7,9)),type='response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pr <- prediction(fitted.results, ash.df$Churn)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8418911
The results of the regression model states that it is evident that MonthlyCharges and gender is not signiificant at all. As for the statistically significant variables, tenure has the lowest p-value suggesting a strong association of the period for which the customers used the company’s services with the churn prediction.
Forecasting future values:
library(forecast)
ash1.df <- subset(ash.df[c(6,19,20)])
myts <- ts(ash1.df)
forecast(myts, 3)
## Warning in ets(object, lambda = lambda, allow.multiplicative.trend =
## allow.multiplicative.trend, : Missing values encountered. Using longest
## contiguous portion of time series
## tenure
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 7044 32.38652 0.9130221 63.86002 -15.74804 80.52109
## 7045 32.38652 0.9130219 63.86002 -15.74804 80.52109
## 7046 32.38652 0.9130218 63.86002 -15.74804 80.52109
##
## MonthlyCharges
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 7044 64.7517 26.24427 103.2591 5.859675 123.6437
## 7045 64.7517 26.24427 103.2591 5.859675 123.6437
## 7046 64.7517 26.24427 103.2591 5.859674 123.6437
##
## TotalCharges
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 3332 2254.99 -651.9964 5161.977 -2190.862 6700.843
## 3333 2254.99 -651.9964 5161.977 -2190.862 6700.843
## 3334 2254.99 -651.9965 5161.977 -2190.862 6700.843
plot(forecast(myts, 3))
## Warning in ets(object, lambda = lambda, allow.multiplicative.trend =
## allow.multiplicative.trend, : Missing values encountered. Using longest
## contiguous portion of time series
From the above analysis and visualizations, customers who leave tend to be ones who are on a month-to-month contract, have fiber optic internet service, and have been customers for shorter periods.
VI DEPLOYMENT PHASE