In this project, Richard Wall was trying to estimate the probability that a given customer would leave in the near future and identify the drivers that contributed most to that customer’s decision so that they would be able to reach out to the customer, enhance his or her experience with QWE services, and divert churn without giving up costly discounts. In particular, he wanted to know the relationships between churning and customer longevity in months, CHI score, number of support cases, average support priority, and usage information (logins, blogs, views, and days since last login).
In order to analysing the relationships between them, his associates V. J. Aggrawal pulled data on both the value of a characteristic as of December 1, 2011, and its change from November to December and added a column indicating whether the customer actually left in the two months following December 1. The final dataset includes such variables as ID, Customer Age (in months), Churn (1 = Yes, 0 = No), CHI Score Month 0, CHI Score 0-1, Support Cases Month 0, Support Cases 0-1, SP Month 0, SP 0-1, Logins 0-1, Blog Articles 0-1, Views 0-1, Days Since Last Login 0-1.
Here is one thing to note, to simplified the problem, Aggrawal only pulled data for two months. Therefore, the regressions below may not be as precise as the regressions based on a longer period.
df_qwe = as.data.frame(read_excel('UV6696-XLS-ENG.xlsx', sheet = 2))
## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.
str(df_qwe)
## 'data.frame': 6347 obs. of 13 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Customer Age (in months) : num 67 67 55 63 57 58 57 46 56 56 ...
## $ Churn (1 = Yes, 0 = No) : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CHI Score Month 0 : num 0 62 0 231 43 138 180 116 78 78 ...
## $ CHI Score 0-1 : num 0 4 0 1 -1 -10 -5 -11 -7 -37 ...
## $ Support Cases Month 0 : num 0 0 0 1 0 0 1 0 1 0 ...
## $ Support Cases 0-1 : num 0 0 0 -1 0 0 1 0 -2 0 ...
## $ SP Month 0 : num 0 0 0 3 0 0 3 0 3 0 ...
## $ SP 0-1 : num 0 0 0 0 0 0 3 0 0 0 ...
## $ Logins 0-1 : num 0 0 0 167 0 43 13 0 -9 -7 ...
## $ Blog Articles 0-1 : num 0 0 0 -8 0 0 -1 0 1 0 ...
## $ Views 0-1 : num 0 -16 0 21996 9 ...
## $ Days Since Last Login 0-1: num 31 31 31 0 31 0 0 6 7 14 ...
summary(df_qwe)
## ID Customer Age (in months) Churn (1 = Yes, 0 = No)
## Min. : 1 Min. : 0.0 Min. :0.00000
## 1st Qu.:1588 1st Qu.: 5.0 1st Qu.:0.00000
## Median :3174 Median :11.0 Median :0.00000
## Mean :3174 Mean :13.9 Mean :0.05089
## 3rd Qu.:4760 3rd Qu.:20.0 3rd Qu.:0.00000
## Max. :6347 Max. :67.0 Max. :1.00000
## CHI Score Month 0 CHI Score 0-1 Support Cases Month 0
## Min. : 0.00 Min. :-125.000 Min. : 0.0000
## 1st Qu.: 24.50 1st Qu.: -8.000 1st Qu.: 0.0000
## Median : 87.00 Median : 0.000 Median : 0.0000
## Mean : 87.32 Mean : 5.059 Mean : 0.7063
## 3rd Qu.:139.00 3rd Qu.: 15.000 3rd Qu.: 1.0000
## Max. :298.00 Max. : 208.000 Max. :32.0000
## Support Cases 0-1 SP Month 0 SP 0-1
## Min. :-29.000000 Min. :0.0000 Min. :-4.00000
## 1st Qu.: 0.000000 1st Qu.:0.0000 1st Qu.: 0.00000
## Median : 0.000000 Median :0.0000 Median : 0.00000
## Mean : -0.006932 Mean :0.8128 Mean : 0.03017
## 3rd Qu.: 0.000000 3rd Qu.:2.6667 3rd Qu.: 0.00000
## Max. : 31.000000 Max. :4.0000 Max. : 4.00000
## Logins 0-1 Blog Articles 0-1 Views 0-1
## Min. :-293.00 Min. :-75.0000 Min. :-28322.00
## 1st Qu.: -1.00 1st Qu.: 0.0000 1st Qu.: -11.00
## Median : 2.00 Median : 0.0000 Median : 0.00
## Mean : 15.73 Mean : 0.1572 Mean : 96.31
## 3rd Qu.: 23.00 3rd Qu.: 0.0000 3rd Qu.: 27.00
## Max. : 865.00 Max. :217.0000 Max. :230414.00
## Days Since Last Login 0-1
## Min. :-648.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 1.765
## 3rd Qu.: 3.000
## Max. : 61.000
In the dataset, we note that -
Month 0 is to denote the current moment in time.
0-1 is the change from November to December. Negative values denote a decrease and vice versa.
Churn is to show whether a customer call with a request to cancel his or her contract.
Support cases is using when any customer uses our system and may has some issues at times. These requests will be routed to the tech people.
Support priority shows how serious the issue/request is. And the more service requests, the higher the priority, the more serious issues will be.
Usage includes logins, blog articles, views and days since last login. It shows whether a user is active. The higher the usege, the less likely to churn there will be.
df_qwe$ID = as.factor(df_qwe$ID)
df_qwe$`Churn (1 = Yes, 0 = No)` = as.factor(df_qwe$`Churn (1 = Yes, 0 = No)`)
str(df_qwe)
## 'data.frame': 6347 obs. of 13 variables:
## $ ID : Factor w/ 6347 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Customer Age (in months) : num 67 67 55 63 57 58 57 46 56 56 ...
## $ Churn (1 = Yes, 0 = No) : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ CHI Score Month 0 : num 0 62 0 231 43 138 180 116 78 78 ...
## $ CHI Score 0-1 : num 0 4 0 1 -1 -10 -5 -11 -7 -37 ...
## $ Support Cases Month 0 : num 0 0 0 1 0 0 1 0 1 0 ...
## $ Support Cases 0-1 : num 0 0 0 -1 0 0 1 0 -2 0 ...
## $ SP Month 0 : num 0 0 0 3 0 0 3 0 3 0 ...
## $ SP 0-1 : num 0 0 0 0 0 0 3 0 0 0 ...
## $ Logins 0-1 : num 0 0 0 167 0 43 13 0 -9 -7 ...
## $ Blog Articles 0-1 : num 0 0 0 -8 0 0 -1 0 1 0 ...
## $ Views 0-1 : num 0 -16 0 21996 9 ...
## $ Days Since Last Login 0-1: num 31 31 31 0 31 0 0 6 7 14 ...
In the dataset, we have 13 variables as seen above. Here, we note that -
“ID” and “Churn (1 = Yes, 0 = No)” are discrete variables.
“Customer Age (in months)”, “CHI Score Month 0”, “CHI Score 0-1”, “Support Cases Month 0”, “Support Cases 0-1”, “SP Month 0”, “SP 0-1”, “Logins 0-1”, “Blog Articles 0-1”, “Views 0-1” and “Days Since Last Login 0-1” are continuous variables.
p1 <- df_qwe %>%
ggplot(aes(x = `CHI Score Month 0`)) +
geom_histogram(aes(y=..density..), binwidth = 50, colour="black", fill="white") +
geom_density(alpha=.2, fill="Orange") +
facet_grid(.~`Churn (1 = Yes, 0 = No)`) +
labs( #add title and anotations
title = 'Distribution of CHI score for December 2011 by different churn outcomes', tag = '3.a)', x = 'CHI score', y = 'Frequency', caption = "'O' stands for customers who didn't churn.
'1' stands for customers who churn"
) +
theme( #formating font and size
plot.title = element_text(size=20, face="bold"),
axis.title.x = element_text(size=12, face="bold"),
axis.title.y = element_text(size=12, face="bold")
)
p2 <- df_qwe %>%
group_by(`Customer Age (in months)`) %>%
summarise(`Average Churn Rate` = mean(`Churn (1 = Yes, 0 = No)` == 1)) %>%
ggplot(aes(x = `Customer Age (in months)`, y = `Average Churn Rate`, fill = `Average Churn Rate`)) +
geom_col(color = 'black') +
scale_fill_gradient(low="orange", high="red") +
labs( #add title and anotations
title = 'Average churn rate by customer age', tag = '3.b)', x = 'Customer Age (in months)', y = 'Churn Rate' ) +
theme( #formating font and size
plot.title = element_text(size=20, face="bold"),
axis.title.x = element_text(size=12, face="bold"),
axis.title.y = element_text(size=12, face="bold")
)
p3 <- df_qwe %>%
filter(`Churn (1 = Yes, 0 = No)` == 1) %>%
ggplot() +
geom_bar(aes(`Customer Age (in months)`, fill = ..count..), color = 'black') +
scale_fill_gradient(low="orange", high="red") +
labs( #add title and anotations
title = 'Number of customers who churn \nby customer age', tag = '3.c)', x = 'Customer Age (in months)', y = 'Number of churned customers' ) +
theme( #formating font and size
plot.title = element_text(size=20, face="bold"),
axis.title.x = element_text(size=12, face="bold"),
axis.title.y = element_text(size=12, face="bold")
)
We can see from the plot that the distribution of CHI score is right-skewed and the right plot(churn=1) is more right-skewed than the left plot(churn=0). For customers who did not churn out(Churn=0), the mean of them lies between 0-150, while for customers who churned out, the mean of them lies between 0-100. Therefore, on average, customers who churned out usually have a lower CHI score than those who did not churn out.
Next, we compare the average churn rate by different customer age. From figure 3.b), there is no signicicant trend between customer age. Customer who have been using with the QWE for around 12 to 18 months, 27 months, 41 months and 47 months are more likely to churn compared with the rest of the customers.
And when we compare the number of customers who churn by customer age, we can find that customer staying for 12 months churn the most.
Therefore, combining these two figures, we can say that even though the average churn rate is fairly high for customers who stay for more than 24 months, the total number of churned customer is not rather high. And for customers who stay for 12 months, they have both the high average churn rate and the high number of churned customers.
df_churn0 <- df_qwe %>% filter(`Churn (1 = Yes, 0 = No)` == 0)
df_churn1 <- df_qwe %>% filter(`Churn (1 = Yes, 0 = No)` == 1)
tests <- list()
tests[[1]] <- t.test(df_churn0$`Customer Age (in months)`, df_churn1$`Customer Age (in months)`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[2]] <- t.test(df_churn0$`CHI Score Month 0`, df_churn1$`CHI Score Month 0`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[3]] <- t.test(df_churn0$`CHI Score 0-1`, df_churn1$`CHI Score 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[4]] <- t.test(df_churn0$`Support Cases Month 0` , df_churn1$`Support Cases Month 0` , paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[5]] <- t.test(df_churn0$`Support Cases 0-1` , df_churn1$`Support Cases 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[6]] <- t.test(df_churn0$`SP Month 0` , df_churn1$`SP Month 0`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[7]] <- t.test(df_churn0$`SP 0-1` , df_churn1$`SP 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[8]] <- t.test(df_churn0$`Logins 0-1`, df_churn1$`Logins 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[9]] <- t.test(df_churn0$`Blog Articles 0-1`, df_churn1$`Blog Articles 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[10]] <- t.test(df_churn0$`Views 0-1`, df_churn1$`Views 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[11]] <- t.test(df_churn0$`Days Since Last Login 0-1`, df_churn1$`Days Since Last Login 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
Variables = c("Customer Age (in months)", "CHI Score Month 0", "CHI Score 0-1","Support Cases Month 0","Support Cases 0-1","SP Month 0","SP 0-1","Logins 0-1","Blog Articles 0-1","Views 0-1","Days Since Last Login 0-1")
x <- t(sapply(tests, function(x) {
c(
round(x$estimate[2], 5),
round(x$estimate[1], 5),
p.value = round(x$p.value, 5))
}))
colnames(x) <- c('Mean of churned customers', 'Mean of un-churned customers', 'p-value')
rownames(x) <- Variables
knitr::kable(as.table(x), caption = "Statistic tests summary")
| Mean of churned customers | Mean of un-churned customers | p-value | |
|---|---|---|---|
| Customer Age (in months) | 15.35294 | 13.81873 | 0.00306 |
| CHI Score Month 0 | 63.27245 | 88.60591 | 0.00000 |
| CHI Score 0-1 | -3.73684 | 5.53021 | 0.00000 |
| Support Cases Month 0 | 0.37152 | 0.72427 | 0.00000 |
| Support Cases 0-1 | 0.03715 | -0.00930 | 0.52775 |
| SP Month 0 | 0.49956 | 0.82958 | 0.00000 |
| SP 0-1 | -0.01670 | 0.03268 | 0.52182 |
| Logins 0-1 | 8.06192 | 16.13894 | 0.00040 |
| Blog Articles 0-1 | -0.10217 | 0.17115 | 0.01158 |
| Views 0-1 | -95.76780 | 106.60956 | 0.05631 |
| Days Since Last Login 0-1 | 6.48607 | 1.51145 | 0.00005 |
Each of the t-test we did here is a two-sample t-test(Independent Samples t-Test).
For each of the test, our null hypothesis is that the means of a certain variable for customers who churned out and who did not are equal, so our alternatives is that the means are significantly different. That is: H0: mean(churn=0) - mean(churn=1) = 0 HA: mean(churn=0) - mean(churn=1) ≠ 0
Means are significantly different at 95% confidence level if the p.value for the test is less than 0.05 (when we should reject the null hypothesis). For all the tests we did above, we found that in Customer Age (in months), CHI Score Month 0, CHI Score 0-1, Support Cases Month 0, SP Month 0, Logins 0-1, Blog Articles 0-1 and Days Since Last Login 0-1 the p-value are less than 0.05, indicating that the means of the two groups are significantly different. In other words, these variables have a siginificant influence on the possibility of churnning. Therefore, for the management team, they should especially focus on all these variables mentioned above and try to find out the relationship between churn and each of the variables, so that they can maintain more customers.
reg_5 = glm(df_qwe$`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` + `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` + `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` + `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`, data = df_qwe, family = binomial)
summary(reg_5)
##
## Call:
## glm(formula = df_qwe$`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` +
## `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` +
## `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` +
## `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
## family = binomial, data = df_qwe)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0047 -0.3542 -0.2957 -0.2328 3.0660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.763e+00 1.069e-01 -25.841 < 2e-16 ***
## `Customer Age (in months)` 1.271e-02 5.370e-03 2.366 0.01799 *
## `CHI Score Month 0` -4.657e-03 1.223e-03 -3.808 0.00014 ***
## `CHI Score 0-1` -1.027e-02 2.474e-03 -4.153 3.29e-05 ***
## `Support Cases Month 0` -1.524e-01 1.049e-01 -1.452 0.14643
## `Support Cases 0-1` 1.703e-01 9.050e-02 1.881 0.05992 .
## `SP Month 0` 1.593e-02 1.022e-01 0.156 0.87611
## `SP 0-1` -5.194e-02 7.852e-02 -0.661 0.50830
## `Logins 0-1` 2.893e-04 2.092e-03 0.138 0.89002
## `Blog Articles 0-1` 2.905e-04 1.960e-02 0.015 0.98817
## `Views 0-1` -1.098e-04 4.071e-05 -2.697 0.00700 **
## `Days Since Last Login 0-1` 1.724e-02 4.289e-03 4.020 5.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2553.1 on 6346 degrees of freedom
## Residual deviance: 2440.3 on 6335 degrees of freedom
## AIC: 2464.3
##
## Number of Fisher Scoring iterations: 7
From the regression above, we can identify the factors of customers churning out of QWE Inc. Here, we note that CHI Score Month 0, CHI Score 0-1 as well as Days Since Last Login 0-1 and intercept are siginificant at 99.9% level. Customer Age (in months) and Views 0-1 are siginificant at 99% and 95% level. Support Cases 0-1 is marginally significant. All of them are siginificantly affecting customer churn.
To be specific, with a customer staying for one more month, the relative odds of churning versus not churning will increase 1.28%. And with one unit increase in Support Cases 0-1, the relative odds of churning versus not churning will increase 18.6%, holding other constant. With one more Days since last login 0-1, the relative odds of churning versus not churning will increase 1.74%, holding other constant. Since these factors increase the posibility of churning out, Aggarwal and Wall should be dedicated to decrease the support cases by improving the quality of survice and increase the activity of users. Besides, the longer the customer stay, the more likely they are to churn. This might because customers’ need couldn’t be meet in long term so the attraction of QWE vanishs by time.
For other factors, with one score higher in CHI in December will decrease the relative odds of churning by 0.46%, holding other constant. With one unit decrease in CHI Score 0-1, the relative odds of churning will decrease by 1.02%, holding other constant. And with one unit increase in Views 0-1, the relative odds of churning will decrease by 0.01%, holding other constant. Therefore, Aggarwal and Wall should give more attention on improving customer’s happiness and elevate the viewing times.
The relative odds of churning when all other regressors equal to zero is 6.3%.
df_qwe$`Age Segments` = ifelse(df_qwe$`Customer Age (in months)` >= 13, 'Old',
ifelse(df_qwe$`Customer Age (in months)` <= 6, 'New', 'Medium'))
df_new <- df_qwe %>% filter(`Age Segments` == "New")
df_medium <- df_qwe %>% filter(`Age Segments` == "Medium")
df_old <- df_qwe %>% filter(`Age Segments` == "Old")
churn_new <- glm(`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)`+`CHI Score Month 0`+`CHI Score 0-1`+`Support Cases Month 0`+`Support Cases 0-1`+`SP Month 0`+`SP 0-1`+`Logins 0-1`+`Blog Articles 0-1`+`Views 0-1`+`Days Since Last Login 0-1`, data=df_new, family=binomial)
churn_medium <- glm(`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)`+`CHI Score Month 0`+`CHI Score 0-1`+`Support Cases Month 0`+`Support Cases 0-1`+`SP Month 0`+`SP 0-1`+`Logins 0-1`+`Blog Articles 0-1`+`Views 0-1`+`Days Since Last Login 0-1`, data=df_medium, family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
churn_old <- glm(`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)`+`CHI Score Month 0`+`CHI Score 0-1`+`Support Cases Month 0`+`Support Cases 0-1`+`SP Month 0`+`SP 0-1`+`Logins 0-1`+`Blog Articles 0-1`+`Views 0-1`+`Days Since Last Login 0-1`, data=df_old, family=binomial)
summary(churn_new)
##
## Call:
## glm(formula = `Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` +
## `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` +
## `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` +
## `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
## family = binomial, data = df_new)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6704 -0.2157 -0.1405 -0.1158 3.4027
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.390e+00 5.046e-01 -10.682 < 2e-16 ***
## `Customer Age (in months)` 3.883e-01 1.330e-01 2.919 0.00351 **
## `CHI Score Month 0` -5.195e-03 4.579e-03 -1.135 0.25655
## `CHI Score 0-1` -1.910e-02 6.280e-03 -3.041 0.00236 **
## `Support Cases Month 0` -1.869e-01 1.567e-01 -1.193 0.23303
## `Support Cases 0-1` 2.467e-01 1.455e-01 1.695 0.09001 .
## `SP Month 0` 3.607e-01 2.013e-01 1.791 0.07321 .
## `SP 0-1` -2.254e-01 1.571e-01 -1.435 0.15129
## `Logins 0-1` 7.239e-03 3.061e-03 2.365 0.01802 *
## `Blog Articles 0-1` -1.649e-03 2.789e-02 -0.059 0.95285
## `Views 0-1` 8.911e-05 1.522e-04 0.586 0.55813
## `Days Since Last Login 0-1` 3.552e-02 2.013e-02 1.765 0.07765 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 425.14 on 2050 degrees of freedom
## Residual deviance: 380.69 on 2039 degrees of freedom
## AIC: 404.69
##
## Number of Fisher Scoring iterations: 7
summary(churn_medium)
##
## Call:
## glm(formula = `Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` +
## `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` +
## `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` +
## `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
## family = binomial, data = df_medium)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1506 -0.4174 -0.2815 -0.1777 3.0893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.819e+00 6.897e-01 -6.987 2.80e-12 ***
## `Customer Age (in months)` 3.187e-01 6.495e-02 4.907 9.25e-07 ***
## `CHI Score Month 0` -9.842e-03 2.301e-03 -4.277 1.89e-05 ***
## `CHI Score 0-1` -3.383e-03 4.669e-03 -0.724 0.4688
## `Support Cases Month 0` -1.772e-01 2.521e-01 -0.703 0.4823
## `Support Cases 0-1` 1.339e-01 1.824e-01 0.734 0.4628
## `SP Month 0` -1.828e-01 2.095e-01 -0.872 0.3830
## `SP 0-1` 2.001e-02 1.458e-01 0.137 0.8909
## `Logins 0-1` -9.685e-05 4.092e-03 -0.024 0.9811
## `Blog Articles 0-1` -4.751e-02 6.162e-02 -0.771 0.4407
## `Views 0-1` -1.332e-04 5.743e-05 -2.319 0.0204 *
## `Days Since Last Login 0-1` 1.468e-02 6.552e-03 2.240 0.0251 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 793.58 on 1512 degrees of freedom
## Residual deviance: 684.75 on 1501 degrees of freedom
## AIC: 708.75
##
## Number of Fisher Scoring iterations: 7
summary(churn_old)
##
## Call:
## glm(formula = `Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` +
## `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` +
## `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` +
## `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
## family = binomial, data = df_old)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7415 -0.3938 -0.2910 -0.2070 3.1056
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.271e-01 2.650e-01 -2.744 0.00607 **
## `Customer Age (in months)` -3.984e-02 1.000e-02 -3.982 6.82e-05 ***
## `CHI Score Month 0` -1.146e-02 1.704e-03 -6.728 1.72e-11 ***
## `CHI Score 0-1` 2.699e-05 3.583e-03 0.008 0.99399
## `Support Cases Month 0` -9.970e-02 1.925e-01 -0.518 0.60460
## `Support Cases 0-1` 8.440e-02 1.531e-01 0.551 0.58137
## `SP Month 0` -3.853e-02 1.648e-01 -0.234 0.81520
## `SP 0-1` 4.755e-02 1.237e-01 0.384 0.70067
## `Logins 0-1` -3.835e-03 4.227e-03 -0.907 0.36423
## `Blog Articles 0-1` 1.376e-02 2.898e-02 0.475 0.63492
## `Views 0-1` -1.154e-04 7.307e-05 -1.580 0.11421
## `Days Since Last Login 0-1` 5.331e-04 3.284e-03 0.162 0.87105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1268.9 on 2782 degrees of freedom
## Residual deviance: 1167.7 on 2771 degrees of freedom
## AIC: 1191.7
##
## Number of Fisher Scoring iterations: 6
For new customers: The coefficient of intercept, Customer Age (in months), CHI Score 0-1, Support Cases 0-1, SP Month 0, Logins 0-1 and Days Since Last Login 0-1 are significant. The log odds of churn is -5.390e+00 when all the regressors are equal to zero. The log odds of churn would decrease by 1.910e-02 if CHI Score 0-1 increases by one unit. Holding other constant, the log odds of churn would increase by 3.883e-01 if Customer Age (in months) increases by one unit; The log odds of churn would increase by 2.467e-01 if Support Cases 0-1 increases by one unit; The log odds of churn would increase by 3.607e-01 if SP Month 0 increases by one unit; The log odds of churn would increase by 7.239e-03 if Logins 0-1 increases by one unit; The log odds of churn would increase by 3.552e-02 if Days Since Last Login 0-1 increases by one unit.
For medium customers: The coefficient of intercept, Customer Age (in months), CHI Score Month 0, Views 0-1, and Days Since Last Login 0-1 are significant. The log odds of churn is -4.819e+00 when all the regressors are equal to zero. Holding other constant, the log odds of churn would decrease by 9.842e-03 if CHI Score Month 0 increases by one unit; the log odds of churn would decrease by 1.332e-04 if Views 0-1 increases by one unit. Holding other constant, the log odds of churn would increase by 3.187e-01 if Customer Age (in months) increases by one unit; The log odds of churn would increase by 1.468e-02 if Days Since Last Login 0-1 increases by one unit.
For old customers: The coefficient of intercept, Customer Age (in months) and CHI Score Month 0 are significant. Holding other constant, the log odds of churn is -7.271e-01 when all the regressors are equal to zero. Holding other constant, the log odds of churn would decrease by 3.984e-02 if Customer Age (in months) increases by one unit; the log odds of churn would decrease by 1.146e-02 if CHI Score Month 0 increases by one unit. CHI Score Month 0 is significant for medium and old custoemrs, but is not significant for new custoemrs. Logins 0-1, CHI Score 0-1 ,Support Cases 0-1 and SP Month 0 are significant for new custoemrs, but is not significant for medium and old customers. Views 0-1 is significant for medium customers, but is not significant for new and old customers. Days Since Last Login 0-1 is significant for new and medium customers, and is not significant for old customers.Customer Age (in months) consistently affect all the variables.
Customer Age (in months) consistently affect all the variables. The magnitudes does not vary significantly in new customers and medium customers segments(3.883e-01 v.s. 3.187e-01), but do vary significantly with the old customers segment(-3.984e-02).
In this assignment, we looked at the possibility of churn at QWE in respect to 11 variables. First we found customers with high CHI scores are less likely to churn out and customers who stays with the company for about more 10 months are more likely to churn out. We then did statistical analysis for each of the variables, finding out that the means of Customer Age (in months), CHI Score Month 0, CHI Score 0-1, Support Cases Month 0, SP Month 0, Logins 0-1, Blog Articles 0-1 and Days Since Last Login 0-1 are significantly different for customers who churned out and who did not. To further look into the issue, we divide customers into three segments based on customer ages and did logistical regression on the 11 variables. Interestingly, for new customers and medium customers segments, the possibility of churn increases as customer age increases, while for old customers, the possibility of churn decreases as customer age increases.