Introduction

In this project, Richard Wall was trying to estimate the probability that a given customer would leave in the near future and identify the drivers that contributed most to that customer’s decision so that they would be able to reach out to the customer, enhance his or her experience with QWE services, and divert churn without giving up costly discounts. In particular, he wanted to know the relationships between churning and customer longevity in months, CHI score, number of support cases, average support priority, and usage information (logins, blogs, views, and days since last login).

In order to analysing the relationships between them, his associates V. J. Aggrawal pulled data on both the value of a characteristic as of December 1, 2011, and its change from November to December and added a column indicating whether the customer actually left in the two months following December 1. The final dataset includes such variables as ID, Customer Age (in months), Churn (1 = Yes, 0 = No), CHI Score Month 0, CHI Score 0-1, Support Cases Month 0, Support Cases 0-1, SP Month 0, SP 0-1, Logins 0-1, Blog Articles 0-1, Views 0-1, Days Since Last Login 0-1.

Here is one thing to note, to simplified the problem, Aggrawal only pulled data for two months. Therefore, the regressions below may not be as precise as the regressions based on a longer period.

Data Description & Cleaning

df_qwe = as.data.frame(read_excel('UV6696-XLS-ENG.xlsx', sheet = 2))
## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.
str(df_qwe)
## 'data.frame':    6347 obs. of  13 variables:
##  $ ID                       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Customer Age (in months) : num  67 67 55 63 57 58 57 46 56 56 ...
##  $ Churn (1 = Yes, 0 = No)  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CHI Score Month 0        : num  0 62 0 231 43 138 180 116 78 78 ...
##  $ CHI Score 0-1            : num  0 4 0 1 -1 -10 -5 -11 -7 -37 ...
##  $ Support Cases Month 0    : num  0 0 0 1 0 0 1 0 1 0 ...
##  $ Support Cases 0-1        : num  0 0 0 -1 0 0 1 0 -2 0 ...
##  $ SP Month 0               : num  0 0 0 3 0 0 3 0 3 0 ...
##  $ SP 0-1                   : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ Logins 0-1               : num  0 0 0 167 0 43 13 0 -9 -7 ...
##  $ Blog Articles 0-1        : num  0 0 0 -8 0 0 -1 0 1 0 ...
##  $ Views 0-1                : num  0 -16 0 21996 9 ...
##  $ Days Since Last Login 0-1: num  31 31 31 0 31 0 0 6 7 14 ...
summary(df_qwe)
##        ID       Customer Age (in months) Churn (1 = Yes, 0 = No)
##  Min.   :   1   Min.   : 0.0             Min.   :0.00000        
##  1st Qu.:1588   1st Qu.: 5.0             1st Qu.:0.00000        
##  Median :3174   Median :11.0             Median :0.00000        
##  Mean   :3174   Mean   :13.9             Mean   :0.05089        
##  3rd Qu.:4760   3rd Qu.:20.0             3rd Qu.:0.00000        
##  Max.   :6347   Max.   :67.0             Max.   :1.00000        
##  CHI Score Month 0 CHI Score 0-1      Support Cases Month 0
##  Min.   :  0.00    Min.   :-125.000   Min.   : 0.0000      
##  1st Qu.: 24.50    1st Qu.:  -8.000   1st Qu.: 0.0000      
##  Median : 87.00    Median :   0.000   Median : 0.0000      
##  Mean   : 87.32    Mean   :   5.059   Mean   : 0.7063      
##  3rd Qu.:139.00    3rd Qu.:  15.000   3rd Qu.: 1.0000      
##  Max.   :298.00    Max.   : 208.000   Max.   :32.0000      
##  Support Cases 0-1      SP Month 0         SP 0-1        
##  Min.   :-29.000000   Min.   :0.0000   Min.   :-4.00000  
##  1st Qu.:  0.000000   1st Qu.:0.0000   1st Qu.: 0.00000  
##  Median :  0.000000   Median :0.0000   Median : 0.00000  
##  Mean   : -0.006932   Mean   :0.8128   Mean   : 0.03017  
##  3rd Qu.:  0.000000   3rd Qu.:2.6667   3rd Qu.: 0.00000  
##  Max.   : 31.000000   Max.   :4.0000   Max.   : 4.00000  
##    Logins 0-1      Blog Articles 0-1    Views 0-1        
##  Min.   :-293.00   Min.   :-75.0000   Min.   :-28322.00  
##  1st Qu.:  -1.00   1st Qu.:  0.0000   1st Qu.:   -11.00  
##  Median :   2.00   Median :  0.0000   Median :     0.00  
##  Mean   :  15.73   Mean   :  0.1572   Mean   :    96.31  
##  3rd Qu.:  23.00   3rd Qu.:  0.0000   3rd Qu.:    27.00  
##  Max.   : 865.00   Max.   :217.0000   Max.   :230414.00  
##  Days Since Last Login 0-1
##  Min.   :-648.000         
##  1st Qu.:   0.000         
##  Median :   0.000         
##  Mean   :   1.765         
##  3rd Qu.:   3.000         
##  Max.   :  61.000

In the dataset, we note that -

  1. Month 0 is to denote the current moment in time.

  2. 0-1 is the change from November to December. Negative values denote a decrease and vice versa.

  3. Churn is to show whether a customer call with a request to cancel his or her contract.

  4. Support cases is using when any customer uses our system and may has some issues at times. These requests will be routed to the tech people.

  5. Support priority shows how serious the issue/request is. And the more service requests, the higher the priority, the more serious issues will be.

  6. Usage includes logins, blog articles, views and days since last login. It shows whether a user is active. The higher the usege, the less likely to churn there will be.

df_qwe$ID = as.factor(df_qwe$ID)
df_qwe$`Churn (1 = Yes, 0 = No)` = as.factor(df_qwe$`Churn (1 = Yes, 0 = No)`)
str(df_qwe)
## 'data.frame':    6347 obs. of  13 variables:
##  $ ID                       : Factor w/ 6347 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Customer Age (in months) : num  67 67 55 63 57 58 57 46 56 56 ...
##  $ Churn (1 = Yes, 0 = No)  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CHI Score Month 0        : num  0 62 0 231 43 138 180 116 78 78 ...
##  $ CHI Score 0-1            : num  0 4 0 1 -1 -10 -5 -11 -7 -37 ...
##  $ Support Cases Month 0    : num  0 0 0 1 0 0 1 0 1 0 ...
##  $ Support Cases 0-1        : num  0 0 0 -1 0 0 1 0 -2 0 ...
##  $ SP Month 0               : num  0 0 0 3 0 0 3 0 3 0 ...
##  $ SP 0-1                   : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ Logins 0-1               : num  0 0 0 167 0 43 13 0 -9 -7 ...
##  $ Blog Articles 0-1        : num  0 0 0 -8 0 0 -1 0 1 0 ...
##  $ Views 0-1                : num  0 -16 0 21996 9 ...
##  $ Days Since Last Login 0-1: num  31 31 31 0 31 0 0 6 7 14 ...

In the dataset, we have 13 variables as seen above. Here, we note that -

  1. “ID” and “Churn (1 = Yes, 0 = No)” are discrete variables.

  2. “Customer Age (in months)”, “CHI Score Month 0”, “CHI Score 0-1”, “Support Cases Month 0”, “Support Cases 0-1”, “SP Month 0”, “SP 0-1”, “Logins 0-1”, “Blog Articles 0-1”, “Views 0-1” and “Days Since Last Login 0-1” are continuous variables.

Visualizations

p1 <- df_qwe %>%
  ggplot(aes(x = `CHI Score Month 0`)) +
  geom_histogram(aes(y=..density..), binwidth = 50, colour="black", fill="white") + 
  geom_density(alpha=.2, fill="Orange") +
  facet_grid(.~`Churn (1 = Yes, 0 = No)`) +
  labs(  #add title and anotations
      title = 'Distribution of CHI score for December 2011 by different churn outcomes', tag = '3.a)', x = 'CHI score', y = 'Frequency',       caption = "'O' stands for customers who didn't churn.
      '1' stands for customers who churn"
      ) +
  theme(  #formating font and size
      plot.title = element_text(size=20, face="bold"),
      axis.title.x = element_text(size=12, face="bold"),
      axis.title.y = element_text(size=12, face="bold")
      )
p2 <- df_qwe %>%  
  group_by(`Customer Age (in months)`) %>% 
  summarise(`Average Churn Rate` = mean(`Churn (1 = Yes, 0 = No)` == 1)) %>%
  ggplot(aes(x = `Customer Age (in months)`, y = `Average Churn Rate`, fill = `Average Churn Rate`)) +
  geom_col(color = 'black') +
  scale_fill_gradient(low="orange", high="red") +
  labs(  #add title and anotations
      title = 'Average churn rate by customer age', tag = '3.b)', x = 'Customer Age (in months)', y = 'Churn Rate' ) +
  theme(  #formating font and size
      plot.title = element_text(size=20, face="bold"),
      axis.title.x = element_text(size=12, face="bold"),
      axis.title.y = element_text(size=12, face="bold")
      )
p3 <- df_qwe %>%
  filter(`Churn (1 = Yes, 0 = No)` == 1) %>%
  ggplot() +
  geom_bar(aes(`Customer Age (in months)`, fill = ..count..), color = 'black') +
  scale_fill_gradient(low="orange", high="red") +
  labs(  #add title and anotations
      title = 'Number of customers who churn \nby customer age', tag = '3.c)', x = 'Customer Age (in months)', y = 'Number of churned customers' ) +
  theme(  #formating font and size
      plot.title = element_text(size=20, face="bold"),
      axis.title.x = element_text(size=12, face="bold"),
      axis.title.y = element_text(size=12, face="bold")
      )

We can see from the plot that the distribution of CHI score is right-skewed and the right plot(churn=1) is more right-skewed than the left plot(churn=0). For customers who did not churn out(Churn=0), the mean of them lies between 0-150, while for customers who churned out, the mean of them lies between 0-100. Therefore, on average, customers who churned out usually have a lower CHI score than those who did not churn out.

Next, we compare the average churn rate by different customer age. From figure 3.b), there is no signicicant trend between customer age. Customer who have been using with the QWE for around 12 to 18 months, 27 months, 41 months and 47 months are more likely to churn compared with the rest of the customers.

And when we compare the number of customers who churn by customer age, we can find that customer staying for 12 months churn the most.

Therefore, combining these two figures, we can say that even though the average churn rate is fairly high for customers who stay for more than 24 months, the total number of churned customer is not rather high. And for customers who stay for 12 months, they have both the high average churn rate and the high number of churned customers.

Statistical Analyses

df_churn0 <- df_qwe %>% filter(`Churn (1 = Yes, 0 = No)` == 0) 
df_churn1 <- df_qwe %>% filter(`Churn (1 = Yes, 0 = No)` == 1)

tests <- list()
tests[[1]] <- t.test(df_churn0$`Customer Age (in months)`, df_churn1$`Customer Age (in months)`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[2]] <- t.test(df_churn0$`CHI Score Month 0`, df_churn1$`CHI Score Month 0`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[3]] <- t.test(df_churn0$`CHI Score 0-1`, df_churn1$`CHI Score 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[4]] <- t.test(df_churn0$`Support Cases Month 0` , df_churn1$`Support Cases Month 0` , paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[5]] <- t.test(df_churn0$`Support Cases 0-1` , df_churn1$`Support Cases 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[6]] <- t.test(df_churn0$`SP Month 0` , df_churn1$`SP Month 0`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[7]] <- t.test(df_churn0$`SP 0-1` , df_churn1$`SP 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[8]] <- t.test(df_churn0$`Logins 0-1`, df_churn1$`Logins 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[9]] <- t.test(df_churn0$`Blog Articles 0-1`, df_churn1$`Blog Articles 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[10]] <- t.test(df_churn0$`Views 0-1`, df_churn1$`Views 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tests[[11]] <- t.test(df_churn0$`Days Since Last Login 0-1`, df_churn1$`Days Since Last Login 0-1`, paired = FALSE, alternative = "two.sided", var.equal = FALSE)

Variables =  c("Customer Age (in months)", "CHI Score Month 0", "CHI Score 0-1","Support Cases Month 0","Support Cases 0-1","SP Month 0","SP 0-1","Logins 0-1","Blog Articles 0-1","Views 0-1","Days Since Last Login 0-1")

x <- t(sapply(tests, function(x) {
     c(
       round(x$estimate[2], 5),
       round(x$estimate[1], 5),
       p.value = round(x$p.value, 5))
}))

colnames(x) <- c('Mean of churned customers', 'Mean of un-churned customers', 'p-value')
rownames(x) <- Variables
knitr::kable(as.table(x), caption = "Statistic tests summary")
Statistic tests summary
Mean of churned customers Mean of un-churned customers p-value
Customer Age (in months) 15.35294 13.81873 0.00306
CHI Score Month 0 63.27245 88.60591 0.00000
CHI Score 0-1 -3.73684 5.53021 0.00000
Support Cases Month 0 0.37152 0.72427 0.00000
Support Cases 0-1 0.03715 -0.00930 0.52775
SP Month 0 0.49956 0.82958 0.00000
SP 0-1 -0.01670 0.03268 0.52182
Logins 0-1 8.06192 16.13894 0.00040
Blog Articles 0-1 -0.10217 0.17115 0.01158
Views 0-1 -95.76780 106.60956 0.05631
Days Since Last Login 0-1 6.48607 1.51145 0.00005

Each of the t-test we did here is a two-sample t-test(Independent Samples t-Test).

For each of the test, our null hypothesis is that the means of a certain variable for customers who churned out and who did not are equal, so our alternatives is that the means are significantly different. That is: H0: mean(churn=0) - mean(churn=1) = 0 HA: mean(churn=0) - mean(churn=1) ≠ 0

Means are significantly different at 95% confidence level if the p.value for the test is less than 0.05 (when we should reject the null hypothesis). For all the tests we did above, we found that in Customer Age (in months), CHI Score Month 0, CHI Score 0-1, Support Cases Month 0, SP Month 0, Logins 0-1, Blog Articles 0-1 and Days Since Last Login 0-1 the p-value are less than 0.05, indicating that the means of the two groups are significantly different. In other words, these variables have a siginificant influence on the possibility of churnning. Therefore, for the management team, they should especially focus on all these variables mentioned above and try to find out the relationship between churn and each of the variables, so that they can maintain more customers.

Logistic Regressions

reg_5 = glm(df_qwe$`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` + `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` + `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` + `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`, data = df_qwe, family = binomial)
summary(reg_5)
## 
## Call:
## glm(formula = df_qwe$`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` + 
##     `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` + 
##     `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` + 
##     `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`, 
##     family = binomial, data = df_qwe)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0047  -0.3542  -0.2957  -0.2328   3.0660  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -2.763e+00  1.069e-01 -25.841  < 2e-16 ***
## `Customer Age (in months)`   1.271e-02  5.370e-03   2.366  0.01799 *  
## `CHI Score Month 0`         -4.657e-03  1.223e-03  -3.808  0.00014 ***
## `CHI Score 0-1`             -1.027e-02  2.474e-03  -4.153 3.29e-05 ***
## `Support Cases Month 0`     -1.524e-01  1.049e-01  -1.452  0.14643    
## `Support Cases 0-1`          1.703e-01  9.050e-02   1.881  0.05992 .  
## `SP Month 0`                 1.593e-02  1.022e-01   0.156  0.87611    
## `SP 0-1`                    -5.194e-02  7.852e-02  -0.661  0.50830    
## `Logins 0-1`                 2.893e-04  2.092e-03   0.138  0.89002    
## `Blog Articles 0-1`          2.905e-04  1.960e-02   0.015  0.98817    
## `Views 0-1`                 -1.098e-04  4.071e-05  -2.697  0.00700 ** 
## `Days Since Last Login 0-1`  1.724e-02  4.289e-03   4.020 5.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2553.1  on 6346  degrees of freedom
## Residual deviance: 2440.3  on 6335  degrees of freedom
## AIC: 2464.3
## 
## Number of Fisher Scoring iterations: 7

From the regression above, we can identify the factors of customers churning out of QWE Inc. Here, we note that CHI Score Month 0, CHI Score 0-1 as well as Days Since Last Login 0-1 and intercept are siginificant at 99.9% level. Customer Age (in months) and Views 0-1 are siginificant at 99% and 95% level. Support Cases 0-1 is marginally significant. All of them are siginificantly affecting customer churn.

To be specific, with a customer staying for one more month, the relative odds of churning versus not churning will increase 1.28%. And with one unit increase in Support Cases 0-1, the relative odds of churning versus not churning will increase 18.6%, holding other constant. With one more Days since last login 0-1, the relative odds of churning versus not churning will increase 1.74%, holding other constant. Since these factors increase the posibility of churning out, Aggarwal and Wall should be dedicated to decrease the support cases by improving the quality of survice and increase the activity of users. Besides, the longer the customer stay, the more likely they are to churn. This might because customers’ need couldn’t be meet in long term so the attraction of QWE vanishs by time.

For other factors, with one score higher in CHI in December will decrease the relative odds of churning by 0.46%, holding other constant. With one unit decrease in CHI Score 0-1, the relative odds of churning will decrease by 1.02%, holding other constant. And with one unit increase in Views 0-1, the relative odds of churning will decrease by 0.01%, holding other constant. Therefore, Aggarwal and Wall should give more attention on improving customer’s happiness and elevate the viewing times.

The relative odds of churning when all other regressors equal to zero is 6.3%.

Customer Segmentation

df_qwe$`Age Segments` = ifelse(df_qwe$`Customer Age (in months)` >= 13, 'Old',
                     ifelse(df_qwe$`Customer Age (in months)` <= 6, 'New', 'Medium'))

df_new <- df_qwe %>% filter(`Age Segments` == "New")
df_medium <- df_qwe %>% filter(`Age Segments` == "Medium")
df_old <- df_qwe %>% filter(`Age Segments` == "Old")

churn_new <- glm(`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)`+`CHI Score Month 0`+`CHI Score 0-1`+`Support Cases Month 0`+`Support Cases 0-1`+`SP Month 0`+`SP 0-1`+`Logins 0-1`+`Blog Articles 0-1`+`Views 0-1`+`Days Since Last Login 0-1`, data=df_new, family=binomial)
churn_medium <- glm(`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)`+`CHI Score Month 0`+`CHI Score 0-1`+`Support Cases Month 0`+`Support Cases 0-1`+`SP Month 0`+`SP 0-1`+`Logins 0-1`+`Blog Articles 0-1`+`Views 0-1`+`Days Since Last Login 0-1`, data=df_medium, family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
churn_old <- glm(`Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)`+`CHI Score Month 0`+`CHI Score 0-1`+`Support Cases Month 0`+`Support Cases 0-1`+`SP Month 0`+`SP 0-1`+`Logins 0-1`+`Blog Articles 0-1`+`Views 0-1`+`Days Since Last Login 0-1`, data=df_old, family=binomial)

summary(churn_new)
## 
## Call:
## glm(formula = `Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` + 
##     `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` + 
##     `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` + 
##     `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`, 
##     family = binomial, data = df_new)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6704  -0.2157  -0.1405  -0.1158   3.4027  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -5.390e+00  5.046e-01 -10.682  < 2e-16 ***
## `Customer Age (in months)`   3.883e-01  1.330e-01   2.919  0.00351 ** 
## `CHI Score Month 0`         -5.195e-03  4.579e-03  -1.135  0.25655    
## `CHI Score 0-1`             -1.910e-02  6.280e-03  -3.041  0.00236 ** 
## `Support Cases Month 0`     -1.869e-01  1.567e-01  -1.193  0.23303    
## `Support Cases 0-1`          2.467e-01  1.455e-01   1.695  0.09001 .  
## `SP Month 0`                 3.607e-01  2.013e-01   1.791  0.07321 .  
## `SP 0-1`                    -2.254e-01  1.571e-01  -1.435  0.15129    
## `Logins 0-1`                 7.239e-03  3.061e-03   2.365  0.01802 *  
## `Blog Articles 0-1`         -1.649e-03  2.789e-02  -0.059  0.95285    
## `Views 0-1`                  8.911e-05  1.522e-04   0.586  0.55813    
## `Days Since Last Login 0-1`  3.552e-02  2.013e-02   1.765  0.07765 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 425.14  on 2050  degrees of freedom
## Residual deviance: 380.69  on 2039  degrees of freedom
## AIC: 404.69
## 
## Number of Fisher Scoring iterations: 7
summary(churn_medium)
## 
## Call:
## glm(formula = `Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` + 
##     `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` + 
##     `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` + 
##     `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`, 
##     family = binomial, data = df_medium)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1506  -0.4174  -0.2815  -0.1777   3.0893  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -4.819e+00  6.897e-01  -6.987 2.80e-12 ***
## `Customer Age (in months)`   3.187e-01  6.495e-02   4.907 9.25e-07 ***
## `CHI Score Month 0`         -9.842e-03  2.301e-03  -4.277 1.89e-05 ***
## `CHI Score 0-1`             -3.383e-03  4.669e-03  -0.724   0.4688    
## `Support Cases Month 0`     -1.772e-01  2.521e-01  -0.703   0.4823    
## `Support Cases 0-1`          1.339e-01  1.824e-01   0.734   0.4628    
## `SP Month 0`                -1.828e-01  2.095e-01  -0.872   0.3830    
## `SP 0-1`                     2.001e-02  1.458e-01   0.137   0.8909    
## `Logins 0-1`                -9.685e-05  4.092e-03  -0.024   0.9811    
## `Blog Articles 0-1`         -4.751e-02  6.162e-02  -0.771   0.4407    
## `Views 0-1`                 -1.332e-04  5.743e-05  -2.319   0.0204 *  
## `Days Since Last Login 0-1`  1.468e-02  6.552e-03   2.240   0.0251 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 793.58  on 1512  degrees of freedom
## Residual deviance: 684.75  on 1501  degrees of freedom
## AIC: 708.75
## 
## Number of Fisher Scoring iterations: 7
summary(churn_old)
## 
## Call:
## glm(formula = `Churn (1 = Yes, 0 = No)` ~ `Customer Age (in months)` + 
##     `CHI Score Month 0` + `CHI Score 0-1` + `Support Cases Month 0` + 
##     `Support Cases 0-1` + `SP Month 0` + `SP 0-1` + `Logins 0-1` + 
##     `Blog Articles 0-1` + `Views 0-1` + `Days Since Last Login 0-1`, 
##     family = binomial, data = df_old)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7415  -0.3938  -0.2910  -0.2070   3.1056  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -7.271e-01  2.650e-01  -2.744  0.00607 ** 
## `Customer Age (in months)`  -3.984e-02  1.000e-02  -3.982 6.82e-05 ***
## `CHI Score Month 0`         -1.146e-02  1.704e-03  -6.728 1.72e-11 ***
## `CHI Score 0-1`              2.699e-05  3.583e-03   0.008  0.99399    
## `Support Cases Month 0`     -9.970e-02  1.925e-01  -0.518  0.60460    
## `Support Cases 0-1`          8.440e-02  1.531e-01   0.551  0.58137    
## `SP Month 0`                -3.853e-02  1.648e-01  -0.234  0.81520    
## `SP 0-1`                     4.755e-02  1.237e-01   0.384  0.70067    
## `Logins 0-1`                -3.835e-03  4.227e-03  -0.907  0.36423    
## `Blog Articles 0-1`          1.376e-02  2.898e-02   0.475  0.63492    
## `Views 0-1`                 -1.154e-04  7.307e-05  -1.580  0.11421    
## `Days Since Last Login 0-1`  5.331e-04  3.284e-03   0.162  0.87105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1268.9  on 2782  degrees of freedom
## Residual deviance: 1167.7  on 2771  degrees of freedom
## AIC: 1191.7
## 
## Number of Fisher Scoring iterations: 6

For new customers: The coefficient of intercept, Customer Age (in months), CHI Score 0-1, Support Cases 0-1, SP Month 0, Logins 0-1 and Days Since Last Login 0-1 are significant. The log odds of churn is -5.390e+00 when all the regressors are equal to zero. The log odds of churn would decrease by 1.910e-02 if CHI Score 0-1 increases by one unit. Holding other constant, the log odds of churn would increase by 3.883e-01 if Customer Age (in months) increases by one unit; The log odds of churn would increase by 2.467e-01 if Support Cases 0-1 increases by one unit; The log odds of churn would increase by 3.607e-01 if SP Month 0 increases by one unit; The log odds of churn would increase by 7.239e-03 if Logins 0-1 increases by one unit; The log odds of churn would increase by 3.552e-02 if Days Since Last Login 0-1 increases by one unit.

For medium customers: The coefficient of intercept, Customer Age (in months), CHI Score Month 0, Views 0-1, and Days Since Last Login 0-1 are significant. The log odds of churn is -4.819e+00 when all the regressors are equal to zero. Holding other constant, the log odds of churn would decrease by 9.842e-03 if CHI Score Month 0 increases by one unit; the log odds of churn would decrease by 1.332e-04 if Views 0-1 increases by one unit. Holding other constant, the log odds of churn would increase by 3.187e-01 if Customer Age (in months) increases by one unit; The log odds of churn would increase by 1.468e-02 if Days Since Last Login 0-1 increases by one unit.

For old customers: The coefficient of intercept, Customer Age (in months) and CHI Score Month 0 are significant. Holding other constant, the log odds of churn is -7.271e-01 when all the regressors are equal to zero. Holding other constant, the log odds of churn would decrease by 3.984e-02 if Customer Age (in months) increases by one unit; the log odds of churn would decrease by 1.146e-02 if CHI Score Month 0 increases by one unit. CHI Score Month 0 is significant for medium and old custoemrs, but is not significant for new custoemrs. Logins 0-1, CHI Score 0-1 ,Support Cases 0-1 and SP Month 0 are significant for new custoemrs, but is not significant for medium and old customers. Views 0-1 is significant for medium customers, but is not significant for new and old customers. Days Since Last Login 0-1 is significant for new and medium customers, and is not significant for old customers.Customer Age (in months) consistently affect all the variables.

Customer Age (in months) consistently affect all the variables. The magnitudes does not vary significantly in new customers and medium customers segments(3.883e-01 v.s. 3.187e-01), but do vary significantly with the old customers segment(-3.984e-02).

Final Reflections

In this assignment, we looked at the possibility of churn at QWE in respect to 11 variables. First we found customers with high CHI scores are less likely to churn out and customers who stays with the company for about more 10 months are more likely to churn out. We then did statistical analysis for each of the variables, finding out that the means of Customer Age (in months), CHI Score Month 0, CHI Score 0-1, Support Cases Month 0, SP Month 0, Logins 0-1, Blog Articles 0-1 and Days Since Last Login 0-1 are significantly different for customers who churned out and who did not. To further look into the issue, we divide customers into three segments based on customer ages and did logistical regression on the 11 variables. Interestingly, for new customers and medium customers segments, the possibility of churn increases as customer age increases, while for old customers, the possibility of churn decreases as customer age increases.