Final Project

Data Mining Final Project

In order for Regork to maximize its customer base, I am going to analyze the past customer data in order to find a model that does a good job at predicting whether customers will stay with us, or leave in the future. This will in turn give us a better understanding of our customers, as well as giving us a head start in taking action, such as offering our customers incentives in order for them to stay with us.

Data and Packages Required

Data -

library(readr)
customer_retention <- read_csv("R/customer_retention.csv")

## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Packages -

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(purrr)
library(readr)
library(tibble)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6     ✔ rsample      1.2.1
## ✔ dials        1.3.0     ✔ tune         1.2.1
## ✔ infer        1.0.7     ✔ workflows    1.1.4
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.2.1     ✔ yardstick    1.3.1
## ✔ recipes      1.1.0     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org

Data Analysis

In order for me to properly aanswer this question for Regork, I first will have to analyze the data in order to see any trends or relationships withing the data. I will begin by visualizing the data at hand in order to better understand it.

cr <- customer_retention

Tenure for Customers

ggplot(customer_retention, aes(x = Tenure)) +
     geom_bar()

My first data visualization shows the tenure for all customers in our data set. This shows that we have a very high volume of customers that are leaving right after joining us. This shows that we are attracting customers, but keeping them after a year of usage is the issue. I will now go into more detail on this customer group.

ggplot(customer_retention, aes(x = Contract)) +
       geom_bar()

This bar graph shows the contracts of our customers. Most customers are choosing the month to month contract. This could be one of the reasons that we are seeing a lot of customers ending their contract after one year, as they are only on a month to month contract.

Machine Learning

My first machine learning model is a simple linear regression model. This model is going to be used to show the correlation between monthly charges and customer tenure. I believe that pricing could be one of the most influential variables in influencing customer tenure.

set.seed(123)
split  <- initial_split(cr, prop = 0.7, strata = "Tenure")
cr_train  <- training(split)
cr_test   <- testing(split)

model1 <- ggplot(cr_train, aes(Tenure, MonthlyCharges)) +
     geom_point(size = 1.5, alpha = .25) +
    geom_smooth(method = "lm", se = FALSE)

model1

## `geom_smooth()` using formula = 'y ~ x'

This graph shows that these two variables have a positive relationship with one another. The regression line within the graph reflects this as well, This shows that as tenure increases, price generally increases as well.

cr_train

## # A tibble: 4,897 × 20
##    Gender SeniorCitizen Partner Dependents Tenure PhoneService MultipleLines   
##    <chr>          <dbl> <chr>   <chr>       <dbl> <chr>        <chr>           
##  1 Female             0 Yes     No              1 No           No phone service
##  2 Male               0 No      No              2 Yes          No              
##  3 Female             0 No      No              8 Yes          Yes             
##  4 Male               1 No      No              1 No           No phone service
##  5 Male               0 No      No              1 Yes          No              
##  6 Male               1 Yes     No              2 Yes          No              
##  7 Male               0 No      No              1 Yes          No              
##  8 Male               1 No      No              1 Yes          No              
##  9 Male               0 No      No              5 Yes          No              
## 10 Male               0 No      No              2 Yes          No              
## # ℹ 4,887 more rows
## # ℹ 13 more variables: InternetService <chr>, OnlineSecurity <chr>,
## #   OnlineBackup <chr>, DeviceProtection <chr>, TechSupport <chr>,
## #   StreamingTV <chr>, StreamingMovies <chr>, Contract <chr>,
## #   PaperlessBilling <chr>, PaymentMethod <chr>, MonthlyCharges <dbl>,
## #   TotalCharges <dbl>, Status <chr>

After I create the dataframe and run the correlation between Tenure and Monthly Charges, I end up with a correlation value of .246. This shows that here is not too much of a strong correlation between the two variables with this model.

cr_lm <- linear_reg() %>%
   fit(Tenure ~ MonthlyCharges + TotalCharges, data = cr_train)

cr_lm

## parsnip model object
## 
## 
## Call:
## stats::lm(formula = Tenure ~ MonthlyCharges + TotalCharges, data = data)
## 
## Coefficients:
##    (Intercept)  MonthlyCharges    TotalCharges  
##       30.70211        -0.41551         0.01256

This linear regression model shows the relationship statistics within our training data,

Predictions

Below is the predicted variables for monthly charges from the model that I created. This shows that the predicted monthly charges are varied due to the average tenure of the customer.

This shows that monthly charges has a big time impacts on the tenure of a customer. If Regork can lower some of the monthly charges, then we can expect some of the customers to stay longer than less than a year.

cr_lm %>% predict(cr_train)

## # A tibble: 4,897 × 1
##     .pred
##     <dbl>
##  1 18.7  
##  2  9.69 
##  3 -0.400
##  4 14.7  
##  5 22.6  
##  6 -6.70 
##  7 22.6  
##  8 12.5  
##  9  5.72 
## 10 11.5  
## # ℹ 4,887 more rows

Business Findings

After I have went through my modeling, I have found that one of the most influential factors in influencing the tenure of a customer is the monthly charges variable. It seems that after a customer stays for around a year or two, they are more likely to stay a customer for a longer time. This is something that we can capitalize on by offering discounts for the first year of service. I think that Regork should focus on getting customers into the door by offering discounts during their first year of service, such as a free first 2 months, or a discounted first six months. This can crucial in giving our customers the ability to feel Regork out and make sure we are the correct ones for them, while also using our services at a lower price.