In this performance assessment, a telecom business has provided data on its customers and the services they subscribe to. Because customer churn is expensive for the business, they need to know if there are patterns in customer demographics making them more susceptible to churn so they can set a mitigation plan in motion to try to keep the customers.
Which groups of customers are most susceptible to churn?
The data available for the research consists of a single csv file to be read into either python or R.
To use the data in the csv file it needs to be read into python or R. Both programming languages would do an excellent job of cleaning. I chose R because I have a little previous experience with it and I like the way it can describe the statistical makeup of the variables and the way the visualizations render. R is a very good tool because of the statistical capabilities since it was created by scientists specifically for that purpose. Both programming languages would work well, but R is built more for the kind of statistical analysis that analyzing customer churn would entail. Also, when it is time for reporting results, R is superior at visualizations, especially with the Shiny applications and highcharter packages.
To read the data into R, I used RStudio and set my work space to where my data is in order to make it easier to retrieve any additional data needed (this step is not necessary for this project, but it is useful for projects that have multiple data files to retrieve).
Set Workspace:
setwd('C:/Users/localuser/Desktop/WGU/Data Cleaning/Data')
The R base packages can read the file in, but it is slow, particularly for large files. RStudio has an import dataset feature that can be used to import the file and there are several packages that can be used.
I chose the tidyverse package because it comes with several other packages that will be useful in this project, such as ggplot2, dplyr, and readr. The readr package, which reads the file into R faster than the base R packages and can handle a larger dataset, doesn’t require specifying the delimiter if it is “,” nor the header if header = “TRUE”. Readr uses the read.csv command to read the file into R and creates a dataframe for the data.
Load Tidyverse:
#To install tidyverse for the first time use:
#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
After the tidyverse package is loaded, R is ready to read the csv file.
Read CSV File:
CustChurn <- read.csv("./churn_raw_data.csv")
To get an idea of what variables are in the file, we can take a look at the dataframe:
View Data:
#Provides the number of records, the number of variables, each variable, the data type of each variable, and the first few entries for each variable.
str(CustChurn)
## 'data.frame': 10000 obs. of 52 variables:
## $ ï.. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CaseOrder : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Customer_id : chr "K409198" "S120509" "K191035" "D90850" ...
## $ Interaction : chr "aa90260b-4141-4a24-8e36-b04ce1f4f77b" "fb76459f-c047-4a9d-8af9-e0f7d4ac2524" "344d114c-3736-4be5-98f7-c72c281e2d35" "abfa2b40-2d43-4994-b15a-989b8c79e311" ...
## $ City : chr "Point Baker" "West Branch" "Yamhill" "Del Mar" ...
## $ State : chr "AK" "MI" "OR" "CA" ...
## $ County : chr "Prince of Wales-Hyder" "Ogemaw" "Yamhill" "San Diego" ...
## $ Zip : int 99927 48661 97148 92014 77461 31030 37847 73109 34771 45237 ...
## $ Lat : num 56.3 44.3 45.4 33 29.4 ...
## $ Lng : num -133.4 -84.2 -123.2 -117.2 -95.8 ...
## $ Population : int 38 10446 3735 13863 11352 17701 2535 23144 17351 20193 ...
## $ Area : chr "Urban" "Urban" "Urban" "Suburban" ...
## $ Timezone : chr "America/Sitka" "America/Detroit" "America/Los_Angeles" "America/Los_Angeles" ...
## $ Job : chr "Environmental health practitioner" "Programmer, multimedia" "Chief Financial Officer" "Solicitor" ...
## $ Children : int NA 1 4 1 0 3 0 2 2 NA ...
## $ Age : int 68 27 50 48 83 83 NA NA 49 86 ...
## $ Education : chr "Master's Degree" "Regular High School Diploma" "Regular High School Diploma" "Doctorate Degree" ...
## $ Employment : chr "Part Time" "Retired" "Student" "Retired" ...
## $ Income : num 28562 21705 NA 18925 40074 ...
## $ Marital : chr "Widowed" "Married" "Widowed" "Married" ...
## $ Gender : chr "Male" "Female" "Female" "Male" ...
## $ Churn : chr "No" "Yes" "No" "No" ...
## $ Outage_sec_perweek : num 6.97 12.01 10.25 15.21 8.96 ...
## $ Email : int 10 12 9 15 16 15 10 16 20 18 ...
## $ Contacts : int 0 0 0 2 2 3 0 0 2 1 ...
## $ Yearly_equip_failure: int 1 1 1 0 1 1 1 0 3 0 ...
## $ Techie : chr "No" "Yes" "Yes" "Yes" ...
## $ Contract : chr "One year" "Month-to-month" "Two Year" "Two Year" ...
## $ Port_modem : chr "Yes" "No" "Yes" "No" ...
## $ Tablet : chr "Yes" "Yes" "No" "No" ...
## $ InternetService : chr "Fiber Optic" "Fiber Optic" "DSL" "DSL" ...
## $ Phone : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Multiple : chr "No" "Yes" "Yes" "No" ...
## $ OnlineSecurity : chr "Yes" "Yes" "No" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "No" "No" ...
## $ DeviceProtection : chr "No" "No" "No" "No" ...
## $ TechSupport : chr "No" "No" "No" "No" ...
## $ StreamingTV : chr "No" "Yes" "No" "Yes" ...
## $ StreamingMovies : chr "Yes" "Yes" "Yes" "No" ...
## $ PaperlessBilling : chr "Yes" "Yes" "Yes" "Yes" ...
## $ PaymentMethod : chr "Credit Card (automatic)" "Bank Transfer(automatic)" "Credit Card (automatic)" "Mailed Check" ...
## $ Tenure : num 6.8 1.16 15.75 17.09 1.67 ...
## $ MonthlyCharge : num 171 243 159 120 151 ...
## $ Bandwidth_GB_Year : num 905 801 2055 2165 271 ...
## $ item1 : int 5 3 4 4 4 3 6 2 5 2 ...
## $ item2 : int 5 4 4 4 4 3 5 2 4 2 ...
## $ item3 : int 5 3 2 4 4 3 6 2 4 2 ...
## $ item4 : int 3 3 4 2 3 2 4 5 3 2 ...
## $ item5 : int 4 4 4 5 4 4 1 2 4 5 ...
## $ item6 : int 4 3 3 4 4 3 5 3 3 2 ...
## $ item7 : int 3 4 3 3 4 3 5 4 4 3 ...
## $ item8 : int 4 4 3 3 5 3 5 5 4 3 ...
Based on the results of loading the csv file, the dataset contains 10,000 records representing customers and 50 variables describing the customer demographics, such as geographical information, age, marital status, type of services purchased, and responses to a survey. There are also 2 index variables, one of which serves as a placeholder to preserve the original order.
The indexes are:
| Variable | Data Type | Description |
|---|---|---|
| Sequential Numbering (no variable name assigned) | integer, discrete numerical | An index starting at 1 |
| CaseOrder | integer, discrete numerical | A placeholder starting at 1 to preserve the original order |
The dependent (target) variable is:
| Variable | Data Type | Description |
|---|---|---|
| Churn | character, ordinal categorical | Whether or not the customer discontinued service in the last month (“Yes”, “No”) |
The independent variables are:
| Variable | Data Type | Description |
|---|---|---|
| Customer_id | character, ordinal categorical | The unique identification number assigned to each customer (These are character type because they contain both letters and numbers) |
| Interaction | character, ordinal categorical | The unique transaction identification numbers (These are character type because they contain both letters and numbers) |
| City | character, nominal categorical | The city the customer lives in |
| State | character, nominal categorical | The state the customer lives in |
| County | character, nominal categorical | The county the customer lives in |
| Zip | integer, nominal categorical | The zip code the customer lives in (the data type should likely be character instead of integer because no meaningful statistical calculations can be performed on the numbers themselves) |
| Lat | number, continuous numerical | The latitude where the customer lives |
| Lng | number, continuous numerical | The longitude where the customer lives |
| Population | integer, discrete numerical | The population within a mile of where the customer lives |
| Area | character, nominal categorical | The type of area the customer lives in (such as “urban”, “suburban”, etc.) |
| Timezone | character, ordinal categorical | The timezone customer lived in (such as “America/Sitka”, “America/Detroit”, etc.) at signup |
| Job | character, nominal categorical | The job title the customer or the invoice holder held (such as “Environmental health practitioner”, “Programmer, multimedia”, etc.) at signup |
| Children | integer, discrete numerical | The number of children the customer had at signup |
| Age | integer, continuous numerical | The customer’s age at signup |
| Education | character, ordinal categorical | The education level of the customer (such as “Master’s Degree”, “Regular High School Diploma”, etc.) at signup |
| Employment | character, nominal categorical | The employment status of the customer (such as “Part Time”, “Retired”, etc.) at signup |
| Income | number, discrete numerical | The customer’s income at signup |
| Marital | character, nominal categorical | The customer’s marital status (such as “Widowed”, “Married”, etc.) at signup |
| Gender | character, nominal categorical | The customer’s gender (such as “Male”, “Female”, etc.) at signup |
| Outage_sec_perweek | number, continuous numerical | The average number of seconds per week the customer’s neighborhood experiences an outage |
| integer, discrete numerical | Number of emails sent to the customer in the last year | |
| Contacts | integer, discrete numerical | The number of times the customer contacted tech support |
| Yearly_equip_failure | integer, discrete numerical | The number of equipment failures in the past year the customer experienced where the equipment needed to be repaired or replaced |
| Techie | character, ordinal categorical | Whether or not the customer considered themselves to have technical skills (“Yes”, “No”) at signup |
| Contract | character, nominal categorical | The length of the customer’s contract (such as “One year”, “Month-to-month”, etc.) |
| Port_modem | character, ordinal categorical | Whether the customer has a portable modem (“Yes”, “No”) |
| Tablet | character, ordinal categorical | Whether or not the customer has a tablet (“Yes”, “No”) |
| InternetService | character, nominal categorical | Type of internet service the customer has (such as “Fiber Optic”, “DSL”, etc.) |
| Phone | character, ordinal categorical | Whether or not the customer has a phone (“Yes”, “No”) |
| Multiple | character, ordinal categorical | Whether or not the customer has multiple lines (“Yes”, “No”) |
| OnlineSecurity | character, ordinal categorical | Whether or not the customer has subscribed to online security services (“Yes”, “No”) |
| OnlineBackup | character, ordinal categorical | Whether or not the customer has subscribed to online backup services (“Yes”, “No”) |
| DeviceProtection | character, ordinal categorical | Whether or not the customer has subscribed to device protection services (“Yes”, “No”) |
| TechSupport | character, ordinal categorical | Whether or not the customer has subscribed to technical support services (“Yes”, “No”) |
| StreamingTV | character, ordinal categorical | Whether or not the customer has streaming TV (“Yes”, “No”) |
| StreamingMovies | character, ordinal categorical | Whether or not the customer has streaming movies (“Yes”, “No”) |
| PaperlessBilling | character, ordinal categorical | Whether or not the customer has paperless billing (“Yes”, “No”) |
| PaymentMethod | character, nominal categorical | The payment method used by the customer (such as “Credit Card (automatic)”, “Bank Transfer(automatic)”, etc.) |
| Tenure | number, discrete numerical | The number of months the customer has been with the business |
| MonthlyCharge | number, continuous numerical | The average amount the customer pays per month |
| Bandwidth_GB_Year | number, continuous numerical | The average amount of GB the customer uses in a year |
| item1 | integer, discrete numerical - but could also be ordinal categorical | Survey response to question “Timely Response” |
| item2 | integer, discrete numerical - but could also be ordinal categorical | Survey response to question “Timely Fixes” |
| item3 | integer, discrete numerical - but could also be ordinal categorical | Survey response to question “Timely Replacements” |
| item4 | integer, discrete numerical - but could also be ordinal categorical | Survey response to question “Reliability” |
| item5 | integer, discrete numerical - but could also be ordinal categorical | Survey response to question “Options” |
| item6 | integer, discrete numerical - but could also be ordinal categorical | Survey response to question “Respectful Response” |
| item7 | integer, discrete numerical - but could also be ordinal categorical | Survey response to question “Courteous Exchange” |
| item8 | integer, discrete numerical - but could also be ordinal categorical | Survey response to question “Evidence of Active Listening” |
My plan to clean the data involves multiple steps:
Many of the variable names are misleading or not clear so they will be changed. The misleading names, new names, and reason for change are:
| Old Name | New Name | Reason For Change |
|---|---|---|
| Timezone | Timezone_signup | To clarify the information was provided at the time of signup and may no longer be current |
| Job | Job_CustInvHold_signup | To clarify the information was provided at the time of signup and may no longer be current and that the information may be for either the customer or the invoice holder |
| Children | Children_signup | To clarify the information was provided at the time of signup and may no longer be current |
| Age | Age_signup | To clarify the information was provided at the time of signup and may no longer be current |
| Education | Education_signup | To clarify the information was provided at the time of signup and may no longer be current |
| Employment | Employment_signup | To clarify the information was provided at the time of signup and may no longer be current |
| Income | Income_signup | To clarify the information was provided at the time of signup and may no longer be current |
| Marital | Marital_signup | To clarify the information was provided at the time of signup and may no longer be current |
| Gender | Gender_signup | To clarify the information was provided at the time of signup and may no longer be current (rare, but possible) |
| Techie | Techie_signup | To clarify the information was provided at the time of signup and may no longer be current |
| Outage_sec_perweek | Avg_outage_sec_perweek | To clarify the calculation is an average |
| MonthlyCharge | Avg_MonthlyCharge | To clarify the calculation is an average |
| Bandwidth_GB_Year | Avg_Bandwidth_GB_Year | To clarify the calculation is an average |
| item1 | TimelyResponse | Clarify what survey response is being answered |
| item2 | TimelyFixes | Clarify what survey response is being answered |
| item3 | TimelyReplacements | Clarify what survey response is being answered |
| item4 | Reliability | Clarify what survey response is being answered |
| item5 | Options | Clarify what survey response is being answered |
| item6 | RespectfulResponse | Clarify what survey response is being answered |
| item7 | CourteousExchange | Clarify what survey response is being answered |
| item8 | EvidenceofActiveListening | Clarify what survey response is being answered |
Rename Variables:
#(Zach, 2019)
clean_CustChurn <- CustChurn %>%
rename(
Timezone_signup = Timezone,
Job_CustInvHold_signup = Job,
Children_signup = Children,
Age_signup = Age,
Education_signup = Education,
Employment_signup = Employment,
Income_signup = Income,
Marital_signup = Marital,
Gender_signup = Gender,
Techie_signup = Techie,
Avg_outage_sec_perweek = Outage_sec_perweek,
Avg_MonthlyCharge = MonthlyCharge,
Avg_Bandwidth_GB_Year = Bandwidth_GB_Year,
TimelyResponse = item1,
TimelyFixes = item2,
TimelyReplacements = item3,
Reliability = item4,
Options = item5,
RespectfulResponse = item6,
CourteousExchange = item7,
EvidenceofActiveListening = item8
)
names(clean_CustChurn)
## [1] "ï.." "CaseOrder"
## [3] "Customer_id" "Interaction"
## [5] "City" "State"
## [7] "County" "Zip"
## [9] "Lat" "Lng"
## [11] "Population" "Area"
## [13] "Timezone_signup" "Job_CustInvHold_signup"
## [15] "Children_signup" "Age_signup"
## [17] "Education_signup" "Employment_signup"
## [19] "Income_signup" "Marital_signup"
## [21] "Gender_signup" "Churn"
## [23] "Avg_outage_sec_perweek" "Email"
## [25] "Contacts" "Yearly_equip_failure"
## [27] "Techie_signup" "Contract"
## [29] "Port_modem" "Tablet"
## [31] "InternetService" "Phone"
## [33] "Multiple" "OnlineSecurity"
## [35] "OnlineBackup" "DeviceProtection"
## [37] "TechSupport" "StreamingTV"
## [39] "StreamingMovies" "PaperlessBilling"
## [41] "PaymentMethod" "Tenure"
## [43] "Avg_MonthlyCharge" "Avg_Bandwidth_GB_Year"
## [45] "TimelyResponse" "TimelyFixes"
## [47] "TimelyReplacements" "Reliability"
## [49] "Options" "RespectfulResponse"
## [51] "CourteousExchange" "EvidenceofActiveListening"
Identify Duplicates (if any):
CustChurn[duplicated(clean_CustChurn)]
## data frame with 0 columns and 10000 rows
The results show that we have no duplicate variables, so there are no duplicates to remove.
Count Missing Data in Each Variable:
colSums(is.na(clean_CustChurn))
## ï.. CaseOrder Customer_id
## 0 0 0
## Interaction City State
## 0 0 0
## County Zip Lat
## 0 0 0
## Lng Population Area
## 0 0 0
## Timezone_signup Job_CustInvHold_signup Children_signup
## 0 0 2495
## Age_signup Education_signup Employment_signup
## 2475 0 0
## Income_signup Marital_signup Gender_signup
## 2490 0 0
## Churn Avg_outage_sec_perweek Email
## 0 0 0
## Contacts Yearly_equip_failure Techie_signup
## 0 0 2477
## Contract Port_modem Tablet
## 0 0 0
## InternetService Phone Multiple
## 0 1026 0
## OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0
## TechSupport StreamingTV StreamingMovies
## 991 0 0
## PaperlessBilling PaymentMethod Tenure
## 0 0 931
## Avg_MonthlyCharge Avg_Bandwidth_GB_Year TimelyResponse
## 0 1021 0
## TimelyFixes TimelyReplacements Reliability
## 0 0 0
## Options RespectfulResponse CourteousExchange
## 0 0 0
## EvidenceofActiveListening
## 0
This count shows that there are several variables with missing data:
Examine Distribution of Missing Data:
To visualize how the missing data is distributed across the dataset, the visdat package is loaded to produce a visdat chart. Histograms will also be produced using baseR and ggplot2 functionality.
#To install visdat for the first time use:
#install.packages("visdat")
library(visdat)
#To install RColorBrewer for the first time use:
#install.packages("RColorBrewer")
library(RColorBrewer)
vis_miss(clean_CustChurn) + scale_fill_brewer(palette = "Set3", direction = -1)
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
The resulting visdat chart shows the percentage of missing data within each variable and it gives a nice visual of where in the records the missing data is (green).
par(mfrow = c(3,2),cex.axis = 1.5, cex.lab = 1.5, mar = c(2,4.5,3,0))
hist(clean_CustChurn$Income_signup, main = "Income at Signup Distribution", xlab = "Income at Signup", xlim = c(0, 250000), col = "lavender")
hist(clean_CustChurn$Tenure, main = "Distribution of Tenure", xlab = "Tenure", xlim = c(0,110), col = "mistyrose")
hist(clean_CustChurn$Avg_Bandwidth_GB_Year, main = "Distribution of Avg Bandwidth GB Year", xlab = "Avg Bandwidth GB Year", xlim = c(0, 8000), col = "honeydew3")
hist(clean_CustChurn$Children_signup, main = "Distribution of Children at Signup", xlab = "Children at Signup", xlim = c(0,10), col = "lemonchiffon1")
hist(clean_CustChurn$Age_signup, main = "Distribution of Age at Signup", xlab = "Age at signup", xlim = c(0,100), col= "lightsteelblue")
mtext("HISTOGRAMS", side = 3, line = - 1.5, outer = TRUE, col= "cadetblue4", font = 2)
The histograms of the numerical variables help determine if the imputation method to replace the missing data should be to use the median or the mean. If the distribution is skewed, the median can be used and if the distribution is normal, then the mean can be used. Based on the histogram results, the imputation methods will be:
| Variable | Distribution | Imputation Method |
|---|---|---|
| Income_signup | Non-symmetrical bimodal | Replace with median |
| Tenure | Skewed | Replace with median |
| Ave_Bandwidth_GB_Year | Non-symmetrical bimodal | Replace with median |
| Children_signup | Skewed | Replace with median |
| Age_signup | Uniform | Replace with mean |
The median was applied to the Income_signup, Tenure, Avg_Bandwidth_GB_Year, Children_signup records. The mean was applied to the Age_signup records.
Impute Missing Data with Median and Mean
clean_CustChurn$Income_signup[is.na(clean_CustChurn$Income_signup)]<-
median(clean_CustChurn$Income_signup,na.rm=TRUE)
clean_CustChurn$Tenure[is.na(clean_CustChurn$Tenure)]<-
median(clean_CustChurn$Tenure,na.rm=TRUE)
clean_CustChurn$Avg_Bandwidth_GB_Year[is.na(clean_CustChurn$Avg_Bandwidth_GB_Year)]<-
median(clean_CustChurn$Avg_Bandwidth_GB_Year,na.rm=TRUE)
clean_CustChurn$Children_signup[is.na(clean_CustChurn$Children_signup)]<-
median(clean_CustChurn$Children_signup,na.rm=TRUE)
clean_CustChurn$Age_signup[is.na(clean_CustChurn$Age_signup)]<-
mean(clean_CustChurn$Age_signup,na.rm=TRUE)
The missing data has been cleaned up for the numerical fields. The categorical data still needs to be cleaned. The missing data will be imputed with the mode. The modeest package is a statistical package for R that allows the mode to be calculated. Otherwise, there is not a calculation in base R.
Load Modeest:
#To install modeest for the first time use:
#install.packages("modeest")
library(modeest)
## Registered S3 method overwritten by 'rmutil':
## method from
## print.response httr
Calculate Mode for Categorical Values:
#Phone:
mlv(clean_CustChurn$Phone, method="mfv")
## [1] "Yes"
#Techie_signup:
mlv(clean_CustChurn$Techie_signup, method="mfv")
## [1] "No"
#TechSupport:
mlv(clean_CustChurn$TechSupport, method="mfv")
## [1] "No"
Impute Missing Categorical Data with Mode:
PhoneMode <- mlv(clean_CustChurn$Phone, method="mfv")
clean_CustChurn$Phone[is.na(clean_CustChurn$Phone)] <- PhoneMode
TechieMode <- mlv(clean_CustChurn$Techie_signup, method="mfv")
clean_CustChurn$Techie_signup[is.na(clean_CustChurn$Techie_signup)] <- TechieMode
TechSupportMode <- mlv(clean_CustChurn$TechSupport, method="mfv")
clean_CustChurn$TechSupport[is.na(clean_CustChurn$TechSupport)] <- TechSupportMode
Count Missing Data in Each Variable to Verify Imputation:
colSums(is.na(clean_CustChurn))
## ï.. CaseOrder Customer_id
## 0 0 0
## Interaction City State
## 0 0 0
## County Zip Lat
## 0 0 0
## Lng Population Area
## 0 0 0
## Timezone_signup Job_CustInvHold_signup Children_signup
## 0 0 0
## Age_signup Education_signup Employment_signup
## 0 0 0
## Income_signup Marital_signup Gender_signup
## 0 0 0
## Churn Avg_outage_sec_perweek Email
## 0 0 0
## Contacts Yearly_equip_failure Techie_signup
## 0 0 0
## Contract Port_modem Tablet
## 0 0 0
## InternetService Phone Multiple
## 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0
## TechSupport StreamingTV StreamingMovies
## 0 0 0
## PaperlessBilling PaymentMethod Tenure
## 0 0 0
## Avg_MonthlyCharge Avg_Bandwidth_GB_Year TimelyResponse
## 0 0 0
## TimelyFixes TimelyReplacements Reliability
## 0 0 0
## Options RespectfulResponse CourteousExchange
## 0 0 0
## EvidenceofActiveListening
## 0
All of the missing data, numerical and categorical, has been imputed.
Boxplots are an easy way to visualize outliers. R has a simple boxplot function in base R. The gplots and ggtlot2 packages have boxplots with a little more functionality and customization options. To identify outliers with a boxplot, check to see if there are points outside of the quartiles. The points outside of the quartiles are outliers.
Plot Boxplot:
par(mfrow = c(3,4),cex.axis = 1, cex.lab = 1, cex = .9, mar = c(1,4.5,4,0))
boxplot(clean_CustChurn$Population, main = "Population", col = "violet",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Avg_outage_sec_perweek, main = "Avg Outage Sec per Week", col = "darkseagreen1",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Email, main = "Email", col = "plum3",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Contacts, main = "Contacts", col = "coral",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Yearly_equip_failure, main = "Yearly Equipment Failure", col = "slateblue1",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Avg_MonthlyCharge, main = "Avg Monthly Charge", col = "goldenrod2",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Income_signup, main = "Income at Signup", col = "lavender", outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Tenure, main = "Tenure", col = "mistyrose",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Avg_Bandwidth_GB_Year, main = "Avg BW GB Year", col = "honeydew3",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Children_signup, main = "Children at Signup", col = "lemonchiffon1",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Age_signup,main = "Age at Signup", col = "lightsteelblue",outpch = 17, outcol = "red4")
mtext("BOXPLOTS", side = 3, line = - 1.5, outer = TRUE, col= "cadetblue4", font = 2)
The outliers (in red) are easily seen in these boxplots. The variables containing outliers are:
Before treating any of the outliers, it must be determined if the outliers are errors or if they are legitimate entries. R makes it easy to do a quick check of the statistics associated with the numerical variables.
Check Statistics on Variables:
summary(clean_CustChurn)
## ï.. CaseOrder Customer_id Interaction
## Min. : 1 Min. : 1 Length:10000 Length:10000
## 1st Qu.: 2501 1st Qu.: 2501 Class :character Class :character
## Median : 5000 Median : 5000 Mode :character Mode :character
## Mean : 5000 Mean : 5000
## 3rd Qu.: 7500 3rd Qu.: 7500
## Max. :10000 Max. :10000
## City State County Zip
## Length:10000 Length:10000 Length:10000 Min. : 601
## Class :character Class :character Class :character 1st Qu.:26293
## Mode :character Mode :character Mode :character Median :48870
## Mean :49153
## 3rd Qu.:71867
## Max. :99929
## Lat Lng Population Area
## Min. :17.97 Min. :-171.69 Min. : 0 Length:10000
## 1st Qu.:35.34 1st Qu.: -97.08 1st Qu.: 738 Class :character
## Median :39.40 Median : -87.92 Median : 2910 Mode :character
## Mean :38.76 Mean : -90.78 Mean : 9757
## 3rd Qu.:42.11 3rd Qu.: -80.09 3rd Qu.: 13168
## Max. :70.64 Max. : -65.67 Max. :111850
## Timezone_signup Job_CustInvHold_signup Children_signup Age_signup
## Length:10000 Length:10000 Min. : 0.000 Min. :18.00
## Class :character Class :character 1st Qu.: 1.000 1st Qu.:41.00
## Mode :character Mode :character Median : 1.000 Median :53.28
## Mean : 1.823 Mean :53.28
## 3rd Qu.: 3.000 3rd Qu.:65.00
## Max. :10.000 Max. :89.00
## Education_signup Employment_signup Income_signup Marital_signup
## Length:10000 Length:10000 Min. : 740.7 Length:10000
## Class :character Class :character 1st Qu.: 23660.8 Class :character
## Mode :character Mode :character Median : 33186.8 Mode :character
## Mean : 38256.0
## 3rd Qu.: 45504.2
## Max. :258900.7
## Gender_signup Churn Avg_outage_sec_perweek Email
## Length:10000 Length:10000 Min. :-1.349 Min. : 1.00
## Class :character Class :character 1st Qu.: 8.054 1st Qu.:10.00
## Mode :character Mode :character Median :10.203 Median :12.00
## Mean :11.453 Mean :12.02
## 3rd Qu.:12.488 3rd Qu.:14.00
## Max. :47.049 Max. :23.00
## Contacts Yearly_equip_failure Techie_signup Contract
## Min. :0.0000 Min. :0.000 Length:10000 Length:10000
## 1st Qu.:0.0000 1st Qu.:0.000 Class :character Class :character
## Median :1.0000 Median :0.000 Mode :character Mode :character
## Mean :0.9942 Mean :0.398
## 3rd Qu.:2.0000 3rd Qu.:1.000
## Max. :7.0000 Max. :6.000
## Port_modem Tablet InternetService Phone
## Length:10000 Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Multiple OnlineSecurity OnlineBackup DeviceProtection
## Length:10000 Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## TechSupport StreamingTV StreamingMovies PaperlessBilling
## Length:10000 Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## PaymentMethod Tenure Avg_MonthlyCharge Avg_Bandwidth_GB_Year
## Length:10000 Min. : 1.00 Min. : 77.51 Min. : 155.5
## Class :character 1st Qu.: 8.70 1st Qu.:141.07 1st Qu.:1312.1
## Mode :character Median :36.20 Median :169.92 Median :3382.4
## Mean :34.66 Mean :174.08 Mean :3397.2
## 3rd Qu.:60.15 3rd Qu.:203.78 3rd Qu.:5466.3
## Max. :72.00 Max. :315.88 Max. :7159.0
## TimelyResponse TimelyFixes TimelyReplacements Reliability
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :3.000 Median :3.000
## Mean :3.491 Mean :3.505 Mean :3.487 Mean :3.498
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :7.000 Max. :7.000 Max. :8.000 Max. :7.000
## Options RespectfulResponse CourteousExchange EvidenceofActiveListening
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.00 1st Qu.:3.000
## Median :3.000 Median :3.000 Median :4.00 Median :3.000
## Mean :3.493 Mean :3.497 Mean :3.51 Mean :3.496
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :7.000 Max. :8.000 Max. :7.00 Max. :8.000
At first glance, these statistics do not look unreasonable. For example, the number of children having a minimum of 0 children and a maximum of 10 children is certainly possible. The mean of 1.8 children also seems to fit within expectations. The minimum population at 0 and the minimum income at 740.70 are suspicious. To further investigate, the standard deviation can be calculated for each of these variables.
sd(clean_CustChurn$Population)
## [1] 14432.7
sd(clean_CustChurn$Contacts)
## [1] 0.9884655
sd(clean_CustChurn$Avg_outage_sec_perweek)
## [1] 7.025921
sd(clean_CustChurn$Email)
## [1] 3.025898
sd(clean_CustChurn$Income_signup)
## [1] 24747.87
sd(clean_CustChurn$Children_signup)
## [1] 1.925971
sd(clean_CustChurn$Avg_MonthlyCharge)
## [1] 43.33547
sd(clean_CustChurn$Yearly_equip_failure)
## [1] 0.6359532
Based on the results of the statistics and the standards deviations, it appears that the outliers for the following should not be treated because they are below three standard deviations from the mean.
Other outliers that, although they are higher than 3 standard deviations from the mean, still appear to be reasonable and should not be treated are:
Both of these make sense when compared to other variables. Email is the number of emails- marketing and other correspondence. That correspondence could be customer service issues. If the average outage per week is higher, those customers may have more correspondence as a result. In addition, there was a larger difference between the minimum and maximum for yearly equipment failures. That could also be related to the outages and the emails.
Income_signup is another variable with outliers that should not be treated without more information. Incomes vary significantly across the population so the outliers are not unusual. While the minimum income may be an error, there is not enough information to determine that at this point.
Population, like income, can vary significantly across the country. Some customers will live in very remote areas with few neighbors and others may live in densely populated cities. The large variations are expected. The minimum of 0 appears to be an error, however, without more information from the client, it should not be treated.
Finally, the Avg_MonthlyCharge has a high standard deviation. There is a very large spread between the minimum and the maximum. In this case, it could be explained by customers who subscribe to the very basic package at the low end and customers who subscribe to everything or have multiple accounts (landlords, caregivers) at the high end, with the majority of customers falling somewhere in the middle. The client should be able to check their billing records to confirm the outliers before treating them, so they should just be noted at this point.
The categorical variables can be re-expressed as numeric in order to make processing easier for future use. For small files, non-ordinal values, one-hot encoding can work well. In this case, there are many variables and many records within those variables, which would increase the file size significantly. For the categorical data, I used label encoding. Label encoding assigns a numeric value to each categorical value. It is ideal for yes/no values. It can imply an order where there is none, however, it is better for larger datasets. The superml package makes it easy to use the label encoding method.
Install superml Package:
#(Mulani, 2021)
#To install superml for the first time use:
#install.packages("superml")
library(superml)
## Loading required package: R6
label <- LabelEncoder$new()
clean_CustChurn$Customer_id <- label$fit_transform(clean_CustChurn$Customer_id)
head(clean_CustChurn$Customer_id)
## [1] 0 1 2 3 4 5
label <- LabelEncoder$new()
clean_CustChurn$Interaction <- label$fit_transform(clean_CustChurn$Interaction)
head(clean_CustChurn$Interaction)
## [1] 0 1 2 3 4 5
label <- LabelEncoder$new()
clean_CustChurn$City <- label$fit_transform(clean_CustChurn$City)
head(clean_CustChurn$City)
## [1] 0 1 2 3 4 5
label <- LabelEncoder$new()
clean_CustChurn$State <- label$fit_transform(clean_CustChurn$State)
head(clean_CustChurn$State)
## [1] 0 1 2 3 4 5
label <- LabelEncoder$new()
clean_CustChurn$County <- label$fit_transform(clean_CustChurn$County)
head(clean_CustChurn$County)
## [1] 0 1 2 3 4 5
label <- LabelEncoder$new()
clean_CustChurn$Area <- label$fit_transform(clean_CustChurn$Area)
head(clean_CustChurn$Area)
## [1] 0 0 0 1 1 0
label <- LabelEncoder$new()
clean_CustChurn$Timezone_signup <- label$fit_transform(clean_CustChurn$Timezone_signup)
head(clean_CustChurn$Timezone_signup)
## [1] 0 1 2 2 3 4
label <- LabelEncoder$new()
clean_CustChurn$Job_CustInvHold_signup <- label$fit_transform(clean_CustChurn$Job_CustInvHold_signup)
head(clean_CustChurn$Job_CustInvHold_signup)
## [1] 0 1 2 3 4 5
label <- LabelEncoder$new()
clean_CustChurn$Education_signup <- label$fit_transform(clean_CustChurn$Education_signup)
head(clean_CustChurn$Education_signup)
## [1] 0 1 1 2 0 3
label <- LabelEncoder$new()
clean_CustChurn$Employment_signup <- label$fit_transform(clean_CustChurn$Employment_signup)
head(clean_CustChurn$Employment_signup)
## [1] 0 1 2 1 2 3
label <- LabelEncoder$new()
clean_CustChurn$Marital_signup <- label$fit_transform(clean_CustChurn$Marital_signup)
head(clean_CustChurn$Marital_signup)
## [1] 0 1 0 1 2 3
label <- LabelEncoder$new()
clean_CustChurn$Gender_signup <- label$fit_transform(clean_CustChurn$Gender_signup)
head(clean_CustChurn$Gender_signup)
## [1] 0 1 1 0 0 1
clean_CustChurn$Techie_signup <- label$fit_transform(clean_CustChurn$Techie_signup)
head(clean_CustChurn$Techie_signup)
## [1] 0 1 1 1 0 0
label <- LabelEncoder$new()
clean_CustChurn$Contract <- label$fit_transform(clean_CustChurn$Contract)
head(clean_CustChurn$Contract)
## [1] 0 1 2 2 1 0
clean_CustChurn$Port_modem <- label$fit_transform(clean_CustChurn$Port_modem)
head(clean_CustChurn$Port_modem)
## [1] 0 1 0 1 0 0
clean_CustChurn$Tablet <- label$fit_transform(clean_CustChurn$Tablet)
head(clean_CustChurn$Tablet)
## [1] 0 0 1 1 1 1
label <- LabelEncoder$new()
clean_CustChurn$InternetService <- label$fit_transform(clean_CustChurn$InternetService)
head(clean_CustChurn$InternetService)
## [1] 0 0 1 1 0 2
clean_CustChurn$Phone <- label$fit_transform(clean_CustChurn$Phone)
head(clean_CustChurn$Phone)
## [1] 0 0 0 0 1 0
clean_CustChurn$Multiple <- label$fit_transform(clean_CustChurn$Multiple)
head(clean_CustChurn$Multiple)
## [1] 0 1 1 0 0 1
clean_CustChurn$Churn <- label$fit_transform(clean_CustChurn$Churn)
head(clean_CustChurn$Churn)
## [1] 0 1 0 0 1 0
clean_CustChurn$OnlineSecurity <- label$fit_transform(clean_CustChurn$OnlineSecurity)
head(clean_CustChurn$OnlineSecurity)
## [1] 0 0 1 0 1 0
clean_CustChurn$OnlineBackup <- label$fit_transform(clean_CustChurn$OnlineBackup)
head(clean_CustChurn$OnlineBackup)
## [1] 0 1 1 1 1 0
clean_CustChurn$DeviceProtection <- label$fit_transform(clean_CustChurn$DeviceProtection)
head(clean_CustChurn$DeviceProtection)
## [1] 0 0 0 0 0 1
clean_CustChurn$TechSupport <- label$fit_transform(clean_CustChurn$TechSupport)
head(clean_CustChurn$TechSupport)
## [1] 0 0 0 0 1 0
clean_CustChurn$StreamingTV <- label$fit_transform(clean_CustChurn$StreamingTV)
head(clean_CustChurn$StreamingTV)
## [1] 0 1 0 1 1 0
clean_CustChurn$StreamingMovies <- label$fit_transform(clean_CustChurn$StreamingMovies)
head(clean_CustChurn$StreamingMovies)
## [1] 0 0 0 1 1 0
clean_CustChurn$PaperlessBilling <- label$fit_transform(clean_CustChurn$PaperlessBilling)
head(clean_CustChurn$PaperlessBilling)
## [1] 0 0 0 0 1 1
label <- LabelEncoder$new()
clean_CustChurn$PaymentMethod <- label$fit_transform(clean_CustChurn$PaymentMethod)
head(clean_CustChurn$PaymentMethod)
## [1] 0 1 0 2 2 3
Extract file:
write.csv(clean_CustChurn, "./churn_clean_data.csv")
In reviewing the variables, it is clear that much of the demographic data is limited in its usefulness as it may be outdated because it hasn’t been updated since the customer signed up. The variables with this type of limitation are:
Other variables with limitations are:
While finding patterns that rely on this data is still possible, it would be much more accurate to survey the customers often to get updates on a regular basis.
Replacing the missing values with median and mode is also limiting. It would be better to know why the values are missing and try to get the correct information entered, if there is any. Imputing the values may skew the results.
Outliers are another limiting issue. There were several variables with outliers, however, it could not be readily confirmed that the outliers were input errors. Therefore, the ouliers were noted, and not treated. The business should be consulted to determine how to proceed with the outliers.
These issues may impact the research question, particularly the issue of the stale demographic data. Any one of the stale categories could skew the results, although some variables are more likely to have updates than others (marital status would be more likely to change than gender, for example). This is the most pressing issue for the business to address if they want to be able to have a good answer for their question and to be able to continue to monitor the progress of any remediation efforts to reduce churn.
PCA is a method to reduce the dimensionality in a dataset. It helps to explain a significant percentage of the variance using just a few components, rather than all of them. It can reduce redundancy in the data and make machine learning faster. The variables need to be strongly correlated to perform the analysis. The dataset for PCA also needs to be cleaned first so that it contains no missing values, outliers that should have been removed, and non-numeric values.
Define Variables
Because PCA can only be performed on numeric values, variables like city, state, marital status, etc. will not be used. The following numeric variables will be used for PCA:
Store Variables in New DataFrame:
#To make the PCA coding easier, I'm adding the chosen variables to a new dataframe. I also used the column numbers from the output when renaming the columns rather than the column names to make the code shorter
CCdf <- clean_CustChurn[ , c(9:11, 15:16, 19, 23:26, 42:52)]
colnames(CCdf)
## [1] "Lat" "Lng"
## [3] "Population" "Children_signup"
## [5] "Age_signup" "Income_signup"
## [7] "Avg_outage_sec_perweek" "Email"
## [9] "Contacts" "Yearly_equip_failure"
## [11] "Tenure" "Avg_MonthlyCharge"
## [13] "Avg_Bandwidth_GB_Year" "TimelyResponse"
## [15] "TimelyFixes" "TimelyReplacements"
## [17] "Reliability" "Options"
## [19] "RespectfulResponse" "CourteousExchange"
## [21] "EvidenceofActiveListening"
Normalize Variables:
#(Nistrup, 2020)
pca.data <- prcomp(CCdf, center = TRUE, scale = TRUE)
#View PCA standard deviations, proportions of variance, and cumulative proportions of variance
summary(pca.data)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7170 1.37854 1.27957 1.11205 1.06339 1.02531 1.01838
## Proportion of Variance 0.1404 0.09049 0.07797 0.05889 0.05385 0.05006 0.04939
## Cumulative Proportion 0.1404 0.23088 0.30885 0.36774 0.42158 0.47164 0.52103
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.00793 1.00244 0.99309 0.99093 0.98058 0.93040 0.88247
## Proportion of Variance 0.04838 0.04785 0.04696 0.04676 0.04579 0.04122 0.03708
## Cumulative Proportion 0.56941 0.61726 0.66422 0.71098 0.75677 0.79799 0.83507
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.85786 0.83045 0.76909 0.7332 0.69406 0.56938 0.3208
## Proportion of Variance 0.03504 0.03284 0.02817 0.0256 0.02294 0.01544 0.0049
## Cumulative Proportion 0.87012 0.90296 0.93112 0.9567 0.97966 0.99510 1.0000
#View correlations
pca.data$rotation
## PC1 PC2 PC3 PC4
## Lat -0.0012988768 0.023302535 -0.008737299 -0.7109294805
## Lng 0.0081156770 -0.009778395 0.023050144 0.1694971516
## Population -0.0021648497 -0.005644212 0.016693178 0.6483282604
## Children_signup 0.0004594240 0.001132978 0.010446123 -0.0378825680
## Age_signup 0.0050576345 0.013214185 -0.017046954 0.0233016879
## Income_signup -0.0008520682 -0.007709773 0.023715144 -0.0592077802
## Avg_outage_sec_perweek -0.0131743504 -0.017965733 -0.047804233 -0.0989855571
## Email 0.0086895953 0.020265552 -0.003392581 0.1448429005
## Contacts -0.0085391645 -0.003895242 -0.010323281 0.0280424337
## Yearly_equip_failure -0.0077296632 -0.016043096 0.006845499 -0.0137783361
## Tenure -0.0106657235 -0.700860365 -0.071780032 -0.0040584771
## Avg_MonthlyCharge -0.0004931141 -0.043355506 -0.024276911 -0.0706712652
## Avg_Bandwidth_GB_Year -0.0125683108 -0.702562040 -0.073999492 -0.0100915193
## TimelyResponse 0.4588348341 -0.032034605 0.280088475 -0.0163117494
## TimelyFixes 0.4340607128 -0.042776009 0.280895233 -0.0228070939
## TimelyReplacements 0.4007815228 -0.034729050 0.280701477 -0.0031318090
## Reliability 0.1456316918 0.050443834 -0.566922189 0.0008237461
## Options -0.1754870143 -0.066118921 0.585658369 -0.0137071149
## RespectfulResponse 0.4049914444 0.012186983 -0.183816521 0.0127144337
## CourteousExchange 0.3582397322 0.003824246 -0.180957728 -0.0159301032
## EvidenceofActiveListening 0.3087483611 0.016378170 -0.131939371 0.0420742080
## PC5 PC6 PC7
## Lat -0.0754548661 0.125157397 1.201509e-02
## Lng 0.0718088371 -0.697690812 -4.653143e-01
## Population 0.0599885591 0.272814589 2.607800e-01
## Children_signup 0.0203065093 -0.566165567 3.012516e-01
## Age_signup -0.0416418298 0.209472382 -4.509169e-01
## Income_signup -0.0151598811 -0.095565930 1.669441e-01
## Avg_outage_sec_perweek 0.6937445570 0.048720927 8.502775e-02
## Email 0.0914789449 -0.005126588 -2.115085e-01
## Contacts -0.0006395943 0.181757138 -5.080012e-01
## Yearly_equip_failure 0.0533217770 -0.090675692 2.738410e-01
## Tenure -0.0588330162 0.012167692 -1.292710e-02
## Avg_MonthlyCharge 0.6941250825 0.044792084 -8.524157e-02
## Avg_Bandwidth_GB_Year -0.0103951286 -0.004268561 3.918978e-03
## TimelyResponse 0.0300809461 -0.006513592 -2.529346e-04
## TimelyFixes 0.0143691364 0.022960470 -8.815904e-03
## TimelyReplacements -0.0110899282 0.003512057 -2.361884e-03
## Reliability -0.0310644507 -0.007453986 -7.913205e-03
## Options 0.0232759223 0.040114111 -2.003365e-02
## RespectfulResponse 0.0062402525 0.013523369 1.488216e-02
## CourteousExchange -0.0351877075 -0.021344776 4.238183e-05
## EvidenceofActiveListening 0.0358393805 0.029409180 1.208392e-02
## PC8 PC9 PC10
## Lat -0.0755803560 -0.0023098738 -0.0759715002
## Lng 0.0667557807 -0.0377899974 0.2104743461
## Population 0.0334337505 0.0919283564 -0.0036515457
## Children_signup 0.0157526422 0.0780022695 -0.3189486964
## Age_signup 0.4564897948 -0.0856443461 -0.4202640311
## Income_signup 0.2792343458 0.8297305702 -0.2252327243
## Avg_outage_sec_perweek 0.0204287347 0.0312633943 -0.0088069011
## Email -0.5272436019 -0.0643204375 -0.7334160461
## Contacts 0.2039986844 0.2583411923 0.0636474539
## Yearly_equip_failure 0.6153084826 -0.4597608628 -0.2637133820
## Tenure -0.0116073601 -0.0048543146 -0.0222556425
## Avg_MonthlyCharge 0.0121842062 0.0210074673 0.0705865061
## Avg_Bandwidth_GB_Year -0.0170571601 0.0006799138 -0.0075854017
## TimelyResponse 0.0095540293 -0.0162503921 0.0074544441
## TimelyFixes 0.0074353144 0.0022344086 -0.0004708455
## TimelyReplacements -0.0316154001 -0.0302274469 0.0048692156
## Reliability -0.0042202070 -0.0203604651 0.0240735339
## Options 0.0024812055 -0.0062062195 0.0057337507
## RespectfulResponse 0.0049911489 0.0078937857 -0.0212271981
## CourteousExchange -0.0009372125 0.0533046973 -0.0231981492
## EvidenceofActiveListening 0.0414419446 -0.0020036497 0.0452467705
## PC11 PC12 PC13 PC14
## Lat 0.038392961 0.029193630 -0.008435733 -0.0840239782
## Lng -0.115547254 -0.232244918 0.064759742 -0.0738585170
## Population 0.032340013 0.141295584 -0.005426483 -0.1596223363
## Children_signup 0.191073833 0.662255603 0.005694298 0.0309607662
## Age_signup -0.484380454 0.326994386 0.105211005 -0.0632318701
## Income_signup -0.133377325 -0.339254195 -0.066087238 0.0131052776
## Avg_outage_sec_perweek 0.036854876 -0.042764091 0.690129255 -0.1143995304
## Email 0.138113768 -0.278785657 -0.046423950 0.0635505096
## Contacts 0.751483623 0.162324589 0.012464761 0.0409153224
## Yearly_equip_failure 0.304132072 -0.380069970 -0.117946066 0.0218001512
## Tenure -0.012354135 -0.006055513 0.037777983 -0.0003023203
## Avg_MonthlyCharge -0.098490635 0.113446904 -0.683862986 0.0487590746
## Avg_Bandwidth_GB_Year 0.004165589 0.011855598 -0.012136274 0.0119795574
## TimelyResponse 0.019762678 -0.001088162 -0.006554008 -0.0742180726
## TimelyFixes 0.006856614 0.017016600 -0.002005392 -0.1101556217
## TimelyReplacements -0.008391554 -0.014952716 -0.010288717 -0.1771363080
## Reliability -0.002334510 -0.017647997 -0.021683794 -0.1864450826
## Options -0.006939835 0.009589566 0.041002981 0.1302463247
## RespectfulResponse 0.023539347 0.002343695 -0.007449126 -0.0541041014
## CourteousExchange 0.039169002 -0.006474351 -0.026557663 -0.1501884266
## EvidenceofActiveListening -0.070907293 0.016707008 0.124209235 0.8993615451
## PC15 PC16 PC17
## Lat -0.6629870430 -0.097948218 -0.0444566052
## Lng -0.3578777953 -0.067187627 -0.0386396580
## Population -0.6058237029 -0.099985603 -0.0117364744
## Children_signup 0.0012306259 0.037194136 0.0186275404
## Age_signup 0.0027839366 -0.010549463 0.0009172031
## Income_signup 0.0080161361 0.061188394 0.0060658071
## Avg_outage_sec_perweek 0.0433576746 -0.008893558 0.0128163912
## Email -0.0405545319 0.013305971 0.0148553118
## Contacts 0.0037455091 0.037913094 0.0046247484
## Yearly_equip_failure -0.0416869178 -0.009971406 0.0132965936
## Tenure -0.0097231631 0.008787332 -0.0080746753
## Avg_MonthlyCharge -0.0043105093 -0.008924304 0.0152696466
## Avg_Bandwidth_GB_Year -0.0087335179 -0.003313837 -0.0038227193
## TimelyResponse -0.0232415563 0.114271236 0.0459474895
## TimelyFixes 0.0007022538 0.170293003 0.0681366550
## TimelyReplacements -0.0009642702 0.247244773 0.1487966872
## Reliability -0.0693032276 0.468333268 0.4471573264
## Options -0.0109178537 -0.061321498 0.2108651173
## RespectfulResponse 0.0628733114 -0.050875212 -0.7551424353
## CourteousExchange 0.1377160478 -0.799827196 0.3735231765
## EvidenceofActiveListening -0.1772780546 0.009975454 0.1097472335
## PC18 PC19 PC20
## Lat -0.0041456851 0.0156741589 -0.0111056968
## Lng 0.0177542845 0.0004404725 -0.0246985512
## Population 0.0008659297 0.0016324442 -0.0077723112
## Children_signup 0.0113840191 0.0186316052 -0.0077147097
## Age_signup -0.0122828391 0.0085863229 0.0160254976
## Income_signup 0.0006217216 0.0138000785 0.0049397946
## Avg_outage_sec_perweek -0.0163580933 0.0100097533 -0.0043115885
## Email 0.0065883396 -0.0162277486 0.0010132409
## Contacts -0.0263230183 0.0207755968 -0.0005300704
## Yearly_equip_failure -0.0007583610 0.0072506350 -0.0213011075
## Tenure -0.0121585157 0.0065780270 0.0049847367
## Avg_MonthlyCharge 0.0003496937 0.0211408944 -0.0127283407
## Avg_Bandwidth_GB_Year -0.0016208746 -0.0077506850 0.0071436936
## TimelyResponse 0.0243576194 -0.2391680316 0.7930417038
## TimelyFixes 0.0680061250 -0.5923124854 -0.5726095223
## TimelyReplacements -0.3926001393 0.6756428578 -0.1771996148
## Reliability 0.4306992598 0.0853375511 0.0180427195
## Options 0.6944182200 0.2604452146 -0.0428154633
## RespectfulResponse 0.4057578374 0.2273641395 -0.0651350723
## CourteousExchange 0.0677631015 0.0647317441 -0.0412502694
## EvidenceofActiveListening -0.0439110025 0.0470594076 -0.0430650393
## PC21
## Lat 0.0009630604
## Lng 0.0008761430
## Population -0.0007493478
## Children_signup -0.0182853241
## Age_signup 0.0216573095
## Income_signup 0.0013977966
## Avg_outage_sec_perweek 0.0008521392
## Email 0.0054223039
## Contacts -0.0026509615
## Yearly_equip_failure -0.0023938232
## Tenure -0.7049911957
## Avg_MonthlyCharge -0.0482167849
## Avg_Bandwidth_GB_Year 0.7067644307
## TimelyResponse -0.0028995362
## TimelyFixes -0.0030164735
## TimelyReplacements 0.0145175277
## Reliability 0.0017026995
## Options -0.0029235647
## RespectfulResponse 0.0013754214
## CourteousExchange -0.0067500949
## EvidenceofActiveListening -0.0029899801
To determine which components to keep, they should have an eigenvalue greater than 1. A scree plot can determine which variables have an eigenvalue over 1. Base R can create a basic scree plot. The R package factoextra can also produce a nice scree plot. Factoextra contains many other tools including FactoMineR, plyr, caret, corrplot and many more. FactoMineR and plyr would likely be sufficient, but the other tools may be useful for future projects.
Create Scree Plot of Eigenvalues:
screeplot(pca.data, type = "l", npcs = 21, pch = 17, col = "red", main = "Screeplot for PCA")
abline(h = 1, col="blue", lty="dotted", lwd = 2)
legend("topright", legend=c("Eigenvalue = 1"),
col=c("red"), lty="dotted", cex=0.6)
#To install factoextra for the first time use:
#install.packages("factoextra")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(pca.data, choice = "eigenvalue", addlabels = TRUE, bar_width = 0.5, barfill = "cadetblue", linecolor = "goldenrod2", ncp = 21)
The eigenvalue charts both had a few that looked like they were right at 1 (7 - 12) but because it could have been due to rounding issues, I took a closer look at the PCA values to see which ones might be rounding.
Obtain Eigenvalues:
get_eig(pca.data)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.9481201 14.0386671 14.03867
## Dim.2 1.9003754 9.0494067 23.08807
## Dim.3 1.6373045 7.7966881 30.88476
## Dim.4 1.2366499 5.8888089 36.77357
## Dim.5 1.1308029 5.3847758 42.15835
## Dim.6 1.0512624 5.0060115 47.16436
## Dim.7 1.0370929 4.9385376 52.10290
## Dim.8 1.0159209 4.8377184 56.94061
## Dim.9 1.0048894 4.7851876 61.72580
## Dim.10 0.9862229 4.6962997 66.42210
## Dim.11 0.9819472 4.6759391 71.09804
## Dim.12 0.9615399 4.5787615 75.67680
## Dim.13 0.8656462 4.1221247 79.79893
## Dim.14 0.7787583 3.7083729 83.50730
## Dim.15 0.7359178 3.5043703 87.01167
## Dim.16 0.6896399 3.2839994 90.29567
## Dim.17 0.5914997 2.8166651 93.11233
## Dim.18 0.5375789 2.5598995 95.67223
## Dim.19 0.4817194 2.2939020 97.96614
## Dim.20 0.3241938 1.5437798 99.50992
## Dim.21 0.1029178 0.4900845 100.00000
Principal components 10 -12 are just under 1 but round to 1. If rounding out to 0 decimal places, then PCA 9 is the last principal component to accept. If we round based on the third decimal place, then PCA 10 is acceptable as it will round to 1. If round by the second decimal place, then PCA 11 and 12 would also be acceptable because they would round to 1. Which one is chosen depends on how much of the variance the business is comfortable having explained. Using just the first nine explains over 60% of the variance. If 12 are used, then the explained variance goes up to over 75%. The next charts show the difference between using 9, 10, or 12 principal components.
Produce Charts Visualizing the Amount of Variance the Chosen Number of Principal Components Explains:
#9 PCs
cum.pca <- cumsum(pca.data$sdev^2 / sum(pca.data$sdev^2))
plot(cum.pca[0:21], xlab = "Principal Component Number", ylab = "Amount of Explained Variance", main = "Cumulative Explained Variance")
abline(v = 9, col="red", lty="dotted", lwd = 2)
abline(h = 0.617258, col="red", lty="dotted", lwd = 2)
legend("center", legend=c("Intersection at PC9"),
col=c("red"), lty="dotted", lwd = 2, cex=0.6)
#10 PCs
cum.pca <- cumsum(pca.data$sdev^2 / sum(pca.data$sdev^2))
plot(cum.pca[0:21], xlab = "Principal Component Number", ylab = "Amount of Explained Variance", main = "Cumulative Explained Variance")
abline(v = 10, col="red", lty="dotted", lwd = 2)
abline(h = 0.66422, col="red", lty="dotted", lwd = 2)
legend("center", legend=c("Intersection at PC10"),
col=c("red"), lty="dotted", lwd = 2, cex=0.6)
#12 PCs
cum.pca <- cumsum(pca.data$sdev^2 / sum(pca.data$sdev^2))
plot(cum.pca[0:21], xlab = "Principal Component Number", ylab = "Amount of Explained Variance", main = "Cumulative Explained Variance")
abline(v = 12, col="red", lty="dotted", lwd = 2)
abline(h = 0.756768, col="red", lty="dotted", lwd = 2)
legend("center", legend=c("Intersection at PC12"),
col=c("red"), lty="dotted", lwd = 2, cex=0.6)
PCA Summary:
The Principal Component Analysis showed the over 60% of the variance could be explained by using only 9 components. More than 75% could be explained by using 12 components. Using Principal Component Analysis is useful for the business because it reduces the dimensionality of a large dataset that needs to be analyzed. It helps to observe patterns and trends in the data by showing the relationships between the variables and observations. A business can use this method to determine patterns in customer behavior to help them predict which groups of customers are more susceptible to churn. Knowing which customer groups are more likely to end their subscriptions, helps the business set a mitigation plan in motion to reduce the churn through marketing or education efforts, improved customer services, better product development, etc., depending on what the analysis shows. In this case, such an analysis may answer the research question if there are groups of customers more susceptible to churn because it could uncover some relationships between survey responses about equipment failure being rated poorly and low tenure, for example. This would indicate to the business that they need to work on product development. This is why it is so important for the business to keep the data fresh and not rely on stale data. The need to monitor the progress of any mitigation plan they put in place to ensure it is working.
Leek, J. (2017). Reading Local Flat Files [Video Course]. Coursera Data Science Specialization. www.coursera.com
Zach. (2020, October 27). How to Import CSV Files into R (Step-by-Step). Statology. https://www.statology.org/import-csv-into-r/#:~:text=1.%20Use%20read.csv%20from%20base%20R%20%28Slowest%20method%2C
Fast Reading of Data From TXT|CSV Files into R: readr package - Easy Guides - Wiki - STHDA. (n.d.). www.sthda.com. Retrieved December 28, 2021, from http://www.sthda.com/english/wiki/fast-reading-of-data-from-txt-csv-files-into-r-readr-package#:~:text=The%20readr%20package%20contains%20functions%20for%20reading%20i%29
Examine a Data Frame in R with 7 Basic Functions. (2016, November 29). (R)Very Day. https://rveryday.wordpress.com/2016/11/29/examine-a-data-frame-in-r-with-7-basic-functions/
Larose, C.D., & Larose, D.T. (2019). Data Science Using Python and R. John Wiley & Sons. ISBN-13:978-1-119-52684-1
Zach. (2019, March 12). How to Rename Data Frame Columns in R. Statology. https://www.statology.org/how-to-rename-data-frame-columns-in-r/
Makowski, D. (2018, March 29). Standardize (Z-score) a Dataframe. R-Bloggers. Retrieved December 31, 2021, from https://www.r-bloggers.com/2018/03/standardize-z-score-a-dataframe/
Mulani, S. (2021, March 4). Label Encoding in R programming – All you need to know! JournalDev. https://www.journaldev.com/48639/label-encoding-in-r
Nistrup, P. (2020, January 28). Principal Component Analysis (PCA) 101, using R. Medium. https://towardsdatascience.com/principal-component-analysis-pca-101-using-r-361f4c53a9ff