D206 Data Cleaning Performance Assessment

CUSTOMER CHURN

In this performance assessment, a telecom business has provided data on its customers and the services they subscribe to. Because customer churn is expensive for the business, they need to know if there are patterns in customer demographics making them more susceptible to churn so they can set a mitigation plan in motion to try to keep the customers.

Research Question

Which groups of customers are most susceptible to churn?

Available Data

The data available for the research consists of a single csv file to be read into either python or R.

Process the Data

To use the data in the csv file it needs to be read into python or R. Both programming languages would do an excellent job of cleaning. I chose R because I have a little previous experience with it and I like the way it can describe the statistical makeup of the variables and the way the visualizations render. R is a very good tool because of the statistical capabilities since it was created by scientists specifically for that purpose. Both programming languages would work well, but R is built more for the kind of statistical analysis that analyzing customer churn would entail. Also, when it is time for reporting results, R is superior at visualizations, especially with the Shiny applications and highcharter packages.

To read the data into R, I used RStudio and set my work space to where my data is in order to make it easier to retrieve any additional data needed (this step is not necessary for this project, but it is useful for projects that have multiple data files to retrieve).

Set Workspace:

   setwd('C:/Users/localuser/Desktop/WGU/Data Cleaning/Data')

The R base packages can read the file in, but it is slow, particularly for large files. RStudio has an import dataset feature that can be used to import the file and there are several packages that can be used.

I chose the tidyverse package because it comes with several other packages that will be useful in this project, such as ggplot2, dplyr, and readr. The readr package, which reads the file into R faster than the base R packages and can handle a larger dataset, doesn’t require specifying the delimiter if it is “,” nor the header if header = “TRUE”. Readr uses the read.csv command to read the file into R and creates a dataframe for the data.

Load Tidyverse:

#To install tidyverse for the first time use:
#install.packages("tidyverse")
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

After the tidyverse package is loaded, R is ready to read the csv file.

Read CSV File:

CustChurn <- read.csv("./churn_raw_data.csv")

To get an idea of what variables are in the file, we can take a look at the dataframe:

View Data:

#Provides the number of records, the number of variables, each variable, the data type of each variable, and the first few entries for each variable.  
 str(CustChurn)

## 'data.frame':    10000 obs. of  52 variables:
##  $ ï..                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CaseOrder           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Customer_id         : chr  "K409198" "S120509" "K191035" "D90850" ...
##  $ Interaction         : chr  "aa90260b-4141-4a24-8e36-b04ce1f4f77b" "fb76459f-c047-4a9d-8af9-e0f7d4ac2524" "344d114c-3736-4be5-98f7-c72c281e2d35" "abfa2b40-2d43-4994-b15a-989b8c79e311" ...
##  $ City                : chr  "Point Baker" "West Branch" "Yamhill" "Del Mar" ...
##  $ State               : chr  "AK" "MI" "OR" "CA" ...
##  $ County              : chr  "Prince of Wales-Hyder" "Ogemaw" "Yamhill" "San Diego" ...
##  $ Zip                 : int  99927 48661 97148 92014 77461 31030 37847 73109 34771 45237 ...
##  $ Lat                 : num  56.3 44.3 45.4 33 29.4 ...
##  $ Lng                 : num  -133.4 -84.2 -123.2 -117.2 -95.8 ...
##  $ Population          : int  38 10446 3735 13863 11352 17701 2535 23144 17351 20193 ...
##  $ Area                : chr  "Urban" "Urban" "Urban" "Suburban" ...
##  $ Timezone            : chr  "America/Sitka" "America/Detroit" "America/Los_Angeles" "America/Los_Angeles" ...
##  $ Job                 : chr  "Environmental health practitioner" "Programmer, multimedia" "Chief Financial Officer" "Solicitor" ...
##  $ Children            : int  NA 1 4 1 0 3 0 2 2 NA ...
##  $ Age                 : int  68 27 50 48 83 83 NA NA 49 86 ...
##  $ Education           : chr  "Master's Degree" "Regular High School Diploma" "Regular High School Diploma" "Doctorate Degree" ...
##  $ Employment          : chr  "Part Time" "Retired" "Student" "Retired" ...
##  $ Income              : num  28562 21705 NA 18925 40074 ...
##  $ Marital             : chr  "Widowed" "Married" "Widowed" "Married" ...
##  $ Gender              : chr  "Male" "Female" "Female" "Male" ...
##  $ Churn               : chr  "No" "Yes" "No" "No" ...
##  $ Outage_sec_perweek  : num  6.97 12.01 10.25 15.21 8.96 ...
##  $ Email               : int  10 12 9 15 16 15 10 16 20 18 ...
##  $ Contacts            : int  0 0 0 2 2 3 0 0 2 1 ...
##  $ Yearly_equip_failure: int  1 1 1 0 1 1 1 0 3 0 ...
##  $ Techie              : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Contract            : chr  "One year" "Month-to-month" "Two Year" "Two Year" ...
##  $ Port_modem          : chr  "Yes" "No" "Yes" "No" ...
##  $ Tablet              : chr  "Yes" "Yes" "No" "No" ...
##  $ InternetService     : chr  "Fiber Optic" "Fiber Optic" "DSL" "DSL" ...
##  $ Phone               : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Multiple            : chr  "No" "Yes" "Yes" "No" ...
##  $ OnlineSecurity      : chr  "Yes" "Yes" "No" "Yes" ...
##  $ OnlineBackup        : chr  "Yes" "No" "No" "No" ...
##  $ DeviceProtection    : chr  "No" "No" "No" "No" ...
##  $ TechSupport         : chr  "No" "No" "No" "No" ...
##  $ StreamingTV         : chr  "No" "Yes" "No" "Yes" ...
##  $ StreamingMovies     : chr  "Yes" "Yes" "Yes" "No" ...
##  $ PaperlessBilling    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ PaymentMethod       : chr  "Credit Card (automatic)" "Bank Transfer(automatic)" "Credit Card (automatic)" "Mailed Check" ...
##  $ Tenure              : num  6.8 1.16 15.75 17.09 1.67 ...
##  $ MonthlyCharge       : num  171 243 159 120 151 ...
##  $ Bandwidth_GB_Year   : num  905 801 2055 2165 271 ...
##  $ item1               : int  5 3 4 4 4 3 6 2 5 2 ...
##  $ item2               : int  5 4 4 4 4 3 5 2 4 2 ...
##  $ item3               : int  5 3 2 4 4 3 6 2 4 2 ...
##  $ item4               : int  3 3 4 2 3 2 4 5 3 2 ...
##  $ item5               : int  4 4 4 5 4 4 1 2 4 5 ...
##  $ item6               : int  4 3 3 4 4 3 5 3 3 2 ...
##  $ item7               : int  3 4 3 3 4 3 5 4 4 3 ...
##  $ item8               : int  4 4 3 3 5 3 5 5 4 3 ...

Characteristics of the Data

Based on the results of loading the csv file, the dataset contains 10,000 records representing customers and 50 variables describing the customer demographics, such as geographical information, age, marital status, type of services purchased, and responses to a survey. There are also 2 index variables, one of which serves as a placeholder to preserve the original order.

The indexes are:

Variable	Data Type	Description
Sequential Numbering (no variable name assigned)	integer, discrete numerical	An index starting at 1
CaseOrder	integer, discrete numerical	A placeholder starting at 1 to preserve the original order

The dependent (target) variable is:

Variable	Data Type	Description
Churn	character, ordinal categorical	Whether or not the customer discontinued service in the last month (“Yes”, “No”)

The independent variables are:

Variable	Data Type	Description
Customer_id	character, ordinal categorical	The unique identification number assigned to each customer (These are character type because they contain both letters and numbers)
Interaction	character, ordinal categorical	The unique transaction identification numbers (These are character type because they contain both letters and numbers)
City	character, nominal categorical	The city the customer lives in
State	character, nominal categorical	The state the customer lives in
County	character, nominal categorical	The county the customer lives in
Zip	integer, nominal categorical	The zip code the customer lives in (the data type should likely be character instead of integer because no meaningful statistical calculations can be performed on the numbers themselves)
Lat	number, continuous numerical	The latitude where the customer lives
Lng	number, continuous numerical	The longitude where the customer lives
Population	integer, discrete numerical	The population within a mile of where the customer lives
Area	character, nominal categorical	The type of area the customer lives in (such as “urban”, “suburban”, etc.)
Timezone	character, ordinal categorical	The timezone customer lived in (such as “America/Sitka”, “America/Detroit”, etc.) at signup
Job	character, nominal categorical	The job title the customer or the invoice holder held (such as “Environmental health practitioner”, “Programmer, multimedia”, etc.) at signup
Children	integer, discrete numerical	The number of children the customer had at signup
Age	integer, continuous numerical	The customer’s age at signup
Education	character, ordinal categorical	The education level of the customer (such as “Master’s Degree”, “Regular High School Diploma”, etc.) at signup
Employment	character, nominal categorical	The employment status of the customer (such as “Part Time”, “Retired”, etc.) at signup
Income	number, discrete numerical	The customer’s income at signup
Marital	character, nominal categorical	The customer’s marital status (such as “Widowed”, “Married”, etc.) at signup
Gender	character, nominal categorical	The customer’s gender (such as “Male”, “Female”, etc.) at signup
Outage_sec_perweek	number, continuous numerical	The average number of seconds per week the customer’s neighborhood experiences an outage
Email	integer, discrete numerical	Number of emails sent to the customer in the last year
Contacts	integer, discrete numerical	The number of times the customer contacted tech support
Yearly_equip_failure	integer, discrete numerical	The number of equipment failures in the past year the customer experienced where the equipment needed to be repaired or replaced
Techie	character, ordinal categorical	Whether or not the customer considered themselves to have technical skills (“Yes”, “No”) at signup
Contract	character, nominal categorical	The length of the customer’s contract (such as “One year”, “Month-to-month”, etc.)
Port_modem	character, ordinal categorical	Whether the customer has a portable modem (“Yes”, “No”)
Tablet	character, ordinal categorical	Whether or not the customer has a tablet (“Yes”, “No”)
InternetService	character, nominal categorical	Type of internet service the customer has (such as “Fiber Optic”, “DSL”, etc.)
Phone	character, ordinal categorical	Whether or not the customer has a phone (“Yes”, “No”)
Multiple	character, ordinal categorical	Whether or not the customer has multiple lines (“Yes”, “No”)
OnlineSecurity	character, ordinal categorical	Whether or not the customer has subscribed to online security services (“Yes”, “No”)
OnlineBackup	character, ordinal categorical	Whether or not the customer has subscribed to online backup services (“Yes”, “No”)
DeviceProtection	character, ordinal categorical	Whether or not the customer has subscribed to device protection services (“Yes”, “No”)
TechSupport	character, ordinal categorical	Whether or not the customer has subscribed to technical support services (“Yes”, “No”)
StreamingTV	character, ordinal categorical	Whether or not the customer has streaming TV (“Yes”, “No”)
StreamingMovies	character, ordinal categorical	Whether or not the customer has streaming movies (“Yes”, “No”)
PaperlessBilling	character, ordinal categorical	Whether or not the customer has paperless billing (“Yes”, “No”)
PaymentMethod	character, nominal categorical	The payment method used by the customer (such as “Credit Card (automatic)”, “Bank Transfer(automatic)”, etc.)
Tenure	number, discrete numerical	The number of months the customer has been with the business
MonthlyCharge	number, continuous numerical	The average amount the customer pays per month
Bandwidth_GB_Year	number, continuous numerical	The average amount of GB the customer uses in a year
item1	integer, discrete numerical - but could also be ordinal categorical	Survey response to question “Timely Response”
item2	integer, discrete numerical - but could also be ordinal categorical	Survey response to question “Timely Fixes”
item3	integer, discrete numerical - but could also be ordinal categorical	Survey response to question “Timely Replacements”
item4	integer, discrete numerical - but could also be ordinal categorical	Survey response to question “Reliability”
item5	integer, discrete numerical - but could also be ordinal categorical	Survey response to question “Options”
item6	integer, discrete numerical - but could also be ordinal categorical	Survey response to question “Respectful Response”
item7	integer, discrete numerical - but could also be ordinal categorical	Survey response to question “Courteous Exchange”
item8	integer, discrete numerical - but could also be ordinal categorical	Survey response to question “Evidence of Active Listening”

Data Cleaning Plan

My plan to clean the data involves multiple steps:

Rename the misleading variables.
Identify and remove any duplicates.
Identify missing data by counting the number of missing entries in the columns to identify which columns have missing data.
Plot a histogram on the columns with missing data to examine the distribution and correct the missing data using one of the following methods:
- If the distribution is skewed and the data is numerical: replace the missing values with the median.
- If the distribution is normal and the data is numerical, replace the missing values with the mean.
- If the data is categorical, replace the missing values with the mode.
Identify outliers in the data using the following methods:
- Plot a boxplot to show any outliers that are outside of the quartiles.
- Calculate minimum, maximum, quartiles, median, and mean to find any obvious outliers.
- Calculate standard deviations to determine which variables have outliers more than three standard deviations away from the mean.
Treat outliers: Examine the outlier data to determine if it makes sense or if the outliers may be reasonable. Replace the outliers with the median if they can be confirmed to be errors. Otherwise, keep the outliers and make a note.
Re-express categorical variables using one of the following methods:
- One-hot encoding for nominal data if there are few values
- Label encoding for nominal data unless it can be arranged as ordinal
- Ordinal encoding for ordinal data
Extract clean data file

Step One - Rename Misleading or Unclear Variables

Many of the variable names are misleading or not clear so they will be changed. The misleading names, new names, and reason for change are:

Old Name	New Name	Reason For Change
Timezone	Timezone_signup	To clarify the information was provided at the time of signup and may no longer be current
Job	Job_CustInvHold_signup	To clarify the information was provided at the time of signup and may no longer be current and that the information may be for either the customer or the invoice holder
Children	Children_signup	To clarify the information was provided at the time of signup and may no longer be current
Age	Age_signup	To clarify the information was provided at the time of signup and may no longer be current
Education	Education_signup	To clarify the information was provided at the time of signup and may no longer be current
Employment	Employment_signup	To clarify the information was provided at the time of signup and may no longer be current
Income	Income_signup	To clarify the information was provided at the time of signup and may no longer be current
Marital	Marital_signup	To clarify the information was provided at the time of signup and may no longer be current
Gender	Gender_signup	To clarify the information was provided at the time of signup and may no longer be current (rare, but possible)
Techie	Techie_signup	To clarify the information was provided at the time of signup and may no longer be current
Outage_sec_perweek	Avg_outage_sec_perweek	To clarify the calculation is an average
MonthlyCharge	Avg_MonthlyCharge	To clarify the calculation is an average
Bandwidth_GB_Year	Avg_Bandwidth_GB_Year	To clarify the calculation is an average
item1	TimelyResponse	Clarify what survey response is being answered
item2	TimelyFixes	Clarify what survey response is being answered
item3	TimelyReplacements	Clarify what survey response is being answered
item4	Reliability	Clarify what survey response is being answered
item5	Options	Clarify what survey response is being answered
item6	RespectfulResponse	Clarify what survey response is being answered
item7	CourteousExchange	Clarify what survey response is being answered
item8	EvidenceofActiveListening	Clarify what survey response is being answered

Rename Variables:

#(Zach, 2019)
clean_CustChurn <- CustChurn %>%
      rename(
        Timezone_signup = Timezone,
        Job_CustInvHold_signup = Job,
        Children_signup = Children,
        Age_signup = Age,
        Education_signup = Education,
        Employment_signup = Employment,
        Income_signup = Income,
        Marital_signup = Marital,
        Gender_signup = Gender,
        Techie_signup = Techie,
        Avg_outage_sec_perweek = Outage_sec_perweek,
        Avg_MonthlyCharge = MonthlyCharge,
        Avg_Bandwidth_GB_Year = Bandwidth_GB_Year,
        TimelyResponse = item1,
        TimelyFixes = item2,
        TimelyReplacements = item3,
        Reliability = item4,
        Options = item5,
        RespectfulResponse = item6,
        CourteousExchange = item7,
        EvidenceofActiveListening = item8
      )
names(clean_CustChurn)

##  [1] "ï.."                       "CaseOrder"                
##  [3] "Customer_id"               "Interaction"              
##  [5] "City"                      "State"                    
##  [7] "County"                    "Zip"                      
##  [9] "Lat"                       "Lng"                      
## [11] "Population"                "Area"                     
## [13] "Timezone_signup"           "Job_CustInvHold_signup"   
## [15] "Children_signup"           "Age_signup"               
## [17] "Education_signup"          "Employment_signup"        
## [19] "Income_signup"             "Marital_signup"           
## [21] "Gender_signup"             "Churn"                    
## [23] "Avg_outage_sec_perweek"    "Email"                    
## [25] "Contacts"                  "Yearly_equip_failure"     
## [27] "Techie_signup"             "Contract"                 
## [29] "Port_modem"                "Tablet"                   
## [31] "InternetService"           "Phone"                    
## [33] "Multiple"                  "OnlineSecurity"           
## [35] "OnlineBackup"              "DeviceProtection"         
## [37] "TechSupport"               "StreamingTV"              
## [39] "StreamingMovies"           "PaperlessBilling"         
## [41] "PaymentMethod"             "Tenure"                   
## [43] "Avg_MonthlyCharge"         "Avg_Bandwidth_GB_Year"    
## [45] "TimelyResponse"            "TimelyFixes"              
## [47] "TimelyReplacements"        "Reliability"              
## [49] "Options"                   "RespectfulResponse"       
## [51] "CourteousExchange"         "EvidenceofActiveListening"

Step Two - Identify and Remove Any Duplicates

Identify Duplicates (if any):

CustChurn[duplicated(clean_CustChurn)]

## data frame with 0 columns and 10000 rows

The results show that we have no duplicate variables, so there are no duplicates to remove.

Step Three - Identify Missing Data

Count Missing Data in Each Variable:

colSums(is.na(clean_CustChurn))

##                       ï..                 CaseOrder               Customer_id 
##                         0                         0                         0 
##               Interaction                      City                     State 
##                         0                         0                         0 
##                    County                       Zip                       Lat 
##                         0                         0                         0 
##                       Lng                Population                      Area 
##                         0                         0                         0 
##           Timezone_signup    Job_CustInvHold_signup           Children_signup 
##                         0                         0                      2495 
##                Age_signup          Education_signup         Employment_signup 
##                      2475                         0                         0 
##             Income_signup            Marital_signup             Gender_signup 
##                      2490                         0                         0 
##                     Churn    Avg_outage_sec_perweek                     Email 
##                         0                         0                         0 
##                  Contacts      Yearly_equip_failure             Techie_signup 
##                         0                         0                      2477 
##                  Contract                Port_modem                    Tablet 
##                         0                         0                         0 
##           InternetService                     Phone                  Multiple 
##                         0                      1026                         0 
##            OnlineSecurity              OnlineBackup          DeviceProtection 
##                         0                         0                         0 
##               TechSupport               StreamingTV           StreamingMovies 
##                       991                         0                         0 
##          PaperlessBilling             PaymentMethod                    Tenure 
##                         0                         0                       931 
##         Avg_MonthlyCharge     Avg_Bandwidth_GB_Year            TimelyResponse 
##                         0                      1021                         0 
##               TimelyFixes        TimelyReplacements               Reliability 
##                         0                         0                         0 
##                   Options        RespectfulResponse         CourteousExchange 
##                         0                         0                         0 
## EvidenceofActiveListening 
##                         0

This count shows that there are several variables with missing data:

Income_signup 2490 records missing data
Tenure 931 records missing data
Avg_Bandwidth_GB_Year 1021 records missing data
Children_signup 2495 records missing data
Age_signup 2475 records missing data
Phone 1026 records missing data
Techie_signup 2477 records missing data
TechSupport 991 records missing data

Examine Distribution of Missing Data:

To visualize how the missing data is distributed across the dataset, the visdat package is loaded to produce a visdat chart. Histograms will also be produced using baseR and ggplot2 functionality.

Produce a Visdat Chart:

#To install visdat for the first time use:
#install.packages("visdat")
library(visdat)
#To install RColorBrewer for the first time use:
#install.packages("RColorBrewer")
library(RColorBrewer)
vis_miss(clean_CustChurn) + scale_fill_brewer(palette = "Set3", direction = -1)

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

The resulting visdat chart shows the percentage of missing data within each variable and it gives a nice visual of where in the records the missing data is (green).

Produce Histograms:

par(mfrow = c(3,2),cex.axis = 1.5, cex.lab = 1.5, mar = c(2,4.5,3,0))
hist(clean_CustChurn$Income_signup, main = "Income at Signup Distribution", xlab = "Income at Signup", xlim = c(0, 250000), col = "lavender")
hist(clean_CustChurn$Tenure, main = "Distribution of Tenure", xlab = "Tenure", xlim = c(0,110), col = "mistyrose")
hist(clean_CustChurn$Avg_Bandwidth_GB_Year, main = "Distribution of Avg Bandwidth GB Year", xlab = "Avg Bandwidth GB Year", xlim = c(0, 8000), col = "honeydew3")
hist(clean_CustChurn$Children_signup, main = "Distribution of Children at Signup", xlab = "Children at Signup", xlim = c(0,10), col = "lemonchiffon1")
hist(clean_CustChurn$Age_signup, main = "Distribution of Age at Signup", xlab = "Age at signup", xlim = c(0,100), col= "lightsteelblue")
mtext("HISTOGRAMS", side = 3, line = - 1.5, outer = TRUE, col= "cadetblue4", font = 2)

The histograms of the numerical variables help determine if the imputation method to replace the missing data should be to use the median or the mean. If the distribution is skewed, the median can be used and if the distribution is normal, then the mean can be used. Based on the histogram results, the imputation methods will be:

Variable	Distribution	Imputation Method
Income_signup	Non-symmetrical bimodal	Replace with median
Tenure	Skewed	Replace with median
Ave_Bandwidth_GB_Year	Non-symmetrical bimodal	Replace with median
Children_signup	Skewed	Replace with median
Age_signup	Uniform	Replace with mean

Step Four - Correct Missing Data

The median was applied to the Income_signup, Tenure, Avg_Bandwidth_GB_Year, Children_signup records. The mean was applied to the Age_signup records.

Impute Missing Data with Median and Mean

clean_CustChurn$Income_signup[is.na(clean_CustChurn$Income_signup)]<-
median(clean_CustChurn$Income_signup,na.rm=TRUE)
clean_CustChurn$Tenure[is.na(clean_CustChurn$Tenure)]<-
median(clean_CustChurn$Tenure,na.rm=TRUE)
clean_CustChurn$Avg_Bandwidth_GB_Year[is.na(clean_CustChurn$Avg_Bandwidth_GB_Year)]<-
median(clean_CustChurn$Avg_Bandwidth_GB_Year,na.rm=TRUE)
clean_CustChurn$Children_signup[is.na(clean_CustChurn$Children_signup)]<-
median(clean_CustChurn$Children_signup,na.rm=TRUE)
clean_CustChurn$Age_signup[is.na(clean_CustChurn$Age_signup)]<-
mean(clean_CustChurn$Age_signup,na.rm=TRUE)

The missing data has been cleaned up for the numerical fields. The categorical data still needs to be cleaned. The missing data will be imputed with the mode. The modeest package is a statistical package for R that allows the mode to be calculated. Otherwise, there is not a calculation in base R.

Load Modeest:

#To install modeest for the first time use:
#install.packages("modeest")
library(modeest)

## Registered S3 method overwritten by 'rmutil':
##   method         from
##   print.response httr

Calculate Mode for Categorical Values:

#Phone:
mlv(clean_CustChurn$Phone, method="mfv")

## [1] "Yes"

#Techie_signup:
mlv(clean_CustChurn$Techie_signup, method="mfv")

## [1] "No"

#TechSupport:
mlv(clean_CustChurn$TechSupport, method="mfv")

## [1] "No"

Impute Missing Categorical Data with Mode:

PhoneMode <- mlv(clean_CustChurn$Phone, method="mfv")
clean_CustChurn$Phone[is.na(clean_CustChurn$Phone)] <- PhoneMode
TechieMode <- mlv(clean_CustChurn$Techie_signup, method="mfv")
clean_CustChurn$Techie_signup[is.na(clean_CustChurn$Techie_signup)] <- TechieMode
TechSupportMode <- mlv(clean_CustChurn$TechSupport, method="mfv")
clean_CustChurn$TechSupport[is.na(clean_CustChurn$TechSupport)] <- TechSupportMode

Count Missing Data in Each Variable to Verify Imputation:

colSums(is.na(clean_CustChurn))

##                       ï..                 CaseOrder               Customer_id 
##                         0                         0                         0 
##               Interaction                      City                     State 
##                         0                         0                         0 
##                    County                       Zip                       Lat 
##                         0                         0                         0 
##                       Lng                Population                      Area 
##                         0                         0                         0 
##           Timezone_signup    Job_CustInvHold_signup           Children_signup 
##                         0                         0                         0 
##                Age_signup          Education_signup         Employment_signup 
##                         0                         0                         0 
##             Income_signup            Marital_signup             Gender_signup 
##                         0                         0                         0 
##                     Churn    Avg_outage_sec_perweek                     Email 
##                         0                         0                         0 
##                  Contacts      Yearly_equip_failure             Techie_signup 
##                         0                         0                         0 
##                  Contract                Port_modem                    Tablet 
##                         0                         0                         0 
##           InternetService                     Phone                  Multiple 
##                         0                         0                         0 
##            OnlineSecurity              OnlineBackup          DeviceProtection 
##                         0                         0                         0 
##               TechSupport               StreamingTV           StreamingMovies 
##                         0                         0                         0 
##          PaperlessBilling             PaymentMethod                    Tenure 
##                         0                         0                         0 
##         Avg_MonthlyCharge     Avg_Bandwidth_GB_Year            TimelyResponse 
##                         0                         0                         0 
##               TimelyFixes        TimelyReplacements               Reliability 
##                         0                         0                         0 
##                   Options        RespectfulResponse         CourteousExchange 
##                         0                         0                         0 
## EvidenceofActiveListening 
##                         0

All of the missing data, numerical and categorical, has been imputed.

Step Five - Identify Outliers

Boxplots are an easy way to visualize outliers. R has a simple boxplot function in base R. The gplots and ggtlot2 packages have boxplots with a little more functionality and customization options. To identify outliers with a boxplot, check to see if there are points outside of the quartiles. The points outside of the quartiles are outliers.

Plot Boxplot:

par(mfrow = c(3,4),cex.axis = 1, cex.lab = 1, cex = .9, mar = c(1,4.5,4,0))
boxplot(clean_CustChurn$Population, main = "Population", col = "violet",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Avg_outage_sec_perweek, main = "Avg Outage Sec per Week", col = "darkseagreen1",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Email, main = "Email", col = "plum3",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Contacts, main = "Contacts", col = "coral",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Yearly_equip_failure, main = "Yearly Equipment Failure", col = "slateblue1",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Avg_MonthlyCharge, main = "Avg Monthly Charge", col = "goldenrod2",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Income_signup, main = "Income at Signup", col = "lavender", outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Tenure, main = "Tenure", col = "mistyrose",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Avg_Bandwidth_GB_Year, main = "Avg BW GB Year", col = "honeydew3",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Children_signup, main = "Children at Signup", col = "lemonchiffon1",outpch = 17, outcol = "red4")
boxplot(clean_CustChurn$Age_signup,main = "Age at Signup", col = "lightsteelblue",outpch = 17, outcol = "red4")
mtext("BOXPLOTS", side = 3, line = - 1.5, outer = TRUE, col= "cadetblue4", font = 2)

The outliers (in red) are easily seen in these boxplots. The variables containing outliers are:

Population
Avg_outage_sec_perweek
Email
Contacts
Yearly_equip_failure
Avg_MonthlyCharge
Income_signup
Children_signup

Before treating any of the outliers, it must be determined if the outliers are errors or if they are legitimate entries. R makes it easy to do a quick check of the statistics associated with the numerical variables.

Check Statistics on Variables:

summary(clean_CustChurn)

##       ï..          CaseOrder     Customer_id        Interaction       
##  Min.   :    1   Min.   :    1   Length:10000       Length:10000      
##  1st Qu.: 2501   1st Qu.: 2501   Class :character   Class :character  
##  Median : 5000   Median : 5000   Mode  :character   Mode  :character  
##  Mean   : 5000   Mean   : 5000                                        
##  3rd Qu.: 7500   3rd Qu.: 7500                                        
##  Max.   :10000   Max.   :10000                                        
##      City              State              County               Zip       
##  Length:10000       Length:10000       Length:10000       Min.   :  601  
##  Class :character   Class :character   Class :character   1st Qu.:26293  
##  Mode  :character   Mode  :character   Mode  :character   Median :48870  
##                                                           Mean   :49153  
##                                                           3rd Qu.:71867  
##                                                           Max.   :99929  
##       Lat             Lng            Population         Area          
##  Min.   :17.97   Min.   :-171.69   Min.   :     0   Length:10000      
##  1st Qu.:35.34   1st Qu.: -97.08   1st Qu.:   738   Class :character  
##  Median :39.40   Median : -87.92   Median :  2910   Mode  :character  
##  Mean   :38.76   Mean   : -90.78   Mean   :  9757                     
##  3rd Qu.:42.11   3rd Qu.: -80.09   3rd Qu.: 13168                     
##  Max.   :70.64   Max.   : -65.67   Max.   :111850                     
##  Timezone_signup    Job_CustInvHold_signup Children_signup    Age_signup   
##  Length:10000       Length:10000           Min.   : 0.000   Min.   :18.00  
##  Class :character   Class :character       1st Qu.: 1.000   1st Qu.:41.00  
##  Mode  :character   Mode  :character       Median : 1.000   Median :53.28  
##                                            Mean   : 1.823   Mean   :53.28  
##                                            3rd Qu.: 3.000   3rd Qu.:65.00  
##                                            Max.   :10.000   Max.   :89.00  
##  Education_signup   Employment_signup  Income_signup      Marital_signup    
##  Length:10000       Length:10000       Min.   :   740.7   Length:10000      
##  Class :character   Class :character   1st Qu.: 23660.8   Class :character  
##  Mode  :character   Mode  :character   Median : 33186.8   Mode  :character  
##                                        Mean   : 38256.0                     
##                                        3rd Qu.: 45504.2                     
##                                        Max.   :258900.7                     
##  Gender_signup         Churn           Avg_outage_sec_perweek     Email      
##  Length:10000       Length:10000       Min.   :-1.349         Min.   : 1.00  
##  Class :character   Class :character   1st Qu.: 8.054         1st Qu.:10.00  
##  Mode  :character   Mode  :character   Median :10.203         Median :12.00  
##                                        Mean   :11.453         Mean   :12.02  
##                                        3rd Qu.:12.488         3rd Qu.:14.00  
##                                        Max.   :47.049         Max.   :23.00  
##     Contacts      Yearly_equip_failure Techie_signup        Contract        
##  Min.   :0.0000   Min.   :0.000        Length:10000       Length:10000      
##  1st Qu.:0.0000   1st Qu.:0.000        Class :character   Class :character  
##  Median :1.0000   Median :0.000        Mode  :character   Mode  :character  
##  Mean   :0.9942   Mean   :0.398                                             
##  3rd Qu.:2.0000   3rd Qu.:1.000                                             
##  Max.   :7.0000   Max.   :6.000                                             
##   Port_modem           Tablet          InternetService       Phone          
##  Length:10000       Length:10000       Length:10000       Length:10000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Multiple         OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:10000       Length:10000       Length:10000       Length:10000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies    PaperlessBilling  
##  Length:10000       Length:10000       Length:10000       Length:10000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  PaymentMethod          Tenure      Avg_MonthlyCharge Avg_Bandwidth_GB_Year
##  Length:10000       Min.   : 1.00   Min.   : 77.51    Min.   : 155.5       
##  Class :character   1st Qu.: 8.70   1st Qu.:141.07    1st Qu.:1312.1       
##  Mode  :character   Median :36.20   Median :169.92    Median :3382.4       
##                     Mean   :34.66   Mean   :174.08    Mean   :3397.2       
##                     3rd Qu.:60.15   3rd Qu.:203.78    3rd Qu.:5466.3       
##                     Max.   :72.00   Max.   :315.88    Max.   :7159.0       
##  TimelyResponse   TimelyFixes    TimelyReplacements  Reliability   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000      Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000      1st Qu.:3.000  
##  Median :3.000   Median :4.000   Median :3.000      Median :3.000  
##  Mean   :3.491   Mean   :3.505   Mean   :3.487      Mean   :3.498  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000      3rd Qu.:4.000  
##  Max.   :7.000   Max.   :7.000   Max.   :8.000      Max.   :7.000  
##     Options      RespectfulResponse CourteousExchange EvidenceofActiveListening
##  Min.   :1.000   Min.   :1.000      Min.   :1.00      Min.   :1.000            
##  1st Qu.:3.000   1st Qu.:3.000      1st Qu.:3.00      1st Qu.:3.000            
##  Median :3.000   Median :3.000      Median :4.00      Median :3.000            
##  Mean   :3.493   Mean   :3.497      Mean   :3.51      Mean   :3.496            
##  3rd Qu.:4.000   3rd Qu.:4.000      3rd Qu.:4.00      3rd Qu.:4.000            
##  Max.   :7.000   Max.   :8.000      Max.   :7.00      Max.   :8.000

At first glance, these statistics do not look unreasonable. For example, the number of children having a minimum of 0 children and a maximum of 10 children is certainly possible. The mean of 1.8 children also seems to fit within expectations. The minimum population at 0 and the minimum income at 740.70 are suspicious. To further investigate, the standard deviation can be calculated for each of these variables.

sd(clean_CustChurn$Population)

## [1] 14432.7

sd(clean_CustChurn$Contacts)

## [1] 0.9884655

sd(clean_CustChurn$Avg_outage_sec_perweek)

## [1] 7.025921

sd(clean_CustChurn$Email)

## [1] 3.025898

sd(clean_CustChurn$Income_signup)

## [1] 24747.87

sd(clean_CustChurn$Children_signup)

## [1] 1.925971

sd(clean_CustChurn$Avg_MonthlyCharge)

## [1] 43.33547

sd(clean_CustChurn$Yearly_equip_failure)

## [1] 0.6359532

Step Six - Treat Outliers

Based on the results of the statistics and the standards deviations, it appears that the outliers for the following should not be treated because they are below three standard deviations from the mean.

Contacts
Yearly_equip_failure
Children_signup

Other outliers that, although they are higher than 3 standard deviations from the mean, still appear to be reasonable and should not be treated are:

Avg_outage_sec_perweek
Email

Both of these make sense when compared to other variables. Email is the number of emails- marketing and other correspondence. That correspondence could be customer service issues. If the average outage per week is higher, those customers may have more correspondence as a result. In addition, there was a larger difference between the minimum and maximum for yearly equipment failures. That could also be related to the outages and the emails.

Income_signup is another variable with outliers that should not be treated without more information. Incomes vary significantly across the population so the outliers are not unusual. While the minimum income may be an error, there is not enough information to determine that at this point.

Population, like income, can vary significantly across the country. Some customers will live in very remote areas with few neighbors and others may live in densely populated cities. The large variations are expected. The minimum of 0 appears to be an error, however, without more information from the client, it should not be treated.

Finally, the Avg_MonthlyCharge has a high standard deviation. There is a very large spread between the minimum and the maximum. In this case, it could be explained by customers who subscribe to the very basic package at the low end and customers who subscribe to everything or have multiple accounts (landlords, caregivers) at the high end, with the majority of customers falling somewhere in the middle. The client should be able to check their billing records to confirm the outliers before treating them, so they should just be noted at this point.

Step Seven - Re-express Categorical Variables

The categorical variables can be re-expressed as numeric in order to make processing easier for future use. For small files, non-ordinal values, one-hot encoding can work well. In this case, there are many variables and many records within those variables, which would increase the file size significantly. For the categorical data, I used label encoding. Label encoding assigns a numeric value to each categorical value. It is ideal for yes/no values. It can imply an order where there is none, however, it is better for larger datasets. The superml package makes it easy to use the label encoding method.

Install superml Package:

#(Mulani, 2021)
#To install superml for the first time use:
#install.packages("superml")
library(superml)

## Loading required package: R6

label <- LabelEncoder$new()
clean_CustChurn$Customer_id <- label$fit_transform(clean_CustChurn$Customer_id)
head(clean_CustChurn$Customer_id)

## [1] 0 1 2 3 4 5

label <- LabelEncoder$new()
clean_CustChurn$Interaction <- label$fit_transform(clean_CustChurn$Interaction)
head(clean_CustChurn$Interaction)

## [1] 0 1 2 3 4 5

label <- LabelEncoder$new()
clean_CustChurn$City <- label$fit_transform(clean_CustChurn$City)
head(clean_CustChurn$City)

## [1] 0 1 2 3 4 5

label <- LabelEncoder$new()
clean_CustChurn$State <- label$fit_transform(clean_CustChurn$State)
head(clean_CustChurn$State)

## [1] 0 1 2 3 4 5

label <- LabelEncoder$new()
clean_CustChurn$County <- label$fit_transform(clean_CustChurn$County)
head(clean_CustChurn$County)

## [1] 0 1 2 3 4 5

label <- LabelEncoder$new()
clean_CustChurn$Area <- label$fit_transform(clean_CustChurn$Area)
head(clean_CustChurn$Area)

## [1] 0 0 0 1 1 0

label <- LabelEncoder$new()
clean_CustChurn$Timezone_signup <- label$fit_transform(clean_CustChurn$Timezone_signup)
head(clean_CustChurn$Timezone_signup)

## [1] 0 1 2 2 3 4

label <- LabelEncoder$new()
clean_CustChurn$Job_CustInvHold_signup <- label$fit_transform(clean_CustChurn$Job_CustInvHold_signup)
head(clean_CustChurn$Job_CustInvHold_signup)

## [1] 0 1 2 3 4 5

label <- LabelEncoder$new()
clean_CustChurn$Education_signup <- label$fit_transform(clean_CustChurn$Education_signup)
head(clean_CustChurn$Education_signup)

## [1] 0 1 1 2 0 3

label <- LabelEncoder$new()
clean_CustChurn$Employment_signup <- label$fit_transform(clean_CustChurn$Employment_signup)
head(clean_CustChurn$Employment_signup)

## [1] 0 1 2 1 2 3

label <- LabelEncoder$new()
clean_CustChurn$Marital_signup <- label$fit_transform(clean_CustChurn$Marital_signup)
head(clean_CustChurn$Marital_signup)

## [1] 0 1 0 1 2 3

label <- LabelEncoder$new()
clean_CustChurn$Gender_signup <- label$fit_transform(clean_CustChurn$Gender_signup)
head(clean_CustChurn$Gender_signup)

## [1] 0 1 1 0 0 1

clean_CustChurn$Techie_signup <- label$fit_transform(clean_CustChurn$Techie_signup)
head(clean_CustChurn$Techie_signup)

## [1] 0 1 1 1 0 0

label <- LabelEncoder$new()
clean_CustChurn$Contract <- label$fit_transform(clean_CustChurn$Contract)
head(clean_CustChurn$Contract)

## [1] 0 1 2 2 1 0

clean_CustChurn$Port_modem <- label$fit_transform(clean_CustChurn$Port_modem)
head(clean_CustChurn$Port_modem)

## [1] 0 1 0 1 0 0

clean_CustChurn$Tablet <- label$fit_transform(clean_CustChurn$Tablet)
head(clean_CustChurn$Tablet)

## [1] 0 0 1 1 1 1

label <- LabelEncoder$new()
clean_CustChurn$InternetService <- label$fit_transform(clean_CustChurn$InternetService)
head(clean_CustChurn$InternetService)

## [1] 0 0 1 1 0 2

clean_CustChurn$Phone <- label$fit_transform(clean_CustChurn$Phone)
head(clean_CustChurn$Phone)

## [1] 0 0 0 0 1 0

clean_CustChurn$Multiple <- label$fit_transform(clean_CustChurn$Multiple)
head(clean_CustChurn$Multiple)

## [1] 0 1 1 0 0 1

clean_CustChurn$Churn <- label$fit_transform(clean_CustChurn$Churn)
head(clean_CustChurn$Churn)

## [1] 0 1 0 0 1 0

clean_CustChurn$OnlineSecurity <- label$fit_transform(clean_CustChurn$OnlineSecurity)
head(clean_CustChurn$OnlineSecurity)

## [1] 0 0 1 0 1 0

clean_CustChurn$OnlineBackup <- label$fit_transform(clean_CustChurn$OnlineBackup)
head(clean_CustChurn$OnlineBackup)

## [1] 0 1 1 1 1 0

clean_CustChurn$DeviceProtection <- label$fit_transform(clean_CustChurn$DeviceProtection)
head(clean_CustChurn$DeviceProtection)

## [1] 0 0 0 0 0 1

clean_CustChurn$TechSupport <- label$fit_transform(clean_CustChurn$TechSupport)
head(clean_CustChurn$TechSupport)

## [1] 0 0 0 0 1 0

clean_CustChurn$StreamingTV <- label$fit_transform(clean_CustChurn$StreamingTV)
head(clean_CustChurn$StreamingTV)

## [1] 0 1 0 1 1 0

clean_CustChurn$StreamingMovies <- label$fit_transform(clean_CustChurn$StreamingMovies)
head(clean_CustChurn$StreamingMovies)

## [1] 0 0 0 1 1 0

clean_CustChurn$PaperlessBilling <- label$fit_transform(clean_CustChurn$PaperlessBilling)
head(clean_CustChurn$PaperlessBilling)

## [1] 0 0 0 0 1 1

label <- LabelEncoder$new()
clean_CustChurn$PaymentMethod <- label$fit_transform(clean_CustChurn$PaymentMethod)
head(clean_CustChurn$PaymentMethod)

## [1] 0 1 0 2 2 3

Step Eight - Extract Clean Data File

Extract file:

write.csv(clean_CustChurn, "./churn_clean_data.csv")

Limitations of the Data Cleaning Process

In reviewing the variables, it is clear that much of the demographic data is limited in its usefulness as it may be outdated because it hasn’t been updated since the customer signed up. The variables with this type of limitation are:

Area
Job (This is also limited due to the possibility that the variable may represent either the invoice holder or the customer)
Children
Age (Age can be somewhat estimated using months of tenure, but it will be off as we do not have the customer’s birth-date. This may be acceptable.)
Education
Employment
Income
Marital
Gender (Not likely to have changed but still a possibility)
Techie (The customer may have gained some technical skills since signing up and now considers themselves to be a techie)

Other variables with limitations are:

Emails (The number of emails don’t specify if the email was for marketing purposes or for some other purpose)
Contacts (The number of contacts with tech support don’t give a time period over which the contacts happened. It may be possible to assume it is a running total over the customer’s tenure.)

While finding patterns that rely on this data is still possible, it would be much more accurate to survey the customers often to get updates on a regular basis.

Replacing the missing values with median and mode is also limiting. It would be better to know why the values are missing and try to get the correct information entered, if there is any. Imputing the values may skew the results.

Outliers are another limiting issue. There were several variables with outliers, however, it could not be readily confirmed that the outliers were input errors. Therefore, the ouliers were noted, and not treated. The business should be consulted to determine how to proceed with the outliers.

These issues may impact the research question, particularly the issue of the stale demographic data. Any one of the stale categories could skew the results, although some variables are more likely to have updates than others (marital status would be more likely to change than gender, for example). This is the most pressing issue for the business to address if they want to be able to have a good answer for their question and to be able to continue to monitor the progress of any remediation efforts to reduce churn.

Principal Component Analysis (PCA)

PCA is a method to reduce the dimensionality in a dataset. It helps to explain a significant percentage of the variance using just a few components, rather than all of them. It can reduce redundancy in the data and make machine learning faster. The variables need to be strongly correlated to perform the analysis. The dataset for PCA also needs to be cleaned first so that it contains no missing values, outliers that should have been removed, and non-numeric values.

Define Variables

Because PCA can only be performed on numeric values, variables like city, state, marital status, etc. will not be used. The following numeric variables will be used for PCA:

Lat
Lng
Population
Children_signup
Age_ signup
Income_signup
Avg_outage_sec_perweek
Email
Contacts
Yearly_equip_failure
Tenure
Avg_MonthlyCharge
Avg_Bandwidth_GB_Year
TimelyResponse
TimelyFixes
TimelyReplacements
Reliability
Options
RespectfulResponse
CourteousExchange
EvidenceofActiveListening

Store Variables in New DataFrame:

#To make the PCA coding easier, I'm adding the chosen variables to a new dataframe. I also used the column numbers from the output when renaming the columns rather than the column names to make the code shorter
CCdf <- clean_CustChurn[ , c(9:11, 15:16, 19, 23:26, 42:52)]
colnames(CCdf)

##  [1] "Lat"                       "Lng"                      
##  [3] "Population"                "Children_signup"          
##  [5] "Age_signup"                "Income_signup"            
##  [7] "Avg_outage_sec_perweek"    "Email"                    
##  [9] "Contacts"                  "Yearly_equip_failure"     
## [11] "Tenure"                    "Avg_MonthlyCharge"        
## [13] "Avg_Bandwidth_GB_Year"     "TimelyResponse"           
## [15] "TimelyFixes"               "TimelyReplacements"       
## [17] "Reliability"               "Options"                  
## [19] "RespectfulResponse"        "CourteousExchange"        
## [21] "EvidenceofActiveListening"

Normalize Variables:

#(Nistrup, 2020)
pca.data <- prcomp(CCdf, center = TRUE, scale = TRUE)
#View PCA standard deviations, proportions of variance, and cumulative proportions of variance
summary(pca.data)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7170 1.37854 1.27957 1.11205 1.06339 1.02531 1.01838
## Proportion of Variance 0.1404 0.09049 0.07797 0.05889 0.05385 0.05006 0.04939
## Cumulative Proportion  0.1404 0.23088 0.30885 0.36774 0.42158 0.47164 0.52103
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.00793 1.00244 0.99309 0.99093 0.98058 0.93040 0.88247
## Proportion of Variance 0.04838 0.04785 0.04696 0.04676 0.04579 0.04122 0.03708
## Cumulative Proportion  0.56941 0.61726 0.66422 0.71098 0.75677 0.79799 0.83507
##                           PC15    PC16    PC17   PC18    PC19    PC20   PC21
## Standard deviation     0.85786 0.83045 0.76909 0.7332 0.69406 0.56938 0.3208
## Proportion of Variance 0.03504 0.03284 0.02817 0.0256 0.02294 0.01544 0.0049
## Cumulative Proportion  0.87012 0.90296 0.93112 0.9567 0.97966 0.99510 1.0000

#View correlations
pca.data$rotation

##                                     PC1          PC2          PC3           PC4
## Lat                       -0.0012988768  0.023302535 -0.008737299 -0.7109294805
## Lng                        0.0081156770 -0.009778395  0.023050144  0.1694971516
## Population                -0.0021648497 -0.005644212  0.016693178  0.6483282604
## Children_signup            0.0004594240  0.001132978  0.010446123 -0.0378825680
## Age_signup                 0.0050576345  0.013214185 -0.017046954  0.0233016879
## Income_signup             -0.0008520682 -0.007709773  0.023715144 -0.0592077802
## Avg_outage_sec_perweek    -0.0131743504 -0.017965733 -0.047804233 -0.0989855571
## Email                      0.0086895953  0.020265552 -0.003392581  0.1448429005
## Contacts                  -0.0085391645 -0.003895242 -0.010323281  0.0280424337
## Yearly_equip_failure      -0.0077296632 -0.016043096  0.006845499 -0.0137783361
## Tenure                    -0.0106657235 -0.700860365 -0.071780032 -0.0040584771
## Avg_MonthlyCharge         -0.0004931141 -0.043355506 -0.024276911 -0.0706712652
## Avg_Bandwidth_GB_Year     -0.0125683108 -0.702562040 -0.073999492 -0.0100915193
## TimelyResponse             0.4588348341 -0.032034605  0.280088475 -0.0163117494
## TimelyFixes                0.4340607128 -0.042776009  0.280895233 -0.0228070939
## TimelyReplacements         0.4007815228 -0.034729050  0.280701477 -0.0031318090
## Reliability                0.1456316918  0.050443834 -0.566922189  0.0008237461
## Options                   -0.1754870143 -0.066118921  0.585658369 -0.0137071149
## RespectfulResponse         0.4049914444  0.012186983 -0.183816521  0.0127144337
## CourteousExchange          0.3582397322  0.003824246 -0.180957728 -0.0159301032
## EvidenceofActiveListening  0.3087483611  0.016378170 -0.131939371  0.0420742080
##                                     PC5          PC6           PC7
## Lat                       -0.0754548661  0.125157397  1.201509e-02
## Lng                        0.0718088371 -0.697690812 -4.653143e-01
## Population                 0.0599885591  0.272814589  2.607800e-01
## Children_signup            0.0203065093 -0.566165567  3.012516e-01
## Age_signup                -0.0416418298  0.209472382 -4.509169e-01
## Income_signup             -0.0151598811 -0.095565930  1.669441e-01
## Avg_outage_sec_perweek     0.6937445570  0.048720927  8.502775e-02
## Email                      0.0914789449 -0.005126588 -2.115085e-01
## Contacts                  -0.0006395943  0.181757138 -5.080012e-01
## Yearly_equip_failure       0.0533217770 -0.090675692  2.738410e-01
## Tenure                    -0.0588330162  0.012167692 -1.292710e-02
## Avg_MonthlyCharge          0.6941250825  0.044792084 -8.524157e-02
## Avg_Bandwidth_GB_Year     -0.0103951286 -0.004268561  3.918978e-03
## TimelyResponse             0.0300809461 -0.006513592 -2.529346e-04
## TimelyFixes                0.0143691364  0.022960470 -8.815904e-03
## TimelyReplacements        -0.0110899282  0.003512057 -2.361884e-03
## Reliability               -0.0310644507 -0.007453986 -7.913205e-03
## Options                    0.0232759223  0.040114111 -2.003365e-02
## RespectfulResponse         0.0062402525  0.013523369  1.488216e-02
## CourteousExchange         -0.0351877075 -0.021344776  4.238183e-05
## EvidenceofActiveListening  0.0358393805  0.029409180  1.208392e-02
##                                     PC8           PC9          PC10
## Lat                       -0.0755803560 -0.0023098738 -0.0759715002
## Lng                        0.0667557807 -0.0377899974  0.2104743461
## Population                 0.0334337505  0.0919283564 -0.0036515457
## Children_signup            0.0157526422  0.0780022695 -0.3189486964
## Age_signup                 0.4564897948 -0.0856443461 -0.4202640311
## Income_signup              0.2792343458  0.8297305702 -0.2252327243
## Avg_outage_sec_perweek     0.0204287347  0.0312633943 -0.0088069011
## Email                     -0.5272436019 -0.0643204375 -0.7334160461
## Contacts                   0.2039986844  0.2583411923  0.0636474539
## Yearly_equip_failure       0.6153084826 -0.4597608628 -0.2637133820
## Tenure                    -0.0116073601 -0.0048543146 -0.0222556425
## Avg_MonthlyCharge          0.0121842062  0.0210074673  0.0705865061
## Avg_Bandwidth_GB_Year     -0.0170571601  0.0006799138 -0.0075854017
## TimelyResponse             0.0095540293 -0.0162503921  0.0074544441
## TimelyFixes                0.0074353144  0.0022344086 -0.0004708455
## TimelyReplacements        -0.0316154001 -0.0302274469  0.0048692156
## Reliability               -0.0042202070 -0.0203604651  0.0240735339
## Options                    0.0024812055 -0.0062062195  0.0057337507
## RespectfulResponse         0.0049911489  0.0078937857 -0.0212271981
## CourteousExchange         -0.0009372125  0.0533046973 -0.0231981492
## EvidenceofActiveListening  0.0414419446 -0.0020036497  0.0452467705
##                                   PC11         PC12         PC13          PC14
## Lat                        0.038392961  0.029193630 -0.008435733 -0.0840239782
## Lng                       -0.115547254 -0.232244918  0.064759742 -0.0738585170
## Population                 0.032340013  0.141295584 -0.005426483 -0.1596223363
## Children_signup            0.191073833  0.662255603  0.005694298  0.0309607662
## Age_signup                -0.484380454  0.326994386  0.105211005 -0.0632318701
## Income_signup             -0.133377325 -0.339254195 -0.066087238  0.0131052776
## Avg_outage_sec_perweek     0.036854876 -0.042764091  0.690129255 -0.1143995304
## Email                      0.138113768 -0.278785657 -0.046423950  0.0635505096
## Contacts                   0.751483623  0.162324589  0.012464761  0.0409153224
## Yearly_equip_failure       0.304132072 -0.380069970 -0.117946066  0.0218001512
## Tenure                    -0.012354135 -0.006055513  0.037777983 -0.0003023203
## Avg_MonthlyCharge         -0.098490635  0.113446904 -0.683862986  0.0487590746
## Avg_Bandwidth_GB_Year      0.004165589  0.011855598 -0.012136274  0.0119795574
## TimelyResponse             0.019762678 -0.001088162 -0.006554008 -0.0742180726
## TimelyFixes                0.006856614  0.017016600 -0.002005392 -0.1101556217
## TimelyReplacements        -0.008391554 -0.014952716 -0.010288717 -0.1771363080
## Reliability               -0.002334510 -0.017647997 -0.021683794 -0.1864450826
## Options                   -0.006939835  0.009589566  0.041002981  0.1302463247
## RespectfulResponse         0.023539347  0.002343695 -0.007449126 -0.0541041014
## CourteousExchange          0.039169002 -0.006474351 -0.026557663 -0.1501884266
## EvidenceofActiveListening -0.070907293  0.016707008  0.124209235  0.8993615451
##                                    PC15         PC16          PC17
## Lat                       -0.6629870430 -0.097948218 -0.0444566052
## Lng                       -0.3578777953 -0.067187627 -0.0386396580
## Population                -0.6058237029 -0.099985603 -0.0117364744
## Children_signup            0.0012306259  0.037194136  0.0186275404
## Age_signup                 0.0027839366 -0.010549463  0.0009172031
## Income_signup              0.0080161361  0.061188394  0.0060658071
## Avg_outage_sec_perweek     0.0433576746 -0.008893558  0.0128163912
## Email                     -0.0405545319  0.013305971  0.0148553118
## Contacts                   0.0037455091  0.037913094  0.0046247484
## Yearly_equip_failure      -0.0416869178 -0.009971406  0.0132965936
## Tenure                    -0.0097231631  0.008787332 -0.0080746753
## Avg_MonthlyCharge         -0.0043105093 -0.008924304  0.0152696466
## Avg_Bandwidth_GB_Year     -0.0087335179 -0.003313837 -0.0038227193
## TimelyResponse            -0.0232415563  0.114271236  0.0459474895
## TimelyFixes                0.0007022538  0.170293003  0.0681366550
## TimelyReplacements        -0.0009642702  0.247244773  0.1487966872
## Reliability               -0.0693032276  0.468333268  0.4471573264
## Options                   -0.0109178537 -0.061321498  0.2108651173
## RespectfulResponse         0.0628733114 -0.050875212 -0.7551424353
## CourteousExchange          0.1377160478 -0.799827196  0.3735231765
## EvidenceofActiveListening -0.1772780546  0.009975454  0.1097472335
##                                    PC18          PC19          PC20
## Lat                       -0.0041456851  0.0156741589 -0.0111056968
## Lng                        0.0177542845  0.0004404725 -0.0246985512
## Population                 0.0008659297  0.0016324442 -0.0077723112
## Children_signup            0.0113840191  0.0186316052 -0.0077147097
## Age_signup                -0.0122828391  0.0085863229  0.0160254976
## Income_signup              0.0006217216  0.0138000785  0.0049397946
## Avg_outage_sec_perweek    -0.0163580933  0.0100097533 -0.0043115885
## Email                      0.0065883396 -0.0162277486  0.0010132409
## Contacts                  -0.0263230183  0.0207755968 -0.0005300704
## Yearly_equip_failure      -0.0007583610  0.0072506350 -0.0213011075
## Tenure                    -0.0121585157  0.0065780270  0.0049847367
## Avg_MonthlyCharge          0.0003496937  0.0211408944 -0.0127283407
## Avg_Bandwidth_GB_Year     -0.0016208746 -0.0077506850  0.0071436936
## TimelyResponse             0.0243576194 -0.2391680316  0.7930417038
## TimelyFixes                0.0680061250 -0.5923124854 -0.5726095223
## TimelyReplacements        -0.3926001393  0.6756428578 -0.1771996148
## Reliability                0.4306992598  0.0853375511  0.0180427195
## Options                    0.6944182200  0.2604452146 -0.0428154633
## RespectfulResponse         0.4057578374  0.2273641395 -0.0651350723
## CourteousExchange          0.0677631015  0.0647317441 -0.0412502694
## EvidenceofActiveListening -0.0439110025  0.0470594076 -0.0430650393
##                                    PC21
## Lat                        0.0009630604
## Lng                        0.0008761430
## Population                -0.0007493478
## Children_signup           -0.0182853241
## Age_signup                 0.0216573095
## Income_signup              0.0013977966
## Avg_outage_sec_perweek     0.0008521392
## Email                      0.0054223039
## Contacts                  -0.0026509615
## Yearly_equip_failure      -0.0023938232
## Tenure                    -0.7049911957
## Avg_MonthlyCharge         -0.0482167849
## Avg_Bandwidth_GB_Year      0.7067644307
## TimelyResponse            -0.0028995362
## TimelyFixes               -0.0030164735
## TimelyReplacements         0.0145175277
## Reliability                0.0017026995
## Options                   -0.0029235647
## RespectfulResponse         0.0013754214
## CourteousExchange         -0.0067500949
## EvidenceofActiveListening -0.0029899801

To determine which components to keep, they should have an eigenvalue greater than 1. A scree plot can determine which variables have an eigenvalue over 1. Base R can create a basic scree plot. The R package factoextra can also produce a nice scree plot. Factoextra contains many other tools including FactoMineR, plyr, caret, corrplot and many more. FactoMineR and plyr would likely be sufficient, but the other tools may be useful for future projects.

Create Scree Plot of Eigenvalues:

screeplot(pca.data, type = "l", npcs = 21, pch = 17, col = "red", main = "Screeplot for PCA")
abline(h = 1, col="blue", lty="dotted", lwd = 2)
legend("topright", legend=c("Eigenvalue = 1"),
       col=c("red"), lty="dotted", cex=0.6)

#To install factoextra for the first time use:
#install.packages("factoextra")
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_eig(pca.data, choice = "eigenvalue", addlabels = TRUE, bar_width = 0.5, barfill = "cadetblue", linecolor = "goldenrod2", ncp = 21)

The eigenvalue charts both had a few that looked like they were right at 1 (7 - 12) but because it could have been due to rounding issues, I took a closer look at the PCA values to see which ones might be rounding.

Obtain Eigenvalues:

get_eig(pca.data)

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   2.9481201       14.0386671                    14.03867
## Dim.2   1.9003754        9.0494067                    23.08807
## Dim.3   1.6373045        7.7966881                    30.88476
## Dim.4   1.2366499        5.8888089                    36.77357
## Dim.5   1.1308029        5.3847758                    42.15835
## Dim.6   1.0512624        5.0060115                    47.16436
## Dim.7   1.0370929        4.9385376                    52.10290
## Dim.8   1.0159209        4.8377184                    56.94061
## Dim.9   1.0048894        4.7851876                    61.72580
## Dim.10  0.9862229        4.6962997                    66.42210
## Dim.11  0.9819472        4.6759391                    71.09804
## Dim.12  0.9615399        4.5787615                    75.67680
## Dim.13  0.8656462        4.1221247                    79.79893
## Dim.14  0.7787583        3.7083729                    83.50730
## Dim.15  0.7359178        3.5043703                    87.01167
## Dim.16  0.6896399        3.2839994                    90.29567
## Dim.17  0.5914997        2.8166651                    93.11233
## Dim.18  0.5375789        2.5598995                    95.67223
## Dim.19  0.4817194        2.2939020                    97.96614
## Dim.20  0.3241938        1.5437798                    99.50992
## Dim.21  0.1029178        0.4900845                   100.00000

Principal components 10 -12 are just under 1 but round to 1. If rounding out to 0 decimal places, then PCA 9 is the last principal component to accept. If we round based on the third decimal place, then PCA 10 is acceptable as it will round to 1. If round by the second decimal place, then PCA 11 and 12 would also be acceptable because they would round to 1. Which one is chosen depends on how much of the variance the business is comfortable having explained. Using just the first nine explains over 60% of the variance. If 12 are used, then the explained variance goes up to over 75%. The next charts show the difference between using 9, 10, or 12 principal components.

Produce Charts Visualizing the Amount of Variance the Chosen Number of Principal Components Explains:

#9 PCs
cum.pca <- cumsum(pca.data$sdev^2 / sum(pca.data$sdev^2))
plot(cum.pca[0:21], xlab = "Principal Component Number", ylab = "Amount of Explained Variance", main = "Cumulative Explained Variance")
abline(v = 9, col="red", lty="dotted", lwd = 2)
abline(h = 0.617258, col="red", lty="dotted", lwd = 2)
legend("center", legend=c("Intersection at PC9"),
       col=c("red"), lty="dotted", lwd = 2, cex=0.6)

#10 PCs
cum.pca <- cumsum(pca.data$sdev^2 / sum(pca.data$sdev^2))
plot(cum.pca[0:21], xlab = "Principal Component Number", ylab = "Amount of Explained Variance", main = "Cumulative Explained Variance")
abline(v = 10, col="red", lty="dotted", lwd = 2)
abline(h = 0.66422, col="red", lty="dotted", lwd = 2)
legend("center", legend=c("Intersection at PC10"),
       col=c("red"), lty="dotted", lwd = 2, cex=0.6)

#12 PCs
cum.pca <- cumsum(pca.data$sdev^2 / sum(pca.data$sdev^2))
plot(cum.pca[0:21], xlab = "Principal Component Number", ylab = "Amount of Explained Variance", main = "Cumulative Explained Variance")
abline(v = 12, col="red", lty="dotted", lwd = 2)
abline(h = 0.756768, col="red", lty="dotted", lwd = 2)
legend("center", legend=c("Intersection at PC12"),
       col=c("red"), lty="dotted", lwd = 2, cex=0.6)

PCA Summary:

The Principal Component Analysis showed the over 60% of the variance could be explained by using only 9 components. More than 75% could be explained by using 12 components. Using Principal Component Analysis is useful for the business because it reduces the dimensionality of a large dataset that needs to be analyzed. It helps to observe patterns and trends in the data by showing the relationships between the variables and observations. A business can use this method to determine patterns in customer behavior to help them predict which groups of customers are more susceptible to churn. Knowing which customer groups are more likely to end their subscriptions, helps the business set a mitigation plan in motion to reduce the churn through marketing or education efforts, improved customer services, better product development, etc., depending on what the analysis shows. In this case, such an analysis may answer the research question if there are groups of customers more susceptible to churn because it could uncover some relationships between survey responses about equipment failure being rated poorly and low tenure, for example. This would indicate to the business that they need to work on product development. This is why it is so important for the business to keep the data fresh and not rely on stale data. The need to monitor the progress of any mitigation plan they put in place to ensure it is working.

References

Leek, J. (2017). Reading Local Flat Files [Video Course]. Coursera Data Science Specialization. www.coursera.com

Zach. (2020, October 27). How to Import CSV Files into R (Step-by-Step). Statology. https://www.statology.org/import-csv-into-r/#:~:text=1.%20Use%20read.csv%20from%20base%20R%20%28Slowest%20method%2C

Fast Reading of Data From TXT|CSV Files into R: readr package - Easy Guides - Wiki - STHDA. (n.d.). www.sthda.com. Retrieved December 28, 2021, from http://www.sthda.com/english/wiki/fast-reading-of-data-from-txt-csv-files-into-r-readr-package#:~:text=The%20readr%20package%20contains%20functions%20for%20reading%20i%29

Examine a Data Frame in R with 7 Basic Functions. (2016, November 29). (R)Very Day. https://rveryday.wordpress.com/2016/11/29/examine-a-data-frame-in-r-with-7-basic-functions/

Larose, C.D., & Larose, D.T. (2019). Data Science Using Python and R. John Wiley & Sons. ISBN-13:978-1-119-52684-1

Zach. (2019, March 12). How to Rename Data Frame Columns in R. Statology. https://www.statology.org/how-to-rename-data-frame-columns-in-r/

Makowski, D. (2018, March 29). Standardize (Z-score) a Dataframe. R-Bloggers. Retrieved December 31, 2021, from https://www.r-bloggers.com/2018/03/standardize-z-score-a-dataframe/

Mulani, S. (2021, March 4). Label Encoding in R programming – All you need to know! JournalDev. https://www.journaldev.com/48639/label-encoding-in-r

Nistrup, P. (2020, January 28). Principal Component Analysis (PCA) 101, using R. Medium. https://towardsdatascience.com/principal-component-analysis-pca-101-using-r-361f4c53a9ff