LAB 1 — Data Understanding, Preparation & Exploratory Analysis Dataset: week2_churn_data ________________________________________

Tasks

  1. Data Familiarization Load the provided dataset and:
library(readr)
churn.df <- read_csv("R working directory/week2_churn_data.csv")
#View(churn.df)

• Identify the business outcome variable

dim(churn.df)  # find the dimension of data frame
#> [1] 400   8
head(churn.df)
#> # A tibble: 6 × 8
#>   customer_id tenure_months monthly_charges contract_type  online_security
#>         <dbl>         <dbl>           <dbl> <chr>          <chr>          
#> 1           1            49            68.6 One year       Yes            
#> 2           2            29            77.4 One year       Yes            
#> 3           3            35            80.2 Two year       Yes            
#> 4           4             1            58.4 One year       Yes            
#> 5           5            23            50.8 Month-to-month Yes            
#> 6           6            31            61.0 Month-to-month No             
#> # ℹ 3 more variables: tech_support <chr>, churn <chr>, total_charges <dbl>

For this data set, it appears likely that manager want to identify which customers are most likely to discontinue services or churn.Decision makers may want to offer incentives from customers likely to stay or they may choose to stop providing perks to customers more likely to end their contracts.

Based on the variable names decision makers might also want to explore insights about different variables like total_charges or contract_type.

• Identify predictor variables and their types (numeric, categorical, binary)

summary(churn.df)
#>   customer_id    tenure_months   monthly_charges  contract_type     
#>  Min.   :  1.0   Min.   : 1.00   Min.   : 13.08   Length:400        
#>  1st Qu.:100.8   1st Qu.:16.00   1st Qu.: 54.31   Class :character  
#>  Median :200.5   Median :29.00   Median : 66.91   Mode  :character  
#>  Mean   :200.5   Mean   :29.79   Mean   : 68.01                     
#>  3rd Qu.:300.2   3rd Qu.:44.00   3rd Qu.: 81.89                     
#>  Max.   :400.0   Max.   :59.00   Max.   :137.16                     
#>  online_security    tech_support          churn           total_charges    
#>  Length:400         Length:400         Length:400         Min.   :  57.15  
#>  Class :character   Class :character   Class :character   1st Qu.: 962.00  
#>  Mode  :character   Mode  :character   Mode  :character   Median :1841.82  
#>                                                           Mean   :2021.82  
#>                                                           3rd Qu.:2848.20  
#>                                                           Max.   :7310.18

Using the summary function we see variable that could be used as predictor listed as: customer_id
tenure_months
monthly_charges
contract_type
online_security
tech_support
total_charges

This is given that we may want to use a model to predict if a customer will churn.

str(churn.df)
#> spc_tbl_ [400 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ customer_id    : num [1:400] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ tenure_months  : num [1:400] 49 29 35 1 23 31 24 9 3 11 ...
#>  $ monthly_charges: num [1:400] 68.6 77.4 80.2 58.4 50.8 ...
#>  $ contract_type  : chr [1:400] "One year" "One year" "Two year" "One year" ...
#>  $ online_security: chr [1:400] "Yes" "Yes" "Yes" "Yes" ...
#>  $ tech_support   : chr [1:400] "Yes" "Yes" "No" "Yes" ...
#>  $ churn          : chr [1:400] "Yes" "Yes" "No" "No" ...
#>  $ total_charges  : num [1:400] 3691 2114.3 2915.4 57.1 1259.1 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   customer_id = col_double(),
#>   ..   tenure_months = col_double(),
#>   ..   monthly_charges = col_double(),
#>   ..   contract_type = col_character(),
#>   ..   online_security = col_character(),
#>   ..   tech_support = col_character(),
#>   ..   churn = col_character(),
#>   ..   total_charges = col_double()
#>   .. )
#>  - attr(*, "problems")=<externalptr>

When we inspect the structure of the data frame the corresponding variable types as: customer_id: number, tenure_months: number, monthly_charges:number, contract_type: character, online_security:character, tech_support: character, churn: character, total_charges: number.


  1. Data Quality Assessment Evaluate the dataset for: • Missing values
#churn.df(miss.val=sapply(churn.df,
 #                        function(x)sum(length(which(is.ns(X))))))
library(ggplot2)
library(reshape) 
melt.missing <- melt (missing)
ggplot(melt.missing, aes(x=X1, y=X2, fill=value)) +
  geom_tile(show.legend=FALSE) +
  scale_fill_gradient(low="white", high="#666666") +
  scale_x_continuous(expand=c(0,0))
#> Error in `geom_tile()`:
#> ! Problem while computing aesthetics.
#> ℹ Error occurred in the 1st layer.
#> Caused by error:
#> ! object 'X1' not found
missing = churn.df
missing[missing == ""]= NA
missing = 1 * is.na(missing)
heatmap(missing, Rowv = NA, Colv = NA)

There appears to be no missing values in our data set. We can move forward with analysis.

• Outliers or extreme values

#order(churn.df$customer_id)
#order(churn.df$tenure_months)
#order(churn.df$monthly_charges)
#order(churn.df$contract_type)
#order(churn.df$online_security)
#order(churn.df$tech_support)
#order(churn.df$churn)
order(churn.df$total_charges)
#>   [1]   4  29  93 211 374 353 153 194 338 136 270 288 149 182  60 186 132 103
#>  [19] 246 158 162 278 134 293 247 368   9 383 389 252 144  50  44  42 367  79
#>  [37]  83  96  32 312 295 142 240 285  17 203  62 364 108 355 184 395 376 309
#>  [55] 152 200   8  48  10  64 326 174 275 140 256 385 125 221 215 352 304  13
#>  [73] 336 176  24 169  37 242  27 349 333 109 315 146 225  72 110 369 335 286
#>  [91] 262 365  31 120 341  43  38 163 307 239 394 190 280 171 392 237 202 121
#> [109] 197 195 324 386 292 337   7 331 321 170  92 212  40 266 208 167 328 366
#> [127] 249 348 147  59  71 227 357   5  87 317 277  63 378 363  19 318  95  46
#> [145] 379  26 272 116 345 251 228 219 155 372 180 330 135 141  88 298 137 107
#> [163] 148 301 102  90  89 248 114 106 362 241 126 218 196 316 305 181 351  45
#> [181]  58 271  22  28 340 189 223 243 384 122 236  99 284 111 100 332 373 231
#> [199]  66 342  23   6  41 314 346 234 347  52 113  74 143 358  15  35 245 131
#> [217] 308 296 371 210 118  75  11   2 235 381 139  12  84 302 117 129 350  70
#> [235]  80 232 303 398  49 213 306 310 290 313 325  65  16 391 273 382  82  25
#> [253] 359  39  54 150 261 334 191 161 344 281 123 396  20 233  21  77 220 151
#> [271] 255 168 322 119 159  67 206 274 380 282 138 254 375  94 377 327  30 253
#> [289] 192 360  69 250 124 222 287 393 294 259 265 397 216 127  53 205 238   3
#> [307] 199 390 179 356 329 343  98 387 130  78 311 279 133 198 258 300  57 178
#> [325] 166 319  56  33 207 260  91  34 165 175 112 230  86 204 299 128 145  14
#> [343] 388 172  76 105  81 101  73 244 323 291 173  47  68 226 104 399   1 183
#> [361]  61 217 193 156 257 157 224 289 188  51 320 201 269 154 297 268 164 229
#> [379]  36  18  97 115 214 177 361 209 185 160  55 370  85 267 263 283 187 354
#> [397] 264 400 276 339

One way to identify outliers is to inspect the data for extreme values.When visually inspecting the data for very large or small values no outliers stand out.However this metod is tedious and may lead to human error.

boxplot(churn.df$total_charges)

For numerical variables we can generate boxplots which show the range of the data and where the data appears in relation to other data points. With boxplots we can see extreme values that may warrant further investigation to determine if they are true outliers.

• Potential data coding issues (e.g., inconsistent categories)

library(fastDummies)
library(tidyverse)

When reviewing the

# make contract_type a factor variable
churn.df$contract_type <- factor(churn.df$contract_type)
str(churn.df$contract_type)
#>  Factor w/ 3 levels "Month-to-month",..: 2 2 3 2 1 1 3 2 1 2 ...
levels(churn.df$contract_type) # show factor's categories (levels)
#> [1] "Month-to-month" "One year"       "Two year"
# use tidyverse to load and preprocess data in one statement
# the %>% operator inserts the result of the expression on the left
# as the first argument into the function on the right
churn.df <- churn.df %>%
  mutate(contract_type=factor(contract_type))
churn.df <- dummy_cols(churn.df,
                 remove_selected_columns=TRUE,  # remove the original column
                 remove_first_dummy=TRUE)  # removes the first created dummy variable
churn.df %>% head(2)
#> # A tibble: 2 × 9
#>   customer_id tenure_months monthly_charges total_charges contract_type_One ye…¹
#>         <dbl>         <dbl>           <dbl>         <dbl>                  <int>
#> 1           1            49            68.6         3691.                      1
#> 2           2            29            77.4         2114.                      1
#> # ℹ abbreviated name: ¹​`contract_type_One year`
#> # ℹ 4 more variables: `contract_type_Two year` <int>,
#> #   online_security_Yes <int>, tech_support_Yes <int>, churn_Yes <int>

We can see that the customer_id is listed as a number but measuring the distance between to customer ids would likely yield no value. Additionally, the data for online_security, tech_support and churn are listed characters which represent binary outcome that could be coded differently.

Expected output: • Summary statistics – median, standard deviation, correlation analysis, mean

summary(churn.df)
#>   customer_id    tenure_months   monthly_charges  total_charges    
#>  Min.   :  1.0   Min.   : 1.00   Min.   : 13.08   Min.   :  57.15  
#>  1st Qu.:100.8   1st Qu.:16.00   1st Qu.: 54.31   1st Qu.: 962.00  
#>  Median :200.5   Median :29.00   Median : 66.91   Median :1841.82  
#>  Mean   :200.5   Mean   :29.79   Mean   : 68.01   Mean   :2021.82  
#>  3rd Qu.:300.2   3rd Qu.:44.00   3rd Qu.: 81.89   3rd Qu.:2848.20  
#>  Max.   :400.0   Max.   :59.00   Max.   :137.16   Max.   :7310.18  
#>  contract_type_One year contract_type_Two year online_security_Yes
#>  Min.   :0.00           Min.   :0.000          Min.   :0.00       
#>  1st Qu.:0.00           1st Qu.:0.000          1st Qu.:0.00       
#>  Median :0.00           Median :0.000          Median :0.00       
#>  Mean   :0.31           Mean   :0.355          Mean   :0.46       
#>  3rd Qu.:1.00           3rd Qu.:1.000          3rd Qu.:1.00       
#>  Max.   :1.00           Max.   :1.000          Max.   :1.00       
#>  tech_support_Yes   churn_Yes    
#>  Min.   :0.0000   Min.   :0.000  
#>  1st Qu.:0.0000   1st Qu.:0.000  
#>  Median :1.0000   Median :0.000  
#>  Mean   :0.5125   Mean   :0.275  
#>  3rd Qu.:1.0000   3rd Qu.:1.000  
#>  Max.   :1.0000   Max.   :1.000
#view(churn.df)

• At least two visualizations (e.g., boxplot, histogram, missingness heatmap)

boxplot(churn.df$tenure_months)

#boxplot(churn.df$monthly_charges ` churn.df$tenure_months,
 #       xlab="Monthly Charges", ylab="Tenure Months")
plot(churn.df[, c(1:9)])

cor.mat <-round(cor(churn.df), 2)
melted.cor.mat <- melt(cor.mat)
ggplot(melted.cor.mat, aes(x=X1, y=X2, fill=value)) +
  geom_tile() + xlab("") + ylab("") +
  scale_fill_distiller(palette="RdBu", limits=c(-1,1))

We can use the correlation matrix to identify variable that my have a greater effect of the churn rate than others.This heatmap shows us that there is some positive correlation between churn_Yes and Monthly_charges, with Total_charges and with tenure-months.We can explore these further by comparing side-by-side boxplots of churn_Yes.

boxplot(churn.df$monthly_charges ~ churn.df$churn_Yes,
     xlab="churn", ylab="Monthly charge")

When we compare values for those who churn and those that do not with regard to Monthly_charges we see little difference between the groups.

boxplot(total_charges ~ churn_Yes,
     xlab="churn", ylab="Total charge", data = churn.df)

With Total_charges we see a greater difference than with Monthly_charges.

boxplot(churn.df$tenure_months ~ churn.df$churn_Yes,
     xlab="churn", ylab="tenure months")

A more significant differnce between the groups appears when examining the relation between churn_Yes and tenture_months.The churn rate seems to be slightly higher for customers with a greater value for tenture months.

Total_charges and tenure_months are likely correlated since the longer tenure with results in accumulated monthly charges and higher Total charges.We should be mindful of this as we move forward with data modeling.


  1. Exploratory Data Analysis (EDA) Conduct EDA to explore relationships between predictors and the outcome. Important clarification — what counts as an “insight”:
plot(churn.df$total_charges ~ churn.df$tenure_months,
     xlab="Tenure Months", ylab="Total charge")

hist(churn.df$total_charges, xlab="Total Charges")

hist(churn.df$tenure_months)

plot(churn.df$total_charges ~ churn.df$tenure_months,
     xlab="Tenure Months", ylab="Total charge", log="y")

A substantive insight is: • A pattern supported by evidence (numeric summary or visualization)

We can see that the churn rate is greater for Total_charges and for tenure-months that for Monthly charges.The correlation matrix and results of differences between groups support the finding that Total_charges and tenture-months are correlated.The scatter plot indicates that this relationship is a non-linear. We also see that differences appear between those that have churned and those that have not. We would build a model that explores these differences while addressing isues of correlation between predictor variables.

, and • A short interpretation explaining why it matters for the business problem.

Business decision makers would be interested in these findings because they likely want to reduce the sales cost by retaining customers and increase revenue. Retaining customers is likely less expensive that identifying an obtaining new customers.Conversely a Business decision maker may want to reduce costs by offering few perks to customers that are mostly likely to end contracts.

Requirement: Provide at least three distinct insights, each supported by a plot or statistic.

Based on what we see from in the boxplots above we can conclude that there is a difference between customers that churn and customers that do not. The churn rate is influenced by both Total-charges and tenure months and theses variables are likely correlated.


  1. Business Framing Briefly explain how your findings would influence:

• Feature selection With Monthly charges and Total charges moderately correlated we would likely choose to use one feature and not the one when we build a model to evaluate likelihood of a customer churning.

• Modeling choices We can likely use a regression model to predict which customers are likely to churn, however we do want to be careful about using variables that are correlated, like Total_charges and tenure_months.

• Data preprocessing decisions We would likely test different models excluding the redundant variable from each to find which performs better.We would choose predictors that have the greatest variable explained by the predictor we keep in the model.

Expected output: 1–2 paragraphs. The churn data set can be used to build a machine learning model that predicts which customers are most likely to churn or end contracts.The insights gained from identifying customers would allow business leader to take appropriate steps to retain customers or reduce the costs associated with retaining customers that are not likely to stay with the company.

Independant models would be generated to included different variables so performance could be evaluated. Our goal is to finding the model with the best performance without overfitting the data.


Deliverables • R script (RMD file) • RMD file knitted output - notebook • Short written report (2 pages) summarizing findings ________________________________________