LAB 1 — Data Understanding, Preparation & Exploratory Analysis Dataset: week2_churn_data ________________________________________
Tasks
library(readr)
churn.df <- read_csv("R working directory/week2_churn_data.csv")
#View(churn.df)
• Identify the business outcome variable
dim(churn.df) # find the dimension of data frame
#> [1] 400 8
head(churn.df)
#> # A tibble: 6 × 8
#> customer_id tenure_months monthly_charges contract_type online_security
#> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1 49 68.6 One year Yes
#> 2 2 29 77.4 One year Yes
#> 3 3 35 80.2 Two year Yes
#> 4 4 1 58.4 One year Yes
#> 5 5 23 50.8 Month-to-month Yes
#> 6 6 31 61.0 Month-to-month No
#> # ℹ 3 more variables: tech_support <chr>, churn <chr>, total_charges <dbl>
For this data set, it appears likely that manager want to identify which customers are most likely to discontinue services or churn.Decision makers may want to offer incentives from customers likely to stay or they may choose to stop providing perks to customers more likely to end their contracts.
Based on the variable names decision makers might also want to explore insights about different variables like total_charges or contract_type.
• Identify predictor variables and their types (numeric, categorical, binary)
summary(churn.df)
#> customer_id tenure_months monthly_charges contract_type
#> Min. : 1.0 Min. : 1.00 Min. : 13.08 Length:400
#> 1st Qu.:100.8 1st Qu.:16.00 1st Qu.: 54.31 Class :character
#> Median :200.5 Median :29.00 Median : 66.91 Mode :character
#> Mean :200.5 Mean :29.79 Mean : 68.01
#> 3rd Qu.:300.2 3rd Qu.:44.00 3rd Qu.: 81.89
#> Max. :400.0 Max. :59.00 Max. :137.16
#> online_security tech_support churn total_charges
#> Length:400 Length:400 Length:400 Min. : 57.15
#> Class :character Class :character Class :character 1st Qu.: 962.00
#> Mode :character Mode :character Mode :character Median :1841.82
#> Mean :2021.82
#> 3rd Qu.:2848.20
#> Max. :7310.18
Using the summary function we see variable that could be used as
predictor listed as: customer_id
tenure_months
monthly_charges
contract_type
online_security
tech_support
total_charges
This is given that we may want to use a model to predict if a customer will churn.
str(churn.df)
#> spc_tbl_ [400 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ customer_id : num [1:400] 1 2 3 4 5 6 7 8 9 10 ...
#> $ tenure_months : num [1:400] 49 29 35 1 23 31 24 9 3 11 ...
#> $ monthly_charges: num [1:400] 68.6 77.4 80.2 58.4 50.8 ...
#> $ contract_type : chr [1:400] "One year" "One year" "Two year" "One year" ...
#> $ online_security: chr [1:400] "Yes" "Yes" "Yes" "Yes" ...
#> $ tech_support : chr [1:400] "Yes" "Yes" "No" "Yes" ...
#> $ churn : chr [1:400] "Yes" "Yes" "No" "No" ...
#> $ total_charges : num [1:400] 3691 2114.3 2915.4 57.1 1259.1 ...
#> - attr(*, "spec")=
#> .. cols(
#> .. customer_id = col_double(),
#> .. tenure_months = col_double(),
#> .. monthly_charges = col_double(),
#> .. contract_type = col_character(),
#> .. online_security = col_character(),
#> .. tech_support = col_character(),
#> .. churn = col_character(),
#> .. total_charges = col_double()
#> .. )
#> - attr(*, "problems")=<externalptr>
When we inspect the structure of the data frame the corresponding variable types as: customer_id: number, tenure_months: number, monthly_charges:number, contract_type: character, online_security:character, tech_support: character, churn: character, total_charges: number.
#churn.df(miss.val=sapply(churn.df,
# function(x)sum(length(which(is.ns(X))))))
library(ggplot2)
library(reshape)
melt.missing <- melt (missing)
ggplot(melt.missing, aes(x=X1, y=X2, fill=value)) +
geom_tile(show.legend=FALSE) +
scale_fill_gradient(low="white", high="#666666") +
scale_x_continuous(expand=c(0,0))
#> Error in `geom_tile()`:
#> ! Problem while computing aesthetics.
#> ℹ Error occurred in the 1st layer.
#> Caused by error:
#> ! object 'X1' not found
missing = churn.df
missing[missing == ""]= NA
missing = 1 * is.na(missing)
heatmap(missing, Rowv = NA, Colv = NA)
There appears to be no missing values in our data set. We can move
forward with analysis.
• Outliers or extreme values
#order(churn.df$customer_id)
#order(churn.df$tenure_months)
#order(churn.df$monthly_charges)
#order(churn.df$contract_type)
#order(churn.df$online_security)
#order(churn.df$tech_support)
#order(churn.df$churn)
order(churn.df$total_charges)
#> [1] 4 29 93 211 374 353 153 194 338 136 270 288 149 182 60 186 132 103
#> [19] 246 158 162 278 134 293 247 368 9 383 389 252 144 50 44 42 367 79
#> [37] 83 96 32 312 295 142 240 285 17 203 62 364 108 355 184 395 376 309
#> [55] 152 200 8 48 10 64 326 174 275 140 256 385 125 221 215 352 304 13
#> [73] 336 176 24 169 37 242 27 349 333 109 315 146 225 72 110 369 335 286
#> [91] 262 365 31 120 341 43 38 163 307 239 394 190 280 171 392 237 202 121
#> [109] 197 195 324 386 292 337 7 331 321 170 92 212 40 266 208 167 328 366
#> [127] 249 348 147 59 71 227 357 5 87 317 277 63 378 363 19 318 95 46
#> [145] 379 26 272 116 345 251 228 219 155 372 180 330 135 141 88 298 137 107
#> [163] 148 301 102 90 89 248 114 106 362 241 126 218 196 316 305 181 351 45
#> [181] 58 271 22 28 340 189 223 243 384 122 236 99 284 111 100 332 373 231
#> [199] 66 342 23 6 41 314 346 234 347 52 113 74 143 358 15 35 245 131
#> [217] 308 296 371 210 118 75 11 2 235 381 139 12 84 302 117 129 350 70
#> [235] 80 232 303 398 49 213 306 310 290 313 325 65 16 391 273 382 82 25
#> [253] 359 39 54 150 261 334 191 161 344 281 123 396 20 233 21 77 220 151
#> [271] 255 168 322 119 159 67 206 274 380 282 138 254 375 94 377 327 30 253
#> [289] 192 360 69 250 124 222 287 393 294 259 265 397 216 127 53 205 238 3
#> [307] 199 390 179 356 329 343 98 387 130 78 311 279 133 198 258 300 57 178
#> [325] 166 319 56 33 207 260 91 34 165 175 112 230 86 204 299 128 145 14
#> [343] 388 172 76 105 81 101 73 244 323 291 173 47 68 226 104 399 1 183
#> [361] 61 217 193 156 257 157 224 289 188 51 320 201 269 154 297 268 164 229
#> [379] 36 18 97 115 214 177 361 209 185 160 55 370 85 267 263 283 187 354
#> [397] 264 400 276 339
One way to identify outliers is to inspect the data for extreme values.When visually inspecting the data for very large or small values no outliers stand out.However this metod is tedious and may lead to human error.
boxplot(churn.df$total_charges)
For numerical variables we can generate boxplots which show the range of
the data and where the data appears in relation to other data points.
With boxplots we can see extreme values that may warrant further
investigation to determine if they are true outliers.
• Potential data coding issues (e.g., inconsistent categories)
library(fastDummies)
library(tidyverse)
When reviewing the
# make contract_type a factor variable
churn.df$contract_type <- factor(churn.df$contract_type)
str(churn.df$contract_type)
#> Factor w/ 3 levels "Month-to-month",..: 2 2 3 2 1 1 3 2 1 2 ...
levels(churn.df$contract_type) # show factor's categories (levels)
#> [1] "Month-to-month" "One year" "Two year"
# use tidyverse to load and preprocess data in one statement
# the %>% operator inserts the result of the expression on the left
# as the first argument into the function on the right
churn.df <- churn.df %>%
mutate(contract_type=factor(contract_type))
churn.df <- dummy_cols(churn.df,
remove_selected_columns=TRUE, # remove the original column
remove_first_dummy=TRUE) # removes the first created dummy variable
churn.df %>% head(2)
#> # A tibble: 2 × 9
#> customer_id tenure_months monthly_charges total_charges contract_type_One ye…¹
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 49 68.6 3691. 1
#> 2 2 29 77.4 2114. 1
#> # ℹ abbreviated name: ¹`contract_type_One year`
#> # ℹ 4 more variables: `contract_type_Two year` <int>,
#> # online_security_Yes <int>, tech_support_Yes <int>, churn_Yes <int>
We can see that the customer_id is listed as a number but measuring the distance between to customer ids would likely yield no value. Additionally, the data for online_security, tech_support and churn are listed characters which represent binary outcome that could be coded differently.
Expected output: • Summary statistics – median, standard deviation, correlation analysis, mean
summary(churn.df)
#> customer_id tenure_months monthly_charges total_charges
#> Min. : 1.0 Min. : 1.00 Min. : 13.08 Min. : 57.15
#> 1st Qu.:100.8 1st Qu.:16.00 1st Qu.: 54.31 1st Qu.: 962.00
#> Median :200.5 Median :29.00 Median : 66.91 Median :1841.82
#> Mean :200.5 Mean :29.79 Mean : 68.01 Mean :2021.82
#> 3rd Qu.:300.2 3rd Qu.:44.00 3rd Qu.: 81.89 3rd Qu.:2848.20
#> Max. :400.0 Max. :59.00 Max. :137.16 Max. :7310.18
#> contract_type_One year contract_type_Two year online_security_Yes
#> Min. :0.00 Min. :0.000 Min. :0.00
#> 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.00
#> Median :0.00 Median :0.000 Median :0.00
#> Mean :0.31 Mean :0.355 Mean :0.46
#> 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.00
#> Max. :1.00 Max. :1.000 Max. :1.00
#> tech_support_Yes churn_Yes
#> Min. :0.0000 Min. :0.000
#> 1st Qu.:0.0000 1st Qu.:0.000
#> Median :1.0000 Median :0.000
#> Mean :0.5125 Mean :0.275
#> 3rd Qu.:1.0000 3rd Qu.:1.000
#> Max. :1.0000 Max. :1.000
#view(churn.df)
• At least two visualizations (e.g., boxplot, histogram, missingness heatmap)
boxplot(churn.df$tenure_months)
#boxplot(churn.df$monthly_charges ` churn.df$tenure_months,
# xlab="Monthly Charges", ylab="Tenure Months")
plot(churn.df[, c(1:9)])
cor.mat <-round(cor(churn.df), 2)
melted.cor.mat <- melt(cor.mat)
ggplot(melted.cor.mat, aes(x=X1, y=X2, fill=value)) +
geom_tile() + xlab("") + ylab("") +
scale_fill_distiller(palette="RdBu", limits=c(-1,1))
We can use the correlation matrix to identify variable that my have a
greater effect of the churn rate than others.This heatmap shows us that
there is some positive correlation between churn_Yes and
Monthly_charges, with Total_charges and with tenure-months.We can
explore these further by comparing side-by-side boxplots of
churn_Yes.
boxplot(churn.df$monthly_charges ~ churn.df$churn_Yes,
xlab="churn", ylab="Monthly charge")
When we compare values for those who churn and those that do not with regard to Monthly_charges we see little difference between the groups.
boxplot(total_charges ~ churn_Yes,
xlab="churn", ylab="Total charge", data = churn.df)
With Total_charges we see a greater difference than with
Monthly_charges.
boxplot(churn.df$tenure_months ~ churn.df$churn_Yes,
xlab="churn", ylab="tenure months")
A more significant differnce between the groups appears when examining
the relation between churn_Yes and tenture_months.The churn rate seems
to be slightly higher for customers with a greater value for tenture
months.
Total_charges and tenure_months are likely correlated since the longer tenure with results in accumulated monthly charges and higher Total charges.We should be mindful of this as we move forward with data modeling.
plot(churn.df$total_charges ~ churn.df$tenure_months,
xlab="Tenure Months", ylab="Total charge")
hist(churn.df$total_charges, xlab="Total Charges")
hist(churn.df$tenure_months)
plot(churn.df$total_charges ~ churn.df$tenure_months,
xlab="Tenure Months", ylab="Total charge", log="y")
A substantive insight is: • A pattern supported by evidence (numeric summary or visualization)
We can see that the churn rate is greater for Total_charges and for tenure-months that for Monthly charges.The correlation matrix and results of differences between groups support the finding that Total_charges and tenture-months are correlated.The scatter plot indicates that this relationship is a non-linear. We also see that differences appear between those that have churned and those that have not. We would build a model that explores these differences while addressing isues of correlation between predictor variables.
, and • A short interpretation explaining why it matters for the business problem.
Business decision makers would be interested in these findings because they likely want to reduce the sales cost by retaining customers and increase revenue. Retaining customers is likely less expensive that identifying an obtaining new customers.Conversely a Business decision maker may want to reduce costs by offering few perks to customers that are mostly likely to end contracts.
Requirement: Provide at least three distinct insights, each supported by a plot or statistic.
Based on what we see from in the boxplots above we can conclude that there is a difference between customers that churn and customers that do not. The churn rate is influenced by both Total-charges and tenure months and theses variables are likely correlated.
• Feature selection With Monthly charges and Total charges moderately correlated we would likely choose to use one feature and not the one when we build a model to evaluate likelihood of a customer churning.
• Modeling choices We can likely use a regression model to predict which customers are likely to churn, however we do want to be careful about using variables that are correlated, like Total_charges and tenure_months.
• Data preprocessing decisions We would likely test different models excluding the redundant variable from each to find which performs better.We would choose predictors that have the greatest variable explained by the predictor we keep in the model.
Expected output: 1–2 paragraphs. The churn data set can be used to build a machine learning model that predicts which customers are most likely to churn or end contracts.The insights gained from identifying customers would allow business leader to take appropriate steps to retain customers or reduce the costs associated with retaining customers that are not likely to stay with the company.
Independant models would be generated to included different variables so performance could be evaluated. Our goal is to finding the model with the best performance without overfitting the data.
Deliverables • R script (RMD file) • RMD file knitted output - notebook • Short written report (2 pages) summarizing findings ________________________________________