Goal: Understand the dataset well enough to justify
later modeling choices.
This lab is not about building predictive
models.
Dataset: week2_churn_data.csv
An insight must include: 1. Evidence (a plot or statistic) 2. A clear pattern (direction and/or magnitude) 3. Business meaning (why it matters)
Use this sentence frame: > “Customers with ___ show ___ compared to , which suggests .”
In this dataset, the outcome variable is: churn (values: Yes/No).
# Confirm churn values
table(churn$churn)
##
## No Yes
## 290 110
# Create a 0/1 version for calculations (Yes=1, No=0)
churn01 <- ifelse(churn$churn == "Yes", 1, 0)
table(churn01)
## churn01
## 0 1
## 290 110
Task: Create a short table (or bullet list) showing each variable’s: - role (Outcome / Predictor) - type (numeric / categorical / binary)
Hint: sapply(churn, class) helps.
var_types <- sapply(churn, class)
var_types
## customer_id tenure_months monthly_charges contract_type online_security
## "integer" "integer" "numeric" "character" "character"
## tech_support churn total_charges
## "character" "character" "numeric"
Your table/list (fill in):
“The original dataset does not contain missing values. To practice realistic data preparation, you will intentionally introduce a small amount of missingness into one numeric variable and then assess and address it.”
# Create a working copy (do NOT overwrite original)
churn_work <- churn
# Introduce ~5% missingness in monthly_charges
set.seed(123)
n <- nrow(churn_work)
missing_index <- sample(1:n, size = round(0.05 * n))
churn_work$monthly_charges[missing_index] <- NA
missing_index <- sample(1:n, size = round(0.05 * n))
churn_work$tenure_months[missing_index] <- NA
missing_counts <- colSums(is.na(churn_work))
missing_counts
## customer_id tenure_months monthly_charges contract_type online_security
## 0 20 20 0 0
## tech_support churn total_charges
## 0 0 0
missing_counts[missing_counts > 0]
## tenure_months monthly_charges
## 20 20
summary(churn_work)
## customer_id tenure_months monthly_charges contract_type
## Min. : 1.0 Min. : 1.00 Min. : 13.08 Length:400
## 1st Qu.:100.8 1st Qu.:16.00 1st Qu.: 54.09 Class :character
## Median :200.5 Median :30.00 Median : 66.92 Mode :character
## Mean :200.5 Mean :30.03 Mean : 67.97
## 3rd Qu.:300.2 3rd Qu.:44.00 3rd Qu.: 81.89
## Max. :400.0 Max. :59.00 Max. :137.16
## NA's :20 NA's :20
## online_security tech_support churn total_charges
## Length:400 Length:400 Length:400 Min. : 57.15
## Class :character Class :character Class :character 1st Qu.: 962.00
## Mode :character Mode :character Mode :character Median :1841.82
## Mean :2021.82
## 3rd Qu.:2848.20
## Max. :7310.18
##
# Create an example vector with NAs
x <- churn_work$tenure_months
# Attempt to calculate the mean without removing NAs (result is NA)
mean(x)
## [1] NA
# Calculate the mean by removing NAs (result is 5.333333)
mean(x, na.rm = TRUE)
## [1] 30.02895
Interpretation (fill in):
- Which variables have missing values? tenure_months and
monthly_charges
- Do the missing values appear minor or substantial? Minor. Each
variable has 20 missing values out of 400 observations, which is about
5%.
- What would you do about them (omit / impute / investigate)? Since the
missing values are minimal (about 5%), I would use mean or median
imputation to retain the full dataset.
An outlier is defined by: Distance from the central mass of data Standard deviation, quartiles, mean Supported by a plot or statistic A boxplot with observations above the upper whisker or below the lower whisker suggests potential high-end outliers. Interpreted in business context
No automatic action
# Choose numeric variables (excluding customer_id, which is likely an ID)
numeric_vars <- c("tenure_months", "monthly_charges", "total_charges")
# Summary stats
summary(churn[numeric_vars])
## tenure_months monthly_charges total_charges
## Min. : 1.00 Min. : 13.08 Min. : 57.15
## 1st Qu.:16.00 1st Qu.: 54.31 1st Qu.: 962.00
## Median :29.00 Median : 66.91 Median :1841.82
## Mean :29.79 Mean : 68.01 Mean :2021.82
## 3rd Qu.:44.00 3rd Qu.: 81.89 3rd Qu.:2848.20
## Max. :59.00 Max. :137.16 Max. :7310.18
summary(churn[c(1,2,8)])
## customer_id tenure_months total_charges
## Min. : 1.0 Min. : 1.00 Min. : 57.15
## 1st Qu.:100.8 1st Qu.:16.00 1st Qu.: 962.00
## Median :200.5 Median :29.00 Median :1841.82
## Mean :200.5 Mean :29.79 Mean :2021.82
## 3rd Qu.:300.2 3rd Qu.:44.00 3rd Qu.:2848.20
## Max. :400.0 Max. :59.00 Max. :7310.18
# Boxplots
par(mfrow=c(1,3))
boxplot(churn$tenure_months, main="tenure_months", ylab="months")
boxplot(churn$monthly_charges, main="monthly_charges", ylab="charges")
boxplot(churn$total_charges, main="total_charges", ylab="charges")
par(mfrow=c(1,1))
summary(churn$monthly_charges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.08 54.31 66.91 68.01 81.89 137.16
sd(churn$monthly_charges, na.rm = TRUE)
## [1] 21.01485
summary(churn$total_charges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.15 962.00 1841.82 2021.82 2848.20 7310.18
sd(churn$total_charges, na.rm = TRUE)
## [1] 1343.435
Interpretation (fill in):
- Any extreme values or unusual distributions? total_charges is
right-skewed with high values.
- Would you transform or cap anything before modeling? Consider
transforming total_charges; keep values.
If keeping, you can state: “Potential outliers were identified using boxplots and summary statistics; while a small number of extreme values appear, they are plausible given the business context and were therefore retained.
cor_mat <- cor(churn[numeric_vars], use="pairwise.complete.obs")
round(cor_mat, 3)
## tenure_months monthly_charges total_charges
## tenure_months 1.000 0.011 0.842
## monthly_charges 0.011 1.000 0.469
## total_charges 0.842 0.469 1.000
Interpretation (fill in):
- Are any numeric predictors strongly correlated? tenure_months and
total_charges
- What might that imply for modeling later? Churn rates are similar
across contract types.
Below are sample plots. Add/replace with your own if you prefer.
# Histogram
hist(churn$monthly_charges, breaks=20,
main="Histogram: monthly_charges",
xlab="monthly_charges")
# Churn rate by contract type
churn_rate_by_contract <- tapply(churn01, churn$contract_type, mean, na.rm=TRUE)
barplot(churn_rate_by_contract,
main="Churn Rate by Contract Type",
ylab="Churn Rate", las=2)
Interpretation (fill in):
- Plot 1 shows distribution of monthly_charges and suggests customers
have a wide range of pricing levels
- Plot 2 shows churn rates by contract type and suggests churn is
similar across contract categories
You must provide at least three distinct insights, each supported by a plot or statistic.
Insight statement (fill in):
> Customers with one-year contracts show slightly higher churn
compared to two-year and month-to-month contracts, which suggests
contract length alone is not a strong driver of churn
# Example idea: churn rate by contract_type (edit or replace)
ins1_tbl <- tapply(churn01, churn$contract_type, mean, na.rm=TRUE)
ins1_tbl
## Month-to-month One year Two year
## 0.2761194 0.2903226 0.2605634
barplot(ins1_tbl, main="Churn Rate by Contract Type", ylab="Churn Rate", las=2)
Insight statement (fill in):
> Customers with churn show higher average tenure compared to
customers without churn, which suggests churn occurs after longer
customer relationships
# Example idea: compare tenure_months by churn status
mean_no <- mean(churn$tenure_months[churn01==0], na.rm=TRUE)
mean_yes <- mean(churn$tenure_months[churn01==1], na.rm=TRUE)
mean_no; mean_yes
## [1] 28.80345
## [1] 32.37273
boxplot(churn$tenure_months ~ churn01,
main="tenure_months by churn (0=No, 1=Yes)",
xlab="churn01", ylab="tenure_months")
Insight statement (fill in):
> Customers with online security show higher churn compared to
customers without churn, which suggests churn occurs after longer
customer relationships
# Example idea: churn rate by online_security (Yes/No)
ins3_tbl <- tapply(churn01, churn$online_security, mean, na.rm=TRUE)
ins3_tbl
## No Yes
## 0.2453704 0.3097826
barplot(ins3_tbl, main="Churn Rate by Online Security", ylab="Churn Rate", las=2)
## 8) Three “substantive insights” scaffold (simple and repeatable) ----
# Insight Example A: churn rate by a categorical variable
# Replace contract_type with another category if needed.
if ("contract_type" %in% names(churn)) {
tbl <- table(churn$contract_type, churn01)
print(tbl)
# Churn rate by category (again, but now you can show counts too)
churn_rate <- tapply(churn01, churn$contract_type, mean, na.rm = TRUE)
print(churn_rate)
# You can verbalize: "Category X is about ___ compared to category Y"
}
## churn01
## 0 1
## Month-to-month 97 37
## One year 88 36
## Two year 105 37
## Month-to-month One year Two year
## 0.2761194 0.2903226 0.2605634
# Insight Example B: numeric variable difference by churn status
# Use tenure_months (your note) if it exists.
if ("tenure_months" %in% names(churn)) {
# Compare group means
mean_churn0 <- mean(churn$tenure_months[churn01 == 0], na.rm = TRUE)
mean_churn1 <- mean(churn$tenure_months[churn01 == 1], na.rm = TRUE)
cat("\nMean tenure_months (no churn):", round(mean_churn0, 2), "\n")
cat("Mean tenure_months (churn): ", round(mean_churn1, 2), "\n")
# Simple boxplot by churn status
boxplot(churn$tenure_months ~ churn01,
main = "tenure_months by Churn (0=no, 1=yes)",
xlab = "churn01", ylab = "tenure_months")
}
##
## Mean tenure_months (no churn): 28.8
## Mean tenure_months (churn): 32.37
# Insight Example C: Create simple tenure groups WITHOUT cut() (very explicit)
if ("tenure_months" %in% names(churn)) {
tenure_group <- rep(NA, nrow(churn))
tenure_group[churn$tenure_months <= 12] <- "0-12"
tenure_group[churn$tenure_months > 12 & churn$tenure_months <= 36] <- "13-36"
tenure_group[churn$tenure_months > 36 & churn$tenure_months <= 72] <- "37-72"
tenure_group[churn$tenure_months > 72] <- "73+"
# Churn rate by tenure group
churn_rate_tenure <- tapply(churn01, tenure_group, mean, na.rm = TRUE)
print(churn_rate_tenure)
barplot(churn_rate_tenure, main = "Churn Rate by Tenure Group", ylab = "Churn Rate", las = 2)
}
## 0-12 13-36 37-72
## 0.2465753 0.2429379 0.3266667
Write 1–2 paragraphs answering: - How would your findings influence feature selection? - How would your findings influence modeling choices? - How would your findings influence preprocessing decisions?
Your response (fill in):
*The findings suggest selecting tenure_months, monthly_charges, and total_charges as key numeric features, while excluding customer_id as an identifier. Categorical variables such as contract_type, online_security, and tech_support should be included, though contract_type shows limited standalone impact. The strong correlation between tenure_months and total_charges indicates possible redundancy.
*For modeling, this correlation suggests caution with linear models and supports using regularized or tree-based approaches. Preprocessing should include encoding categorical variables, imputing minor missing values, and considering a transformation for total_charges due to right skew.
install.packages(c(“fastmap”, “htmltools”, “bslib”, “rmarkdown”))