1 Overview (Read This First)

Goal: Understand the dataset well enough to justify later modeling choices.
This lab is not about building predictive models.

Dataset: week2_churn_data.csv

1.1 What counts as a “substantive insight” (required)

An insight must include: 1. Evidence (a plot or statistic) 2. A clear pattern (direction and/or magnitude) 3. Business meaning (why it matters)

Use this sentence frame: > “Customers with ___ show ___ compared to , which suggests .”


2 Setup


3 1) Data Familiarization

3.1 1.1 Identify the outcome variable

In this dataset, the outcome variable is: churn (values: Yes/No).

# Confirm churn values
table(churn$churn)
## 
##  No Yes 
## 290 110
# Create a 0/1 version for calculations (Yes=1, No=0)
churn01 <- ifelse(churn$churn == "Yes", 1, 0)
table(churn01)
## churn01
##   0   1 
## 290 110

3.2 1.2 Identify predictors and their types

Task: Create a short table (or bullet list) showing each variable’s: - role (Outcome / Predictor) - type (numeric / categorical / binary)

Hint: sapply(churn, class) helps.

var_types <- sapply(churn, class)
var_types
##     customer_id   tenure_months monthly_charges   contract_type online_security 
##       "integer"       "integer"       "numeric"     "character"     "character" 
##    tech_support           churn   total_charges 
##     "character"     "character"       "numeric"

Your table/list (fill in):

  • Outcome:
    • churn — (binary categorical; Yes/No)
  • Predictors:
    • customer_id — (integer) (note: likely an identifier)
    • tenure_months — (numeric(integer))
    • monthly_charges — (numeric)
    • total_charges — (numeric____)
    • contract_type — (categorical)
    • online_security — (binary categorical; Yes/No)
    • tech_support — (binary categorical; Yes/No)

4 2) Data Quality Assessment

4.1 2.1 Missing values

“The original dataset does not contain missing values. To practice realistic data preparation, you will intentionally introduce a small amount of missingness into one numeric variable and then assess and address it.”

# Create a working copy (do NOT overwrite original)
churn_work <- churn

# Introduce ~5% missingness in monthly_charges
set.seed(123)
n <- nrow(churn_work)
missing_index <- sample(1:n, size = round(0.05 * n))

churn_work$monthly_charges[missing_index] <- NA

missing_index <- sample(1:n, size = round(0.05 * n))

churn_work$tenure_months[missing_index] <- NA
missing_counts <- colSums(is.na(churn_work))

missing_counts
##     customer_id   tenure_months monthly_charges   contract_type online_security 
##               0              20              20               0               0 
##    tech_support           churn   total_charges 
##               0               0               0
missing_counts[missing_counts > 0]
##   tenure_months monthly_charges 
##              20              20
summary(churn_work)
##   customer_id    tenure_months   monthly_charges  contract_type     
##  Min.   :  1.0   Min.   : 1.00   Min.   : 13.08   Length:400        
##  1st Qu.:100.8   1st Qu.:16.00   1st Qu.: 54.09   Class :character  
##  Median :200.5   Median :30.00   Median : 66.92   Mode  :character  
##  Mean   :200.5   Mean   :30.03   Mean   : 67.97                     
##  3rd Qu.:300.2   3rd Qu.:44.00   3rd Qu.: 81.89                     
##  Max.   :400.0   Max.   :59.00   Max.   :137.16                     
##                  NA's   :20      NA's   :20                         
##  online_security    tech_support          churn           total_charges    
##  Length:400         Length:400         Length:400         Min.   :  57.15  
##  Class :character   Class :character   Class :character   1st Qu.: 962.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :1841.82  
##                                                           Mean   :2021.82  
##                                                           3rd Qu.:2848.20  
##                                                           Max.   :7310.18  
## 
# Create an example vector with NAs
x <- churn_work$tenure_months

# Attempt to calculate the mean without removing NAs (result is NA)
mean(x)
## [1] NA
# Calculate the mean by removing NAs (result is 5.333333)
mean(x, na.rm = TRUE)
## [1] 30.02895

Interpretation (fill in):
- Which variables have missing values? tenure_months and monthly_charges
- Do the missing values appear minor or substantial? Minor. Each variable has 20 missing values out of 400 observations, which is about 5%.
- What would you do about them (omit / impute / investigate)? Since the missing values are minimal (about 5%), I would use mean or median imputation to retain the full dataset.

4.2 2.2 Outliers / extreme values (numeric variables)

An outlier is defined by: Distance from the central mass of data Standard deviation, quartiles, mean Supported by a plot or statistic A boxplot with observations above the upper whisker or below the lower whisker suggests potential high-end outliers. Interpreted in business context

No automatic action

# Choose numeric variables (excluding customer_id, which is likely an ID)
numeric_vars <- c("tenure_months", "monthly_charges", "total_charges")

# Summary stats
summary(churn[numeric_vars])
##  tenure_months   monthly_charges  total_charges    
##  Min.   : 1.00   Min.   : 13.08   Min.   :  57.15  
##  1st Qu.:16.00   1st Qu.: 54.31   1st Qu.: 962.00  
##  Median :29.00   Median : 66.91   Median :1841.82  
##  Mean   :29.79   Mean   : 68.01   Mean   :2021.82  
##  3rd Qu.:44.00   3rd Qu.: 81.89   3rd Qu.:2848.20  
##  Max.   :59.00   Max.   :137.16   Max.   :7310.18
summary(churn[c(1,2,8)])
##   customer_id    tenure_months   total_charges    
##  Min.   :  1.0   Min.   : 1.00   Min.   :  57.15  
##  1st Qu.:100.8   1st Qu.:16.00   1st Qu.: 962.00  
##  Median :200.5   Median :29.00   Median :1841.82  
##  Mean   :200.5   Mean   :29.79   Mean   :2021.82  
##  3rd Qu.:300.2   3rd Qu.:44.00   3rd Qu.:2848.20  
##  Max.   :400.0   Max.   :59.00   Max.   :7310.18
# Boxplots
par(mfrow=c(1,3))
boxplot(churn$tenure_months, main="tenure_months", ylab="months")
boxplot(churn$monthly_charges, main="monthly_charges", ylab="charges")
boxplot(churn$total_charges, main="total_charges", ylab="charges")

par(mfrow=c(1,1))


summary(churn$monthly_charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.08   54.31   66.91   68.01   81.89  137.16
sd(churn$monthly_charges, na.rm = TRUE)
## [1] 21.01485
summary(churn$total_charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.15  962.00 1841.82 2021.82 2848.20 7310.18
sd(churn$total_charges, na.rm = TRUE)
## [1] 1343.435

Interpretation (fill in):
- Any extreme values or unusual distributions? total_charges is right-skewed with high values.
- Would you transform or cap anything before modeling? Consider transforming total_charges; keep values.

If keeping, you can state: “Potential outliers were identified using boxplots and summary statistics; while a small number of extreme values appear, they are plausible given the business context and were therefore retained.

4.3 2.3 Basic correlations (numeric variables)

cor_mat <- cor(churn[numeric_vars], use="pairwise.complete.obs")
round(cor_mat, 3)
##                 tenure_months monthly_charges total_charges
## tenure_months           1.000           0.011         0.842
## monthly_charges         0.011           1.000         0.469
## total_charges           0.842           0.469         1.000

Interpretation (fill in):
- Are any numeric predictors strongly correlated? tenure_months and total_charges
- What might that imply for modeling later? Churn rates are similar across contract types.

4.4 2.4 Required visualizations (at least 2)

Below are sample plots. Add/replace with your own if you prefer.

# Histogram
hist(churn$monthly_charges, breaks=20,
     main="Histogram: monthly_charges",
     xlab="monthly_charges")

# Churn rate by contract type
churn_rate_by_contract <- tapply(churn01, churn$contract_type, mean, na.rm=TRUE)
barplot(churn_rate_by_contract,
        main="Churn Rate by Contract Type",
        ylab="Churn Rate", las=2)

Interpretation (fill in):
- Plot 1 shows distribution of monthly_charges and suggests customers have a wide range of pricing levels
- Plot 2 shows churn rates by contract type and suggests churn is similar across contract categories


5 3) Exploratory Data Analysis (EDA)

You must provide at least three distinct insights, each supported by a plot or statistic.

5.1 Insight 1 (required)

Insight statement (fill in):
> Customers with one-year contracts show slightly higher churn compared to two-year and month-to-month contracts, which suggests contract length alone is not a strong driver of churn

# Example idea: churn rate by contract_type (edit or replace)
ins1_tbl <- tapply(churn01, churn$contract_type, mean, na.rm=TRUE)
ins1_tbl
## Month-to-month       One year       Two year 
##      0.2761194      0.2903226      0.2605634
barplot(ins1_tbl, main="Churn Rate by Contract Type", ylab="Churn Rate", las=2)

5.2 Insight 2 (required)

Insight statement (fill in):
> Customers with churn show higher average tenure compared to customers without churn, which suggests churn occurs after longer customer relationships

# Example idea: compare tenure_months by churn status
mean_no <- mean(churn$tenure_months[churn01==0], na.rm=TRUE)
mean_yes <- mean(churn$tenure_months[churn01==1], na.rm=TRUE)
mean_no; mean_yes
## [1] 28.80345
## [1] 32.37273
boxplot(churn$tenure_months ~ churn01,
        main="tenure_months by churn (0=No, 1=Yes)",
        xlab="churn01", ylab="tenure_months")

5.3 Insight 3 (required)

Insight statement (fill in):
> Customers with online security show higher churn compared to customers without churn, which suggests churn occurs after longer customer relationships

# Example idea: churn rate by online_security (Yes/No)
ins3_tbl <- tapply(churn01, churn$online_security, mean, na.rm=TRUE)
ins3_tbl
##        No       Yes 
## 0.2453704 0.3097826
barplot(ins3_tbl, main="Churn Rate by Online Security", ylab="Churn Rate", las=2)

## 8) Three “substantive insights” scaffold (simple and repeatable) ----
# Insight Example A: churn rate by a categorical variable
# Replace contract_type with another category if needed.
if ("contract_type" %in% names(churn)) {
  tbl <- table(churn$contract_type, churn01)
  print(tbl)
  # Churn rate by category (again, but now you can show counts too)
  churn_rate <- tapply(churn01, churn$contract_type, mean, na.rm = TRUE)
  print(churn_rate)

  # You can verbalize: "Category X is about ___ compared to category Y"
}
##                 churn01
##                    0   1
##   Month-to-month  97  37
##   One year        88  36
##   Two year       105  37
## Month-to-month       One year       Two year 
##      0.2761194      0.2903226      0.2605634
# Insight Example B: numeric variable difference by churn status
# Use tenure_months (your note) if it exists.
if ("tenure_months" %in% names(churn)) {
  # Compare group means
  mean_churn0 <- mean(churn$tenure_months[churn01 == 0], na.rm = TRUE)
  mean_churn1 <- mean(churn$tenure_months[churn01 == 1], na.rm = TRUE)
  cat("\nMean tenure_months (no churn):", round(mean_churn0, 2), "\n")
  cat("Mean tenure_months (churn):   ", round(mean_churn1, 2), "\n")

  # Simple boxplot by churn status
  boxplot(churn$tenure_months ~ churn01,
          main = "tenure_months by Churn (0=no, 1=yes)",
          xlab = "churn01", ylab = "tenure_months")
}
## 
## Mean tenure_months (no churn): 28.8 
## Mean tenure_months (churn):    32.37

# Insight Example C: Create simple tenure groups WITHOUT cut() (very explicit)
if ("tenure_months" %in% names(churn)) {
  tenure_group <- rep(NA, nrow(churn))
  tenure_group[churn$tenure_months <= 12] <- "0-12"
  tenure_group[churn$tenure_months > 12 & churn$tenure_months <= 36] <- "13-36"
  tenure_group[churn$tenure_months > 36 & churn$tenure_months <= 72] <- "37-72"
  tenure_group[churn$tenure_months > 72] <- "73+"

  # Churn rate by tenure group
  churn_rate_tenure <- tapply(churn01, tenure_group, mean, na.rm = TRUE)
  print(churn_rate_tenure)
  barplot(churn_rate_tenure, main = "Churn Rate by Tenure Group", ylab = "Churn Rate", las = 2)
}
##      0-12     13-36     37-72 
## 0.2465753 0.2429379 0.3266667


6 4) Business Framing (1–2 paragraphs)

Write 1–2 paragraphs answering: - How would your findings influence feature selection? - How would your findings influence modeling choices? - How would your findings influence preprocessing decisions?

Your response (fill in):

*The findings suggest selecting tenure_months, monthly_charges, and total_charges as key numeric features, while excluding customer_id as an identifier. Categorical variables such as contract_type, online_security, and tech_support should be included, though contract_type shows limited standalone impact. The strong correlation between tenure_months and total_charges indicates possible redundancy.

*For modeling, this correlation suggests caution with linear models and supports using regularized or tree-based approaches. Preprocessing should include encoding categorical variables, imputing minor missing values, and considering a transformation for total_charges due to right skew.


install.packages(c(“fastmap”, “htmltools”, “bslib”, “rmarkdown”))

7 Checklist Before Submitting