Part I: Data Description

Dataset of Choice

For this assignment, I will use the Telco Customer Churn dataset, which contains information about telecom customers and whether they churned (left the service provider).

  • Who collected it? This dataset is based on IBM Sample Data Sets and is widely available on platforms like Kaggle.
  • Format: The dataset is in CSV format and is publicly available.
  • Key Variables of Interest:
    • Churn: Whether the customer churned (Yes/No). This is our target variable.
    • tenure: The number of months a customer has been with the company. A shorter tenure may indicate a higher risk of churn.
    • MonthlyCharges: The amount the customer pays per month for services. Higher charges may impact churn behavior.
    • Contract: Type of contract (Month-to-month, One year, Two year). Customers with shorter contracts may be more likely to churn.
    • PaymentMethod: How the customer pays (Electronic check, Mailed check, etc.). Some payment methods may correlate with higher churn rates.
  • Research Questions:
    • What factors influence customer churn?
    • Does contract type affect churn probability?
    • How do monthly charges impact churn?

I plan to conduct logistic regression using MLE to estimate churn probabilities based on these factors.

Part II: Pivoting Data

Demonstrating pivot_longer() and pivot_wider()

# Create a small dataset that simulates monthly and total charges for customers
set.seed(123)
pivot_data <- tibble(
  CustomerID = 1:5,
  Monthly_Charges = runif(5, 50, 100),  # Generate random values for monthly charges
  Total_Charges = runif(5, 500, 1000)   # Generate random values for total charges
)

print("Original Wide Format:")
## [1] "Original Wide Format:"
print(pivot_data)
## # A tibble: 5 × 3
##   CustomerID Monthly_Charges Total_Charges
##        <int>           <dbl>         <dbl>
## 1          1            64.4          523.
## 2          2            89.4          764.
## 3          3            70.4          946.
## 4          4            94.2          776.
## 5          5            97.0          728.
# Convert wide format to long format
long_data <- pivot_data %>%
  pivot_longer(cols = c("Monthly_Charges", "Total_Charges"), 
               names_to = "Charge_Type", 
               values_to = "Amount")

print("Long Format:")
## [1] "Long Format:"
print(long_data)
## # A tibble: 10 × 3
##    CustomerID Charge_Type     Amount
##         <int> <chr>            <dbl>
##  1          1 Monthly_Charges   64.4
##  2          1 Total_Charges    523. 
##  3          2 Monthly_Charges   89.4
##  4          2 Total_Charges    764. 
##  5          3 Monthly_Charges   70.4
##  6          3 Total_Charges    946. 
##  7          4 Monthly_Charges   94.2
##  8          4 Total_Charges    776. 
##  9          5 Monthly_Charges   97.0
## 10          5 Total_Charges    728.
# Convert back from long format to wide format
wide_data <- long_data %>%
  pivot_wider(names_from = "Charge_Type", values_from = "Amount")

print("Converted Back to Wide Format:")
## [1] "Converted Back to Wide Format:"
print(wide_data)
## # A tibble: 5 × 3
##   CustomerID Monthly_Charges Total_Charges
##        <int>           <dbl>         <dbl>
## 1          1            64.4          523.
## 2          2            89.4          764.
## 3          3            70.4          946.
## 4          4            94.2          776.
## 5          5            97.0          728.

Explanation

  • pivot_longer(): This function transforms wide-format data into long-format data by stacking multiple columns into two new ones: Charge_Type (categorical variable indicating type of charge) and Amount (the actual value of the charge). This is useful for reshaping data to make it easier to analyze and visualize.
  • pivot_wider(): This function converts long-format data back to wide-format by spreading Charge_Type categories into separate columns again. This is useful when summarizing or displaying data in an easy-to-read format.

This transformation is important in data cleaning and preprocessing, making it easier to perform statistical analyses and visualizations.

Part III: Adapting MLE Analysis

Using the Telco Dataset for MLE Logistic Regression

# Load Telco dataset and preprocess the data
telco_data <- read_csv("telco-data.csv.csv") %>% 
  mutate(Churn = if_else(Churn == "Yes", 1, 0),  # Convert Churn to binary (1 = churn, 0 = no churn)
         Contract = as.numeric(factor(Contract)),  # Convert contract type to numeric (categorical variable encoding)
         PaymentMethod = as.numeric(factor(PaymentMethod)))  # Convert payment method to numeric for regression
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl  (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Define Log-Likelihood Function for Logistic Regression
logit_lf <- function(param) {
  intercept <- param[1]  # Intercept term
  beta_tenure <- param[2]  # Coefficient for tenure (length of stay)
  beta_monthly <- param[3]  # Coefficient for monthly charges (cost)
  beta_contract <- param[4]  # Coefficient for contract type
  
  # Response variable (binary: 1 = churn, 0 = no churn)
  y <- as.vector(telco_data$Churn)
  
  # Design matrix with intercept, tenure, monthly charges, and contract type
  x <- cbind(1, telco_data$tenure, telco_data$MonthlyCharges, telco_data$Contract)
  
  # Compute predicted probabilities using the logistic function
  prob <- exp(x %*% c(intercept, beta_tenure, beta_monthly, beta_contract)) / (1 + exp(x %*% c(intercept, beta_tenure, beta_monthly, beta_contract)))
  
  # Compute log-likelihood by summing over all observations
  sum(dbinom(y, size = 1, prob = prob, log = TRUE))
}

# Running MLE for Logistic Regression
mle_logit <- maxLik(logLik = logit_lf, start = c(0, 0, 0, 0))  # Initial estimates for parameters
summary(mle_logit)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 6 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -3073.039 
## 4  free parameters
## Estimates:
##       Estimate Std. error t value  Pr(> t)    
## [1,] -0.573358   0.118513  -4.838 1.31e-06 ***
## [2,] -0.035787   0.002036 -17.574  < 2e-16 ***
## [3,]  0.028621   0.001348  21.228  < 2e-16 ***
## [4,] -1.030966   0.071392 -14.441  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Explanation

  • This adapts the ToothGrowth MLE approach to predict customer churn.
  • Binary logistic regression: Models Churn (1 = churned, 0 = retained) as a function of three predictor variables:
    • tenure: Measures how long the customer has been subscribed.
    • MonthlyCharges: Measures the cost per month paid by the customer.
    • Contract: Indicates whether the customer has a month-to-month, one-year, or two-year contract.
  • MLE estimates coefficients by maximizing the log-likelihood function using maxLik.
  • The logistic function transforms a linear combination of predictors into a probability (bounded between 0 and 1), representing the likelihood of customer churn.
  • The MLE output provides estimates for:
    • Intercept (β0): Baseline probability of churn.
    • Effect of tenure (β1): Whether longer tenure decreases/increases churn likelihood.
    • Effect of monthly charges (β2): How higher/lower monthly charges impact churn.
    • Effect of contract type (β3): Whether longer contracts reduce churn probability.

Conclusion

This completes Assignment 3 with an in-depth application of MLE, logistic regression, and data transformation techniques in R.