Assignment 3: Maximum Likelihood Estimation and Data Transformation

Part I: Data Description

Dataset of Choice

For this assignment, I will use the Telco Customer Churn dataset, which contains information about telecom customers and whether they churned (left the service provider).

Who collected it? This dataset is based on IBM Sample Data Sets and is widely available on platforms like Kaggle.
Format: The dataset is in CSV format and is publicly available.
Key Variables of Interest:
- Churn: Whether the customer churned (Yes/No). This is our target variable.
- tenure: The number of months a customer has been with the company. A shorter tenure may indicate a higher risk of churn.
- MonthlyCharges: The amount the customer pays per month for services. Higher charges may impact churn behavior.
- Contract: Type of contract (Month-to-month, One year, Two year). Customers with shorter contracts may be more likely to churn.
- PaymentMethod: How the customer pays (Electronic check, Mailed check, etc.). Some payment methods may correlate with higher churn rates.
Research Questions:
- What factors influence customer churn?
- Does contract type affect churn probability?
- How do monthly charges impact churn?

I plan to conduct logistic regression using MLE to estimate churn probabilities based on these factors.

Part II: Pivoting Data

Demonstrating `pivot_longer()` and `pivot_wider()`

# Create a small dataset that simulates monthly and total charges for customers
set.seed(123)
pivot_data <- tibble(
  CustomerID = 1:5,
  Monthly_Charges = runif(5, 50, 100),  # Generate random values for monthly charges
  Total_Charges = runif(5, 500, 1000)   # Generate random values for total charges
)

print("Original Wide Format:")

## [1] "Original Wide Format:"

print(pivot_data)

## # A tibble: 5 × 3
##   CustomerID Monthly_Charges Total_Charges
##        <int>           <dbl>         <dbl>
## 1          1            64.4          523.
## 2          2            89.4          764.
## 3          3            70.4          946.
## 4          4            94.2          776.
## 5          5            97.0          728.

# Convert wide format to long format
long_data <- pivot_data %>%
  pivot_longer(cols = c("Monthly_Charges", "Total_Charges"), 
               names_to = "Charge_Type", 
               values_to = "Amount")

print("Long Format:")

## [1] "Long Format:"

print(long_data)

## # A tibble: 10 × 3
##    CustomerID Charge_Type     Amount
##         <int> <chr>            <dbl>
##  1          1 Monthly_Charges   64.4
##  2          1 Total_Charges    523. 
##  3          2 Monthly_Charges   89.4
##  4          2 Total_Charges    764. 
##  5          3 Monthly_Charges   70.4
##  6          3 Total_Charges    946. 
##  7          4 Monthly_Charges   94.2
##  8          4 Total_Charges    776. 
##  9          5 Monthly_Charges   97.0
## 10          5 Total_Charges    728.

# Convert back from long format to wide format
wide_data <- long_data %>%
  pivot_wider(names_from = "Charge_Type", values_from = "Amount")

print("Converted Back to Wide Format:")

## [1] "Converted Back to Wide Format:"

print(wide_data)

## # A tibble: 5 × 3
##   CustomerID Monthly_Charges Total_Charges
##        <int>           <dbl>         <dbl>
## 1          1            64.4          523.
## 2          2            89.4          764.
## 3          3            70.4          946.
## 4          4            94.2          776.
## 5          5            97.0          728.

Explanation

pivot_longer(): This function transforms wide-format data into long-format data by stacking multiple columns into two new ones: Charge_Type (categorical variable indicating type of charge) and Amount (the actual value of the charge). This is useful for reshaping data to make it easier to analyze and visualize.
pivot_wider(): This function converts long-format data back to wide-format by spreading Charge_Type categories into separate columns again. This is useful when summarizing or displaying data in an easy-to-read format.

This transformation is important in data cleaning and preprocessing, making it easier to perform statistical analyses and visualizations.

Part III: Adapting MLE Analysis

Using the Telco Dataset for MLE Logistic Regression

# Load Telco dataset and preprocess the data
telco_data <- read_csv("telco-data.csv.csv") %>% 
  mutate(Churn = if_else(Churn == "Yes", 1, 0),  # Convert Churn to binary (1 = churn, 0 = no churn)
         Contract = as.numeric(factor(Contract)),  # Convert contract type to numeric (categorical variable encoding)
         PaymentMethod = as.numeric(factor(PaymentMethod)))  # Convert payment method to numeric for regression

## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl  (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Define Log-Likelihood Function for Logistic Regression
logit_lf <- function(param) {
  intercept <- param[1]  # Intercept term
  beta_tenure <- param[2]  # Coefficient for tenure (length of stay)
  beta_monthly <- param[3]  # Coefficient for monthly charges (cost)
  beta_contract <- param[4]  # Coefficient for contract type
  
  # Response variable (binary: 1 = churn, 0 = no churn)
  y <- as.vector(telco_data$Churn)
  
  # Design matrix with intercept, tenure, monthly charges, and contract type
  x <- cbind(1, telco_data$tenure, telco_data$MonthlyCharges, telco_data$Contract)
  
  # Compute predicted probabilities using the logistic function
  prob <- exp(x %*% c(intercept, beta_tenure, beta_monthly, beta_contract)) / (1 + exp(x %*% c(intercept, beta_tenure, beta_monthly, beta_contract)))
  
  # Compute log-likelihood by summing over all observations
  sum(dbinom(y, size = 1, prob = prob, log = TRUE))
}

# Running MLE for Logistic Regression
mle_logit <- maxLik(logLik = logit_lf, start = c(0, 0, 0, 0))  # Initial estimates for parameters
summary(mle_logit)

## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 6 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -3073.039 
## 4  free parameters
## Estimates:
##       Estimate Std. error t value  Pr(> t)    
## [1,] -0.573358   0.118513  -4.838 1.31e-06 ***
## [2,] -0.035787   0.002036 -17.574  < 2e-16 ***
## [3,]  0.028621   0.001348  21.228  < 2e-16 ***
## [4,] -1.030966   0.071392 -14.441  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Explanation

This adapts the ToothGrowth MLE approach to predict customer churn.
Binary logistic regression: Models Churn (1 = churned, 0 = retained) as a function of three predictor variables:
- tenure: Measures how long the customer has been subscribed.
- MonthlyCharges: Measures the cost per month paid by the customer.
- Contract: Indicates whether the customer has a month-to-month, one-year, or two-year contract.
MLE estimates coefficients by maximizing the log-likelihood function using maxLik.
The logistic function transforms a linear combination of predictors into a probability (bounded between 0 and 1), representing the likelihood of customer churn.
The MLE output provides estimates for:
- Intercept (β0): Baseline probability of churn.
- Effect of tenure (β1): Whether longer tenure decreases/increases churn likelihood.
- Effect of monthly charges (β2): How higher/lower monthly charges impact churn.
- Effect of contract type (β3): Whether longer contracts reduce churn probability.

Conclusion

Part I: Described the Telco dataset and research objectives.
Part II: Demonstrated pivot_longer() and pivot_wider().
Part III: Used MLE for logistic regression on churn prediction, adapting the ToothGrowth example.

This completes Assignment 3 with an in-depth application of MLE, logistic regression, and data transformation techniques in R.

Assignment 3: Maximum Likelihood Estimation and Data Transformation

Marc Brian Ventura

03/10/2025

Part I: Data Description

Dataset of Choice

Part II: Pivoting Data

Demonstrating `pivot_longer()` and `pivot_wider()`

Explanation

Part III: Adapting MLE Analysis

Using the Telco Dataset for MLE Logistic Regression

Explanation

Conclusion

Assignment 3: Maximum Likelihood Estimation and Data Transformation

Marc Brian Ventura

03/10/2025

Part I: Data Description

Dataset of Choice

Part II: Pivoting Data

Demonstrating pivot_longer() and pivot_wider()

Explanation

Part III: Adapting MLE Analysis

Using the Telco Dataset for MLE Logistic Regression

Explanation

Conclusion

Demonstrating `pivot_longer()` and `pivot_wider()`