For this assignment, I will use the Telco Customer Churn dataset, which contains information about telecom customers and whether they churned (left the service provider).
Churn
: Whether the customer churned (Yes/No). This is
our target variable.tenure
: The number of months a customer has been with
the company. A shorter tenure may indicate a higher risk of churn.MonthlyCharges
: The amount the customer pays per month
for services. Higher charges may impact churn behavior.Contract
: Type of contract (Month-to-month, One year,
Two year). Customers with shorter contracts may be more likely to
churn.PaymentMethod
: How the customer pays (Electronic check,
Mailed check, etc.). Some payment methods may correlate with higher
churn rates.I plan to conduct logistic regression using MLE to estimate churn probabilities based on these factors.
pivot_longer()
and
pivot_wider()
# Create a small dataset that simulates monthly and total charges for customers
set.seed(123)
pivot_data <- tibble(
CustomerID = 1:5,
Monthly_Charges = runif(5, 50, 100), # Generate random values for monthly charges
Total_Charges = runif(5, 500, 1000) # Generate random values for total charges
)
print("Original Wide Format:")
## [1] "Original Wide Format:"
print(pivot_data)
## # A tibble: 5 × 3
## CustomerID Monthly_Charges Total_Charges
## <int> <dbl> <dbl>
## 1 1 64.4 523.
## 2 2 89.4 764.
## 3 3 70.4 946.
## 4 4 94.2 776.
## 5 5 97.0 728.
# Convert wide format to long format
long_data <- pivot_data %>%
pivot_longer(cols = c("Monthly_Charges", "Total_Charges"),
names_to = "Charge_Type",
values_to = "Amount")
print("Long Format:")
## [1] "Long Format:"
print(long_data)
## # A tibble: 10 × 3
## CustomerID Charge_Type Amount
## <int> <chr> <dbl>
## 1 1 Monthly_Charges 64.4
## 2 1 Total_Charges 523.
## 3 2 Monthly_Charges 89.4
## 4 2 Total_Charges 764.
## 5 3 Monthly_Charges 70.4
## 6 3 Total_Charges 946.
## 7 4 Monthly_Charges 94.2
## 8 4 Total_Charges 776.
## 9 5 Monthly_Charges 97.0
## 10 5 Total_Charges 728.
# Convert back from long format to wide format
wide_data <- long_data %>%
pivot_wider(names_from = "Charge_Type", values_from = "Amount")
print("Converted Back to Wide Format:")
## [1] "Converted Back to Wide Format:"
print(wide_data)
## # A tibble: 5 × 3
## CustomerID Monthly_Charges Total_Charges
## <int> <dbl> <dbl>
## 1 1 64.4 523.
## 2 2 89.4 764.
## 3 3 70.4 946.
## 4 4 94.2 776.
## 5 5 97.0 728.
pivot_longer()
: This function transforms
wide-format data into long-format data
by stacking multiple columns into two new ones: Charge_Type
(categorical variable indicating type of charge) and Amount
(the actual value of the charge). This is useful for reshaping
data to make it easier to analyze and visualize.pivot_wider()
: This function converts
long-format data back to wide-format by spreading
Charge_Type
categories into separate columns again. This is
useful when summarizing or displaying data in an easy-to-read
format.This transformation is important in data cleaning and preprocessing, making it easier to perform statistical analyses and visualizations.
# Load Telco dataset and preprocess the data
telco_data <- read_csv("telco-data.csv.csv") %>%
mutate(Churn = if_else(Churn == "Yes", 1, 0), # Convert Churn to binary (1 = churn, 0 = no churn)
Contract = as.numeric(factor(Contract)), # Convert contract type to numeric (categorical variable encoding)
PaymentMethod = as.numeric(factor(PaymentMethod))) # Convert payment method to numeric for regression
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Define Log-Likelihood Function for Logistic Regression
logit_lf <- function(param) {
intercept <- param[1] # Intercept term
beta_tenure <- param[2] # Coefficient for tenure (length of stay)
beta_monthly <- param[3] # Coefficient for monthly charges (cost)
beta_contract <- param[4] # Coefficient for contract type
# Response variable (binary: 1 = churn, 0 = no churn)
y <- as.vector(telco_data$Churn)
# Design matrix with intercept, tenure, monthly charges, and contract type
x <- cbind(1, telco_data$tenure, telco_data$MonthlyCharges, telco_data$Contract)
# Compute predicted probabilities using the logistic function
prob <- exp(x %*% c(intercept, beta_tenure, beta_monthly, beta_contract)) / (1 + exp(x %*% c(intercept, beta_tenure, beta_monthly, beta_contract)))
# Compute log-likelihood by summing over all observations
sum(dbinom(y, size = 1, prob = prob, log = TRUE))
}
# Running MLE for Logistic Regression
mle_logit <- maxLik(logLik = logit_lf, start = c(0, 0, 0, 0)) # Initial estimates for parameters
summary(mle_logit)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 6 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -3073.039
## 4 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## [1,] -0.573358 0.118513 -4.838 1.31e-06 ***
## [2,] -0.035787 0.002036 -17.574 < 2e-16 ***
## [3,] 0.028621 0.001348 21.228 < 2e-16 ***
## [4,] -1.030966 0.071392 -14.441 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
Churn
(1 = churned, 0 = retained) as a function of three
predictor variables:
tenure
: Measures how long the customer has been
subscribed.MonthlyCharges
: Measures the cost per month paid by the
customer.Contract
: Indicates whether the customer has a
month-to-month, one-year, or two-year contract.maxLik
.β0
): Baseline probability
of churn.β1
): Whether longer
tenure decreases/increases churn likelihood.β2
): How
higher/lower monthly charges impact churn.β3
): Whether
longer contracts reduce churn probability.pivot_longer()
and pivot_wider()
.This completes Assignment 3 with an in-depth application of MLE, logistic regression, and data transformation techniques in R.