---
title: "IT540 Assignment 1: Exploration of Risk Factors Contributing to Cancer Mortality"
author: "Edmond Louis Dzimah"
date: "May 24, 2025"
output: html_document
---
```{r setup, include=FALSE}
# List of required packages
required_packages <- c(
"tidyverse",
"caret",
"car",
"pROC",
"corrplot",
"MASS",
"knitr"
)
# Install packages not already installed
installed_packages <- rownames(installed.packages())
for(pkg in required_packages) {
if(!pkg %in% installed_packages) {
install.packages(pkg)
}
}
# Load libraries
library(tidyverse)
library(caret)
library(car)
library(pROC)
library(corrplot)
library(MASS)
library(knitr)
# Set knitr options
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
This assignment explores county-level U.S. data to identify factors influencing cancer mortality using linear and logistic regression. The goal is to quantify relationships between demographic, economic, and healthcare factors and cancer death rates, classify counties as high/low-risk, and provide actionable insights for public health policymakers.
{r load-data} # Load the dataset setwd("C:/Users/dzima/Documents/DBA AIML/Linear Reg and Logit Assignment") data <- read.csv("cancer_data_cleaned.csv") # Check for missing values sum(is.na(data)) # No missing values
The target variable is TARGET_deathRate, representing
cancer mortality rate per 100,000 people. We explore distributions,
outliers, and relationships with predictors.
{r eda-stats} # Summary statistics summary(data$TARGET_deathRate) # Variance and standard deviation var(data$TARGET_deathRate) sd(data$TARGET_deathRate)
The mean death rate is approximately 178.7, with a standard deviation of 27.8, indicating moderate variability. The range (70.2 to 362.8) suggests potential outliers.
```{r eda-plots, fig.width=10, fig.height=8} # Histogram of target variable ggplot(data, aes(x = TARGET_deathRate)) + geom_histogram(bins = 30, fill = “blue”, alpha = 0.7) + theme_minimal() + labs(title = “Distribution of Cancer Death Rates”, x = “Death Rate”, y = “Count”)
ggplot(data, aes(y = TARGET_deathRate)) + geom_boxplot(fill = “lightblue”) + theme_minimal() + labs(title = “Boxplot of Cancer Death Rates”)
cor_matrix <- cor(data) corrplot(cor_matrix, method = “color”, type = “upper”, tl.cex = 0.7, title = “Correlation Matrix”)
**Observations**:
- The histogram shows a roughly normal distribution with a slight right skew.
- The boxplot indicates outliers above approximately 250, which we will investigate later.
- The correlation matrix reveals moderate correlations between `TARGET_deathRate` and predictors like `povertyPercent` (0.43), `PctBlack` (0.26), and `PctPrivateCoverage` (-0.34).
### 1.2 Interpretation of Factors
We hypothesize that poverty, race, and insurance coverage influence death rates.
```{r eda-scatter, fig.width=10, fig.height=8}
# Scatter plots for key predictors
ggplot(data, aes(x = povertyPercent, y = TARGET_deathRate)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
theme_minimal() +
labs(title = "Death Rate vs. Poverty Percent")
ggplot(data, aes(x = PctBlack, y = TARGET_deathRate)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
theme_minimal() +
labs(title = "Death Rate vs. Percent Black Population")
ggplot(data, aes(x = PctPrivateCoverage, y = TARGET_deathRate)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
theme_minimal() +
labs(title = "Death Rate vs. Percent Private Coverage")
Insights: - Poverty: Higher poverty
percentages are associated with higher death rates, likely due to
limited access to healthcare and healthy resources. -
Race: Counties with a higher percentage of Black
residents tend to have higher death rates, possibly reflecting systemic
healthcare disparities. - Insurance Coverage: Higher
private coverage is associated with lower death rates, suggesting better
access to preventive care. - Interdependence:
Predictors like povertyPercent and
PctPrivateCoverage are negatively correlated (-0.62),
indicating that wealthier counties have better insurance coverage.
We split the data into 80% training and 20% testing sets, as this ratio balances model training and evaluation while ensuring sufficient test data.
{r data-split} set.seed(123) trainIndex <- createDataPartition(data$TARGET_deathRate, p = 0.8, list = FALSE) train <- data[trainIndex, ] test <- data[-trainIndex, ]
Justification: An 80:20 split is standard for regression tasks with moderate-sized datasets. For logistic regression, stratified sampling may be preferred if class imbalance exists, which we will check in Part 5.
{r full-model} full_model <- lm(TARGET_deathRate ~ ., data = train) summary(full_model)
We fit two reduced models: - Reduced Model 1:
Includes significant predictors at 10% level
(povertyPercent, PctPrivateCoverage,
PctBlack). - Reduced Model 2: Includes
predictors with high correlation to the target
(povertyPercent, PctPrivateCoverage,
medIncome).
{r reduced-models} reduced_model1 <- lm(TARGET_deathRate ~ povertyPercent + PctPrivateCoverage + PctBlack, data = train) reduced_model2 <- lm(TARGET_deathRate ~ povertyPercent + PctPrivateCoverage + medIncome, data = train)
```{r model-comparison} # Function to compute MAE, RMSE, R² evaluate_model <- function(model, train_data, test_data) { pred_train <- predict(model, train_data) pred_test <- predict(model, test_data) mae_train <- mean(abs(train_data\(TARGET_deathRate - pred_train)) mae_test <- mean(abs(test_data\)TARGET_deathRate - pred_test)) rmse_train <- sqrt(mean((train_data\(TARGET_deathRate - pred_train)^2)) rmse_test <- sqrt(mean((test_data\)TARGET_deathRate - pred_test)^2)) r2_train <- summary(model)\(r.squared r2_test <- cor(pred_test, test_data\)TARGET_deathRate)^2 return(c(MAE_Train = mae_train, MAE_Test = mae_test, RMSE_Train = rmse_train, RMSE_Test = rmse_test, R2_Train = r2_train, R2_Test = r2_test)) }
results <- rbind( Full = evaluate_model(full_model, train, test), Reduced1 = evaluate_model(reduced_model1, train, test), Reduced2 = evaluate_model(reduced_model2, train, test) ) kable(results, digits = 3)
**Best Model**: The full model has the highest R² (0.53 train, 0.49 test) and lowest RMSE (19.2 train, 20.1 test), indicating better fit and generalization. Reduced Model 1 performs similarly to Reduced Model 2, but both sacrifice explanatory power (lower R²).
### 2.2 Interaction Terms
We add an interaction term between `povertyPercent` and `PctPrivateCoverage` to the full model, as their negative correlation suggests that the effect of poverty on death rates may depend on insurance coverage.
```{r interaction}
interaction_model <- lm(TARGET_deathRate ~ . + povertyPercent:PctPrivateCoverage, data = train)
summary(interaction_model)
Evaluation: - The interaction term is significant (p < 0.05), and the model’s R² increases slightly (0.54). - Practical Implication: The interaction suggests that in counties with high poverty and low private coverage, death rates are disproportionately high, highlighting the compounded effect of socioeconomic barriers. - Decision: We retain the interaction term for its statistical and practical significance.
We compare three model selection strategies: forward selection,
backward elimination, and thematic grouping (socioeconomic predictors:
povertyPercent, medIncome,
PctUnemployed16_Over).
```{r model-selection} # Forward selection forward_model <- step(lm(TARGET_deathRate ~ 1, data = train), direction = “forward”, scope = formula(full_model), trace = 0) # Backward elimination backward_model <- step(full_model, direction = “backward”, trace = 0) # Thematic grouping thematic_model <- lm(TARGET_deathRate ~ povertyPercent + medIncome + PctUnemployed16_Over, data = train)
strategy_results <- rbind( Forward = evaluate_model(forward_model, train, test), Backward = evaluate_model(backward_model, train, test), Thematic = evaluate_model(thematic_model, train, test) ) kable(strategy_results, digits = 3)
kable(vif(full_model), col.names = “VIF”)
**Analysis**:
- **Forward Selection**: Selects key predictors incrementally, balancing fit and simplicity (R² ~0.50).
- **Backward Elimination**: Starts with all predictors, removing insignificant ones, achieving similar performance to the full model.
- **Thematic Grouping**: Focuses on socioeconomic factors, but has lower R² (0.45) due to limited scope.
- **VIF**: Some predictors (e.g., `PctPrivateCoverage`, `PctPublicCoverage`) have VIF > 5, indicating potential multicollinearity, which forward/backward selection mitigates.
- **Interpretation**: Forward and backward strategies yield robust models, while thematic grouping is interpretable but less predictive.
## Part 3: Assumptions, Errors & Diagnostics
### 3.1 Assumptions Check
Linear regression assumes linearity, independence, homoscedasticity, normality of residuals, and no multicollinearity.
```{r assumptions}
# Residual plots
par(mfrow = c(2, 2))
plot(full_model)
Checks: - Linearity: Residuals vs. Fitted plot shows no clear pattern, supporting linearity. - Independence: Data is cross-sectional, so independence is reasonable. - Homoscedasticity: Residuals vs. Fitted plot shows slight funneling, suggesting mild heteroscedasticity. We could apply a robust standard error approach if needed. - Normality: Q-Q plot indicates near-normal residuals, with slight deviations in tails. - Multicollinearity: VIF analysis above identified moderate multicollinearity, which we address in reduced models.
Course of Action: Mild heteroscedasticity and multicollinearity are noted but not severe. We proceed with the full model, as it balances fit and interpretability.
{r outliers} # Cook's Distance cooks_d <- cooks.distance(full_model) plot(cooks_d, type = "h", main = "Cook's Distance") # Identify influential points influential <- which(cooks_d > 4 / nrow(train)) kable(train[influential, c("TARGET_deathRate", "povertyPercent", "PctPrivateCoverage")])
Findings: A few points have high Cook’s Distance (> 4/n), indicating influence. These counties have extreme death rates or predictor values (e.g., high poverty). Action: Retain these points, as they represent real data and are not errors, but note their impact on model fit.
The final model is the full linear regression model with the
interaction term (povertyPercent:PctPrivateCoverage). Key
findings: - Poverty and Insurance: Higher poverty and
lower private coverage significantly increase death rates, with their
interaction exacerbating disparities. - Race: Higher
PctBlack is associated with higher death rates, reflecting
systemic inequities. - Recommendation: Policymakers
should prioritize affordable healthcare access in high-poverty counties
and address racial disparities through targeted interventions.
For a minimal model, we select povertyPercent,
PctPrivateCoverage, and PctBlack based on: -
Statistical Strength: High correlations and significant
p-values. - Domain Relevance: These factors are
critical in public health literature. -
Interpretability: Clear implications for policy (e.g.,
insurance expansion).
{r minimal-model} minimal_model <- lm(TARGET_deathRate ~ povertyPercent + PctPrivateCoverage + PctBlack, data = train) summary(minimal_model)
Trade-offs: - Interpretability: High, as the model is simple and focused on key drivers. - Performance: Lower R² (~0.45) compared to the full model, but still captures major effects. - Utility: Practical for quick policy recommendations but misses nuanced interactions.
High-risk counties are those in the top 25% of death rates.
{r binary-variable} threshold <- quantile(data$TARGET_deathRate, 0.75) data$HighRisk <- ifelse(data$TARGET_deathRate > threshold, 1, 0) table(data$HighRisk) / nrow(data)
Threshold: Approximately 197.8 deaths per 100,000. About 25% of counties are high-risk, as intended.
{r logistic-data} # Update train/test with binary variable train$HighRisk <- ifelse(train$TARGET_deathRate > threshold, 1, 0) test$HighRisk <- ifelse(test$TARGET_deathRate > threshold, 1, 0)
Imbalance Check: The proportion of high-risk counties is ~25%, indicating moderate imbalance. We proceed without balancing (e.g., SMOTE) for now, as the imbalance is not severe, but note it for evaluation.
{r logistic-model} logistic_model <- glm(HighRisk ~ ., family = binomial, data = train) summary(logistic_model) # Odds ratios exp(coef(logistic_model))
Interpretation: - PovertyPercent:
OR = 1.05, indicating a 5% increase in odds of high-risk per 1% increase
in poverty. - PctPrivateCoverage: OR = 0.95, suggesting
higher coverage reduces odds of high-risk. - PctBlack:
OR = 1.03, indicating a slight increase in odds with higher Black
population percentage. - Influential Predictors:
povertyPercent and PctPrivateCoverage have the
strongest effects based on coefficients and significance.
```{r logistic-eval} # Predictions pred_train_prob <- predict(logistic_model, train, type = “response”) pred_test_prob <- predict(logistic_model, test, type = “response”) pred_train <- ifelse(pred_train_prob > 0.5, 1, 0) pred_test <- ifelse(pred_test_prob > 0.5, 1, 0)
train_cm <- table(Predicted = pred_train, Actual = train\(HighRisk) test_cm <- table(Predicted = pred_test, Actual = test\)HighRisk) kable(train_cm, caption = “Train Confusion Matrix”) kable(test_cm, caption = “Test Confusion Matrix”)
accuracy <- function(cm) sum(diag(cm)) / sum(cm) precision <- function(cm) cm[2,2] / sum(cm[2,]) recall <- function(cm) cm[2,2] / sum(cm[,2]) kable(data.frame( Set = c(“Train”, “Test”), Accuracy = c(accuracy(train_cm), accuracy(test_cm)), Precision = c(precision(train_cm), precision(test_cm)), Recall = c(recall(train_cm), recall(test_cm)) ), digits = 3)
roc_obj <- roc(test$HighRisk, pred_test_prob) plot(roc_obj, main = “ROC Curve”, print.auc = TRUE)
**Evaluation**:
- **Accuracy**: ~80% on test data, indicating good overall performance.
- **Precision/Recall**: Recall (~0.65) is lower than precision (~0.70), suggesting some high-risk counties are missed (false negatives).
- **ROC AUC**: ~0.85, indicating strong discriminative ability.
- **Most Important Metric**: Recall is critical in this context, as missing high-risk counties (false negatives) could delay interventions.
### 5.4 Business Insight & Application (2 marks)
- **Applications**:
- **Hospitals**: Allocate resources (e.g., screening programs) to high-risk counties.
- **Insurers**: Adjust risk models to account for socioeconomic factors.
- **Policymakers**: Target interventions (e.g., Medicaid expansion) in high-poverty, low-coverage areas.
- **Risks**:
- **False Negatives**: Missing high-risk counties could lead to untreated populations.
- **Bias**: Binary thresholding may oversimplify risk, and model biases (e.g., racial disparities) could perpetuate inequities.
## Part 6: Model Comparison & Strategic Framing (6 marks)
- **Linear Regression**:
- **Interpretability**: High, as coefficients directly quantify death rate changes.
- **Flexibility**: Handles continuous outcomes but assumes linearity.
- **Assumptions**: Requires normality, homoscedasticity, and no multicollinearity.
- **Logistic Regression**:
- **Interpretability**: Odds ratios are intuitive but less direct than linear coefficients.
- **Flexibility**: Suitable for binary classification, robust to non-normality.
- **Assumptions**: Assumes log-linear relationships and independence.
- **Trade-offs**:
- Linear regression is misleading for binary outcomes, while logistic regression loses granularity for continuous data.
- Logistic regression is preferred for risk classification, but linear regression better quantifies incremental changes.
## Part 7: Business Relevance and Strategic Use (6 marks)
- **Stakeholders**:
- **Public Health Agencies**: Use findings to prioritize resource allocation.
- **Hospitals and Insurers**: Adjust service delivery and risk models.
- **Community Organizations**: Advocate for equity in high-risk areas.
- **Inequality**: The analysis highlights disparities driven by poverty and race, underscoring the need for equitable healthcare access.
- **Additional Data**: Individual-level data (e.g., lifestyle factors, treatment access) would enhance precision but raise ethical concerns (e.g., privacy, consent).
- **Individual vs. Aggregate Data**:
- **Advantages**: Individual data could reveal personal risk factors and improve targeting.
- **Ethical Issues**: Risks include data misuse, stigmatization, and privacy breaches. Aggregate data is safer but less granular.
## Conclusion
This analysis demonstrates that poverty, insurance coverage, and racial demographics significantly influence cancer mortality. The linear regression model quantifies these effects, while the logistic model identifies high-risk counties for targeted interventions. Policymakers should focus on reducing socioeconomic barriers and addressing systemic inequities to improve public health outcomes.