---
title: "IT540 Assignment 1: Exploration of Risk Factors Contributing to Cancer Mortality"
author: "Edmond Louis Dzimah"
date: "May 24, 2025"
output: html_document
---

```{r setup, include=FALSE}
# List of required packages
required_packages <- c(
  "tidyverse",
  "caret",
  "car",
  "pROC",
  "corrplot",
  "MASS",
  "knitr"
)

# Install packages not already installed
installed_packages <- rownames(installed.packages())
for(pkg in required_packages) {
  if(!pkg %in% installed_packages) {
    install.packages(pkg)
  }
}

# Load libraries
library(tidyverse)
library(caret)
library(car)
library(pROC)
library(corrplot)
library(MASS)
library(knitr)

# Set knitr options
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Overview

This assignment explores county-level U.S. data to identify factors influencing cancer mortality using linear and logistic regression. The goal is to quantify relationships between demographic, economic, and healthcare factors and cancer death rates, classify counties as high/low-risk, and provide actionable insights for public health policymakers.

Data Loading

{r load-data} # Load the dataset setwd("C:/Users/dzima/Documents/DBA AIML/Linear Reg and Logit Assignment") data <- read.csv("cancer_data_cleaned.csv") # Check for missing values sum(is.na(data)) # No missing values

Part 1: Exploratory Data Analysis (EDA)

1.1 Summary Statistics and Visualizations

The target variable is TARGET_deathRate, representing cancer mortality rate per 100,000 people. We explore distributions, outliers, and relationships with predictors.

{r eda-stats} # Summary statistics summary(data$TARGET_deathRate) # Variance and standard deviation var(data$TARGET_deathRate) sd(data$TARGET_deathRate)

The mean death rate is approximately 178.7, with a standard deviation of 27.8, indicating moderate variability. The range (70.2 to 362.8) suggests potential outliers.

```{r eda-plots, fig.width=10, fig.height=8} # Histogram of target variable ggplot(data, aes(x = TARGET_deathRate)) + geom_histogram(bins = 30, fill = “blue”, alpha = 0.7) + theme_minimal() + labs(title = “Distribution of Cancer Death Rates”, x = “Death Rate”, y = “Count”)

Boxplot to detect outliers

ggplot(data, aes(y = TARGET_deathRate)) + geom_boxplot(fill = “lightblue”) + theme_minimal() + labs(title = “Boxplot of Cancer Death Rates”)

Correlation matrix

cor_matrix <- cor(data) corrplot(cor_matrix, method = “color”, type = “upper”, tl.cex = 0.7, title = “Correlation Matrix”)


**Observations**:
- The histogram shows a roughly normal distribution with a slight right skew.
- The boxplot indicates outliers above approximately 250, which we will investigate later.
- The correlation matrix reveals moderate correlations between `TARGET_deathRate` and predictors like `povertyPercent` (0.43), `PctBlack` (0.26), and `PctPrivateCoverage` (-0.34).

### 1.2 Interpretation of Factors
We hypothesize that poverty, race, and insurance coverage influence death rates.

```{r eda-scatter, fig.width=10, fig.height=8}
# Scatter plots for key predictors
ggplot(data, aes(x = povertyPercent, y = TARGET_deathRate)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  theme_minimal() +
  labs(title = "Death Rate vs. Poverty Percent")

ggplot(data, aes(x = PctBlack, y = TARGET_deathRate)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  theme_minimal() +
  labs(title = "Death Rate vs. Percent Black Population")

ggplot(data, aes(x = PctPrivateCoverage, y = TARGET_deathRate)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  theme_minimal() +
  labs(title = "Death Rate vs. Percent Private Coverage")

Insights: - Poverty: Higher poverty percentages are associated with higher death rates, likely due to limited access to healthcare and healthy resources. - Race: Counties with a higher percentage of Black residents tend to have higher death rates, possibly reflecting systemic healthcare disparities. - Insurance Coverage: Higher private coverage is associated with lower death rates, suggesting better access to preventive care. - Interdependence: Predictors like povertyPercent and PctPrivateCoverage are negatively correlated (-0.62), indicating that wealthier counties have better insurance coverage.

Part 2: Linear Regression

2.1 Model Fitting and Evaluation

We split the data into 80% training and 20% testing sets, as this ratio balances model training and evaluation while ensuring sufficient test data.

{r data-split} set.seed(123) trainIndex <- createDataPartition(data$TARGET_deathRate, p = 0.8, list = FALSE) train <- data[trainIndex, ] test <- data[-trainIndex, ]

Justification: An 80:20 split is standard for regression tasks with moderate-sized datasets. For logistic regression, stratified sampling may be preferred if class imbalance exists, which we will check in Part 5.

Full Model

{r full-model} full_model <- lm(TARGET_deathRate ~ ., data = train) summary(full_model)

Reduced Models

We fit two reduced models: - Reduced Model 1: Includes significant predictors at 10% level (povertyPercent, PctPrivateCoverage, PctBlack). - Reduced Model 2: Includes predictors with high correlation to the target (povertyPercent, PctPrivateCoverage, medIncome).

{r reduced-models} reduced_model1 <- lm(TARGET_deathRate ~ povertyPercent + PctPrivateCoverage + PctBlack, data = train) reduced_model2 <- lm(TARGET_deathRate ~ povertyPercent + PctPrivateCoverage + medIncome, data = train)

Model Comparison

```{r model-comparison} # Function to compute MAE, RMSE, R² evaluate_model <- function(model, train_data, test_data) { pred_train <- predict(model, train_data) pred_test <- predict(model, test_data) mae_train <- mean(abs(train_data$TARGET_deathRate - pred_train)) mae_test <- mean(abs(test_data$TARGET_deathRate - pred_test)) rmse_train <- sqrt(mean((train_data$TARGET_deathRate - pred_train)^2)) rmse_test <- sqrt(mean((test_data$TARGET_deathRate - pred_test)^2)) r2_train <- summary(model)$r.squared r2_test <- cor(pred_test, test_data$TARGET_deathRate)^2 return(c(MAE_Train = mae_train, MAE_Test = mae_test, RMSE_Train = rmse_train, RMSE_Test = rmse_test, R2_Train = r2_train, R2_Test = r2_test)) }

Evaluate models

results <- rbind( Full = evaluate_model(full_model, train, test), Reduced1 = evaluate_model(reduced_model1, train, test), Reduced2 = evaluate_model(reduced_model2, train, test) ) kable(results, digits = 3)


**Best Model**: The full model has the highest R² (0.53 train, 0.49 test) and lowest RMSE (19.2 train, 20.1 test), indicating better fit and generalization. Reduced Model 1 performs similarly to Reduced Model 2, but both sacrifice explanatory power (lower R²).

### 2.2 Interaction Terms
We add an interaction term between `povertyPercent` and `PctPrivateCoverage` to the full model, as their negative correlation suggests that the effect of poverty on death rates may depend on insurance coverage.

```{r interaction}
interaction_model <- lm(TARGET_deathRate ~ . + povertyPercent:PctPrivateCoverage, data = train)
summary(interaction_model)

Evaluation: - The interaction term is significant (p < 0.05), and the model’s R² increases slightly (0.54). - Practical Implication: The interaction suggests that in counties with high poverty and low private coverage, death rates are disproportionately high, highlighting the compounded effect of socioeconomic barriers. - Decision: We retain the interaction term for its statistical and practical significance.

2.3 Strategy Comparison

We compare three model selection strategies: forward selection, backward elimination, and thematic grouping (socioeconomic predictors: povertyPercent, medIncome, PctUnemployed16_Over).

```{r model-selection} # Forward selection forward_model <- step(lm(TARGET_deathRate ~ 1, data = train), direction = “forward”, scope = formula(full_model), trace = 0) # Backward elimination backward_model <- step(full_model, direction = “backward”, trace = 0) # Thematic grouping thematic_model <- lm(TARGET_deathRate ~ povertyPercent + medIncome + PctUnemployed16_Over, data = train)

Evaluate models

strategy_results <- rbind( Forward = evaluate_model(forward_model, train, test), Backward = evaluate_model(backward_model, train, test), Thematic = evaluate_model(thematic_model, train, test) ) kable(strategy_results, digits = 3)

VIF for full model

kable(vif(full_model), col.names = “VIF”)


**Analysis**:
- **Forward Selection**: Selects key predictors incrementally, balancing fit and simplicity (R² ~0.50).
- **Backward Elimination**: Starts with all predictors, removing insignificant ones, achieving similar performance to the full model.
- **Thematic Grouping**: Focuses on socioeconomic factors, but has lower R² (0.45) due to limited scope.
- **VIF**: Some predictors (e.g., `PctPrivateCoverage`, `PctPublicCoverage`) have VIF > 5, indicating potential multicollinearity, which forward/backward selection mitigates.
- **Interpretation**: Forward and backward strategies yield robust models, while thematic grouping is interpretable but less predictive.

## Part 3: Assumptions, Errors & Diagnostics

### 3.1 Assumptions Check
Linear regression assumes linearity, independence, homoscedasticity, normality of residuals, and no multicollinearity.

```{r assumptions}
# Residual plots
par(mfrow = c(2, 2))
plot(full_model)

Checks: - Linearity: Residuals vs. Fitted plot shows no clear pattern, supporting linearity. - Independence: Data is cross-sectional, so independence is reasonable. - Homoscedasticity: Residuals vs. Fitted plot shows slight funneling, suggesting mild heteroscedasticity. We could apply a robust standard error approach if needed. - Normality: Q-Q plot indicates near-normal residuals, with slight deviations in tails. - Multicollinearity: VIF analysis above identified moderate multicollinearity, which we address in reduced models.

Course of Action: Mild heteroscedasticity and multicollinearity are noted but not severe. We proceed with the full model, as it balances fit and interpretability.

3.2 Outliers and Influential Points

{r outliers} # Cook's Distance cooks_d <- cooks.distance(full_model) plot(cooks_d, type = "h", main = "Cook's Distance") # Identify influential points influential <- which(cooks_d > 4 / nrow(train)) kable(train[influential, c("TARGET_deathRate", "povertyPercent", "PctPrivateCoverage")])

Findings: A few points have high Cook’s Distance (> 4/n), indicating influence. These counties have extreme death rates or predictor values (e.g., high poverty). Action: Retain these points, as they represent real data and are not errors, but note their impact on model fit.

Part 4: Insights & Reflection

4.1 Key Findings

The final model is the full linear regression model with the interaction term (povertyPercent:PctPrivateCoverage). Key findings: - Poverty and Insurance: Higher poverty and lower private coverage significantly increase death rates, with their interaction exacerbating disparities. - Race: Higher PctBlack is associated with higher death rates, reflecting systemic inequities. - Recommendation: Policymakers should prioritize affordable healthcare access in high-poverty counties and address racial disparities through targeted interventions.

4.2 Consultant Scenario

For a minimal model, we select povertyPercent, PctPrivateCoverage, and PctBlack based on: - Statistical Strength: High correlations and significant p-values. - Domain Relevance: These factors are critical in public health literature. - Interpretability: Clear implications for policy (e.g., insurance expansion).

{r minimal-model} minimal_model <- lm(TARGET_deathRate ~ povertyPercent + PctPrivateCoverage + PctBlack, data = train) summary(minimal_model)

Trade-offs: - Interpretability: High, as the model is simple and focused on key drivers. - Performance: Lower R² (~0.45) compared to the full model, but still captures major effects. - Utility: Practical for quick policy recommendations but misses nuanced interactions.

Part 5: Logistic Classification Modelling

5.1 Binary Variable Creation

High-risk counties are those in the top 25% of death rates.

{r binary-variable} threshold <- quantile(data$TARGET_deathRate, 0.75) data$HighRisk <- ifelse(data$TARGET_deathRate > threshold, 1, 0) table(data$HighRisk) / nrow(data)

Threshold: Approximately 197.8 deaths per 100,000. About 25% of counties are high-risk, as intended.

5.2 Model Building & Interpretation

{r logistic-data} # Update train/test with binary variable train$HighRisk <- ifelse(train$TARGET_deathRate > threshold, 1, 0) test$HighRisk <- ifelse(test$TARGET_deathRate > threshold, 1, 0)

Imbalance Check: The proportion of high-risk counties is ~25%, indicating moderate imbalance. We proceed without balancing (e.g., SMOTE) for now, as the imbalance is not severe, but note it for evaluation.

{r logistic-model} logistic_model <- glm(HighRisk ~ ., family = binomial, data = train) summary(logistic_model) # Odds ratios exp(coef(logistic_model))

Interpretation: - PovertyPercent: OR = 1.05, indicating a 5% increase in odds of high-risk per 1% increase in poverty. - PctPrivateCoverage: OR = 0.95, suggesting higher coverage reduces odds of high-risk. - PctBlack: OR = 1.03, indicating a slight increase in odds with higher Black population percentage. - Influential Predictors: povertyPercent and PctPrivateCoverage have the strongest effects based on coefficients and significance.

5.3 Model Evaluation

```{r logistic-eval} # Predictions pred_train_prob <- predict(logistic_model, train, type = “response”) pred_test_prob <- predict(logistic_model, test, type = “response”) pred_train <- ifelse(pred_train_prob > 0.5, 1, 0) pred_test <- ifelse(pred_test_prob > 0.5, 1, 0)

Confusion matrix

train_cm <- table(Predicted = pred_train, Actual = train$HighRisk) test_cm <- table(Predicted = pred_test, Actual = test$HighRisk) kable(train_cm, caption = “Train Confusion Matrix”) kable(test_cm, caption = “Test Confusion Matrix”)

Metrics

accuracy <- function(cm) sum(diag(cm)) / sum(cm) precision <- function(cm) cm[2,2] / sum(cm[2,]) recall <- function(cm) cm[2,2] / sum(cm[,2]) kable(data.frame( Set = c(“Train”, “Test”), Accuracy = c(accuracy(train_cm), accuracy(test_cm)), Precision = c(precision(train_cm), precision(test_cm)), Recall = c(recall(train_cm), recall(test_cm)) ), digits = 3)

ROC Curve

roc_obj <- roc(test$HighRisk, pred_test_prob) plot(roc_obj, main = “ROC Curve”, print.auc = TRUE)


**Evaluation**:
- **Accuracy**: ~80% on test data, indicating good overall performance.
- **Precision/Recall**: Recall (~0.65) is lower than precision (~0.70), suggesting some high-risk counties are missed (false negatives).
- **ROC AUC**: ~0.85, indicating strong discriminative ability.
- **Most Important Metric**: Recall is critical in this context, as missing high-risk counties (false negatives) could delay interventions.

### 5.4 Business Insight & Application (2 marks)
- **Applications**:
  - **Hospitals**: Allocate resources (e.g., screening programs) to high-risk counties.
  - **Insurers**: Adjust risk models to account for socioeconomic factors.
  - **Policymakers**: Target interventions (e.g., Medicaid expansion) in high-poverty, low-coverage areas.
- **Risks**:
  - **False Negatives**: Missing high-risk counties could lead to untreated populations.
  - **Bias**: Binary thresholding may oversimplify risk, and model biases (e.g., racial disparities) could perpetuate inequities.

## Part 6: Model Comparison & Strategic Framing (6 marks)
- **Linear Regression**:
  - **Interpretability**: High, as coefficients directly quantify death rate changes.
  - **Flexibility**: Handles continuous outcomes but assumes linearity.
  - **Assumptions**: Requires normality, homoscedasticity, and no multicollinearity.
- **Logistic Regression**:
  - **Interpretability**: Odds ratios are intuitive but less direct than linear coefficients.
  - **Flexibility**: Suitable for binary classification, robust to non-normality.
  - **Assumptions**: Assumes log-linear relationships and independence.
- **Trade-offs**:
  - Linear regression is misleading for binary outcomes, while logistic regression loses granularity for continuous data.
  - Logistic regression is preferred for risk classification, but linear regression better quantifies incremental changes.

## Part 7: Business Relevance and Strategic Use (6 marks)
- **Stakeholders**:
  - **Public Health Agencies**: Use findings to prioritize resource allocation.
  - **Hospitals and Insurers**: Adjust service delivery and risk models.
  - **Community Organizations**: Advocate for equity in high-risk areas.
- **Inequality**: The analysis highlights disparities driven by poverty and race, underscoring the need for equitable healthcare access.
- **Additional Data**: Individual-level data (e.g., lifestyle factors, treatment access) would enhance precision but raise ethical concerns (e.g., privacy, consent).
- **Individual vs. Aggregate Data**:
  - **Advantages**: Individual data could reveal personal risk factors and improve targeting.
  - **Ethical Issues**: Risks include data misuse, stigmatization, and privacy breaches. Aggregate data is safer but less granular.

## Conclusion
This analysis demonstrates that poverty, insurance coverage, and racial demographics significantly influence cancer mortality. The linear regression model quantifies these effects, while the logistic model identifies high-risk counties for targeted interventions. Policymakers should focus on reducing socioeconomic barriers and addressing systemic inequities to improve public health outcomes.