Data Selection and Preparation
- Data source and origin
- Data Import
Parametric Statistics
Non-Parametric Statistics
- Decision Tree analysis
- Principal Component Analysis (PCA)
Conclusion

Data Selection and Preparation

Data source and origin

Dataset name: Breast Cancer Wisconsin (Diagnostic).

release date: 1993

Authors: W. Street, W. Wolberg, O. Mangasarian

Taken from: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

Data Import

df <- read.csv("wdbc.data", header = FALSE)

colnames(df) <- c(
  "id", "diagnosis",
  "radius_mean", "texture_mean", "perimeter_mean", "area_mean",
  "smoothness_mean", "compactness_mean", "concavity_mean",
  "concave.points_mean", "symmetry_mean", "fractal_dimension_mean",
  "radius_se", "texture_se", "perimeter_se", "area_se",
  "smoothness_se", "compactness_se", "concavity_se",
  "concave.points_se", "symmetry_se", "fractal_dimension_se",
  "radius_worst", "texture_worst", "perimeter_worst", "area_worst",
  "smoothness_worst", "compactness_worst", "concavity_worst",
  "concave.points_worst", "symmetry_worst", "fractal_dimension_worst"
)

Clean dataset,convert diagnosis into a factor and take off id.

df <- df %>%
select(-id, -matches("^X$"), everything())

df$diagnosis <- factor(df$diagnosis, levels = c("B", "M"))

table(df$diagnosis)

## 
##   B   M 
## 357 212

Histograms from some key variables.

numeric_cols <- df %>%
select(where(is.numeric)) %>%
colnames()

df %>%
select(radius_mean, texture_mean, perimeter_mean, area_mean,
smoothness_mean, compactness_mean) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 25) +
facet_wrap(~ variable, scales = "free") +
theme_minimal()

Correlations

mean_vars <- c("radius_mean", "texture_mean", "perimeter_mean", "area_mean",
"smoothness_mean", "compactness_mean", "concavity_mean",
"concave.points_mean", "symmetry_mean", "fractal_dimension_mean")

corr_mat <- cor(df[, mean_vars])

corr_df <- as.data.frame(as.table(corr_mat))

ggplot(corr_df, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
scale_fill_gradient2() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "", y = "", fill = "Correlation",
title = "Correlation matrix of mean ")

Parametric Statistics

Use logistic regression to model the probability that a tumor is malignant (M) based on its morphological characteristics.

For the parametric analysis, I use logistic regression, given that the outcome of interest (the diagnosis) is binary. Thus, each tumor is either benign (B) or malignant (M). Furthermore, logistic regression models the logarithm of the malignancy odds ratio as a linear combination of the predictors and is the standard parametric method for binary classification problems. Therefore, in this context, the model allows me to estimate how changes in morphological characteristics are associated with a higher or lower probability of a tumor being malignant.

Outlier detection

iqr_outliers <- function(x, k = 1.5) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
lower <- q1 - k * iqr
upper <- q3 + k * iqr
(x < lower) | (x > upper)
}

out_radius <- iqr_outliers(df$radius_mean)
out_area <- iqr_outliers(df$area_mean)

sum(out_radius | out_area)

## [1] 25

df_clean <- df[!(out_radius | out_area), ]
dim(df)

## [1] 569  32

dim(df_clean)

## [1] 544  32

Applying the interquartile range (IQR) rule to radius_mean and area_mean, I identified 25 observations, approximately 4.4% of the dataset, as potential outliers. These cases correspond to tumors with an extremely large radius and area compared to the rest of the sample. Therefore, to reduce the influence of extreme values on the parameter estimates, I removed these 25 observations and performed subsequent analyses on the cleaned dataset (df_clean), which contains 544 tumors and 32 variables.

Normality and homoscedasticity of relevant variables

shapiro.test(df_clean$radius_mean)

## 
##  Shapiro-Wilk normality test
## 
## data:  df_clean$radius_mean
## W = 0.96064, p-value = 6.97e-11

car::leveneTest(radius_mean ~ diagnosis, data = df_clean)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   1  49.373 6.382e-12 ***
##       542                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Shapiro-Wilk test for radius_mean yields a very small p-value (< 0.05), indicating that this predictor does not follow a normal distribution. Furthermore, Levene’s test is highly significant, showing that the variance of radius_mean differs between the benign and malignant groups.

However, in logistic regression, predictors are not required to be normally distributed or to have equal variances across outcome categories. What is important is that the logit model is well-specified and that there is no perfect separation. Therefore, these deviations from normality and homoscedasticity are not critical, although they should be considered when interpreting the results.

Collinearity (VIF)

features_logit <- mean_vars 

logit_data <- df_clean %>%
select(diagnosis, all_of(features_logit))

lm_aux <- lm(as.numeric(diagnosis) - 1 ~ ., data = logit_data)
vif(lm_aux)

##            radius_mean           texture_mean         perimeter_mean 
##            1367.833052               1.189109            1538.956093 
##              area_mean        smoothness_mean       compactness_mean 
##              85.126357               3.034723              21.239903 
##         concavity_mean    concave.points_mean          symmetry_mean 
##               9.670277              17.415204               1.754517 
## fractal_dimension_mean 
##               6.671479

The VIF values show very strong multicollinearity among several average variables, especially mean radius, mean perimeter, and mean area, with VIF well above 10. This makes sense because they all describe tumor size. Although high multicollinearity does not always worsen predictive ability, it can destabilize the coefficients and prevent a clear interpretation of each individual effect.

In this report, I retain all average variables in the logistic regression and focus on overall predictive performance, interpreting the coefficients cautiously. Additionally, I used nonparametric methods (decision tree and PCA) to summarize and visualize the main patterns without relying on independent linear predictors.

Train/test split

set.seed(123)
n <- nrow(logit_data)
train_idx <- sample(seq_len(n), size = 0.8 * n)

train_logit <- logit_data[train_idx, ]
test_logit <- logit_data[-train_idx, ]

prop.table(table(train_logit$diagnosis))

## 
##         B         M 
## 0.6413793 0.3586207

prop.table(table(test_logit$diagnosis))

## 
##         B         M 
## 0.7155963 0.2844037

Fit logistic regression model

logit_fit <- glm(diagnosis ~ ., data = train_logit, family = binomial)

summary(logit_fit)

## 
## Call:
## glm(formula = diagnosis ~ ., family = binomial, data = train_logit)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -18.89195   13.85452  -1.364   0.1727    
## radius_mean              3.17289    3.97726   0.798   0.4250    
## texture_mean             0.35747    0.06924   5.163 2.43e-07 ***
## perimeter_mean          -0.80022    0.53899  -1.485   0.1376    
## area_mean                0.03600    0.01600   2.251   0.0244 *  
## smoothness_mean         76.03380   34.71683   2.190   0.0285 *  
## compactness_mean       -26.28354   22.02873  -1.193   0.2328    
## concavity_mean          33.69279   13.18361   2.556   0.0106 *  
## concave.points_mean     69.05215   34.24559   2.016   0.0438 *  
## symmetry_mean           21.68266   11.43845   1.896   0.0580 .  
## fractal_dimension_mean  48.76549  101.02003   0.483   0.6293    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 567.78  on 434  degrees of freedom
## Residual deviance: 122.60  on 424  degrees of freedom
## AIC: 144.6
## 
## Number of Fisher Scoring iterations: 8

Probabilities of being malignant (M)

Classification with a threshold of 0.5

prob_test <- predict(logit_fit, newdata = test_logit, type = "response")

pred_test <- ifelse(prob_test > 0.5, "M", "B")
pred_test <- factor(pred_test, levels = levels(test_logit$diagnosis))

conf_mat <- table(Actual = test_logit$diagnosis, Predicted = pred_test)
conf_mat

##       Predicted
## Actual  B  M
##      B 73  5
##      M  1 30

accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
accuracy

## [1] 0.9449541

roc_obj <- roc(test_logit$diagnosis, prob_test, levels = c("B", "M"))

## Setting direction: controls < cases

auc(roc_obj)

## Area under the curve: 0.983

pseudo_R2 <- 1 - logit_fit$deviance / logit_fit$null.deviance
pseudo_R2

## [1] 0.7840678

The logistic regression model performed very well, correctly predicting tumor type in approximately 94.5% of cases in the test set, with few errors according to the confusion matrix. The AUC of 0.983 and McFadden’s pseudo-R² of nearly 0.78 indicate excellent ability to distinguish between benign and malignant tumors and good overall fit. Furthermore, higher values for area_mean, smoothness_mean, concavity_mean, and concave.points_mean increase the likelihood of malignancy, consistent with clinical intuition regarding larger and more irregular tumors.

Non-Parametric Statistics

Decision Tree analysis

tree_data <- df_clean %>%
select(diagnosis, all_of(features_logit))

set.seed(123)
n2 <- nrow(tree_data)
train_idx2 <- sample(seq_len(n2), size = 0.8 * n2)

train_tree <- tree_data[train_idx2, ]
test_tree <- tree_data[-train_idx2, ]

tree_fit <- rpart(diagnosis ~ ., data = train_tree,
method = "class",
control = rpart.control(maxdepth = 4, cp = 0.01))

rpart.plot(tree_fit, main = "Decision Tree for Breast Cancer Diagnosis")

tree_pred <- predict(tree_fit, newdata = test_tree, type = "class")

conf_mat_tree <- table(Actual = test_tree$diagnosis, Predicted = tree_pred)
conf_mat_tree

##       Predicted
## Actual  B  M
##      B 71  7
##      M  4 27

tree_acc <- sum(diag(conf_mat_tree)) / sum(conf_mat_tree)
tree_acc

## [1] 0.8990826

The decision tree achieves an accuracy of approximately 89.9% in the test set, slightly lower than logistic regression, but it offers highly interpretable rules. Furthermore, the confusion matrix shows that most malignant tumors are correctly classified, with few false negatives and positives. The most important variables are area_mean, concave_points_mean, concavity_mean, perimeter_mean, and radius_mean, all related to tumor size and irregularity. The initial divisions of the tree primarily use thresholds for area_mean and concavity measurements, reinforcing the conclusion that larger tumors with more pronounced concavities are more likely to be malignant.

varImp(tree_fit)

##                          Overall
## area_mean              138.52217
## concave.points_mean    135.97780
## concavity_mean         115.32931
## perimeter_mean         130.47532
## radius_mean            133.95028
## texture_mean            10.02694
## smoothness_mean          0.00000
## compactness_mean         0.00000
## symmetry_mean            0.00000
## fractal_dimension_mean   0.00000

Principal Component Analysis (PCA)

pca_data <- df_clean[, mean_vars]

pca_scaled <- scale(pca_data)

pca_fit <- prcomp(pca_scaled, center = TRUE, scale. = TRUE)

summary(pca_fit)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2931 1.6343 0.94370 0.71790 0.62974 0.37333 0.28044
## Proportion of Variance 0.5258 0.2671 0.08906 0.05154 0.03966 0.01394 0.00786
## Cumulative Proportion  0.5258 0.7929 0.88197 0.93351 0.97316 0.98710 0.99497
##                            PC8     PC9    PC10
## Standard deviation     0.20665 0.08531 0.01868
## Proportion of Variance 0.00427 0.00073 0.00003
## Cumulative Proportion  0.99924 0.99997 1.00000

var_explained <- pca_fit$sdev^2 / sum(pca_fit$sdev^2)
cum_var <- cumsum(var_explained)

pca_var_df <- data.frame(
  PC = seq_along(var_explained),
  VarExplained = var_explained,
  CumVar = cum_var
)

pca_var_df

##    PC VarExplained    CumVar
## 1   1 5.258272e-01 0.5258272
## 2   2 2.670850e-01 0.7929122
## 3   3 8.905705e-02 0.8819692
## 4   4 5.153808e-02 0.9335073
## 5   5 3.965744e-02 0.9731647
## 6   6 1.393760e-02 0.9871023
## 7   7 7.864414e-03 0.9949668
## 8   8 4.270530e-03 0.9992373
## 9   9 7.278190e-04 0.9999651
## 10 10 3.489137e-05 1.0000000

plot(cum_var, type = "b",
xlab = "Number of principal components",
ylab = "Cumulative explained variance",
main = "PCA – Cumulative explained variance")
abline(h = 0.8, col = "red", lty = 2)

loadings <- pca_fit$rotation[, 1:2]
loadings

##                                PC1         PC2
## radius_mean            -0.35631655  0.33780335
## texture_mean           -0.15189037  0.14501832
## perimeter_mean         -0.37186173  0.30611386
## area_mean              -0.35814323  0.33093525
## smoothness_mean        -0.22857339 -0.39246414
## compactness_mean       -0.36939636 -0.26545809
## concavity_mean         -0.39455302 -0.12644702
## concave.points_mean    -0.42407619 -0.02405019
## symmetry_mean          -0.22556081 -0.33880112
## fractal_dimension_mean -0.09127033 -0.55297847

pca_scores <- as.data.frame(pca_fit$x[, 1:2])
pca_scores$diagnosis <- df_clean$diagnosis

ggplot(pca_scores, aes(x = PC1, y = PC2, color = diagnosis)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Patients projected on first two principal components")

The PCA shows that PC1 explains approximately 52.6% of the variance and that two or three components capture nearly 80–90% of the total variability. That is, the 10 variables are well summarized in low dimensions. PC1 primarily reflects tumor size and irregularity (radius, perimeter, area, concavity), while PC2 is more closely associated with texture and border complexity. When the data are projected onto these components, benign and malignant tumors are clearly separated, indicating that a few latent dimensions are sufficient to distinguish them.

Conclusion

Both parametric and non-parametric methods offer a very consistent picture. Logistic regression demonstrates excellent performance (94.5% accuracy, AUC = 0.983, and a high pseudo-R²) and identifies tumor size and irregularity variables (area, radius, concavity, and concave points) as predictors. Furthermore, the decision tree, although somewhat less accurate (89.9%), generates easily interpretable rules based on these same characteristics, and PCA shows that 2–3 principal components, dominated by size and concavity, capture most of the variability and effectively separate benign and malignant cases.

Similarly, the main limitations are the strong multicollinearity among size measures and the use of a single dataset and imaging protocol. Despite this, all approaches agree that the morphological characteristics of the masses, especially those related to their size and borders, are very powerful predictors of malignancy in this context.

Homework2

Javier Aguilar

2025-11-01

Data Selection and Preparation

Data source and origin

Data Import

Parametric Statistics

Outlier detection

Normality and homoscedasticity of relevant variables

Collinearity (VIF)

Train/test split

Fit logistic regression model

Non-Parametric Statistics

Decision Tree analysis

Principal Component Analysis (PCA)

Conclusion