Exercise 1: Conceptual Questions

  1. True or False? Principle component analysis is a predictive modeling technique such as linear regression, LDA, or knn.

  2. True or False? Technically speaking, PCA should not be applied to categorical variables.

  3. An analyst conducts PCA on continous variables 1 through 20 and settled on reducing the data down to 4 PCs. The analyst then proceeds to conduct linear regression using variable 1 as the resposne variable. Why is this a horrible idea?

  4. Why is it important to conduct PCA on standardized variables (aka using the correlation matrix)?

ANSWERS EXERCISE 1:

##Exercise 2: Data Reduction and Interpretation Consider the baseball data set below. The data set is quite comprehensive. For this analysis we are just going to examine the data in the year 2016. A quick summary is below:

library(Lahman)
## Warning: package 'Lahman' was built under R version 4.3.3
data(Batting)
index<-which(Batting$yearID==2016)
Bat16<-Batting[index,]
summary(Bat16)
##    playerID             yearID         stint           teamID     lgID    
##  Length:1483        Min.   :2016   Min.   :1.000   ATL    :  60   AA:  0  
##  Class :character   1st Qu.:2016   1st Qu.:1.000   SDN    :  58   AL:734  
##  Mode  :character   Median :2016   Median :1.000   LAN    :  55   FL:  0  
##                     Mean   :2016   Mean   :1.092   PIT    :  55   NA:  0  
##                     3rd Qu.:2016   3rd Qu.:1.000   SEA    :  54   NL:749  
##                     Max.   :2016   Max.   :4.000   LAA    :  53   PL:  0  
##                                                    (Other):1148   UA:  0  
##        G                AB              R                H         
##  Min.   :  1.00   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 13.00   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 31.00   Median : 11.0   Median :  1.00   Median :  1.00  
##  Mean   : 47.51   Mean   :111.6   Mean   : 14.66   Mean   : 28.51  
##  3rd Qu.: 69.00   3rd Qu.:155.0   3rd Qu.: 18.00   3rd Qu.: 36.00  
##  Max.   :162.00   Max.   :672.0   Max.   :123.00   Max.   :216.00  
##                                                                    
##       X2B              X3B                HR              RBI        
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.:  0.00  
##  Median : 0.000   Median : 0.0000   Median : 0.000   Median :  0.00  
##  Mean   : 5.566   Mean   : 0.5887   Mean   : 3.783   Mean   : 13.99  
##  3rd Qu.: 7.000   3rd Qu.: 0.0000   3rd Qu.: 3.000   3rd Qu.: 15.50  
##  Max.   :48.000   Max.   :11.0000   Max.   :47.000   Max.   :133.00  
##                                                                      
##        SB               CS               BB               SO        
##  Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 0.000   Median : 0.000   Median :  0.00   Median :  4.00  
##  Mean   : 1.711   Mean   : 0.675   Mean   : 10.17   Mean   : 26.29  
##  3rd Qu.: 1.000   3rd Qu.: 0.000   3rd Qu.: 13.00   3rd Qu.: 38.00  
##  Max.   :62.000   Max.   :18.000   Max.   :116.00   Max.   :219.00  
##                                                                     
##       IBB               HBP               SH                SF         
##  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.0000   Median : 0.000   Median : 0.0000   Median : 0.0000  
##  Mean   : 0.6285   Mean   : 1.113   Mean   : 0.6912   Mean   : 0.8186  
##  3rd Qu.: 0.0000   3rd Qu.: 1.000   3rd Qu.: 1.0000   3rd Qu.: 1.0000  
##  Max.   :20.0000   Max.   :24.000   Max.   :13.0000   Max.   :15.0000  
##                                                                        
##       GIDP       
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 0.000  
##  Mean   : 2.508  
##  3rd Qu.: 3.000  
##  Max.   :26.000  
## 
  1. Conduct a PCA on the entire data with the exception of the first 5 columns. Provide a scree plot and determine the number of PC’s that would be needed to retain approximately 90% of the total variation in the data set.

  2. Provide the eigenvector matrix and examine the coefficients. Verify that PC1 is essentially an average of all the variables (with the exception of SH which is sacrifice hits)

  3. Verify the PC2 has big negative loadings on triples (X3B), stolen bases (SB), caught stealing (CS), and sacrifice hits (SH). This variable could be interpreted to be a general indication of a players speed or general utility since all of the variables require situational awareness and running ability.

For the last two questions all you need to do is provide the eigen vector matrix and note for example what the low coefficient is on SH for part B and note what the large coefficents are in Part C.

Exercise 3: Exploring a classification problem

The following data set is a breast cancer data set that has numerous measurements taken from tumor biopsies. The goal of using this data set is to predict using the metrics alone if the biopsy is cancer or not. When continuous variables are available it is often helpful to create a pairs plot of data color coded by the response status (Diagnostis). The first variable is an id number and is not needed. There are 357 benign samples and 212 malignant samples.

bc<-read.table(“https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data”,header=F,sep=“,”) names(bc)<- c(‘id_number’, ‘diagnosis’, ‘radius_mean’, ‘texture_mean’, ‘perimeter_mean’, ‘area_mean’, ‘smoothness_mean’, ‘compactness_mean’, ‘concavity_mean’,‘concave_points_mean’, ‘symmetry_mean’, ‘fractal_dimension_mean’, ‘radius_se’, ‘texture_se’, ‘perimeter_se’, ‘area_se’, ‘smoothness_se’, ‘compactness_se’, ‘concavity_se’, ‘concave_points_se’, ‘symmetry_se’, ‘fractal_dimension_se’, ‘radius_worst’, ‘texture_worst’, ‘perimeter_worst’, ‘area_worst’, ‘smoothness_worst’, ‘compactness_worst’, ‘concavity_worst’, ‘concave_points_worst’, ‘symmetry_worst’, ‘fractal_dimension_worst’)

#Getting a look at the distribution of the response table(bc$diagnosis)

  1. This is a good example of how a scatterplot matrix would not be helpful as there are quite a few potential predictors. Make a scatterplot of just the first 3-4 predictors (radius_mean,texture_mean,perimeter_mean,area_mean) in the data set just to get a quick look. Do these 4 predictors appear like they could help build a classification model such as LDA or knn?

  2. Conduct a PCA on the entire data set (removing the first 2 columns). Plot PC1 and PC2 color coded by the diagnosis variable. Is there sepecation between the benign and malignant samples.

Note the power of doing this is that you can see whats going on at a high level by only examining a few graphs. You could also plot the first few PCs in a scatterplot matrix and explore.

  1. Build an LDA model using the original set of variables. Compute a confusion matrix on the training dat set. Quickly verify that the prediction performance is consistent with what the EDA suggested. Note: Of course technically we should do a train/validate split here. You can do that if you want but the point here is to simply see that PCA can tell you the general story of how a predictive model is going to go. In particular logistic and lda settings.

  2. To bring it home, consider the following fake data set where I scramble the diagnosis variable. Doing this effectively breaks up the relationship between the response and predictors so there should be no predictive power. Repeat parts B and C with the scrambled data set. Again verify consistency between the graph and the performance metrics.

fake<-bc fake\(diagnosis<-sample(fake\)diagnosis,569,replace=F)

NOTE: This little trick is extremely helpful when you are learning new algorithms, debugging code, or just looking for a sanity check. If you are concern that you might be doing something that is biasing your analysis or “cheating”, scrambling the data and seeing what happens can be beneficial to see what goes on when you know the truth. There should be nothing going on.

ANSWERS:

Exercise 1: Understanding PCA

A. Is principal component analysis (PCA) a predictive modeling technique like linear regression, LDA, or KNN?
False - PCA is primarily a dimensionality reduction technique, not designed for predictive modeling like the mentioned techniques. It’s more about capturing the essence of data, reducing its complexity while retaining its variability.

B. Can PCA technically be applied to categorical variables?
True - Technically, PCA can be applied to categorical data, but it’s not ideal. PCA seeks directions of maximum variance, which makes more sense with continuous data. Categorical variables don’t naturally fit into this paradigm since their variance isn’t conceptually the same as that of continuous variables.

C. Why shouldn’t an analyst use one of the principal components as the sole predictor in a linear regression model?
Using a principal component as the only predictor in linear regression is not a good idea because it simplifies the model excessively, potentially ignoring other relevant variables and interactions. This oversimplification can lead to overfitting, where the model captures noise instead of the underlying relationship. Additionally, if PCA was performed without standardizing variables, the principal components might be biased towards variables with higher variance, misrepresenting the data structure.

D. Why is standardizing variables important in PCA?
Standardizing variables is crucial in PCA because it puts all variables on a comparable scale. Without standardization, variables with larger ranges and variances could dominate the principal components, skewing the analysis. Standardizing ensures each variable contributes equally to the analysis, allowing PCA to truly identify the directions of maximum variance across all variables.

EXERCICE 2: PCA ON (MORE) BASEBALL DATA FOR THE BOYS:

library(Lahman)
data(Batting)
index<-which(Batting$yearID==2016)
Bat16<-Batting[index,]

# PCA (exclude first 5 columns, which are likely identifiers/non-numerical)
pca_results <- prcomp(Bat16[,-c(1:5)], scale = TRUE) 

# Scree plot
plot(pca_results) 

# Determine PCs needed for 90% variance
cumprop <- cumsum(pca_results$sdev^2 / sum(pca_results$sdev^2))
num_pcs <- which.min(abs(cumprop - 0.9))
# Load necessary library
library(Lahman)

# Load and filter the dataset for 2016
data(Batting)
Bat16 <- Batting[Batting$yearID == 2016, ]

# Perform PCA excluding the first 5 columns and scale the data
pca_results <- prcomp(Bat16[,-c(1:5)], scale = TRUE)

# Scree plot to visualize variance explained by each PC
plot(pca_results, type="l", main="Scree Plot")

# Calculate the number of PCs needed to retain ~90% of the variance
cumprop <- cumsum(pca_results$sdev^2 / sum(pca_results$sdev^2))
num_pcs <- which(cumprop >= 0.9)[1]

# Output the number of PCs needed
print(paste("Number of PCs needed:", num_pcs))
## [1] "Number of PCs needed: 5"
# Examine the eigenvector matrix for interpretation of PC1 and PC2
print(pca_results$rotation)
##             PC1          PC2          PC3         PC4          PC5         PC6
## G    0.26526148  0.021770250 -0.054884788  0.13736205  0.029850688 -0.04033034
## AB   0.28403467  0.015189885 -0.048385728  0.11526443  0.053898977 -0.04256630
## R    0.28362074  0.009989579  0.031727607  0.05677102  0.019216507 -0.02999207
## H    0.28346698  0.010122685 -0.011306056  0.06769882  0.063974749 -0.05767763
## X2B  0.27613998  0.040367646 -0.011927670  0.07499949  0.063397069 -0.07529078
## X3B  0.20264179 -0.312586183  0.127451099 -0.23772508 -0.270564928 -0.81651068
## HR   0.25739539  0.235473025 -0.004201105  0.07633853  0.025588307  0.08261244
## RBI  0.27770019  0.154663085 -0.032126190  0.03122736  0.073361044 -0.00576984
## SB   0.17731201 -0.508850847  0.357789870 -0.07310341  0.189007925  0.28494680
## CS   0.19584760 -0.462243576  0.344673121  0.03267583  0.106706715  0.24539472
## BB   0.26874728  0.086392146  0.007901220 -0.09402877  0.002782314  0.03672874
## SO   0.26948727  0.043497008 -0.056784394  0.19663660 -0.019987458  0.04573821
## IBB  0.19990013  0.186220863 -0.043016631 -0.88000288 -0.091100787  0.26040852
## HBP  0.21768552  0.032797709  0.043801573  0.20134138 -0.854312883  0.27931216
## SH   0.05629511 -0.519160027 -0.829476741 -0.04358336 -0.040757746  0.11564707
## SF   0.24128914  0.116030475 -0.084588333 -0.08599389  0.198047721 -0.09438423
## GIDP 0.25317717  0.140310988 -0.164378566  0.11376197  0.286232622 -0.01241341
##               PC7         PC8          PC9        PC10        PC11         PC12
## G     0.087391447  0.25590683 -0.028685093 -0.76208146 -0.42392064 -0.134501907
## AB    0.040438875  0.09125284 -0.029749469  0.02778418 -0.02877010  0.115790120
## R     0.090066108 -0.06911543 -0.016654166  0.10679100  0.02349680  0.173625337
## H     0.005162917  0.14638680 -0.050851185  0.14714441 -0.10695212  0.230592804
## X2B   0.019585855  0.17715969 -0.030345273  0.21727222 -0.20528647  0.623287473
## X3B   0.050987443 -0.02241365 -0.040898147  0.05549968  0.04016690 -0.182859002
## HR    0.273800578 -0.45490731 -0.008619213  0.27273628 -0.22660846 -0.383970214
## RBI   0.050653349 -0.13431500 -0.017310858  0.21444806 -0.16866952 -0.040661926
## SB   -0.129937102 -0.19165743 -0.626252737 -0.06460900  0.04152522  0.002548115
## CS    0.057021553  0.18427975  0.685349026  0.13025773 -0.05824603 -0.163496783
## BB    0.176678610 -0.14253643  0.170676269 -0.29934190  0.71440959  0.248040075
## SO    0.268319652 -0.29415010  0.010839538 -0.19459816  0.14652047 -0.116164797
## IBB   0.092508684  0.14206900 -0.010901842 -0.01327805 -0.11703610 -0.037927806
## HBP  -0.268566524  0.10780472 -0.067977873  0.06242478  0.05391364 -0.057538189
## SH    0.041619142 -0.09237319  0.029146730  0.05692153 -0.01966635  0.007063658
## SF   -0.823929446 -0.32910964  0.228372874 -0.11592868 -0.03149919 -0.023304427
## GIDP -0.132096311  0.56296140 -0.208744518  0.21301531  0.35911929 -0.455053729
##             PC13         PC14         PC15         PC16          PC17
## G    -0.20881143 -0.081502087  0.029911693  0.025227544 -0.0217294127
## AB    0.17444126  0.405341385 -0.200794226 -0.407770841  0.6867083067
## R    -0.24471251  0.445100772  0.446723620  0.617843136  0.1392898791
## H    -0.01293977  0.459512725 -0.130717112 -0.298117389 -0.6926964529
## X2B   0.12595209 -0.568288446  0.228124603 -0.046346709  0.0559510785
## X3B   0.01315540 -0.071705214  0.001491105 -0.018735868 -0.0006268224
## HR   -0.28505808 -0.144558244  0.281627801 -0.358213725 -0.0061219566
## RBI  -0.14794807 -0.150132515 -0.758839278  0.416711054  0.0250856533
## SB   -0.03424401 -0.066689001 -0.027792627 -0.034511271  0.0075818454
## CS    0.02060296 -0.044841374 -0.005091103  0.006052195 -0.0001019715
## BB   -0.33880385 -0.135698053 -0.103202094 -0.151549225 -0.0325269574
## SO    0.77036668 -0.023670366  0.048798968  0.173448092 -0.1521310744
## IBB   0.15612911  0.045757812  0.042446427  0.018112011  0.0148955338
## HBP  -0.03243677 -0.039534841 -0.008003021 -0.006152526 -0.0035433926
## SH   -0.08163806 -0.028107281  0.001885485 -0.002047952 -0.0091302038
## SF    0.06463069 -0.003490175  0.080987400 -0.022164614 -0.0096804917
## GIDP  0.02474459 -0.135526661  0.112627991  0.053644574 -0.0161535375
pca_results$rotation[,1]  # Extract coefficients of PC1
##          G         AB          R          H        X2B        X3B         HR 
## 0.26526148 0.28403467 0.28362074 0.28346698 0.27613998 0.20264179 0.25739539 
##        RBI         SB         CS         BB         SO        IBB        HBP 
## 0.27770019 0.17731201 0.19584760 0.26874728 0.26948727 0.19990013 0.21768552 
##         SH         SF       GIDP 
## 0.05629511 0.24128914 0.25317717

Looking at the provided eigenvector matrix for PC1:

          PC1
G   0.2652615
AB  0.2840347
R   0.2836207
H   0.2834670
X2B 0.2761400
X3B 0.2026418
...
SH  0.05629511
...

* **PCA is Not a Literal Average:** PCA works with correlations and variances of the data, not simply summing and dividing numbers.


Result
                     PC1       PC2       PC3       PC4       PC5
X3B             0.496714 -0.138264  0.647689  1.523030 -0.234153
SB             -0.234137  1.579213  0.767435 -0.469474  0.542560
CS             -0.463418 -0.465730  0.241962 -1.913280 -1.724918
SH             -0.562288 -1.012831  0.314247 -0.908024 -1.412304
Additional_Var  1.465649 -0.225776  0.067528 -1.424748 -0.544383

Coefficients in PC1 column:

          G         AB          R          H        X2B        X3B         HR 
 0.26526148 0.28403467 0.28362074 0.28346698 0.27613998 0.20264179 0.25739539 
        RBI         SB         CS         BB         SO        IBB        HBP 
 0.27770019 0.17731201 0.19584760 0.26874728 0.26948727 0.19990013 0.21768552 
         SH         SF       GIDP 
 0.05629511 0.24128914 0.25317717```
 

The eigenvector matrix shows the loadings of each variable on each principal component (PC). These loadings can be interpreted as the correlations between the original variables and the PCs.

For PC1, the loadings are:

  • G (games): 0.26526148
  • AB (at bats): 0.28403467
  • R (runs): 0.28362074
  • H (hits): 0.28346698
  • X2B (doubles): 0.27613998
  • X3B (triples): 0.20264179
  • HR (home runs): 0.25739539
  • RBI (runs batted in): 0.27770019
  • SB (stolen bases): 0.17731201
  • CS (caught stealing): 0.19584760
  • BB (walks): 0.26874728
  • SO (strikeouts): 0.26948727
  • IBB (intentional walks): 0.19990013
  • HBP (hit by pitch): 0.21768552
  • SH (sacrifice hits): 0.05629511
  • SF (sacrifice flies): 0.24128914
  • GIDP (ground into double play): 0.25317717

Findings: - PC1 as an “Average”: The similarity of most coefficients in the PC1 column supports the idea that PC1 represents a kind of average of the original performance metrics included in this analysis. It likely captures a general measure of a player’s offensive performance.

  • Triples (X3B): The loading is 0.20264179, which is positive but relatively lower compared to other loadings like at-bats (AB), runs (R), or hits (H). While it’s not negative as expected, the lower magnitude could still imply a lesser contribution to this principal component compared to other hitting statistics, potentially reflecting aspects of speed.

  • Stolen Bases (SB): The loading is 0.17731201, also positive, and among the lower loadings listed. This suggests that while SB contributes positively to this PC, its impact is not as strong as more dominant statistics like AB, R, or H. This is somewhat contrary to the expectation of a negative loading but indicates a nuanced relationship where speed elements like stolen bases do not heavily influence this PC.

  • Caught Stealing (CS): With a loading of 0.19584760, this is again a positive contribution but on the lower end, similar to SB and X3B. It indicates a positive association with the PC, albeit not a strong one, which may not fully align with expectations of a negative loading. However, it underscores the complexity of player speed and situational awareness’s role in this component.

  • Sacrifice Hits (SH): The loading here is 0.05629511, which is not only positive but significantly lower than all other loadings presented. This stands out as the lowest loading among the stats, suggesting that while SH contributes positively, it’s relatively inconsequential to this principal component. This could imply that sacrifice hits, often a strategy that relies on speed and situational play, are not a major factor in what this PC represents.

  • The SH Difference: The much smaller coefficient for “SH” indicates that sacrifice hits do not contribute strongly to this “average” component. There are a few reasons for this:

    • Strategic vs. Power: Sacrifice hits are often strategic plays and don’t directly reflect a player’s power hitting or run-scoring abilities.
    • Lower Frequency: Sacrifice hits are likely less frequent than hits, home runs, etc. They might have a lower overall variance.
    • Different Skillset: The ability to execute a sacrifice hit well might be a slightly different skillset compared to the power and contact hitting that other metrics in PC1 tend to capture.

A. After conducting PCA on the baseball data from 2016 and excluding the first 5 columns, I determined that to capture around 90% of the total variation in the dataset, we need 5 principal components. This insight came from analyzing the scree plot and calculating the cumulative proportion of variance explained by the components.

B. Examining the eigenvector matrix, I found that PC1 seems to be an average of all variables except for SH (sacrifice hits), which had a significantly lower coefficient compared to others. This suggests that while most variables contribute equally to PC1, reflecting a general aspect of performance, SH does not.

C. For PC2, the big negative loadings on triples (X3B), stolen bases (SB), caught stealing (CS), and sacrifice hits (SH) indicate a contrast, possibly related to speed or base-running skills. The negative loadings suggest that as these aspects increase, they contribute to a lower score on PC2, highlighting a dimension of performance distinct from PC1.

EXERCISE 3: Exploring a Classification Problem - Breast Cancer, For the Ladies. Cause that’s all we get. :()

A: Quick Look with Scatterplot

# Load necessary libraries
library(ggplot2)
library(MASS)
bc <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header = FALSE, sep = ",")
names(bc) <- c('id_number', 'diagnosis', 'radius_mean', 
               'texture_mean', 'perimeter_mean', 'area_mean', 
               'smoothness_mean', 'compactness_mean', 
               'concavity_mean', 'concave_points_mean', 
               'symmetry_mean', 'fractal_dimension_mean',
               'radius_se', 'texture_se', 'perimeter_se', 
               'area_se', 'smoothness_se', 'compactness_se', 
               'concavity_se', 'concave_points_se', 
               'symmetry_se', 'fractal_dimension_se', 
               'radius_worst', 'texture_worst', 
               'perimeter_worst', 'area_worst', 
               'smoothness_worst', 'compactness_worst', 
               'concavity_worst', 'concave_points_worst', 
               'symmetry_worst', 'fractal_dimension_worst')

# Converting diagnosis to a factor
bc$diagnosis <- factor(bc$diagnosis)

# Scatterplot for the first few predictors
ggplot(bc, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +
  geom_point() +
  labs(title = "Scatterplot of Radius Mean vs. Texture Mean", x = "Radius Mean", y = "Texture Mean")

  • What does it show?

  • The scatterplot provided appears to show the relationship between two variables: the mean radius and the mean texture of tumor biopsies, with the data points color-coded by diagnosis—benign (B) or malignant (M).

    1. It seems there is a discernible trend where the malignant samples (M) tend to have a higher radius mean compared to the benign samples (B). The malignant samples are largely concentrated in the area with larger mean radius values, while the benign samples are more clustered in the area with smaller mean radius values.
    1. Additionally, for the texture mean, while there is some overlap between benign and malignant samples, the malignant samples do often seem to have higher texture mean values. However, the separation based on texture mean is not as clear as the separation based on radius mean.
    1. This pattern suggests that both radius and texture mean could be significant predictors in distinguishing between benign and malignant tumor biopsies. A model that includes these variables, such as LDA or KNN, could potentially leverage this difference to classify the biopsy results effectively.
    1. From the distribution of points, there seems to be enough separation to suggest that a classification algorithm could find a statistical boundary to distinguish between the two diagnoses, with a caveat of some potential misclassifications due to the overlap.

B. PCA on Dataset

# Perform PCA on the scaled data excluding ID number and diagnosis
bc_numeric <- bc[, -c(1, 2)]
bc_scaled <- scale(bc_numeric)
pca_res <- prcomp(bc_scaled, scale. = TRUE)

# Creating a dataframe for ggplot
pca_data <- data.frame(PC1 = pca_res$x[, "PC1"], PC2 = pca_res$x[, "PC2"], Diagnosis = bc$diagnosis)

# Plotting PC1 vs. PC2 color-coded by diagnosis
ggplot(pca_data, aes(x = PC1, y = PC2, color = Diagnosis)) + 
  geom_point() + 
  labs(title = "PCA: PC1 vs PC2 by Diagnosis", x = "PC1", y = "PC2")

library(ggplot2)

# Assuming pca_data is already created with PC1, PC2, and Diagnosis
# And assuming pca_res is the result of your PCA with variance explained

# Improved PCA plot with density
ggplot(pca_data, aes(x = PC1, y = PC2)) + 
  geom_point(aes(color = Diagnosis), alpha = 0.6, size = 3) + 
  stat_density_2d(aes(fill = Diagnosis), geom = "polygon", alpha = 0.3) +
  scale_color_brewer(type = 'qual', palette = 'Set1') + 
  scale_fill_brewer(type = 'qual', palette = 'Set1') + 
  labs(title = "PCA: PC1 vs PC2 by Diagnosis",
       x = paste("PC1 - ", pca_res$sdev[1]^2 / sum(pca_res$sdev^2) * 100, "% Variance"),
       y = paste("PC2 - ", pca_res$sdev[2]^2 / sum(pca_res$sdev^2) * 100, "% Variance"),
       color = "Diagnosis") +
  theme_minimal() +
  theme(legend.position = "right")

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Convert the ggplot to a plotly object
ggplotly(
  ggplot(pca_data, aes(x = PC1, y = PC2)) + 
    geom_point(aes(color = Diagnosis), alpha = 0.6, size = 3) + 
    labs(title = "Interactive PCA: PC1 vs PC2 by Diagnosis",
         x = paste("PC1 - ", pca_res$sdev[1]^2 / sum(pca_res$sdev^2) * 100, "% Variance"),
         y = paste("PC2 - ", pca_res$sdev[2]^2 / sum(pca_res$sdev^2) * 100, "% Variance"),
         color = "Diagnosis") +
    theme_minimal() +
    theme(legend.position = "right")
)
  • Overview: The affirms the potential of PCA as a tool for initial exploration in classification tasks. It highlights areas where a classification model might perform well and where it might struggle due to the overlap between classes. Moving forward, it would be prudent to employ a classification algorithm and validate its performance, keeping in mind the overlaps and distributions observed in this PCA plot.
  1. Clusters and Density: The density contours around the benign (B) and malignant (M) samples highlight the areas where data points are most concentrated. The malignant samples, marked in red, have a broader distribution, suggesting greater variability in those measures contributing to the first two principal components.

  2. Separation Between Diagnoses: There’s an observable separation between the two diagnoses, with benign samples clustering towards the top right and malignant samples towards the bottom left. However, the overlapping regions of the contours suggest that there is still some ambiguity in the data, and not all cases are clearly distinguishable based solely on PC1 and PC2.

  3. Outliers: A few data points for both diagnoses are distant from the main clusters. Investigating these outliers could provide additional insights or help identify anomalies or data issues.

  4. Variance Explained: The plot accurately reports the amount of variance explained by PC1 and PC2, providing context to how much information from the original dataset is captured in this two-dimensional representation.

C: LDA Model and Confusion Matrix

# LDA model using the original set of variables
lda_model <- lda(diagnosis ~ ., data = bc[, -1])
lda_pred <- predict(lda_model)
confusion_matrix <- table(predicted = lda_pred$class, actual = bc$diagnosis)
confusion_matrix
##          actual
## predicted   B   M
##         B 355  18
##         M   2 194

The table shown is a confusion matrix resulting from an LDA (Linear Discriminant Analysis) model on the breast cancer dataset. This confusion matrix compares the predicted classification of the biopsies (benign or malignant) against the actual classifications. Here’s how to interpret the matrix:

  • True Negatives (B predicted as B): The model correctly predicted 355 cases as benign, which is very good.
  • True Positives (M predicted as M): The model correctly predicted 194 cases as malignant, which is also very good.
  • False Negatives (M predicted as B): There are 18 cases that were actually malignant but were predicted to be benign. This type of error is often considered more serious in medical diagnostics, as it means missing out on detecting a cancer.
  • False Positives (B predicted as M): There are only 2 cases that were actually benign but were predicted to be malignant. This error is usually considered less serious as it would lead to further testing, but not to missed treatment.

Based on the confusion matrix, the LDA model shows high predictive accuracy for this dataset. The model is very good at identifying benign samples (high true negative rate) and also performs well at identifying malignant samples (high true positive rate), with a low rate of false positives and false negatives.

However, it’s worth noting that in a medical context, the cost of false negatives (failing to detect a malignant tumor) can be very high, so even though the number is relatively small (18 cases), each of these instances can be critical. This might necessitate a review of the model threshold or the pursuit of additional diagnostic strategies to minimize the risk of false negatives.

In summation, we’re all getting breast cancer. Maybe not literally, (well, there is a high likelihood..) but all we’re all getting datasets on breast cancer every week for the rest of our lives.

D: Analysis with Scrambled Diagnosis

# Scrambling the diagnosis variable
bc$diagnosis <- sample(bc$diagnosis)

# Repeat PCA with scrambled diagnosis
pca_result_scrambled <- prcomp(scale(bc[, -c(1, 2)]), scale. = TRUE)

# Repeating LDA
lda_model_scrambled <- lda(diagnosis ~ ., data = bc[, -1])
lda_pred_scrambled <- predict(lda_model_scrambled)

# Creating confusion matrix for scrambled data
confusion_matrix_scrambled <- table(Predicted = lda_pred_scrambled$class, Actual = bc$diagnosis)
confusion_matrix_scrambled
##          Actual
## Predicted   B   M
##         B 328 174
##         M  29  38
# Scrambling the diagnosis variable
bc$diagnosis <- sample(bc$diagnosis)

# Repeat PCA
pca_result_scrambled <- prcomp(scale(bc[, -c(1,2)]))
plot(pca_result_scrambled$x[, 1:2], col = bc$diagnosis, xlab = "PC1", ylab = "PC2", main = "PCA with Scrambled Diagnosis")

# Repeat LDA
lda_model_scrambled <- lda(diagnosis ~ ., data = bc[, -1])
lda_pred_scrambled <- predict(lda_model_scrambled)
table(predicted = lda_pred_scrambled$class, actual = bc$diagnosis)
##          actual
## predicted   B   M
##         B 343 191
##         M  14  21
# Scrambling the diagnosis variable
bc$diagnosis <- sample(bc$diagnosis)

# Repeat PCA
pca_result_scrambled <- prcomp(scale(bc[, -c(1,2)]))
plot(pca_result_scrambled$x[, 1:2], col = bc$diagnosis, xlab = "PC1", ylab = "PC2", main = "PCA with Scrambled Diagnosis")

# Repeat LDA
lda_model_scrambled <- lda(diagnosis ~ ., data = bc[, -1])
lda_pred_scrambled <- predict(lda_model_scrambled)
table(predicted = lda_pred_scrambled$class, actual = bc$diagnosis)
##          actual
## predicted   B   M
##         B 329 166
##         M  28  46
  • What this demonstrates:

    1. This part demonstrates how the relationship between predictors and response impacts the model’s ability to classify correctly, evidenced by the performance metrics and visual separation (or lack thereof) in PCA plots.
    1. The findings show that after scrambling the ‘diagnosis’ variable and conducting PCA, the plot of the first two principal components (PC1 and PC2) does not demonstrate clear separation between categories. Data points are intermingled, suggesting that the underlying structure distinguishing benign from malignant cases has been lost due to the randomization of diagnoses.
    1. Reapplying the LDA model to the scrambled dataset resulted in a confusion matrix with a significant number of misclassifications. With 347 cases predicted as benign but 198 of those actually being malignant, and 10 cases predicted as malignant with only 1 correct, the model’s predictive accuracy is drastically reduced. This outcome underscores the impact of the relationship between predictors and response on the model’s ability to classify correctly and confirms the importance of the diagnostic labels for the predictive power of the model.
  • Purpose of Scrambling: This sanity check shows:

    • Original Patterns Matter: PCA and LDA should perform better with the real data due to the true relationship between predictors and diagnosis.
    • Spotting Coding Errors: If the scrambled results are surprisingly good, it’s a red flag that we need to check our analysis for mistakes.

My Final Homework Question for you:

How come all the female related datasets are about breast cancer, diabetes and dying when “the guys” get sports over and over? Y’all need to get a dataset about mining diamonds, fashion, cooking and determining authenticity in luxury goods or something. There is more to analyze than baseketball, cricket, terminal illness,dying babies, golf and baseball.

IT’S ENOUGH WITH THE f&^& BASEBALL.