True or False? Principle component analysis is a predictive modeling technique such as linear regression, LDA, or knn.
True or False? Technically speaking, PCA should not be applied to categorical variables.
An analyst conducts PCA on continous variables 1 through 20 and settled on reducing the data down to 4 PCs. The analyst then proceeds to conduct linear regression using variable 1 as the resposne variable. Why is this a horrible idea?
Why is it important to conduct PCA on standardized variables (aka using the correlation matrix)?
A. False: - PCA is a dimensionality reduction technique, not a predictive modeling technique like the ones listed.
B. True: - While PCA can technically work with categorical data, it’s primarily intended for continuous numerical variables. Think of PCA as working by finding directions of highest variance; categorical variables don’t have meaningful variance on their own.
C.
Using one of the principal components as the sole predictor in linear
regression is a bad idea because it can lead to overfitting and ignores
potential relationships between other PCs and the response. If PCA is
conducted without standardization on variables measured at different
scales, the resulting components might be biased towards variables with
higher variance, leading to a model that does not accurately reflect the
data’s underlying structure.
D.
Standardizing variables before PCA is important because it ensures that
all variables are on a similar scale. This prevents variables with
larger ranges from unduly dominating the analysis and skewing
results.
##Exercise 2: Data Reduction and Interpretation Consider the baseball data set below. The data set is quite comprehensive. For this analysis we are just going to examine the data in the year 2016. A quick summary is below:
library(Lahman)
## Warning: package 'Lahman' was built under R version 4.3.3
data(Batting)
index<-which(Batting$yearID==2016)
Bat16<-Batting[index,]
summary(Bat16)
## playerID yearID stint teamID lgID
## Length:1483 Min. :2016 Min. :1.000 ATL : 60 AA: 0
## Class :character 1st Qu.:2016 1st Qu.:1.000 SDN : 58 AL:734
## Mode :character Median :2016 Median :1.000 LAN : 55 FL: 0
## Mean :2016 Mean :1.092 PIT : 55 NA: 0
## 3rd Qu.:2016 3rd Qu.:1.000 SEA : 54 NL:749
## Max. :2016 Max. :4.000 LAA : 53 PL: 0
## (Other):1148 UA: 0
## G AB R H
## Min. : 1.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 13.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 31.00 Median : 11.0 Median : 1.00 Median : 1.00
## Mean : 47.51 Mean :111.6 Mean : 14.66 Mean : 28.51
## 3rd Qu.: 69.00 3rd Qu.:155.0 3rd Qu.: 18.00 3rd Qu.: 36.00
## Max. :162.00 Max. :672.0 Max. :123.00 Max. :216.00
##
## X2B X3B HR RBI
## Min. : 0.000 Min. : 0.0000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 0.000 Median : 0.0000 Median : 0.000 Median : 0.00
## Mean : 5.566 Mean : 0.5887 Mean : 3.783 Mean : 13.99
## 3rd Qu.: 7.000 3rd Qu.: 0.0000 3rd Qu.: 3.000 3rd Qu.: 15.50
## Max. :48.000 Max. :11.0000 Max. :47.000 Max. :133.00
##
## SB CS BB SO
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.000 Median : 0.000 Median : 0.00 Median : 4.00
## Mean : 1.711 Mean : 0.675 Mean : 10.17 Mean : 26.29
## 3rd Qu.: 1.000 3rd Qu.: 0.000 3rd Qu.: 13.00 3rd Qu.: 38.00
## Max. :62.000 Max. :18.000 Max. :116.00 Max. :219.00
##
## IBB HBP SH SF
## Min. : 0.0000 Min. : 0.000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.000 Median : 0.0000 Median : 0.0000
## Mean : 0.6285 Mean : 1.113 Mean : 0.6912 Mean : 0.8186
## 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 1.0000 3rd Qu.: 1.0000
## Max. :20.0000 Max. :24.000 Max. :13.0000 Max. :15.0000
##
## GIDP
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 2.508
## 3rd Qu.: 3.000
## Max. :26.000
##
Conduct a PCA on the entire data with the exception of the first 5 columns. Provide a scree plot and determine the number of PC’s that would be needed to retain approximately 90% of the total variation in the data set.
Provide the eigenvector matrix and examine the coefficients. Verify that PC1 is essentially an average of all the variables (with the exception of SH which is sacrifice hits)
Verify the PC2 has big negative loadings on triples (X3B), stolen bases (SB), caught stealing (CS), and sacrifice hits (SH). This variable could be interpreted to be a general indication of a players speed or general utility since all of the variables require situational awareness and running ability.
For the last two questions all you need to do is provide the eigen vector matrix and note for example what the low coefficient is on SH for part B and note what the large coefficents are in Part C.
The following data set is a breast cancer data set that has numerous measurements taken from tumor biopsies. The goal of using this data set is to predict using the metrics alone if the biopsy is cancer or not. When continuous variables are available it is often helpful to create a pairs plot of data color coded by the response status (Diagnostis). The first variable is an id number and is not needed. There are 357 benign samples and 212 malignant samples.
bc<-read.table(“https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data”,header=F,sep=“,”) names(bc)<- c(‘id_number’, ‘diagnosis’, ‘radius_mean’, ‘texture_mean’, ‘perimeter_mean’, ‘area_mean’, ‘smoothness_mean’, ‘compactness_mean’, ‘concavity_mean’,‘concave_points_mean’, ‘symmetry_mean’, ‘fractal_dimension_mean’, ‘radius_se’, ‘texture_se’, ‘perimeter_se’, ‘area_se’, ‘smoothness_se’, ‘compactness_se’, ‘concavity_se’, ‘concave_points_se’, ‘symmetry_se’, ‘fractal_dimension_se’, ‘radius_worst’, ‘texture_worst’, ‘perimeter_worst’, ‘area_worst’, ‘smoothness_worst’, ‘compactness_worst’, ‘concavity_worst’, ‘concave_points_worst’, ‘symmetry_worst’, ‘fractal_dimension_worst’)
#Getting a look at the distribution of the response table(bc$diagnosis)
This is a good example of how a scatterplot matrix would not be helpful as there are quite a few potential predictors. Make a scatterplot of just the first 3-4 predictors (radius_mean,texture_mean,perimeter_mean,area_mean) in the data set just to get a quick look. Do these 4 predictors appear like they could help build a classification model such as LDA or knn?
Conduct a PCA on the entire data set (removing the first 2 columns). Plot PC1 and PC2 color coded by the diagnosis variable. Is there sepecation between the benign and malignant samples.
Note the power of doing this is that you can see whats going on at a high level by only examining a few graphs. You could also plot the first few PCs in a scatterplot matrix and explore.
Build an LDA model using the original set of variables. Compute a confusion matrix on the training dat set. Quickly verify that the prediction performance is consistent with what the EDA suggested. Note: Of course technically we should do a train/validate split here. You can do that if you want but the point here is to simply see that PCA can tell you the general story of how a predictive model is going to go. In particular logistic and lda settings.
To bring it home, consider the following fake data set where I scramble the diagnosis variable. Doing this effectively breaks up the relationship between the response and predictors so there should be no predictive power. Repeat parts B and C with the scrambled data set. Again verify consistency between the graph and the performance metrics.
fake<-bc fake\(diagnosis<-sample(fake\)diagnosis,569,replace=F)
NOTE: This little trick is extremely helpful when you are learning new algorithms, debugging code, or just looking for a sanity check. If you are concern that you might be doing something that is biasing your analysis or “cheating”, scrambling the data and seeing what happens can be beneficial to see what goes on when you know the truth. There should be nothing going on.
A. Is principal component analysis (PCA) a
predictive modeling technique like linear regression, LDA, or KNN?
False - PCA is primarily a dimensionality reduction
technique, not designed for predictive modeling like the mentioned
techniques. It’s more about capturing the essence of data, reducing its
complexity while retaining its variability.
B. Can PCA technically be applied to categorical
variables?
True - Technically, PCA can be applied to categorical
data, but it’s not ideal. PCA seeks directions of maximum variance,
which makes more sense with continuous data. Categorical variables don’t
naturally fit into this paradigm since their variance isn’t conceptually
the same as that of continuous variables.
C. Why shouldn’t an analyst use one of the principal
components as the sole predictor in a linear regression model?
Using a principal component as the only predictor in linear regression
is not a good idea because it simplifies the model excessively,
potentially ignoring other relevant variables and interactions. This
oversimplification can lead to overfitting, where the model captures
noise instead of the underlying relationship. Additionally, if PCA was
performed without standardizing variables, the principal components
might be biased towards variables with higher variance, misrepresenting
the data structure.
D. Why is standardizing variables important in
PCA?
Standardizing variables is crucial in PCA because it puts all variables
on a comparable scale. Without standardization, variables with larger
ranges and variances could dominate the principal components, skewing
the analysis. Standardizing ensures each variable contributes equally to
the analysis, allowing PCA to truly identify the directions of maximum
variance across all variables.
library(Lahman)
data(Batting)
index<-which(Batting$yearID==2016)
Bat16<-Batting[index,]
# PCA (exclude first 5 columns, which are likely identifiers/non-numerical)
pca_results <- prcomp(Bat16[,-c(1:5)], scale = TRUE)
# Scree plot
plot(pca_results)
# Determine PCs needed for 90% variance
cumprop <- cumsum(pca_results$sdev^2 / sum(pca_results$sdev^2))
num_pcs <- which.min(abs(cumprop - 0.9))
# Load necessary library
library(Lahman)
# Load and filter the dataset for 2016
data(Batting)
Bat16 <- Batting[Batting$yearID == 2016, ]
# Perform PCA excluding the first 5 columns and scale the data
pca_results <- prcomp(Bat16[,-c(1:5)], scale = TRUE)
# Scree plot to visualize variance explained by each PC
plot(pca_results, type="l", main="Scree Plot")
# Calculate the number of PCs needed to retain ~90% of the variance
cumprop <- cumsum(pca_results$sdev^2 / sum(pca_results$sdev^2))
num_pcs <- which(cumprop >= 0.9)[1]
# Output the number of PCs needed
print(paste("Number of PCs needed:", num_pcs))
## [1] "Number of PCs needed: 5"
# Examine the eigenvector matrix for interpretation of PC1 and PC2
print(pca_results$rotation)
## PC1 PC2 PC3 PC4 PC5 PC6
## G 0.26526148 0.021770250 -0.054884788 0.13736205 0.029850688 -0.04033034
## AB 0.28403467 0.015189885 -0.048385728 0.11526443 0.053898977 -0.04256630
## R 0.28362074 0.009989579 0.031727607 0.05677102 0.019216507 -0.02999207
## H 0.28346698 0.010122685 -0.011306056 0.06769882 0.063974749 -0.05767763
## X2B 0.27613998 0.040367646 -0.011927670 0.07499949 0.063397069 -0.07529078
## X3B 0.20264179 -0.312586183 0.127451099 -0.23772508 -0.270564928 -0.81651068
## HR 0.25739539 0.235473025 -0.004201105 0.07633853 0.025588307 0.08261244
## RBI 0.27770019 0.154663085 -0.032126190 0.03122736 0.073361044 -0.00576984
## SB 0.17731201 -0.508850847 0.357789870 -0.07310341 0.189007925 0.28494680
## CS 0.19584760 -0.462243576 0.344673121 0.03267583 0.106706715 0.24539472
## BB 0.26874728 0.086392146 0.007901220 -0.09402877 0.002782314 0.03672874
## SO 0.26948727 0.043497008 -0.056784394 0.19663660 -0.019987458 0.04573821
## IBB 0.19990013 0.186220863 -0.043016631 -0.88000288 -0.091100787 0.26040852
## HBP 0.21768552 0.032797709 0.043801573 0.20134138 -0.854312883 0.27931216
## SH 0.05629511 -0.519160027 -0.829476741 -0.04358336 -0.040757746 0.11564707
## SF 0.24128914 0.116030475 -0.084588333 -0.08599389 0.198047721 -0.09438423
## GIDP 0.25317717 0.140310988 -0.164378566 0.11376197 0.286232622 -0.01241341
## PC7 PC8 PC9 PC10 PC11 PC12
## G 0.087391447 0.25590683 -0.028685093 -0.76208146 -0.42392064 -0.134501907
## AB 0.040438875 0.09125284 -0.029749469 0.02778418 -0.02877010 0.115790120
## R 0.090066108 -0.06911543 -0.016654166 0.10679100 0.02349680 0.173625337
## H 0.005162917 0.14638680 -0.050851185 0.14714441 -0.10695212 0.230592804
## X2B 0.019585855 0.17715969 -0.030345273 0.21727222 -0.20528647 0.623287473
## X3B 0.050987443 -0.02241365 -0.040898147 0.05549968 0.04016690 -0.182859002
## HR 0.273800578 -0.45490731 -0.008619213 0.27273628 -0.22660846 -0.383970214
## RBI 0.050653349 -0.13431500 -0.017310858 0.21444806 -0.16866952 -0.040661926
## SB -0.129937102 -0.19165743 -0.626252737 -0.06460900 0.04152522 0.002548115
## CS 0.057021553 0.18427975 0.685349026 0.13025773 -0.05824603 -0.163496783
## BB 0.176678610 -0.14253643 0.170676269 -0.29934190 0.71440959 0.248040075
## SO 0.268319652 -0.29415010 0.010839538 -0.19459816 0.14652047 -0.116164797
## IBB 0.092508684 0.14206900 -0.010901842 -0.01327805 -0.11703610 -0.037927806
## HBP -0.268566524 0.10780472 -0.067977873 0.06242478 0.05391364 -0.057538189
## SH 0.041619142 -0.09237319 0.029146730 0.05692153 -0.01966635 0.007063658
## SF -0.823929446 -0.32910964 0.228372874 -0.11592868 -0.03149919 -0.023304427
## GIDP -0.132096311 0.56296140 -0.208744518 0.21301531 0.35911929 -0.455053729
## PC13 PC14 PC15 PC16 PC17
## G -0.20881143 -0.081502087 0.029911693 0.025227544 -0.0217294127
## AB 0.17444126 0.405341385 -0.200794226 -0.407770841 0.6867083067
## R -0.24471251 0.445100772 0.446723620 0.617843136 0.1392898791
## H -0.01293977 0.459512725 -0.130717112 -0.298117389 -0.6926964529
## X2B 0.12595209 -0.568288446 0.228124603 -0.046346709 0.0559510785
## X3B 0.01315540 -0.071705214 0.001491105 -0.018735868 -0.0006268224
## HR -0.28505808 -0.144558244 0.281627801 -0.358213725 -0.0061219566
## RBI -0.14794807 -0.150132515 -0.758839278 0.416711054 0.0250856533
## SB -0.03424401 -0.066689001 -0.027792627 -0.034511271 0.0075818454
## CS 0.02060296 -0.044841374 -0.005091103 0.006052195 -0.0001019715
## BB -0.33880385 -0.135698053 -0.103202094 -0.151549225 -0.0325269574
## SO 0.77036668 -0.023670366 0.048798968 0.173448092 -0.1521310744
## IBB 0.15612911 0.045757812 0.042446427 0.018112011 0.0148955338
## HBP -0.03243677 -0.039534841 -0.008003021 -0.006152526 -0.0035433926
## SH -0.08163806 -0.028107281 0.001885485 -0.002047952 -0.0091302038
## SF 0.06463069 -0.003490175 0.080987400 -0.022164614 -0.0096804917
## GIDP 0.02474459 -0.135526661 0.112627991 0.053644574 -0.0161535375
pca_results$rotation[,1] # Extract coefficients of PC1
## G AB R H X2B X3B HR
## 0.26526148 0.28403467 0.28362074 0.28346698 0.27613998 0.20264179 0.25739539
## RBI SB CS BB SO IBB HBP
## 0.27770019 0.17731201 0.19584760 0.26874728 0.26948727 0.19990013 0.21768552
## SH SF GIDP
## 0.05629511 0.24128914 0.25317717
Looking at the provided eigenvector matrix for PC1:
PC1
G 0.2652615
AB 0.2840347
R 0.2836207
H 0.2834670
X2B 0.2761400
X3B 0.2026418
...
SH 0.05629511
...
* **PCA is Not a Literal Average:** PCA works with correlations and variances of the data, not simply summing and dividing numbers.
Result
PC1 PC2 PC3 PC4 PC5
X3B 0.496714 -0.138264 0.647689 1.523030 -0.234153
SB -0.234137 1.579213 0.767435 -0.469474 0.542560
CS -0.463418 -0.465730 0.241962 -1.913280 -1.724918
SH -0.562288 -1.012831 0.314247 -0.908024 -1.412304
Additional_Var 1.465649 -0.225776 0.067528 -1.424748 -0.544383
Coefficients in PC1 column:
G AB R H X2B X3B HR
0.26526148 0.28403467 0.28362074 0.28346698 0.27613998 0.20264179 0.25739539
RBI SB CS BB SO IBB HBP
0.27770019 0.17731201 0.19584760 0.26874728 0.26948727 0.19990013 0.21768552
SH SF GIDP
0.05629511 0.24128914 0.25317717```
The eigenvector matrix shows the loadings of each variable on each principal component (PC). These loadings can be interpreted as the correlations between the original variables and the PCs.
For PC1, the loadings are:
Findings: - PC1 as an “Average”: The similarity of most coefficients in the PC1 column supports the idea that PC1 represents a kind of average of the original performance metrics included in this analysis. It likely captures a general measure of a player’s offensive performance.
Triples (X3B): The loading is 0.20264179, which is positive but relatively lower compared to other loadings like at-bats (AB), runs (R), or hits (H). While it’s not negative as expected, the lower magnitude could still imply a lesser contribution to this principal component compared to other hitting statistics, potentially reflecting aspects of speed.
Stolen Bases (SB): The loading is 0.17731201, also positive, and among the lower loadings listed. This suggests that while SB contributes positively to this PC, its impact is not as strong as more dominant statistics like AB, R, or H. This is somewhat contrary to the expectation of a negative loading but indicates a nuanced relationship where speed elements like stolen bases do not heavily influence this PC.
Caught Stealing (CS): With a loading of 0.19584760, this is again a positive contribution but on the lower end, similar to SB and X3B. It indicates a positive association with the PC, albeit not a strong one, which may not fully align with expectations of a negative loading. However, it underscores the complexity of player speed and situational awareness’s role in this component.
Sacrifice Hits (SH): The loading here is 0.05629511, which is not only positive but significantly lower than all other loadings presented. This stands out as the lowest loading among the stats, suggesting that while SH contributes positively, it’s relatively inconsequential to this principal component. This could imply that sacrifice hits, often a strategy that relies on speed and situational play, are not a major factor in what this PC represents.
The SH Difference: The much smaller coefficient for “SH” indicates that sacrifice hits do not contribute strongly to this “average” component. There are a few reasons for this:
A. After conducting PCA on the baseball data from 2016 and excluding the first 5 columns, I determined that to capture around 90% of the total variation in the dataset, we need 5 principal components. This insight came from analyzing the scree plot and calculating the cumulative proportion of variance explained by the components.
B. Examining the eigenvector matrix, I found that PC1 seems to be an average of all variables except for SH (sacrifice hits), which had a significantly lower coefficient compared to others. This suggests that while most variables contribute equally to PC1, reflecting a general aspect of performance, SH does not.
C. For PC2, the big negative loadings on triples (X3B), stolen bases (SB), caught stealing (CS), and sacrifice hits (SH) indicate a contrast, possibly related to speed or base-running skills. The negative loadings suggest that as these aspects increase, they contribute to a lower score on PC2, highlighting a dimension of performance distinct from PC1.
# Load necessary libraries
library(ggplot2)
library(MASS)
bc <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header = FALSE, sep = ",")
names(bc) <- c('id_number', 'diagnosis', 'radius_mean',
'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean',
'concavity_mean', 'concave_points_mean',
'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se',
'area_se', 'smoothness_se', 'compactness_se',
'concavity_se', 'concave_points_se',
'symmetry_se', 'fractal_dimension_se',
'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst',
'concavity_worst', 'concave_points_worst',
'symmetry_worst', 'fractal_dimension_worst')
# Converting diagnosis to a factor
bc$diagnosis <- factor(bc$diagnosis)
# Scatterplot for the first few predictors
ggplot(bc, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +
geom_point() +
labs(title = "Scatterplot of Radius Mean vs. Texture Mean", x = "Radius Mean", y = "Texture Mean")
What does it show?
The scatterplot provided appears to show the relationship between two variables: the mean radius and the mean texture of tumor biopsies, with the data points color-coded by diagnosis—benign (B) or malignant (M).
# Perform PCA on the scaled data excluding ID number and diagnosis
bc_numeric <- bc[, -c(1, 2)]
bc_scaled <- scale(bc_numeric)
pca_res <- prcomp(bc_scaled, scale. = TRUE)
# Creating a dataframe for ggplot
pca_data <- data.frame(PC1 = pca_res$x[, "PC1"], PC2 = pca_res$x[, "PC2"], Diagnosis = bc$diagnosis)
# Plotting PC1 vs. PC2 color-coded by diagnosis
ggplot(pca_data, aes(x = PC1, y = PC2, color = Diagnosis)) +
geom_point() +
labs(title = "PCA: PC1 vs PC2 by Diagnosis", x = "PC1", y = "PC2")
library(ggplot2)
# Assuming pca_data is already created with PC1, PC2, and Diagnosis
# And assuming pca_res is the result of your PCA with variance explained
# Improved PCA plot with density
ggplot(pca_data, aes(x = PC1, y = PC2)) +
geom_point(aes(color = Diagnosis), alpha = 0.6, size = 3) +
stat_density_2d(aes(fill = Diagnosis), geom = "polygon", alpha = 0.3) +
scale_color_brewer(type = 'qual', palette = 'Set1') +
scale_fill_brewer(type = 'qual', palette = 'Set1') +
labs(title = "PCA: PC1 vs PC2 by Diagnosis",
x = paste("PC1 - ", pca_res$sdev[1]^2 / sum(pca_res$sdev^2) * 100, "% Variance"),
y = paste("PC2 - ", pca_res$sdev[2]^2 / sum(pca_res$sdev^2) * 100, "% Variance"),
color = "Diagnosis") +
theme_minimal() +
theme(legend.position = "right")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Convert the ggplot to a plotly object
ggplotly(
ggplot(pca_data, aes(x = PC1, y = PC2)) +
geom_point(aes(color = Diagnosis), alpha = 0.6, size = 3) +
labs(title = "Interactive PCA: PC1 vs PC2 by Diagnosis",
x = paste("PC1 - ", pca_res$sdev[1]^2 / sum(pca_res$sdev^2) * 100, "% Variance"),
y = paste("PC2 - ", pca_res$sdev[2]^2 / sum(pca_res$sdev^2) * 100, "% Variance"),
color = "Diagnosis") +
theme_minimal() +
theme(legend.position = "right")
)
Clusters and Density: The density contours around the benign (B) and malignant (M) samples highlight the areas where data points are most concentrated. The malignant samples, marked in red, have a broader distribution, suggesting greater variability in those measures contributing to the first two principal components.
Separation Between Diagnoses: There’s an observable separation between the two diagnoses, with benign samples clustering towards the top right and malignant samples towards the bottom left. However, the overlapping regions of the contours suggest that there is still some ambiguity in the data, and not all cases are clearly distinguishable based solely on PC1 and PC2.
Outliers: A few data points for both diagnoses are distant from the main clusters. Investigating these outliers could provide additional insights or help identify anomalies or data issues.
Variance Explained: The plot accurately reports the amount of variance explained by PC1 and PC2, providing context to how much information from the original dataset is captured in this two-dimensional representation.
# LDA model using the original set of variables
lda_model <- lda(diagnosis ~ ., data = bc[, -1])
lda_pred <- predict(lda_model)
confusion_matrix <- table(predicted = lda_pred$class, actual = bc$diagnosis)
confusion_matrix
## actual
## predicted B M
## B 355 18
## M 2 194
The table shown is a confusion matrix resulting from an LDA (Linear Discriminant Analysis) model on the breast cancer dataset. This confusion matrix compares the predicted classification of the biopsies (benign or malignant) against the actual classifications. Here’s how to interpret the matrix:
Based on the confusion matrix, the LDA model shows high predictive accuracy for this dataset. The model is very good at identifying benign samples (high true negative rate) and also performs well at identifying malignant samples (high true positive rate), with a low rate of false positives and false negatives.
However, it’s worth noting that in a medical context, the cost of false negatives (failing to detect a malignant tumor) can be very high, so even though the number is relatively small (18 cases), each of these instances can be critical. This might necessitate a review of the model threshold or the pursuit of additional diagnostic strategies to minimize the risk of false negatives.
In summation, we’re all getting breast cancer. Maybe not literally, (well, there is a high likelihood..) but all we’re all getting datasets on breast cancer every week for the rest of our lives.
# Scrambling the diagnosis variable
bc$diagnosis <- sample(bc$diagnosis)
# Repeat PCA with scrambled diagnosis
pca_result_scrambled <- prcomp(scale(bc[, -c(1, 2)]), scale. = TRUE)
# Repeating LDA
lda_model_scrambled <- lda(diagnosis ~ ., data = bc[, -1])
lda_pred_scrambled <- predict(lda_model_scrambled)
# Creating confusion matrix for scrambled data
confusion_matrix_scrambled <- table(Predicted = lda_pred_scrambled$class, Actual = bc$diagnosis)
confusion_matrix_scrambled
## Actual
## Predicted B M
## B 328 174
## M 29 38
# Scrambling the diagnosis variable
bc$diagnosis <- sample(bc$diagnosis)
# Repeat PCA
pca_result_scrambled <- prcomp(scale(bc[, -c(1,2)]))
plot(pca_result_scrambled$x[, 1:2], col = bc$diagnosis, xlab = "PC1", ylab = "PC2", main = "PCA with Scrambled Diagnosis")
# Repeat LDA
lda_model_scrambled <- lda(diagnosis ~ ., data = bc[, -1])
lda_pred_scrambled <- predict(lda_model_scrambled)
table(predicted = lda_pred_scrambled$class, actual = bc$diagnosis)
## actual
## predicted B M
## B 343 191
## M 14 21
# Scrambling the diagnosis variable
bc$diagnosis <- sample(bc$diagnosis)
# Repeat PCA
pca_result_scrambled <- prcomp(scale(bc[, -c(1,2)]))
plot(pca_result_scrambled$x[, 1:2], col = bc$diagnosis, xlab = "PC1", ylab = "PC2", main = "PCA with Scrambled Diagnosis")
# Repeat LDA
lda_model_scrambled <- lda(diagnosis ~ ., data = bc[, -1])
lda_pred_scrambled <- predict(lda_model_scrambled)
table(predicted = lda_pred_scrambled$class, actual = bc$diagnosis)
## actual
## predicted B M
## B 329 166
## M 28 46
What this demonstrates:
Purpose of Scrambling: This sanity check shows:
How come all the female related datasets are about breast cancer, diabetes and dying when “the guys” get sports over and over? Y’all need to get a dataset about mining diamonds, fashion, cooking and determining authenticity in luxury goods or something. There is more to analyze than baseketball, cricket, terminal illness,dying babies, golf and baseball.
IT’S ENOUGH WITH THE f&^& BASEBALL.