library(glmnet)
library(ROCR)
library(MASS)
library(ROCR)
library(glmnet)

Logistics

Please check Canvas for the due date of this assignment. Please be sure to answer all parts of each question and submit your answers as a knitted .html file: select the Knit drop down at the top of the window. Once converted to .html, select the file in the file browser, click the More drop down, and select Export.

Background

In class, we discussed two broad subfields of machine learning: supervised learning and unsupervised learning. In a typical supervised learning setup, we use a set of labeled input data points to try and learn a robust function that maps future inputs to labels. By contrast, in unsupervised learning, we do not have labels for our training data, and instead we seek to understand interesting things about the data such as the correlation structure between variables or whether high-dimensional data points can be projected into a lower-dimensional space without much loss of information. In this assignment, we will explore both approaches to machine learning using the same dataset.

Part 1: Supervised Learning

Your clinician collaborator has collected a dataset of blood laboratory values, basic demographic information (age, gender), and coronary artery disease status from a population of ~1K individuals. The data is available in hw2_2022.rds.

library(tidyverse)
hw2.data <- readRDS('hw2_2022.rds')

Question 1

This dataset derives from CDC NHANES and consists of blood lab values (prefixed with ‘LBX’) as well as coronary artery disease (CAD) status (variable “cad” with 0 = absent, 1 = present), age, and gender. Write a code snippet to count the number of cases and controls in the dataset, and report these two numbers here:

# your code here
# Cases: CAD==1
table(hw2.data$cad)
## 
##   0   1 
## 684 395
  • your answer here Cases = 395 (individuals with CAD) controls =0 (individuals without CAD)

Question 2

Your clinician collaborator asks you to build a new lab-based “score” that can tell whether a new individual has CAD or not given their laboratory values, gender, and age. Is this a prediction or inference task?

  • your answer here Prediction. We have available variables (set of inputs X) and we want to predict the output (Y).

Question 3

Your collaborator suggests you use the whole dataset, try as many complex models as you can run, and select the best one to publish. You inform your collaborator that we could do this but it might be a bad idea because of a trade-off called:

  • your answer here bias-variance trade-off.

Question 4

Write down the equation for the trade-off from Q3.

  • your answer here \[E(y_0 -\hat{f}(x_0)^2= Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 +Var(\epsilon).\]

Question 5

You suggest to your collaborator that you should divide the sample into a training and test dataset. Write a function called “train_test_split” that accepts “hw2.data” as an input and returns an R list object containing two “tibbles”: “train” and “test” in a 70:30 random split.

# Add identification column to the dataset: PID

hw2.data <-hw2.data %>%  mutate(PID= row_number()) 
# your code here
# The function "train_test_split" has 2 arguments: 1) df: the dataframe of interest, 
# 2) IDcol: an identifier column that the dataframe df should have, which has a unique value per row. 
# The function returns a list with 2 dataframes: train, test.
# train: has a random sample of 70% of the rows of the original dataframe df.
# test: contains the remaining 30% of the original dataframe df. 

train_test_split <- function(df, IDcol) {
  IDcol_str <- deparse(substitute(IDcol))
  train <- df[sample(nrow(df), floor(0.7 * nrow(df)), replace = FALSE),]
  test <- anti_join(df, train, by = IDcol_str)
  result <- list(train = train, test = test)
  return(result)
}
# Creating a train and a test dataset using the previously created function
results <- train_test_split(hw2.data, PID)

# Extracting the created datasets 
train<- results$train
test<- results$test

Question 6

What type of regression is a good first statistical model to use for binary classification (such as 0/1 status of coronary artery disease)? [Hint: It’s not linear regression.] How is this model different than linear regression?

  • your answer here Logistic Regression. In logistic regression the outcome is binary. Logistic regression is a classifier (has a qualitative response) whereas linear regression has a quantitative response.

Question 7

Write R code to fit the model you named in Q6 to predict “cad” from labs and demographics. You should use “train_test_split” from Q5 to fit the model on the “train” data with R’s “glm” function. Evaluate the model’s performance on the “test” data and report the accuracy and plot an ROC curve. Report the AUC. Hint: use the “ROCR” package.

# your code here

# Fitting the model: using the "train" dataset. 
# Predictors: all blood work and basic demographics (age, gender)

# Building a logistic regression model with all the variables in hand (removing the previously added identifier column PID)

CAD_pred <- glm(cad~ . -PID, data= train, family= binomial)
# Making prediction using the previous model on the test dataset 
# The type='response' is to output probabilities of the form P(Y=1|X)

cad.predict <- predict(CAD_pred, type = "response", newdata=test)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Creating a prediction object
cad.prediction <- prediction(cad.predict, test$cad)

# Create ROC curve
cad.performance <- performance(cad.prediction, "tpr", "fpr")
plot(cad.performance, col='blue',main= "ROC of CAD prediction using the test dataset")

  • your answer here Accuracy –> this will return many measures of accuracy depending on the TP, TN,P,N. Once a threshold is decided for the model, then a sole measure can be returned.
accuracy_ROCR <- performance(cad.prediction, measure = "acc")
accuracy_ROCR <- accuracy_ROCR@y.values[[1]]
accuracy_ROCR
##   [1] 0.6450617 0.6481481 0.6512346 0.6543210 0.6512346 0.6543210 0.6574074
##   [8] 0.6604938 0.6635802 0.6666667 0.6697531 0.6666667 0.6697531 0.6728395
##  [15] 0.6759259 0.6790123 0.6820988 0.6851852 0.6882716 0.6913580 0.6882716
##  [22] 0.6913580 0.6882716 0.6913580 0.6944444 0.6975309 0.7006173 0.6975309
##  [29] 0.7006173 0.6975309 0.7006173 0.6975309 0.7006173 0.7037037 0.7067901
##  [36] 0.7098765 0.7129630 0.7098765 0.7129630 0.7160494 0.7191358 0.7222222
##  [43] 0.7253086 0.7283951 0.7314815 0.7345679 0.7376543 0.7345679 0.7376543
##  [50] 0.7407407 0.7438272 0.7469136 0.7438272 0.7469136 0.7500000 0.7530864
##  [57] 0.7561728 0.7592593 0.7561728 0.7592593 0.7623457 0.7654321 0.7685185
##  [64] 0.7716049 0.7685185 0.7654321 0.7623457 0.7592593 0.7623457 0.7654321
##  [71] 0.7685185 0.7716049 0.7746914 0.7777778 0.7808642 0.7839506 0.7870370
##  [78] 0.7901235 0.7932099 0.7962963 0.7993827 0.7962963 0.7993827 0.7962963
##  [85] 0.7993827 0.8024691 0.8055556 0.8086420 0.8117284 0.8086420 0.8117284
##  [92] 0.8086420 0.8117284 0.8148148 0.8117284 0.8086420 0.8117284 0.8086420
##  [99] 0.8117284 0.8148148 0.8179012 0.8148148 0.8179012 0.8209877 0.8240741
## [106] 0.8209877 0.8179012 0.8148148 0.8179012 0.8209877 0.8179012 0.8209877
## [113] 0.8240741 0.8271605 0.8240741 0.8209877 0.8240741 0.8209877 0.8240741
## [120] 0.8209877 0.8240741 0.8209877 0.8179012 0.8148148 0.8117284 0.8148148
## [127] 0.8179012 0.8148148 0.8117284 0.8086420 0.8055556 0.8086420 0.8117284
## [134] 0.8148148 0.8179012 0.8209877 0.8240741 0.8271605 0.8240741 0.8271605
## [141] 0.8302469 0.8333333 0.8302469 0.8333333 0.8302469 0.8271605 0.8302469
## [148] 0.8271605 0.8240741 0.8209877 0.8240741 0.8271605 0.8240741 0.8209877
## [155] 0.8179012 0.8148148 0.8179012 0.8148148 0.8117284 0.8086420 0.8055556
## [162] 0.8024691 0.7993827 0.8024691 0.7993827 0.7962963 0.7932099 0.7901235
## [169] 0.7870370 0.7901235 0.7870370 0.7839506 0.7808642 0.7777778 0.7746914
## [176] 0.7716049 0.7746914 0.7716049 0.7685185 0.7654321 0.7623457 0.7592593
## [183] 0.7561728 0.7530864 0.7500000 0.7469136 0.7438272 0.7469136 0.7438272
## [190] 0.7469136 0.7438272 0.7407407 0.7376543 0.7407407 0.7376543 0.7345679
## [197] 0.7314815 0.7283951 0.7253086 0.7222222 0.7191358 0.7160494 0.7129630
## [204] 0.7098765 0.7067901 0.7037037 0.7006173 0.6975309 0.6944444 0.6913580
## [211] 0.6944444 0.6913580 0.6882716 0.6851852 0.6820988 0.6790123 0.6759259
## [218] 0.6728395 0.6697531 0.6666667 0.6635802 0.6604938 0.6574074 0.6543210
## [225] 0.6512346 0.6481481 0.6512346 0.6481481 0.6450617 0.6419753 0.6388889
## [232] 0.6358025 0.6327160 0.6296296 0.6265432 0.6234568 0.6203704 0.6172840
## [239] 0.6141975 0.6111111 0.6080247 0.6049383 0.6018519 0.5987654 0.5956790
## [246] 0.5925926 0.5895062 0.5864198 0.5833333 0.5802469 0.5771605 0.5740741
## [253] 0.5709877 0.5679012 0.5648148 0.5679012 0.5648148 0.5617284 0.5586420
## [260] 0.5555556 0.5524691 0.5493827 0.5462963 0.5432099 0.5401235 0.5370370
## [267] 0.5339506 0.5308642 0.5277778 0.5246914 0.5216049 0.5185185 0.5154321
## [274] 0.5123457 0.5092593 0.5061728 0.5030864 0.5000000 0.4969136 0.4938272
## [281] 0.4907407 0.4876543 0.4845679 0.4814815 0.4783951 0.4753086 0.4722222
## [288] 0.4691358 0.4660494 0.4629630 0.4598765 0.4567901 0.4537037 0.4506173
## [295] 0.4475309 0.4444444 0.4413580 0.4382716 0.4351852 0.4320988 0.4290123
## [302] 0.4259259 0.4228395 0.4197531 0.4166667 0.4135802 0.4104938 0.4074074
## [309] 0.4043210 0.4012346 0.3981481 0.3950617 0.3919753 0.3888889 0.3858025
## [316] 0.3827160 0.3796296 0.3765432 0.3734568 0.3703704 0.3672840 0.3641975
## [323] 0.3611111 0.3580247 0.3549383

Area Under the curve –>

auc_ROCR <- performance(cad.prediction, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
auc_ROCR
## [1] 0.8964427

Question 8

Is this a good predictor? What would a random CAD predictor yield? Write R code to verify the accuracy of a random prediction.

# your code here
n <- nrow(test)
random_predictions <- sample(c(0,1), n, replace = TRUE)

# Calculate AUC
cad.prediction.random <- prediction(random_predictions, test$cad)
auc_ROCR.random <- performance(cad.prediction.random, measure = "auc")
auc_ROCR.random <- auc_ROCR.random@y.values[[1]]
auc_ROCR.random
## [1] 0.5244643
  • your answer here The area under the ROC curve (AUC) is a summary measure of the overall performance of the model, taking into account all possible threshold values. The AUC ranges from 0.5 (random guessing) to 1 (perfect classification). So a random predictor will have AUC = 0.5 as it is equivalent to randomly guessing whether a given individual has CAD or not. Thia predictor has an AUC of 0.81, meaning that the model is able to distinguish between positive and negative cases with 81% accuracy. Whether this is good prediction or not depends largely on the clinical question in hand, giving the urgency of CAD diagnosis, missing 20% of the patient will be considered too much.

The code above reports AUC of 0.46. Meaning that the model is able to distinguish between positive and negative cases only with 46% accuracy.

Question 9

You suggest that a model that has some regularization may be helpful. Explain this idea to your clinical colleague. Also indicate which of the pictures below depicts L1 and which depicts L2 regularization.

Question 9

  • your answer here Using shirinkage methods (we fit all predictors, but the estimated coefficients are shrunken towards zero relative to ehe least squares estimates) has the effect of reducing variance. A. is Lasso (the ellipse is intersecting the constrain region at an axis, resulting in one of the coefficient to equal zero, only lasso can shrink down the coefficient to zero (it is a sparse model)). B. is ridge regression- the coefficients will be exclusively non-zero (the coefficients can never equal zero exactly).

Question 10

Use the “glmnet” package to regularize your model [Hint: see http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html; R package already installed on the server] with the same “train” “test” datasets. Report the same performance measures as in Q7 and plot and ROC curve. Compare model performance for the non-regularized (Q7) and the regularized model and speculate on why performance did/did not improve.

# your code here
# Reference: 6.5 Lab: Linear Models and Regularization Methods, An introduction to statistical learning with applications in R (2nd edition)
x <- model.matrix(cad ~. -PID, train)[,-1] # to remove the intercept column 
y <- train$cad
grid <- 10^seq(10,-2,length=100) # to create a grid to stimulate the possible range of lambda

lasso.mod <- cv.glmnet(x, y, alpha = 1, lambda=grid) # alpha=1 to determine lasso

# Extract best lambda
best_lambda <- lasso.mod $lambda.min
best_lambda
## [1] 0.01
# Using the best lambda 
xtest <- model.matrix(cad ~. -PID, test)[,-1] 
lasso.predict <- predict(lasso.mod, s=best_lambda, newx=xtest)
# Create a prediction object
lasso.prediction <- prediction(lasso.predict, test$cad)

# Create ROC curve
lasso.performance <- performance(lasso.prediction, "tpr", "fpr")

# Plot ROC curve
plot(lasso.performance, col = "blue", main = "ROC of Lasso model for CAD prediction")

# Calculate AUC
lasso.auc <- performance(lasso.prediction, measure = "auc")
lasso.auc <- lasso.auc@y.values[[1]]
lasso.auc
## [1] 0.8902018
  • your answer here The AUC=0.82. Not much enhanced compared to the full model.
# Get coefficients for best lambda
lasso_coef <- predict(lasso.mod, s=best_lambda, type="coefficients")

# Print coefficients
print(lasso_coef)
## 50 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -0.5511737165
## LBXWBCSI     0.0013133145
## LBXLYPCT    -0.0013935774
## LBXMOPCT     .           
## LBXNEPCT     .           
## LBXEOPCT     0.0038969653
## LBXBAPCT     .           
## LBXRBCSI     .           
## LBXHGB       .           
## LBXHCT       .           
## LBXMCVSI     .           
## LBXMCHSI     .           
## LBXMC        0.0234669675
## LBXRDW       0.0066315713
## LBXPLTSI    -0.0003249680
## LBXMPSI      .           
## LBXSAL       .           
## LBXSATSI    -0.0008750031
## LBXSASSI    -0.0002727434
## LBXSAPSI     0.0001228363
## LBXSBU       .           
## LBXSCA       .           
## LBXSCH      -0.0013152377
## LBXSC3SI     .           
## LBXSGTSI     .           
## LBXSGL       .           
## LBXSIR       .           
## LBXSLDSI     0.0005888535
## LBXSPH       .           
## LBXSTB       0.0184442759
## LBXSTP       .           
## LBXSTR       0.0001229009
## LBXSUA       0.0147508369
## LBXSCR       0.0191388881
## LBXSNASI     .           
## LBXSKSI      0.0612453579
## LBXSCLSI    -0.0067452380
## LBXSOSSI     .           
## LBXSGB       .           
## LBXTC        .           
## LBXGH        0.0226604543
## LBXBPB       0.0043129060
## LBXBCD       0.0190560586
## LBXTHG       .           
## LBXIHG       .           
## LBXCOT       .           
## LBXRBFSI     .           
## LBXFOLSI     .           
## age          0.0099525690
## gender      -0.0574739128

The lasso model did remove number of variables by shrinking their coefficients to zero. It’s possible for a Lasso model to remove many coefficients while still achieving similar predictive performance compared to a full model, especially if many of the predictors are weak or redundant.The remaining predictors may have a similar overall effect on the outcome variable as the full model, leading to similar predictive performance. OR it could be that I totally missed up the code :)

Part 2: Unsupervised Learning

Now let’s switch gears to unsupervised learning. First we’ll simulate a dataset with known correlation structure and then we’ll explore the same coronary artery disease dataset from Part 1 using techniques from unsupervised learning.

Question 11

Consider a toy dataset consisting of four variables, x1, x2, x3, and x4. Write an R function called “simulate” that simulates 10,000 observations of such a dataset with only two constraints: that the correlation = 0.8 between x1 and x2 and the correlation = 0.55 between x3 and x4 [Hint: use the mvrnorm function from the R package “MASS” already installed on the server].

# your code here
# I used chatgpt for help.
library(MASS)


simulate <- function() {
  # Set the means and variances for each variable
  mu <- c(0, 0, 0, 0)
  sigma <- matrix(c(1, 0.8, 0, 0,
                     0.8, 1, 0, 0,
                     0, 0, 1, 0.55,
                     0, 0, 0.55, 1), nrow = 4)
  
  # Generate 10,000 observations from a multivariate normal distribution with the specified means and variances
  data <- mvrnorm(n = 10000, mu = mu, Sigma = sigma)
  
  # Convert the matrix to a data frame with variable names
  colnames(data) <- c("x1", "x2", "x3", "x4")
  data <- as.data.frame(data)
  
  return(data)
}
simulated_data <- simulate()
cor(simulated_data$x1,simulated_data$x2)
## [1] 0.8068266
cor(simulated_data$x3,simulated_data$x4)
## [1] 0.5549626

Question 12

Regress x2 ~ x1 and x1 ~ x2, report the intercepts and slopes and R^2 values, and compare the results with what you expected based on writing “simulate”. What are the two separate regression equations optimizing?

# your code here
first_reg <- lm(x2~x1, data=simulated_data)
summary(first_reg)
## 
## Call:
## lm(formula = x2 ~ x1, data = simulated_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08910 -0.40130  0.00427  0.39697  2.20416 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.005324   0.005987  -0.889    0.374    
## x1           0.814417   0.005964 136.554   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5987 on 9998 degrees of freedom
## Multiple R-squared:  0.651,  Adjusted R-squared:  0.6509 
## F-statistic: 1.865e+04 on 1 and 9998 DF,  p-value: < 2.2e-16
Second_reg <- lm(x3~x4, data=simulated_data)
summary(Second_reg)
## 
## Call:
## lm(formula = x3 ~ x4, data = simulated_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.10200 -0.55785  0.00059  0.54618  3.05650 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0006799  0.0082670  -0.082    0.934    
## x4           0.5535626  0.0082986  66.706   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8267 on 9998 degrees of freedom
## Multiple R-squared:  0.308,  Adjusted R-squared:  0.3079 
## F-statistic:  4450 on 1 and 9998 DF,  p-value: < 2.2e-16
  • your answer here The intercept of the slope when regressing x2~x1 was 0.8 as expected. the R^2 is as expected 0.805354^2 =0.6435. The intercept of the slope when regressing x3~x4 was 0.5 as expected. the R^2 is as expected 0.538524^2 =0.2891.

The regression equation is attempting to model the relationship between the two variables in a linear fashion by finding the line that best fits the data, minimizing the distance between the observed data and the predicted values.

The first regression:This regression equation is optimizing the relationship between x1 (the independent variable) and x2 (the dependent variable), in terms of how well x1 can predict x2. The slope indicates that for every one-unit increase in x1, the response variable will increase by 0.8 unites. 64% of the variance in the dependent variable can be explained by x1.

The second regression:This regression equation is optimizing the relationship between x3 (the independent variable) and x4 (the dependent variable), in terms of how well x3 can predict x4. The slope indicates that for every one-unit increase in x3, the response variable will increase by 0.5 unites. 28% of the variance in the dependent variable can be explained by x3.

Question 13

Now run principal components analysis (PCA) using the “prcomp” function on (x1,x2,x3,x4). Plot PC1 and PC2. Does PCA identify interesting structure? What is PCA optimizing?

# your code here
pca <- prcomp(simulated_data, scale=TRUE)
# plot the first two principal components 
biplot(pca, scale=0)

  • your answers here to both parts The data points that are plotted close to each other have similar data patterns in regards to the variables in the original dataset. As expected x1,x2 are close (correlated), as well as x3,x4.

Question 14

Plot the variance explained by successive principal components in this dataset.

# Calculate the proportion of variance explained by each principal component
pr.var <- pca$sdev^2
# Proportion of variance 
pve <- pr.var/sum(pr.var)
pve
## [1] 0.45175010 0.38870823 0.11127313 0.04826854
# Create a data frame with the PVE for each PC
pve_df <- data.frame(PC = 1:length(pve), PVE = pve)

# Plot the PVE for each PC
ggplot(pve_df, aes(x = PC, y = PVE)) +
  geom_line(color='blue') +
  labs(x = "Principal Component", y = "Proportion of Variance Explained") +
  scale_x_continuous(breaks = 1:length(pve)) +
  theme_bw()

# The culmitave proportion by succcesive 
# Build a dataframe 
 df_sum <- data.frame(
  x = c("PC1", "PC1+PC2","PC1+PC2+PC3","PC1+PC2+PC3+PC4"),
  y = c(45, 83,94, 100)
)

# Plot the PVE for each PC
ggplot(df_sum, aes(x = x, y = y)) +
   geom_bar(stat = "identity", fill = "#6495ED", color = "black", alpha=0.3, width=0.4)+
  labs(x = "Principal Component", y = "Additive Proportion of Variance Explained (%)") +
  theme_bw()

Question 15

How many principal components summarize these data well? Explain your reasoning.

  • your answer here 2 principal components will summarize this data well. As seen previously, the first principal component explains 45% of the variance in the dataset, the second principal component explains 38% of the variance. Therefore, we can conclude that these two PCs summarize the data well, as they explain a large proportion of the total variability in the data. (83%).

Question 16

Run PCA on all the blood labs in the full “hw2.data” dataset (make sure to select only variables that start with ‘LBX’). Make sure to center and scale PCA. Why is it important to center and scale? Plot the variance explained by principal component.

# your code here
library(dplyr)
features <- hw2.data %>%   select(starts_with("LBX"))

# running pca analysis
pca2 <- prcomp(features, scale=TRUE, center=TRUE)

# Calculate the proportion of variance explained by each principal component
pr.var2 <- pca2$sdev^2
# Proportion of variance 
pve2 <- round(pr.var2/sum(pr.var2)*100,1)

# Plotting the variance explained by each PC.
# Create a data frame with the PVE for each PC
barplot(pve2, main='Scree Plot', xlab='Principal Component', ylab='Percent Variation')
axis(side = 1, at = seq_along(pve2), labels = seq_along(pve2))

pve2
##  [1] 9.6 7.6 7.3 5.5 5.3 4.4 4.2 4.1 3.6 3.3 3.1 3.0 2.7 2.6 2.4 2.3 2.2 2.1 1.9
## [20] 1.8 1.8 1.7 1.6 1.4 1.4 1.3 1.3 1.2 1.2 1.1 1.1 1.0 0.9 0.9 0.8 0.7 0.5 0.4
## [39] 0.4 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  • your answer here Scaling the variables ensures that each variable contributes equally to the PCA and that the results are not dominated by variables with large values. Centering the data (i.e., subtracting the mean) ensures that the first principal component represents the direction of maximum variance in the data. If the data are not centered, the first principal component may represent the mean of the data rather than the direction of maximum variance.

Question 17

How many principal components summarize this data well? Plot PC1 and PC2 and color the projected points by CAD status. Is there structure visible in this plot by CAD status? Do these results align with your expectations from the supervised analysis conducted earlier?

# your code here


# Transforming the principal components into a dataframe. 
pca2.data <- data.frame(pca2$x)
pca2.data$plotx <- pca2.data[,1]
pca2.data$ploty <- pca2.data[,2]

# Adding CAD variable to the dataframe
cad <- as.factor(hw2.data$cad)
pca2.data$CAD <- cad
pca2.plot <- ggplot(pca2.data, aes(x=plotx, y-ploty, color=CAD))+ geom_point()+ggtitle('PCA with CAD status')+ xlab('PC1')+ylab('PC2')+coord_fixed(ratio=1)+theme_bw()+theme(aspect.ratio = 1)+theme(panel.grid=element_blank())+  stat_ellipse(geom='polygon', aes(fill= after_scale(alpha(colour,0.3))), data=pca2.data[pca2.data$CAD != "versicolor",])


pca2.plot
## Warning in y - ploty: longer object length is not a multiple of shorter object
## length

## Warning in y - ploty: longer object length is not a multiple of shorter object
## length

## Warning in y - ploty: longer object length is not a multiple of shorter object
## length

  • your answer here The number of principal components to choose will depend on the threshold for the proportion of variance that needs to be explained by the retained principal components. If I want that threshold to be 80%, then looking at the scree plot, I would say <22 components. The structure is not very visible, there is a lot of overlap between the two groups. Comparing to the supervised approach: The results don’t completely align. The results of the PCA suggest that the features do not provide clear separation between the two groups based on their underlying patterns, wherras the supervised model gave an AUC 0.8 which which indicates that the model is able to distinguish the two groups with reasonable accuracy based on the selected feature. Why the discrepancy? The two methods are different, with different underlying assumptions, thus yielding different outcome is reasonable. it is possible that some features that are important for predicting the outcome variable may not be strongly correlated with the principal components identified by PCA. OR it could be that my analysis is completely flawed :)

تم ولم ينتهي