library(glmnet)
library(ROCR)
library(MASS)
library(ROCR)
library(glmnet)
Please check Canvas for the due date of this assignment. Please be sure to answer all parts of each question and submit your answers as a knitted .html file: select the Knit drop down at the top of the window. Once converted to .html, select the file in the file browser, click the More drop down, and select Export.
In class, we discussed two broad subfields of machine learning: supervised learning and unsupervised learning. In a typical supervised learning setup, we use a set of labeled input data points to try and learn a robust function that maps future inputs to labels. By contrast, in unsupervised learning, we do not have labels for our training data, and instead we seek to understand interesting things about the data such as the correlation structure between variables or whether high-dimensional data points can be projected into a lower-dimensional space without much loss of information. In this assignment, we will explore both approaches to machine learning using the same dataset.
Your clinician collaborator has collected a dataset of blood laboratory values, basic demographic information (age, gender), and coronary artery disease status from a population of ~1K individuals. The data is available in hw2_2022.rds.
library(tidyverse)
hw2.data <- readRDS('hw2_2022.rds')
This dataset derives from CDC NHANES and consists of blood lab values (prefixed with ‘LBX’) as well as coronary artery disease (CAD) status (variable “cad” with 0 = absent, 1 = present), age, and gender. Write a code snippet to count the number of cases and controls in the dataset, and report these two numbers here:
# your code here
# Cases: CAD==1
table(hw2.data$cad)
##
## 0 1
## 684 395
Your clinician collaborator asks you to build a new lab-based “score” that can tell whether a new individual has CAD or not given their laboratory values, gender, and age. Is this a prediction or inference task?
Your collaborator suggests you use the whole dataset, try as many complex models as you can run, and select the best one to publish. You inform your collaborator that we could do this but it might be a bad idea because of a trade-off called:
Write down the equation for the trade-off from Q3.
You suggest to your collaborator that you should divide the sample into a training and test dataset. Write a function called “train_test_split” that accepts “hw2.data” as an input and returns an R list object containing two “tibbles”: “train” and “test” in a 70:30 random split.
# Add identification column to the dataset: PID
hw2.data <-hw2.data %>% mutate(PID= row_number())
# your code here
# The function "train_test_split" has 2 arguments: 1) df: the dataframe of interest,
# 2) IDcol: an identifier column that the dataframe df should have, which has a unique value per row.
# The function returns a list with 2 dataframes: train, test.
# train: has a random sample of 70% of the rows of the original dataframe df.
# test: contains the remaining 30% of the original dataframe df.
train_test_split <- function(df, IDcol) {
IDcol_str <- deparse(substitute(IDcol))
train <- df[sample(nrow(df), floor(0.7 * nrow(df)), replace = FALSE),]
test <- anti_join(df, train, by = IDcol_str)
result <- list(train = train, test = test)
return(result)
}
# Creating a train and a test dataset using the previously created function
results <- train_test_split(hw2.data, PID)
# Extracting the created datasets
train<- results$train
test<- results$test
What type of regression is a good first statistical model to use for binary classification (such as 0/1 status of coronary artery disease)? [Hint: It’s not linear regression.] How is this model different than linear regression?
Write R code to fit the model you named in Q6 to predict “cad” from labs and demographics. You should use “train_test_split” from Q5 to fit the model on the “train” data with R’s “glm” function. Evaluate the model’s performance on the “test” data and report the accuracy and plot an ROC curve. Report the AUC. Hint: use the “ROCR” package.
# your code here
# Fitting the model: using the "train" dataset.
# Predictors: all blood work and basic demographics (age, gender)
# Building a logistic regression model with all the variables in hand (removing the previously added identifier column PID)
CAD_pred <- glm(cad~ . -PID, data= train, family= binomial)
# Making prediction using the previous model on the test dataset
# The type='response' is to output probabilities of the form P(Y=1|X)
cad.predict <- predict(CAD_pred, type = "response", newdata=test)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Creating a prediction object
cad.prediction <- prediction(cad.predict, test$cad)
# Create ROC curve
cad.performance <- performance(cad.prediction, "tpr", "fpr")
plot(cad.performance, col='blue',main= "ROC of CAD prediction using the test dataset")
accuracy_ROCR <- performance(cad.prediction, measure = "acc")
accuracy_ROCR <- accuracy_ROCR@y.values[[1]]
accuracy_ROCR
## [1] 0.6450617 0.6481481 0.6512346 0.6543210 0.6512346 0.6543210 0.6574074
## [8] 0.6604938 0.6635802 0.6666667 0.6697531 0.6666667 0.6697531 0.6728395
## [15] 0.6759259 0.6790123 0.6820988 0.6851852 0.6882716 0.6913580 0.6882716
## [22] 0.6913580 0.6882716 0.6913580 0.6944444 0.6975309 0.7006173 0.6975309
## [29] 0.7006173 0.6975309 0.7006173 0.6975309 0.7006173 0.7037037 0.7067901
## [36] 0.7098765 0.7129630 0.7098765 0.7129630 0.7160494 0.7191358 0.7222222
## [43] 0.7253086 0.7283951 0.7314815 0.7345679 0.7376543 0.7345679 0.7376543
## [50] 0.7407407 0.7438272 0.7469136 0.7438272 0.7469136 0.7500000 0.7530864
## [57] 0.7561728 0.7592593 0.7561728 0.7592593 0.7623457 0.7654321 0.7685185
## [64] 0.7716049 0.7685185 0.7654321 0.7623457 0.7592593 0.7623457 0.7654321
## [71] 0.7685185 0.7716049 0.7746914 0.7777778 0.7808642 0.7839506 0.7870370
## [78] 0.7901235 0.7932099 0.7962963 0.7993827 0.7962963 0.7993827 0.7962963
## [85] 0.7993827 0.8024691 0.8055556 0.8086420 0.8117284 0.8086420 0.8117284
## [92] 0.8086420 0.8117284 0.8148148 0.8117284 0.8086420 0.8117284 0.8086420
## [99] 0.8117284 0.8148148 0.8179012 0.8148148 0.8179012 0.8209877 0.8240741
## [106] 0.8209877 0.8179012 0.8148148 0.8179012 0.8209877 0.8179012 0.8209877
## [113] 0.8240741 0.8271605 0.8240741 0.8209877 0.8240741 0.8209877 0.8240741
## [120] 0.8209877 0.8240741 0.8209877 0.8179012 0.8148148 0.8117284 0.8148148
## [127] 0.8179012 0.8148148 0.8117284 0.8086420 0.8055556 0.8086420 0.8117284
## [134] 0.8148148 0.8179012 0.8209877 0.8240741 0.8271605 0.8240741 0.8271605
## [141] 0.8302469 0.8333333 0.8302469 0.8333333 0.8302469 0.8271605 0.8302469
## [148] 0.8271605 0.8240741 0.8209877 0.8240741 0.8271605 0.8240741 0.8209877
## [155] 0.8179012 0.8148148 0.8179012 0.8148148 0.8117284 0.8086420 0.8055556
## [162] 0.8024691 0.7993827 0.8024691 0.7993827 0.7962963 0.7932099 0.7901235
## [169] 0.7870370 0.7901235 0.7870370 0.7839506 0.7808642 0.7777778 0.7746914
## [176] 0.7716049 0.7746914 0.7716049 0.7685185 0.7654321 0.7623457 0.7592593
## [183] 0.7561728 0.7530864 0.7500000 0.7469136 0.7438272 0.7469136 0.7438272
## [190] 0.7469136 0.7438272 0.7407407 0.7376543 0.7407407 0.7376543 0.7345679
## [197] 0.7314815 0.7283951 0.7253086 0.7222222 0.7191358 0.7160494 0.7129630
## [204] 0.7098765 0.7067901 0.7037037 0.7006173 0.6975309 0.6944444 0.6913580
## [211] 0.6944444 0.6913580 0.6882716 0.6851852 0.6820988 0.6790123 0.6759259
## [218] 0.6728395 0.6697531 0.6666667 0.6635802 0.6604938 0.6574074 0.6543210
## [225] 0.6512346 0.6481481 0.6512346 0.6481481 0.6450617 0.6419753 0.6388889
## [232] 0.6358025 0.6327160 0.6296296 0.6265432 0.6234568 0.6203704 0.6172840
## [239] 0.6141975 0.6111111 0.6080247 0.6049383 0.6018519 0.5987654 0.5956790
## [246] 0.5925926 0.5895062 0.5864198 0.5833333 0.5802469 0.5771605 0.5740741
## [253] 0.5709877 0.5679012 0.5648148 0.5679012 0.5648148 0.5617284 0.5586420
## [260] 0.5555556 0.5524691 0.5493827 0.5462963 0.5432099 0.5401235 0.5370370
## [267] 0.5339506 0.5308642 0.5277778 0.5246914 0.5216049 0.5185185 0.5154321
## [274] 0.5123457 0.5092593 0.5061728 0.5030864 0.5000000 0.4969136 0.4938272
## [281] 0.4907407 0.4876543 0.4845679 0.4814815 0.4783951 0.4753086 0.4722222
## [288] 0.4691358 0.4660494 0.4629630 0.4598765 0.4567901 0.4537037 0.4506173
## [295] 0.4475309 0.4444444 0.4413580 0.4382716 0.4351852 0.4320988 0.4290123
## [302] 0.4259259 0.4228395 0.4197531 0.4166667 0.4135802 0.4104938 0.4074074
## [309] 0.4043210 0.4012346 0.3981481 0.3950617 0.3919753 0.3888889 0.3858025
## [316] 0.3827160 0.3796296 0.3765432 0.3734568 0.3703704 0.3672840 0.3641975
## [323] 0.3611111 0.3580247 0.3549383
Area Under the curve –>
auc_ROCR <- performance(cad.prediction, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
auc_ROCR
## [1] 0.8964427
Is this a good predictor? What would a random CAD predictor yield? Write R code to verify the accuracy of a random prediction.
# your code here
n <- nrow(test)
random_predictions <- sample(c(0,1), n, replace = TRUE)
# Calculate AUC
cad.prediction.random <- prediction(random_predictions, test$cad)
auc_ROCR.random <- performance(cad.prediction.random, measure = "auc")
auc_ROCR.random <- auc_ROCR.random@y.values[[1]]
auc_ROCR.random
## [1] 0.5244643
The code above reports AUC of 0.46. Meaning that the model is able to distinguish between positive and negative cases only with 46% accuracy.
You suggest that a model that has some regularization may be helpful. Explain this idea to your clinical colleague. Also indicate which of the pictures below depicts L1 and which depicts L2 regularization.
Question 9
Use the “glmnet” package to regularize your model [Hint: see http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html; R package already installed on the server] with the same “train” “test” datasets. Report the same performance measures as in Q7 and plot and ROC curve. Compare model performance for the non-regularized (Q7) and the regularized model and speculate on why performance did/did not improve.
# your code here
# Reference: 6.5 Lab: Linear Models and Regularization Methods, An introduction to statistical learning with applications in R (2nd edition)
x <- model.matrix(cad ~. -PID, train)[,-1] # to remove the intercept column
y <- train$cad
grid <- 10^seq(10,-2,length=100) # to create a grid to stimulate the possible range of lambda
lasso.mod <- cv.glmnet(x, y, alpha = 1, lambda=grid) # alpha=1 to determine lasso
# Extract best lambda
best_lambda <- lasso.mod $lambda.min
best_lambda
## [1] 0.01
# Using the best lambda
xtest <- model.matrix(cad ~. -PID, test)[,-1]
lasso.predict <- predict(lasso.mod, s=best_lambda, newx=xtest)
# Create a prediction object
lasso.prediction <- prediction(lasso.predict, test$cad)
# Create ROC curve
lasso.performance <- performance(lasso.prediction, "tpr", "fpr")
# Plot ROC curve
plot(lasso.performance, col = "blue", main = "ROC of Lasso model for CAD prediction")
# Calculate AUC
lasso.auc <- performance(lasso.prediction, measure = "auc")
lasso.auc <- lasso.auc@y.values[[1]]
lasso.auc
## [1] 0.8902018
# Get coefficients for best lambda
lasso_coef <- predict(lasso.mod, s=best_lambda, type="coefficients")
# Print coefficients
print(lasso_coef)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.5511737165
## LBXWBCSI 0.0013133145
## LBXLYPCT -0.0013935774
## LBXMOPCT .
## LBXNEPCT .
## LBXEOPCT 0.0038969653
## LBXBAPCT .
## LBXRBCSI .
## LBXHGB .
## LBXHCT .
## LBXMCVSI .
## LBXMCHSI .
## LBXMC 0.0234669675
## LBXRDW 0.0066315713
## LBXPLTSI -0.0003249680
## LBXMPSI .
## LBXSAL .
## LBXSATSI -0.0008750031
## LBXSASSI -0.0002727434
## LBXSAPSI 0.0001228363
## LBXSBU .
## LBXSCA .
## LBXSCH -0.0013152377
## LBXSC3SI .
## LBXSGTSI .
## LBXSGL .
## LBXSIR .
## LBXSLDSI 0.0005888535
## LBXSPH .
## LBXSTB 0.0184442759
## LBXSTP .
## LBXSTR 0.0001229009
## LBXSUA 0.0147508369
## LBXSCR 0.0191388881
## LBXSNASI .
## LBXSKSI 0.0612453579
## LBXSCLSI -0.0067452380
## LBXSOSSI .
## LBXSGB .
## LBXTC .
## LBXGH 0.0226604543
## LBXBPB 0.0043129060
## LBXBCD 0.0190560586
## LBXTHG .
## LBXIHG .
## LBXCOT .
## LBXRBFSI .
## LBXFOLSI .
## age 0.0099525690
## gender -0.0574739128
The lasso model did remove number of variables by shrinking their coefficients to zero. It’s possible for a Lasso model to remove many coefficients while still achieving similar predictive performance compared to a full model, especially if many of the predictors are weak or redundant.The remaining predictors may have a similar overall effect on the outcome variable as the full model, leading to similar predictive performance. OR it could be that I totally missed up the code :)
Now let’s switch gears to unsupervised learning. First we’ll simulate a dataset with known correlation structure and then we’ll explore the same coronary artery disease dataset from Part 1 using techniques from unsupervised learning.
Consider a toy dataset consisting of four variables, x1, x2, x3, and x4. Write an R function called “simulate” that simulates 10,000 observations of such a dataset with only two constraints: that the correlation = 0.8 between x1 and x2 and the correlation = 0.55 between x3 and x4 [Hint: use the mvrnorm function from the R package “MASS” already installed on the server].
# your code here
# I used chatgpt for help.
library(MASS)
simulate <- function() {
# Set the means and variances for each variable
mu <- c(0, 0, 0, 0)
sigma <- matrix(c(1, 0.8, 0, 0,
0.8, 1, 0, 0,
0, 0, 1, 0.55,
0, 0, 0.55, 1), nrow = 4)
# Generate 10,000 observations from a multivariate normal distribution with the specified means and variances
data <- mvrnorm(n = 10000, mu = mu, Sigma = sigma)
# Convert the matrix to a data frame with variable names
colnames(data) <- c("x1", "x2", "x3", "x4")
data <- as.data.frame(data)
return(data)
}
simulated_data <- simulate()
cor(simulated_data$x1,simulated_data$x2)
## [1] 0.8068266
cor(simulated_data$x3,simulated_data$x4)
## [1] 0.5549626
Regress x2 ~ x1 and x1 ~ x2, report the intercepts and slopes and R^2 values, and compare the results with what you expected based on writing “simulate”. What are the two separate regression equations optimizing?
# your code here
first_reg <- lm(x2~x1, data=simulated_data)
summary(first_reg)
##
## Call:
## lm(formula = x2 ~ x1, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08910 -0.40130 0.00427 0.39697 2.20416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.005324 0.005987 -0.889 0.374
## x1 0.814417 0.005964 136.554 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5987 on 9998 degrees of freedom
## Multiple R-squared: 0.651, Adjusted R-squared: 0.6509
## F-statistic: 1.865e+04 on 1 and 9998 DF, p-value: < 2.2e-16
Second_reg <- lm(x3~x4, data=simulated_data)
summary(Second_reg)
##
## Call:
## lm(formula = x3 ~ x4, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.10200 -0.55785 0.00059 0.54618 3.05650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0006799 0.0082670 -0.082 0.934
## x4 0.5535626 0.0082986 66.706 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8267 on 9998 degrees of freedom
## Multiple R-squared: 0.308, Adjusted R-squared: 0.3079
## F-statistic: 4450 on 1 and 9998 DF, p-value: < 2.2e-16
The regression equation is attempting to model the relationship between the two variables in a linear fashion by finding the line that best fits the data, minimizing the distance between the observed data and the predicted values.
The first regression:This regression equation is optimizing the relationship between x1 (the independent variable) and x2 (the dependent variable), in terms of how well x1 can predict x2. The slope indicates that for every one-unit increase in x1, the response variable will increase by 0.8 unites. 64% of the variance in the dependent variable can be explained by x1.
The second regression:This regression equation is optimizing the relationship between x3 (the independent variable) and x4 (the dependent variable), in terms of how well x3 can predict x4. The slope indicates that for every one-unit increase in x3, the response variable will increase by 0.5 unites. 28% of the variance in the dependent variable can be explained by x3.
Now run principal components analysis (PCA) using the “prcomp” function on (x1,x2,x3,x4). Plot PC1 and PC2. Does PCA identify interesting structure? What is PCA optimizing?
# your code here
pca <- prcomp(simulated_data, scale=TRUE)
# plot the first two principal components
biplot(pca, scale=0)
Plot the variance explained by successive principal components in this dataset.
# Calculate the proportion of variance explained by each principal component
pr.var <- pca$sdev^2
# Proportion of variance
pve <- pr.var/sum(pr.var)
pve
## [1] 0.45175010 0.38870823 0.11127313 0.04826854
# Create a data frame with the PVE for each PC
pve_df <- data.frame(PC = 1:length(pve), PVE = pve)
# Plot the PVE for each PC
ggplot(pve_df, aes(x = PC, y = PVE)) +
geom_line(color='blue') +
labs(x = "Principal Component", y = "Proportion of Variance Explained") +
scale_x_continuous(breaks = 1:length(pve)) +
theme_bw()
# The culmitave proportion by succcesive
# Build a dataframe
df_sum <- data.frame(
x = c("PC1", "PC1+PC2","PC1+PC2+PC3","PC1+PC2+PC3+PC4"),
y = c(45, 83,94, 100)
)
# Plot the PVE for each PC
ggplot(df_sum, aes(x = x, y = y)) +
geom_bar(stat = "identity", fill = "#6495ED", color = "black", alpha=0.3, width=0.4)+
labs(x = "Principal Component", y = "Additive Proportion of Variance Explained (%)") +
theme_bw()
How many principal components summarize these data well? Explain your reasoning.
Run PCA on all the blood labs in the full “hw2.data” dataset (make sure to select only variables that start with ‘LBX’). Make sure to center and scale PCA. Why is it important to center and scale? Plot the variance explained by principal component.
# your code here
library(dplyr)
features <- hw2.data %>% select(starts_with("LBX"))
# running pca analysis
pca2 <- prcomp(features, scale=TRUE, center=TRUE)
# Calculate the proportion of variance explained by each principal component
pr.var2 <- pca2$sdev^2
# Proportion of variance
pve2 <- round(pr.var2/sum(pr.var2)*100,1)
# Plotting the variance explained by each PC.
# Create a data frame with the PVE for each PC
barplot(pve2, main='Scree Plot', xlab='Principal Component', ylab='Percent Variation')
axis(side = 1, at = seq_along(pve2), labels = seq_along(pve2))
pve2
## [1] 9.6 7.6 7.3 5.5 5.3 4.4 4.2 4.1 3.6 3.3 3.1 3.0 2.7 2.6 2.4 2.3 2.2 2.1 1.9
## [20] 1.8 1.8 1.7 1.6 1.4 1.4 1.3 1.3 1.2 1.2 1.1 1.1 1.0 0.9 0.9 0.8 0.7 0.5 0.4
## [39] 0.4 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
How many principal components summarize this data well? Plot PC1 and PC2 and color the projected points by CAD status. Is there structure visible in this plot by CAD status? Do these results align with your expectations from the supervised analysis conducted earlier?
# your code here
# Transforming the principal components into a dataframe.
pca2.data <- data.frame(pca2$x)
pca2.data$plotx <- pca2.data[,1]
pca2.data$ploty <- pca2.data[,2]
# Adding CAD variable to the dataframe
cad <- as.factor(hw2.data$cad)
pca2.data$CAD <- cad
pca2.plot <- ggplot(pca2.data, aes(x=plotx, y-ploty, color=CAD))+ geom_point()+ggtitle('PCA with CAD status')+ xlab('PC1')+ylab('PC2')+coord_fixed(ratio=1)+theme_bw()+theme(aspect.ratio = 1)+theme(panel.grid=element_blank())+ stat_ellipse(geom='polygon', aes(fill= after_scale(alpha(colour,0.3))), data=pca2.data[pca2.data$CAD != "versicolor",])
pca2.plot
## Warning in y - ploty: longer object length is not a multiple of shorter object
## length
## Warning in y - ploty: longer object length is not a multiple of shorter object
## length
## Warning in y - ploty: longer object length is not a multiple of shorter object
## length
تم ولم ينتهي