HTP Data Analysis

Q(1)

Suppose we collect data for a group of students in a statistics class with variables \(X_1\) = hours studied, \(X_2\) = undergrad GPA, and Y = receive an A. We fit a logistic regression and obtain the following estimated coefficients: \(\hat\beta_0\)=-6, \(\hat\beta_1\)=0.05,\(\hat\beta_2\)=1.

(a)

Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class.

Solution:

Here Y has two values: receives an A = 1 and does not receive an A = 0. So, Y ~ Bernoulli(p)

The logistic regression model is

\(logit(p) = log(\frac{p}{1-p}) = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2\)

or \(p = P(Y = 1|X) = \frac{1}{1 + exp(-(\hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2))}\)

where \(X = (x_1,x_2)\) and \(p\) is the probability of success given the predictors.Let’s calculate it:

beta_0 <- -6
beta_1 <- 0.05
beta_2 <- 1
x_1 <- 40
x_2 <- 3.5

#Probability of success or probability of getting A
p <- 1/(1+exp(-(beta_0 + beta_1 * x_1 +beta_2 * x_2)))
p

## [1] 0.3775407

## The probability that the student will get an A given that they study 40 hours and have undergrad GPA 3.5 is =  0.3775407

(b)

How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?

Solution

The logistic regression model is

\(logit(p) = log(\frac{p}{1-p}) = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2\)

or \(x_1 = \frac{log(\frac{p}{1-p}) - \hat\beta_0 - \hat\beta_2 x_2}{\hat\beta_1}\)

This has been calculated below:

p <- 0.5
x_1 <- (log(p/(1-p)) - (beta_0) - (beta_2 * x_2))/beta_1
x_1

## [1] 50

## To have a 50% chance of getting an A,they need to study 50 hrs.

Q(2)

For this problem, use the heartdisease dataset, which is part of the MLDataR package. This data contains some demographic and clinical information on 918 patients.Divide ~80% of the data for training and the remaining ~20% for testing. The HeartDisease status variable is coded 1 = heart disease and 0 = no heart disease.

(a)

Fit a logistic regression model to predict the HeartDisease status of the patients using four predictors: Age, RestingBP, Cholesterol, and FastingBS. Here, the model should predict P(HeartDisease = 1 | X). Create the confusion matrix and compute the sensitivity, specificity, and overall classification accuracy for the test data.

library(MLDataR)
library(DT)
data <- heartdisease
datatable(head(data), options = list(scrollX = TRUE))

Data Cleaning

#Check if there are any missing values
sum(is.na(data))

## [1] 0

# Check for duplicate rows
sum(duplicated(data))

## [1] 0

The result shows that there are no missing values and no duplicated rows. So, the data is clean and ready for analysis.

#Extract only 5 columns
data_subset <- data[, c("Age", "RestingBP", "Cholesterol", "FastingBS", "HeartDisease")]
datatable(head(data_subset))

str(data_subset)

## Classes 'tbl_df', 'tbl' and 'data.frame':    918 obs. of  5 variables:
##  $ Age         : num  40 49 37 48 54 39 45 54 37 48 ...
##  $ RestingBP   : num  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol : num  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HeartDisease: num  0 1 0 1 0 0 0 0 1 0 ...

Let’s change HeartDisease and FastingBS into categorical variable:

data_subset$FastingBS <- as.factor(data_subset$FastingBS)
data_subset$HeartDisease <- as.factor(data_subset$HeartDisease)
str(data_subset)

## Classes 'tbl_df', 'tbl' and 'data.frame':    918 obs. of  5 variables:
##  $ Age         : num  40 49 37 48 54 39 45 54 37 48 ...
##  $ RestingBP   : num  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol : num  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HeartDisease: Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...

Training Data

set.seed(1)

#sample 80% indices
training_indices <- sample(1:nrow(data_subset),size = 0.8*nrow(data_subset))

#Get the training data
training_data <- data_subset[training_indices,]
datatable(head(training_data))

Test Data

#Get the testing data
test_data <-data_subset[-training_indices,]
datatable(head(test_data))

logistic_reg <- glm(HeartDisease ~ Age + RestingBP + Cholesterol + FastingBS, data = training_data,family = binomial)

# Make predictions on the test set (probabilities)
test_probs <- predict(logistic_reg, test_data, type = "response")

# Convert probabilities to class labels (threshold = 0.5)
test_pred <- ifelse(test_probs > 0.5, "1", "0")
test_pred <- factor(test_pred, levels = c("0", "1"))

library(caret)
# Compute confusion matrix 
confusion_matrix <- confusionMatrix(test_pred, test_data$HeartDisease)
confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 56 28
##          1 28 72
##                                           
##                Accuracy : 0.6957          
##                  95% CI : (0.6237, 0.7612)
##     No Information Rate : 0.5435          
##     P-Value [Acc > NIR] : 1.769e-05       
##                                           
##                   Kappa : 0.3867          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.7200          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.7200          
##              Prevalence : 0.4565          
##          Detection Rate : 0.3043          
##    Detection Prevalence : 0.4565          
##       Balanced Accuracy : 0.6933          
##                                           
##        'Positive' Class : 0               
##

From logistic regression

sensitivity = 0.6667 specificity = 0.7200 Overall prediction accuracy = 0.6933

(b)

Repeat (a) using LDA.

Solution

library(MASS)
lda_model <- lda(HeartDisease ~ Age + RestingBP + Cholesterol + FastingBS, data = training_data)
lda_prediction <- predict(lda_model,test_data)

#See the components of the list lda_prediction
names(lda_prediction)

## [1] "class"     "posterior" "x"

The lda_prediction is a list of class,posterior and x where class is a collection of categorical values 1 and 0,posterior is the collection of posterior probabilities and x is discriminant function values.

#Extracting only the class from lda_prediction
predicted_classes <-lda_prediction$class
predicted_classes

##   [1] 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0
##  [38] 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1
## [112] 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 0
## [149] 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 1 1 1 0
## Levels: 0 1

actual_classes <- test_data$HeartDisease

# create a confusion matrix
confuse_matrix<- confusionMatrix(predicted_classes, actual_classes)
confuse_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 56 28
##          1 28 72
##                                           
##                Accuracy : 0.6957          
##                  95% CI : (0.6237, 0.7612)
##     No Information Rate : 0.5435          
##     P-Value [Acc > NIR] : 1.769e-05       
##                                           
##                   Kappa : 0.3867          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.7200          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.7200          
##              Prevalence : 0.4565          
##          Detection Rate : 0.3043          
##    Detection Prevalence : 0.4565          
##       Balanced Accuracy : 0.6933          
##                                           
##        'Positive' Class : 0               
##

From LDA model,

sensitivity=0.6667 specificity= 0.7200 prediction accuracy=0.6933

(c)

Repeat (a) using QDA.

library(MASS)
qda_model <- qda(HeartDisease ~ Age + RestingBP + Cholesterol + FastingBS, data = training_data)

qda_prediction <- predict(qda_model,test_data)

# see the elements in the list qda_prediction
names(qda_prediction)

## [1] "class"     "posterior"

Extract only the class from the list qda_prediction

qda_predicted_classes <- qda_prediction$class
qda_predicted_classes

##   [1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0
##  [38] 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0
## [112] 1 1 0 1 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0
## [149] 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0
## Levels: 0 1

actual_classes <- test_data$HeartDisease

#Create the confusion matrix object and then extract the confusion matrix from it
Confusion_Matrix_Object <- confusionMatrix(qda_predicted_classes,actual_classes)
Confusion_Matrix_Object

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 66 43
##          1 18 57
##                                          
##                Accuracy : 0.6685         
##                  95% CI : (0.5954, 0.736)
##     No Information Rate : 0.5435         
##     P-Value [Acc > NIR] : 0.000378       
##                                          
##                   Kappa : 0.3474         
##                                          
##  Mcnemar's Test P-Value : 0.002120       
##                                          
##             Sensitivity : 0.7857         
##             Specificity : 0.5700         
##          Pos Pred Value : 0.6055         
##          Neg Pred Value : 0.7600         
##              Prevalence : 0.4565         
##          Detection Rate : 0.3587         
##    Detection Prevalence : 0.5924         
##       Balanced Accuracy : 0.6779         
##                                          
##        'Positive' Class : 0              
##

From QDA model, Sensitivity = 0.7857
Specificity = 0.5700 Accuracy = 0.6779

(d)

Compute the ROC and plot ROC curve for the logistic regression, LDA, and QDA models using the test data.

library(pROC)
ROC_Logistic_reg <-roc(test_data$HeartDisease,test_probs)
ROC_Logistic_reg

## 
## Call:
## roc.default(response = test_data$HeartDisease, predictor = test_probs)
## 
## Data: test_probs in 84 controls (test_data$HeartDisease 0) < 100 cases (test_data$HeartDisease 1).
## Area under the curve: 0.7365

#From lda_prediction list, just extract the prediction and extract the probabilities of prediction of 1 only
lda_test_probs <-lda_prediction$posterior[,2]
ROC_lda <- roc(test_data$HeartDisease,lda_test_probs)
ROC_lda

## 
## Call:
## roc.default(response = test_data$HeartDisease, predictor = lda_test_probs)
## 
## Data: lda_test_probs in 84 controls (test_data$HeartDisease 0) < 100 cases (test_data$HeartDisease 1).
## Area under the curve: 0.7367

qda_test_probs <-qda_prediction$posterior[,2]
ROC_qda <-roc(test_data$HeartDisease,qda_test_probs)
ROC_qda

## 
## Call:
## roc.default(response = test_data$HeartDisease, predictor = qda_test_probs)
## 
## Data: qda_test_probs in 84 controls (test_data$HeartDisease 0) < 100 cases (test_data$HeartDisease 1).
## Area under the curve: 0.7468

Plotting:

par(mfrow=c(2,2))
plot.roc(ROC_Logistic_reg,print.auc = TRUE,col="blue",main="ROC curve for logistic regression",lwd=3)
plot.roc(ROC_lda,print.auc = TRUE,main="ROC curve for LDA model",col="red",lwd=3)
plot.roc(ROC_qda,print.auc = TRUE,main="ROC curve for QDA model",col="green",lwd=3)

Plotting all three in the same plot

plot.roc(ROC_Logistic_reg,col="blue",main="ROC curve comparison",lwd=5)
plot.roc(ROC_lda,add=T,col="red",lwd=3)
plot.roc(ROC_qda,add=T,col="green",lwd=3)

# Add a legend
legend("bottomright", legend = c("Logistic Regression", "LDA", "QDA"),
       col = c("blue", "red", "green"), lwd = 2, lty = c(1, 1, 1))

Based on the plot, the three models have similar performance.Logistic regression and LDA model have almost the same performance.QDA is slightly better.

(e)

Repeat (a) using KNN with K = 1, K = 5, and K = 9.

Fit a KNN regression model to predict the HeartDisease status of the patients using four predictors: Age, RestingBP, Cholesterol, and FastingBS. Here, the model should predict P(HeartDisease = 1 | X). Create the confusion matrix and compute the sensitivity, specificity, and overall classification accuracy for the test data.

Solution:

# Select predictor variables
predictors <- c("Age", "RestingBP", "Cholesterol", "FastingBS")
target <- "HeartDisease"

# Convert predictor variables to numeric
X_train <- as.data.frame(lapply(training_data[predictors], as.numeric))
X_test <- as.data.frame(lapply(test_data[predictors], as.numeric))

# Convert target variable to factor (KNN requires factor for classification)
y_train <- as.factor(training_data$HeartDisease)
y_test <- as.factor(test_data$HeartDisease)

K=1

library(class)
library(caret)
# Fit KNN model for K = 1
knn_pred_1 <- knn(train = X_train, test = X_test, cl = y_train, k = 1)

# Confusion matrix
confusion_matrix_1 <- confusionMatrix(knn_pred_1, y_test)

# Print results
print("K = 1 Confusion Matrix and Performance Metrics:")

## [1] "K = 1 Confusion Matrix and Performance Metrics:"

print(confusion_matrix_1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 49 33
##          1 35 67
##                                           
##                Accuracy : 0.6304          
##                  95% CI : (0.5563, 0.7003)
##     No Information Rate : 0.5435          
##     P-Value [Acc > NIR] : 0.0105          
##                                           
##                   Kappa : 0.2538          
##                                           
##  Mcnemar's Test P-Value : 0.9035          
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.6700          
##          Pos Pred Value : 0.5976          
##          Neg Pred Value : 0.6569          
##              Prevalence : 0.4565          
##          Detection Rate : 0.2663          
##    Detection Prevalence : 0.4457          
##       Balanced Accuracy : 0.6267          
##                                           
##        'Positive' Class : 0               
##

For K = 1,

Sensitivity = 0.5833
Specificity = 0.6800 Accuracy = 0.6317

K=5

# Fit KNN model for K = 5
knn_pred_5 <- knn(train = X_train, test = X_test, cl = y_train, k = 5)

# Confusion matrix 
confusion_matrix_5 <- confusionMatrix(knn_pred_5, y_test)

# Print results
print("K = 5 Confusion Matrix and Performance Metrics:")

## [1] "K = 5 Confusion Matrix and Performance Metrics:"

print(confusion_matrix_5)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 50 30
##          1 34 70
##                                           
##                Accuracy : 0.6522          
##                  95% CI : (0.5786, 0.7207)
##     No Information Rate : 0.5435          
##     P-Value [Acc > NIR] : 0.001795        
##                                           
##                   Kappa : 0.2964          
##                                           
##  Mcnemar's Test P-Value : 0.707660        
##                                           
##             Sensitivity : 0.5952          
##             Specificity : 0.7000          
##          Pos Pred Value : 0.6250          
##          Neg Pred Value : 0.6731          
##              Prevalence : 0.4565          
##          Detection Rate : 0.2717          
##    Detection Prevalence : 0.4348          
##       Balanced Accuracy : 0.6476          
##                                           
##        'Positive' Class : 0               
##

For K = 5, Sensitivity = 0.5952
Specificity = 0.7000 Accuracy = 0.6476

k=9

# Fit KNN model for K = 9
knn_pred_9 <- knn(train = X_train, test = X_test, cl = y_train, k = 9)

# Confusion matrix 
confusion_matrix_9 <- confusionMatrix(knn_pred_9, y_test)

# Print results
print("K = 9 Confusion Matrix and Performance Metrics:")

## [1] "K = 9 Confusion Matrix and Performance Metrics:"

print(confusion_matrix_9)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43 29
##          1 41 71
##                                         
##                Accuracy : 0.6196        
##                  95% CI : (0.5452, 0.69)
##     No Information Rate : 0.5435        
##     P-Value [Acc > NIR] : 0.02236       
##                                         
##                   Kappa : 0.2245        
##                                         
##  Mcnemar's Test P-Value : 0.18859       
##                                         
##             Sensitivity : 0.5119        
##             Specificity : 0.7100        
##          Pos Pred Value : 0.5972        
##          Neg Pred Value : 0.6339        
##              Prevalence : 0.4565        
##          Detection Rate : 0.2337        
##    Detection Prevalence : 0.3913        
##       Balanced Accuracy : 0.6110        
##                                         
##        'Positive' Class : 0             
##

For k = 9,

Sensitivity = 0.5119
Specificity = 0.7100 Accuracy = 0.6110

KNN Performance Metrics for Different K Values
K	Sensitivity	Specificity	Accuracy
1	0.5833	0.68	0.6317
5	0.5952	0.70	0.6476
9	0.5119	0.71	0.6110

Result:

The KNN model with K = 5 is better as compared to K = 1 and K = 9.

(f)

Write a brief summary comparing the results of the considered methods (logistic regression, LDA, QDA, KNN). Which model do you think is the “best” and why?

Solution:

Below we create a table to compare the results:

# Load required package
library(knitr)

# Create a data frame with the performance metrics
model_results <- data.frame(
  Model = c("Logistic Regression", "LDA", "QDA", "KNN (K=1)", "KNN (K=5)", "KNN (K=9)"),
  Sensitivity = c(0.6667, 0.6667, 0.7857, 0.5833, 0.5952, 0.5119),
  Specificity = c(0.7200, 0.7200, 0.5700, 0.6800, 0.7000, 0.7100),
  Accuracy = c(0.6933, 0.6933, 0.6779, 0.6317, 0.6476, 0.6110)
)

# Print the table using knitr::kable()
kable(model_results, caption = "Comparison of Classification Models Performance")

Comparison of Classification Models Performance
Model	Sensitivity	Specificity	Accuracy
Logistic Regression	0.6667	0.72	0.6933
LDA	0.6667	0.72	0.6933
QDA	0.7857	0.57	0.6779
KNN (K=1)	0.5833	0.68	0.6317
KNN (K=5)	0.5952	0.70	0.6476
KNN (K=9)	0.5119	0.71	0.6110

Comparison:

Logistic regression and LDA model have same performance for the given data. If we are looking for a balanced model(that is overall higher sensitivity,specificity and accuracy), then logistic regression or LDA look better than other models. But, if our goal is just the sensitivity, then QDA looks better.

Q(3)

Generate a simulated data set as follows: set .seed (1) x <- rnorm(100) y <- x - 2* x^2 + rnorm(100)

(a) Write out the model used to generate the data in equation form. In this dataset, what is n and what is p?

Solution

Model:

\(y = x - 2x^2 +\varepsilon\) where \(\varepsilon\) ~ N(0,1).

n = number of observations = 100 p = number of predictors = 1 because there is only one predictor x.

set.seed (1)
x <- rnorm(100)
y <- x - 2* x^2 + rnorm(100)

(b)

Create a scatterplot of X against Y. Comment on what you observe.

plot(x,y,main="Scatter plot: x vs y")
#curve(x - 2*x^2, add = TRUE, col = "red", lwd = 2)
curve(expr = x - 2*x^2, from = min(x), to = max(x), add = TRUE, col = "red", lwd = 2)

There is quadratic relationship between x and y.A lot of points are clustered around the peak with high variability there.However, overall the curve captures most of the data showing that the relationship between y and x is quadratic.

(c)

Set a random seed and use the validation set approach with a 50/50 split to calculate the test MSE that results from fitting the four models below using least squares (linear model). Use the poly() function, when appropriate.

(i)

\(Y = \beta_0 + \beta_1 X + \varepsilon\)

Solution

set.seed(2)
x<- rnorm(100)
y <- x - 2* x^2 + rnorm(100)

#sample 50% indices
train_indices <- sample(1:100,size = 0.5*100)
train_data <- data.frame(x = x[train_indices], y = y[train_indices])
test_data <- data.frame(x = x[-train_indices], y = y[-train_indices])

# Model i: Linear
model_i <- glm(y ~ x, data = train_data)
coef(model_i)

## (Intercept)           x 
## -2.52087486 -0.05895862

# Predictions on the test set
pred_i <- predict(model_i, newdata = test_data)

MSE_i <- mean((test_data$y - pred_i)^2)
MSE_i

## [1] 13.07129

(ii)

\(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon\)

Solution:

# Model ii: Quadratic
model_ii <- glm(y ~ poly(x, 2), data = train_data)

#Prediction on the test data
pred_ii <- predict(model_ii,newdata = test_data)

MSE_ii <- mean((test_data$y - pred_ii)^2)
MSE_ii

## [1] 1.258174

(iii)

\(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \varepsilon\)

Solution

#model(cubic)
model_iii <- glm(y ~ poly(x,3),data=train_data)

#Prediction
pred_iii <- predict(model_iii,newdata = test_data)

#MSE
MSE_iii <- mean((test_data$y - pred_iii)^2)
MSE_iii

## [1] 1.269602

(iv)

\(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \varepsilon\)

Solution

#Model(bi-quadratic)
model_iv <- glm(y ~ poly(x,4),data = train_data)

#Prediction
pred_iv <- predict(model_iv,newdata = test_data)

#MSE
MSE_iv <- mean((test_data$y - pred_iv)^2)
MSE_iv

## [1] 1.276

Comparing the MSEs:

# Print the test MSEs
print(paste("Test MSE for Model_i:", MSE_i))

## [1] "Test MSE for Model_i: 13.0712854334733"

print(paste("Test MSE for Model_ii:", MSE_ii))

## [1] "Test MSE for Model_ii: 1.25817361454273"

print(paste("Test MSE for Model_iii:", MSE_iii))

## [1] "Test MSE for Model_iii: 1.26960242280675"

print(paste("Test MSE for Model_iv:", MSE_iv))

## [1] "Test MSE for Model_iv: 1.27600000817104"

Conclusion

The quadratic model_ii has the smallest MSE.So, it is the best out of four.

(d)

Set a random seed and use the 5-fold cross-validation (CV) approach to calculate the test errors that result from fitting the same four models from part (c).

Solution

set.seed(3)
library(boot)

#5-fold CV for each model
cv_error_i <- cv.glm(train_data, model_i, K = 5)$delta[1]
cv_error_ii <- cv.glm(train_data, model_ii, K = 5)$delta[1]
cv_error_iii <- cv.glm(train_data, model_iii, K = 5)$delta[1]
cv_error_iv <- cv.glm(train_data, model_iv, K = 5)$delta[1]

# Print the CV errors
print(paste("CV Error for Model i (Linear):", cv_error_i))

## [1] "CV Error for Model i (Linear): 8.55433933853172"

print(paste("CV Error for Model ii (Quadratic):", cv_error_ii))

## [1] "CV Error for Model ii (Quadratic): 0.917276500261124"

print(paste("CV Error for Model iii (Cubic):", cv_error_iii))

## [1] "CV Error for Model iii (Cubic): 0.895352752112618"

print(paste("CV Error for Model iv (Quartic):", cv_error_iv))

## [1] "CV Error for Model iv (Quartic): 1.0158824663129"

Result:

For 5 fold, the cubic model has the smallest error.So, it looks better for K = 5 compared to other models.

(e)

Set a random seed and use the 10-fold cross-validation (CV) approach to calculate the test errors that result from fitting the same four models from part (c).

Solution:

set.seed(4)
library(boot)

#10-fold CV for each model
cv_error_i <- cv.glm(train_data, model_i, K = 10)$delta[1]
cv_error_ii <- cv.glm(train_data, model_ii, K = 10)$delta[1]
cv_error_iii <- cv.glm(train_data, model_iii, K = 10)$delta[1]
cv_error_iv <- cv.glm(train_data, model_iv, K = 10)$delta[1]

# Print the CV errors
print(paste("CV Error for Model i (Linear):", cv_error_i))

## [1] "CV Error for Model i (Linear): 8.38480410862023"

print(paste("CV Error for Model ii (Quadratic):", cv_error_ii))

## [1] "CV Error for Model ii (Quadratic): 0.896709776820211"

print(paste("CV Error for Model iii (Cubic):", cv_error_iii))

## [1] "CV Error for Model iii (Cubic): 0.961522234193992"

print(paste("CV Error for Model iv (Quartic):", cv_error_iv))

## [1] "CV Error for Model iv (Quartic): 1.00624716758666"

Result

For 10 fold, the quadratic model has smallest error.So, for this case quadratic model is better than others.

(f)

Set a random seed and use the leave-one-out cross-validation (LOOCV) approach to calculate the test errors that result from fitting the same four models from part(c).

# Initialize vectors to store LOOCV errors for each model
loocv_errors_i <- numeric(nrow(train_data))
loocv_errors_ii <- numeric(nrow(train_data))
loocv_errors_iii <- numeric(nrow(train_data))
loocv_errors_iv <- numeric(nrow(train_data))

# Perform LOOCV
for (i in 1:nrow(train_data)) {
  # Split data into training and validation sets
  train_data_loocv <- train_data[-i, ]  # All observations except the i-th one
  val_data_loocv <- train_data[i, ]     # The i-th observation

  # Fit models on the training data 
  model_i_loocv <- lm(y ~ x, data = train_data_loocv) 
  model_ii_loocv <- lm(y ~ poly(x, 2), data = train_data_loocv) 
  model_iii_loocv <- lm(y ~ poly(x, 3), data = train_data_loocv)
  model_iv_loocv <- lm(y ~ poly(x, 4), data = train_data_loocv)  

  # Make predictions on the validation data
  pred_i <- predict(model_i_loocv, newdata = val_data_loocv)
  pred_ii <- predict(model_ii_loocv, newdata = val_data_loocv)
  pred_iii <- predict(model_iii_loocv, newdata = val_data_loocv)
  pred_iv <- predict(model_iv_loocv, newdata = val_data_loocv)

  # Calculate squared error for the validation set
  loocv_errors_i[i] <- (val_data_loocv$y - pred_i)^2
  loocv_errors_ii[i] <- (val_data_loocv$y - pred_ii)^2
  loocv_errors_iii[i] <- (val_data_loocv$y - pred_iii)^2
  loocv_errors_iv[i] <- (val_data_loocv$y - pred_iv)^2
}

# Calculate the average LOOCV error for each model
loocv_error_i <- mean(loocv_errors_i)
loocv_error_ii <- mean(loocv_errors_ii)
loocv_error_iii <- mean(loocv_errors_iii)
loocv_error_iv <- mean(loocv_errors_iv)

# Print the LOOCV errors
print(paste("LOOCV Error for Model i (Linear):", loocv_error_i))

## [1] "LOOCV Error for Model i (Linear): 8.50041927838054"

print(paste("LOOCV Error for Model ii (Quadratic):", loocv_error_ii))

## [1] "LOOCV Error for Model ii (Quadratic): 0.913862439360599"

print(paste("LOOCV Error for Model iii (Cubic):", loocv_error_iii))

## [1] "LOOCV Error for Model iii (Cubic): 0.974932206914268"

print(paste("LOOCV Error for Model iv (Quartic):", loocv_error_iv))

## [1] "LOOCV Error for Model iv (Quartic): 1.01633825658973"

In LOOCV method, quadratic model has smallest error.So, it is better than other models for this case.

(g)

Which of the four considered models appears to provide the best results on this data based on the validation set approach? The 5-fold CV approach? The 10-fold CV approach? The LOOCV approach? Justify your answers.

Solution

library(DT)

# Create a data frame to store the results
results <- data.frame(
  Model = c("Linear", "Quadratic", "Cubic", "Quartic"),
  Test_Error_5_Fold_CV = c(8.55433933853172, 0.917276500261124, 0.895352752112618, 1.0158824663129),
  Test_Error_10_Fold_CV = c(8.38480410862023, 0.896709776820211, 0.961522234193992, 1.00624716758666),
  Test_Error_LOOCV = c(8.50041927838054, 0.913862439360599, 0.974932206914268, 1.01633825658973)
)

# Print the table
datatable(results)

Conclusion:

For 5 Fold, Cubic model has smallest error, for 10 Fold,Quadratic model has the lowest error, and for LOOCV, the quadratic model has the lowest error.Overall, The quadratic model has the lowest error. So, it is the best.

```

HTP Data Analysis

NAME: RAM CHANDRA DHUNGANA

2025-02-23