The purpose of writing graded lab reports is to help students to stay on track and to provide summative feedback. Each lab report is just 1% of the total course mark. Please do not cheat - it is not worth it!
Solve the practical questions, knit your document into a PDF and submit to NTULearn before the deadline. The deadline is very tight because the task is simple. We are sure that everyone is capable to do it by themselves and we want to discourage taking someone else’s report and writing it with your own words.
You will get “excellent”, or 100% for this lab report if everything
is perfect. You will get “good”, or 75% if there are minor issues. For
example, you will get “good” rather than excellent if you transform the
data by manually subsetting and changing variable names instead of a
tidyverse
pipepine or if you implement cross-validation
from scratch instead of using the library caret
. You will
get “average”, or 50% if there are serious issues in your report, such
as failing to do cross-validation or data normalization when it is
needed. You will get “poor”, or 25% if you barely attempt this report.
You will get “not done”, or 0% if you do not attempt this report.
30 Jul 2021, midnight
Here, we load libraries, data and set the random seed. Replace the number “1729” with the numeric part of your matric no
Diabetes dataset:
http://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.
We will load the data, at the same time renaming class
to Y
and changing its type from character
to
factor
.
library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fastDummies) # for creating dummy variables
## Thank you for using fastDummies!
## To acknowledge our work, please cite the package:
## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
library(caret) # for machine learning, including KNN
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
D <- read_csv("diabetes_data_upload.csv") %>%
rename(Y = class) %>%
mutate(Y = as.factor(Y))
## Rows: 520 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Polyuria, Polydipsia, sudden weight loss, weakness, Polyph...
## dbl (1): Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(1729) # replace the number '1729' with your matric no
dim(D)
## [1] 520 17
Plot a coloured violin chart of Age
by
Y
ggplot(data = D, aes(x = Y, y = Age, fill = Y)) +
geom_violin()
We want to predict Y
with logistic regression that uses
two most highly correlated variables with Y
. Below we
convert all non-numeric variables to numeric, find correlation
coefficients with Y
and identify the variables with the
highest correlation:
X <- dummy_cols(D, remove_first_dummy = TRUE) %>%
select(-!!names(D)) %>%
mutate(Age = D$Age)
cor_coefs <- cor(X)["Y_Positive" , ]
cor_coefs <- cor_coefs[order(abs(cor_coefs), decreasing = TRUE)]
cor_coefs
## Y_Positive Polyuria_Yes Polydipsia_Yes
## 1.00000000 0.66592240 0.64873373
## Gender_Male sudden weight loss_Yes partial paresis_Yes
## -0.44923336 0.43656818 0.43228762
## Polyphagia_Yes Irritability_Yes Alopecia_Yes
## 0.34250386 0.29946707 -0.26751158
## visual blurring_Yes weakness_Yes muscle stiffness_Yes
## 0.25130025 0.24327477 0.12247449
## Genital thrush_Yes Age Obesity_Yes
## 0.11028775 0.10867900 0.07217334
## delayed healing_Yes Itching_Yes
## 0.04697952 -0.01338372
Thus the two variables with the highest correlation with the response variable are
cor_coefs[2:3]
## Polyuria_Yes Polydipsia_Yes
## 0.6659224 0.6487337
Now we will just use these two variables as predictors.
Explain what is wrong with this method.
Answer: Selecting variables is a part of training. We shouldn’t use test data for it. This should be done after splitting the data into training and test sets.
Split the data into 70% training and 30% test sets. Train a logistic
regression that uses Polyuria
and Polydipsia
as predictors. Report its 5-fold cross-validation error and its test
error.
ind <- which(runif(nrow(D)) < 0.7)
test_data <- D %>%
slice(-ind)
train_data <- D %>%
slice(ind)
cat("Training data dim =", dim(train_data), "\n")
## Training data dim = 382 17
cat("Test data dim =", dim(test_data), "\n")
## Test data dim = 138 17
First, we will train the logistic regression
mod_log <- train(Y ~ Polyuria + Polydipsia,
data = train_data,
method = 'glm',
family = binomial,
trControl = trainControl(method = 'cv', number = 5))
mod_log
## Generalized Linear Model
##
## 382 samples
## 2 predictor
## 2 classes: 'Negative', 'Positive'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 306, 306, 305, 305, 306
## Resampling results:
##
## Accuracy Kappa
## 0.8560834 0.7007867
The cross-validation error rate is
1 - mod_log$results$Accuracy
## [1] 0.1439166
The test error rate is
CM <- mod_log %>% predict(test_data) %>%
confusionMatrix(as.factor(test_data$Y))
unname(1 - CM$overall['Accuracy'])
## [1] 0.0942029
Here, create 16 logistic regressions, by using more and more predictors. Specifically, regression \(t\) will use variables \(X_1,X_2,\dots,X_t\) as predictors (in the same order as they appear in the dataset). For instance, regression 6 will use the following variables as predictors:
names(D)[1:6]
## [1] "Age" "Gender" "Polyuria"
## [4] "Polydipsia" "sudden weight loss" "weakness"
Report cross-validation errors and test errors for each of the model. Which of the models has the lowest cross-validation error rate?
model_with_t_predictors <- function(t) {
df <- train_data[ , 1:t] %>%
mutate(Y = train_data$Y)
mod_log <- train(Y ~ . , data = df,
method = 'glm',
family = binomial,
trControl = trainControl(method = 'cv', number = 5))
cv_error <- 1 - mod_log$results$Accuracy
CM <- mod_log %>% predict(test_data) %>%
confusionMatrix(as.factor(test_data$Y))
test_error <- unname(1 - CM$overall['Accuracy'])
c(t = t, cv_error = cv_error, test_error = test_error)
}
table_of_models <- 1:16 %>% sapply(model_with_t_predictors) %>%
t %>% as_tibble
table_of_models
The lowest cross validation error rate is attained at
table_of_models %>%
slice(which.min(cv_error))