Info

Objective

The purpose of writing graded lab reports is to help students to stay on track and to provide summative feedback. Each lab report is just 1% of the total course mark. Please do not cheat - it is not worth it!

Your task

Solve the practical questions, knit your document into a PDF and submit to NTULearn before the deadline. The deadline is very tight because the task is simple. We are sure that everyone is capable to do it by themselves and we want to discourage taking someone else’s report and writing it with your own words.

Marking scheme

You will get “excellent”, or 100% for this lab report if everything is perfect. You will get “good”, or 75% if there are minor issues. For example, you will get “good” rather than excellent if you transform the data by manually subsetting and changing variable names instead of a tidyverse pipepine or if you implement cross-validation from scratch instead of using the library caret. You will get “average”, or 50% if there are serious issues in your report, such as failing to do cross-validation or data normalization when it is needed. You will get “poor”, or 25% if you barely attempt this report. You will get “not done”, or 0% if you do not attempt this report.

Deadline

30 Jul 2021, midnight

Libraries

Here, we load libraries, data and set the random seed. Replace the number “1729” with the numeric part of your matric no

Diabetes dataset:

http://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.

We will load the data, at the same time renaming class to Y and changing its type from character to factor.

library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fastDummies) # for creating dummy variables
## Thank you for using fastDummies!
## To acknowledge our work, please cite the package:
## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
library(caret) # for machine learning, including KNN
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
D <- read_csv("diabetes_data_upload.csv") %>%
  rename(Y = class) %>%
  mutate(Y = as.factor(Y))
## Rows: 520 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Polyuria, Polydipsia, sudden weight loss, weakness, Polyph...
## dbl  (1): Age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(1729) # replace the number '1729' with your matric no

dim(D)
## [1] 520  17

Question 1

Plot a coloured violin chart of Age by Y

ggplot(data = D, aes(x = Y, y = Age, fill = Y)) +
  geom_violin()

Question 2

We want to predict Y with logistic regression that uses two most highly correlated variables with Y. Below we convert all non-numeric variables to numeric, find correlation coefficients with Y and identify the variables with the highest correlation:

X <- dummy_cols(D, remove_first_dummy = TRUE) %>%
  select(-!!names(D)) %>%
  mutate(Age = D$Age)

cor_coefs <- cor(X)["Y_Positive" , ]
cor_coefs <- cor_coefs[order(abs(cor_coefs), decreasing = TRUE)]
cor_coefs
##             Y_Positive           Polyuria_Yes         Polydipsia_Yes 
##             1.00000000             0.66592240             0.64873373 
##            Gender_Male sudden weight loss_Yes    partial paresis_Yes 
##            -0.44923336             0.43656818             0.43228762 
##         Polyphagia_Yes       Irritability_Yes           Alopecia_Yes 
##             0.34250386             0.29946707            -0.26751158 
##    visual blurring_Yes           weakness_Yes   muscle stiffness_Yes 
##             0.25130025             0.24327477             0.12247449 
##     Genital thrush_Yes                    Age            Obesity_Yes 
##             0.11028775             0.10867900             0.07217334 
##    delayed healing_Yes            Itching_Yes 
##             0.04697952            -0.01338372

Thus the two variables with the highest correlation with the response variable are

cor_coefs[2:3]
##   Polyuria_Yes Polydipsia_Yes 
##      0.6659224      0.6487337

Now we will just use these two variables as predictors.

Explain what is wrong with this method.

Answer: Selecting variables is a part of training. We shouldn’t use test data for it. This should be done after splitting the data into training and test sets.

Question 3

Split the data into 70% training and 30% test sets. Train a logistic regression that uses Polyuria and Polydipsia as predictors. Report its 5-fold cross-validation error and its test error.

ind <- which(runif(nrow(D)) < 0.7)

test_data <- D %>%
  slice(-ind)

train_data <- D %>%
  slice(ind)

cat("Training data dim =", dim(train_data), "\n")
## Training data dim = 382 17
cat("Test data dim =", dim(test_data), "\n")
## Test data dim = 138 17

First, we will train the logistic regression

mod_log <- train(Y ~ Polyuria + Polydipsia,
                 data = train_data,
                  method = 'glm',
                  family = binomial,
                  trControl = trainControl(method = 'cv', number = 5))

mod_log
## Generalized Linear Model 
## 
## 382 samples
##   2 predictor
##   2 classes: 'Negative', 'Positive' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 306, 306, 305, 305, 306 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8560834  0.7007867

The cross-validation error rate is

1 - mod_log$results$Accuracy
## [1] 0.1439166

The test error rate is

CM <- mod_log %>% predict(test_data) %>%
  confusionMatrix(as.factor(test_data$Y))

unname(1 - CM$overall['Accuracy'])
## [1] 0.0942029

Question 4

Here, create 16 logistic regressions, by using more and more predictors. Specifically, regression \(t\) will use variables \(X_1,X_2,\dots,X_t\) as predictors (in the same order as they appear in the dataset). For instance, regression 6 will use the following variables as predictors:

names(D)[1:6]
## [1] "Age"                "Gender"             "Polyuria"          
## [4] "Polydipsia"         "sudden weight loss" "weakness"

Report cross-validation errors and test errors for each of the model. Which of the models has the lowest cross-validation error rate?

model_with_t_predictors <- function(t) {
  df <- train_data[ , 1:t] %>%
    mutate(Y = train_data$Y)
  
  mod_log <- train(Y ~ . , data = df,
                  method = 'glm',
                  family = binomial,
                  trControl = trainControl(method = 'cv', number = 5))
  
  cv_error <- 1 - mod_log$results$Accuracy
  CM <- mod_log %>% predict(test_data) %>%
    confusionMatrix(as.factor(test_data$Y))
  
  test_error <- unname(1 - CM$overall['Accuracy'])
  
  c(t = t, cv_error = cv_error, test_error = test_error)
}

table_of_models <- 1:16 %>% sapply(model_with_t_predictors) %>%
  t %>% as_tibble
table_of_models

The lowest cross validation error rate is attained at

table_of_models %>%
  slice(which.min(cv_error))