Week 10 Lab Practice

Load the dataset and appropriate packages

library(caret)

## Warning: package 'caret' was built under R version 4.1.2

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.1.2

## Loading required package: lattice

library(readxl)

## Warning: package 'readxl' was built under R version 4.1.2

df<-read_excel('labW9.xlsx', 1)

Conduct data exploration and checking and cleaning if necessary

str(df)

## tibble [768 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Pregnancies             : num [1:768] 6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : num [1:768] 148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : num [1:768] 72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : num [1:768] 35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : num [1:768] 0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num [1:768] 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num [1:768] 0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : num [1:768] 50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : num [1:768] 1 0 1 0 1 0 1 0 1 1 ...

summary(df)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

colnames(df)

## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

Check for missing values

colSums(is.na(df))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Change outcome to factor

df$Outcome <- as.factor(df$Outcome)

Partition data 70/30 using any method you feel comfortable with

split = 0.7
trainIndex <- createDataPartition(df$Outcome, p = split, list = FALSE)
df_train <- df[trainIndex, ]
df_test <- df[-trainIndex, ]

Check both your training and test subsets

nrow(df_train); nrow(df_test)

## [1] 538

## [1] 230

Check for cross validation if the model allows for it

Initialize cross validation train control

train_control = trainControl(method = "cv", number = 5)

Train your test data using any model you feel is appropriate

Train the model using KNN classifier

set.seed(3333)
model <- train(Outcome~., data = df_train, trControl = train_control, method = "knn")

Plot your model

plot(model)

Predict using your test data onto your model

predictions <- predict(model, newdata = df_test)
predictions

##   [1] 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 1 1 0 0
##  [38] 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 1
##  [75] 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1
## [112] 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0
## [149] 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1
## [186] 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1
## [223] 0 1 0 1 0 1 0 1
## Levels: 0 1

Evaluate your outcome using any suitable method

confusionMatrix(predictions, df_test$Outcome)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124  32
##          1  26  48
##                                           
##                Accuracy : 0.7478          
##                  95% CI : (0.6865, 0.8026)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 0.001163        
##                                           
##                   Kappa : 0.4343          
##                                           
##  Mcnemar's Test P-Value : 0.511482        
##                                           
##             Sensitivity : 0.8267          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.7949          
##          Neg Pred Value : 0.6486          
##              Prevalence : 0.6522          
##          Detection Rate : 0.5391          
##    Detection Prevalence : 0.6783          
##       Balanced Accuracy : 0.7133          
##                                           
##        'Positive' Class : 0               
##

Week 10 Lab Practice

Tay Shi Hui (17170153)

12/31/2021

Load the dataset and appropriate packages

Conduct data exploration and checking and cleaning if necessary

Partition data 70/30 using any method you feel comfortable with

Check both your training and test subsets

Check for cross validation if the model allows for it

Train your test data using any model you feel is appropriate

Plot your model

Predict using your test data onto your model

Evaluate your outcome using any suitable method