Dinesh Ramachandran

Load the dataset and appropriate packages

library(readxl)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

dataset<-read_excel('labW9.xlsx', 1)

Conduct data exploration and checking and cleaning if necessary

str(dataset)

## tibble [768 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Pregnancies             : num [1:768] 6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : num [1:768] 148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : num [1:768] 72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : num [1:768] 35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : num [1:768] 0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num [1:768] 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num [1:768] 0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : num [1:768] 50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : num [1:768] 1 0 1 0 1 0 1 0 1 1 ...

summary(dataset)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

dim(dataset)

## [1] 768   9

colSums(is.na(dataset))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

dataset$Outcome <- as.factor(dataset$Outcome)

Partition data 70/30 check both training and test subsets

split = 0.70
trainIndex <- createDataPartition(dataset$Outcome, p=split, list=FALSE)
data_train <- dataset[trainIndex,]
data_test <- dataset[-trainIndex,]
dim(data_train)

## [1] 538   9

dim(data_test)

## [1] 230   9

Check for cross validation if the model allows for it

train_cont <- trainControl(method = "cv", number = 7)

Train the model

model <- train(Outcome~., data = data_train, trControl = train_cont, method = "knn")
model

## k-Nearest Neighbors 
## 
## 538 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (7 fold) 
## Summary of sample sizes: 461, 461, 461, 462, 461, 461, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7174836  0.3575802
##   7  0.7434577  0.4172214
##   9  0.7546138  0.4404962
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Plot model

plot(model)

Predit using your test data onto your model

pred <- predict(model, data_test)
pred

##   [1] 0 1 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
##  [38] 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 0
##  [75] 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0
## [112] 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
## [149] 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0 1
## [186] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0
## [223] 0 0 0 0 1 0 0 0
## Levels: 0 1

Evaluating outcome

confusionMatrix(pred, data_test$Outcome)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 129  40
##          1  21  40
##                                           
##                Accuracy : 0.7348          
##                  95% CI : (0.6728, 0.7906)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 0.004551        
##                                           
##                   Kappa : 0.3811          
##                                           
##  Mcnemar's Test P-Value : 0.021185        
##                                           
##             Sensitivity : 0.8600          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.7633          
##          Neg Pred Value : 0.6557          
##              Prevalence : 0.6522          
##          Detection Rate : 0.5609          
##    Detection Prevalence : 0.7348          
##       Balanced Accuracy : 0.6800          
##                                           
##        'Positive' Class : 0               
##

Lab Week 10

Dinesh Ramachandran

Load the dataset and appropriate packages

Conduct data exploration and checking and cleaning if necessary

Partition data 70/30 check both training and test subsets

Check for cross validation if the model allows for it

Train the model

Plot model

Predit using your test data onto your model

Evaluating outcome