Activity Overview

In this activity a dataset that examines causes of diabetes is analyzed. The KNN algorithm is used on the data set to try and predict which symptoms are most likely to lead a diagnosis of diabetes.

Reading Data Into R

Data is read in from the CSV file. Age is removed from the data set because in this model we are strictly looking at health complications that effect diagnosis of diabetes.

data <- read.csv("diabetes_data_upload.csv", header = TRUE)
head(data, 10)
##    Age Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1   40   Male       No        Yes                 No      Yes         No
## 2   58   Male       No         No                 No      Yes         No
## 3   41   Male      Yes         No                 No      Yes        Yes
## 4   45   Male       No         No                Yes      Yes        Yes
## 5   60   Male      Yes        Yes                Yes      Yes        Yes
## 6   55   Male      Yes        Yes                 No      Yes        Yes
## 7   57   Male      Yes        Yes                 No      Yes        Yes
## 8   66   Male      Yes        Yes                Yes      Yes         No
## 9   67   Male      Yes        Yes                 No      Yes        Yes
## 10  70   Male       No        Yes                Yes      Yes        Yes
##    Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1              No              No     Yes           No             Yes
## 2              No             Yes      No           No              No
## 3              No              No     Yes           No             Yes
## 4             Yes              No     Yes           No             Yes
## 5              No             Yes     Yes          Yes             Yes
## 6              No             Yes     Yes           No             Yes
## 7             Yes              No      No           No             Yes
## 8              No             Yes     Yes          Yes              No
## 9             Yes              No     Yes          Yes              No
## 10             No             Yes     Yes          Yes              No
##    partial.paresis muscle.stiffness Alopecia Obesity    class
## 1               No              Yes      Yes     Yes Positive
## 2              Yes               No      Yes      No Positive
## 3               No              Yes      Yes      No Positive
## 4               No               No       No      No Positive
## 5              Yes              Yes      Yes     Yes Positive
## 6               No              Yes      Yes     Yes Positive
## 7              Yes               No       No      No Positive
## 8              Yes              Yes       No      No Positive
## 9              Yes              Yes       No     Yes Positive
## 10              No               No      Yes      No Positive
data <- data[-1] # remove age from data set
head(data, 3)
##   Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1   Male       No        Yes                 No      Yes         No
## 2   Male       No         No                 No      Yes         No
## 3   Male      Yes         No                 No      Yes        Yes
##   Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1             No              No     Yes           No             Yes
## 2             No             Yes      No           No              No
## 3             No              No     Yes           No             Yes
##   partial.paresis muscle.stiffness Alopecia Obesity    class
## 1              No              Yes      Yes     Yes Positive
## 2             Yes               No      Yes      No Positive
## 3              No              Yes      Yes      No Positive

Convert Categorical Data to Quantitative Data

my_df <- data.frame(data) # save data in new data frame
str(my_df)
## 'data.frame':    520 obs. of  16 variables:
##  $ Gender            : chr  "Male" "Male" "Male" "Male" ...
##  $ Polyuria          : chr  "No" "No" "Yes" "No" ...
##  $ Polydipsia        : chr  "Yes" "No" "No" "No" ...
##  $ sudden.weight.loss: chr  "No" "No" "No" "Yes" ...
##  $ weakness          : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Polyphagia        : chr  "No" "No" "Yes" "Yes" ...
##  $ Genital.thrush    : chr  "No" "No" "No" "Yes" ...
##  $ visual.blurring   : chr  "No" "Yes" "No" "No" ...
##  $ Itching           : chr  "Yes" "No" "Yes" "Yes" ...
##  $ Irritability      : chr  "No" "No" "No" "No" ...
##  $ delayed.healing   : chr  "Yes" "No" "Yes" "Yes" ...
##  $ partial.paresis   : chr  "No" "Yes" "No" "No" ...
##  $ muscle.stiffness  : chr  "Yes" "No" "Yes" "No" ...
##  $ Alopecia          : chr  "Yes" "Yes" "Yes" "No" ...
##  $ Obesity           : chr  "Yes" "No" "No" "No" ...
##  $ class             : chr  "Positive" "Positive" "Positive" "Positive" ...
my_df <- data.matrix(my_df)
head(my_df, 5)
##      Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## [1,]      2        1          2                  1        2          1
## [2,]      2        1          1                  1        2          1
## [3,]      2        2          1                  1        2          2
## [4,]      2        1          1                  2        2          2
## [5,]      2        2          2                  2        2          2
##      Genital.thrush visual.blurring Itching Irritability delayed.healing
## [1,]              1               1       2            1               2
## [2,]              1               2       1            1               1
## [3,]              1               1       2            1               2
## [4,]              2               1       2            1               2
## [5,]              1               2       2            2               2
##      partial.paresis muscle.stiffness Alopecia Obesity class
## [1,]               1                2        2       2     2
## [2,]               2                1        2       1     2
## [3,]               1                2        2       1     2
## [4,]               1                1        1       1     2
## [5,]               2                2        2       2     2

Data Splicing

Now that the data has been converted from categorical data to quantitative data, it needs to be split into training and testing subsets. This is going to be done using a 70/30 split, meaning that 70% of the data is going to be used for training, and 30% of the data is going to be used for testing. The 70% is chosen first, and the data included in the training set is chosen at random. The remaining 30% is added to the testing set.

set.seed(123)
trn_dat <- sample(1:nrow(my_df), size = nrow(my_df)*.70, replace = FALSE) # randomly grab 70% of the data
train.data <- my_df[trn_dat,]
test.data <- my_df[-trn_dat,]

# Converting to data frame to print first 10 rows
train_df <- as.data.frame(train.data)
test_df <- as.data.frame(test.data)
head(train_df, 10)
##    Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1       1        2          2                  2        2          1
## 2       2        1          1                  1        1          2
## 3       1        1          2                  1        2          2
## 4       2        1          1                  2        2          2
## 5       1        2          2                  1        2          2
## 6       1        2          2                  2        2          2
## 7       2        1          1                  1        2          1
## 8       2        1          1                  1        2          1
## 9       2        2          2                  2        2          2
## 10      2        1          1                  1        1          1
##    Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1               1               2       1            1               1
## 2               1               2       1            1               1
## 3               1               2       2            2               2
## 4               2               2       2            2               2
## 5               1               2       2            2               2
## 6               1               1       2            2               2
## 7               1               1       2            2               2
## 8               1               1       1            1               1
## 9               2               2       2            1               1
## 10              1               1       1            1               1
##    partial.paresis muscle.stiffness Alopecia Obesity class
## 1                2                1        1       2     2
## 2                1                2        1       1     1
## 3                2                2        1       1     2
## 4                1                1        1       1     2
## 5                2                2        1       1     2
## 6                2                1        1       1     2
## 7                2                1        1       1     1
## 8                1                1        1       1     1
## 9                1                1        2       2     2
## 10               1                1        1       1     1
head(test_df, 10)
##    Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1       2        1          2                  1        2          1
## 2       2        2          1                  1        2          2
## 3       2        2          2                  1        2          2
## 4       2        2          2                  2        2          1
## 5       2        2          2                  1        2          2
## 6       2        2          2                  1        1          2
## 7       2        2          2                  1        2          2
## 8       2        2          2                  2        2          1
## 9       2        1          2                  1        2          2
## 10      2        2          2                  1        2          2
##    Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1               1               1       2            1               2
## 2               1               1       2            1               2
## 3               1               2       2            1               2
## 4               1               2       2            2               1
## 5               2               1       2            2               1
## 6               2               1       2            1               2
## 7               1               2       2            1               2
## 8               2               1       1            1               2
## 9               1               2       1            2               2
## 10              1               2       1            1               1
##    partial.paresis muscle.stiffness Alopecia Obesity class
## 1                1                2        2       2     2
## 2                1                2        2       1     2
## 3                1                2        2       2     2
## 4                2                2        1       1     2
## 5                2                2        1       2     2
## 6                1                2        1       1     2
## 7                2                1        1       1     2
## 8                1                2        1       1     2
## 9                2                2        2       2     2
## 10               2                2        1       1     2
# Create a variable to hold the classification value (Positive/negative diabetes)
train.classification <- my_df[trn_dat, 16]
head(train.classification, 5)
## [1] 2 1 2 2 2
test.classification <- my_df[-trn_dat, 16]
head(test.classification, 5)
## [1] 2 2 2 2 2

Determine The K value

The K value is the number of nearest neighbors to the value passed in. One way to determine the optimal value for K is to calculate the square root of the total number of observations in the data set.

# Get the number of rows in the training data set
num_trn_data = NROW(train.data)
num_trn_data
## [1] 364
# Determine the K value
calc_k_val = sqrt(num_trn_data)
calc_k_val
## [1] 19.07878

The calculated K value is 19.07878. The optimal K value will be rounded to 19. Now, the models can be set with the K value that was found.

knn.19 <- knn(train = train.data, test = test.data, cl=train.classification, k = 19)

Evaluate Model

Now that the models have been created, the accuracy is going to be evaluated.

model.acc <- 100 * sum(test.classification==knn.19)/NROW(test.classification)
model.acc
## [1] 91.02564

The current accuracy of the model was calculated to be 91.025%.