In this activity a dataset that examines causes of diabetes is analyzed. The KNN algorithm is used on the data set to try and predict which symptoms are most likely to lead a diagnosis of diabetes.
Data is read in from the CSV file. Age is removed from the data set because in this model we are strictly looking at health complications that effect diagnosis of diabetes.
data <- read.csv("diabetes_data_upload.csv", header = TRUE)
head(data, 10)
## Age Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1 40 Male No Yes No Yes No
## 2 58 Male No No No Yes No
## 3 41 Male Yes No No Yes Yes
## 4 45 Male No No Yes Yes Yes
## 5 60 Male Yes Yes Yes Yes Yes
## 6 55 Male Yes Yes No Yes Yes
## 7 57 Male Yes Yes No Yes Yes
## 8 66 Male Yes Yes Yes Yes No
## 9 67 Male Yes Yes No Yes Yes
## 10 70 Male No Yes Yes Yes Yes
## Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1 No No Yes No Yes
## 2 No Yes No No No
## 3 No No Yes No Yes
## 4 Yes No Yes No Yes
## 5 No Yes Yes Yes Yes
## 6 No Yes Yes No Yes
## 7 Yes No No No Yes
## 8 No Yes Yes Yes No
## 9 Yes No Yes Yes No
## 10 No Yes Yes Yes No
## partial.paresis muscle.stiffness Alopecia Obesity class
## 1 No Yes Yes Yes Positive
## 2 Yes No Yes No Positive
## 3 No Yes Yes No Positive
## 4 No No No No Positive
## 5 Yes Yes Yes Yes Positive
## 6 No Yes Yes Yes Positive
## 7 Yes No No No Positive
## 8 Yes Yes No No Positive
## 9 Yes Yes No Yes Positive
## 10 No No Yes No Positive
data <- data[-1] # remove age from data set
head(data, 3)
## Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1 Male No Yes No Yes No
## 2 Male No No No Yes No
## 3 Male Yes No No Yes Yes
## Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1 No No Yes No Yes
## 2 No Yes No No No
## 3 No No Yes No Yes
## partial.paresis muscle.stiffness Alopecia Obesity class
## 1 No Yes Yes Yes Positive
## 2 Yes No Yes No Positive
## 3 No Yes Yes No Positive
table(data$class)
##
## Negative Positive
## 200 320
my_df <- data.frame(data) # save data in new data frame
str(my_df)
## 'data.frame': 520 obs. of 16 variables:
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Polyuria : chr "No" "No" "Yes" "No" ...
## $ Polydipsia : chr "Yes" "No" "No" "No" ...
## $ sudden.weight.loss: chr "No" "No" "No" "Yes" ...
## $ weakness : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Polyphagia : chr "No" "No" "Yes" "Yes" ...
## $ Genital.thrush : chr "No" "No" "No" "Yes" ...
## $ visual.blurring : chr "No" "Yes" "No" "No" ...
## $ Itching : chr "Yes" "No" "Yes" "Yes" ...
## $ Irritability : chr "No" "No" "No" "No" ...
## $ delayed.healing : chr "Yes" "No" "Yes" "Yes" ...
## $ partial.paresis : chr "No" "Yes" "No" "No" ...
## $ muscle.stiffness : chr "Yes" "No" "Yes" "No" ...
## $ Alopecia : chr "Yes" "Yes" "Yes" "No" ...
## $ Obesity : chr "Yes" "No" "No" "No" ...
## $ class : chr "Positive" "Positive" "Positive" "Positive" ...
my_df <- data.matrix(my_df)
head(my_df, 5)
## Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## [1,] 2 1 2 1 2 1
## [2,] 2 1 1 1 2 1
## [3,] 2 2 1 1 2 2
## [4,] 2 1 1 2 2 2
## [5,] 2 2 2 2 2 2
## Genital.thrush visual.blurring Itching Irritability delayed.healing
## [1,] 1 1 2 1 2
## [2,] 1 2 1 1 1
## [3,] 1 1 2 1 2
## [4,] 2 1 2 1 2
## [5,] 1 2 2 2 2
## partial.paresis muscle.stiffness Alopecia Obesity class
## [1,] 1 2 2 2 2
## [2,] 2 1 2 1 2
## [3,] 1 2 2 1 2
## [4,] 1 1 1 1 2
## [5,] 2 2 2 2 2
Now that the data has been converted from categorical data to quantitative data, it needs to be split into training and testing subsets. This is going to be done using a 70/30 split, meaning that 70% of the data is going to be used for training, and 30% of the data is going to be used for testing. The 70% is chosen first, and the data included in the training set is chosen at random. The remaining 30% is added to the testing set.
set.seed(123)
trn_dat <- sample(1:nrow(my_df), size = nrow(my_df)*.70, replace = FALSE) # randomly grab 70% of the data
train.data <- my_df[trn_dat,]
test.data <- my_df[-trn_dat,]
# Converting to data frame to print first 10 rows
train_df <- as.data.frame(train.data)
test_df <- as.data.frame(test.data)
head(train_df, 10)
## Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1 1 2 2 2 2 1
## 2 2 1 1 1 1 2
## 3 1 1 2 1 2 2
## 4 2 1 1 2 2 2
## 5 1 2 2 1 2 2
## 6 1 2 2 2 2 2
## 7 2 1 1 1 2 1
## 8 2 1 1 1 2 1
## 9 2 2 2 2 2 2
## 10 2 1 1 1 1 1
## Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1 1 2 1 1 1
## 2 1 2 1 1 1
## 3 1 2 2 2 2
## 4 2 2 2 2 2
## 5 1 2 2 2 2
## 6 1 1 2 2 2
## 7 1 1 2 2 2
## 8 1 1 1 1 1
## 9 2 2 2 1 1
## 10 1 1 1 1 1
## partial.paresis muscle.stiffness Alopecia Obesity class
## 1 2 1 1 2 2
## 2 1 2 1 1 1
## 3 2 2 1 1 2
## 4 1 1 1 1 2
## 5 2 2 1 1 2
## 6 2 1 1 1 2
## 7 2 1 1 1 1
## 8 1 1 1 1 1
## 9 1 1 2 2 2
## 10 1 1 1 1 1
head(test_df, 10)
## Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1 2 1 2 1 2 1
## 2 2 2 1 1 2 2
## 3 2 2 2 1 2 2
## 4 2 2 2 2 2 1
## 5 2 2 2 1 2 2
## 6 2 2 2 1 1 2
## 7 2 2 2 1 2 2
## 8 2 2 2 2 2 1
## 9 2 1 2 1 2 2
## 10 2 2 2 1 2 2
## Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1 1 1 2 1 2
## 2 1 1 2 1 2
## 3 1 2 2 1 2
## 4 1 2 2 2 1
## 5 2 1 2 2 1
## 6 2 1 2 1 2
## 7 1 2 2 1 2
## 8 2 1 1 1 2
## 9 1 2 1 2 2
## 10 1 2 1 1 1
## partial.paresis muscle.stiffness Alopecia Obesity class
## 1 1 2 2 2 2
## 2 1 2 2 1 2
## 3 1 2 2 2 2
## 4 2 2 1 1 2
## 5 2 2 1 2 2
## 6 1 2 1 1 2
## 7 2 1 1 1 2
## 8 1 2 1 1 2
## 9 2 2 2 2 2
## 10 2 2 1 1 2
# Create a variable to hold the classification value (Positive/negative diabetes)
train.classification <- my_df[trn_dat, 16]
head(train.classification, 5)
## [1] 2 1 2 2 2
test.classification <- my_df[-trn_dat, 16]
head(test.classification, 5)
## [1] 2 2 2 2 2
The K value is the number of nearest neighbors to the value passed in. One way to determine the optimal value for K is to calculate the square root of the total number of observations in the data set.
# Get the number of rows in the training data set
num_trn_data = NROW(train.data)
num_trn_data
## [1] 364
# Determine the K value
calc_k_val = sqrt(num_trn_data)
calc_k_val
## [1] 19.07878
The calculated K value is 19.07878. The optimal K value will be rounded to 19. Now, the models can be set with the K value that was found.
knn.19 <- knn(train = train.data, test = test.data, cl=train.classification, k = 19)
Now that the models have been created, the accuracy is going to be evaluated.
model.acc <- 100 * sum(test.classification==knn.19)/NROW(test.classification)
model.acc
## [1] 91.02564
The current accuracy of the model was calculated to be 91.025%.