Classification 1

Harold Nelson

11/12/2020

KNN and Naive Bayes

The task is to predict the gender of a person based on other characteristics?

This document works through two model types, K Nearest Neighbors and Naive Bayes.

The Data

I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (cdc2). The data I used is available from Openintro.org. See https://www.openintro.org/book/statdata/?data=cdc2. It is in Moodle as cdc22.Rdata. This version has been through a cleaning and augmentation process we have gone through earlier.

Load the data.

load("cdc2.Rdata")

Packages

Make a few packages available.

library(class)
library(naivebayes)
## naivebayes 0.9.7 loaded
library(broom)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(formula.tools)

Task 1

Scaling the data. Height and weight need to be scaled so that both are on a 0 - 1 scale. Use the procedure from the Datacamp course to do this. Check the range of both before and after the rescaling. This task needs to be done before splitting the data.

Answer

mean(cdc2$weight)
## [1] 169.676
print(range(cdc2$weight))
## [1]  68 500
print(range(cdc2$height))
## [1] 48 84
cdc2$weight = (cdc2$weight - min(cdc2$weight))/(max(cdc2$weight) - min(cdc2$weight))
cdc2$height = (cdc2$height - min(cdc2$height))/(max(cdc2$height) - min(cdc2$height))
print(range(cdc2$weight))
## [1] 0 1
print(range(cdc2$height))
## [1] 0 1
mean(cdc2$weight)
## [1] 0.235361

Task 2

Splitting the data

We want to split our data into train and test. We will build the models on train and evaluate their performance on test. Randomly select about 75% of the data for train and 25% for test.

Follow the process in the Mount-Zumel course in Datacamp to create train and test subsets.

Answer

print(N <- nrow(cdc2))
## [1] 19997
print(target <- round(.75 * N))
## [1] 14998
set.seed(123)
gp <- runif(N)
train = cdc2[gp < .75,]
test = cdc2[gp >= .75,]

nrow(train)
## [1] 15064
nrow(test)
## [1] 4933

Task 3

KNN Model 1

Use select to create train_a and test_a. These dataframes contain only gender, height and weight in that order.

Follow the procedure in the Datacamp course to create a knn model for gender with k = 1. Build the model using train and measure its performace on test. Note that the variable gender, which has column index 1, must be removed from the train dataframe and placed in a separate variable called gender. Note the way sign-types was used in the Datacamp course.

Compute the accuracy of this model using the procedure in the Datacamp course.

Answer

train_a = train[c("gender","height","weight")]
test_a = test[c("gender","height","weight")]
gender = train_a$gender
k_1 <- knn(train = train_a[-1], test = test_a[-1], cl = gender)
mean(test_a$gender == k_1)
## [1] 0.8489763

Task 4

Repeat the process to create and measure the performance of a model with k = 5.

Answer

k_5 <- knn(train = train_a[-1], test = test_a[-1], cl = gender,k=5)
mean(test_a$gender == k_5)
## [1] 0.8548551

Task 5

Try k = 50

Answer

k_50 <- knn(train = train_a[-1], test = test_a[-1], cl = gender,k=50)
mean(test_a$gender == k_50)
## [1] 0.8583012

Task 6

Try k = 100

Answer

k_100 <- knn(train = train_a[-1], test = test_a[-1], cl = gender,k=100)
mean(test_a$gender == k_100)
## [1] 0.8552605

The performance deteriorated between k values of 50 and 100.

Task 7

Let’s stick with knn but add some variables to the mix. Create dataframes train_b and test_b. Add genhlth, smoke100 and exerany. Try k values of 1, 5, 50, and 100.

Answer

train_b = train[c("gender","height","weight","smoke100","exerany")]
test_b = test[c("gender","height","weight","smoke100","exerany")]
gender = train_b$gender
k_1 <- knn(train = train_b[-1], test = test_b[-1], cl = gender, k = 1)
mean(test_b$gender == k_1)
## [1] 0.8343807
k_5 <- knn(train = train_b[-1], test = test_b[-1], cl = gender,k=5)
mean(test_b$gender == k_5)
## [1] 0.8530306
k_50 <- knn(train = train_b[-1], test = test_b[-1], cl = gender,k=50)
mean(test_b$gender == k_50)
## [1] 0.8544496
k_100 <- knn(train = train_b[-1], test = test_b[-1], cl = gender,k=100)
mean(test_b$gender == k_100)
## [1] 0.8532333

Again, the performance deteriorated between 50 and 100.

The most successful model was with just height and weight with k = 50.

Task 8

The Baseline

Our data is not evenly divided by gender. Use the simple table command on train and test to see the true distribution of gender. Divide by nrow() to get proportions.

Answer

table(train$gender)/nrow(train)
## 
##         m         f 
## 0.4791556 0.5208444
table(test$gender)/nrow(test)
## 
##         m         f 
## 0.4759781 0.5240219

If you were forced to guess a person’s gender with no information on characteristics, you should pick “f” since you’d be write more often than wrong. Your accuracy would be about 52%. Any other reported accuracy should be compared with the accuracy of this totally uninformed guess.

Task 9

Let’s consider using one of the categorical variables, smoke100 as a predictor of gender.

Use naive_bayes with this variable and measure the accuracy on the test data.

Answer

NB1 = naive_bayes(gender ~ smoke100, data = train)
NB1_predict = predict(NB1,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB1_predict == test$gender)
accuracy
## [1] 0.5552402

Task 10

Add a second categorical variable, genhlth. Does this improve the accuracy?

Answer

NB2 = naive_bayes(gender ~ smoke100 + genhlth, data = train)
NB2_predict = predict(NB2,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB2_predict == test$gender)
accuracy
## [1] 0.5507805

The accuracy actually had a slight decline. This might seem to be impossible, but it illustrates the importance of using test data to measure accuracy. More complex models frequently do this relative to simpler models. The phenomenon is known as overfitting.

Task 11

Let’s use a quantitative variable, height. Do this with naive_bayes and measure the accuracy.

Answer

NB3 = naive_bayes(gender ~ height , data = train)
NB3_predict = predict(NB3,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB3_predict == test$gender)
accuracy
## [1] 0.8394486

This is a substantial improvement over the categorical variable models we tried.

Task 12

Add a second quantitative variable, weight; then Measure the accuracy.

Answer

NB4 = naive_bayes(gender ~ height + weight, data = train)
NB4_predict = predict(NB4,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB4_predict == test$gender)
accuracy
## [1] 0.8378269

Again, we see a slight loss of accuracy with the more complex model.

Task 13

Try weight alone. Compare the accuracy with the previous results.

Answer

NB5 = naive_bayes(gender ~ weight, data = train)
NB5_predict = predict(NB5,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB5_predict == test$gender)
accuracy
## [1] 0.713359

The best model of this set is clearly height alone.

The Final Question

We did several variations on two types of models, knn and naive bayes. Which model was the best of all of these?