Classification 1

Harold Nelson

11/12/2020

KNN and Naive Bayes

The task is to predict the gender of a person based on other characteristics?

This document works through two model types, K Nearest Neighbors and Naive Bayes.

The Data

This section of the document repeats material from other documents so that it will stand alone by itself. It also does some data cleaning.

If you have recently read this material, you should skip down to Task 1.

I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (CDC). The data I used is available from Openintro.org. See https://www.openintro.org/book/statdata/?data=cdc. It is in Moodle.

Load the data.

load("cdc.Rdata")

Packages

Make a few packages available.

library(class)
library(naivebayes)
## naivebayes 0.9.7 loaded
library(broom)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(formula.tools)

Structure

Determine the structure of the object CDC.

str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

BMI

Before cleaning, create a new variable BMI, the body mass index. This corrects for the influence of height on weight in looking for obesity.

The body mass index (BMI) is a measure which incorprates both height and weight.

The standard interpetation of this measure is as follows:

New Variables.

cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(0,18.5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(0,18.5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))

Examine the data and look for anomalies

summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00            
##       BMI            BMIDes         DesActRatio                BMICat    
##  Min.   :12.40   Min.   :  8.128   Min.   :0.2667   Underweight   : 411  
##  1st Qu.:22.71   1st Qu.: 21.727   1st Qu.:0.8710   Normal        :8321  
##  Median :25.60   Median : 23.746   Median :0.9444   Overweight    :7296  
##  Mean   :26.31   Mean   : 23.971   Mean   :0.9268   Obese         :3541  
##  3rd Qu.:28.89   3rd Qu.: 25.799   3rd Qu.:1.0000   Morbidly Obese: 431  
##  Max.   :73.09   Max.   :100.407   Max.   :3.7778                        
##           BMIDesCat       ageCat    
##  Underweight   :  404   18-31:5087  
##  Normal        :12325   32-43:5263  
##  Overweight    : 6451   44-57:4787  
##  Obese         :  798   58-99:4863  
##  Morbidly Obese:   22               
## 

Look at the records which have unusual values.

cdc[cdc$height==93,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 17534 very good       1        0        0     93    179      100  31      m
##            BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 17534 14.54931 8.128107   0.5586592 Underweight Underweight  18-31
cdc[cdc$weight==68,]
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 18743    good       1        1        1     52     68       68  44      f
##            BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 18743 17.67899 17.67899           1 Underweight Underweight  44-57
cdc[cdc$wtdesire==680,]
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 16874    good       0        1        0     69    180      680  24      m
##            BMI   BMIDes DesActRatio     BMICat      BMIDesCat ageCat
## 16874 26.57845 100.4075    3.777778 Overweight Morbidly Obese  18-31
cdc[cdc$BMIDes > 50,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 10034 very good       1        1        1     73    290      601  56      m
## 13086      good       0        1        1     62    300      300  48      f
## 13607 very good       0        1        0     69    350      350  33      f
## 16874      good       0        1        0     69    180      680  24      m
##            BMI    BMIDes DesActRatio         BMICat      BMIDesCat ageCat
## 10034 38.25671  79.28373    2.072414          Obese Morbidly Obese  44-57
## 13086 54.86472  54.86472    1.000000 Morbidly Obese Morbidly Obese  44-57
## 13607 51.68032  51.68032    1.000000 Morbidly Obese Morbidly Obese  32-43
## 16874 26.57845 100.40748    3.777778     Overweight Morbidly Obese  18-31
cdc[cdc$BMIDes < 10,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 17534 very good       1        0        0     93    179      100  31      m
##            BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 17534 14.54931 8.128107   0.5586592 Underweight Underweight  18-31

Remove Anomalies

Rejects = cdc$BMIDes < 10 | 
   (cdc$BMIDes > 50 & cdc$DesActRatio > 1.0) 
table(Rejects)
## Rejects
## FALSE  TRUE 
## 19997     3
Keepers = !Rejects
cdc = cdc[Keepers,]
summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6970   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5674   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire        age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68   Min.   :18.00   m: 9566  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175   3rd Qu.:57.00            
##  Max.   :84.00   Max.   :500.0   Max.   :350   Max.   :99.00            
##       BMI            BMIDes       DesActRatio                BMICat    
##  Min.   :12.40   Min.   :10.44   Min.   :0.2667   Underweight   : 410  
##  1st Qu.:22.71   1st Qu.:21.73   1st Qu.:0.8710   Normal        :8321  
##  Median :25.60   Median :23.75   Median :0.9444   Overweight    :7295  
##  Mean   :26.31   Mean   :23.96   Mean   :0.9266   Obese         :3540  
##  3rd Qu.:28.89   3rd Qu.:25.80   3rd Qu.:1.0000   Morbidly Obese: 431  
##  Max.   :73.09   Max.   :54.86   Max.   :1.9681                        
##           BMIDesCat       ageCat    
##  Underweight   :  403   18-31:5085  
##  Normal        :12325   32-43:5263  
##  Overweight    : 6451   44-57:4786  
##  Obese         :  798   58-99:4863  
##  Morbidly Obese:   20               
## 

Task 1

Scaling the data. Height and weight need to be scaled so that both are on a 0 - 1 scale. Use the procedure from the Datacamp course to do this. Check the range of both before and after the rescaling. This task needs to be done before splitting the data.

Answer

mean(cdc$weight)
## [1] 169.676
print(range(cdc$weight))
## [1]  68 500
print(range(cdc$height))
## [1] 48 84
cdc$weight = (cdc$weight - min(cdc$weight))/(max(cdc$weight) - min(cdc$weight))
cdc$height = (cdc$height - min(cdc$height))/(max(cdc$height) - min(cdc$height))
print(range(cdc$weight))
## [1] 0 1
print(range(cdc$height))
## [1] 0 1
mean(cdc$weight)
## [1] 0.235361

Task 2

Splitting the data

We want to split our data into train and test. We will build the model on train and evaluate its performance on test. Randomly select about 75% of the data for train and 25% for test.

Follow the process in the Mount-Zumel course in Datacamp to create train and test subsets.

Answer

mean(cdc$weight)
## [1] 0.235361
print(N <- nrow(cdc))
## [1] 19997
print(target <- round(.75 * N))
## [1] 14998
set.seed(123)
gp <- runif(N)
train = cdc[gp < .75,]
test = cdc[gp >= .75,]

nrow(train)
## [1] 15064
nrow(test)
## [1] 4933
mean(train$weight)
## [1] 0.2353177

Task 3

KNN Model 1

Use select to create train_a and test_a. These dataframes contain only gender, height and weight in that order.

Follow the procedure in the Datacamp course to create a knn model for gender with k = 1. Build the model using train and measure its performace on test. Note that the variable gender, which has column index 1, must be removed from the train dataframe and placed in a separate variable called gender. Note the way sign-types was used in the Datacamp course.

Compute the accuracy of this model using the procedure in the Datacamp course.

Answer

train_a = train[c("gender","height","weight")]
test_a = test[c("gender","height","weight")]
gender = train_a$gender
k_1 <- knn(train = train_a[-1], test = test_a[-1], cl = gender)
mean(test_a$gender == k_1)
## [1] 0.8489763
mean(train_a$weight)
## [1] 0.2353177

Task 4

Repeat the process to create and measure the performance of a model with k = 5.

Answer

k_5 <- knn(train = train_a[-1], test = test_a[-1], cl = gender,k=5)
mean(test_a$gender == k_5)
## [1] 0.8548551

Task 5

Try k = 50

k_50 <- knn(train = train_a[-1], test = test_a[-1], cl = gender,k=50)
mean(test_a$gender == k_50)
## [1] 0.8583012

Task 6

Try k = 100

k_100 <- knn(train = train_a[-1], test = test_a[-1], cl = gender,k=100)
mean(test_a$gender == k_100)
## [1] 0.8552605

The performance deteriorated between k values of 50 and 100.

Task 7

Let’s stick with knn but add some variables to the mix. Create dataframes train_b and test_b. Add genhlth, smoke100 and exerany. Try k values of 1, 5, 50, and 100.

Answer

train_b = train[c("gender","height","weight","smoke100","exerany")]
test_b = test[c("gender","height","weight","smoke100","exerany")]
gender = train_b$gender
k_1 <- knn(train = train_b[-1], test = test_b[-1], cl = gender)
mean(test_b$gender == k_1)
## [1] 0.8343807
mean(train_b$weight)
## [1] 0.2353177
k_5 <- knn(train = train_b[-1], test = test_b[-1], cl = gender,k=5)
mean(test_b$gender == k_5)
## [1] 0.8530306
k_50 <- knn(train = train_b[-1], test = test_b[-1], cl = gender,k=50)
mean(test_b$gender == k_50)
## [1] 0.8544496
k_100 <- knn(train = train_b[-1], test = test_b[-1], cl = gender,k=100)
mean(test_b$gender == k_100)
## [1] 0.8532333

Again, the performance deteriorated between 50 and 100.

The most successful model was with just height and weight with k = 50.

Task 8

The Baseline

Our data is not evenly divided by gender. Use the simple table command on train and test to see the true distribution of gender. Divide by nrow() to get proportions.

Answer

table(train$gender)/nrow(train)
## 
##         m         f 
## 0.4791556 0.5208444
table(test$gender)/nrow(test)
## 
##         m         f 
## 0.4759781 0.5240219

If you were forced to guess a person’s gender with no information on characteristics, you should pick “f” since you’d be write more often than wrong. Your accuracy would be about 52%. Any other reported accuracy should be compared with the accuracy of this totally uninformed guess.

Task 9

Let’s consider using one of the categorical variables, smoke100 as a predictor of gender.

Use naive_bayes with this variable and measure the accuracy on the test data.

Answer

NB1 = naive_bayes(gender ~ smoke100, data = train)
NB1_predict = predict(NB1,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB1_predict == test$gender)
accuracy
## [1] 0.5552402

Task 10

Add a second categorical variable, genhlth. Does this improve the accuracy?

Answer

NB2 = naive_bayes(gender ~ smoke100 + genhlth, data = train)
NB2_predict = predict(NB2,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB2_predict == test$gender)
accuracy
## [1] 0.5507805

The accuracy actually had a slight decline. This might seem to be impossible, but it illustrates the importance of using test data to measure accuracy. More complex models frequently do this relative to simpler models. The phenomenon is known as overfitting.

Task 11

Let’s use a quantitative variable, height. Do this with naive_bayes and measure the accuracy.

Answer

NB3 = naive_bayes(gender ~ height , data = train)
NB3_predict = predict(NB3,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB3_predict == test$gender)
accuracy
## [1] 0.8394486

This is a substantial improvement over the categorical variable models we tried.

Task 12

Add a second quantitative variable, weight; then Measure the accuracy.

Answer

NB4 = naive_bayes(gender ~ height + weight, data = train)
NB4_predict = predict(NB4,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB4_predict == test$gender)
accuracy
## [1] 0.8378269

Again, we see a slight loss of accuracy with the more complex model.

Task 13

Try weight alone. Compare the accuracy with the previous results.

Answer

NB5 = naive_bayes(gender ~ weight, data = train)
NB5_predict = predict(NB5,test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
accuracy = mean(NB5_predict == test$gender)
accuracy
## [1] 0.713359

The best model of this set is clearly height alone.

The Final Question

We did several variations on two types of models, knn and naive bayes. Which model was the best of all of these?