Harold Nelson
11/12/2020
The task is to predict the gender of a person based on other characteristics?
This document works through two model types, K Nearest Neighbors and Naive Bayes.
This section of the document repeats material from other documents so that it will stand alone by itself. It also does some data cleaning.
If you have recently read this material, you should skip down to Task 1.
I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (CDC). The data I used is available from Openintro.org. See https://www.openintro.org/book/statdata/?data=cdc. It is in Moodle.
Make a few packages available.
## naivebayes 0.9.7 loaded
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Determine the structure of the object CDC.
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
Before cleaning, create a new variable BMI, the body mass index. This corrects for the influence of height on weight in looking for obesity.
The body mass index (BMI) is a measure which incorprates both height and weight.
The standard interpetation of this measure is as follows:
cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(0,18.5,24.9,29.9,39.9,200),labels =
c("Underweight","Normal","Overweight",
"Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(0,18.5,24.9,29.9,39.9,200),labels =
c("Underweight","Normal","Overweight",
"Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5675 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
## BMI BMIDes DesActRatio BMICat
## Min. :12.40 Min. : 8.128 Min. :0.2667 Underweight : 411
## 1st Qu.:22.71 1st Qu.: 21.727 1st Qu.:0.8710 Normal :8321
## Median :25.60 Median : 23.746 Median :0.9444 Overweight :7296
## Mean :26.31 Mean : 23.971 Mean :0.9268 Obese :3541
## 3rd Qu.:28.89 3rd Qu.: 25.799 3rd Qu.:1.0000 Morbidly Obese: 431
## Max. :73.09 Max. :100.407 Max. :3.7778
## BMIDesCat ageCat
## Underweight : 404 18-31:5087
## Normal :12325 32-43:5263
## Overweight : 6451 44-57:4787
## Obese : 798 58-99:4863
## Morbidly Obese: 22
##
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 17534 very good 1 0 0 93 179 100 31 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 17534 14.54931 8.128107 0.5586592 Underweight Underweight 18-31
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 18743 good 1 1 1 52 68 68 44 f
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 18743 17.67899 17.67899 1 Underweight Underweight 44-57
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 16874 good 0 1 0 69 180 680 24 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 16874 26.57845 100.4075 3.777778 Overweight Morbidly Obese 18-31
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 10034 very good 1 1 1 73 290 601 56 m
## 13086 good 0 1 1 62 300 300 48 f
## 13607 very good 0 1 0 69 350 350 33 f
## 16874 good 0 1 0 69 180 680 24 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 10034 38.25671 79.28373 2.072414 Obese Morbidly Obese 44-57
## 13086 54.86472 54.86472 1.000000 Morbidly Obese Morbidly Obese 44-57
## 13607 51.68032 51.68032 1.000000 Morbidly Obese Morbidly Obese 32-43
## 16874 26.57845 100.40748 3.777778 Overweight Morbidly Obese 18-31
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 17534 very good 1 0 0 93 179 100 31 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 17534 14.54931 8.128107 0.5586592 Underweight Underweight 18-31
## Rejects
## FALSE TRUE
## 19997 3
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6970 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5674 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68 Min. :18.00 m: 9566
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175 3rd Qu.:57.00
## Max. :84.00 Max. :500.0 Max. :350 Max. :99.00
## BMI BMIDes DesActRatio BMICat
## Min. :12.40 Min. :10.44 Min. :0.2667 Underweight : 410
## 1st Qu.:22.71 1st Qu.:21.73 1st Qu.:0.8710 Normal :8321
## Median :25.60 Median :23.75 Median :0.9444 Overweight :7295
## Mean :26.31 Mean :23.96 Mean :0.9266 Obese :3540
## 3rd Qu.:28.89 3rd Qu.:25.80 3rd Qu.:1.0000 Morbidly Obese: 431
## Max. :73.09 Max. :54.86 Max. :1.9681
## BMIDesCat ageCat
## Underweight : 403 18-31:5085
## Normal :12325 32-43:5263
## Overweight : 6451 44-57:4786
## Obese : 798 58-99:4863
## Morbidly Obese: 20
##
Scaling the data. Height and weight need to be scaled so that both are on a 0 - 1 scale. Use the procedure from the Datacamp course to do this. Check the range of both before and after the rescaling. This task needs to be done before splitting the data.
## [1] 169.676
## [1] 68 500
## [1] 48 84
cdc$weight = (cdc$weight - min(cdc$weight))/(max(cdc$weight) - min(cdc$weight))
cdc$height = (cdc$height - min(cdc$height))/(max(cdc$height) - min(cdc$height))
print(range(cdc$weight))
## [1] 0 1
## [1] 0 1
## [1] 0.235361
Splitting the data
We want to split our data into train and test. We will build the model on train and evaluate its performance on test. Randomly select about 75% of the data for train and 25% for test.
Follow the process in the Mount-Zumel course in Datacamp to create train and test subsets.
## [1] 0.235361
## [1] 19997
## [1] 14998
## [1] 15064
## [1] 4933
## [1] 0.2353177
KNN Model 1
Use select to create train_a and test_a. These dataframes contain only gender, height and weight in that order.
Follow the procedure in the Datacamp course to create a knn model for gender with k = 1. Build the model using train and measure its performace on test. Note that the variable gender, which has column index 1, must be removed from the train dataframe and placed in a separate variable called gender. Note the way sign-types was used in the Datacamp course.
Compute the accuracy of this model using the procedure in the Datacamp course.
train_a = train[c("gender","height","weight")]
test_a = test[c("gender","height","weight")]
gender = train_a$gender
k_1 <- knn(train = train_a[-1], test = test_a[-1], cl = gender)
mean(test_a$gender == k_1)
## [1] 0.8489763
## [1] 0.2353177
Repeat the process to create and measure the performance of a model with k = 5.
## [1] 0.8548551
Try k = 50
## [1] 0.8583012
Try k = 100
k_100 <- knn(train = train_a[-1], test = test_a[-1], cl = gender,k=100)
mean(test_a$gender == k_100)
## [1] 0.8552605
The performance deteriorated between k values of 50 and 100.
Let’s stick with knn but add some variables to the mix. Create dataframes train_b and test_b. Add genhlth, smoke100 and exerany. Try k values of 1, 5, 50, and 100.
train_b = train[c("gender","height","weight","smoke100","exerany")]
test_b = test[c("gender","height","weight","smoke100","exerany")]
gender = train_b$gender
k_1 <- knn(train = train_b[-1], test = test_b[-1], cl = gender)
mean(test_b$gender == k_1)
## [1] 0.8343807
## [1] 0.2353177
## [1] 0.8530306
## [1] 0.8544496
k_100 <- knn(train = train_b[-1], test = test_b[-1], cl = gender,k=100)
mean(test_b$gender == k_100)
## [1] 0.8532333
Again, the performance deteriorated between 50 and 100.
The most successful model was with just height and weight with k = 50.
The Baseline
Our data is not evenly divided by gender. Use the simple table command on train and test to see the true distribution of gender. Divide by nrow() to get proportions.
##
## m f
## 0.4791556 0.5208444
##
## m f
## 0.4759781 0.5240219
If you were forced to guess a person’s gender with no information on characteristics, you should pick “f” since you’d be write more often than wrong. Your accuracy would be about 52%. Any other reported accuracy should be compared with the accuracy of this totally uninformed guess.
Let’s consider using one of the categorical variables, smoke100 as a predictor of gender.
Use naive_bayes with this variable and measure the accuracy on the test data.
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] 0.5552402
Add a second categorical variable, genhlth. Does this improve the accuracy?
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] 0.5507805
The accuracy actually had a slight decline. This might seem to be impossible, but it illustrates the importance of using test data to measure accuracy. More complex models frequently do this relative to simpler models. The phenomenon is known as overfitting.
Let’s use a quantitative variable, height. Do this with naive_bayes and measure the accuracy.
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] 0.8394486
This is a substantial improvement over the categorical variable models we tried.
Add a second quantitative variable, weight; then Measure the accuracy.
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] 0.8378269
Again, we see a slight loss of accuracy with the more complex model.
Try weight alone. Compare the accuracy with the previous results.
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## [1] 0.713359
The best model of this set is clearly height alone.
We did several variations on two types of models, knn and naive bayes. Which model was the best of all of these?