This assignment will make use of a published tutorial to build a knn model of gender classification for the cdc data using the caret package.

Load required libraries and the data.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)

Problem 1

Load the cdc data, do the cleaning and transformations. Do not do the splitting. Run str() and summary to verify your results.

# Place your code here. 
load("/cloud/project/cdc.Rdata")

cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(18,5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(18,5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))

str(cdc)
## 'data.frame':    20000 obs. of  15 variables:
##  $ genhlth    : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany    : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100   : num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height     : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight     : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire   : int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age        : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender     : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
##  $ BMI        : num  25.1 21.5 20.5 21.3 28.3 ...
##  $ BMIDes     : num  25.1 19.7 20.5 20 24.6 ...
##  $ DesActRatio: num  1 0.92 1 0.939 0.867 ...
##  $ BMICat     : Factor w/ 5 levels "Underweight",..: 3 2 2 2 3 2 3 3 3 3 ...
##  $ BMIDesCat  : Factor w/ 5 levels "Underweight",..: 3 2 2 2 2 2 3 3 2 2 ...
##  $ ageCat     : Factor w/ 4 levels "18-31","32-43",..: 4 2 3 2 3 3 1 3 1 3 ...

Problem 2

Use caret to split the data into training and testing dataframes with a 70/30 split. Use set.seed(2343) to facilitate replicability. Run dim() on both to verify your results.

nrows = nrow(cdc)

rows <- sample(nrows)

cdc = cdc[rows,]


split <- round(nrow(cdc) * .7)

traincdc = cdc[1:split,]
testcdc = cdc[(split+1):nrows,]

dim(traincdc)
## [1] 14000    15
dim(testcdc)
## [1] 6000   15

Problem 3

Build a knn model, knn_fit, using height and weight to predict gender using the train function in caret. in the call to trainControl reduce number to 5 and repeats to 2. Then use the remainder of the code in the example to compute the confusion matrix and accuracy for this model.

trctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2)

knn_fit <- train(gender ~ height+weight, data = traincdc, method = "knn",
 trControl=trctrl,
 preProcess = c("center", "scale"),
 tuneLength = 10)

predcdc <- predict(knn_fit, newdata = testcdc)
matrix <- table(testcdc$gender,predcdc)
matrix
##    predcdc
##        m    f
##   m 2398  462
##   f  379 2761
accuracy <- sum(matrix[1,1]+matrix[2,2])/sum(matrix[1,1]+matrix[1,2]+matrix[2,1]+matrix[2,2])
accuracy
## [1] 0.8598333

Problem 4

Repeat the process in Problem 3, adding exerany and smoke100 to the model. Did this improve the accuracy of the model?

This did improve the accuracy slighty.

# Place your code here.
trctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2)

knn_fit <- train(gender ~ height+weight+exerany+smoke100, data = traincdc, method = "knn",
 trControl=trctrl,
 preProcess = c("center", "scale"),
 tuneLength = 10)

predcdc <- predict(knn_fit, newdata = testcdc)
matrix <- table(testcdc$gender,predcdc)
matrix
##    predcdc
##        m    f
##   m 2413  447
##   f  400 2740
accuracy <- sum(matrix[1,1]+matrix[2,2])/sum(matrix[1,1]+matrix[1,2]+matrix[2,1]+matrix[2,2])
accuracy
## [1] 0.8588333

Problem 5.

Repeat Problem 4, adding a variable of your choice which captures the extra sensitivity of women to weight. Does this improve the accuracy?

This improved the accuracy of the model, more than adding smoke100 and exerany.

# Place your code here.
traincdc$wtdesiredif <- traincdc$wtdesire - traincdc$weight
testcdc$wtdesiredif <- testcdc$wtdesire - testcdc$weight

trctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2)

knn_fit <- train(gender ~ height+weight+exerany+smoke100+wtdesiredif, data = traincdc, method = "knn",
 trControl=trctrl,
 preProcess = c("center", "scale"),
 tuneLength = 10)

predcdc <- predict(knn_fit, newdata = testcdc)
matrix <- table(testcdc$gender,predcdc)
matrix
##    predcdc
##        m    f
##   m 2556  304
##   f  297 2843
accuracy <- sum(matrix[1,1]+matrix[2,2])/sum(matrix[1,1]+matrix[1,2]+matrix[2,1]+matrix[2,2])
accuracy
## [1] 0.8998333