TBDFinal

Starting here, I began by renaming a few variables and changing the dataset. For some reason, my arm.circ and age imported in a weird format, so I had to change those along with gender.

#renaming dataset name
#changing column names to better name formats
#changing GENDER to factor with levels
tbd<-TBD
names(tbd)[2]<-"GENDER"
names(tbd)[14]<- "ARM.CIRC"
names(tbd)[1]<- "AGE"
tbd$GENDER<-factor(c("FEMALE","MALE"))
colnames(tbd)

##  [1] "AGE"       "GENDER"    "PULSE"     "SYSTOLIC"  "DIASTOLIC" "HDL"      
##  [7] "LDL"       "WHITE"     "RED"       "PLATE"     "WEIGHT"    "HEIGHT"   
## [13] "WAIST"     "ARM.CIRC"  "BMI"

Here I used a z function to normalize all of the numeric features. the z normalization works better in this case rather than the percentile approach.

#Creating our normalize function using z function to normalize all fourteen numeric features
tbdz <- as.data.frame(scale(tbd[-2]))

This is where I used a random approach to create a training set with 200 observations and a test set with 100 observations.I also created vectors of gender to with with the test and training sets for labels.

#creating test with 100 random observations and train with 200 random observations
#set seed so sample can be reproduced
set.seed(100) 
sample <- sample.int(n = nrow(tbdz), size = 200, replace = F)
tbtrain <- tbdz[sample, ]
tbtest  <- tbdz[-sample, ]
#adding gender back in so we can use it
trainlabels <- tbd[sample, 2]
testlabels  <- tbd[-sample, 2]

using knn function from “class” to classify the test set by gender. The default nearest neighbor is 3, I tested for 5 and 9 nearest neighbors too.

#producing a prediction model and confusion matrix for 3 nearest neighbors
testpred3 <- knn(train = tbtrain, test = tbtest, cl = trainlabels, k = 3)
testpred3

##   [1] MALE   MALE   FEMALE MALE   MALE   FEMALE FEMALE MALE   FEMALE FEMALE
##  [11] FEMALE MALE   MALE   MALE   FEMALE FEMALE FEMALE MALE   FEMALE FEMALE
##  [21] FEMALE MALE   FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE
##  [31] FEMALE MALE   FEMALE FEMALE MALE   MALE   MALE   MALE   FEMALE MALE  
##  [41] MALE   MALE   FEMALE MALE   MALE   MALE   MALE   MALE   MALE   FEMALE
##  [51] MALE   FEMALE MALE   MALE   MALE   FEMALE MALE   FEMALE MALE   FEMALE
##  [61] MALE   FEMALE MALE   FEMALE MALE   FEMALE FEMALE MALE   MALE   FEMALE
##  [71] MALE   FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE MALE   FEMALE
##  [81] MALE   FEMALE MALE   MALE   FEMALE FEMALE MALE   FEMALE MALE   FEMALE
##  [91] FEMALE FEMALE MALE   MALE   FEMALE MALE   MALE   MALE   FEMALE FEMALE
## Levels: FEMALE MALE

CrossTable(x=testlabels, y=testpred3, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | testpred3 
##   testlabels |    FEMALE |      MALE | Row Total | 
## -------------|-----------|-----------|-----------|
##       FEMALE |        31 |        25 |        56 | 
##              |     0.554 |     0.446 |     0.560 | 
##              |     0.585 |     0.532 |           | 
##              |     0.310 |     0.250 |           | 
## -------------|-----------|-----------|-----------|
##         MALE |        22 |        22 |        44 | 
##              |     0.500 |     0.500 |     0.440 | 
##              |     0.415 |     0.468 |           | 
##              |     0.220 |     0.220 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        53 |        47 |       100 | 
##              |     0.530 |     0.470 |           | 
## -------------|-----------|-----------|-----------|
## 
##

We were wrong 47% of the time. This model has 25% predecited as male when they are actually female and 22% predicted as female when they are actually male.

#producing a prediction model and confusion matrix for 5 nearest neighbors
testpred5 <- knn(train = tbtrain, test = tbtest, cl = trainlabels, k = 5)
testpred5

##   [1] MALE   MALE   FEMALE MALE   FEMALE FEMALE FEMALE MALE   FEMALE FEMALE
##  [11] FEMALE MALE   MALE   MALE   FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE
##  [21] FEMALE MALE   FEMALE FEMALE FEMALE FEMALE FEMALE MALE   MALE   FEMALE
##  [31] FEMALE MALE   MALE   FEMALE FEMALE MALE   MALE   MALE   FEMALE MALE  
##  [41] MALE   MALE   MALE   MALE   MALE   MALE   MALE   MALE   MALE   FEMALE
##  [51] MALE   FEMALE MALE   MALE   MALE   FEMALE MALE   MALE   MALE   FEMALE
##  [61] MALE   FEMALE MALE   MALE   MALE   FEMALE FEMALE MALE   MALE   FEMALE
##  [71] MALE   MALE   MALE   FEMALE FEMALE FEMALE FEMALE FEMALE MALE   FEMALE
##  [81] FEMALE MALE   MALE   FEMALE FEMALE FEMALE MALE   FEMALE MALE   FEMALE
##  [91] MALE   FEMALE MALE   MALE   FEMALE MALE   FEMALE MALE   FEMALE MALE  
## Levels: FEMALE MALE

CrossTable(x=testlabels,y=testpred5)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | testpred5 
##   testlabels |    FEMALE |      MALE | Row Total | 
## -------------|-----------|-----------|-----------|
##       FEMALE |        31 |        25 |        56 | 
##              |     0.631 |     0.583 |           | 
##              |     0.554 |     0.446 |     0.560 | 
##              |     0.646 |     0.481 |           | 
##              |     0.310 |     0.250 |           | 
## -------------|-----------|-----------|-----------|
##         MALE |        17 |        27 |        44 | 
##              |     0.804 |     0.742 |           | 
##              |     0.386 |     0.614 |     0.440 | 
##              |     0.354 |     0.519 |           | 
##              |     0.170 |     0.270 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        48 |        52 |       100 | 
##              |     0.480 |     0.520 |           | 
## -------------|-----------|-----------|-----------|
## 
##

This model, using 5 nearest neighbors, was wrong 42% of the time. It predicted 25% male that were actual female, and predicted 17% female when they were actually male.

#producing a prediction model and confusion matrix for 9 nearest neighbors
testpred9 <- knn(train = tbtrain, test = tbtest, cl = trainlabels, k = 9)
testpred9

##   [1] MALE   MALE   FEMALE MALE   MALE   FEMALE FEMALE MALE   FEMALE FEMALE
##  [11] FEMALE MALE   MALE   FEMALE FEMALE MALE   FEMALE FEMALE FEMALE FEMALE
##  [21] FEMALE MALE   FEMALE FEMALE FEMALE FEMALE FEMALE MALE   FEMALE MALE  
##  [31] MALE   MALE   FEMALE FEMALE FEMALE FEMALE MALE   FEMALE FEMALE MALE  
##  [41] FEMALE MALE   MALE   MALE   MALE   MALE   MALE   MALE   MALE   FEMALE
##  [51] MALE   FEMALE FEMALE MALE   FEMALE MALE   FEMALE MALE   MALE   FEMALE
##  [61] MALE   FEMALE MALE   MALE   MALE   FEMALE MALE   FEMALE MALE   FEMALE
##  [71] MALE   FEMALE MALE   FEMALE FEMALE MALE   FEMALE FEMALE FEMALE FEMALE
##  [81] FEMALE MALE   MALE   FEMALE FEMALE MALE   MALE   FEMALE MALE   FEMALE
##  [91] FEMALE MALE   MALE   MALE   MALE   MALE   FEMALE MALE   MALE   FEMALE
## Levels: FEMALE MALE

CrossTable(x=testlabels,y=testpred9)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | testpred9 
##   testlabels |    FEMALE |      MALE | Row Total | 
## -------------|-----------|-----------|-----------|
##       FEMALE |        34 |        22 |        56 | 
##              |     1.036 |     1.078 |           | 
##              |     0.607 |     0.393 |     0.560 | 
##              |     0.667 |     0.449 |           | 
##              |     0.340 |     0.220 |           | 
## -------------|-----------|-----------|-----------|
##         MALE |        17 |        27 |        44 | 
##              |     1.319 |     1.373 |           | 
##              |     0.386 |     0.614 |     0.440 | 
##              |     0.333 |     0.551 |           | 
##              |     0.170 |     0.270 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        51 |        49 |       100 | 
##              |     0.510 |     0.490 |           | 
## -------------|-----------|-----------|-----------|
## 
##

With 9 nearest neighbors, the model was incorrect 39% of the time. 22% of the time it predicted male when it was actually female, and 17% predicted female when it was actually male.

Using randomForest package to predict gender using the default number of trees.

#random forest with default parameters of 500 trees
rftest<- tbtest
rflabel<-testlabels
set.seed(1234)
rf1<-randomForest(x= rftest,y=rflabel,importance= TRUE, ntree=500)
rf1

## 
## Call:
##  randomForest(x = rftest, y = rflabel, ntree = 500, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 44%
## Confusion matrix:
##        FEMALE MALE class.error
## FEMALE     38   18   0.3214286
## MALE       26   18   0.5909091

There is an oob estimated error rate of 44%

starting cross validation

#using caret package to run k=5 fold cross validation, repeated 10 times.
set.seed(1234)
folds5 <- createMultiFolds(rflabel, k = 5, times = 10)
table(rflabel)

## rflabel
## FEMALE   MALE 
##     56     44

table(rflabel[folds5[[33]]])

## 
## FEMALE   MALE 
##     45     35

#setting up traincontol
control <- trainControl(method = "repeatedcv", number = 5, repeats = 10,
                       index = folds5)


#using dosnow for train bc lots of trees, kinda like a forest LOL. I wonder if there is ever any rain... 
#maybe it could be a random rain forest... LOL I NEED SLEEP
clust <- makeCluster(3, type = "SOCK")
registerDoSNOW(clust)

#set seed for reproducibility and train
set.seed(45678)
rf5 <- train(x = rftest, y = rflabel, method = "rf", tuneLength = 3,
                   ntree = 500, trControl = control)
#stop cluster
stopCluster(clust)
#Check out results
rf5

## Random Forest 
## 
## 100 samples
##  14 predictor
##   2 classes: 'FEMALE', 'MALE' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 81, 80, 80, 79, 80, 80, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.5838622  0.1352300
##    8    0.5808095  0.1353332
##   14    0.5779574  0.1321624
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

the accuracy for the model is between 57.7% and 58.3% for mtry= 2,8, and 14.

set.seed(1000)
formtry <- randomForest(x = rftest, y = rflabel, importance = TRUE, ntree = 500, mtry = 8)
formtry

## 
## Call:
##  randomForest(x = rftest, y = rflabel, ntree = 500, mtry = 8,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 39%
## Confusion matrix:
##        FEMALE MALE class.error
## FEMALE     40   16   0.2857143
## MALE       23   21   0.5227273

The oob estimated error rate is 39% which is the lowest error rate along with with 9 nearest trees.

Overall, The oob for random forest and the 9 nearest neighbors, gave us the least amount of error. In my opinion, 39% is still a lot of error. I wish we could get less, but for now, 39% will have to do.

PS. I have enjoyed this class and enjoyed our zoom meetings and being able to meet everyone’s pets. I hope you have a great summer and this all ends soon.