Starting here, I began by renaming a few variables and changing the dataset. For some reason, my arm.circ and age imported in a weird format, so I had to change those along with gender.
#renaming dataset name
#changing column names to better name formats
#changing GENDER to factor with levels
tbd<-TBD
names(tbd)[2]<-"GENDER"
names(tbd)[14]<- "ARM.CIRC"
names(tbd)[1]<- "AGE"
tbd$GENDER<-factor(c("FEMALE","MALE"))
colnames(tbd)
## [1] "AGE" "GENDER" "PULSE" "SYSTOLIC" "DIASTOLIC" "HDL"
## [7] "LDL" "WHITE" "RED" "PLATE" "WEIGHT" "HEIGHT"
## [13] "WAIST" "ARM.CIRC" "BMI"
Here I used a z function to normalize all of the numeric features. the z normalization works better in this case rather than the percentile approach.
#Creating our normalize function using z function to normalize all fourteen numeric features
tbdz <- as.data.frame(scale(tbd[-2]))
This is where I used a random approach to create a training set with 200 observations and a test set with 100 observations.I also created vectors of gender to with with the test and training sets for labels.
#creating test with 100 random observations and train with 200 random observations
#set seed so sample can be reproduced
set.seed(100)
sample <- sample.int(n = nrow(tbdz), size = 200, replace = F)
tbtrain <- tbdz[sample, ]
tbtest <- tbdz[-sample, ]
#adding gender back in so we can use it
trainlabels <- tbd[sample, 2]
testlabels <- tbd[-sample, 2]
using knn function from “class” to classify the test set by gender. The default nearest neighbor is 3, I tested for 5 and 9 nearest neighbors too.
#producing a prediction model and confusion matrix for 3 nearest neighbors
testpred3 <- knn(train = tbtrain, test = tbtest, cl = trainlabels, k = 3)
testpred3
## [1] MALE MALE FEMALE MALE MALE FEMALE FEMALE MALE FEMALE FEMALE
## [11] FEMALE MALE MALE MALE FEMALE FEMALE FEMALE MALE FEMALE FEMALE
## [21] FEMALE MALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE
## [31] FEMALE MALE FEMALE FEMALE MALE MALE MALE MALE FEMALE MALE
## [41] MALE MALE FEMALE MALE MALE MALE MALE MALE MALE FEMALE
## [51] MALE FEMALE MALE MALE MALE FEMALE MALE FEMALE MALE FEMALE
## [61] MALE FEMALE MALE FEMALE MALE FEMALE FEMALE MALE MALE FEMALE
## [71] MALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE MALE FEMALE
## [81] MALE FEMALE MALE MALE FEMALE FEMALE MALE FEMALE MALE FEMALE
## [91] FEMALE FEMALE MALE MALE FEMALE MALE MALE MALE FEMALE FEMALE
## Levels: FEMALE MALE
CrossTable(x=testlabels, y=testpred3, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | testpred3
## testlabels | FEMALE | MALE | Row Total |
## -------------|-----------|-----------|-----------|
## FEMALE | 31 | 25 | 56 |
## | 0.554 | 0.446 | 0.560 |
## | 0.585 | 0.532 | |
## | 0.310 | 0.250 | |
## -------------|-----------|-----------|-----------|
## MALE | 22 | 22 | 44 |
## | 0.500 | 0.500 | 0.440 |
## | 0.415 | 0.468 | |
## | 0.220 | 0.220 | |
## -------------|-----------|-----------|-----------|
## Column Total | 53 | 47 | 100 |
## | 0.530 | 0.470 | |
## -------------|-----------|-----------|-----------|
##
##
#producing a prediction model and confusion matrix for 5 nearest neighbors
testpred5 <- knn(train = tbtrain, test = tbtest, cl = trainlabels, k = 5)
testpred5
## [1] MALE MALE FEMALE MALE FEMALE FEMALE FEMALE MALE FEMALE FEMALE
## [11] FEMALE MALE MALE MALE FEMALE FEMALE FEMALE FEMALE FEMALE FEMALE
## [21] FEMALE MALE FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALE FEMALE
## [31] FEMALE MALE MALE FEMALE FEMALE MALE MALE MALE FEMALE MALE
## [41] MALE MALE MALE MALE MALE MALE MALE MALE MALE FEMALE
## [51] MALE FEMALE MALE MALE MALE FEMALE MALE MALE MALE FEMALE
## [61] MALE FEMALE MALE MALE MALE FEMALE FEMALE MALE MALE FEMALE
## [71] MALE MALE MALE FEMALE FEMALE FEMALE FEMALE FEMALE MALE FEMALE
## [81] FEMALE MALE MALE FEMALE FEMALE FEMALE MALE FEMALE MALE FEMALE
## [91] MALE FEMALE MALE MALE FEMALE MALE FEMALE MALE FEMALE MALE
## Levels: FEMALE MALE
CrossTable(x=testlabels,y=testpred5)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | testpred5
## testlabels | FEMALE | MALE | Row Total |
## -------------|-----------|-----------|-----------|
## FEMALE | 31 | 25 | 56 |
## | 0.631 | 0.583 | |
## | 0.554 | 0.446 | 0.560 |
## | 0.646 | 0.481 | |
## | 0.310 | 0.250 | |
## -------------|-----------|-----------|-----------|
## MALE | 17 | 27 | 44 |
## | 0.804 | 0.742 | |
## | 0.386 | 0.614 | 0.440 |
## | 0.354 | 0.519 | |
## | 0.170 | 0.270 | |
## -------------|-----------|-----------|-----------|
## Column Total | 48 | 52 | 100 |
## | 0.480 | 0.520 | |
## -------------|-----------|-----------|-----------|
##
##
#producing a prediction model and confusion matrix for 9 nearest neighbors
testpred9 <- knn(train = tbtrain, test = tbtest, cl = trainlabels, k = 9)
testpred9
## [1] MALE MALE FEMALE MALE MALE FEMALE FEMALE MALE FEMALE FEMALE
## [11] FEMALE MALE MALE FEMALE FEMALE MALE FEMALE FEMALE FEMALE FEMALE
## [21] FEMALE MALE FEMALE FEMALE FEMALE FEMALE FEMALE MALE FEMALE MALE
## [31] MALE MALE FEMALE FEMALE FEMALE FEMALE MALE FEMALE FEMALE MALE
## [41] FEMALE MALE MALE MALE MALE MALE MALE MALE MALE FEMALE
## [51] MALE FEMALE FEMALE MALE FEMALE MALE FEMALE MALE MALE FEMALE
## [61] MALE FEMALE MALE MALE MALE FEMALE MALE FEMALE MALE FEMALE
## [71] MALE FEMALE MALE FEMALE FEMALE MALE FEMALE FEMALE FEMALE FEMALE
## [81] FEMALE MALE MALE FEMALE FEMALE MALE MALE FEMALE MALE FEMALE
## [91] FEMALE MALE MALE MALE MALE MALE FEMALE MALE MALE FEMALE
## Levels: FEMALE MALE
CrossTable(x=testlabels,y=testpred9)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | testpred9
## testlabels | FEMALE | MALE | Row Total |
## -------------|-----------|-----------|-----------|
## FEMALE | 34 | 22 | 56 |
## | 1.036 | 1.078 | |
## | 0.607 | 0.393 | 0.560 |
## | 0.667 | 0.449 | |
## | 0.340 | 0.220 | |
## -------------|-----------|-----------|-----------|
## MALE | 17 | 27 | 44 |
## | 1.319 | 1.373 | |
## | 0.386 | 0.614 | 0.440 |
## | 0.333 | 0.551 | |
## | 0.170 | 0.270 | |
## -------------|-----------|-----------|-----------|
## Column Total | 51 | 49 | 100 |
## | 0.510 | 0.490 | |
## -------------|-----------|-----------|-----------|
##
##
Using randomForest package to predict gender using the default number of trees.
#random forest with default parameters of 500 trees
rftest<- tbtest
rflabel<-testlabels
set.seed(1234)
rf1<-randomForest(x= rftest,y=rflabel,importance= TRUE, ntree=500)
rf1
##
## Call:
## randomForest(x = rftest, y = rflabel, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 44%
## Confusion matrix:
## FEMALE MALE class.error
## FEMALE 38 18 0.3214286
## MALE 26 18 0.5909091
starting cross validation
#using caret package to run k=5 fold cross validation, repeated 10 times.
set.seed(1234)
folds5 <- createMultiFolds(rflabel, k = 5, times = 10)
table(rflabel)
## rflabel
## FEMALE MALE
## 56 44
table(rflabel[folds5[[33]]])
##
## FEMALE MALE
## 45 35
#setting up traincontol
control <- trainControl(method = "repeatedcv", number = 5, repeats = 10,
index = folds5)
#using dosnow for train bc lots of trees, kinda like a forest LOL. I wonder if there is ever any rain...
#maybe it could be a random rain forest... LOL I NEED SLEEP
clust <- makeCluster(3, type = "SOCK")
registerDoSNOW(clust)
#set seed for reproducibility and train
set.seed(45678)
rf5 <- train(x = rftest, y = rflabel, method = "rf", tuneLength = 3,
ntree = 500, trControl = control)
#stop cluster
stopCluster(clust)
#Check out results
rf5
## Random Forest
##
## 100 samples
## 14 predictor
## 2 classes: 'FEMALE', 'MALE'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 81, 80, 80, 79, 80, 80, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.5838622 0.1352300
## 8 0.5808095 0.1353332
## 14 0.5779574 0.1321624
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
the accuracy for the model is between 57.7% and 58.3% for mtry= 2,8, and 14.
set.seed(1000)
formtry <- randomForest(x = rftest, y = rflabel, importance = TRUE, ntree = 500, mtry = 8)
formtry
##
## Call:
## randomForest(x = rftest, y = rflabel, ntree = 500, mtry = 8, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 39%
## Confusion matrix:
## FEMALE MALE class.error
## FEMALE 40 16 0.2857143
## MALE 23 21 0.5227273
Overall, The oob for random forest and the 9 nearest neighbors, gave us the least amount of error. In my opinion, 39% is still a lot of error. I wish we could get less, but for now, 39% will have to do.
PS. I have enjoyed this class and enjoyed our zoom meetings and being able to meet everyone’s pets. I hope you have a great summer and this all ends soon.