Background

This study have 2 objectives:

The first objective is to validate the hypothesis: “Height is different between Male and Female”.
The second objective is to build a model to predict Sex from Height.

The dataset is available in the package dslabs. There are two variables Sex and Height.

Content

Import required package for this study

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0       v purrr   0.3.2  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'readr' was built under R version 3.5.3

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.3

## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(caret)

## Warning: package 'caret' was built under R version 3.5.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

Import the dataset

library(dslabs)

## Warning: package 'dslabs' was built under R version 3.5.3

data("heights")

Take an overlook on the dataset:

Create the train_set and test_set data.

y <- heights$sex
x <- heights$height
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
set.seed(2)
train_set <- heights[-test_index,]
test_set <- heights[test_index,]

Create a first simple model by sample function

y_hat <- sample(c("Male","Female"), length(test_index), replace = T) %>% factor(levels = levels(test_set$sex))

Calculate the accuracy as below code

mean(y_hat == test_set$sex)

## [1] 0.5009524

The accuracy of our model as 0.52(52%) is low. But we can do better. Let’s look at our data again.

heights %>% group_by(sex) %>% summarize(mean(height), sd(height))

## # A tibble: 2 x 3
##   sex    `mean(height)` `sd(height)`
##   <fct>           <dbl>        <dbl>
## 1 Female           64.9         3.76
## 2 Male             69.3         3.61

The Male is higher than the Female. Now we will create a 2nd model to predict the result within 2 standard deviation.

y_hat2 <- ifelse(x>62,"Male","Female") %>% factor(levels = levels(test_set$sex))

mean(y == y_hat2)

## [1] 0.7933333

Now the accuracy is better, it’s way up from 50% to 80% which we can see above.

But if we refer to the figure 1, we can see that from height > 62inches, there still have female in the sample. This can cause an overfitting by wrong prediction on Female with condition height > 62 inches.

We can do better by build up a set of model with condition height is over from 61 to 70 inches.

cutoff <- seq(61,70)
accuracy <- map_dbl(cutoff, function(x){
  y_hat <- ifelse(train_set$height > x, "Male", "Female") %>%
    factor(levels = levels(test_set$sex))
  mean(y_hat == train_set$sex)
})
data.frame(cutoff,accuracy) %>% ggplot(aes(x = cutoff, y = accuracy)) +geom_line() + geom_label(aes(label = round(accuracy, digits = 2)))+labs(x = "height condition", y = "accuracy", title = "Figure 3 - Accuracy by height minimum condition") + theme_minimal()

max(accuracy)

## [1] 0.832381

The maximum accuracy is 83% at the cutoff point 64 inches.

best_cutoff <- cutoff[which.max(accuracy)]
best_cutoff

## [1] 64

Now we can test at cutoff point 64 inches with our test_set data.

y_hat3 <- ifelse(test_set$height > best_cutoff, "Male","Female") %>%
  factor(levels = levels(test_set$sex))
y_hat3 <- factor(y_hat3)
mean(y_hat3 == test_set$sex)

## [1] 0.8209524

CONFUSION MATRIX

table(predicted = y_hat, actual = test_set$sex)

##          actual
## predicted Female Male
##    Female     57  200
##    Male       62  206

test_set %>% 
  mutate(y_hat3 = y_hat3) %>%
  group_by(sex) %>%
  summarize(accuracy = mean(y_hat3 == sex))

## # A tibble: 2 x 2
##   sex    accuracy
##   <fct>     <dbl>
## 1 Female    0.429
## 2 Male      0.936

We can see the accuracy for Male is 94%, but Female accuracy is only 43%. This is due to prevalance.

prev <- mean(y == "Male")
prev

## [1] 0.7733333

Let’s continue to the definition in model validation, sensitivity and specificity. Actually Positive: +TP: True Possitive and Predicted Positive +FN: False Negative and Predicted Negative Actually Negative: +FP: False Positives and Predicted Positive + TN: True Negatives and Predicted Negative Sensitivity = TP/(TP + FN) Specificity = TN/(TN + FP)

Picture 1 - Specificity and Sensitivity

confusionMatrix(data = y_hat3, reference = test_set$sex)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     51   26
##     Male       68  380
##                                           
##                Accuracy : 0.821           
##                  95% CI : (0.7854, 0.8528)
##     No Information Rate : 0.7733          
##     P-Value [Acc > NIR] : 0.004489        
##                                           
##                   Kappa : 0.4165          
##                                           
##  Mcnemar's Test P-Value : 2.349e-05       
##                                           
##             Sensitivity : 0.42857         
##             Specificity : 0.93596         
##          Pos Pred Value : 0.66234         
##          Neg Pred Value : 0.84821         
##              Prevalence : 0.22667         
##          Detection Rate : 0.09714         
##    Detection Prevalence : 0.14667         
##       Balanced Accuracy : 0.68227         
##                                           
##        'Positive' Class : Female          
##

Another metric to evaluate overall accuracy, which can include both Sensitivity and Specificity. In fact, the F1 score, is widely used one number summary. F1-Score = 2x(precision x recall)/(precision + recall)

F_1 <- map_dbl(cutoff, function(x){
  y_hat <- ifelse(train_set$height > x, "Male", "Female") %>%
    factor(levels = levels(test_set$sex))
  F_meas(data = y_hat, reference = factor(train_set$sex))
})

data.frame(cutoff,F_1) %>% ggplot(aes(x = cutoff, y = F_1)) + geom_line() + geom_label(aes(label = round(F_1, digits = 2))) + labs(x = "cutoff",y = "F_1 score", title = "Figure 4 - F_1 score by cutoff") + theme_minimal()

max(F_1)

## [1] 0.6373626

best_cutoff2 <- cutoff[which.max(F_1)]
best_cutoff2

## [1] 66

p <- 0.9
y_hat4 <- sample(c("Male","Female"), length(test_index), replace = T, prob = c(p, 1 - p)) %>% factor(levels = levels(test_set$sex))
mean(y_hat4 == test_set$sex)

## [1] 0.6990476

Receiver Operating Characteristic (ROC) curve sensitivity (TPR) versus 1 - specificity (FPR)

cutoffs <- c(50, seq(60,75),80)
height_cutoff <- map_df(cutoffs, function(x){
  y_hat <- ifelse(test_set$height > x, "Male", "Female") %>%
    factor(levels = c("Female","Male"))
  list(method = "Height cutoff",
       FPR = 1-specificity(y_hat, test_set$sex),
       TPR = sensitivity(y_hat, test_set$sex))
})

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.