Content
Import required package for this study
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.2
## v tibble 2.0.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Import the dataset
library(dslabs)
## Warning: package 'dslabs' was built under R version 3.5.3
data("heights")
Take an overlook on the dataset:
Create the train_set and test_set data.
y <- heights$sex
x <- heights$height
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
set.seed(2)
train_set <- heights[-test_index,]
test_set <- heights[test_index,]
Create a first simple model by sample function
y_hat <- sample(c("Male","Female"), length(test_index), replace = T) %>% factor(levels = levels(test_set$sex))
Calculate the accuracy as below code
mean(y_hat == test_set$sex)
## [1] 0.5009524
The accuracy of our model as 0.52(52%) is low. But we can do better. Let’s look at our data again.
heights %>% group_by(sex) %>% summarize(mean(height), sd(height))
## # A tibble: 2 x 3
## sex `mean(height)` `sd(height)`
## <fct> <dbl> <dbl>
## 1 Female 64.9 3.76
## 2 Male 69.3 3.61
The Male is higher than the Female. Now we will create a 2nd model to predict the result within 2 standard deviation.
y_hat2 <- ifelse(x>62,"Male","Female") %>% factor(levels = levels(test_set$sex))
mean(y == y_hat2)
## [1] 0.7933333
Now the accuracy is better, it’s way up from 50% to 80% which we can see above.
But if we refer to the figure 1, we can see that from height > 62inches, there still have female in the sample. This can cause an overfitting by wrong prediction on Female with condition height > 62 inches.
We can do better by build up a set of model with condition height is over from 61 to 70 inches.
cutoff <- seq(61,70)
accuracy <- map_dbl(cutoff, function(x){
y_hat <- ifelse(train_set$height > x, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
mean(y_hat == train_set$sex)
})
data.frame(cutoff,accuracy) %>% ggplot(aes(x = cutoff, y = accuracy)) +geom_line() + geom_label(aes(label = round(accuracy, digits = 2)))+labs(x = "height condition", y = "accuracy", title = "Figure 3 - Accuracy by height minimum condition") + theme_minimal()
max(accuracy)
## [1] 0.832381
The maximum accuracy is 83% at the cutoff point 64 inches.
best_cutoff <- cutoff[which.max(accuracy)]
best_cutoff
## [1] 64
Now we can test at cutoff point 64 inches with our test_set data.
y_hat3 <- ifelse(test_set$height > best_cutoff, "Male","Female") %>%
factor(levels = levels(test_set$sex))
y_hat3 <- factor(y_hat3)
mean(y_hat3 == test_set$sex)
## [1] 0.8209524