Suppose we are working for Target as a data scientist and we are tasked with predicting which customers are pregnant based mainly on demographic and physical patterns observed in the publically available NHANES data. (A real data scientist for Target would also have buying profiles!) For this problem we want to predict PregnantNow using the characteristics: Age, Education, HHIncomeMid, MaritalStatus, Bmi, and Height.
Preliminary steps:
forcats::fct_recode to recode PregnantNow to have two levels (yes, no) and make the unknown level NAs.forcats::fct_collapse to recode MaritalStatus to just be married or other for all other levels.NHANESf that only contains complete cases (i.e. no NAs) for the variables described above that we will use in this problem. Use NHANESf in to answer the questions below!> NHANESf <-NHANES %>%
+ select('PregnantNow', 'Age', 'Education', 'HHIncomeMid', 'MaritalStatus', 'BMI', 'Height', 'Gender') %>%
+ mutate(PregnantNow = fct_recode(PregnantNow, NULL = 'Unknown', Yes = 'Yes', No = 'No'),
+ MaritalStatus = fct_recode(MaritalStatus, Married = 'Married', Other = "Divorced", Other = "LivePartner", Other = "NeverMarried", Other = "Separated", Other = "Widowed")) %>%
+ drop_na() %>%
+ filter(Gender == 'female')
Make a training data set of 1000 random cases and a test data set of the remaining cases (should be 505 cases). Set a seed so that we can reproduce our random training set.
> set.seed(20200219)
> train_index <- sample(nrow(NHANESf), 1000)
>
> NHANESf_train <- NHANESf %>% slice(train_index)
> NHANESf_test <- NHANESf %>% slice(-train_index)
Using NHANESf data, fit a logistic model to model pregnancies using the set of specificied explanatory variables. Draw the ROC curve and double density curves for our model.
> train_results <- glm(PregnantNow ~ Age + Education + HHIncomeMid + MaritalStatus + BMI + Height, data = NHANESf_train, family = binomial)
>
> NHANESf_test <- NHANESf_test %>%
+ mutate(prob = predict(train_results, newdata = NHANESf_test, type = "response"))
>
> eval_fun <- function(t, test) {
+ data <- test %>%
+ mutate(prediction = ifelse(prob > t, "Yes", "No")) %>%
+ summarize(threshold=t,
+ accuracy = mean(PregnantNow == prediction),
+ precision = sum(PregnantNow == "Yes" & prediction == "No")/sum(prediction == "Yes"),
+ sensitivity = sum(PregnantNow == "Yes" & prediction == "Yes")/sum(PregnantNow == "Yes"),
+ specificity = sum(PregnantNow == "No" & prediction == "No")/sum(PregnantNow == "No"))
+ return(data)
+ }
> t_vals <- seq(0.1, 0.9, by=0.1)
> eval_df <- map_df(t_vals, eval_fun, test=NHANESf_test)
ROC curve
> ggplot(eval_df, aes(x=1-specificity, y=sensitivity, color = threshold)) +
+ geom_point() +
+ geom_path() +
+ geom_text(aes(label = threshold), color = "black", vjust = "inward", hjust = "inward", check_overlap = TRUE ) +
+ labs(x="false positive rate", y="true positive rate", title="ROC curve for logistic") +
+ geom_abline(slope=1,intercept=0, linetype=3)
Double density curve
> ggplot(NHANESf_test, aes(x=prob, fill=PregnantNow)) +
+ geom_density(alpha = .3) +
+ labs(x="predicted CKD probability", fill = "truth", title="Forecasted Pregnant probabilities using t=0.5 decision threshold")
Use a threshold of 0.5 to predict pregnancies. Compute the confusion matrix, accuracy, sensitivity and specificity. Then repeat these calculations using a threshold of 0.05.
t = 0.5
> conf_mat
truth
predict Yes No
Yes 22 483
When t = 0.5, accuracy = 0.0435, sensitivity = 1, specificity = 0
t = 0.05
> conf_mat
truth
predict Yes No
Yes 22 483
When t = 0.5, accuracy = 0.0435, sensitivity = 1, specificity = 0
Correctly identifying women who were pregnant at a high rate: high sensitivity, we can assign either 0.5 or 0.05 since the sensitivity of both thresholds are 1.
Low chance of misclassifying a woman who was not pregnant: high specificity. In this case both 0.5 threshold and 0.05 threshold do not perform well. Using a threshold of 0.9 will actually get us to reach high specificity.
The overall rate of pregnancies in our data constructed in part a should be around 5%. Suppose Barb, the lazy data scientist, decided simply to classify woman as “pregnant” based on this 5% rate (since, hey, it will result in about 5% of her predictions being pregnant which matchs the rate in the data!). Compute the confusion matrix, accuracy, sensitivity and specificity. Make sure to explain/show our work for these calculations.
Accuracy = 0.9202658, Sensitvity = 0.1066667, specificity = 0.9629371
Fit a knn classifier to this data using k=3 using the four quantitative predictors age, bmi, income and height. Compute the accuracy, sensitivity and specificity of the predictions from the test set.
# A tibble: 1 x 5
k accuracy sensitivity specificity precision
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0.949 0.227 0.981 0.227
For the values of k below. Plot accuracy, sensitivity and specificity of the predictions from the test set against k and comment on which k might be “best”.
Yes No
61 1444
If we prefer high prcision or high sensitivity, k = 1 is the best
If we prefer high accuracy or high specificity, every k from (1, 3, 5, 7, … 21) except k = 3 are preferrable values.