Suppose we are working for Target as a data scientist and we are tasked with predicting which customers are pregnant based mainly on demographic and physical patterns observed in the publically available NHANES data. (A real data scientist for Target would also have buying profiles!) For this problem we want to predict PregnantNow using the characteristics: Age, Education, HHIncomeMid, MaritalStatus, Bmi, and Height.

Preliminary steps:

> NHANESf <-NHANES %>% 
+   select('PregnantNow', 'Age', 'Education', 'HHIncomeMid',  'MaritalStatus', 'BMI', 'Height', 'Gender') %>%
+   mutate(PregnantNow = fct_recode(PregnantNow, NULL = 'Unknown', Yes = 'Yes', No = 'No'), 
+         MaritalStatus = fct_recode(MaritalStatus, Married = 'Married', Other = "Divorced", Other = "LivePartner", Other = "NeverMarried", Other = "Separated", Other = "Widowed")) %>% 
+   drop_na() %>%
+   filter(Gender == 'female')

Make a training data set of 1000 random cases and a test data set of the remaining cases (should be 505 cases). Set a seed so that we can reproduce our random training set.

> set.seed(20200219)
> train_index <- sample(nrow(NHANESf), 1000)
> 
> NHANESf_train <- NHANESf %>% slice(train_index)
> NHANESf_test <- NHANESf %>% slice(-train_index)

Using NHANESf data, fit a logistic model to model pregnancies using the set of specificied explanatory variables. Draw the ROC curve and double density curves for our model.

> train_results <- glm(PregnantNow ~ Age + Education + HHIncomeMid + MaritalStatus + BMI + Height, data = NHANESf_train, family = binomial)
> 
> NHANESf_test <- NHANESf_test %>%
+   mutate(prob = predict(train_results, newdata = NHANESf_test, type = "response")) 
> 
> eval_fun <- function(t, test) {
+   data <- test %>%
+   mutate(prediction = ifelse(prob > t, "Yes", "No")) %>%
+   summarize(threshold=t, 
+             accuracy = mean(PregnantNow == prediction), 
+             precision = sum(PregnantNow == "Yes" &  prediction == "No")/sum(prediction == "Yes"),
+             sensitivity = sum(PregnantNow == "Yes" & prediction == "Yes")/sum(PregnantNow == "Yes"),
+             specificity = sum(PregnantNow == "No" & prediction == "No")/sum(PregnantNow == "No"))
+ return(data)
+ }
> t_vals <- seq(0.1, 0.9, by=0.1)
> eval_df <- map_df(t_vals, eval_fun, test=NHANESf_test) 

ROC curve

> ggplot(eval_df, aes(x=1-specificity, y=sensitivity, color = threshold)) + 
+   geom_point() + 
+   geom_path() + 
+   geom_text(aes(label = threshold), color = "black", vjust = "inward", hjust = "inward", check_overlap = TRUE ) + 
+   labs(x="false positive rate", y="true positive rate", title="ROC curve for logistic") + 
+   geom_abline(slope=1,intercept=0, linetype=3)

Double density curve

> ggplot(NHANESf_test, aes(x=prob, fill=PregnantNow)) + 
+   geom_density(alpha = .3) + 
+   labs(x="predicted CKD probability", fill = "truth", title="Forecasted Pregnant probabilities using t=0.5 decision threshold")

Use a threshold of 0.5 to predict pregnancies. Compute the confusion matrix, accuracy, sensitivity and specificity. Then repeat these calculations using a threshold of 0.05.

t = 0.5

> conf_mat
       truth
predict Yes  No
    Yes  22 483

When t = 0.5, accuracy = 0.0435, sensitivity = 1, specificity = 0

t = 0.05

> conf_mat
       truth
predict Yes  No
    Yes  22 483

When t = 0.5, accuracy = 0.0435, sensitivity = 1, specificity = 0

Correctly identifying women who were pregnant at a high rate: high sensitivity, we can assign either 0.5 or 0.05 since the sensitivity of both thresholds are 1.

Low chance of misclassifying a woman who was not pregnant: high specificity. In this case both 0.5 threshold and 0.05 threshold do not perform well. Using a threshold of 0.9 will actually get us to reach high specificity.

The overall rate of pregnancies in our data constructed in part a should be around 5%. Suppose Barb, the lazy data scientist, decided simply to classify woman as “pregnant” based on this 5% rate (since, hey, it will result in about 5% of her predictions being pregnant which matchs the rate in the data!). Compute the confusion matrix, accuracy, sensitivity and specificity. Make sure to explain/show our work for these calculations.

Accuracy = 0.9202658, Sensitvity = 0.1066667, specificity = 0.9629371

Fit a knn classifier to this data using k=3 using the four quantitative predictors age, bmi, income and height. Compute the accuracy, sensitivity and specificity of the predictions from the test set.

# A tibble: 1 x 5
      k accuracy sensitivity specificity precision
  <dbl>    <dbl>       <dbl>       <dbl>     <dbl>
1     3    0.949       0.227       0.981     0.227

For the values of k below. Plot accuracy, sensitivity and specificity of the predictions from the test set against k and comment on which k might be “best”.

 Yes   No 
  61 1444 

If we prefer high prcision or high sensitivity, k = 1 is the best

If we prefer high accuracy or high specificity, every k from (1, 3, 5, 7, … 21) except k = 3 are preferrable values.