Load Packages and Data

library(ggplot2)
library(dplyr)
library(broom)

profiles <- readr::read_csv(file="profiles.csv") %>% 
  filter(!is.na(height)) %>% 
  mutate(is_female=ifelse(sex=='f', 1, 0)) %>% 
  tibble::rownames_to_column(var="id")

We going to use age and drugs as the predictor variables.

Missing Data

At the very least, for each of the predictor variables we use, we need to address rows (i.e. individuals) with missing values NA. Otherwise, no predictions will be returned (see below). For example a proportion 0.235 of drugs values are NA missing; we can’t not return a prediction for this many people! So we change the NAs to an actual level of the categorical variable drugs. Note this is clumsy is we are losing any discriminative information about the sexes contained for those users that did respond!

profiles <- profiles %>% 
  mutate(drugs = ifelse(is.na(drugs), "did not report", drugs))

Split into Training and Test Data

You need to split the training (2997 rows) and test (56946 rows) data into two disjoint data sets. For example here is one not-so-great way. Ask yourself, why is this approach suspect?

training <- profiles %>% 
  slice(1:2997)
test <- profiles %>% 
  slice(2998:59943)

Conceptual Leap Necessary

For this exercise to have been completely realistic, for both HW-2 (EDA) and HW-3 (predictive modeling) I should have:

  • For training, given you both
    • The predictor variables: age and drugs
    • The “truth” variable: is_female
  • For test only given you the predictor variables: age and drugs

Then the exercise would proceed as follows:

  • You would train your model by fitting a logistic regression model predict_sex_model to training
  • Then, using test
    • Fit predict_sex_model on test which only has the predictor variables, not the truth variable (as we do below using the predict() function)
    • Which then returns a fitted probability p_hat of being female for all 56946 test users
    • Which you would then convert to a predicted sex in a variable predicted_female
    • Which you would then send to me
  • Since only I would know the truth in is_female, only I could tell you what proportion of the 56946 users you guessed correct! This is how Kaggle competitions work!

Alas, this only occurred to me after it was too late. So to salvage what I could from this exercise, I have set up this work around where:

  • You need to pretend hide the truth variable variable from yourself in test. i.e. pretend like I never gave you the true sex, as reflected in is_female, in test.
  • Hence you, instead of me behind the scenes, will evaluate how well your predictions fared after pretend unveiling the truth to yourself.

Train the Model

When training the model, we have access to both:

  • The predictor variables: age and drugs
  • The truth variable: is_female

i.e. We have this:

training %>% 
  select(age, drugs, is_female) %>% 
  View()
age drugs is_female
22 never 0
35 sometimes 0
38 did not report 0
23 did not report 0
29 never 0

We then fit a logistic regression using training:

predict_sex_model <- glm(is_female ~ age + drugs, data=training, family="binomial")

Test the Model

As mentioned above, we at first pretend hide the truth variable is_female i.e. pretend I never gave it to you:

test %>% 
  select(age, drugs) %>% 
  View()
age drugs
32 did not report
24 did not report
28 never
25 did not report
25 did not report

We take the trained model predict_sex_model, apply it to predictor variables age and drugs in test, and obtain 56946 fitted probabilities p_hat of being female using the predict() function. Note: if any of age and drugs were NA missing, NO value would be returned for that user, and hence p_hat would be of length less than 56946.

p_hat <- predict(predict_sex_model, newdata=test, type="response")
ggplot(data=NULL, aes(x=p_hat)) + 
  geom_histogram()

Following tidy data principles (variables describing the same observational units should be kept in the same data frame) we add these predictions to our test data set via a mutate().

test <- test %>% 
  mutate(p_hat = predict(predict_sex_model, newdata=test, type="response"))

So what we now have is:

test %>% 
  select(age, drugs, p_hat) %>% 
  View()
age drugs p_hat
32 did not report 0.3900222
24 did not report 0.3799643
28 never 0.4029798
25 did not report 0.3812161
25 did not report 0.3812161

Now you pretend unveil the truth variable is_female to yourself, allowing you to evaluate for yourself (and not me from behind the curtains) how well your prediction mechanism fared:

test %>% 
  select(age, drugs, p_hat, is_female) %>% 
  View()
age drugs p_hat is_female
32 did not report 0.3900222 1
24 did not report 0.3799643 1
28 never 0.4029798 0
25 did not report 0.3812161 1
25 did not report 0.3812161 1