library(ggplot2)
library(dplyr)
library(broom)
profiles <- readr::read_csv(file="profiles.csv") %>%
filter(!is.na(height)) %>%
mutate(is_female=ifelse(sex=='f', 1, 0)) %>%
tibble::rownames_to_column(var="id")
We going to use age and drugs as the predictor variables.
At the very least, for each of the predictor variables we use, we need to address rows (i.e. individuals) with missing values NA. Otherwise, no predictions will be returned (see below). For example a proportion 0.235 of drugs values are NA missing; we can’t not return a prediction for this many people! So we change the NAs to an actual level of the categorical variable drugs. Note this is clumsy is we are losing any discriminative information about the sexes contained for those users that did respond!
profiles <- profiles %>%
mutate(drugs = ifelse(is.na(drugs), "did not report", drugs))
You need to split the training (2997 rows) and test (56946 rows) data into two disjoint data sets. For example here is one not-so-great way. Ask yourself, why is this approach suspect?
training <- profiles %>%
slice(1:2997)
test <- profiles %>%
slice(2998:59943)
For this exercise to have been completely realistic, for both HW-2 (EDA) and HW-3 (predictive modeling) I should have:
training, given you both
age and drugsis_femaletest only given you the predictor variables: age and drugsThen the exercise would proceed as follows:
predict_sex_model to trainingtest
predict_sex_model on test which only has the predictor variables, not the truth variable (as we do below using the predict() function)p_hat of being female for all 56946 test userspredicted_femaleis_female, only I could tell you what proportion of the 56946 users you guessed correct! This is how Kaggle competitions work!Alas, this only occurred to me after it was too late. So to salvage what I could from this exercise, I have set up this work around where:
test. i.e. pretend like I never gave you the true sex, as reflected in is_female, in test.When training the model, we have access to both:
age and drugsis_femalei.e. We have this:
training %>%
select(age, drugs, is_female) %>%
View()
| age | drugs | is_female |
|---|---|---|
| 22 | never | 0 |
| 35 | sometimes | 0 |
| 38 | did not report | 0 |
| 23 | did not report | 0 |
| 29 | never | 0 |
We then fit a logistic regression using training:
predict_sex_model <- glm(is_female ~ age + drugs, data=training, family="binomial")
As mentioned above, we at first pretend hide the truth variable is_female i.e. pretend I never gave it to you:
test %>%
select(age, drugs) %>%
View()
| age | drugs |
|---|---|
| 32 | did not report |
| 24 | did not report |
| 28 | never |
| 25 | did not report |
| 25 | did not report |
We take the trained model predict_sex_model, apply it to predictor variables age and drugs in test, and obtain 56946 fitted probabilities p_hat of being female using the predict() function. Note: if any of age and drugs were NA missing, NO value would be returned for that user, and hence p_hat would be of length less than 56946.
p_hat <- predict(predict_sex_model, newdata=test, type="response")
ggplot(data=NULL, aes(x=p_hat)) +
geom_histogram()
Following tidy data principles (variables describing the same observational units should be kept in the same data frame) we add these predictions to our test data set via a mutate().
test <- test %>%
mutate(p_hat = predict(predict_sex_model, newdata=test, type="response"))
So what we now have is:
test %>%
select(age, drugs, p_hat) %>%
View()
| age | drugs | p_hat |
|---|---|---|
| 32 | did not report | 0.3900222 |
| 24 | did not report | 0.3799643 |
| 28 | never | 0.4029798 |
| 25 | did not report | 0.3812161 |
| 25 | did not report | 0.3812161 |
Now you pretend unveil the truth variable is_female to yourself, allowing you to evaluate for yourself (and not me from behind the curtains) how well your prediction mechanism fared:
test %>%
select(age, drugs, p_hat, is_female) %>%
View()
| age | drugs | p_hat | is_female |
|---|---|---|---|
| 32 | did not report | 0.3900222 | 1 |
| 24 | did not report | 0.3799643 | 1 |
| 28 | never | 0.4029798 | 0 |
| 25 | did not report | 0.3812161 | 1 |
| 25 | did not report | 0.3812161 | 1 |