Please indicate
We will use a logistic regression model to predict sex. Our metric to rate how well our model performs will be:
\[ \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i) \]
where \(I(A)\) is the indicator function that is equal to 1 if condition \(A\) holds, 0 otherwise. So
So what the above formula is reporting is the proportion of users’ sex we correctly predicted.
Define:
training of 2997 users (5% of users). We will train the logistic regression model to predict gender using this data. Since we want to train the model to tell who is female and who is not, we use the outcome variable is_female.test of the remaining 56,946 users (95% of users). We will test how good our trained model is using this data. So at first, we will pretend we don’t know the outcome variable is_female. We use the above model to make a prediction of sex for all 56,946 test users, then we use the is_female outcome to rate how well we performed.cleanprofiles <- profiles %>%
select(id, sex, income, job, pets, is_female) %>%
mutate(petowners = fct_recode(pets,
# New name = old name
"dogs" = "has dogs",
"dogs" = "has dogs and dislikes cats",
"dogs" = "has dogs and likes cats",
"both" = "has dogs and has cats",
"cats" = "has cats",
"cats" = "dislikes dogs and has cats",
"cats" = "likes dogs and has cats",
"neither" = "dislikes cats",
"neither" = "dislikes dogs",
"neither" = "dislikes dogs and dislikes cats",
"neither" = "dislikes dogs and likes cats",
"neither" = "likes cats",
"neither" = "likes dogs",
"neither" = "likes dogs and dislikes cats",
"neither" = "likes dogs and likes cats"
))
cleaner <- cleanprofiles %>%
mutate(job = fct_recode(job,
"other" = "military",
"other" = "unemployed",
"other" = "transportation",
"other" = "retired",
"other" = "rather not say",
"other" = "political / government",
"other" = "clerical / administrative",
"other" = "hospitality / travel",
"other" = "law / legal services",
"other" = "construction / craftsmanship"
))
newclean <- cleaner %>%
select(id, income, job, is_female, petowners) %>%
mutate(petowners = ifelse(is.na(petowners), "N/A", as.character(petowners))) %>%
mutate(job = ifelse(is.na(job), "N/A", as.character(job)))
set.seed(21)
training <- sample_n(newclean, 2997)
test <- newclean %>%
filter(!(id %in% training$id))
#graph to show proportion by sex
gender <- training %>%
group_by(is_female) %>%
tally()%>%
rename(gender = n) %>%
mutate(prop = gender/sum(gender))
ggplot(gender, aes(x = as.factor(is_female), y = prop)) +
geom_bar(stat = "identity", position = "dodge", fill = "violetred4", color = "darkslateblue", width = .65) +
theme_minimal() +
labs(title = "Proportion by Sex of Sample", x = "Female", y = "Proportion")
Train the logistic regression model to predict sex. i.e. fit a logistic regression model to the training data. Assign this model to an R object called predict_sex_model, then rate how well the model performs on the training data.
#regression
predictsexmodel <- glm(is_female ~ income + job + petowners, data=training, family="binomial")
predictions <- training %>%
select(income, job, petowners, is_female) %>%
mutate(p_hat1 = fitted(predictsexmodel))
#How did the model do?
prophecy <- predictions %>%
mutate(guess = ifelse(p_hat1>=.5, 1, 0))
errors <- prophecy %>%
mutate(correct = ifelse(guess==is_female, 1, 0))
propwrong <- errors %>%
summarize(prop_wrong=1 - mean(correct))
kable(propwrong)
0.3540207
sexpropwrong <- errors %>%
group_by(is_female) %>%
summarize(prop_wrong=1-mean(correct))
kable(sexpropwrong)
| is_female | prop_wrong |
|---|---|
| 0 | 0.1890348 |
| 1 | 0.5858748 |
The first table looks at the proportion of wrong answers generated by the model for the training data. 35.40% of the time, the model guest the wrong sex for a person based on their OkCupid profile. The second table broke down the incorrect answers by sex, with female=1. The model was only wrong 18.90% of the time when the respondent was male, compared to being wrong 58.59% of the time when the respondent was female.
Take predict_sex_model and apply it to the test data and make a prediction for each users’ sex, then rate how well the model performs on the test data.
Hint: What do you think predict(predict_sex_model, newdata=test, type="response") does? The help file is located in ?predict.glm
test <- test %>%
select(is_female, income, job, petowners) %>%
mutate(phat=predict(predictsexmodel, newdata = test, type = "response"))
#How did the model do for the test?
testprophecy <- test %>%
mutate(guess = ifelse(phat>=.5, 1, 0))
testerrors <- testprophecy %>%
mutate(correct = ifelse(guess==is_female, 1, 0))
testpropwrong <- testerrors %>%
summarize(prop_wrong=1 - mean(correct))
kable(testpropwrong)
0.3538616
testsexpropwrong <- testerrors %>%
group_by(is_female) %>%
summarize(prop_wrong=1-mean(correct))
kable(testsexpropwrong)
| is_female | prop_wrong |
|---|---|
| 0 | 0.1854091 |
| 1 | 0.6048535 |
The first table looks at the proportion of wrong answers generated by the model for the test data. 35.39% of the time, the model guest the wrong sex for a person based on their OkCupid profile. The second table broke down the incorrect answers by sex, with female=1. The model was only wrong 18.54% of the time when the respondent was male, compared to being wrong 60.49% of the time when the respondent was female.
Did the model perform better on the training data or the test data? Why do you think that is?
The model performed marginally better in the test data than in the training data. The model was wrong 35.40% of the time in the training compared to 35.39% in the test. Surprisingly, the model performed better for men in the test and better for women in the training.
We want to compare the volatility of
Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.
bitcoin <- Quandl("BAVERAGE/USD") %>%
tbl_df() %>%
rename(
Avg = `24h Average`,
Total_Volume = `Total Volume`
)
gold <- Quandl("BUNDESBANK/BBK01_WT5511")
bitcoin$lagged <- lead(bitcoin$Avg, 1)
golddates <- interval(ymd("2010-07-17"), ymd("2016-04-15"))
bitcoin <-bitcoin %>%
filter(!is.na(lagged)) %>%
filter(Date %within% golddates) %>%
mutate(lagvalue = (Avg - lagged)/lagged)
gold$lagged <- lead(gold$Value, 1)
bitcoindates <- interval(ymd("2010-07-17"), ymd("2016-10-22"))
gold <-gold %>%
filter(!is.na(lagged)) %>%
filter(Date %within% bitcoindates) %>%
mutate(lagvalue = (Value - lagged)/lagged)
ggplot(bitcoin, aes(x = Date, y = lagvalue)) +
geom_line(color = "darkslateblue") +
geom_line(data = gold, color = "cadetblue2", alpha = .75) +
theme_minimal() +
labs(title = "Daily Volatility of Bitcoin and Gold", y = "Percent Daily Change")
Bitcoin is substantially more volatile than gold when looking at percent daily changes.
Using the Reed College jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define
graveyard <- jukebox %>%
mutate(date_time = parse_date_time(date_time, "a b d HMS Y")) %>%
separate(date_time, c("year", "month", "day", "hour", "minute", "seconds")) %>%
select(year, month, day, hour, minute, artist, track)
schoolyard <- graveyard %>%
transform(month=as.numeric(month)) %>%
transform(hour=as.numeric(hour)) %>%
filter(!month %in% c(6, 7, 8)) %>%
filter(hour<8) %>%
group_by(artist) %>%
tally() %>%
rename(playcount = n) %>%
top_n(10, playcount) %>%
arrange(desc(playcount))
ggplot(schoolyard, aes(x=fct_reorder(artist, playcount), y=playcount)) +
geom_bar(stat = "identity", fill = "violetred4") +
coord_flip()+
labs(title="Most Popular Artists during the Graveyard Shift", x='Artist', y='Play Count') +
theme_minimal()