## Warning: package 'ggplot2' was built under R version 3.2.4
## Warning: package 'Quandl' was built under R version 3.2.5
## Warning: package 'zoo' was built under R version 3.2.5
## Warning: package 'tidyr' was built under R version 3.2.5
## Warning: package 'quantmod' was built under R version 3.2.5
## Warning: package 'TTR' was built under R version 3.2.4

Admistrative:

Please indicate

Question 1:

We will use a logistic regression model to predict sex. Our metric to rate how well our model performs will be:

\[ \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i) \]

where \(I(A)\) is the indicator function that is equal to 1 if condition \(A\) holds, 0 otherwise. So

So what the above formula is reporting is the proportion of users’ sex we correctly predicted.

a)

Define:

  • A training set training of 2997 users (5% of users). We will train the logistic regression model to predict gender using this data. Since we want to train the model to tell who is female and who is not, we use the outcome variable is_female.
  • A test set test of the remaining 56,946 users (95% of users). We will test how good our trained model is using this data. So at first, we will pretend we don’t know the outcome variable is_female. We use the above model to make a prediction of sex for all 56,946 test users, then we use the is_female outcome to rate how well we performed.
  • Be sure to incorporate all the insight your garnered in your EDA in HW-2.
set.seed(76)
training <- sample_n(profiles, 2997)
test <- anti_join(profiles, training, by = "id")

b)

Train the logistic regression model to predict sex. i.e. fit a logistic regression model to the training data. Assign this model to an R object called predict_sex_model, then rate how well the model performs on the training data.

predict_sex_model <- glm(is_female ~ orientation + income_bracket + job_new, data=training, family="binomial")

training_phats <- predict(predict_sex_model, data=training, type="response")

training_predictions <- training %>% 
  select(income_bracket, job_new, orientation, is_female) %>%
  mutate(phat= training_phats) %>% mutate(predicted_is_female = ifelse(phat < .5, 0, 1))


training_predictions %>% summarise(sum(is.na(phat)), sum(is.na(predicted_is_female)))
## # A tibble: 1 × 2
##   `sum(is.na(phat))` `sum(is.na(predicted_is_female))`
##                <int>                             <int>
## 1                  0                                 0
# I am not seeing the error you are describing... training_phats has length 2997

summarise(training_predictions, percent_correct = mean(is_female == predicted_is_female)) %>%
  knitr::kable(digits = 4)

percent_correct

      0.6456

The prediction model worked on 64.56% of predictions.

c)

Take predict_sex_model and apply it to the test data and make a prediction for each users’ sex, then rate how well the model performs on the test data.

Hint: What do you think predict(predict_sex_model, newdata=test, type="response") does? The help file is located in ?predict.glm

predictions <- predict(predict_sex_model, newdata=test, type="response")
test <- test %>% select(income_bracket, job_new, orientation, is_female) %>% 
  mutate(phat= predictions) %>% mutate(predicted_is_female = ifelse(phat < .5, 0, 1))

summarise(test, percent_correct = mean(is_female == predicted_is_female)) %>%
  knitr::kable(digits = 4)

percent_correct

      0.6523

The prediction model worked on 65.23% of predictions.

d)

Did the model perform better on the training data or the test data? Why do you think that is?

The model performed better on the test set by less than 1% of predictions. This shows that the model is not very affected by the variability between the sets. If I were to reselect training/test sets, I expect that on most runs the test set does better, but usually they are pretty close in percent correct.

Question 2:

We want to compare the volatility of

Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.

bitcoin <- Quandl("BAVERAGE/USD")
gold <- Quandl("BUNDESBANK/BBK01_WT5511")

gold_clipped <- filter(gold, as.numeric(year(Date)) >= 2010)
ggplot() + 
  geom_line(data =bitcoin, aes(x = Date, y = `24h Average`, col = "bicoin"))  + 
  geom_line(data = gold_clipped, aes(x =  Date, y = Value, col = "gold")) +
  scale_y_log10() +
  labs(y = "log(US Dollars)", title = "Relative change in price of Bitcoin and Gold")

bitcoin1 <- bitcoin %>% mutate(percent_change = Delt(`24h Average`)) 
gold1 <- gold_clipped %>% mutate(percent_change = Delt(Value))

ggplot() +
  geom_line(data = bitcoin1, aes(x = Date, y = percent_change, col = "bitcoin")) +
  geom_line(data = gold1, aes(x = Date, y=percent_change, col ="gold")) +
  labs(y = "relative change (USD)", title = "Relative volatility of bitcoin and gold")
## Warning: Removed 1 rows containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_path).

In the past couple of years, bitcoin has had more day-to-day price variation. The price of gold hasn’t changed all that much, relatively speaking. A foreign currency exchanger might be more interested in bitcoin, based off of this analysis.

Question 3:

Using the Reed College jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define

jukebox <- mutate(jukebox, date_time1 = parse_date_time(date_time, "a b d T y"))

academic_year <- c(01, 02, 03, 04, 05, 09, 10, 11, 12)
graveyard_shift <- c(00, 01, 02, 03, 04, 05, 06, 07, 08)

top_10 <- jukebox %>% 
  filter(hour(date_time1) %in% graveyard_shift & 
           as.numeric(month(date_time1)) %in% academic_year) %>%
  group_by(artist) %>%
  summarise( count = n()) %>%
  arrange(desc(count)) %>%
  top_n(10) 
## Selecting by count
top_10 %>% knitr::kable()
artist count
OutKast 2880
Beatles, The 2481
Led Zeppelin 1838
Radiohead 1757
Rolling Stones, The 1681
Notorious B.I.G. 1611
Eminem 1503
Red Hot Chili Peppers, The 1424
Bob Dylan 1281
Talking Heads 1276
ggplot(top_10, aes(x = reorder(artist, -count), y = count)) + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Top artists in the graveyard shift", x = "", y = "Plays")