## Warning: package 'ggplot2' was built under R version 3.2.4
## Warning: package 'readr' was built under R version 3.2.5
## Warning: package 'lubridate' was built under R version 3.2.4
## Warning: package 'Quandl' was built under R version 3.2.5
## Warning: package 'zoo' was built under R version 3.2.5
## Warning: package 'tidyverse' was built under R version 3.2.5
## Warning: package 'tibble' was built under R version 3.2.5
## Warning: package 'tidyr' was built under R version 3.2.5
## Warning: package 'purrr' was built under R version 3.2.5

Admistrative:

Please indicate

  • Who you collaborated with: Connor
  • Roughly how much time you spent on this HW so far: 5 hours
  • The URL of the RPubs published URL here.
  • What gave you the most trouble: Creating the Shiny app for question 1, I still couldn’t figure out how to output a nice-looking table.
  • Any comments you have:

Question 1:

We will use a logistic regression model to predict sex. Our metric to rate how well our model performs will be:

\[ \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i) \]

where \(I(A)\) is the indicator function that is equal to 1 if condition \(A\) holds, 0 otherwise. So

  • Say user \(i\) is female, then \(y_i=1\)
  • Say we predict user \(i\) is female, then \(\widehat{y}_i=1\)
  • In this case \(I(y_i =\widehat{y}_i)=1\).

So what the above formula is reporting is the proportion of users’ sex we correctly predicted.

a)

Define:

  • A training set training of 2997 users (5% of users). We will train the logistic regression model to predict gender using this data. Since we want to train the model to tell who is female and who is not, we use the outcome variable is_female.
  • A test set test of the remaining 56,946 users (95% of users). We will test how good our trained model is using this data. So at first, we will pretend we don’t know the outcome variable is_female. We use the above model to make a prediction of sex for all 56,946 test users, then we use the is_female outcome to rate how well we performed.
  • Be sure to incorporate all the insight your garnered in your EDA in HW-2.

Look at Answer 1 in the Shiny App

set.seed(7)

training <- sample_n(profiles, 2997)
test <- profiles %>% 
  filter(!(id %in% training$id))

b)

Train the logistic regression model to predict sex. i.e. fit a logistic regression model to the training data. Assign this model to an R object called predict_sex_model, then rate how well the model performs on the training data.

predict_sex_model <- glm(is_female ~ height + orientation + body_type
                         , data=training, family="binomial")

broom::tidy(predict_sex_model)
##                       term   estimate  std.error   statistic       p.value
## 1              (Intercept) 48.9303364 1.91772782  25.5147450 1.352539e-143
## 2                   height -0.7060013 0.02713088 -26.0220586 2.787667e-149
## 3           orientationgay -2.1294598 0.36967186  -5.7604055  8.391212e-09
## 4      orientationstraight -1.2704516 0.31913487  -3.9809238  6.864795e-05
## 5        body_typeathletic -1.0901907 0.30479653  -3.5767819  3.478502e-04
## 6         body_typeaverage -0.6981459 0.29469271  -2.3690641  1.783316e-02
## 7           body_typecurvy  3.9043064 0.54748347   7.1313685  9.937585e-13
## 8             body_typefit -0.5018644 0.29795652  -1.6843544  9.211319e-02
## 9    body_typefull figured  1.7712796 0.56908235   3.1125189  1.854981e-03
## 10         body_typejacked -1.1674314 0.84615523  -1.3796895  1.676823e-01
## 11             body_typena  0.3077050 0.33003009   0.9323545  3.511534e-01
## 12     body_typeoverweight  0.1036753 0.81463270   0.1272663  8.987296e-01
## 13 body_typerather not say -0.5045597 1.15722128  -0.4360097  6.628297e-01
## 14         body_typeskinny -0.5045652 0.43865795  -1.1502474  2.500420e-01
## 15           body_typethin  0.6304969 0.33480817   1.8831587  5.967886e-02
## 16        body_typeused up -0.1275101 0.82336787  -0.1548640  8.769285e-01
training %>% 
  mutate(p_hat = fitted(predict_sex_model)) %>%
  mutate(fitted_binary = ifelse(p_hat > .5, 1, 0)) %>% 
  mutate(prediction_true = ifelse(fitted_binary==is_female, 1, 0)) %>% 
  mutate(type1_error = ifelse(fitted_binary==1 & is_female==0, 1, 0)) %>% 
  mutate(type2_error = ifelse(fitted_binary==0 & is_female==1, 1, 0)) %>% 
  summarise("Training Rating" = mean(prediction_true), 
            "Type 1 Error" = mean(type1_error),
            "Type 2 Error" = mean(type2_error)) %>% 
  kable(digits=3)
Training Rating Type 1 Error Type 2 Error
0.863 0.054 0.083

c)

Take predict_sex_model and apply it to the test data and make a prediction for each users’ sex, then rate how well the model performs on the test data.

Hint: What do you think predict(predict_sex_model, newdata=test, type="response") does? The help file is located in ?predict.glm

test %>% 
  mutate(p_hat = predict(predict_sex_model, newdata=test, type="response")) %>% 
  mutate(fitted_binary = ifelse(p_hat > .5, 1, 0)) %>% 
  mutate(prediction_true = ifelse(fitted_binary==is_female, 1, 0)) %>% 
  mutate(type1_error = ifelse(fitted_binary==1 & is_female==0, 1, 0)) %>% 
  mutate(type2_error = ifelse(fitted_binary==0 & is_female==1, 1, 0)) %>% 
  summarise("Test Rating" = mean(prediction_true),
            "Type 1 Error" = mean(type1_error),
            "Type 2 Error" = mean(type2_error)) %>% 
  kable(digits=3)
Test Rating Type 1 Error Type 2 Error
0.858 0.054 0.088

d)

Did the model perform better on the training data or the test data? Why do you think that is?

The model performs marginally better on the training data than the test data when the decision threshold is 0.5. This could be because the model is a bit narrower when the dataset is smaller.

Also, if you control for too many variables you may overfit the model to make it fit the training data, but it may not do as well on the test data.

Question 2:

We want to compare the volatility of

Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.

gold <- Quandl("BUNDESBANK/BBK01_WT5511") %>% 
  tbl_df()

bitcoin <- Quandl("BAVERAGE/USD") %>% 
  tbl_df() %>% 
  mutate(currency="bitcoin") %>% 
  filter(!is.na(Ask)) %>% 
  mutate(value_change=(Ask/lag(Ask) - 1)*100) %>% 
  filter(Date %within% interval(min(gold$Date), max(gold$Date)))
  

gold <- gold %>% 
  mutate(currency="gold") %>% 
  mutate(value_change=(Value/lag(Value) - 1)*100) %>% 
  filter(Date %within% interval(min(bitcoin$Date), max(bitcoin$Date)))

combined_currencies <- bind_rows(bitcoin, gold)

ggplot(combined_currencies, aes(x=Date, y=value_change, colour=currency)) + geom_line(alpha=0.8)
## Warning: Removed 1 rows containing missing values (geom_path).

Bitcoin is very volatile as compared to gold.

Question 3:

Using the Reed College jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define

  • the “graveyard shift” as midnight to 8am
  • the academic year as September through May (inclusive)
jukebox <- jukebox %>% 
  mutate(date_time = parse_date_time(date_time, "a b d HMS Y")) %>% 
  filter(hour(date_time) < 8) %>% 
  filter(month(date_time) <= 5 | month(date_time) >= 9) %>% 
  group_by(artist) %>% 
  summarise(frequency = n()) %>% 
  arrange(., desc(frequency))

kable(head(jukebox, n=10))
artist frequency
OutKast 2533
Beatles, The 2219
Led Zeppelin 1617
Radiohead 1589
Rolling Stones, The 1522
Notorious B.I.G. 1382
Red Hot Chili Peppers, The 1318
Eminem 1311
Bob Dylan 1163
Talking Heads 1113