MATH 216 Homework 3

Admistrative:

Please indicate

Who you collaborated with: Amanda Hotvedt, Katherine Hobbs
Roughly how much time you spent on this HW so far: 7 hours
The URL of the RPubs published URL here.
What gave you the most trouble:
Any comments you have:

Question 1:

We will use a logistic regression model to predict sex. Our metric to rate how well our model performs will be:

\[ \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i) \]

where \(I(A)\) is the indicator function that is equal to 1 if condition \(A\) holds, 0 otherwise. So

Say user \(i\) is female, then \(y_i=1\)
Say we predict user \(i\) is female, then \(\widehat{y}_i=1\)
In this case \(I(y_i =\widehat{y}_i)=1\).

So what the above formula is reporting is the proportion of users’ sex we correctly predicted.

profiles <- profiles %>% 
mutate(income_level = 
  ifelse(income %in% -2:0, "No Response",
  ifelse(income %in% 0:20000, "Low Income",
  ifelse(income %in% 20001:70000, "Middle Income",
  ifelse(income %in% 70000:1000000, "High Income", " "))))) %>% 
  mutate(job_new = ifelse(is.na(job), "NA", job)) %>% 
  mutate(job_new = fct_recode(job_new, 
                            "not currently working" = "unemployed", 
                            "not currently working" = "student", 
                            "not currently working" = "retired", 
                            "no response" = "rather not say", 
                            "no response" = "NA",   
                            "entertainment" = "entertainment / media", 
                            "entertainment" = "artistic / musical / writer", 
                            "miscellaneous" = "military",
                            "miscellaneous" = "political / government", 
                            "miscellaneous" = "clerical / administrative", 
                            "miscellaneous" = "law / legal services", 
                            "miscellaneous" = "construction / craftsmanship", 
                            "miscellaneous" = "transportation", 
                            "miscellaneous" = "hospitality / travel")) %>% 
mutate(age_level = 
  ifelse(age %in% 18:19, "Under 20",
  ifelse(age %in% 20:29, "Twenties",
  ifelse(age %in% 30:39, "Thirties",
  ifelse(age %in% 40:49, "Forties", 
  ifelse(age %in% 50:59, "Fifties", 
  ifelse(age %in% 60:111, "60 and Over", " ")))))))

a)

Define:

A training set training of 2997 users (5% of users). We will train the logistic regression model to predict gender using this data. Since we want to train the model to tell who is female and who is not, we use the outcome variable is_female.
A test set test of the remaining 56,946 users (95% of users). We will test how good our trained model is using this data. So at first, we will pretend we don’t know the outcome variable is_female. We use the above model to make a prediction of sex for all 56,946 test users, then we use the is_female outcome to rate how well we performed.
Be sure to incorporate all the insight your garnered in your EDA in HW-2.

set.seed(9)
training <- sample_n(profiles, 2997)
test <- profiles %>% 
  filter(!(id %in% training$id))

b)

Train the logistic regression model to predict sex. i.e. fit a logistic regression model to the training data. Assign this model to an R object called predict_sex_model, then rate how well the model performs on the training data.

predict_sex_model <- glm(is_female ~income_level + job_new + age_level, data = training, family = "binomial")

prediction_training <- predict(predict_sex_model, newdata=training, type="response")

training <- training %>% 
  mutate(phat=predict(predict_sex_model, newdata=training, type="response")) %>% 
  select(income_level, job_new, age_level, phat, is_female) %>% 
  mutate(predicted_correct=
        ifelse((phat>=.4023 & is_female==1) | (phat<.4023 & is_female==0), 1, 0)) 

training_correct <- training %>% 
  group_by(predicted_correct) %>% 
  summarise(n=n()) %>% 
  mutate(prop = n/sum(n)) 
knitr::kable(training_correct)

predicted_correct	n	prop
0	1182	0.3943944
1	1815	0.6056056

c)

Take predict_sex_model and apply it to the test data and make a prediction for each users’ sex, then rate how well the model performs on the test data.

Hint: What do you think predict(predict_sex_model, newdata=test, type="response") does? The help file is located in ?predict.glm

prediction_test <- predict(predict_sex_model, newdata=test, type="response")

test <- test %>% 
  mutate(phat=predict(predict_sex_model, newdata=test, type="response")) %>% 
  select(income_level, job_new, age_level, phat, is_female) %>% 
  mutate(predicted_correct=
        ifelse((phat>=.4023 & is_female==1) | (phat<.4023 & is_female==0), 1, 0))  

test_correct <- test %>% 
  group_by(predicted_correct) %>% 
  summarise(n=n()) %>% 
  mutate(prop = n/sum(n)) 
knitr::kable(test_correct)

predicted_correct	n	prop
0	23455	0.4118814
1	33491	0.5881186

d)

Did the model perform better on the training data or the test data? Why do you think that is?

Considering that the model was built on the training data, I hypothesized that the model would perform better on the training data. The results support the hypothesis but not as strongly as I would have guessed, the model using the training data predicts sex correctly 60% of the time and the model using the test data predicts sex correctly 58% of the time.

Question 2:

We want to compare the volatility of

Bitcoin prices
Gold prices

Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.

bitcoin <- Quandl("BAVERAGE/USD", api_key="HT-hvAhCykRB7qMy-oHD") %>% 
  tbl_df()
gold <- Quandl("BUNDESBANK/BBK01_WT5511", api_key="HT-hvAhCykRB7qMy-oHD") %>% 
  tbl_df()

bitcoin <- bitcoin %>% 
  rename(Avg = `24h Average`, Total_Volume=`Total Volume`) 
  
bitcoin$Avg_previous_day <- lead(bitcoin$Avg, 1)

bitcoin <- bitcoin %>% 
  mutate(daily_relative_change=((Avg-Avg_previous_day)/Avg_previous_day))

gold_recent <- gold %>% 
  filter(Date %within% interval(ymd("2010-01-01"), ymd("2016-12-31")))

gold_recent$Value_previous_day <- lead(gold_recent$Value, 1)

gold_recent <- gold_recent %>% 
  mutate(daily_relative_change=((Value-Value_previous_day)/Value_previous_day))


 p <- ggplot() +
   geom_line(data=bitcoin, aes(x=Date, y=daily_relative_change, color="Bitcoin")) +
   geom_line(data=gold_recent, aes(x=Date, y=daily_relative_change, color="Gold")) +
   labs(title="Relative Daily Volitilty, Bitcoin Compared to Gold", x="Date", y="Daily Relative Volatility")
 p

## Warning: Removed 1 rows containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_path).

Question 3:

Using the Reed College jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define

the “graveyard shift” as midnight to 8am
the academic year as September through May (inclusive)

jukebox <- jukebox %>% 
  mutate(Date = parse_date_time(date_time, "a b d HMS Y")) %>% 
  mutate(Month=month(Date)) %>% 
  mutate(Hour=hour(Date))

graveyard <- jukebox %>% 
  filter(Month<=5 | Month>=9) %>% 
  filter(Hour>=24 | Hour<=8)

graveyard_artist <- graveyard %>% 
  group_by(artist) %>% 
  tally() %>% 
  ungroup(artist) %>% 
  arrange(desc(n))

graveyard_artist_top10 <- graveyard_artist %>%
  filter(n>1275)
  
knitr::kable(graveyard_artist_top10)

artist	n
OutKast	2880
Beatles, The	2481
Led Zeppelin	1838
Radiohead	1757
Rolling Stones, The	1681
Notorious B.I.G.	1611
Eminem	1503
Red Hot Chili Peppers, The	1424
Bob Dylan	1281
Talking Heads	1276

MATH 216 Homework 3

Kyra Gray

Admistrative:

Question 1:

a)

b)

c)

d)

Question 2:

Question 3: