## Warning: package 'ggplot2' was built under R version 3.2.4
## Warning: package 'dplyr' was built under R version 3.2.5
## Warning: package 'tidyverse' was built under R version 3.2.5
## Warning: package 'tibble' was built under R version 3.2.5
## Warning: package 'tidyr' was built under R version 3.2.5
## Warning: package 'readr' was built under R version 3.2.5
## Warning: package 'purrr' was built under R version 3.2.5
## Warning: package 'lubridate' was built under R version 3.2.5
## Warning: package 'Quandl' was built under R version 3.2.5
## Warning: package 'zoo' was built under R version 3.2.5
## Warning: package 'forcats' was built under R version 3.2.5
## Warning: package 'knitr' was built under R version 3.2.5
Please indicate
Who you collaborated with: Brenda, Shannia, Albert Kim.
Any comments you have: Thanks!
We will use a logistic regression model to predict sex. Our metric to rate how well our model performs will be:
\[ \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i) \]
where \(I(A)\) is the indicator function that is equal to 1 if condition \(A\) holds, 0 otherwise. So
So what the above formula is reporting is the proportion of users’ sex we correctly predicted.
Define:
A training set training
of 2997 users (5% of users). We will train the logistic regression model to predict gender using this data. Since we want to train the model to tell who is female and who is not, we use the outcome variable is_female
.
A test set test
of the remaining 56,946 users (95% of users). We will test how good our trained model is using this data.
So at first, we will pretend we don’t know the outcome variable is_female
. We use the above model to make a prediction of sex for all 56,946 test users, then we use the is_female
outcome to rate how well we performed. * Be sure to incorporate all the insight your garnered in your EDA in HW-2.
profiles <- profiles %>%
mutate(is_female = ifelse(sex=="f", 1, 0)) %>%
mutate(
last_online = stringr::str_sub(last_online, 1, 10),
last_online = lubridate::ymd(last_online)
)
#a training set of 5% of users:
train5 <- profiles %>%
sample_frac(0.05)
#testing set with 95% of users
test95 <- anti_join(profiles, train5, by="id")
Train the logistic regression model to predict sex. i.e. fit a logistic regression model to the training
data. Assign this model to an R object called predict_sex_model
, then rate how well the model performs on the training
data.
Take predict_sex_model
and apply it to the test
data and make a prediction for each users’ sex, then rate how well the model performs on the test
data.
Hint: What do you think predict(predict_sex_model, newdata=test, type="response")
does? The help file is located in ?predict.glm
The first output is the output for the training data that the model was built on. This is 63% accurate. My test data I ran the model on was 61% correct.
## [1] 0.6329663
## [1] 0.6120008
Did the model perform better on the training
data or the test
data? Why do you think that is?
My training model was 2% more accurate than my test data analysis. I think this is the case because the model was built on the training data set.
We want to compare the volatility of
Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.
## Warning: Removed 2 rows containing missing values (geom_path).
Using the Reed College jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define
artist | n |
---|---|
OutKast | 1233 |
Beatles, The | 807 |
Notorious B.I.G. | 730 |
Led Zeppelin | 672 |
Eminem | 627 |
2Pac | 581 |
Rolling Stones, The | 542 |
Radiohead | 505 |
Talking Heads | 449 |
Tenacious D | 438 |