Please indicate
I still need to figure out how to make the testing and training data sets disjoint and whether I need to set a seed for those models (I have noticed that I get a different proportion for correct predictions each time I run the model).
We will use a logistic regression model to predict sex. Our metric to rate how well our model performs will be:
\[ \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i) \]
where \(I(A)\) is the indicator function that is equal to 1 if condition \(A\) holds, 0 otherwise. So
So what the above formula is reporting is the proportion of users’ sex we correctly predicted.
Define:
training of 2997 users (5% of users). We will train the logistic regression model to predict gender using this data. Since we want to train the model to tell who is female and who is not, we use the outcome variable is_female.test of the remaining 56,946 users (95% of users). We will test how good our trained model is using this data. So at first, we will pretend we don’t know the outcome variable is_female. We use the above model to make a prediction of sex for all 56,946 test users, then we use the is_female outcome to rate how well we performed.set.seed(76)
profiles <- profiles %>%
mutate(religion= ifelse(is.na(religion), "Not Reported", religion))
profiles <- profiles %>%
mutate(`Grouped Religion` = fct_recode(religion,
"agnosticism" = "agnosticism",
"agnosticism" = "agnosticism and laughing about it",
"agnosticism" = "agnosticism and somewhat serious about it",
"agnosticism" = "agnosticism and very serious about it",
"agnosticism" = "agnosticism but not too serious about it",
"atheism" = "atheism",
"atheism" = "atheism and laughing about it",
"atheism" = "atheism and somewhat serious about it",
"atheism" = "atheism and very serious about it",
"atheism" = "atheism but not too serious about it",
"buddhism" = "buddhism",
"buddhism" = "buddhism and laughing about it",
"buddhism" = "buddhism and somewhat serious about it",
"buddhism" = "buddhism and very serious about it",
"buddhism" = "buddhism but not too serious about it",
"catholism" = "catholicism",
"catholism" = "catholicism and laughing about it",
"catholism" = "catholicism and somewhat serious about it",
"catholism" = "catholicism and very serious about it",
"catholism" = "catholicism but not too serious about it",
"christianity" = "christianity",
"christianity" = "christianity and laughing about it",
"christianity" ="christianity and somewhat serious about it",
"christianity" = "christianity and very serious about it",
"christianity" = "christianity but not too serious about it",
"hinduism" = "hinduism",
"hinduism" = "hinduism and laughing about it",
"hinduism" = "hinduism and somewhat serious about it",
"hinduism" = "hinduism and very serious about it",
"hinduism" = "hinduism but not too serious about it",
"islam" = "islam",
"islam" = "islam and laughing about it",
"islam" = "islam and somewhat serious about it",
"islam" = "islam and very serious about it",
"islam" = "islam but not too serious about it",
"judaism" = "judaism",
"judaism" = "judaism and laughing about it",
"judaism" = "judaism and somewhat serious about it",
"judaism" = "judaism and very serious about it",
"judaism" = "judaism but not too serious about it",
"other" = "other",
"other" = "other and laughing about it",
"other" = "other and somewhat serious about it",
"other" = "other and very serious about it",
"other" = "other but not too serious about it",
"Not Reported" = "Not Reported"))
profiles <- profiles %>%
mutate(income_levels =
ifelse(income %in% -2:0, "Not Reported",
ifelse(income %in% 0:20000, "Low Income",
ifelse(income %in% 20001:70000, "Middle Income",
ifelse(income %in% 70000:1000000, "High Income", " ")))))
profiles <- profiles %>%
mutate(job =ifelse(is.na(job),"Not Reported", job))
training <- profiles[sample(nrow(profiles), 2997), ]
test <- anti_join(profiles, training, by="id")
#I do not yet know how to make these disjoint data sets!
Train the logistic regression model to predict sex. i.e. fit a logistic regression model to the training data. Assign this model to an R object called predict_sex_model, then rate how well the model performs on the training data.
predict_sex_model <- glm(is_female ~`Grouped Religion` + income_levels + job, data = training, family = "binomial")
broom::tidy(predict_sex_model)
## term estimate std.error
## 1 (Intercept) -1.39539644 0.2553761
## 2 `Grouped Religion`atheism -0.31546549 0.1663912
## 3 `Grouped Religion`buddhism -0.10898631 0.2524213
## 4 `Grouped Religion`catholism 0.24074311 0.1721834
## 5 `Grouped Religion`christianity 0.45171327 0.1600477
## 6 `Grouped Religion`hinduism -0.94383892 0.6654929
## 7 `Grouped Religion`islam -0.29508392 0.8797441
## 8 `Grouped Religion`judaism 0.25026019 0.1940980
## 9 `Grouped Religion`Not Reported 0.16592334 0.1265662
## 10 `Grouped Religion`other 0.37846950 0.1507658
## 11 income_levelsLow Income 0.74812531 0.2622546
## 12 income_levelsMiddle Income 0.45129772 0.2482032
## 13 income_levelsNot Reported 1.16619760 0.1935261
## 14 jobbanking / financial / real estate -0.24655088 0.2329682
## 15 jobclerical / administrative 1.74287376 0.4039856
## 16 jobcomputer / hardware / software -1.46288075 0.2412595
## 17 jobconstruction / craftsmanship -2.96370217 0.7384700
## 18 jobeducation / academia 0.83385451 0.2119484
## 19 jobentertainment / media -0.63831151 0.2465477
## 20 jobexecutive / management -0.49214055 0.2437952
## 21 jobhospitality / travel -0.01804216 0.2830577
## 22 joblaw / legal services 0.37933147 0.2727542
## 23 jobmedicine / health 0.59244184 0.2078673
## 24 jobmilitary -14.11048590 229.9846017
## 25 jobNot Reported 0.02088039 0.1772729
## 26 jobother 0.16280265 0.1781592
## 27 jobpolitical / government -0.53503490 0.4015112
## 28 jobrather not say -0.65931953 0.4534215
## 29 jobretired 0.27310059 0.5694868
## 30 jobsales / marketing / biz dev 0.05431871 0.2081820
## 31 jobscience / tech / engineering -0.69174916 0.2079726
## 32 jobstudent 0.06005210 0.1993069
## 33 jobtransportation -0.67430934 0.4768726
## 34 jobunemployed 0.34709406 0.5696720
## statistic p.value
## 1 -5.46408462 4.653015e-08
## 2 -1.89592675 5.796973e-02
## 3 -0.43176356 6.659133e-01
## 4 1.39817848 1.620595e-01
## 5 2.82236631 4.767069e-03
## 6 -1.41825539 1.561162e-01
## 7 -0.33542017 7.373082e-01
## 8 1.28934938 1.972767e-01
## 9 1.31096077 1.898710e-01
## 10 2.51031437 1.206237e-02
## 11 2.85266852 4.335382e-03
## 12 1.81825896 6.902456e-02
## 13 6.02604815 1.680172e-09
## 14 -1.05830283 2.899174e-01
## 15 4.31419822 1.601832e-05
## 16 -6.06351661 1.331770e-09
## 17 -4.01330053 5.987560e-05
## 18 3.93423416 8.346239e-05
## 19 -2.58899854 9.625550e-03
## 20 -2.01866411 4.352214e-02
## 21 -0.06374020 9.491771e-01
## 22 1.39074493 1.643028e-01
## 23 2.85009585 4.370606e-03
## 24 -0.06135405 9.510772e-01
## 25 0.11778669 9.062367e-01
## 26 0.91380423 3.608197e-01
## 27 -1.33255285 1.826786e-01
## 28 -1.45409852 1.459190e-01
## 29 0.47955558 6.315434e-01
## 30 0.26091929 7.941547e-01
## 31 -3.32615475 8.805303e-04
## 32 0.30130466 7.631822e-01
## 33 -1.41402405 1.573548e-01
## 34 0.60928752 5.423339e-01
predictions <- training %>%
select(income_levels, job, `Grouped Religion`, is_female) %>%
mutate(p_hat = fitted(predict_sex_model))
predictions <- predictions %>%
mutate(correct = ifelse(p_hat >.5 & is_female==1, 1,
ifelse(p_hat<0.5 &is_female==0,1,0)))
predictions %>%
summarise(Training_prop_correct=mean(correct)) %>%
knitr::kable(digits = 4)
0.6373
Take predict_sex_model and apply it to the test data and make a prediction for each users’ sex, then rate how well the model performs on the test data.
Hint: What do you think predict(predict_sex_model, newdata=test, type="response") does? The help file is located in ?predict.glm
test_predictions <- test %>%
select(income_levels, job, `Grouped Religion`, is_female) %>%
mutate(p_hat = predict(predict_sex_model, newdata=test, type="response"))
test_predictions <- test_predictions %>%
mutate(correct = ifelse(p_hat >.5 & is_female==1, 1,
ifelse(p_hat<0.5 &is_female==0,1,0)))
test_predictions %>%
summarise(Test_Prop_correct=mean(correct)) %>%
knitr::kable(digits = 4)
0.6367
Did the model perform better on the training data or the test data? Why do you think that is?
The model performed better on the test data, but only marginally! The results were almost the same. 63.73% of predictions were correct for the training data while 63.67% of predictions were correct for the testing data. I would expect the model to perform better on the training data because that was the data used to build the model, while the test data is a new dataset to which the model is being applied.
We want to compare the volatility of
Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.
bitcoin <- Quandl("BAVERAGE/USD") %>%
tbl_df()
gold <- Quandl("BUNDESBANK/BBK01_WT5511") %>%
tbl_df()
bitcoin <- bitcoin %>%
rename(
Avg = `24h Average`,
Total_Volume = `Total Volume`)
bitcoin <- bitcoin %>%
mutate(Price_lag=lag(Avg)) %>%
mutate(Daily_Price_Change=(Avg - Price_lag)) %>%
mutate(Daily_Rel_Price_Change=(Daily_Price_Change/Price_lag))
gold_recent <- gold %>%
filter(Date %within% interval(ymd("2010-01-01"), ymd("2016-12-31")))
gold_recent <- gold_recent %>%
mutate(Value_lag=lag(Value)) %>%
mutate(Daily_Value_Change=(Value - Value_lag)) %>%
mutate(Daily_Rel_Value_Change=(Daily_Value_Change/Value_lag))
p <- ggplot() +
geom_line(data=bitcoin, aes(x=Date, y=Daily_Rel_Price_Change, color="Bitcoin")) +
geom_line(data=gold_recent, aes(x=Date, y=Daily_Rel_Value_Change, color="Gold")) +
labs(x='Date', y='USD') +
labs(title="Daily Relative Change of Bitcoin and Gold Values 2010-2016")
p
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
Using the Reed College jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define
jukebox <- jukebox %>%
mutate(Date = parse_date_time(date_time, "a b d HMS Y")) %>%
mutate(Month=month(Date)) %>%
mutate(hour=hour(Date))
graveyard <- jukebox %>%
filter(Month<=5 | Month>=9) %>%
filter(hour>=24 | hour <=8)
graveyard_artist <- graveyard %>%
group_by(artist) %>%
tally() %>%
ungroup() %>%
arrange(desc(n))
graveyard_top10 <- graveyard_artist %>%
top_n(10)
## Selecting by n
graveyard_top10 %>%
kable()
| artist | n |
|---|---|
| OutKast | 2880 |
| Beatles, The | 2481 |
| Led Zeppelin | 1838 |
| Radiohead | 1757 |
| Rolling Stones, The | 1681 |
| Notorious B.I.G. | 1611 |
| Eminem | 1503 |
| Red Hot Chili Peppers, The | 1424 |
| Bob Dylan | 1281 |
| Talking Heads | 1276 |