Please indicate
profiles.csv from HW-2 to the data folder in the HW-3 directoryHW-3 folder).For this question we will be picking up from where we left off in HW-2, specifically the OkCupid dataset.
Using your exploratory data analysis from HW-2, fit a logistic regression to predict individual’s gender and interpret your results.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| Age | 0.0212 | 0.00354 | 5.97 | 2.32e-09 |
| Some College | -0.22 | 0.0899 | -2.45 | 0.0144 |
| Graduate/ Professional Degree | 0.362 | 0.0885 | 4.09 | 4.24e-05 |
| Space Camp | -0.0379 | 0.202 | -0.187 | 0.852 |
| Atheltic | -0.769 | 0.0997 | -7.71 | 1.26e-14 |
| Curvey | 3.99 | 0.225 | 17.8 | 9.86e-71 |
| Fit | -0.166 | 0.0885 | -1.88 | 0.0605 |
| Full Figured | 2.91 | 0.227 | 12.8 | 1.92e-37 |
| Smokes | -0.0231 | 0.0896 | -0.257 | 0.797 |
| Drinks | 0.262 | 0.0876 | 2.99 | 0.00278 |
| Does Drugs | -0.02 | 0.0906 | -0.221 | 0.825 |
| Height | -0.567 | 0.0139 | -40.8 | 0 |
| Logged Income | -0.325 | 0.0464 | -7 | 2.51e-12 |
| (Intercept) | 40 | 1.05 | 38.1 | 0 |
The results of our regression model show that age is a statistically significant predictor of sex. A one year increase in age, on average, is associated with a 1.021 increase in the odds of being female. Conversely, having only some college education, compared to having a Bachelor’s, decreases the odds of being female by 0.802. Ideally. we would be able to see the condifence interval for these predictions, but the code breaks down.
Plot a histogram of the fitted probabilities \(\widehat{p}_i\) for all users \(i=1, \ldots, n=59946\) in your dataset.
Use a decision threshold of \(p^*=0.5\) to make an explicit prediction for each user \(i\)’s sex and save this in a variable predicted_sex. In other words, for user \(i\)
predicted_sex = 1 i.e. they are femalepredicted_sex = 0 i.e. they are maleDisplay a 2 x 2 contigency table of sex and predicted_sex i.e. compare the predicted sex to the actual sex of all users. The sum of all the elements in your table should be \(n=59946\). Comment on how well our predictions fared.
The contingency table below displays predicted and actual sexes of Okcupid users in the Bay Area. Overall, our model seems to have fared quite well. Of the 6,742 males in our sample, we were able to predict accuratley the sex of 6,043. That is, our model correctly predicted the gender of males 90% of the time. As for females, the model predicted correctly their sex 83.7% of the time. Combined, our model made correct predictions 88.3% of the time regarding an Okcupid user’s sex.
| Predicted Male | Predicted Female | |
|---|---|---|
| Male | 6043 | 347 |
| Female | 699 | 1790 |
Say we wanted to have a false positive rate of about 20%, i.e. of the people we predicted to be female, we want to be wrong no more than 20% of the time. What decision threshold \(p^*\) should we use?
False positive refers to instances of falsely rejecting the null hypothesis. In this case, it refers to predicting that a user is female when in fact they are male. As it currently stands, the model has a false positive rate of 17.3% only. In general, however, if we wanted to adjust a threshold tp get such a rate, I have no idea what we should do … (need to think more about this)
Using the jukebox data, plot a time series of the number of songs played each week over the entire time period. i.e.
What seasonal (i.e. cyclical) patterns do you observe?
Looking at the graph, it seems that the beginning and end of every year witnesses a considerable increase in the total number of songs played per week at Reed’s college. This pattern is expected, since there are presuemably fewer students in the summer and during winter break, when the number of songs played is lowest, compared to the academic year which spans from September to May.
Using the jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define
| artist | n |
|---|---|
| OutKast | 847 |
| Beatles, The | 698 |
| Radiohead | 566 |
| Led Zeppelin | 556 |
| Notorious B.I.G. | 519 |
| Rolling Stones, The | 480 |
| Talking Heads | 462 |
| Red Hot Chili Peppers, The | 426 |
| Michael Jackson | 407 |
| Eminem | 384 |
The top artist played at Reed college’s pool hall on the jukebox between 2004 and 2009 is OutKast. The hip-hop duo beats its closest contester The Beatles by a considerable margin of 149. The Beatles, in turn, was played 132 times more than Radiohead, who came in third place, overtaking Led Zeppelin by merely 10 times. Eminem, who took the tenth place, can perhaps learn a lesson or two from OutKast, who were played 463 times more than the contemporary rapper.
We want to compare the volatility of
Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.
A visual comparison between the prices of bitcoin and gold relative to the US dollar shows that, in the long term, gold is characterized by mild fluctuations. More specifically, the amplitude of the volatity of gold is around 417 USD. Bitcoin, on the other hand, seems to possess a maximum displacement of 600 USD, making it more volatile than gold over all. This large difference of almost 200 USD is driven in part by a drastic spike in the value of Bitcoins in the week of 29-11-2013, when its media presence was strong, as evident by references in articles in CNN Money, Wall Street Journal, and Bloomberg. From Mid-2014 onwards, it seems that the fluctuations of Bitcoin have become milder and continue to look this way up to the present.