Admistrative:

Please indicate

Data

Question 1:

For this question we will be picking up from where we left off in HW-2, specifically the OkCupid dataset.

a)

Using your exploratory data analysis from HW-2, fit a logistic regression to predict individual’s gender and interpret your results.

Fitting generalized (binomial/logit) linear model: is_female ~ age + smc + grad + spc + ath + cur + fit + ff + smokes + drinks + drugs + height + log_income
  Estimate Std. Error z value Pr(>|z|)
Age 0.0212 0.00354 5.97 2.32e-09
Some College -0.22 0.0899 -2.45 0.0144
Graduate/ Professional Degree 0.362 0.0885 4.09 4.24e-05
Space Camp -0.0379 0.202 -0.187 0.852
Atheltic -0.769 0.0997 -7.71 1.26e-14
Curvey 3.99 0.225 17.8 9.86e-71
Fit -0.166 0.0885 -1.88 0.0605
Full Figured 2.91 0.227 12.8 1.92e-37
Smokes -0.0231 0.0896 -0.257 0.797
Drinks 0.262 0.0876 2.99 0.00278
Does Drugs -0.02 0.0906 -0.221 0.825
Height -0.567 0.0139 -40.8 0
Logged Income -0.325 0.0464 -7 2.51e-12
(Intercept) 40 1.05 38.1 0

The results of our regression model show that age is a statistically significant predictor of sex. A one year increase in age, on average, is associated with a 1.021 multiplicative increase in the odds of being female, cetris paribus. This age logged odds estimate (0.0212) lies witin a 95% confidence interval that streches from 0.0142 to 0.0281. Conversely, having only some college education, compared to having a Bachelor’s, is associated with a decrease in the odds of being female by 0.802 and its logged odds ratio falls within a confidence interval of [-0.396, -0.0438]. Surprisingly, smoking does not look signififcant, while drinking is associated with an increase in the odds of being female (!)

  2.5 % 97.5 %
Intercept 37.9 42
Age 0.0142 0.0281
Some College -0.396 -0.0438
Graduate/ Professional Degree 0.189 0.536
Space Camp -0.434 0.359
Atheltic -0.964 -0.573
Curvey 3.55 4.43
Fit -0.34 0.00735
Full Figured 2.46 3.35
Smokes -0.199 0.153
Drinks 0.0904 0.434
Does Drugs -0.198 0.158
Height -0.594 -0.54
Logged Income -0.416 -0.234

b)

Plot a histogram of the fitted probabilities \(\widehat{p}_i\) for all users \(i=1, \ldots, n=59946\) in your dataset.

This histogram shows the distribution of the fitted probabilitys \(\widehat{p}_i\) for users in the okcupid dataset. Given the model specififcation chosen above, notice that the total number of observation drops from 59,946 to 8,879. ### c)

Use a decision threshold of \(p^*=0.5\) to make an explicit prediction for each user \(i\)’s sex and save this in a variable predicted_sex. In other words, for user \(i\)

  • If \(\widehat{p}_i > p^*\), set predicted_sex = 1 i.e. they are female
  • If \(\widehat{p}_i < p^*\), set predicted_sex = 0 i.e. they are male

Display a 2 x 2 contigency table of sex and predicted_sex i.e. compare the predicted sex to the actual sex of all users. The sum of all the elements in your table should be \(n=59946\). Comment on how well our predictions fared.

The contingency table below displays predicted and actual sexes of Okcupid users in the Bay Area. Overall, our model seems to have fared well. Of the 6,742 males in our sample, we were able to predict accuratley the sex of 6,043. That is, our model correctly predicted the gender of males 90% of the time. As for females, the model predicted correctly their sex 83.7% of the time. Combined, our model made correct predictions 88.3% of the time regarding an Okcupid user’s sex. As mentioned above, the total number of users is less than the expected \(n=59946\), since regression models drop users for whom no information on a certain variable is available.

Predicted Male Predicted Female
Male 6043 347
Female 699 1790

d)

Say we wanted to have a false positive rate of about 20%, i.e. of the people we predicted to be female, we want to be wrong no more than 20% of the time. What decision threshold \(p^*\) should we use?

False positive refers to instances of falsely rejecting the null hypothesis. In this case, it refers to predicting that a user is female when in fact they are male. As it currently stands, the model has a false positive rate of 17.3% only. In any case, to determine the best theshold for our model, we draw a Receiver Operating Characteristic Curve (ROC). Professor, I am unclear on how to interpret this graph. It seems too far away from the diagonal line (\(x=y\)). What are your thoughts?

Question 2:

Using the jukebox data, plot a time series of the number of songs played each week over the entire time period. i.e.

What seasonal (i.e. cyclical) patterns do you observe?

This graph displays the total number of songs played on the juke box for every week of the years 2004-2009 at Reed College’s pool hall. On average, as the red line shows, 827 songs were played per week. Looking at the graph, it seems that the begining and end of every year witnessesa considerable increase in the total number of songs played per week at Reed’s college. This pattern is expected, since there are presuemably fewer students in the summer and during winter break, when the number of songs played is lowest, compared to the academic year which spans from September to May. It is unclear why the number of songs shoots up to 1,641, which is two standard deviations away from th emean of 827, in the week of 27 of January 2007.

Question 3:

Using the jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define

artist n
OutKast 2880
Beatles, The 2481
Led Zeppelin 1838
Radiohead 1757
Rolling Stones, The 1681
Notorious B.I.G. 1611
Eminem 1503
Red Hot Chili Peppers, The 1424
Bob Dylan 1281
Talking Heads 1276

The top artist played at Reed college’s pool hall on the jukebox between 2004 and 2009 during the graveyard shift in the academic year is OutKast. The hip-hop duo beats its closest contester The Beatles by a considerable margin of 399 The Beatles, in turn, was played 643 times more than Led Zeppelin, who came in third place, overtaking Radiohead by merely 81 times. The Talking Heads, who took the tenth place, can perhaps learn a lesson or two from OutKast, who were played 1604 times more than this 70s and 80s rock band.

Let’s look at what the graveyard shift looks like during the summer for comparison purposes:

artist n
OutKast 847
Beatles, The 698
Radiohead 566
Led Zeppelin 556
Notorious B.I.G. 519
Rolling Stones, The 480
Talking Heads 462
Red Hot Chili Peppers, The 426
Michael Jackson 407
Eminem 384

We see that the pattern is largely the same, with a few shifts around. The top artist remains OutKast, although it is played less frequently. The hip-hop duo beats its closest contester The Beatles by a margin of 149. The Beatles, in turn, were played 132 times more than Radiohead, who switched positions coming in third place and overtaking Led Zeppelin by merely 10 times. Eminem lost ground to THe Red Hot Chili Peppers, Michael Jackson, and the Talking Heads, coming in tenth with 463 plays between the contemporary rapper and OutKast, the overall winner.

Question

We want to compare the volatility of

Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.

A visual comparison between the prices of bitcoin and gold relative to the US dollar shows that, in the long term, gold is characterized by mild fluctuations. More specifically, the amplitude of the volatity of gold is around 417 USD. Bitcoin, on the other hand, seems to possess a maximum displacement of 600 USD, making it more volatile than gold over all. This large difference of almost 200 USD is driven in part by a drastic spike in the value of Bitcoins in the week of 29-11-2013, when its media presence was strong, as evident by references in articles in CNN Money, Wall Street Journal, and Bloomberg. From Mid-2014 onwards, it seems that the fluctuations of Bitcoin have become milder and continue to look this way up to the present.

We can do better by defining volatility as the relative change in price between one day and the previous day. Plotting the volatility for Bitcoin and gold, we get:

This graph shows Gold to be a consistent currency in its volatility, restricted by a band of about 50% difference in either direction from the day before. Bitcoin, however, seems to have faced much greater volatility, especially in 2014, where the were offshoots of more than 300% in the negative direction and 150% in the positive direction of the percentage change in price from the day before. It does seem that the relative change is converging to a narrower band, from March 2014 onwards.

Even better would be to rescale the magintude of the difference.

This graph drives the point home. Gold is condiserably less volatile compared to bitcoin, across the board. It also confirms our earlier suspicion that the volatility of Bitcoin has been decreasing over time.

Question 5:

Using the data loaded from Quandl below, plot a time series using geom_line() comparing cheese and milk production in the US from 1930 to today. Comment on this.

It seems that, while Milk production has soared since the 1980s, cheese production has only witnessed a very mild rise in production. The steep rise in milk production may be explained by the demand for milk is driven by more than direct consumer consumption. Cheese, for instance, depends on milk. Hence, the smaller upward trend in cheese may be driving part of the increase in milk production. Afterall, to produce one pound of a cheese like Cheddar, one needs at least 10 pounds of milk (which is a werid metric for measuring milk but guarantees that our axes are the same for two distinct variables).

To make the comparison fairer, let’s consider the percentage change in milk and cheese production, relative to the first year we have information on both industries, which is 1924. Effectively, we are creating an index:

The trends switch. While before, the naive graph made it seem as if Milk was growing at a faster rate than cheese, it is now evident that cheese is growing quite rapidly. This trend was masked by the sheer size of Milk. It is more difficult to sustain high levels of growth when you are already so big as an industry. Wheras cheese seems to be taking off, especially in the 90s and 2000s. Note: The y-axis should be understood as how many times is the industry compared to its level in 1924. In case of cheese in 2000, for instance, we would say that production level 15 times larger than production level in 1924. For Milk, it would be 3 times as large.