Admistrative:

Please indicate

Data

Question 1:

For this question we will be picking up from where we left off in HW-2, specifically the OkCupid dataset.

a)

Using your exploratory data analysis from HW-2, fit a logistic regression to predict individual’s gender and interpret your results.

Fitting generalized (binomial/logit) linear model: is_female ~ age + smc + grad + spc + ath + cur + fit + ff + smokes + drinks + drugs + height + log_income
  Estimate Std. Error z value Pr(>|z|)
Age 0.0212 0.00354 5.97 2.32e-09
Some College -0.22 0.0899 -2.45 0.0144
Graduate/ Professional Degree 0.362 0.0885 4.09 4.24e-05
Space Camp -0.0379 0.202 -0.187 0.852
Atheltic -0.769 0.0997 -7.71 1.26e-14
Curvey 3.99 0.225 17.8 9.86e-71
Fit -0.166 0.0885 -1.88 0.0605
Full Figured 2.91 0.227 12.8 1.92e-37
Smokes -0.0231 0.0896 -0.257 0.797
Drinks 0.262 0.0876 2.99 0.00278
Does Drugs -0.02 0.0906 -0.221 0.825
Height -0.567 0.0139 -40.8 0
Logged Income -0.325 0.0464 -7 2.51e-12
(Intercept) 40 1.05 38.1 0

The results of our regression model show that age is a statistically significant predictor of sex. A one year increase in age, on average, is associated with a 1.021 increase in the odds of being female. Conversely, having only some college education, compared to having a Bachelor’s, decreases the odds of being female by 0.802. Ideally. we would be able to see the condifence interval for these predictions, but the code breaks down.

b)

Plot a histogram of the fitted probabilities \(\widehat{p}_i\) for all users \(i=1, \ldots, n=59946\) in your dataset.

c)

Use a decision threshold of \(p^*=0.5\) to make an explicit prediction for each user \(i\)’s sex and save this in a variable predicted_sex. In other words, for user \(i\)

  • If \(\widehat{p}_i > p^*\), set predicted_sex = 1 i.e. they are female
  • If \(\widehat{p}_i < p^*\), set predicted_sex = 0 i.e. they are male

Display a 2 x 2 contigency table of sex and predicted_sex i.e. compare the predicted sex to the actual sex of all users. The sum of all the elements in your table should be \(n=59946\). Comment on how well our predictions fared.

The contingency table below displays predicted and actual sexes of Okcupid users in the Bay Area. Overall, our model seems to have fared quite well. Of the 6,742 males in our sample, we were able to predict accuratley the sex of 6,043. That is, our model correctly predicted the gender of males 90% of the time. As for females, the model predicted correctly their sex 83.7% of the time. Combined, our model made correct predictions 88.3% of the time regarding an Okcupid user’s sex.

Predicted Male Predicted Female
Male 6043 347
Female 699 1790

d)

Say we wanted to have a false positive rate of about 20%, i.e. of the people we predicted to be female, we want to be wrong no more than 20% of the time. What decision threshold \(p^*\) should we use?

False positive refers to instances of falsely rejecting the null hypothesis. In this case, it refers to predicting that a user is female when in fact they are male. As it currently stands, the model has a false positive rate of 17.3% only. In general, however, if we wanted to adjust a threshold tp get such a rate, I have no idea what we should do … (need to think more about this)

Question 2:

Using the jukebox data, plot a time series of the number of songs played each week over the entire time period. i.e.

What seasonal (i.e. cyclical) patterns do you observe?

Looking at the graph, it seems that the beginning and end of every year witnesses a considerable increase in the total number of songs played per week at Reed’s college. This pattern is expected, since there are presuemably fewer students in the summer and during winter break, when the number of songs played is lowest, compared to the academic year which spans from September to May.

Question 3:

Using the jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define

artist n
OutKast 847
Beatles, The 698
Radiohead 566
Led Zeppelin 556
Notorious B.I.G. 519
Rolling Stones, The 480
Talking Heads 462
Red Hot Chili Peppers, The 426
Michael Jackson 407
Eminem 384

The top artist played at Reed college’s pool hall on the jukebox between 2004 and 2009 is OutKast. The hip-hop duo beats its closest contester The Beatles by a considerable margin of 149. The Beatles, in turn, was played 132 times more than Radiohead, who came in third place, overtaking Led Zeppelin by merely 10 times. Eminem, who took the tenth place, can perhaps learn a lesson or two from OutKast, who were played 463 times more than the contemporary rapper.

Question 4:

We want to compare the volatility of

Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.

A visual comparison between the prices of bitcoin and gold relative to the US dollar shows that, in the long term, gold is characterized by mild fluctuations. More specifically, the amplitude of the volatity of gold is around 417 USD. Bitcoin, on the other hand, seems to possess a maximum displacement of 600 USD, making it more volatile than gold over all. This large difference of almost 200 USD is driven in part by a drastic spike in the value of Bitcoins in the week of 29-11-2013, when its media presence was strong, as evident by references in articles in CNN Money, Wall Street Journal, and Bloomberg. From Mid-2014 onwards, it seems that the fluctuations of Bitcoin have become milder and continue to look this way up to the present.

Question 5:

Using the data loaded from Quandl below, plot a time series using geom_line() comparing cheese and milk production in the US from 1930 to today. Comment on this.