MATH 216 Homework 3

Admistrative:

Please indicate

Who you collaborated with: None
Roughly how much time you spent on this HW: 5 Hours
What gave you the most trouble: Question 1
Any comments you have: Question 1 is in a very crude state as it currently stand. Would love to discuss the concepts related to machine learning in class. In addition, what are the statistical implications for losing a considerable chunk of the dataset as a result of the variables chosen and how thresholds are chosen.

Data

You must first copy the file profiles.csv from HW-2 to the data folder in the HW-3 directory
We also consider all 222,540 songs played in the Reed College pool hall jukebox from Nov 30, 2003 to Jan 22, 2009 (included in HW-3 folder).

Question 1:

For this question we will be picking up from where we left off in HW-2, specifically the OkCupid dataset.

a)

Using your exploratory data analysis from HW-2, fit a logistic regression to predict individual’s gender and interpret your results.

Fitting generalized (binomial/logit) linear model: is_female ~ age + smc + grad + spc + ath + cur + fit + ff + smokes + drinks + drugs + height + log_income
	Estimate	Std. Error	z value	Pr(>\|z\|)
Age	0.0212	0.00354	5.97	2.32e-09
Some College	-0.22	0.0899	-2.45	0.0144
Graduate/ Professional Degree	0.362	0.0885	4.09	4.24e-05
Space Camp	-0.0379	0.202	-0.187	0.852
Atheltic	-0.769	0.0997	-7.71	1.26e-14
Curvey	3.99	0.225	17.8	9.86e-71
Fit	-0.166	0.0885	-1.88	0.0605
Full Figured	2.91	0.227	12.8	1.92e-37
Smokes	-0.0231	0.0896	-0.257	0.797
Drinks	0.262	0.0876	2.99	0.00278
Does Drugs	-0.02	0.0906	-0.221	0.825
Height	-0.567	0.0139	-40.8	0
Logged Income	-0.325	0.0464	-7	2.51e-12
(Intercept)	40	1.05	38.1	0

The results of our regression model show that age is a statistically significant predictor of sex. A one year increase in age, on average, is associated with a 1.021 increase in the odds of being female. Conversely, having only some college education, compared to having a Bachelor’s, decreases the odds of being female by 0.802. Ideally. we would be able to see the condifence interval for these predictions, but the code breaks down.

b)

Plot a histogram of the fitted probabilities \(\widehat{p}_i\) for all users \(i=1, \ldots, n=59946\) in your dataset.

c)

Use a decision threshold of \(p^*=0.5\) to make an explicit prediction for each user \(i\)’s sex and save this in a variable predicted_sex. In other words, for user \(i\)

If \(\widehat{p}_i > p^*\), set predicted_sex = 1 i.e. they are female
If \(\widehat{p}_i < p^*\), set predicted_sex = 0 i.e. they are male

Display a 2 x 2 contigency table of sex and predicted_sex i.e. compare the predicted sex to the actual sex of all users. The sum of all the elements in your table should be \(n=59946\). Comment on how well our predictions fared.

The contingency table below displays predicted and actual sexes of Okcupid users in the Bay Area. Overall, our model seems to have fared quite well. Of the 6,742 males in our sample, we were able to predict accuratley the sex of 6,043. That is, our model correctly predicted the gender of males 90% of the time. As for females, the model predicted correctly their sex 83.7% of the time. Combined, our model made correct predictions 88.3% of the time regarding an Okcupid user’s sex.

	Predicted Male	Predicted Female
Male	6043	347
Female	699	1790

d)

Say we wanted to have a false positive rate of about 20%, i.e. of the people we predicted to be female, we want to be wrong no more than 20% of the time. What decision threshold \(p^*\) should we use?

False positive refers to instances of falsely rejecting the null hypothesis. In this case, it refers to predicting that a user is female when in fact they are male. As it currently stands, the model has a false positive rate of 17.3% only. In general, however, if we wanted to adjust a threshold tp get such a rate, I have no idea what we should do … (need to think more about this)

Question 2:

Using the jukebox data, plot a time series of the number of songs played each week over the entire time period. i.e.

On the x-axis present actual dates (not something like Week 93, which doesn’t mean anything to most people).
On the y-axis present the total number of songs.

What seasonal (i.e. cyclical) patterns do you observe?

Looking at the graph, it seems that the beginning and end of every year witnesses a considerable increase in the total number of songs played per week at Reed’s college. This pattern is expected, since there are presuemably fewer students in the summer and during winter break, when the number of songs played is lowest, compared to the academic year which spans from September to May.

Question 3:

Using the jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define

the “graveyard shift” as midnight to 8am
the academic year as September through May (inclusive)

artist	n
OutKast	847
Beatles, The	698
Radiohead	566
Led Zeppelin	556
Notorious B.I.G.	519
Rolling Stones, The	480
Talking Heads	462
Red Hot Chili Peppers, The	426
Michael Jackson	407
Eminem	384

The top artist played at Reed college’s pool hall on the jukebox between 2004 and 2009 is OutKast. The hip-hop duo beats its closest contester The Beatles by a considerable margin of 149. The Beatles, in turn, was played 132 times more than Radiohead, who came in third place, overtaking Led Zeppelin by merely 10 times. Eminem, who took the tenth place, can perhaps learn a lesson or two from OutKast, who were played 463 times more than the contemporary rapper.

Question 4:

We want to compare the volatility of

bitcoin prices
gold prices

Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.

A visual comparison between the prices of bitcoin and gold relative to the US dollar shows that, in the long term, gold is characterized by mild fluctuations. More specifically, the amplitude of the volatity of gold is around 417 USD. Bitcoin, on the other hand, seems to possess a maximum displacement of 600 USD, making it more volatile than gold over all. This large difference of almost 200 USD is driven in part by a drastic spike in the value of Bitcoins in the week of 29-11-2013, when its media presence was strong, as evident by references in articles in CNN Money, Wall Street Journal, and Bloomberg. From Mid-2014 onwards, it seems that the fluctuations of Bitcoin have become milder and continue to look this way up to the present.

Question 5:

Using the data loaded from Quandl below, plot a time series using geom_line() comparing cheese and milk production in the US from 1930 to today. Comment on this.

Cheese page
Milk page