Admistrative:

Please indicate

Data

Question 1:

For this question we will be picking up from where we left off in HW-2, specifically the OkCupid dataset.

a)

Using your exploratory data analysis from HW-2, fit a logistic regression to predict individual’s gender and interpret your results.

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Estimate Std. Error z value Pr(>|z|)
(Intercept) 42.86 0.35 123.27 0
orientationbisexual 1.38 0.06 23.13 0
orientationgay -0.68 0.04 -16.14 0
height -0.64 0.01 -124.07 0
orientationbisexual orientationgay height
3.985 0.5076 0.5281

This model attempts to show the relationship between gender, height and sexual orientation. The idea is that for every change in height and sexual orientation (specifically bisexuality and gay) there is a change in the odds of being female. From the regression output, one can see a positive relationship of bisexuality to the odds of being female. This means that for every 1.382 unit increase in bisexuality there is a one unit increase in the odds of being female. This is the opposite for gay and straight. Every 0.678 decrease in gay results in a 1 unit increase in odds of being female, which means that increases in gay result in decrease in odds female (negative relationship). Height and gender also have a negative relationship. A 0.638 unit decrease in height results in a 1 unit increase in the odds of being female. The exponential of these coeficients show the increase or decrease in odds for each descripter to gender. For example, if the user is bisexual there is a 3.98 multiplicative fold increase in the odds of the user being female. On the contrary, if the user is gay, the odds of the user being female are reduced multiplicatively by 0.507. This also means that 1/0.507, which is equal to 1.97, is the fold increase of the odds that the user is male. A one unit increase in height corresponds to a multiplicative increase of 0.5281 in the odds of the user being female. In other words, there is a multiplicative increase of 1.89 in the odds of the user being male for every one unit increase in height.

b)

Plot a histogram of the fitted probabilities \(\widehat{p}_i\) for all users \(i=1, \ldots, n=59946\) in your dataset.

## Warning: Removed 3 rows containing non-finite values (stat_bin).

Above is a histogram showing the distribution of fitted probabilities \(\widehat{p}_i\) for OKCupid users. Each bin on the x-axis represents ranges of fitted probabilities. \(\widehat{p}_i\) of 0.00 to 0.02 is the most frequent range of \(\widehat{p}_i\)

c)

Use a decision threshold of \(p^*=0.5\) to make an explicit prediction for each user \(i\)’s sex and save this in a variable predicted_sex. In other words, for user \(i\)

  • If \(\widehat{p}_i > p^*\), set predicted_sex = 1 i.e. they are female
  • If \(\widehat{p}_i < p^*\), set predicted_sex = 0 i.e. they are male

Display a 2 x 2 contigency table of sex and predicted_sex i.e. compare the predicted sex to the actual sex of all users. The sum of all the elements in your table should be \(n=59946\). Comment on how well our predictions fared.

Observed Male Observed Female
Predicted Male 30521 4526
Predicted Female 5308 19591
Observed Male Observed Female
% Predicted Male 85.19 18.77
% Predicted Female 14.81 81.23

This model worked very well. As you can see from the 2x2 table, the model predicted correctly 85.2% of the time for male and 81.2% of the time for females. The error rate for predicting males was 14.81% and the error rate for predicted females was 18.8.

d)

Say we wanted to have a false positive rate of about 20%, i.e. of the people we predicted to be female, we want to be wrong no more than 20% of the time. What decision threshold \(p^*\) should we use?

Question 2:

Using the jukebox data, plot a time series of the number of songs played each week over the entire time period. i.e.

What seasonal (i.e. cyclical) patterns do you observe?

## Warning: Removed 6 rows containing missing values (position_stack).

You can see from the plot that there are two distinct times of the year when there is a drop in songs played. The broader valley is summer and the very narrow but deep dip is winter break. The spring seems to have a downward trend. The closer to the end of the semester (closer to summer) the fewer songs are played.

Question 3:

Using the jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define

## Selecting by n
## Selecting by n

From the bar plot above, one can see that OutKast is played the most during the aademic year of the graveyard shift. Talking Heads is played the least. It is interesting how the majority of these music groups are Rock and Roll. Reed College must not like ‘mainstream’ music played during wee hours of the night. Comparing the graveyard shift’s most popular tracks to all hours of the day, you can see that there isn’t much variation. Perhaps this means the music is actually on some sort of loop. The only difference in most popular

Question 4:

We want to compare the volatility of

Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.

## Warning: Removed 4 rows containing missing values (geom_path).

From plot #1 produced above you can see that gold consistantly has more value than bitcoin. In addition, it is visually apparent that gold is more temperate than bitcoin. Bitcoin has had an change of $1000 in the last 5 years, while gold has stayed between $1000 and $2000. Looking at plot #2 you can see the volatility of the two graphs. The daily % change of gold and bitcoin, shows how strickingly different bitcoin and gold are when it comes to variabilies in daily price changes. Relative to bitcoin, gold strattles 0% change. The highest daily % change for gold was 4.96% while the highest % change for bitcoin was 29.8%. Therefore, you can come to the conclusion that bicoin is more volatile than gold. It is important to mention that gold changes prices less often, since the price of gold does not change on weekends. The price of Bitcoins does change over the weekend. If I were given advice to a foreign currency exchanger, I would let him/her know that they have the greatest chance to make a profit with bitcoin, but it is much more volatile than gold. If you want to be safe, stick with gold. It has consistently gone up since 2010, while bitcoin has been extremly variable. A volatile stock, though, however more dangerous, also has the potential to make an exchanger more money, faster.

Question 5:

Using the data loaded from Quandl below, plot a time series using geom_line() comparing cheese and milk production in the US from 1930 to today. Comment on this.