Please indicate
profiles.csv from HW-2 to the data folder in the HW-3 directoryHW-3 folder).For this question we will be picking up from where we left off in HW-2, specifically the OkCupid dataset.
Using your exploratory data analysis from HW-2, fit a logistic regression to predict individual’s gender and interpret your results.
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 42.86 | 0.35 | 123.27 | 0 |
| orientationbisexual | 1.38 | 0.06 | 23.13 | 0 |
| orientationgay | -0.68 | 0.04 | -16.14 | 0 |
| height | -0.64 | 0.01 | -124.07 | 0 |
| orientationbisexual | orientationgay | height |
|---|---|---|
| 3.985 | 0.5076 | 0.5281 |
This model attempts to show the relationship between gender, height and sexual orientation. The idea is that for every change in height and sexual orientation (specifically bisexuality and gay) there is a change in the odds of being female. From the regression output, one can see a positive relationship of bisexuality to the odds of being female. This means that for every 1.382 unit increase in bisexuality there is a one unit increase in the odds of being female. This is the opposite for gay and straight. Every 0.678 decrease in gay results in a 1 unit increase in odds of being female, which means that increases in gay result in decrease in odds female (negative relationship). Height and gender also have a negative relationship. A 0.638 unit decrease in height results in a 1 unit increase in the odds of being female. The exponential of these coeficients show the increase or decrease in odds for each descripter to gender. For example, if the user is bisexual there is a 3.98 multiplicative fold increase in the odds of the user being female. On the contrary, if the user is gay, the odds of the user being female are reduced multiplicatively by 0.507. This also means that 1/0.507, which is equal to 1.97, is the fold increase of the odds that the user is male. A one unit increase in height corresponds to a multiplicative increase of 0.5281 in the odds of the user being female. In other words, there is a multiplicative increase of 1.89 in the odds of the user being male for every one unit increase in height.
Plot a histogram of the fitted probabilities \(\widehat{p}_i\) for all users \(i=1, \ldots, n=59946\) in your dataset.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
Above is a histogram showing the distribution of fitted probabilities \(\widehat{p}_i\) for OKCupid users. Each bin on the x-axis represents ranges of fitted probabilities. \(\widehat{p}_i\) of 0.00 to 0.02 is the most frequent range of \(\widehat{p}_i\)
Use a decision threshold of \(p^*=0.5\) to make an explicit prediction for each user \(i\)’s sex and save this in a variable predicted_sex. In other words, for user \(i\)
predicted_sex = 1 i.e. they are femalepredicted_sex = 0 i.e. they are maleDisplay a 2 x 2 contigency table of sex and predicted_sex i.e. compare the predicted sex to the actual sex of all users. The sum of all the elements in your table should be \(n=59946\). Comment on how well our predictions fared.
| Observed Male | Observed Female | |
|---|---|---|
| Predicted Male | 30521 | 4526 |
| Predicted Female | 5308 | 19591 |
| Observed Male | Observed Female | |
|---|---|---|
| % Predicted Male | 85.19 | 18.77 |
| % Predicted Female | 14.81 | 81.23 |
This model worked very well. As you can see from the 2x2 table, the model predicted correctly 85.2% of the time for male and 81.2% of the time for females. The error rate for predicting males was 14.81% and the error rate for predicted females was 18.8.
Say we wanted to have a false positive rate of about 20%, i.e. of the people we predicted to be female, we want to be wrong no more than 20% of the time. What decision threshold \(p^*\) should we use?
Using the jukebox data, plot a time series of the number of songs played each week over the entire time period. i.e.
What seasonal (i.e. cyclical) patterns do you observe?
## Warning: Removed 6 rows containing missing values (position_stack).
You can see from the plot that there are two distinct times of the year when there is a drop in songs played. The broader valley is summer and the very narrow but deep dip is winter break. The spring seems to have a downward trend. The closer to the end of the semester (closer to summer) the fewer songs are played.
Using the jukebox data, what are the top 10 artists played during the “graveyard shift” during the academic year? Define
## Selecting by n
## Selecting by n
From the bar plot above, one can see that OutKast is played the most during the aademic year of the graveyard shift. Talking Heads is played the least. It is interesting how the majority of these music groups are Rock and Roll. Reed College must not like ‘mainstream’ music played during wee hours of the night. Comparing the graveyard shift’s most popular tracks to all hours of the day, you can see that there isn’t much variation. Perhaps this means the music is actually on some sort of loop. The only difference in most popular
We want to compare the volatility of
Let our measure of volatility be the relative change from day-to-day in price. Let the reference currency be US dollars. Analyze these results and provide insight to a foreign currency exchanger.
## Warning: Removed 4 rows containing missing values (geom_path).