## Warning: package 'ggplot2' was built under R version 3.2.4
Please indicate
Question 4 on page 76 from Chapter 4 of Data Analysis Using Regression and Multilevel/Hierarchical Models. The codebook can be found here. I’ve included R code blocks for each question, but use them only if you feel it necessary.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| nox | -0.104 | 0.176 | -0.591 | 0.557 |
Table: Fitting linear model: mort ~ nox
## (Intercept) nox
## 942.711 -0.104
The linear regression for this data does not fit the data. One can see that the values are condensed to the left side of the x axis between 0-100. The linear model shows on first plot shows that the linear regression line does not fit the data well, and the residual plot shows a lot of variaiton with points being on one side, rather than evenly distributed. The model does not fit the data well.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| log_nox | 35.31 | 15.19 | 2.325 | 0.02359 |
Table: Fitting linear model: mort ~ log_nox
## (Intercept) log_nox
## 904.724 35.311
After log-transformating the nitric oxiide pollution, the linear regression model fits the data much better. One can see looking at residual graph that the standard error is smaller when the nitric oxide is log-transformed compared to the pre-transformed data. The points are more randomly dispersed rather than clumped together. The log-transformed linear regression shows a positive relationship (35.311), while the non-transformed linear regression showed a negative relationship of nitric oxide pollution to mortality, meaning there is a possibility that an increase in nitric oxide could lead to increased mortality. We do not know enough to make this conclusion, however.
##
## Call:
## lm(formula = mort ~ log_nox, data = no_mr2)
##
## Coefficients:
## (Intercept) log_nox
## 904.72 35.31
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 870.350 | 939.099 |
| log_nox | 4.911 | 65.712 |
The slope coefficient in this model is 35.311. This can be interpreted as the increase in age-adjusted mortality for each multiplicative increase of a factor of 10 in Nitric Oxide. We are 95% confident that the effect of no2 lies between 870 and 939 mortality increase for 100,000 people.
##
## Call:
## lm(formula = mort ~ log_nox + log_hc + log_so2, data = misc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.793 -34.728 -3.118 34.148 194.567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 924.97 21.45 43.125 < 2e-16 ***
## log_nox 134.32 50.08 2.682 0.00960 **
## log_hc -131.94 44.71 -2.951 0.00462 **
## log_so2 27.08 16.50 1.642 0.10629
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.36 on 56 degrees of freedom
## Multiple R-squared: 0.2752, Adjusted R-squared: 0.2363
## F-statistic: 7.086 on 3 and 56 DF, p-value: 0.0004044
## (Intercept) log_nox log_hc log_so2
## 924.96522 134.32452 -131.93846 27.08262
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 881.999 | 967.932 |
| log_nox | 33.997 | 234.652 |
| log_hc | -221.510 | -42.367 |
| log_so2 | -5.968 | 60.133 |
## [1] "\n"
For each multiplicative increase of a factor of 10 in NOX, HC, and SO2 there is an increase in adjusted mortality rate of 134.32, -131.93, and 27.08. The 95% confidence interval excludes 0 for nitric oxide and hydrocarbons, but includes 0 for SO2. Therefore, SO2 is not a significant predicter itself, but nitric oxide and hydrocarbons could have a relationship to mortality rate; nitric oxide having a positive relationship and hydrocarbons having a negative relationship.
##
## Call:
## lm(formula = mort ~ log_nox + log_hc + log_so2, data = half1_misc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.391 -23.558 0.663 23.218 96.290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 870.76 27.13 32.096 <2e-16 ***
## log_nox 52.76 68.22 0.773 0.4462
## log_hc -85.41 57.15 -1.494 0.1471
## log_so2 80.16 22.24 3.604 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.91 on 26 degrees of freedom
## Multiple R-squared: 0.571, Adjusted R-squared: 0.5215
## F-statistic: 11.53 on 3 and 26 DF, p-value: 5.411e-05
## 2.5 % 97.5 %
## (Intercept) 814.99146 926.52480
## log_nox -87.45590 192.98008
## log_hc -202.88518 32.07021
## log_so2 34.44458 125.87879
One can see that the plot above shows the predicted mortality rate vs. the observed mortality rate. This plot shows the adaptability of the linear regression model for new (predicted) values. y=x is overlayed on the plot. Point on the plot are close to the y=x plot, which can be seen more clearly in the residual plot. The majority of the points are betwen 50 and -50 units from the y=x line.
It is a test for how the model would perform when dealing with “new” data. It tests for overfitting a dataset. A model can become too specific and customized for one dataset, that is underperforms when encountering new data.
Perform an Exploratory Data Analysis (EDA) of the OkCupid data, keeping in mind in HW-3, you will be fitting a logistic regression to predict gender. What do I mean by EDA?
For the R Markdown to work, you must first copy the file profiles.csv from Lec09 to the project directory HW-2.
| likes_cats | prop_female |
|---|---|
| 0 | 0.379 |
| 1 | 0.433 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| likes_cats | 0.225 | 0.01681 | 13.39 | 7.314e-41 |
| (Intercept) | -0.4936 | 0.01114 | -44.3 | 0 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 80800 on 59945 degrees of freedom |
| Residual deviance: | 80621 on 59944 degrees of freedom |
## (Intercept) likes_cats
## -0.4935637 0.2249725
## (Intercept)
## 0.379
## (Intercept)
## 0.433
This shows a very weak relationship between cat fondness and gender. The horizontal lines show the average- in other words, those people who don’t like cats have a gender score of 0.379- which means more men than female. For people who do like cats, the gender score is 0.433, meaning there are more men than female that like cats. But having a cat is has a greater gender score, meaning it is more correlated to females than not having a cat. But all things considered, the relationship is weak. It is important to note that some users did not list sexual orientation, so there were some users not included.
| is_veg | prop_female |
|---|---|
| FALSE | 0.3885917 |
| TRUE | 0.5535499 |
##
## Call:
## glm(formula = is.female ~ is_veg, family = binomial, data = gender)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.270 -0.992 -0.992 1.375 1.375
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.453236 0.008751 -51.79 <2e-16 ***
## is_vegTRUE 0.668260 0.029802 22.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 80800 on 59945 degrees of freedom
## Residual deviance: 80294 on 59944 degrees of freedom
## AIC: 80298
##
## Number of Fisher Scoring iterations: 4
## (Intercept) is_vegTRUE
## -0.4532358 0.6682602
## (Intercept)
## 0.389
## (Intercept)
## 0.554
The data from above shows that women do tend to be vegetarian’s more than men, but the slope of the regression is 0.6682. Looking at the plot that is created above, one can see that there is not much difference in gender specification comparing vegetarians vs. non vegetarians (0.3886 Non vs. 0.5535 Veg). Diet is not the greatest predictor of gender. It is important to note that some users did not list cat-dog preference, so there were some users not included.
| orientation | prop_female |
|---|---|
| bisexual | 0.721 |
| straight | 0.398 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| orientationstraight | -1.37 | 0.0433 | -31.5 | 8.02e-218 |
| (Intercept) | 0.951 | 0.0424 | 22.4 | 1.89e-111 |
## (Intercept) orientationstraight
## 0.9512121 -1.3655180
## (Intercept)
## 0.721
## (Intercept)
## 0.398
From this plot, you can see that women are more likely to be bisexual than men on OKcupid. From the plot above, women tend to be more bisexual more often than men. If you were to chose a bisexual person at random, there is a 0.721 chance that they would be female. If you chose a straight person at random, there is a 0.398 chance that they would be male. It is important to note that some users did not list sexual orientation, so there were some users not included.