## Warning: package 'ggplot2' was built under R version 3.2.4
Please indicate
Question 4 on page 76 from Chapter 4 of Data Analysis Using Regression and Multilevel/Hierarchical Models. The codebook can be found here. I’ve included R code blocks for each question, but use them only if you feel it necessary.
The regression is heavily skewed by the vertical spread of y values at low levels of x, and the few data points at large values of x.
## (Intercept) log_nox
## 904.7245 15.3355
For every increase of 1 in NOx, there is a 15 point increase in mortality.
##
## Call:
## lm(formula = mort ~ log_nox + log_so2 + log_hc, data = pollution2)
##
## Coefficients:
## (Intercept) log_nox log_so2 log_hc
## 924.97 58.34 11.76 -57.30
Each increase of 1 in SO2 level corresponds to an increase of 11.76 in mortality. Similarly, every increase in NOx is associated with an increase of 58.34 in mortality. Surprisingly, hydrocarbons are negatively associated with mortality, such that an increase of 1 in hc is associated with a decrease in mortality of 57.30.
##
## Call:
## lm(formula = mort ~ log_nox + log_so2 + log_hc, data = pollutiontrain)
##
## Coefficients:
## (Intercept) log_nox log_so2 log_hc
## 917.210 63.908 2.965 -49.024
##
## Call:
## lm(formula = mort ~ predictions, data = pollutiontest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.859 -32.570 8.585 38.104 92.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.8653 486.9831 -0.067 0.9470
## predictions 1.0193 0.5214 1.955 0.0683 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.39 on 16 degrees of freedom
## Multiple R-squared: 0.1928, Adjusted R-squared: 0.1423
## F-statistic: 3.821 on 1 and 16 DF, p-value: 0.06831
##
## Call:
## lm(formula = mort ~ predictions, data = pollutiontest)
##
## Coefficients:
## (Intercept) predictions
## -32.865 1.019
## 2.5 % 97.5 %
## (Intercept) -1.065223e+03 999.492814
## predictions -8.607896e-02 2.124728
If the model worked perfectly, we would expect to see the line y = x when plotting predicted vs observed rates. Depending on the sample fraction used, a better-fitting model can be reliably produced. I was finding very variable slopes when I used only half the dataset to train the model, so switched to 0.7. Still the model changes with each run because it is based off of a random sample of the data.
The reason to cross-validate is to make sure the model isn’t too specific. With a small data set, it is possible to create a very good model that predicts every data point very closely. However, it is hard to know if such a perfect model is generalizable. One way to test the model is to use only a portion of the data to make the model and then test it on the other portion.
Perform an Exploratory Data Analysis (EDA) of the OkCupid data, keeping in mind in HW-3, you will be fitting a logistic regression to predict gender. What do I mean by EDA?
For the R Markdown to work, you must first copy the file profiles.csv from Lec09 to the project directory HW-2.
## Warning: Removed 116 rows containing missing values (geom_point).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 569 rows containing non-finite values (stat_boxplot).
Note that height and income differ by gender, while age and sexual orientation are pretty similar.
Note that drug and alcohol use do not appear to differ by gender, though body type descriptors do.
| Sex | Proportion of Respondents Whose Essay Contains ‘Read’ |
|---|---|
| f | 0.526 |
| m | 0.462 |
| Sex | Proportion of Respondents Whose Essay Contains ‘Cook’ |
|---|---|
| f | 0.355 |
| m | 0.284 |
| Sex | Proportion of Respondents Whose Essay Contains ‘Cars’, ‘Truck’, or ‘Motorcycle’ |
|---|---|
| f | 0.067 |
| m | 0.111 |
| Sex | Proportion of Respondents Whose Essay Contains ‘Child’ or ‘Kid’ |
|---|---|
| f | 0.300 |
| m | 0.239 |