MATH 216 Homework 2

Question 1:

Question 4 on page 76 from Chapter 4 of Data Analysis Using Regression and Multilevel/Hierarchical Models. The codebook can be found here. I’ve included R code blocks for each question, but use them only if you feel it necessary.

a)

Create a scatterplot of mortality rate vs. level of nitric oxides. Do you think linear regression will fit these data well? Fit the regression and evaluate a residual plot from the regression.

The regression is heavily skewed by the vertical spread of y values at low levels of x, and the few data points at large values of x.

b)

Find an appropriate transformation that will result in data more appropriate for linear regression. Fit a regression to the transformed data and evaluate a new residual plot.

c)

Interpret the slope coefficient from the model you chose in b.

## (Intercept)     log_nox 
##    904.7245     15.3355

For every increase of 1 in NOx, there is a 15 point increase in mortality.

d)

Now fit a model predicting mortality rate using levels of nitric oxides, sulfur dioxide, and hydrocarbons as inputs. Use appropriate transformations when helpful. Plot the fitted regression model and interpret the coeffiecients.

## 
## Call:
## lm(formula = mort ~ log_nox + log_so2 + log_hc, data = pollution2)
## 
## Coefficients:
## (Intercept)      log_nox      log_so2       log_hc  
##      924.97        58.34        11.76       -57.30

Each increase of 1 in SO2 level corresponds to an increase of 11.76 in mortality. Similarly, every increase in NOx is associated with an increase of 58.34 in mortality. Surprisingly, hydrocarbons are negatively associated with mortality, such that an increase of 1 in hc is associated with a decrease in mortality of 57.30.

e)

Cross-validate: fit the model you chose above to the first half of the data and then predict for the second half. (You used all the data to construct the model in d, so this is not really cross-validation, but it gives a sense of how the steps of cross-validation can be implemented.)

## 
## Call:
## lm(formula = mort ~ log_nox + log_so2 + log_hc, data = pollutiontrain)
## 
## Coefficients:
## (Intercept)      log_nox      log_so2       log_hc  
##     917.210       63.908        2.965      -49.024

## 
## Call:
## lm(formula = mort ~ predictions, data = pollutiontest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.859  -32.570    8.585   38.104   92.774 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -32.8653   486.9831  -0.067   0.9470  
## predictions   1.0193     0.5214   1.955   0.0683 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.39 on 16 degrees of freedom
## Multiple R-squared:  0.1928, Adjusted R-squared:  0.1423 
## F-statistic: 3.821 on 1 and 16 DF,  p-value: 0.06831

## 
## Call:
## lm(formula = mort ~ predictions, data = pollutiontest)
## 
## Coefficients:
## (Intercept)  predictions  
##     -32.865        1.019

##                     2.5 %     97.5 %
## (Intercept) -1.065223e+03 999.492814
## predictions -8.607896e-02   2.124728

If the model worked perfectly, we would expect to see the line y = x when plotting predicted vs observed rates. Depending on the sample fraction used, a better-fitting model can be reliably produced. I was finding very variable slopes when I used only half the dataset to train the model, so switched to 0.7. Still the model changes with each run because it is based off of a random sample of the data.

f)

What do you think are the reasons for using cross-validation?

The reason to cross-validate is to make sure the model isn’t too specific. With a small data set, it is possible to create a very good model that predicts every data point very closely. However, it is hard to know if such a perfect model is generalizable. One way to test the model is to use only a portion of the data to make the model and then test it on the other portion.

Question 2:

Perform an Exploratory Data Analysis (EDA) of the OkCupid data, keeping in mind in HW-3, you will be fitting a logistic regression to predict gender. What do I mean by EDA?

Visualizations
Tables
Numerical summaries

For the R Markdown to work, you must first copy the file profiles.csv from Lec09 to the project directory HW-2.

Some Demographics

## Warning: Removed 116 rows containing missing values (geom_point).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 569 rows containing non-finite values (stat_boxplot).

Note that height and income differ by gender, while age and sexual orientation are pretty similar.

More Personal:

Note that drug and alcohol use do not appear to differ by gender, though body type descriptors do.

Sex	Proportion of Respondents Whose Essay Contains ‘Read’
f	0.526
m	0.462

Sex	Proportion of Respondents Whose Essay Contains ‘Cook’
f	0.355
m	0.284

Sex	Proportion of Respondents Whose Essay Contains ‘Cars’, ‘Truck’, or ‘Motorcycle’
f	0.067
m	0.111

Sex	Proportion of Respondents Whose Essay Contains ‘Child’ or ‘Kid’
f	0.300
m	0.239