MATH 216 Homework 2

## Warning: package 'ggplot2' was built under R version 3.2.4

Admistrative:

Please indicate

Who you collaborated with: Ali Cook, Jacob Dixon
Roughly how much time you spent on this HW: 12 hours
What gave you the most trouble: Deciding what was worth making a regression for.
Any comments you have: I am really enjoying working with the OKcupid data I hope it is okay that I used the outline for regression that we did with The wine data from class (Lecture 11).

Question 1:

Question 4 on page 76 from Chapter 4 of Data Analysis Using Regression and Multilevel/Hierarchical Models. The codebook can be found here. I’ve included R code blocks for each question, but use them only if you feel it necessary.

a)

	Estimate	Std. Error	t value	Pr(>\|t\|)
nox	-0.104	0.176	-0.591	0.557

(Intercept) 943 9 105 8.59e-68

Table: Fitting linear model: mort ~ nox

## (Intercept)         nox 
##     942.711      -0.104

The linear regression for this data does not fit the data. One can see that the values are condensed to the left side of the x axis between 0-100. The linear model shows on first plot shows that the linear regression line does not fit the data well, and the residual plot shows a lot of variaiton with points being on one side, rather than evenly distributed. The model does not fit the data well.

b)

	Estimate	Std. Error	t value	Pr(>\|t\|)
log_nox	35.31	15.19	2.325	0.02359

(Intercept) 904.7 17.17 52.68 1.108e-50

Table: Fitting linear model: mort ~ log_nox

## (Intercept)     log_nox 
##     904.724      35.311

After log-transformating the nitric oxiide pollution, the linear regression model fits the data much better. One can see looking at residual graph that the standard error is smaller when the nitric oxide is log-transformed compared to the pre-transformed data. The points are more randomly dispersed rather than clumped together. The log-transformed linear regression shows a positive relationship (35.311), while the non-transformed linear regression showed a negative relationship of nitric oxide pollution to mortality, meaning there is a possibility that an increase in nitric oxide could lead to increased mortality. We do not know enough to make this conclusion, however.

c)

## 
## Call:
## lm(formula = mort ~ log_nox, data = no_mr2)
## 
## Coefficients:
## (Intercept)      log_nox  
##      904.72        35.31

	2.5 %	97.5 %
(Intercept)	870.350	939.099
log_nox	4.911	65.712

The slope coefficient in this model is 35.311. This can be interpreted as the increase in age-adjusted mortality for each multiplicative increase of a factor of 10 in Nitric Oxide. We are 95% confident that the effect of no2 lies between 870 and 939 mortality increase for 100,000 people.

d)

## 
## Call:
## lm(formula = mort ~ log_nox + log_hc + log_so2, data = misc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -97.793 -34.728  -3.118  34.148 194.567 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   924.97      21.45  43.125  < 2e-16 ***
## log_nox       134.32      50.08   2.682  0.00960 ** 
## log_hc       -131.94      44.71  -2.951  0.00462 ** 
## log_so2        27.08      16.50   1.642  0.10629    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.36 on 56 degrees of freedom
## Multiple R-squared:  0.2752, Adjusted R-squared:  0.2363 
## F-statistic: 7.086 on 3 and 56 DF,  p-value: 0.0004044

## (Intercept)     log_nox      log_hc     log_so2 
##   924.96522   134.32452  -131.93846    27.08262

	2.5 %	97.5 %
(Intercept)	881.999	967.932
log_nox	33.997	234.652
log_hc	-221.510	-42.367
log_so2	-5.968	60.133

## [1] "\n"

For each multiplicative increase of a factor of 10 in NOX, HC, and SO2 there is an increase in adjusted mortality rate of 134.32, -131.93, and 27.08. The 95% confidence interval excludes 0 for nitric oxide and hydrocarbons, but includes 0 for SO2. Therefore, SO2 is not a significant predicter itself, but nitric oxide and hydrocarbons could have a relationship to mortality rate; nitric oxide having a positive relationship and hydrocarbons having a negative relationship.

e)

## 
## Call:
## lm(formula = mort ~ log_nox + log_hc + log_so2, data = half1_misc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.391 -23.558   0.663  23.218  96.290 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   870.76      27.13  32.096   <2e-16 ***
## log_nox        52.76      68.22   0.773   0.4462    
## log_hc        -85.41      57.15  -1.494   0.1471    
## log_so2        80.16      22.24   3.604   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.91 on 26 degrees of freedom
## Multiple R-squared:  0.571,  Adjusted R-squared:  0.5215 
## F-statistic: 11.53 on 3 and 26 DF,  p-value: 5.411e-05

##                  2.5 %    97.5 %
## (Intercept)  814.99146 926.52480
## log_nox      -87.45590 192.98008
## log_hc      -202.88518  32.07021
## log_so2       34.44458 125.87879

One can see that the plot above shows the predicted mortality rate vs. the observed mortality rate. This plot shows the adaptability of the linear regression model for new (predicted) values. y=x is overlayed on the plot. Point on the plot are close to the y=x plot, which can be seen more clearly in the residual plot. The majority of the points are betwen 50 and -50 units from the y=x line.

f) What do you think are the reasons for using cross-validation?

It is a test for how the model would perform when dealing with “new” data. It tests for overfitting a dataset. A model can become too specific and customized for one dataset, that is underperforms when encountering new data.

Question 2:

Perform an Exploratory Data Analysis (EDA) of the OkCupid data, keeping in mind in HW-3, you will be fitting a logistic regression to predict gender. What do I mean by EDA?

Visualizations
Tables
Numerical summaries

For the R Markdown to work, you must first copy the file profiles.csv from Lec09 to the project directory HW-2.

1.) Can Pet preferences predict gender? Cat vs. Dog?

likes_cats	prop_female
0	0.379
1	0.433

	Estimate	Std. Error	z value	Pr(>\|z\|)
likes_cats	0.225	0.01681	13.39	7.314e-41
(Intercept)	-0.4936	0.01114	-44.3	0

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	80800 on 59945 degrees of freedom
Residual deviance:	80621 on 59944 degrees of freedom

## (Intercept)  likes_cats 
##  -0.4935637   0.2249725

## (Intercept) 
##       0.379

## (Intercept) 
##       0.433

This shows a very weak relationship between cat fondness and gender. The horizontal lines show the average- in other words, those people who don’t like cats have a gender score of 0.379- which means more men than female. For people who do like cats, the gender score is 0.433, meaning there are more men than female that like cats. But having a cat is has a greater gender score, meaning it is more correlated to females than not having a cat. But all things considered, the relationship is weak. It is important to note that some users did not list sexual orientation, so there were some users not included.

2.) Can diet preferences predict gender? Vegetarians?

is_veg	prop_female
FALSE	0.3885917
TRUE	0.5535499

## 
## Call:
## glm(formula = is.female ~ is_veg, family = binomial, data = gender)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.270  -0.992  -0.992   1.375   1.375  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.453236   0.008751  -51.79   <2e-16 ***
## is_vegTRUE   0.668260   0.029802   22.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 80800  on 59945  degrees of freedom
## Residual deviance: 80294  on 59944  degrees of freedom
## AIC: 80298
## 
## Number of Fisher Scoring iterations: 4

## (Intercept)  is_vegTRUE 
##  -0.4532358   0.6682602

## (Intercept) 
##       0.389

## (Intercept) 
##       0.554

The data from above shows that women do tend to be vegetarian’s more than men, but the slope of the regression is 0.6682. Looking at the plot that is created above, one can see that there is not much difference in gender specification comparing vegetarians vs. non vegetarians (0.3886 Non vs. 0.5535 Veg). Diet is not the greatest predictor of gender. It is important to note that some users did not list cat-dog preference, so there were some users not included.

3.) Can bisexuality predict gender?

orientation	prop_female
bisexual	0.721
straight	0.398

Fitting generalized (binomial/logit) linear model: is.female ~ orientation
	Estimate	Std. Error	z value	Pr(>\|z\|)
orientationstraight	-1.37	0.0433	-31.5	8.02e-218
(Intercept)	0.951	0.0424	22.4	1.89e-111

##         (Intercept) orientationstraight 
##           0.9512121          -1.3655180

## (Intercept) 
##       0.721

## (Intercept) 
##       0.398

From this plot, you can see that women are more likely to be bisexual than men on OKcupid. From the plot above, women tend to be more bisexual more often than men. If you were to chose a bisexual person at random, there is a 0.721 chance that they would be female. If you chose a straight person at random, there is a 0.398 chance that they would be male. It is important to note that some users did not list sexual orientation, so there were some users not included.