MATH 216 Homework 2

Admistrative:

Please indicate

Who you collaborated with: Christian L.
Roughly how much time you spent on this HW: 8 Hrs
What gave you the most trouble:
Any comments you have:

Question 1:

Question 4 on page 76 from Chapter 4 of Data Analysis Using Regression and Multilevel/Hierarchical Models. The codebook can be found here. I’ve included R code blocks for each question, but use them only if you feel it necessary.

a)

*Create a scatterplot of mortality rate versus level of nitric oxides. Do you think linear regression will fit these data well? Fit the regression and evaluate residual plot from the regression.

At a cursory glance, it appears that a linear model would not fit these data well, since the majority of points are constrained below 100 and display a non-linear relationship in how they increase. Adding a linear regression line, we get:

Indeed, adding a linear regression line through the data shows how poorly a linear model can predict the mortality rates given different Nitric Oxide pollution levels. To assess more rigirously the accuracy of our intuition regarding how bad a linear model explains the data, we plot a residuals graph:

This residuals graph confirms our intuition. Despite showing residuals centered around zero, the graph depicts a non-constant variance (demonstrated by seeing a pattern in the dispersion of the points), thereby indicating that a linear model is a poor fit for the data at hand.

b)

*Find an appropriate transformation that will result in data more appropriate for linear regression. Fit a regression to the transformed data and evaluatethe new residual plot:

The transformation we consider is the log of the Nirtic Oxide levels.

The resulting residual graph shows an improvement over the previous one. Not only are the residuals centered around zero, as was the case in the previous graph, but we also see a random dispersion of residuals, suggesting a constant variance.This implies that our second model fits the data relatively better than a simple linear model.

c)

*Interpret the slope coefficient from the model you chose in (b).

The first table presents the regression results, while the second table shows the confidence intervals.

Fitting linear model: mort ~ log_nox
	Estimate	Std. Error	t value	Pr(>\|t\|)
Logged Nitrogin Oxide	15.3	6.6	2.33	0.0236
(Intercept)	905	17.2	52.7	1.11e-50

	2.5 %	97.5 %
Intercept	870	939
Logged Nitrogen Oxide	2.13	28.5

An intuitive way of interpreting the results of this model is to differenciate the model with respect to the Nitric Oxide. Thus, a one perecent increase in Nitric Oxide levels is associated with a 0.15 increase in the age-adjusted total mortality rate per 100,000. We are 95% confident that the effect lies between 0.213 and 0.285 increases in mortailities per 100,000 adjusted for age (Of course, if we say a one UNIT change in the levels of Nitric Oxide levels, then the associated change in mortality rates is about 15.3).

d)

*Now fit a model predicting mortality rate using levels of nitric oxides, sulfur dioxide, and hydrocarbons as inputs. Use appropriate transformations when helpful. Plot the fitted regression model and interpret the coefficients.

First, let’s examine the relationship between the two new covariates and the mortality rate:

A quick look at the graphs, which display the relationship between the two additional covariates and our outcome variable, suggests that there exists a non-linear relationship between our two new dependent variables and our outcome variable that is best approximated by a log transformation.

Next, we construct a new model that takes into account the logged values of Sulfur Dioxide and Hydrocarbons. From this new model we see that a one one perecent increase in Nitric Oxide levels is associated with a 0.58 increase in the age-adjusted total mortality rate per 100,000, controlling forlevels of sulphur dioxide and hydrocarbon (Again, if we were changing the Nitric Oxide by one unit, the associated increase in motality would be 58).

Fitting linear model: mort ~ log_nox + log_hc + log_so2
	Estimate	Std. Error	t value	Pr(>\|t\|)
Logged Nitrogin Oxide	58.3	21.8	2.68	0.0096
Logged Hydrocarbons	-57.3	19.4	-2.95	0.00462
Logged Sulfur Dioxide	11.8	7.17	1.64	0.106
(Intercept)	925	21.4	43.1	1.19e-44

	2.5 %	97.5 %
Intercept	882	968
Logged Nitrogen Oxide	14.8	102
Logged Hydrocarbons	-96.2	-18.4
Logged Sulfur Dioxide	-2.59	26.1

e)

*Cross-validate: fit the model you chose above to the first half of the data and then predict for the second half. (You used all the data to construct the mode in (d), so this is not really cross-validation, but it gives a sense of how the steps of cross-validation can be implemented.)

To answer this question, let’s follow thw Monte Carlo cross-validation technique: In a nutshell, what we will do imagine that our ‘pollution’ dataset is the population, all the data in the world that relates pollutants to mortality rates. Then, we will partition our sample into two complementary and mutually exclusive subsamples. We then compare the predicted values obtained from the test dataset to the actual mortality values in the train dataset. If our model is accurate, we would expect, on average for the points to lie on a 45 degree line.

From this graph, it seems that our model is not good at predicting the actual mortality rates. To quantify how ‘bad’ are our model is at making predictions, we could calculate the R^2 values.

f)

*What do you think are the reasons for using cross-validation?

Cross-validation tests the efficacy of a model by using a subset of the available dataset to fit the model, then testing its ability to predict using the the remainder of the dataset. This technique is preferable to evaluating residuals as it demonstrates how well a model will do when facing ‘new data.’ In short, having a model without knowing how ‘accurately’ it can predict new data is not very useful, cross-validation gives us an idea of how well (or not) our model will be able to predict new data.

Question 2:

Perform an Exploratory Data Analysis (EDA) of the OkCupid data, keeping in mind in HW-3, you will be fitting a logistic regression to predict gender. What do I mean by EDA?

Visualizations
Tables
Numerical summaries

The big question we are trying to answer in this section is: Who is in our sample? First, let’s explore our sample’s basic demographics:

Sex	Statistic	Age	Height	Income
f	Mean	32.82	65.10	86633.47
f	Sd	10.03	2.93	189917.38
m	Mean	32.02	70.44	110984.39
m	Sd	9.03	3.08	205162.13

This table reports the mean and standard deviation of Okcupid users’ age, height, and income by sex. In our sample, the average age of male Okcupid user in the Bay Area is 32, while the female average age is 33. As expected, male users are taller on average by about five inches. Men also tend to have higher reported income, $110,000 compared to $81,000 reported by females. It is important to stress that we are unable to verify the accuracy of these reported incomes. Thus, it may very well be that males, on average, tend to inflate their incomes.

Another important variable to consider is education. However, there appears to be too many categories available for Okcupid users to choose from. This may potentially kill any variation in our subsequent regression analysis. An easy fix is to consolodate some of the categories, in an arbitrary but reasonable way. We use the US Census education category as a guiding principle to produce four categories: Some College, Bachelor’s, Graduate or Professional Degrees, and Space Camp. While space camp may not be a serious measure of education, we choose to include it as a variable of interest, since this user’s choice may reveal important charactaritsics about who they are (e.g. an embarressment of a low level of education or humor), which may aid us predict their sex later on.

On average, there are more male users of Okcupid with the same level of education compared to females. Our sample contains, for example, almost 1500 male users with a bachelor’s compared to only 1000 female users. This could be a function of the mere number of males, which is more than the number of females in our sample in absolute terms. Hence, we look at the proportion of users with a given level of education relative to the overall number of their sex.

Interestingly, this does not seem to be the case. male users are on average more educated than female users of Okcpuid, not by virtue of being more in quantity.

Next, we consider the body types and marital status of Bay Area Okcupid users:

As expected, certain adjectives are used more frequently by women than men, for instance, curvy and full-figured. Conversly, men tend to use the ajectives atheltic and fit more than women.

Unsurprisingly, the majority of users on the website indicate that they are single or available. However, there are users who are married or are currently seeing someone still active on Okcupid. While there tend to be more single men, the proportion of men and women who are active users of Okcupid but are in a relationship appears to be about the same.

Now that we know a bit about user’s basic demographics and love-life-related variables, we turn to their social habits:

Finally, we see that there is a higher proportion of men engaging in ‘risky’ social behavior such as drinking, smoking, or doing drugs.

MATH 216 Homework 2

Mohamed Hussein

Admistrative:

Question 1:

a)

b)

c)

d)

e)

f)

Question 2: