Lab 6: Multiple Regression Exercises

Objectives

Practice choosing between different functional specifications (log vs linear) and justifying your choise
Practice interpreting coefficients numerically
Practice interpreting coefficients scientifically
Better understand the differences between simple linear regression and multiple linear regression.

1. Comparing Incomes

The NLSY data that were used in the last lab were a subset from a dataset that is on the blackboard site. There are many variables there, but in this lab, you will only focus on explaining income, using gender, education, and IQ (measured by AFQT).

Decide on a suitable model for the regression using either the original variables or log transformed variables. Explain your reasoning for choosing your model specification.
Is there any evidence that the mean salary for males exceeds the mean salary for females with the same years of education and AFQT scores? By how many dollars (or percentage points) is the mean income for males larger? What are 90% confidence intervals for this difference?

2. Effect of air pollution on house prices

For this question, you will analyze the boston housing price data from 1970 that I used in class. You can get these data (minus the shapefile I used for plotting) from the spdep package.

library(spdep)
data(boston)

This will load two objects: boston.c, which is a dataframe with the observations, and boston.utm, which is a dataframe with X and Y spatial coordinates (in case you want to produce a map!)

1) Imagine regressing log(Median House Price) (the log of the CMEDV variable) on

NOX (Number of Nitrous Oxides (parts per 10 million))
log(DIS) (distance from downtown Boston and other employment centers)
CRIM (crime per capita)
log(RM) (average number of rooms per dwelling)
AGE (proportion of houses built before 1940)
TAX (property tax rate per $10000)
PTRATIO (Pupil-Teacher Ratio per town)
log(LSTAT) (percentat population that is of “lower status”)
B (1000* ((proportion black residents) - .63)²
log(RAD) accessibility to radial highways (a proxy for ease of accessibility in and out of Boston)
INDUS proportion of non-retail business acres per town (primarily Industry)

For the first 8 variables, Briefly describe (1-2 sentences each), whether you expect the the regression coefficient to be positive or negative, and why. Remember, that the regression coefficient in a multivariate regression measures the effect of X on Y if you hold all the other X variables fixed.

2) Run the multivariate regression of log(CMEDV) on all of the above variables. What is the approximate effect (in dollars or percentage points) of an increase in Distance by 1%?

3) The primary interest in the original study was on the effect NOX – whether air pollution had an effect on house price.
Using the same regression as above, report the coefficient for NOX. Using either the summary command, or by histograms, look at the range of NOX to get an idea of typical values for NOX in parts per 10 million. By about how many dollars (or percentage points) is median house price impacted by an increase of NOX by 1 part per 100 million (i.e. .1 parts per 10 million)?
4) Repeat the regression as a simple linear regression, regression log house price on NOX. Compare the coefficient with the same coefficient in the multiple regression. Explain why the difference arises. Be specific. (Hint, a simple scatterplot of NOX on log(DIS) may be helpful.

5) Using this example, explain to a peer about a) the difficulties that might be encountered when trying to scientifically interpret a statistical regression coeffient, and b) offer some general guidance about how to approach an analysis in which you are trying to measure the scientific impact of one variable on another.