The NLSY data that were used in the last lab were a subset from a dataset that is on the blackboard site. There are many variables there, but in this lab, you will only focus on explaining income, using gender, education, and IQ (measured by AFQT).
For this question, you will analyze the boston housing price data from 1970 that I used in class. You can get these data (minus the shapefile I used for plotting) from the spdep package.
library(spdep)
data(boston)
This will load two objects: boston.c, which is a dataframe with the observations, and boston.utm, which is a dataframe with X and Y spatial coordinates (in case you want to produce a map!)
1) Imagine regressing log(Median House Price) (the log of the CMEDV variable) on
For the first 8 variables, Briefly describe (1-2 sentences each), whether you expect the the regression coefficient to be positive or negative, and why. Remember, that the regression coefficient in a multivariate regression measures the effect of X on Y if you hold all the other X variables fixed.
2) Run the multivariate regression of log(CMEDV) on all of the above variables. What is the approximate effect (in dollars or percentage points) of an increase in Distance by 1%?
3) The primary interest in the original study was on the effect NOX – whether air pollution had an effect on house price.
Using the same regression as above, report the coefficient for NOX. Using either the summary command, or by histograms, look at the range of NOX to get an idea of typical values for NOX in parts per 10 million. By about how many dollars (or percentage points) is median house price impacted by an increase of NOX by 1 part per 100 million (i.e. .1 parts per 10 million)?
4) Repeat the regression as a simple linear regression, regression log house price on NOX. Compare the coefficient with the same coefficient in the multiple regression. Explain why the difference arises. Be specific. (Hint, a simple scatterplot of NOX on log(DIS) may be helpful.
5) Using this example, explain to a peer about a) the difficulties that might be encountered when trying to scientifically interpret a statistical regression coeffient, and b) offer some general guidance about how to approach an analysis in which you are trying to measure the scientific impact of one variable on another.