Lab 5: More Regression Practice

There are two datasets for this lab that you may download from the blackboard site: election2000.csv and NLSY.csv. The CSV (comma separated value) format is a good way to distribute data, that can be read by nearly every statistical software package, as well as Excel. As usual, you should open the data in a text editor first, and see if there are column headers, row numbers, or any other oddities.

1. Florida Election Results from 2000 US Presidential Election

Download the election 2000 data. These are election results data from the now infamous “butterfly” ballot in Palm Beach County. Here is a link to an image of the ballot: http://en.wikipedia.org/wiki/File:Butterfly_large.jpg It is suspected that many voters - who intended to vote for Gore - accidentally voted for Buchanan. This claim was based primarily on two pieces of evidence: 1) Buchanan had an unusually high number of votes in that county, and an unusually high number of ballots in that county were discarded because voters had marked two circles.

a) Draw a scatterplot of the data of the number of votes cast for Bush vs the number of votes case for Buchanan. What evidence is there that Buchanan received more votes than expected?

b) Analyze the data without the Palm Beach County results to estimate a regression predicting the number of Buchanan votes from the number of Bush votes. Does a linear regression seem like a reasonable model to fit the data? Provide evidence. Estimate a 95% confidence interval for the regression coefficient.

c) Estimate a 95% confidence interval for the number of Buchanon votes in Palm Beach from this model. To help with this, type help(predict), and then help(predict.lm). Do you want the 95% confidence interval for the regression line, or the 95% confidence interval for the data? Explain why. How many estimated standard deviations is the Palm Beach County result from the estiamted regression line? Compare this answer to a t-distribution (with degrees of freedom equal to the residual degrees of freedom in the regression line) ot obtain a p-value. Describe the null hypothesis that this p-value is testing.

2. IQ, Education, and Future Income

Download the dataset NLSY.csv. These data contain observations from the National Longitudinal Survey of Youth in 1979 (NLSY79). The NLSY was a panel of children selected in 1979, and then followed in subsequent years. This extract contains IQ scores measured by the Armed Forces Qualifying Test (taken in 1981), years of education completed by 2006, and annual income in 2005.

a) Describe the distribution in 2005 income as a function of IQ. What percentage of variation is explained by the regression?

b) Describe the distribution in 2005 income as a function of education. What percentage of variation is explained by the regression?

c) Calculate the expected 2005 income for a person with a score of 80 on the AFQT. Calculate a 95% confidence interval for this expected score.