Lectures: 10-16
Topics: one and two sample t tests, p-values, Wilcoxon (Mann-Whitney) test, paired test, best-fit line, sum of squared residuals (errors), linear regression parameters, interpretation, lm() function, parameter uncertainty, bootstrap samples, confidence intervals, prediction intervals, linear regression assumptions, checks for model adequacy, multiple regression, pairwise scatter plots, collinearity, simplifying a model, AIC, likelihood, stepwise regression
Sample problems:
The variable sat.m in the data set stud.recs (UsingR) contains math SAT scores for a group of students. Use a t test to test the null hypothesis that the mean score is 500 against the two-sided alternative. What do you conclude?
Consider the airquality data frame containing daily air quality measurements in New York.
a. Create a scatter plot matrix of the four variables (solar radiation, wind speed, temperature, and ozone). Which variable appears to have the strongest relationship to ozone? Note: If the data frame contains missing values use pairs() instead of ggpairs().
ggpairs(airquality[, 1:4])
## Error: could not find function "ggpairs"
cor(airquality[, 1:4], use = "complete")
## Ozone Solar.R Wind Temp
## Ozone 1.0000 0.3483 -0.6125 0.6985
## Solar.R 0.3483 1.0000 -0.1272 0.2941
## Wind -0.6125 -0.1272 1.0000 -0.4972
## Temp 0.6985 0.2941 -0.4972 1.0000
b. Regress ozone on air temperature, wind speed, and solar radiation. What is the R squared value for this model? Is the model statistically significant?
c. Can the model be simplied by removing a variable? If so, which one?
d. Check the model assumptions of linearity, constant variance, and normality and comment on your findings.
ggplot(airquality, aes(x = cut(Solar.R, breaks = 5), y = Ozone)) + geom_boxplot() +
xlab("Speed (mph)") + ylab("Break Distance (ft)")
## Error: could not find function "ggplot"
model2.df = fortify(model)
## Error: could not find function "fortify"
ggplot(model2.df, aes(x = .fitted, y = .stdresid)) + geom_point() + geom_smooth() +
geom_hline(yintercept = 0)
## Error: could not find function "ggplot"
require(sm)
## Loading required package: sm
## Warning: package 'sm' was built under R version 3.0.2
## Package `sm', version 2.2-5: type help(sm) for summary information
res = residuals(model)
## Error: object 'model' not found
sm.density(res, xlab = "Model Residuals", model = "Normal")
## Error: object 'res' not found
qqnorm(model$residuals)
## Error: object 'model' not found
qqline(model$residuals)
## Error: object 'model' not found
e. Is collinearity a problem?
Download the data file from Blackboard and import it to R using the read.dbf() function from the maptools package.
require(maptools)
setwd("~/Desktop")
GE = read.dbf("GeorgiaEduc.dbf")
names(GE)
Let the response variable be PctBach (percentage of population with bachelor degrees) and the explanatory variables be TotPop90, PctRural, PctEld, PctFB (first born), PctPov, PctBlack. Peform a stepwise regression to determine the final model.