QUESTION 1: a)education b) Simpson's paradox occurs when adding a new explanatory variable changes the sign of the coefficient on another variable. According to the textbook, the role of an explanatory model can depend on the context set by other set by other explanatory variables. The explanation of this Simpson's Paradox lies in the number of workers in each educational profile (the sample size). There are now many more college graduates with higher paying jobs than in 2000, despite the fact that their individual wages have decreased.In esscence the large growth in the number of college graduates affects the coefficient for wages, overpowering the wage decline for all non-college educated workers.
QUESTON 2: a) Management wages= 7.01 +.031educ-12.68+1.23educ*sectormanag Management wages=-5.67+1.26educ Construction wages=7.01+7.08+.031educ-.50educ*sectorconst Construction wages=14.09-.47educ The management sector benefits the most from more years of education because its educ coefficient is 1.26 meaning that for every additional year of education wages increase by 1.26, while for construction workers wages decrease .47 for every additional year of education.
b) Sector and education interact in this case because the relationship between years of education and a worker's wages is influenced by what sector they work in. While the y-intercept for management wages is lower than that of construction wages, the slope of increase is larger as years of education increases, causing the lines of construction worker wages and management worker wages to not run parallel to one another, and thus indicating the prescence of an interaction term
library(mosaic)
## Loading required package: grid Loading required package: lattice
##
## Attaching package: 'mosaic'
##
## The following objects are masked from 'package:stats':
##
## D, IQR, binom.test, cor, cov, fivenum, median, prop.test, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, print, prod, range, sample, sum
hw3 = read.csv("http://dl.dropbox.com/u/7315092/Data/cps.csv")
mod1 = lm(wage ~ educ * sector, hw3)
summary(mod1)
##
## Call:
## lm(formula = wage ~ educ * sector, data = hw3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.14 -2.86 -0.89 2.15 32.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0153 3.5434 1.98 0.0483 *
## educ 0.0315 0.2716 0.12 0.9078
## sectorconst 7.8002 7.7397 1.01 0.3140
## sectormanag -12.6805 5.4636 -2.32 0.0207 *
## sectormanuf -6.5493 4.3419 -1.51 0.1321
## sectorother -1.6115 5.1503 -0.31 0.7545
## sectorprof -7.3209 4.9085 -1.49 0.1364
## sectorsales -8.8876 6.8630 -1.30 0.1959
## sectorservice -3.1976 4.3185 -0.74 0.4594
## educ:sectorconst -0.5080 0.6681 -0.76 0.4474
## educ:sectormanag 1.2283 0.3916 3.14 0.0018 **
## educ:sectormanuf 0.6449 0.3488 1.85 0.0650 .
## educ:sectorother 0.2304 0.4142 0.56 0.5782
## educ:sectorprof 0.7521 0.3466 2.17 0.0305 *
## educ:sectorsales 0.6850 0.5183 1.32 0.1869
## educ:sectorservice 0.2029 0.3424 0.59 0.5536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.51 on 518 degrees of freedom
## Multiple R-squared: 0.252, Adjusted R-squared: 0.23
## F-statistic: 11.6 on 15 and 518 DF, p-value: <2e-16
QUESTION 3:
| Model | Var(residuals) |
|---|---|
| mod1 | 42.57 |
| mod2 | 68.73 |
| mod3 | 31.40 |
| Model | Var(fitted values) |
|---|---|
| mod1 | 26.21 |
| mod2 | .59 |
| mod3 | 37.39 |
| Model | Total |
|---|---|
| mod1 | 68.79 |
| mod2 | 68.79 |
| mod3 | 68.79 |
| Model | Var(response) |
|---|---|
| mod1 | 68.79 |
| mod2 | 68.79 |
| mod3 | 68.79 |
| Model | R2 |
|---|---|
| mod1 | .38 |
| mod2 | .008 |
| mod3 | .54 |
b) In 54% of cases mod3 can explain the variation in BodyFat when looking at a person's weight and height.
c) R2 is the proportion of variability in a model and is measured on a scale between 0 and 1. A model closer to 0 is a less accurate model, while a R2 value of 1 is essentially perfect. This value for mod3 makes intuitive sense because by including a person's weight and height gives a more accurate depiction of their body fat percentage, meaning that the model can capture more of the variance in the response values. A model that just looks at height is not going to account for as much variance because without knowing a person's weight as well we cannot determine how their bodymass is distributed and our model of body fat percentage is not as accurate.
bf = read.csv("http://dl.dropbox.com/u/7315092/Data/bodyfatsub.csv")
mod1 = lm(BodyFat ~ Weight, bf)
mean(mod1$resid^2)
## [1] 42.41
var(mod1$resid)
## [1] 42.58
var(mod1$fitted)
## [1] 26.21
var(bf$BodyFat)
## [1] 68.79
mod2 = lm(BodyFat ~ Height, bf)
var(mod2$resid)
## [1] 68.73
var(mod2$fitted)
## [1] 0.05942
var(mod2$resid) + var(mod2$fitted)
## [1] 68.79
mod3 = lm(BodyFat ~ Weight + Height, bf)
var(mod3$resid)
## [1] 31.4
var(mod3$fitted)
## [1] 37.39
# to calculate R^2=variance(fitted)/variance(response)
26.21/68.79 = 0.3810147
## Error: target of assignment expands to non-language object
0.59/68.79 = 0.008576828
## Error: target of assignment expands to non-language object
37.39/68.79 = 0.5435383
## Error: target of assignment expands to non-language object
QUESTION 4: From the summary of hh we obtain information on the mean hdd which is 627. We can also see the scale of hdd measurements from this summary by looking at the minmum and 3rd Quartile or max. We see that the minimum hdd is 0, the 3rd quartile is 1130 and the max is 1747.0. We can see how hdd varies throughout the year by constructing a simple xyplot. We see from this plot that there are higher hdd levels in the winter months and lower rates at between months 6 and 8, so during the summertime.
QUESTION 5 a) Based on the boxplot we see that the homes after renovation have a higher on average use than they did before renovation.
b) In this model houses before renovation are the reference level. From the coeffecients we see that for every increase in one month the energy use decreases by 13.44. If the house is renovated then the intercept value increases by 56.76
hh = read.csv("http://www.macalester.edu/~ajohns24/data/MacNaturalGas.csv")
boxplot(therms ~ renovated + address, hh)
QUESTION 6 a)This model does not account for the actual temperatures. It assumes that the temperatures in the winters before and after renovation are the same. We should not yet use the above model because it does not account for hdd, the monthly heating degree days. The desnity plot shows that there is a higher density of lower heating days for non-renovated homes and a higher density of higher heating days(the winter months) for renovated homes.
b) mod4=lm(therms~hdd+renovated,hh) This model holds hdd constant so we can explore the effect of renovation (insulation) on energy use regardless of the time of year. We can obtain the following equations from the coefficients:
Energy use before renovation= 23.51+.21hdd Energy use after renovation= 21.4+.21hh
We see that the intercept for energy use after renovation is smaller than before renovation. Because of this and the fact that the slopes are increasing at the same rate we see that the energy use after renovation is going to be less than before renovation
hh = read.csv("http://www.macalester.edu/~ajohns24/data/MacNaturalGas.csv")
densityplot(hh$hdd, groups$renovated, auto.key = T)
## Warning: explicit 'data' specification ignored
mod4 = lm(therms ~ hdd + renovated, hh)
summary(mod4)
##
## Call:
## lm(formula = therms ~ hdd + renovated, data = hh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -133.53 -21.32 -5.53 29.11 187.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.5143 9.1934 2.56 0.012 *
## hdd 0.2130 0.0102 20.93 <2e-16 ***
## renovatedyes -2.1115 12.7961 -0.17 0.869
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59 on 96 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.825, Adjusted R-squared: 0.821
## F-statistic: 226 on 2 and 96 DF, p-value: <2e-16