Question 1: A) A larger college educated workforce is the confounding variable–more people are working with a higher education, leading to an overall increase in the median wage but decrease in the median wage per subgroup. B) A greater college educated workforce leads to a confusing set of data that seems to indicate wages are decreasing across educational subgroups. More members of the population working spreads out the wages, so that the median wage is lower for each subgroup but higher when looking at the workforce as a whole. Thus, individually, workers are earning less, but collectively they are earning slightly more than they used to in 2000.
Question 2: A) educ:This a categorical variable. For every increase in years of education, wage increases by .03148 + the intercet (7.01535) educ:sectorconst: This coefficient depicts the interaction between education and the job sector const. It has a negative relationship to wage. educ:sectormanag: This coefficient indicates the interaction between education and the job sector manager. It gives a positive relationship between more education and higher wages for the manager. B) The educ:sectormanag benefits the most from education, increasing by 1.22825. C) For this example, the increase in wage depends on the relatonship between education and sector. Even though individuals in all sectors may have the same years of education, they are paid differently depending on their sector or occupation. Thus, you cannot just say the relationship between education and wages, because it varies across jobs.
library(mosaic)
## Loading required package: grid Loading required package: lattice
##
## Attaching package: 'mosaic'
##
## The following objects are masked from 'package:stats':
##
## D, IQR, binom.test, cor, cov, fivenum, median, prop.test, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, print, prod, range, sample, sum
hw1 = read.csv("http://dl.dropbox.com/u/7315092/Data/cps.csv")
mod1 = lm(wage ~ educ * sector, hw1)
summary(mod1)
##
## Call:
## lm(formula = wage ~ educ * sector, data = hw1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.14 -2.86 -0.89 2.15 32.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0153 3.5434 1.98 0.0483 *
## educ 0.0315 0.2716 0.12 0.9078
## sectorconst 7.8002 7.7397 1.01 0.3140
## sectormanag -12.6805 5.4636 -2.32 0.0207 *
## sectormanuf -6.5493 4.3419 -1.51 0.1321
## sectorother -1.6115 5.1503 -0.31 0.7545
## sectorprof -7.3209 4.9085 -1.49 0.1364
## sectorsales -8.8876 6.8630 -1.30 0.1959
## sectorservice -3.1976 4.3185 -0.74 0.4594
## educ:sectorconst -0.5080 0.6681 -0.76 0.4474
## educ:sectormanag 1.2283 0.3916 3.14 0.0018 **
## educ:sectormanuf 0.6449 0.3488 1.85 0.0650 .
## educ:sectorother 0.2304 0.4142 0.56 0.5782
## educ:sectorprof 0.7521 0.3466 2.17 0.0305 *
## educ:sectorsales 0.6850 0.5183 1.32 0.1869
## educ:sectorservice 0.2029 0.3424 0.59 0.5536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.51 on 518 degrees of freedom
## Multiple R-squared: 0.252, Adjusted R-squared: 0.23
## F-statistic: 11.6 on 15 and 518 DF, p-value: <2e-16
Question 3: A) mod1= 42.5772(VarResiduals) 26.21319 (VarFitted) 68.7902(Total) 68.79044(VarResponse) 0.6156618(R2)
mod2= 68.73103 (VarResiduals) 0.0594176 (VarFitted) 68.79045(Total) 68.79044(VarResponse) 0.0008644947(R2)
mod3= 31.40126 (VarResiduals) 37.38919 (VarFitted) 68.79045(Total) 68.79044(VarResponse)
1.190691 (R2)
B) The coefficient of determination, R2, is 1.190691 for mod3, larger than 1. As an R2 coefficient of 1 means the line is a perfect fit, a R2 greater than 1 may mean that this is a general R2 including nonsense or irrelevant variables.
C) An R2 value above 1 may indicate that this is a general R2 and includes nonsense or irrelavant data that skews the fit of the line and the square of the residual.
bf = read.csv("http://dl.dropbox.com/u/7315092/Data/bodyfatsub.csv")
mod1 = lm(BodyFat ~ Weight, bf)
mod2 = lm(BodyFat ~ Height, bf)
mod3 = lm(BodyFat ~ Weight + Height, bf)
var(mod1$resid)
## [1] 42.58
var(mod1$fitted)
## [1] 26.21
var(mod2$resid)
## [1] 68.73
var(mod2$fitted)
## [1] 0.05942
var(mod3$resid)
## [1] 31.4
var(mod3$fitted)
## [1] 37.39
Question 4: Hdd refers to the number of monthly heating days or in other words, it is a measure of how many days the house needed to heat itself in response to colder outside temperatures. The study measures from January 1 to December 12. The xyplot, density plot, and boxplot below illustrate the amount of hdd for each house. As the graphs show, hdd increased when temperatures were colder and decreased during the summer months when the temperature increased. When looking at the relationship between energy expended (therms) and monthly heating days, the maximum amount of therms were 188.207 and the minimum was -132.592. Typica heating days are estimated at 23.
Question 5:
A) As the xy plot (therms~month) and the boxplot demonstrate, the houses renovated showed an increase in the therms expended compared to the houses that were not renovated. Thus, it appears as if the renovated houses used energy less efficiently then before the renovation.
B) The houses that were not renovated expended an average of 228.962 therms per month. When they were renovated, they expended 56.756 more therms on average.
Question 6:
A) The density plot illustrates the number of heating days in the months of the year. From the plot, we can see that there are more warm days in the summer and winter before the houses were renovated. After they were renovated, the temperatures were colder so there were more heating days. Thus, the renovated houses were not less efficient, it was merely a colder year.
B) The model upholds this argument, illustrating that the renovated houses decreased the amount of therms used by 2.1146 from the average therms used (23.514). However, as heating days increased, the amount of therms increased.
gas = read.csv("http://www.macalester.edu/~ajohns24/data/MacNaturalGas.csv")
xyplot(therms ~ hdd, gas)
densityplot(therms ~ hdd, gas)
boxplot(therms ~ hdd, gas)
mod = lm(therms ~ hdd, gas)
summary(mod)
##
## Call:
## lm(formula = therms ~ hdd, data = gas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.59 -20.42 -6.01 28.04 188.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.00743 8.62147 2.67 0.0089 **
## hdd 0.21271 0.00995 21.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.7 on 97 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.825, Adjusted R-squared: 0.823
## F-statistic: 457 on 1 and 97 DF, p-value: <2e-16
xyplot(hdd ~ month, gas)
xyplot(therms ~ month, groups = renovated, gas, auto.key = T)
boxplot(therms ~ renovated + address, gas, auto.key = T)
mod3 = lm(therms ~ month + renovated, gas)
summary(mod3)
##
## Call:
## lm(formula = therms ~ month + renovated, data = gas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174.1 -113.4 -27.5 93.4 375.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 228.96 29.25 7.83 6.5e-12 ***
## month -13.44 3.76 -3.58 0.00055 ***
## renovatedyes 56.76 27.98 2.03 0.04528 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131 on 96 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.141, Adjusted R-squared: 0.123
## F-statistic: 7.86 on 2 and 96 DF, p-value: 0.000692
densityplot(gas$hdd, groups = gas$renovated, auto.key = T)
mod6 = lm(therms ~ hdd + renovated, gas)
summary(mod6)
##
## Call:
## lm(formula = therms ~ hdd + renovated, data = gas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -133.53 -21.32 -5.53 29.11 187.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.5143 9.1934 2.56 0.012 *
## hdd 0.2130 0.0102 20.93 <2e-16 ***
## renovatedyes -2.1115 12.7961 -0.17 0.869
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59 on 96 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.825, Adjusted R-squared: 0.821
## F-statistic: 226 on 2 and 96 DF, p-value: <2e-16