The dataset teengamb concerns a study of teenage gambling in Britain. Fit a regression model with the expenditure on gambling as the response and the sex, status, income and verbal score as predictors. Present the output.
data(teengamb, package = "faraway") # load the data
teengamblmod <- lm(gamble ~ sex + status + income + verbal, data = teengamb)
summary(lmod) # the summary of the data##
## Call:
## lm(formula = gamble ~ sex + status + income + verbal, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.082 -11.320 -1.451 9.452 94.252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.55565 17.19680 1.312 0.1968
## sex -22.11833 8.21111 -2.694 0.0101 *
## status 0.05223 0.28111 0.186 0.8535
## income 4.96198 1.02539 4.839 1.79e-05 ***
## verbal -2.95949 2.17215 -1.362 0.1803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816
## F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06
lmod$residual## 1 2 3 4 5 6
## 10.6507430 9.3711318 5.4630298 -17.4957487 29.5194692 -2.9846919
## 7 8 9 10 11 12
## -7.0242994 -12.3060734 6.8496267 -10.3329505 1.5934936 -3.0958161
## 13 14 15 16 17 18
## 0.1172839 9.5331344 2.8488167 17.2107726 -25.2627227 -27.7998544
## 19 20 21 22 23 24
## 13.1446553 -15.9510624 -16.0041386 -9.5801478 -27.2711657 94.2522174
## 25 26 27 28 29 30
## 0.6993361 -9.1670510 -25.8747696 -8.7455549 -6.8803097 -19.8090866
## 31 32 33 34 35 36
## 10.8793766 15.0599340 11.7462296 -3.5932770 -14.4016736 45.6051264
## 37 38 39 40 41 42
## 20.5472529 11.2429290 -51.0824078 8.8669438 -1.4513921 -3.8361619
## 43 44 45 46 47
## -4.3831786 -14.8940753 5.4506347 1.4092321 7.1662399
percentage of variation is R2, which is the Multiple R-squared in the summary: 0.5267
max(lmod$residual) # (b)## [1] 94.25222
The largest residual is 94.2522174, and its case number is 24, as we could find in the summary.
mean(lmod$residual) # the mean of the residuals## [1] -3.065293e-17
median(lmod$residual) # the median of the residuals## [1] -1.451392
The mean of the residuals is -3.0653e-17, and the median is -1.4514.
cor(lmod$residual, lmod$fitted.values) # (d)## [1] -1.070659e-16
The correlation of the residuals with the fitted values is -1.070659e-16
cor(lmod$residual, teengamb$income) # The correlation of the residuals with the income## [1] -7.242382e-17
The correlation of the residuals with the income is -7.242382e-17.
teengamb$sex <- factor(teengamb$sex)
levels(teengamb$sex) <- c("male", "female")
linearmodel <- lm(gamble ~ sex + status + income + verbal, teengamb)
linearmodel$coefficients## (Intercept) sexfemale status income verbal
## 22.55565063 -22.11833009 0.05223384 4.96197922 -2.95949350
We know that 1 denotes female and 0 denotes male in the sex column (From the description of faraway::teengamb). The difference in predicted expenditure on gambling for a male compared to a female is 22.11833. Since the beta of sex 1(female) is -22.11833, the expenditure on gambling for female is 22.11 less than male.
The dataset uswages is drawn as a sample from the Current Population Survey in 1988. Fit a model with weekly wages as the response and years of education and experience as predictors. Report and give a simple interpretation to the regression coefficient for years of education. Not fit the same model but with logged weekly wages. Give an interpretation to the regression coefficient for years of education. Which interpretation is more natural?
data(uswages, package="faraway") # import data
lmod2 <- lm(wage ~ educ + exper, data = uswages) # wage as the response (y) and years of education and experience as predictors (x)
summary(lmod2)##
## Call:
## lm(formula = wage ~ educ + exper, data = uswages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1018.2 -237.9 -50.9 149.9 7228.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -242.7994 50.6816 -4.791 1.78e-06 ***
## educ 51.1753 3.3419 15.313 < 2e-16 ***
## exper 9.7748 0.7506 13.023 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 427.9 on 1997 degrees of freedom
## Multiple R-squared: 0.1351, Adjusted R-squared: 0.1343
## F-statistic: 156 on 2 and 1997 DF, p-value: < 2.2e-16
Give a simple interpretation to the regression coefficient for years of education:
## (Intercept) educ exper
## -242.799412 51.175268 9.774767
As we can see above, the coefficients are: (Intercept) -242.7994, educ 51.1753, exper 9.7748. So the interpretation is that for every unit increase in years of education the wage increases by 51.1753, when the experience is held fixed.
lmod2logged <- lm(log(wage) ~ educ + exper, data = uswages) # the second model, with logged wages
summary(lmod2logged)##
## Call:
## lm(formula = log(wage) ~ educ + exper, data = uswages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7533 -0.3495 0.1068 0.4381 3.5699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.650319 0.078354 59.35 <2e-16 ***
## educ 0.090506 0.005167 17.52 <2e-16 ***
## exper 0.018079 0.001160 15.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6615 on 1997 degrees of freedom
## Multiple R-squared: 0.1749, Adjusted R-squared: 0.174
## F-statistic: 211.6 on 2 and 1997 DF, p-value: < 2.2e-16
Give an interpretation to the regression coefficient for years of education.
## (Intercept) educ exper
## 4.65031905 0.09050628 0.01807855
As we can see above, the second model’s coefficients are: (Intercept) 4.650319, educ 0.090506, exper 0.018079. The interpretation is that, with experience held fixed with 1 year increase in education the wage becomes e0.090506 = 1.094728, which means wages increase by 9.4728%.
For the first model with extreme values of the covariates there is a chance of getting negative wage, which sounds really ridiculous, and the problem is avoided in second model, as exponentiation is a non-negative function.So we can say the second model(interpretation) is more natural.
The dataset prostate comes from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Fit a model with lpsa as the response and lcavol as the predictor. Record the residual standard error and the R2. Now add lweight, svi, lbph, age, lcp, pgg45 and gleason to the model one at a time. For each model record the residual standard error and the R2. Plot the trends in these two statistics.
data(prostate, package="faraway") # load the data
lmod3 <- lm(lpsa ~ lcavol, data = prostate)
lmod3_sum <- summary(lmod3) # for plotting the trends
summary(lmod3)##
## Call:
## lm(formula = lpsa ~ lcavol, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.67625 -0.41648 0.09859 0.50709 1.89673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.50730 0.12194 12.36 <2e-16 ***
## lcavol 0.71932 0.06819 10.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7875 on 95 degrees of freedom
## Multiple R-squared: 0.5394, Adjusted R-squared: 0.5346
## F-statistic: 111.3 on 1 and 95 DF, p-value: < 2.2e-16
As we can see in the summary of lmod3, the residual standard error(RSS) is 0.7875 on 95 degrees of freedom. And the R2 is 0.5394.
lmod3_2 <- lm(lpsa ~ lcavol + lweight + svi + lbph + age + lcp + pgg45 + gleason, data = prostate)
summary(lmod3_2)##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi + lbph + age + lcp +
## pgg45 + gleason, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7331 -0.3713 -0.0170 0.4141 1.6381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.669337 1.296387 0.516 0.60693
## lcavol 0.587022 0.087920 6.677 2.11e-09 ***
## lweight 0.454467 0.170012 2.673 0.00896 **
## svi 0.766157 0.244309 3.136 0.00233 **
## lbph 0.107054 0.058449 1.832 0.07040 .
## age -0.019637 0.011173 -1.758 0.08229 .
## lcp -0.105474 0.091013 -1.159 0.24964
## pgg45 0.004525 0.004421 1.024 0.30886
## gleason 0.045142 0.157465 0.287 0.77503
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234
## F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16
As we can see in the summary, the residual standard error(RSS) is 0.7084 on 88 degrees of freedom. The R2 is 0.6548.
# plot the trends in these two statistics
R2<- c()
Sgma <- c()
R2 = lmod3_sum$r.squared
Sgma = lmod3_sum$sigma
for(i in 1:7){
model_temp = lm(prostate$lpsa ~ prostate$lcavol + prostate[,i+1], data = prostate)
lmod_temp_sum = summary(model_temp)
R2[i+1] = lmod_temp_sum$r.squared
Sgma[i+1] = lmod_temp_sum$sigma
}
a = 1:8 # pick from 1 to 8
plot(a, Sgma, type = "l", main = "Residual Error vs R^2", xlab = "Values", ylab = "Models" , ylim = c(.40,.90))
points(a, R2, type = "l")Thirty samples of cheddar cheese were analyzed for their content of acetic acid, hydrogen sulfide and lactic acid. Each sample was tasted and scored by a panel of judges and the average taste score produced. Use the cheddar data to answer the following:
data("cheddar", package="faraway")
lmod4 <- lm(taste ~ Acetic + H2S + Lactic, data = cheddar)
summary(lmod4)##
## Call:
## lm(formula = taste ~ Acetic + H2S + Lactic, data = cheddar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.390 -6.612 -1.009 4.908 25.449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.8768 19.7354 -1.463 0.15540
## Acetic 0.3277 4.4598 0.073 0.94198
## H2S 3.9118 1.2484 3.133 0.00425 **
## Lactic 19.6705 8.6291 2.280 0.03108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.13 on 26 degrees of freedom
## Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116
## F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06
The regression coefficients are:
lmod4$coefficients## (Intercept) Acetic H2S Lactic
## -28.8767696 0.3277413 3.9118411 19.6705434
cor_value <- cor(lmod4$fitted.values, cheddar$taste) # the correlation between the fitted values and the response
squared_value <- cor_value^2 # squared valueThe correlation between the fitted values and the response is: 0.8073256. If we square it, we get, 0.80732562 = 0.6518, which is the multiple R-squared.
# we could get the regression model without an intercept term by adding -1 or 0. I tried both ways just in case.
lmod4_2 <- lm(taste ~ Acetic + H2S + Lactic - 1, data = cheddar) # method 1
summary(lmod4_2)##
## Call:
## lm(formula = taste ~ Acetic + H2S + Lactic - 1, data = cheddar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4521 -6.5262 -0.6388 4.6811 28.4744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Acetic -5.454 2.111 -2.583 0.01553 *
## H2S 4.576 1.187 3.854 0.00065 ***
## Lactic 19.127 8.801 2.173 0.03871 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.34 on 27 degrees of freedom
## Multiple R-squared: 0.8877, Adjusted R-squared: 0.8752
## F-statistic: 71.15 on 3 and 27 DF, p-value: 6.099e-13
lmod4_3 <- lm(taste ~ 0 + Acetic + H2S + Lactic, data = cheddar) # method 2
summary(lmod4_3)##
## Call:
## lm(formula = taste ~ 0 + Acetic + H2S + Lactic, data = cheddar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4521 -6.5262 -0.6388 4.6811 28.4744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Acetic -5.454 2.111 -2.583 0.01553 *
## H2S 4.576 1.187 3.854 0.00065 ***
## Lactic 19.127 8.801 2.173 0.03871 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.34 on 27 degrees of freedom
## Multiple R-squared: 0.8877, Adjusted R-squared: 0.8752
## F-statistic: 71.15 on 3 and 27 DF, p-value: 6.099e-13
cor_value_2 <- cor(lmod4_2$fitted.values, cheddar$taste) # the correlation between the fitted values and the response
squared_value_2 <- cor_value_2^2 # squared value
squared_value_2 ## [1] 0.6244075
To get the same regression model without an intercept term, I should add 0 or -1. With that, I got 0.8877 for the R squared value in the output.But as we could see above, the correlation coefficient squared value without an intercept term is 0.6244 (The R squared value and the correlation coefficient squared value is different) . And in the textbook, it says R2 has a null model with an intercept in mind when the sum of squares is calculated. And added that R uses this definition and will give a misleadingly high R2. So I must have an R2 use the \(cor^2(\hat{y}, y)\) definition when there is no intercept. So the more reasonable measure of the goodness of fit for this example is 0.6244075.
Prove if \(\sum\limits_{i=1}^{n} e_{i} = 0\) then \(\sum\limits_{i=1}^{n} y_{i} = \sum\limits_{i=1}^{n} \hat{y}_{i}\).
Let’s assume \(\sum\limits_{i=1}^{n} e_{i} = 0\) first.
We know that \(y = \hat{y} + \hat{\epsilon}\) when \(\hat{y}\) is predictor and \(\hat{\epsilon}\) is error.
By taking Riemann Sum, we get
\[\sum y = \sum\hat{y} + \sum\hat{\epsilon}\] And we know that \(\sum\hat{\epsilon} = 0\), since we assumed that \(\sum\limits_{i=1}^{n} e_{i} = 0\).
Then the formula looks like: \(\sum y = \sum\hat{y} + 0\)
Thus we can conclude that \(\sum y = \sum\hat{y}\), which is \(\sum\limits_{i=1}^{n} y_{i} = \sum\limits_{i=1}^{n} \hat{y}_{i}\)
Prove if \(\sum\limits_{i=1}^{n} x_{i}e_{i} = 0\) then \(\sum\limits_{i=1}^{n} \hat{y}_ie_{i} = 0\)
We assume that \(\sum\limits_{i=1}^{n} x_{i}e_{i} = 0\)
This means \(x_{1}e_{1} + x_{2}e_{2} + x_{3}e_{3} + ... + x_{n}e_{n} = 0\)
Which denotes that the formula should be like \(0 + 0 + 0 + ... + 0 = 0\), since that is the only way we could get the sum is zero.
And in the problem earlier (Additional Question # 1), we know that \(\sum\limits_{i=1}^{n} e_{i} = 0\).
Then since \(\sum\limits_{i=1}^{n} e_{i} = 0, \sum\limits_{i=1}^{n} \hat{y}_ie_{i}\) is zero either.