$RSS(B_0, B_1) = _{n=1}^{n} [yi - (B_0 + B_1 x_i)]^2 $
$(SSE(B_0, B_1)) = -2 _{n=1}^{n} [yi - (B_0 + B_1 x_i)] = 0 $
$_{n=1}^{n} [y_i - (B_0 + B_1 x_i)] = 0 $
${n=1}^{n} y_i = {n=1}^{n} B_0 + B_1 _{n=1}^{n} x_i $
${n=1}^{n} y_i = n B_0 + B_1 {n=1}^{n} x_i $
\(y = B_0 + B_1 x_i\)
$min Sum _{n=1}^{n} [y_i - (y-(y-B_1 x_i + B_1 x_i)]^2 $
$min Sum _{n=1}^{n} [(y_i -y) - B_1(x_i - x)]^2 $
$ RSS(B_0, B_1) = -2 _{n=1}^{n} [(y_i -y) - B_1(x_i - x)] $
${n=1}^{n} (y_i -y) (x_i - x) - B_1 {n=1}^{n} (x_i - x)^2 = 0 $
$B_1 = _{n=1}^{n} } = $
$Y_i = B_0 + B_1 x_i + e_i $
\(e_i ~ N(0, \sigma^2)\)
\(Y_i ~ N(B_0 + B_1 x_1, \sigma^2)\)
$L(B_0, B_1) = _{n=1}^{n} e^- $
$log(L(B_0, B_1)) = _{n=1}^{n}[log() -] $
Maximizing this function is the same as minimizing the function in Question 1
\(\sum_{n=1}^{n} (y_i - B_1 x_1)^2\) $ -2 {n=1}^{n} (y_i - B_1 x_i) x_i = 0 $ $ {n=1}^{n} y_i x_i = B_1 _{n=1}^{n} x_i $ $ $
$ ^2 (X^T X)^{-1} = ^2 (_{n=1}^{n} (x_i)2){-1} $
library(faraway)
fit1 = lm(teengamb$gamble ~ teengamb$sex + teengamb$status + teengamb$income + teengamb$verbal, data=teengamb)
summary(fit1)
##
## Call:
## lm(formula = teengamb$gamble ~ teengamb$sex + teengamb$status +
## teengamb$income + teengamb$verbal, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.082 -11.320 -1.451 9.452 94.252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.55565 17.19680 1.312 0.1968
## teengamb$sex -22.11833 8.21111 -2.694 0.0101 *
## teengamb$status 0.05223 0.28111 0.186 0.8535
## teengamb$income 4.96198 1.02539 4.839 1.79e-05 ***
## teengamb$verbal -2.95949 2.17215 -1.362 0.1803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816
## F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06
52.67% of the variance is explained by the predictors.
fit1$residuals
## 1 2 3 4 5 6
## 10.6507430 9.3711318 5.4630298 -17.4957487 29.5194692 -2.9846919
## 7 8 9 10 11 12
## -7.0242994 -12.3060734 6.8496267 -10.3329505 1.5934936 -3.0958161
## 13 14 15 16 17 18
## 0.1172839 9.5331344 2.8488167 17.2107726 -25.2627227 -27.7998544
## 19 20 21 22 23 24
## 13.1446553 -15.9510624 -16.0041386 -9.5801478 -27.2711657 94.2522174
## 25 26 27 28 29 30
## 0.6993361 -9.1670510 -25.8747696 -8.7455549 -6.8803097 -19.8090866
## 31 32 33 34 35 36
## 10.8793766 15.0599340 11.7462296 -3.5932770 -14.4016736 45.6051264
## 37 38 39 40 41 42
## 20.5472529 11.2429290 -51.0824078 8.8669438 -1.4513921 -3.8361619
## 43 44 45 46 47
## -4.3831786 -14.8940753 5.4506347 1.4092321 7.1662399
max(fit1$residuals)
## [1] 94.25222
Point 24 seems to be the maximum value.
mean(fit1$residuals)
## [1] -3.065293e-17
median(fit1$residuals)
## [1] -1.451392
The mean is pretty much 0 and the median is -1.451392.
cor(fit1$residuals, fit1$fitted)
## [1] -1.070659e-16
cor(fit1$residuals, teengamb$income)
## [1] -7.242382e-17
Keeping all the predictors constant, gambling expenditure decreases by approximately 22 units for females.
summary(fit1)
##
## Call:
## lm(formula = teengamb$gamble ~ teengamb$sex + teengamb$status +
## teengamb$income + teengamb$verbal, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.082 -11.320 -1.451 9.452 94.252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.55565 17.19680 1.312 0.1968
## teengamb$sex -22.11833 8.21111 -2.694 0.0101 *
## teengamb$status 0.05223 0.28111 0.186 0.8535
## teengamb$income 4.96198 1.02539 4.839 1.79e-05 ***
## teengamb$verbal -2.95949 2.17215 -1.362 0.1803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816
## F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06
Sex and income are the only significant variables.
teengamb_avg <- data.frame(sex=0, income = mean(teengamb$income), status = mean(teengamb$status), verbal = mean(teengamb$verbal))
predict.lm(fit1, teengamb_avg, interval="prediction")
## Warning: 'newdata' had 1 row but variables found have 47 rows
## fit lwr upr
## 1 -10.6507430 -58.235514 36.93403
## 2 -9.3711318 -57.582578 38.84031
## 3 -5.4630298 -52.684957 41.75890
## 4 24.7957487 -23.289986 72.88148
## 5 -9.9194692 -58.782299 38.94336
## 6 3.0846919 -46.314094 52.48348
## 7 8.4742994 -38.868622 55.81722
## 8 18.9060734 -28.484901 66.29705
## 9 -5.1496267 -52.527591 42.22834
## 10 10.4329505 -37.782785 58.64869
## 11 -1.4934936 -49.540423 46.55344
## 12 8.4958161 -38.871827 55.86346
## 13 1.0827161 -47.058657 49.22409
## 14 -5.9331344 -53.380123 41.51385
## 15 -0.4488167 -47.554821 46.65719
## 16 -13.8107726 -61.480049 33.85850
## 17 25.3627227 -23.528720 74.25417
## 18 36.1998544 -12.259611 84.65932
## 19 -1.1446553 -48.490357 46.20105
## 20 15.9510624 -31.807109 63.70923
## 21 17.0041386 -29.920025 63.92830
## 22 10.7801478 -37.571437 59.13173
## 23 27.3711657 -19.927427 74.66976
## 24 61.7477826 13.204942 110.29062
## 25 37.8006639 -10.284912 85.88624
## 26 11.2670510 -37.026554 59.56066
## 27 40.3747696 -7.778386 88.52793
## 28 11.7455549 -36.500529 59.99164
## 29 7.4803097 -40.047511 55.00813
## 30 29.4090866 -17.545286 76.36346
## 31 77.1206234 26.140182 128.10107
## 32 38.1400660 -10.033590 86.31372
## 33 78.2537704 27.648150 128.85939
## 34 6.5932770 -41.586927 54.77348
## 35 28.5016736 -23.944532 80.94788
## 36 24.3948736 -22.450663 71.24041
## 37 17.9527471 -29.681396 65.58689
## 38 45.9570710 -2.016773 93.93091
## 39 57.0824078 9.241214 104.92360
## 40 16.1330562 -31.617014 63.88313
## 41 8.3513921 -40.032997 56.73578
## 42 73.5361619 21.294136 125.77819
## 43 17.6831786 -30.061617 65.42797
## 44 15.4940753 -31.849756 62.83791
## 45 32.5493653 -14.759029 79.85776
## 46 12.9907679 -34.754592 60.73613
## 47 12.0337601 -35.528795 59.59632
teengamb_max <- data.frame(sex=0, income = max(teengamb$income), status = max(teengamb$status), verbal = max(teengamb$verbal))
predict.lm(fit1, teengamb_max, interval="prediction")
## Warning: 'newdata' had 1 row but variables found have 47 rows
## fit lwr upr
## 1 -10.6507430 -58.235514 36.93403
## 2 -9.3711318 -57.582578 38.84031
## 3 -5.4630298 -52.684957 41.75890
## 4 24.7957487 -23.289986 72.88148
## 5 -9.9194692 -58.782299 38.94336
## 6 3.0846919 -46.314094 52.48348
## 7 8.4742994 -38.868622 55.81722
## 8 18.9060734 -28.484901 66.29705
## 9 -5.1496267 -52.527591 42.22834
## 10 10.4329505 -37.782785 58.64869
## 11 -1.4934936 -49.540423 46.55344
## 12 8.4958161 -38.871827 55.86346
## 13 1.0827161 -47.058657 49.22409
## 14 -5.9331344 -53.380123 41.51385
## 15 -0.4488167 -47.554821 46.65719
## 16 -13.8107726 -61.480049 33.85850
## 17 25.3627227 -23.528720 74.25417
## 18 36.1998544 -12.259611 84.65932
## 19 -1.1446553 -48.490357 46.20105
## 20 15.9510624 -31.807109 63.70923
## 21 17.0041386 -29.920025 63.92830
## 22 10.7801478 -37.571437 59.13173
## 23 27.3711657 -19.927427 74.66976
## 24 61.7477826 13.204942 110.29062
## 25 37.8006639 -10.284912 85.88624
## 26 11.2670510 -37.026554 59.56066
## 27 40.3747696 -7.778386 88.52793
## 28 11.7455549 -36.500529 59.99164
## 29 7.4803097 -40.047511 55.00813
## 30 29.4090866 -17.545286 76.36346
## 31 77.1206234 26.140182 128.10107
## 32 38.1400660 -10.033590 86.31372
## 33 78.2537704 27.648150 128.85939
## 34 6.5932770 -41.586927 54.77348
## 35 28.5016736 -23.944532 80.94788
## 36 24.3948736 -22.450663 71.24041
## 37 17.9527471 -29.681396 65.58689
## 38 45.9570710 -2.016773 93.93091
## 39 57.0824078 9.241214 104.92360
## 40 16.1330562 -31.617014 63.88313
## 41 8.3513921 -40.032997 56.73578
## 42 73.5361619 21.294136 125.77819
## 43 17.6831786 -30.061617 65.42797
## 44 15.4940753 -31.849756 62.83791
## 45 32.5493653 -14.759029 79.85776
## 46 12.9907679 -34.754592 60.73613
## 47 12.0337601 -35.528795 59.59632
The maximum prediction interval is wider as expected since there are not as many observations around the maximum as there are around the mean. On top of this being a prediction interval, we are also extrapolating at this point since we are going beyond the maximum on the higher tail of the interval.
income_gamb <- lm(gamble ~ income , data=teengamb)
anova(fit1, income_gamb)
## Warning in anova.lmlist(object, ...): models with response '"gamble"'
## removed because response differs from model 1
## Analysis of Variance Table
##
## Response: teengamb$gamble
## Df Sum Sq Mean Sq F value Pr(>F)
## teengamb$sex 1 7598.4 7598.4 14.7584 0.0004066 ***
## teengamb$status 1 3613.0 3613.0 7.0175 0.0113254 *
## teengamb$income 1 11898.6 11898.6 23.1108 1.985e-05 ***
## teengamb$verbal 1 955.7 955.7 1.8563 0.1803109
## Residuals 42 21623.8 514.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistics is 4.1338, and the p-value is significant. Thus, we would reject the null hypothesis.
salary_model <- lm(total ~ expend + ratio + salary, data=sat)
non_salary_model <-lm(total ~ expend + ratio, data=sat)
anova(salary_model, non_salary_model)
## Analysis of Variance Table
##
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 216812
## 2 47 233443 -1 -16631 3.5285 0.06667 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(salary_model)
##
## Call:
## lm(formula = total ~ expend + ratio + salary, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140.911 -46.740 -7.535 47.966 123.329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1069.234 110.925 9.639 1.29e-12 ***
## expend 16.469 22.050 0.747 0.4589
## ratio 6.330 6.542 0.968 0.3383
## salary -8.823 4.697 -1.878 0.0667 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.65 on 46 degrees of freedom
## Multiple R-squared: 0.2096, Adjusted R-squared: 0.1581
## F-statistic: 4.066 on 3 and 46 DF, p-value: 0.01209
The F-statistic is 3.528 and the p-value is not significant at an alpha = 0.05. Thus \(B_salary\) does equal 0.
The p-value is 0.01209, thus we would reject the null hypothesis and at least one of the betas is not equal to 0.
satmodel2 <- lm(total ~ expend + ratio + salary + takers, data=sat)
summary(satmodel2)
##
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## ratio -3.6242 3.2154 -1.127 0.266
## salary 1.6379 2.3872 0.686 0.496
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
anova(salary_model, satmodel2)
## Analysis of Variance Table
##
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio + salary + takers
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 216812
## 2 45 48124 1 168688 157.74 2.607e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The t-statistic for takers is -12.559 and the p-value is 2.61e-16. The F-statistic is 157.74, if you square the t-statistic for takers you get 157.72 which is the same, and the p-value is also equivalent at 2.607e-16.