STAT 425 Homework 1

Question 1

$RSS(B_0, B_1) = _{n=1}^{n} [y_i - (B_0 + B_1 x_i)]^2 $

$(SSE(B_0, B_1)) = -2 _{n=1}^{n} [y_i - (B_0 + B_1 x_i)] = 0 $

$_{n=1}^{n} [y_i - (B_0 + B_1 x_i)] = 0 $

${n=1}^{n} y_i = {n=1}^{n} B_0 + B_1 _{n=1}^{n} x_i $

${n=1}^{n} y_i = n B_0 + B_1 {n=1}^{n} x_i $

$y = B_0 + B_1 x_i$

$min Sum _{n=1}^{n} [y_i - (y-(y-B_1 x_i + B_1 x_i)]^2 $

$min Sum _{n=1}^{n} [(y_i -y) - B_1(x_i - x)]^2 $

$ RSS(B_0, B_1) = -2 _{n=1}^{n} [(y_i -y) - B_1(x_i - x)] $

${n=1}^{n} (y_i -y) (x_i - x) - B_1 {n=1}^{n} (x_i - x)^2 = 0 $

$B_1 = _{n=1}^{n} } = $

Question 2

$Y_i = B_0 + B_1 x_i + e_i $

$e_i ~ N(0, \sigma^2)$

$Y_i ~ N(B_0 + B_1 x_1, \sigma^2)$

$L(B_0, B_1) = _{n=1}^{n} e^- $

$log(L(B_0, B_1)) = _{n=1}^{n}[log() -] $

Maximizing this function is the same as minimizing the function in Question 1

Question 3

$\sum_{n=1}^{n} (y_i - B_1 x_1)^2$ $ -2 {n=1}^{n} (y_i - B_1 x_i) x_i = 0 $ $ {n=1}^{n} y_i x_i = B_1 _{n=1}^{n} x_i $ $ $

$ ^2 (X^T X)^{-1} = ^2 (_{n=1}^{n} (x_i)²⁾{-1} $

Question 4

Part 4a

library(faraway)
fit1 = lm(teengamb$gamble ~ teengamb$sex + teengamb$status + teengamb$income + teengamb$verbal, data=teengamb)
summary(fit1)

## 
## Call:
## lm(formula = teengamb$gamble ~ teengamb$sex + teengamb$status + 
##     teengamb$income + teengamb$verbal, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.082 -11.320  -1.451   9.452  94.252 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.55565   17.19680   1.312   0.1968    
## teengamb$sex    -22.11833    8.21111  -2.694   0.0101 *  
## teengamb$status   0.05223    0.28111   0.186   0.8535    
## teengamb$income   4.96198    1.02539   4.839 1.79e-05 ***
## teengamb$verbal  -2.95949    2.17215  -1.362   0.1803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared:  0.5267, Adjusted R-squared:  0.4816 
## F-statistic: 11.69 on 4 and 42 DF,  p-value: 1.815e-06

52.67% of the variance is explained by the predictors.

Part 4b

fit1$residuals

##           1           2           3           4           5           6 
##  10.6507430   9.3711318   5.4630298 -17.4957487  29.5194692  -2.9846919 
##           7           8           9          10          11          12 
##  -7.0242994 -12.3060734   6.8496267 -10.3329505   1.5934936  -3.0958161 
##          13          14          15          16          17          18 
##   0.1172839   9.5331344   2.8488167  17.2107726 -25.2627227 -27.7998544 
##          19          20          21          22          23          24 
##  13.1446553 -15.9510624 -16.0041386  -9.5801478 -27.2711657  94.2522174 
##          25          26          27          28          29          30 
##   0.6993361  -9.1670510 -25.8747696  -8.7455549  -6.8803097 -19.8090866 
##          31          32          33          34          35          36 
##  10.8793766  15.0599340  11.7462296  -3.5932770 -14.4016736  45.6051264 
##          37          38          39          40          41          42 
##  20.5472529  11.2429290 -51.0824078   8.8669438  -1.4513921  -3.8361619 
##          43          44          45          46          47 
##  -4.3831786 -14.8940753   5.4506347   1.4092321   7.1662399

max(fit1$residuals)

## [1] 94.25222

Point 24 seems to be the maximum value.

Part 4c

mean(fit1$residuals)

## [1] -3.065293e-17

median(fit1$residuals)

## [1] -1.451392

The mean is pretty much 0 and the median is -1.451392.

Part 4d

cor(fit1$residuals, fit1$fitted)

## [1] -1.070659e-16

Part 4e

cor(fit1$residuals, teengamb$income)

## [1] -7.242382e-17

Part 4f

Keeping all the predictors constant, gambling expenditure decreases by approximately 22 units for females.

Part 4g

summary(fit1)

## 
## Call:
## lm(formula = teengamb$gamble ~ teengamb$sex + teengamb$status + 
##     teengamb$income + teengamb$verbal, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.082 -11.320  -1.451   9.452  94.252 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.55565   17.19680   1.312   0.1968    
## teengamb$sex    -22.11833    8.21111  -2.694   0.0101 *  
## teengamb$status   0.05223    0.28111   0.186   0.8535    
## teengamb$income   4.96198    1.02539   4.839 1.79e-05 ***
## teengamb$verbal  -2.95949    2.17215  -1.362   0.1803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared:  0.5267, Adjusted R-squared:  0.4816 
## F-statistic: 11.69 on 4 and 42 DF,  p-value: 1.815e-06

Sex and income are the only significant variables.

Part 4h

teengamb_avg <- data.frame(sex=0, income = mean(teengamb$income), status = mean(teengamb$status), verbal = mean(teengamb$verbal))

predict.lm(fit1, teengamb_avg, interval="prediction")

## Warning: 'newdata' had 1 row but variables found have 47 rows

##            fit        lwr       upr
## 1  -10.6507430 -58.235514  36.93403
## 2   -9.3711318 -57.582578  38.84031
## 3   -5.4630298 -52.684957  41.75890
## 4   24.7957487 -23.289986  72.88148
## 5   -9.9194692 -58.782299  38.94336
## 6    3.0846919 -46.314094  52.48348
## 7    8.4742994 -38.868622  55.81722
## 8   18.9060734 -28.484901  66.29705
## 9   -5.1496267 -52.527591  42.22834
## 10  10.4329505 -37.782785  58.64869
## 11  -1.4934936 -49.540423  46.55344
## 12   8.4958161 -38.871827  55.86346
## 13   1.0827161 -47.058657  49.22409
## 14  -5.9331344 -53.380123  41.51385
## 15  -0.4488167 -47.554821  46.65719
## 16 -13.8107726 -61.480049  33.85850
## 17  25.3627227 -23.528720  74.25417
## 18  36.1998544 -12.259611  84.65932
## 19  -1.1446553 -48.490357  46.20105
## 20  15.9510624 -31.807109  63.70923
## 21  17.0041386 -29.920025  63.92830
## 22  10.7801478 -37.571437  59.13173
## 23  27.3711657 -19.927427  74.66976
## 24  61.7477826  13.204942 110.29062
## 25  37.8006639 -10.284912  85.88624
## 26  11.2670510 -37.026554  59.56066
## 27  40.3747696  -7.778386  88.52793
## 28  11.7455549 -36.500529  59.99164
## 29   7.4803097 -40.047511  55.00813
## 30  29.4090866 -17.545286  76.36346
## 31  77.1206234  26.140182 128.10107
## 32  38.1400660 -10.033590  86.31372
## 33  78.2537704  27.648150 128.85939
## 34   6.5932770 -41.586927  54.77348
## 35  28.5016736 -23.944532  80.94788
## 36  24.3948736 -22.450663  71.24041
## 37  17.9527471 -29.681396  65.58689
## 38  45.9570710  -2.016773  93.93091
## 39  57.0824078   9.241214 104.92360
## 40  16.1330562 -31.617014  63.88313
## 41   8.3513921 -40.032997  56.73578
## 42  73.5361619  21.294136 125.77819
## 43  17.6831786 -30.061617  65.42797
## 44  15.4940753 -31.849756  62.83791
## 45  32.5493653 -14.759029  79.85776
## 46  12.9907679 -34.754592  60.73613
## 47  12.0337601 -35.528795  59.59632

teengamb_max <- data.frame(sex=0, income = max(teengamb$income), status = max(teengamb$status), verbal = max(teengamb$verbal))

predict.lm(fit1, teengamb_max, interval="prediction")

## Warning: 'newdata' had 1 row but variables found have 47 rows

##            fit        lwr       upr
## 1  -10.6507430 -58.235514  36.93403
## 2   -9.3711318 -57.582578  38.84031
## 3   -5.4630298 -52.684957  41.75890
## 4   24.7957487 -23.289986  72.88148
## 5   -9.9194692 -58.782299  38.94336
## 6    3.0846919 -46.314094  52.48348
## 7    8.4742994 -38.868622  55.81722
## 8   18.9060734 -28.484901  66.29705
## 9   -5.1496267 -52.527591  42.22834
## 10  10.4329505 -37.782785  58.64869
## 11  -1.4934936 -49.540423  46.55344
## 12   8.4958161 -38.871827  55.86346
## 13   1.0827161 -47.058657  49.22409
## 14  -5.9331344 -53.380123  41.51385
## 15  -0.4488167 -47.554821  46.65719
## 16 -13.8107726 -61.480049  33.85850
## 17  25.3627227 -23.528720  74.25417
## 18  36.1998544 -12.259611  84.65932
## 19  -1.1446553 -48.490357  46.20105
## 20  15.9510624 -31.807109  63.70923
## 21  17.0041386 -29.920025  63.92830
## 22  10.7801478 -37.571437  59.13173
## 23  27.3711657 -19.927427  74.66976
## 24  61.7477826  13.204942 110.29062
## 25  37.8006639 -10.284912  85.88624
## 26  11.2670510 -37.026554  59.56066
## 27  40.3747696  -7.778386  88.52793
## 28  11.7455549 -36.500529  59.99164
## 29   7.4803097 -40.047511  55.00813
## 30  29.4090866 -17.545286  76.36346
## 31  77.1206234  26.140182 128.10107
## 32  38.1400660 -10.033590  86.31372
## 33  78.2537704  27.648150 128.85939
## 34   6.5932770 -41.586927  54.77348
## 35  28.5016736 -23.944532  80.94788
## 36  24.3948736 -22.450663  71.24041
## 37  17.9527471 -29.681396  65.58689
## 38  45.9570710  -2.016773  93.93091
## 39  57.0824078   9.241214 104.92360
## 40  16.1330562 -31.617014  63.88313
## 41   8.3513921 -40.032997  56.73578
## 42  73.5361619  21.294136 125.77819
## 43  17.6831786 -30.061617  65.42797
## 44  15.4940753 -31.849756  62.83791
## 45  32.5493653 -14.759029  79.85776
## 46  12.9907679 -34.754592  60.73613
## 47  12.0337601 -35.528795  59.59632

The maximum prediction interval is wider as expected since there are not as many observations around the maximum as there are around the mean. On top of this being a prediction interval, we are also extrapolating at this point since we are going beyond the maximum on the higher tail of the interval.

Part 4i

income_gamb <- lm(gamble ~ income , data=teengamb)
anova(fit1, income_gamb)

## Warning in anova.lmlist(object, ...): models with response '"gamble"'
## removed because response differs from model 1

## Analysis of Variance Table
## 
## Response: teengamb$gamble
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## teengamb$sex     1  7598.4  7598.4 14.7584 0.0004066 ***
## teengamb$status  1  3613.0  3613.0  7.0175 0.0113254 *  
## teengamb$income  1 11898.6 11898.6 23.1108 1.985e-05 ***
## teengamb$verbal  1   955.7   955.7  1.8563 0.1803109    
## Residuals       42 21623.8   514.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-statistics is 4.1338, and the p-value is significant. Thus, we would reject the null hypothesis.

Question 5

Part 5a

salary_model <- lm(total ~ expend + ratio + salary, data=sat)
non_salary_model <-lm(total ~ expend + ratio, data=sat)


anova(salary_model, non_salary_model)

## Analysis of Variance Table
## 
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     46 216812                              
## 2     47 233443 -1    -16631 3.5285 0.06667 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(salary_model)

## 
## Call:
## lm(formula = total ~ expend + ratio + salary, data = sat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -140.911  -46.740   -7.535   47.966  123.329 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1069.234    110.925   9.639 1.29e-12 ***
## expend        16.469     22.050   0.747   0.4589    
## ratio          6.330      6.542   0.968   0.3383    
## salary        -8.823      4.697  -1.878   0.0667 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.65 on 46 degrees of freedom
## Multiple R-squared:  0.2096, Adjusted R-squared:  0.1581 
## F-statistic: 4.066 on 3 and 46 DF,  p-value: 0.01209

The F-statistic is 3.528 and the p-value is not significant at an alpha = 0.05. Thus $B_salary$ does equal 0.

The p-value is 0.01209, thus we would reject the null hypothesis and at least one of the betas is not equal to 0.

Part 5b

satmodel2 <- lm(total ~ expend + ratio + salary + takers, data=sat)

summary(satmodel2)

## 
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.531 -20.855  -1.746  15.979  66.571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1045.9715    52.8698  19.784  < 2e-16 ***
## expend         4.4626    10.5465   0.423    0.674    
## ratio         -3.6242     3.2154  -1.127    0.266    
## salary         1.6379     2.3872   0.686    0.496    
## takers        -2.9045     0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.809 
## F-statistic: 52.88 on 4 and 45 DF,  p-value: < 2.2e-16

anova(salary_model, satmodel2)

## Analysis of Variance Table
## 
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio + salary + takers
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     46 216812                                  
## 2     45  48124  1    168688 157.74 2.607e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The t-statistic for takers is -12.559 and the p-value is 2.61e-16. The F-statistic is 157.74, if you square the t-statistic for takers you get 157.72 which is the same, and the p-value is also equivalent at 2.607e-16.