##Once you finished, submit the generated word file or html file to WebCampus. All datasets are from package ‘faraway’.

library(faraway)

1. Teenage Gambling in Britain.

The data teengamb has 47 observations and 5 columns. Refer to the help page to understand the meaning of the variable names.

data("teengamb")
head(teengamb)
##   sex status income verbal gamble
## 1   1     51   2.00      8    0.0
## 2   1     28   2.50      8    0.0
## 3   1     37   2.00      6    0.0
## 4   1     28   7.00      4    7.3
## 5   1     65   2.00      8   19.6
## 6   1     61   3.47      6    0.1
attach(teengamb)
  1. Fit a linear model with gamble as the response and all other variable as predictors. Write out the fitted model.
gambmodel1 = lm(gamble ~ sex + status + income + verbal, data = teengamb)
summary(gambmodel1)
## 
## Call:
## lm(formula = gamble ~ sex + status + income + verbal, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.082 -11.320  -1.451   9.452  94.252 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.55565   17.19680   1.312   0.1968    
## sex         -22.11833    8.21111  -2.694   0.0101 *  
## status        0.05223    0.28111   0.186   0.8535    
## income        4.96198    1.02539   4.839 1.79e-05 ***
## verbal       -2.95949    2.17215  -1.362   0.1803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared:  0.5267, Adjusted R-squared:  0.4816 
## F-statistic: 11.69 on 4 and 42 DF,  p-value: 1.815e-06

\(\hat{gamble} = 22.56 -22.12x_{sex} + 0.05x_{status} + 4.96x_{income} -2.96x_{verbal}\)

  1. For all other predicotrs held constant, what would be the expected difference in predicted expenditure on gabmling for a male compared to a female?

There would be an expected decrease of 22.12 in expenditure since sex is a binary variable. Since females are assigned the number one and males are assigned zero, the only group of people who are affected by this coefficient are females. According to the model above, if the subject is female then it is expected that they will spend less money on gambling. Specifically, 22.12 less British pounds per year.

  1. Which variables are statistically significant at level \(\alpha=0.05\)? At a significant level of \(\alpha=0.05\), sex and income are significant variables.

  2. Test on the regression significance, what is your conclusion?

Reviewing how to calculate F:

SST=sum((gamble-mean(gamble))^2)
SST
## [1] 45689.49
SSReg = sum((gambmodel1$fitted.values-mean(gamble))^2)
SSReg
## [1] 24065.73
SSE = sum(gambmodel1$residuals^2)
SSE
## [1] 21623.77
(SSReg/4)/(SSE/42)
## [1] 11.68576
pf(11.68576,df1 = 4, df2=42, lower.tail = F)
## [1] 1.814701e-06

Since the p-value is significant, these variables have improved our fitted model.

  1. Test if veral and status should be included in the model using the partial F-test(reduced model vs. full model), and state your conclusion.
anova(gambmodel1)
## Analysis of Variance Table
## 
## Response: gamble
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## sex        1  7598.4  7598.4 14.7584 0.0004066 ***
## status     1  3613.0  3613.0  7.0175 0.0113254 *  
## income     1 11898.6 11898.6 23.1108 1.985e-05 ***
## verbal     1   955.7   955.7  1.8563 0.1803109    
## Residuals 42 21623.8   514.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gambmodel2 = lm(gamble ~ sex + status + income, data = teengamb)
gambmodel3 = lm(gamble ~ sex + income + verbal, data = teengamb)
gambmodel4 = lm(gamble ~ sex + income, data = teengamb)

anova(gambmodel2, gambmodel1)
## Analysis of Variance Table
## 
## Model 1: gamble ~ sex + status + income
## Model 2: gamble ~ sex + status + income + verbal
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     43 22580                           
## 2     42 21624  1    955.73 1.8563 0.1803
anova(gambmodel3, gambmodel1)
## Analysis of Variance Table
## 
## Model 1: gamble ~ sex + income + verbal
## Model 2: gamble ~ sex + status + income + verbal
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     43 21642                           
## 2     42 21624  1    17.776 0.0345 0.8535
anova(gambmodel4, gambmodel1)
## Analysis of Variance Table
## 
## Model 1: gamble ~ sex + income
## Model 2: gamble ~ sex + status + income + verbal
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     44 22781                           
## 2     42 21624  2    1157.5 1.1242 0.3345
summary(gambmodel4)
## 
## Call:
## lm(formula = gamble ~ sex + income, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.757 -11.649   0.844   8.659 100.243 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.041      6.394   0.632  0.53070    
## sex          -21.634      6.809  -3.177  0.00272 ** 
## income         5.172      0.951   5.438 2.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.75 on 44 degrees of freedom
## Multiple R-squared:  0.5014, Adjusted R-squared:  0.4787 
## F-statistic: 22.12 on 2 and 44 DF,  p-value: 2.243e-07

Since the p-value for the anova test comparing the full model with a reduced model that excludes both status and verbal is 0.3345, we fail to reject the null hypothesis of \(\beta_2 = \beta_4 = 0\). This means that the reduced model, \(\hat{gamble} = 4.041 - 21.634x_{sex} + 5.172x_{income}\), is acceptable for our analysis.

2. School expenditure and test scores

The data sat has 50 observations and 7 columns. Refer to the help page to understand the meaning of the variable names.

data("sat")
head(sat)
##            expend ratio salary takers verbal math total
## Alabama     4.405  17.2 31.144      8    491  538  1029
## Alaska      8.963  17.6 47.951     47    445  489   934
## Arizona     4.778  19.3 32.175     27    448  496   944
## Arkansas    4.459  17.1 28.934      6    482  523  1005
## California  4.992  24.0 41.078     45    417  485   902
## Colorado    5.443  18.4 34.571     29    462  518   980
attach(sat)
## The following object is masked from teengamb:
## 
##     verbal
  1. Fit a linear model with total score as the response and expend, ratio, and salary as predictors. Write out the fitted model.

\(\hat{total} = 1069.23 + 16.47x_{expend} + 6.33x_{ratio} - 8.82x_{salary}\)

totalmodel1 = lm(total ~ expend + ratio + salary, data = sat)
totalmodel1
## 
## Call:
## lm(formula = total ~ expend + ratio + salary, data = sat)
## 
## Coefficients:
## (Intercept)       expend        ratio       salary  
##    1069.234       16.469        6.330       -8.823
summary(totalmodel1)
## 
## Call:
## lm(formula = total ~ expend + ratio + salary, data = sat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -140.911  -46.740   -7.535   47.966  123.329 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1069.234    110.925   9.639 1.29e-12 ***
## expend        16.469     22.050   0.747   0.4589    
## ratio          6.330      6.542   0.968   0.3383    
## salary        -8.823      4.697  -1.878   0.0667 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.65 on 46 degrees of freedom
## Multiple R-squared:  0.2096, Adjusted R-squared:  0.1581 
## F-statistic: 4.066 on 3 and 46 DF,  p-value: 0.01209
confint(totalmodel1, level = 0.95)
##                 2.5 %       97.5 %
## (Intercept) 845.95385 1292.5144911
## expend      -27.91528   60.8530112
## ratio        -6.83820   19.4987339
## salary      -18.27679    0.6315232
  1. Which of those 3 variables are statistically significant at level \(\alpha=0.05\)?

According to the summary above, none of these variables are not significant at the \(\alpha=0.05\) level. The only predictor that is significant is the intercept. By looking at each variable’s confidence interval, we can see that expend, ratio, and salary all include zero in their confidence intervals. So we fail to reject the null hypothesis that \(\beta_{expend} = \beta_{ratio} = \beta_{salary} = 0\).

  1. Test on the regression significance, what is your conclusion?
SST = sum((total - mean(total))^2)
SST
## [1] 274307.7
SSReg = sum((totalmodel1$fitted.values - mean(total))^2)
SSReg
## [1] 57495.74
SSE = sum(totalmodel1$residuals^2)
SSE
## [1] 216811.9
(SSReg/3)/(SSE/46)
## [1] 4.066203
pf(4.066203, df1 = 3, df2 = 46, lower.tail = F)
## [1] 0.01208607

Since the p-value of the F-test for this regressio is 0.012, it is significant at the \(\alpha = 0.05\) level.

  1. Now add takers to the above model. Test \(H_0: \beta_{takers}=0\) using a t-test, and state your conclusion.
totalmodel2 = lm(total ~ expend + ratio + salary + takers, data = sat)
totalmodel2
## 
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
## 
## Coefficients:
## (Intercept)       expend        ratio       salary       takers  
##    1045.972        4.463       -3.624        1.638       -2.904
summary(totalmodel2)
## 
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.531 -20.855  -1.746  15.979  66.571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1045.9715    52.8698  19.784  < 2e-16 ***
## expend         4.4626    10.5465   0.423    0.674    
## ratio         -3.6242     3.2154  -1.127    0.266    
## salary         1.6379     2.3872   0.686    0.496    
## takers        -2.9045     0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.809 
## F-statistic: 52.88 on 4 and 45 DF,  p-value: < 2.2e-16
confint(totalmodel2,parm = c(5), level = 0.95)
##            2.5 %    97.5 %
## takers -3.370262 -2.438699

The fitted value for takers is significant and we can reject the null. Looking at the confidence interval for takers, we can see that zero is not included in the interval. So we reject the null that \(\beta_{takers} = 0\). 5. Compare the model in 4 (expend, ratio, salary and takers as predictors) with the model in 1 (expend, ratio, and salary as predictors) using F-test, and state your conclusion.

anova(totalmodel1, totalmodel2)
## Analysis of Variance Table
## 
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio + salary + takers
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     46 216812                                  
## 2     45  48124  1    168688 157.74 2.607e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the anova chart comparing the two models for SAT scores, we reject the null. This means the fuller model is better for modeling the relationship.