##Once you finished, submit the generated word file or html file to WebCampus. All datasets are from package ‘faraway’.
library(faraway)
The data teengamb has 47 observations and 5 columns. Refer to the help page to understand the meaning of the variable names.
data("teengamb")
head(teengamb)
## sex status income verbal gamble
## 1 1 51 2.00 8 0.0
## 2 1 28 2.50 8 0.0
## 3 1 37 2.00 6 0.0
## 4 1 28 7.00 4 7.3
## 5 1 65 2.00 8 19.6
## 6 1 61 3.47 6 0.1
attach(teengamb)
gamble as the response and all other variable as predictors. Write out the fitted model.gambmodel1 = lm(gamble ~ sex + status + income + verbal, data = teengamb)
summary(gambmodel1)
##
## Call:
## lm(formula = gamble ~ sex + status + income + verbal, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.082 -11.320 -1.451 9.452 94.252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.55565 17.19680 1.312 0.1968
## sex -22.11833 8.21111 -2.694 0.0101 *
## status 0.05223 0.28111 0.186 0.8535
## income 4.96198 1.02539 4.839 1.79e-05 ***
## verbal -2.95949 2.17215 -1.362 0.1803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816
## F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06
\(\hat{gamble} = 22.56 -22.12x_{sex} + 0.05x_{status} + 4.96x_{income} -2.96x_{verbal}\)
There would be an expected decrease of 22.12 in expenditure since sex is a binary variable. Since females are assigned the number one and males are assigned zero, the only group of people who are affected by this coefficient are females. According to the model above, if the subject is female then it is expected that they will spend less money on gambling. Specifically, 22.12 less British pounds per year.
Which variables are statistically significant at level \(\alpha=0.05\)? At a significant level of \(\alpha=0.05\), sex and income are significant variables.
Test on the regression significance, what is your conclusion?
Reviewing how to calculate F:
SST=sum((gamble-mean(gamble))^2)
SST
## [1] 45689.49
SSReg = sum((gambmodel1$fitted.values-mean(gamble))^2)
SSReg
## [1] 24065.73
SSE = sum(gambmodel1$residuals^2)
SSE
## [1] 21623.77
(SSReg/4)/(SSE/42)
## [1] 11.68576
pf(11.68576,df1 = 4, df2=42, lower.tail = F)
## [1] 1.814701e-06
Since the p-value is significant, these variables have improved our fitted model.
veral and status should be included in the model using the partial F-test(reduced model vs. full model), and state your conclusion.anova(gambmodel1)
## Analysis of Variance Table
##
## Response: gamble
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 7598.4 7598.4 14.7584 0.0004066 ***
## status 1 3613.0 3613.0 7.0175 0.0113254 *
## income 1 11898.6 11898.6 23.1108 1.985e-05 ***
## verbal 1 955.7 955.7 1.8563 0.1803109
## Residuals 42 21623.8 514.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gambmodel2 = lm(gamble ~ sex + status + income, data = teengamb)
gambmodel3 = lm(gamble ~ sex + income + verbal, data = teengamb)
gambmodel4 = lm(gamble ~ sex + income, data = teengamb)
anova(gambmodel2, gambmodel1)
## Analysis of Variance Table
##
## Model 1: gamble ~ sex + status + income
## Model 2: gamble ~ sex + status + income + verbal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 43 22580
## 2 42 21624 1 955.73 1.8563 0.1803
anova(gambmodel3, gambmodel1)
## Analysis of Variance Table
##
## Model 1: gamble ~ sex + income + verbal
## Model 2: gamble ~ sex + status + income + verbal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 43 21642
## 2 42 21624 1 17.776 0.0345 0.8535
anova(gambmodel4, gambmodel1)
## Analysis of Variance Table
##
## Model 1: gamble ~ sex + income
## Model 2: gamble ~ sex + status + income + verbal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 22781
## 2 42 21624 2 1157.5 1.1242 0.3345
summary(gambmodel4)
##
## Call:
## lm(formula = gamble ~ sex + income, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.757 -11.649 0.844 8.659 100.243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.041 6.394 0.632 0.53070
## sex -21.634 6.809 -3.177 0.00272 **
## income 5.172 0.951 5.438 2.24e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.75 on 44 degrees of freedom
## Multiple R-squared: 0.5014, Adjusted R-squared: 0.4787
## F-statistic: 22.12 on 2 and 44 DF, p-value: 2.243e-07
Since the p-value for the anova test comparing the full model with a reduced model that excludes both status and verbal is 0.3345, we fail to reject the null hypothesis of \(\beta_2 = \beta_4 = 0\). This means that the reduced model, \(\hat{gamble} = 4.041 - 21.634x_{sex} + 5.172x_{income}\), is acceptable for our analysis.
The data sat has 50 observations and 7 columns. Refer to the help page to understand the meaning of the variable names.
data("sat")
head(sat)
## expend ratio salary takers verbal math total
## Alabama 4.405 17.2 31.144 8 491 538 1029
## Alaska 8.963 17.6 47.951 47 445 489 934
## Arizona 4.778 19.3 32.175 27 448 496 944
## Arkansas 4.459 17.1 28.934 6 482 523 1005
## California 4.992 24.0 41.078 45 417 485 902
## Colorado 5.443 18.4 34.571 29 462 518 980
attach(sat)
## The following object is masked from teengamb:
##
## verbal
total score as the response and expend, ratio, and salary as predictors. Write out the fitted model.\(\hat{total} = 1069.23 + 16.47x_{expend} + 6.33x_{ratio} - 8.82x_{salary}\)
totalmodel1 = lm(total ~ expend + ratio + salary, data = sat)
totalmodel1
##
## Call:
## lm(formula = total ~ expend + ratio + salary, data = sat)
##
## Coefficients:
## (Intercept) expend ratio salary
## 1069.234 16.469 6.330 -8.823
summary(totalmodel1)
##
## Call:
## lm(formula = total ~ expend + ratio + salary, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140.911 -46.740 -7.535 47.966 123.329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1069.234 110.925 9.639 1.29e-12 ***
## expend 16.469 22.050 0.747 0.4589
## ratio 6.330 6.542 0.968 0.3383
## salary -8.823 4.697 -1.878 0.0667 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.65 on 46 degrees of freedom
## Multiple R-squared: 0.2096, Adjusted R-squared: 0.1581
## F-statistic: 4.066 on 3 and 46 DF, p-value: 0.01209
confint(totalmodel1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 845.95385 1292.5144911
## expend -27.91528 60.8530112
## ratio -6.83820 19.4987339
## salary -18.27679 0.6315232
According to the summary above, none of these variables are not significant at the \(\alpha=0.05\) level. The only predictor that is significant is the intercept. By looking at each variable’s confidence interval, we can see that expend, ratio, and salary all include zero in their confidence intervals. So we fail to reject the null hypothesis that \(\beta_{expend} = \beta_{ratio} = \beta_{salary} = 0\).
SST = sum((total - mean(total))^2)
SST
## [1] 274307.7
SSReg = sum((totalmodel1$fitted.values - mean(total))^2)
SSReg
## [1] 57495.74
SSE = sum(totalmodel1$residuals^2)
SSE
## [1] 216811.9
(SSReg/3)/(SSE/46)
## [1] 4.066203
pf(4.066203, df1 = 3, df2 = 46, lower.tail = F)
## [1] 0.01208607
Since the p-value of the F-test for this regressio is 0.012, it is significant at the \(\alpha = 0.05\) level.
takers to the above model. Test \(H_0: \beta_{takers}=0\) using a t-test, and state your conclusion.totalmodel2 = lm(total ~ expend + ratio + salary + takers, data = sat)
totalmodel2
##
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
##
## Coefficients:
## (Intercept) expend ratio salary takers
## 1045.972 4.463 -3.624 1.638 -2.904
summary(totalmodel2)
##
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## ratio -3.6242 3.2154 -1.127 0.266
## salary 1.6379 2.3872 0.686 0.496
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
confint(totalmodel2,parm = c(5), level = 0.95)
## 2.5 % 97.5 %
## takers -3.370262 -2.438699
The fitted value for takers is significant and we can reject the null. Looking at the confidence interval for takers, we can see that zero is not included in the interval. So we reject the null that \(\beta_{takers} = 0\). 5. Compare the model in 4 (expend, ratio, salary and takers as predictors) with the model in 1 (expend, ratio, and salary as predictors) using F-test, and state your conclusion.
anova(totalmodel1, totalmodel2)
## Analysis of Variance Table
##
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio + salary + takers
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 216812
## 2 45 48124 1 168688 157.74 2.607e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the anova chart comparing the two models for SAT scores, we reject the null. This means the fuller model is better for modeling the relationship.