library(tidyverse)
url <- "https://raw.githubusercontent.com/curdferguson/data621/main/datasets/sat.txt"
sat <- read_tsv(url, skip = 1, col_names = c("state", "expend", "ratio", "salary", "takers", "verbal", "math", "total"), show_col_types=FALSE)
sat <- column_to_rownames(sat, var="state")
sat %>% head(5)
## expend ratio salary takers verbal math total
## Alabama 4.405 17.2 31.144 8 491 538 1029
## Alaska 8.963 17.6 47.951 47 445 489 934
## Arizona 4.778 19.3 32.175 27 448 496 944
## Arkansas 4.459 17.1 28.934 6 482 523 1005
## California 4.992 24.0 41.078 45 417 485 902
sat[,1:4] %>% summary()
## expend ratio salary takers
## Min. :3.656 Min. :13.80 Min. :25.99 Min. : 4.00
## 1st Qu.:4.882 1st Qu.:15.22 1st Qu.:30.98 1st Qu.: 9.00
## Median :5.768 Median :16.60 Median :33.29 Median :28.00
## Mean :5.905 Mean :16.86 Mean :34.83 Mean :35.24
## 3rd Qu.:6.434 3rd Qu.:17.57 3rd Qu.:38.55 3rd Qu.:63.00
## Max. :9.774 Max. :24.30 Max. :50.05 Max. :81.00
cat("\n")
sat[,5:7] %>% summary()
## verbal math total
## Min. :401.0 Min. :443.0 Min. : 844.0
## 1st Qu.:427.2 1st Qu.:474.8 1st Qu.: 897.2
## Median :448.0 Median :497.5 Median : 945.5
## Mean :457.1 Mean :508.8 Mean : 965.9
## 3rd Qu.:490.2 3rd Qu.:539.5 3rd Qu.:1032.0
## Max. :516.0 Max. :592.0 Max. :1107.0
We construct a linear model with expend, ratio, and salary as predictors of the response variable total.
Is the effect of these predictors on the response statistically significant? We use an F-test for Analysis of Variance to test whether any of the predictors’ coefficients is statistically different from zero.
Our F-statistic of 4.0662 is sufficiently greater than that of the null model, and our p-value of 0.01209 indicates this result would be the result of chance in only 0.12% of hypothetical samples.
We reject the null hypothesis that the coefficients of our predictors are statistically equivalent to zero, and take the effect of this model on the response as statistically significant at the 0.95 level.
sat_lm1 <- lm(total ~ expend + ratio + salary, data=sat)
lm1_sum <- summary(sat_lm1)
lm1_sum
##
## Call:
## lm(formula = total ~ expend + ratio + salary, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140.911 -46.740 -7.535 47.966 123.329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1069.234 110.925 9.639 1.29e-12 ***
## expend 16.469 22.050 0.747 0.4589
## ratio 6.330 6.542 0.968 0.3383
## salary -8.823 4.697 -1.878 0.0667 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.65 on 46 degrees of freedom
## Multiple R-squared: 0.2096, Adjusted R-squared: 0.1581
## F-statistic: 4.066 on 3 and 46 DF, p-value: 0.01209
sat_nullmod <- lm(total ~ 1, data=sat)
lm1_anova <- anova(sat_nullmod, sat_lm1)
lm1_anova
## Analysis of Variance Table
##
## Model 1: total ~ 1
## Model 2: total ~ expend + ratio + salary
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 274308
## 2 46 216812 3 57496 4.0662 0.01209 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We add the predictor takers to the model.
Is the additon of this predictor on the response statistically significant? We can test this in two ways; using a t-test for the specific variable and using an F-test to compare the effect of the first and second models as a whole. Then we can show that the results of these two methods are actually the same.
first, we can output the regresion summary of the new model and observe the value of the t-statistic and p-value for takers.
Our regression summary output gives a t-value of -12.559 and a p-value of 2.61e-16 for takers. This indicates the coefficient is about 12.5 times the size of its standard error, and that we’d expect this to be the result of chance in well fewer than 0.01% of hypothetical samples.
We conclude by this result that we can reject the null hypothesis at the 0.95% level of statistical significance, and assume the impact of takers to be significant.
Second, we can use an F-test for Analysis of Variance between the new model and previous model to test whether the additional impact of the coefficient for takers is statistically different from zero.
Our F-statistic of 157.74 is sufficiently greater than that of the model without takers, and our p-value of 2.607e-16 indicates this result would be the result of chance in well fewer than 0.01% of hypothetical samples.
We reject the null hypothesis that the difference in the coefficients of our predictors is statistically equivalent to zero, and take the effect of this model on the response as statistically significant at the 0.95 level.
Finally, we can verify that our results from these two tests are the same. We expect that our ANOVA F statistic should be approximately the square of our t-value for the added variable takers, and that their p-values would be equal.
As we see in our output below, the difference between the t-value squared and the F-statistic, as well as between the p-values, are each so small as to be functionally equivalent to zero.
# method 1 - regression summary output t-test
sat_lm2 <- lm(total ~ expend + ratio + salary + takers, data=sat)
lm2_sum <- summary(sat_lm2)
lm2_sum
##
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## ratio -3.6242 3.2154 -1.127 0.266
## salary 1.6379 2.3872 0.686 0.496
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
# method 2 - ANOVAb
lm2_anova <- anova(sat_lm1, sat_lm2)
lm2_anova
## Analysis of Variance Table
##
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio + salary + takers
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 216812
## 2 45 48124 1 168688 157.74 2.607e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# verification
(lm2_sum$coefficients["takers", "t value"])^2 - lm2_anova$`F`
## [1] NA 2.273737e-13
(lm2_sum$coefficients["takers", "Pr(>|t|)"]) - lm2_anova$`Pr(>F)`
## [1] NA -7.494179e-30
We conduct diagnostics of our model sat_lm2. First, we’ll check the Constant Variance assumption by plotting the residuals versus the fitted y-values.
While the range over which the residuals vary is about equal on the left and right hand sides, the distribution of points in the middle is skewed to the negative side of the range. The smoothed curve suggest adding a quadratic term to the model may be an appropriate transformation.
There are also 3 outliers noted on the plot - the values for North Dakota, New Hampshire, and West Virginia need to be reviewed for validity and may factor in additional transformations to the model.
Viewing the Scale-Location plot backs up our understanding of the model - the residuals have a constant spread but we have a problem in the shape of the model and thus the linearity of the relationship between predictors and response. We also see Utah surface as another outlier.
plot(sat_lm2, 1)
plot(sat_lm2, 3)
#ggplot(sat_lm2, aes(x=fitted(sat_lm2), y=residuals(sat_lm2))) + geom_jitter() + geom_hline(yintercept=0, color="blue") + xlab("Fitted") + ylab("Residuals")
Next, we’ll check for the normality of our distributed residuals. Using a quantile - quantile plot and a histogram of residuals, we can see that the shape of the distribution is approximately normal with some slight deviation from normality at the tails, accounted for by our known outliers plus another identified in the plot - Utah.
Notably, our data point for West Virginia seems to skew the distribution of residuals slightly leftward in the histogram visualization. It’s time we take a look at the influence of each of these outliers on the model as a whole.
plot(sat_lm2, 2)
ggplot(data=sat_lm2, aes(x=residuals(sat_lm2))) + geom_histogram(bins=15) + xlab("Residual") + ylab("Frequency")
Looking at the residuals vs. leverage plot below, we see that for the most part, our data including the points of high leverage cluster within the area between -2 and 2 standardized residuals.
The point of most concern is Utah, which exerts the most influence over the dataset and falls just within the acceptable range of Cook’s Distance (0.5). This is not enough of an anomaly to impact our assumptions of the model’s validity.
plot(sat_lm2, 5)
#ggplot(data=sat_lm2, aes(x=residuals(sat_lm2))) + geom_histogram(bins=15) + xlab("Residual") + ylab("Frequency")