Model Building & Evaluation | Towards My Final Project
First I’ll load the data and recreate both the correlation matrix and regression model. I really like correlation plots for visualizing correlation so I’ll create one using the ggcorr() function from the GGally package.
P S Be Ba New
P 1.000 0.899 0.590 0.714 0.357
S 0.899 1.000 0.669 0.662 0.176
Be 0.590 0.669 1.000 0.334 0.267
Ba 0.714 0.662 0.334 1.000 0.182
New 0.357 0.176 0.267 0.182 1.000
# create cor plot
ggcorr(hSP2, label = TRUE, label_round = 3)
Call:
lm(formula = P ~ ., data = hSP2)
Residuals:
Min 1Q Median 3Q Max
-36.212 -9.546 1.277 9.406 71.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.795 12.104 -3.453 0.000855 ***
S 64.761 5.630 11.504 < 0.0000000000000002 ***
Be -2.766 3.960 -0.698 0.486763
Ba 19.203 5.650 3.399 0.001019 **
New 18.984 3.873 4.902 0.0000043 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
F-statistic: 145.8 on 4 and 88 DF, p-value: < 0.00000000000000022
Backward elimination is one of the three primary automated variable selection methods. It works as follows: fit a model including all possible predictor variables and one-by-one remove any variable that is not statistically significant (according to a predetermined \(\alpha\)-level), refitting the model after removing each variable and continuing until the model contains only statistically significant predictor variables.
We’ve already fit the model using all possible predictor variables (above). Let’s determine the \(\alpha\)-level we’re looking for to be .05. Every variable except bed (the number of bedrooms in the house) is statistically significant at that level so backward elimination would lead us to remove bed first.
Call:
lm(formula = P ~ . - Be, data = hSP2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 0.0000000815 ***
S 62.263 4.335 14.363 < 0.0000000000000002 ***
Ba 20.072 5.495 3.653 0.000438 ***
New 18.371 3.761 4.885 0.0000045396 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 0.00000000000000022
Now every variable in the model is statistically significant. In fact, removing bed has increased the significance level of bath (the number of bathrooms).
Backward elimination would thus lead us to select the model with three predictor variables: size, bath, and new.
I can confirm this using the step() function.
# backward elimination
step(object = fithSP2, # specify full model
direction = "backward") # specify backward
Start: AIC=524.7
P ~ S + Be + Ba + New
Df Sum of Sq RSS AIC
- Be 1 131 23684 523.21
<none> 23553 524.70
- Ba 1 3092 26645 534.17
- New 1 6432 29985 545.15
- S 1 35419 58972 608.06
Step: AIC=523.21
P ~ S + Ba + New
Df Sum of Sq RSS AIC
<none> 23684 523.21
- Ba 1 3550 27234 534.20
- New 1 6349 30033 543.30
- S 1 54898 78582 632.75
Call:
lm(formula = P ~ S + Ba + New, data = hSP2)
Coefficients:
(Intercept) S Ba New
-47.99 62.26 20.07 18.37
Forward selection is another primary method of automated variable selection. It works as follows: begin with the intercept-only model and one-by-one add each variable that is statistically significant.
===================================================================================
Dependent variable:
-----------------------------------------------------
P
(1) (2) (3) (4)
-----------------------------------------------------------------------------------
S 75.607
t = 19.561
p = 0.000***
Be 42.969
t = 6.976
p = 0.000***
Ba 76.026
t = 9.720
p = 0.000***
New 34.158
t = 3.641
p = 0.0005***
Constant -25.194 -37.229 -49.248 89.249
t = -3.767 t = -1.866 t = -3.148 t = 17.336
p = 0.0003*** p = 0.066* p = 0.003*** p = 0.000***
-----------------------------------------------------------------------------------
Observations 93 93 93 93
R2 0.808 0.348 0.509 0.127
Adjusted R2 0.806 0.341 0.504 0.118
Residual Std. Error (df = 91) 19.473 35.861 31.119 41.506
F Statistic (df = 1; 91) 382.628*** 48.660*** 94.473*** 13.254***
===================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
We can see that all of the variables are statistically significant and while the stargazer() function doesn’t display the full \(P\)-value, we can see that size has the largest \(t\) test statistic so that’s the variable we would add in first. If we look at our correlation matrix/plot, we can see that size is also the most highly correlated with price.
Once again, I can confirm this using the step() function.
Start: AIC=705.63
P ~ 1
Df Sum of Sq RSS AIC
+ S 1 145097 34508 554.22
+ Ba 1 91484 88121 641.41
+ Be 1 62578 117028 667.79
+ New 1 22833 156772 694.99
<none> 179606 705.63
Step: AIC=554.22
P ~ S
Df Sum of Sq RSS AIC
+ New 1 7274.7 27234 534.20
+ Ba 1 4475.6 30033 543.30
<none> 34508 554.22
+ Be 1 40.4 34468 556.11
Step: AIC=534.2
P ~ S + New
Df Sum of Sq RSS AIC
+ Ba 1 3550.1 23684 523.21
+ Be 1 588.8 26645 534.17
<none> 27234 534.20
Step: AIC=523.21
P ~ S + New + Ba
Df Sum of Sq RSS AIC
<none> 23684 523.21
+ Be 1 130.55 23553 524.70
Call:
lm(formula = P ~ S + New + Ba, data = hSP2)
Coefficients:
(Intercept) S New Ba
-47.99 62.26 18.37 20.07
In this case, forward selection gives us the same model as backward elimination:
\[P=-47.99+62.26*size+18.37*new+20.07*bath\]
The correlation between bed and price is .59, indicating a fairly strong relationship between the two variables. Yet in our multiple regression model, the \(P\)-value of bed is .486763, indicating that we can’t be sure that there is any relationship between the two (i.e. that the coefficient/slope isn’t zero). While these two statistics appear to nudge us towards two different conclusions about the relationship between these two variables, I believe what we’re actually seeing is the effect of bed being related to other predictor variables in the model.
Bed and size are even more highly correlated (.669) than bed and price (.59) and size and price are very highly correlated with one another (.899). This means that it’s hard to increase size without increasing bed, meaning that any increase in bed is likely already captured by the increase in size in the model and thus that bed doesn’t add much explanatory power to the model. A similar, albeit weaker, effect can be seen with bed and bath.
We have two models already: our initial model with all possible predictor variables and the model we arrived at using backward elimination & forward selection. I’ll use stepwise selection to see if perhaps that leads us to a third model.
step(object = fitNull, # pass null model to step function
direction = "both", # specify both for stepwise selection
scope = P ~ S + Be + Ba + New) # specify maximum model
Start: AIC=705.63
P ~ 1
Df Sum of Sq RSS AIC
+ S 1 145097 34508 554.22
+ Ba 1 91484 88121 641.41
+ Be 1 62578 117028 667.79
+ New 1 22833 156772 694.99
<none> 179606 705.63
Step: AIC=554.22
P ~ S
Df Sum of Sq RSS AIC
+ New 1 7275 27234 534.20
+ Ba 1 4476 30033 543.30
<none> 34508 554.22
+ Be 1 40 34468 556.11
- S 1 145097 179606 705.63
Step: AIC=534.2
P ~ S + New
Df Sum of Sq RSS AIC
+ Ba 1 3550 23684 523.21
+ Be 1 589 26645 534.17
<none> 27234 534.20
- New 1 7275 34508 554.22
- S 1 129539 156772 694.99
Step: AIC=523.21
P ~ S + New + Ba
Df Sum of Sq RSS AIC
<none> 23684 523.21
+ Be 1 131 23553 524.70
- Ba 1 3550 27234 534.20
- New 1 6349 30033 543.30
- S 1 54898 78582 632.75
Call:
lm(formula = P ~ S + New + Ba, data = hSP2)
Coefficients:
(Intercept) S New Ba
-47.99 62.26 18.37 20.07
Stepwise selection leads us to the same model as backward elimination & forward selection so I’ll be comparing two models:
Let’s look at \(R^2\) first.
Based on the \(R^2\) values of these two models, model P ~ S + Be + Ba + New is the better model.
Next we can look at the \(R^2_{adj}\) values for each model.
Based on the \(R^2_{adj}\) values, model P ~ S + New + Ba is the better model.
Next we can look at the PRESS statistic for each model.
# calculate press
PRESS(fithSP2) # for P ~ S + Be + Ba + New
[1] 28390.22
PRESS(fitBack) # for P ~ S + New + Ba
[1] 27860.05
Based on the PRESS statistic for each model, model P ~ S + New + Ba is the better model.
Finally we can look at the AIC and BIC statistics for both models.
# aic
AIC(fithSP2, fitBack)
df AIC
fithSP2 6 790.6225
fitBack 5 789.1366
# bic
BIC(fithSP2, fitBack)
df BIC
fithSP2 6 805.8181
fitBack 5 801.7996
Based on both the AIC and BIC statistics, model P ~ S + New + Ba is the better model.
It seems clear in this case that model P ~ S + New + Ba is the better model. Every criterion except \(R^2\) points us to that model. Given that \(R^2\) always increases with additional predictor variables, this was to be expected.
Ultimately we know that for our second model 1) all of the predictor variables are statistically significant, which is not true of the first model, 2) it scores better on every criterion but one, and 3) there is a logical reason for removing bed from the model.
“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”
Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,
First I’ll load the data. I’ll also rename the variable girth to diameter.
'data.frame': 31 obs. of 3 variables:
$ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
$ Height: num 70 65 63 72 81 83 66 75 80 75 ...
$ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
Diameter Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
Now I’ll fit a model regressing tree volume onto diameter and height.
Call:
lm(formula = Volume ~ ., data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 0.000000275 ***
Diameter 4.7082 0.2643 17.816 < 0.0000000000000002 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 0.00000000000000022
This appears to be quite a good model. This makes sense, given that we’re trying to understand the nature of a physical object which has a formula for calculating what we’re interested in.
Both diameter and height are statistically significant predictors of volume and both the \(R^2\) and \(R^2_{adj}\) values are quite high.
To further evaluate this model, I’ll use the autoplot() function from the ggfortify package to produce a series of diagnostic plots to test some specific assumptions upon which the model is based.
# create diagnostic plots
autoplot(fitTrees, which = 1:6, ncol = 3) +
theme_minimal()
Moving clockwise from the top, left-hand corner we have:
Based on the plots alone, I believe the assumptions of linearity, constant variance of errors, and influential observations have all been violated.
Based on the Normal Q-Q plot, I do not believe the assumption of normality has been violated. That being said, the Shapiro-Wilk test allows us formally test that assumption.
# shapiro wilk test
shapiro.test(trees$Diameter) # diameter
Shapiro-Wilk normality test
data: trees$Diameter
W = 0.94117, p-value = 0.08893
shapiro.test(trees$Height) # height
Shapiro-Wilk normality test
data: trees$Height
W = 0.96545, p-value = 0.4034
For this test, \(H_0\): the observations are normally distributed and \(H_a\): the observations are not normally distributed.
Thus the \(P\)-values of .08893 (diameter) and .4034 (height) indicate that the observations for each variable are normally distributed (more precisely, we cannot reject the null hypothesis of normal distribution).
Based on the Scale-Location plot, I believe that the assumption of homoscedasticity (i.e. constant variance) has been violated but I will use the Breusch-Pagan test to formally test whether that is true.
For this test, \(H_0\): homoscedasticity and \(H_a\): heteroscedasticity.
# bp test
bptest(fitTrees)
studentized Breusch-Pagan test
data: fitTrees
BP = 2.4681, df = 2, p-value = 0.2911
Somewhat surprisingly, the \(P\)-value is .2911, meaning that we cannot reject the null hypothesis of homoscedasticity at this time.
In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.
The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
First I’ll load and inspect the data.
'data.frame': 67 obs. of 3 variables:
$ Gore : int 47300 2392 18850 3072 97318 386518 2155 29641 25501 14630 ...
$ Bush : int 34062 5610 38637 5413 115185 177279 2873 35419 29744 41745 ...
$ Buchanan: int 262 73 248 65 570 789 90 182 270 186 ...
# preview
head(florida, 5)
Gore Bush Buchanan
ALACHUA 47300 34062 262
BAKER 2392 5610 73
BAY 18850 38637 248
BRADFORD 3072 5413 65
BREVARD 97318 115185 570
Before I do anything else, I’m going to visualize the data.
# pivot longer
floridaLong <- florida %>%
rownames_to_column("County") %>%
pivot_longer(cols = c(`Gore`, `Bush`, `Buchanan`),
names_to = "Candidate",
values_to = "Votes")
# create bar plot
ggplot(floridaLong, aes(fill = Candidate, y = Votes, x = County)) +
geom_bar(position = "stack", stat = "identity") +
labs(title = "Votes by County") +
theme_minimal() +
coord_flip() +
theme(axis.text.y = element_text(hjust = .75, size = 4))
This plot shows the number of votes each candidate received in each county. It’s somewhat ridiculous because of the scale but we can see that Buchanan received more votes in Palm Beach County than in any other county.
This perhaps alerts us to the distribution of votes in Palm Beach County being atypical but can’t tell us much more than that.
To better answer the question of whether Palm Beach County is an outlier, I’ll fit a model regressing Buchanan’s share of the vote onto Bush’s share of the vote and then produce some diagnostic plots to help evaluate the model.
Call:
lm(formula = Buchanan ~ Bush, data = florida)
Residuals:
Min 1Q Median 3Q Max
-907.50 -46.10 -29.19 12.26 2610.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.2898613 54.4794230 0.831 0.409
Bush 0.0049168 0.0007644 6.432 0.0000000173 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
F-statistic: 41.37 on 1 and 65 DF, p-value: 0.00000001727
The first thing to note is that Bush’s share of the vote is a statistically significant predictor of Buchanan’s share of the vote. That is, we can be confident that there is a relationship between the two variables.
That being said, both the \(R^2\) and \(R^2_{adj}\) values are quite low, indicating that this model explains relatively little of the variation in Buchanan’s share of the vote.
Now we can take a look at the diagnostic plots for this model.
# create diagnostic plots
autoplot(fitVote, which = 1:6, ncol = 3) +
theme_minimal()
Moving clockwise from the top, left-hand corner we have:
Based on the above diagnostic plots, we can say that, yes, Palm Beach County does appear to be an outlier.
We can also pass the model to a test that will test specifically for outliers.
# bonferroni test
outlierTest(fitVote)
rstudent unadjusted p-value
PALM BEACH 24.08014 0.00000000000000000000000000000000086246
Bonferroni p
PALM BEACH 0.000000000000000000000000000000057785
This confirms our conclusion based on the diagnostic plots: Palm Beach County is an outlier. None of these tests explain why it is an outlier (that is, what happened on election night) but they do tell us that it warrants further attention.
This question isn’t ultimately concerned with fitting a good model to predict vote share but if it were, these plots suggest that we might consider fitting a model without the two outliers/influential points.
I am interested in exploring predictors of happiness. I am specifically interested in this right now because of the impact of the COVID-19 pandemic. So much advice was given about how to maintain some degree of contentedness in the midst of the sea change wrought by the pandemic and I am curious whether people who spent more or less time engaged in different activities experienced different levels of happiness.
It seems as though people generally fell into one of two camps and either leaned into the “Netflix and chill” mode of survival or hit the trails—many for the first time—to pass the time. This leads me to the two variables I’ll be looking at.
My research question is “How does amount of TV watched and frequency of leisure time in nature affect happiness?” I’ll be looking at data from the General Social Survey (GSS) from 2021.
The outcome variable is happiness. The question asked is “Taken all together, how would you say things are these days–would you say that you are very happy, pretty happy, or not too happy?”
My predictor variables are:
My control variables are:
My null and alternative hypotheses are:
\(H_0:\) number of hours of TV watched per day has no effect on level of happiness
\(H_a:\) number of hours of TV watched per day has some effect on level of happiness
\(H_0:\) frequency of leisure time in nature has no effect on level of happiness
\(H_a:\) frequency of leisure time in nature has some effect on level of happiness
# read data
gSS <- read_excel("GSS.xlsx", sheet = "Data")
# convert to dataframe
gSS <- as.data.frame(unclass(gSS), stringsAsFactors = TRUE)
# create subset
gSS <- gSS %>%
subset(select = c("happy", "health", "satfin", "tvhours", "activnat"))
# convert 0 hours to 0
gSS$tvhours <- sub("0 hours", "0", gSS$tvhours)
# convert tvhours from factor to numeric
gSS$tvhours <- as.numeric(as.character(gSS$tvhours))
# get summary
summary(gSS)
happy health satfin
Not too happy: 331 Excellent: 337 More or less satisfied:769
Pretty happy :1087 Fair : 338 Not satisfied at all :420
Very happy : 347 Good :1011 Pretty well satisfied :576
Poor : 79
tvhours activnat
Min. : 0.000 Daily :328
1st Qu.: 2.000 Never : 85
Median : 3.000 Several times a month:473
Mean : 3.378 Several times a week :528
3rd Qu.: 4.000 Several times a year :351
Max. :23.000
NA's :12
All of the variables are categorical variables except hours of TV watched per day, which is a discrete numerical variable. That is, the respondent could select from a specified set of values. The most basic summary then is a simply a count of the frequency of each value.
# create table health and happy
table(gSS$health, gSS$happy)
Not too happy Pretty happy Very happy
Excellent 32 173 132
Fair 98 209 31
Good 157 676 178
Poor 44 29 6
# create table financial satisfaction and happy
table(gSS$satfin, gSS$happy)
Not too happy Pretty happy Very happy
More or less satisfied 117 519 133
Not satisfied at all 152 226 42
Pretty well satisfied 62 342 172
# create bar plot
ggplot(gSS, aes(x = happy)) +
geom_bar(fill = "#830042") +
labs(title = "Self-Reported Happiness Level", x = NULL, y = "Count") +
coord_flip() +
theme_minimal()
# plot health and happy
ggplot(gSS, aes(x = health,
fill = happy)) +
geom_bar(position = "stack") +
labs(title = "Assessment of Health and Level of Happiness", x = "Assessment of health", y = NULL) +
coord_flip() +
theme_minimal()
# plot financial satisfaction and happy
ggplot(gSS, aes(x = satfin,
fill = happy)) +
geom_bar(position = "stack") +
labs(title = "Financial Satisfcation and Level of Happiness", x = "Satisfaction with financial situation", y = NULL) +
coord_flip() +
theme_minimal()
# plot tv and happy
ggplot(gSS, aes(x = tvhours,
fill = happy)) +
geom_bar(position = "stack") +
labs(title = "TV Watched and Level of Happiness", x = "Hours watched per day", y = NULL) +
coord_flip() +
theme_minimal()
# plot nature and happy
ggplot(gSS, aes(x = activnat,
fill = happy)) +
geom_bar(position = "stack") +
labs(title = "Leisure Time in Nature and Level of Happiness", x = "Frequency of leisure time in nature", y = NULL) +
coord_flip() +
theme_minimal()
I don’t think I’ve hit on the best way to visualize the relationship between each set of variables yet but I’ll keep working on it as I continue my analysis.