Homework 4

Model Building & Evaluation | Towards My Final Project

Claire Battaglia https://rpubs.com/clairebattaglia (DACSS 603 Introduction to Quantitative Analysis)
April 14, 2022

Part 1

Question 1

  1. For backward elimination, which variable would be deleted first? Why?
  2. For forward selection, which variable would be added first? Why?
  3. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
  4. Using software with these four predictors, find the model that would be selected using each criterion:
    1. R2
    2. Adjusted R2
    3. PRESS
    4. AIC
    5. BIC
  5. Explain which model you prefer and why.

Solution

First I’ll load the data and recreate both the correlation matrix and regression model. I really like correlation plots for visualizing correlation so I’ll create one using the ggcorr() function from the GGally package.

Show code
# load data
data("house.selling.price.2")

# create object
hSP2 <- house.selling.price.2

# create cor matrix
round(cor(hSP2), 3)
        P     S    Be    Ba   New
P   1.000 0.899 0.590 0.714 0.357
S   0.899 1.000 0.669 0.662 0.176
Be  0.590 0.669 1.000 0.334 0.267
Ba  0.714 0.662 0.334 1.000 0.182
New 0.357 0.176 0.267 0.182 1.000
Show code
# create cor plot
ggcorr(hSP2, label = TRUE, label_round = 3)
Show code
# fit model
fithSP2 <- lm(P ~ . , hSP2)

# get summary
summary(fithSP2)

Call:
lm(formula = P ~ ., data = hSP2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453             0.000855 ***
S             64.761      5.630  11.504 < 0.0000000000000002 ***
Be            -2.766      3.960  -0.698             0.486763    
Ba            19.203      5.650   3.399             0.001019 ** 
New           18.984      3.873   4.902            0.0000043 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 0.00000000000000022

A

Backward elimination is one of the three primary automated variable selection methods. It works as follows: fit a model including all possible predictor variables and one-by-one remove any variable that is not statistically significant (according to a predetermined \(\alpha\)-level), refitting the model after removing each variable and continuing until the model contains only statistically significant predictor variables.

We’ve already fit the model using all possible predictor variables (above). Let’s determine the \(\alpha\)-level we’re looking for to be .05. Every variable except bed (the number of bedrooms in the house) is statistically significant at that level so backward elimination would lead us to remove bed first.

Show code
# refit model w/o be
fitBack <- lm(P ~ . - Be, hSP2)

# get summary
summary(fitBack)

Call:
lm(formula = P ~ . - Be, data = hSP2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847         0.0000000815 ***
S             62.263      4.335  14.363 < 0.0000000000000002 ***
Ba            20.072      5.495   3.653             0.000438 ***
New           18.371      3.761   4.885         0.0000045396 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 0.00000000000000022

Now every variable in the model is statistically significant. In fact, removing bed has increased the significance level of bath (the number of bathrooms).

Backward elimination would thus lead us to select the model with three predictor variables: size, bath, and new.

I can confirm this using the step() function.

Show code
# backward elimination
step(object = fithSP2, # specify full model
     direction = "backward") # specify backward
Start:  AIC=524.7
P ~ S + Be + Ba + New

       Df Sum of Sq   RSS    AIC
- Be    1       131 23684 523.21
<none>              23553 524.70
- Ba    1      3092 26645 534.17
- New   1      6432 29985 545.15
- S     1     35419 58972 608.06

Step:  AIC=523.21
P ~ S + Ba + New

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
- Ba    1      3550 27234 534.20
- New   1      6349 30033 543.30
- S     1     54898 78582 632.75

Call:
lm(formula = P ~ S + Ba + New, data = hSP2)

Coefficients:
(Intercept)            S           Ba          New  
     -47.99        62.26        20.07        18.37  

B

Forward selection is another primary method of automated variable selection. It works as follows: begin with the intercept-only model and one-by-one add each variable that is statistically significant.

Show code
# fit model s
fitS <- lm(P ~ S, data = hSP2)

# fit model be
fitBe <- lm(P ~ Be, data = hSP2)

# fit model ba
fitBa <- lm(P ~ Ba, data = hSP2)

# fit model new
fitN <- lm(P ~ New, data = hSP2)

# view all to compare
stargazer(fitS, fitBe, fitBa, fitN,
          type = "text",
          report = ("vctp*"))

===================================================================================
                                               Dependent variable:                 
                              -----------------------------------------------------
                                                        P                          
                                   (1)          (2)          (3)           (4)     
-----------------------------------------------------------------------------------
S                                75.607                                            
                               t = 19.561                                          
                              p = 0.000***                                         
                                                                                   
Be                                             42.969                              
                                             t = 6.976                             
                                            p = 0.000***                           
                                                                                   
Ba                                                          76.026                 
                                                          t = 9.720                
                                                         p = 0.000***              
                                                                                   
New                                                                      34.158    
                                                                        t = 3.641  
                                                                      p = 0.0005***
                                                                                   
Constant                         -25.194      -37.229      -49.248       89.249    
                               t = -3.767    t = -1.866   t = -3.148   t = 17.336  
                              p = 0.0003***  p = 0.066*  p = 0.003*** p = 0.000*** 
                                                                                   
-----------------------------------------------------------------------------------
Observations                       93            93           93           93      
R2                                0.808        0.348        0.509         0.127    
Adjusted R2                       0.806        0.341        0.504         0.118    
Residual Std. Error (df = 91)    19.473        35.861       31.119       41.506    
F Statistic (df = 1; 91)       382.628***    48.660***    94.473***     13.254***  
===================================================================================
Note:                                                   *p<0.1; **p<0.05; ***p<0.01

We can see that all of the variables are statistically significant and while the stargazer() function doesn’t display the full \(P\)-value, we can see that size has the largest \(t\) test statistic so that’s the variable we would add in first. If we look at our correlation matrix/plot, we can see that size is also the most highly correlated with price.

Once again, I can confirm this using the step() function.

Show code
fitNull <- lm(P ~ 1, data = hSP2) # fit null model
step(object = fitNull, # pass null model to step function
     direction = "forward", # specify forward
     scope = P ~ S + Be + Ba + New) # specify maximum model
Start:  AIC=705.63
P ~ 1

       Df Sum of Sq    RSS    AIC
+ S     1    145097  34508 554.22
+ Ba    1     91484  88121 641.41
+ Be    1     62578 117028 667.79
+ New   1     22833 156772 694.99
<none>              179606 705.63

Step:  AIC=554.22
P ~ S

       Df Sum of Sq   RSS    AIC
+ New   1    7274.7 27234 534.20
+ Ba    1    4475.6 30033 543.30
<none>              34508 554.22
+ Be    1      40.4 34468 556.11

Step:  AIC=534.2
P ~ S + New

       Df Sum of Sq   RSS    AIC
+ Ba    1    3550.1 23684 523.21
+ Be    1     588.8 26645 534.17
<none>              27234 534.20

Step:  AIC=523.21
P ~ S + New + Ba

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
+ Be    1    130.55 23553 524.70

Call:
lm(formula = P ~ S + New + Ba, data = hSP2)

Coefficients:
(Intercept)            S          New           Ba  
     -47.99        62.26        18.37        20.07  

In this case, forward selection gives us the same model as backward elimination:

\[P=-47.99+62.26*size+18.37*new+20.07*bath\]

C

The correlation between bed and price is .59, indicating a fairly strong relationship between the two variables. Yet in our multiple regression model, the \(P\)-value of bed is .486763, indicating that we can’t be sure that there is any relationship between the two (i.e. that the coefficient/slope isn’t zero). While these two statistics appear to nudge us towards two different conclusions about the relationship between these two variables, I believe what we’re actually seeing is the effect of bed being related to other predictor variables in the model.

Bed and size are even more highly correlated (.669) than bed and price (.59) and size and price are very highly correlated with one another (.899). This means that it’s hard to increase size without increasing bed, meaning that any increase in bed is likely already captured by the increase in size in the model and thus that bed doesn’t add much explanatory power to the model. A similar, albeit weaker, effect can be seen with bed and bath.

D

We have two models already: our initial model with all possible predictor variables and the model we arrived at using backward elimination & forward selection. I’ll use stepwise selection to see if perhaps that leads us to a third model.

Show code
step(object = fitNull, # pass null model to step function
     direction = "both", # specify both for stepwise selection
     scope = P ~ S + Be + Ba + New) # specify maximum model
Start:  AIC=705.63
P ~ 1

       Df Sum of Sq    RSS    AIC
+ S     1    145097  34508 554.22
+ Ba    1     91484  88121 641.41
+ Be    1     62578 117028 667.79
+ New   1     22833 156772 694.99
<none>              179606 705.63

Step:  AIC=554.22
P ~ S

       Df Sum of Sq    RSS    AIC
+ New   1      7275  27234 534.20
+ Ba    1      4476  30033 543.30
<none>               34508 554.22
+ Be    1        40  34468 556.11
- S     1    145097 179606 705.63

Step:  AIC=534.2
P ~ S + New

       Df Sum of Sq    RSS    AIC
+ Ba    1      3550  23684 523.21
+ Be    1       589  26645 534.17
<none>               27234 534.20
- New   1      7275  34508 554.22
- S     1    129539 156772 694.99

Step:  AIC=523.21
P ~ S + New + Ba

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
+ Be    1       131 23553 524.70
- Ba    1      3550 27234 534.20
- New   1      6349 30033 543.30
- S     1     54898 78582 632.75

Call:
lm(formula = P ~ S + New + Ba, data = hSP2)

Coefficients:
(Intercept)            S          New           Ba  
     -47.99        62.26        18.37        20.07  

Stepwise selection leads us to the same model as backward elimination & forward selection so I’ll be comparing two models:

  1. P ~ S + Be + Ba + New
  2. P ~ S + New + Ba

Let’s look at \(R^2\) first.

  1. For model P ~ S + Be + Ba + New, \(R^2\)=.8689
  2. For model P ~ S + New + Ba, \(R^2\)=.8681

Based on the \(R^2\) values of these two models, model P ~ S + Be + Ba + New is the better model.

Next we can look at the \(R^2_{adj}\) values for each model.

  1. For model P ~ S + Be + Ba + New, \(R^2_{adj}\)=.8629
  2. For model P ~ S + New + Ba, \(R^2_{adj}\)=.8637

Based on the \(R^2_{adj}\) values, model P ~ S + New + Ba is the better model.

Next we can look at the PRESS statistic for each model.

Show code
# calculate press
PRESS(fithSP2) # for P ~ S + Be + Ba + New
[1] 28390.22
Show code
PRESS(fitBack) # for P ~ S + New + Ba
[1] 27860.05

Based on the PRESS statistic for each model, model P ~ S + New + Ba is the better model.

Finally we can look at the AIC and BIC statistics for both models.

Show code
# aic
AIC(fithSP2, fitBack)
        df      AIC
fithSP2  6 790.6225
fitBack  5 789.1366
Show code
# bic
BIC(fithSP2, fitBack)
        df      BIC
fithSP2  6 805.8181
fitBack  5 801.7996

Based on both the AIC and BIC statistics, model P ~ S + New + Ba is the better model.

E

It seems clear in this case that model P ~ S + New + Ba is the better model. Every criterion except \(R^2\) points us to that model. Given that \(R^2\) always increases with additional predictor variables, this was to be expected.

Ultimately we know that for our second model 1) all of the predictor variables are statistically significant, which is not true of the first model, 2) it scores better on every criterion but one, and 3) there is a logical reason for removing bed from the model.

Question 2

“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

  1. fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables
  2. Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Solution

First I’ll load the data. I’ll also rename the variable girth to diameter.

Show code
# load data
data("trees")

# get string
str(trees)
'data.frame':   31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
Show code
# rename girth to diameter
trees <- trees %>%
  rename(Diameter = Girth)

# preview
head(trees, 5)
  Diameter Height Volume
1      8.3     70   10.3
2      8.6     65   10.3
3      8.8     63   10.2
4     10.5     72   16.4
5     10.7     81   18.8

A

Now I’ll fit a model regressing tree volume onto diameter and height.

Show code
# fit model
fitTrees <- lm(Volume ~ . , data = trees)

# get summary
summary(fitTrees)

Call:
lm(formula = Volume ~ ., data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713          0.000000275 ***
Diameter      4.7082     0.2643  17.816 < 0.0000000000000002 ***
Height        0.3393     0.1302   2.607               0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 0.00000000000000022

This appears to be quite a good model. This makes sense, given that we’re trying to understand the nature of a physical object which has a formula for calculating what we’re interested in.

Both diameter and height are statistically significant predictors of volume and both the \(R^2\) and \(R^2_{adj}\) values are quite high.

B

To further evaluate this model, I’ll use the autoplot() function from the ggfortify package to produce a series of diagnostic plots to test some specific assumptions upon which the model is based.

Show code
# create diagnostic plots
autoplot(fitTrees, which = 1:6, ncol = 3) +
  theme_minimal()

Moving clockwise from the top, left-hand corner we have:

  1. Residuals vs Fitted: This plot allows us to assess whether the assumptions of linearity and constant variance of errors have been violated. We would like to see the residuals scattered around a horizontal line in a random manner (i.e. not in any discernible pattern). We do not see that here in this plot, indicating that perhaps a nonlinear model would be more appropriate and that variance is not constant.
  2. Normal Q-Q: This plot helps us get a sense of whether the assumption of normality has been violated. We would like to see the points fall more or less along the diagonal line. That holds mostly true for this model, suggesting that the data are normally distributed.
  3. Scale-Location: This plot diagnoses whether the assumption of constant variance has been violated. As with the Residuals vs Fitted plot, we would like to see the points scattered around a horizontal line in a random manner. We do not see that here.
  4. Cook’s distance: This plot helps us identify highly influential observations in the dataset. Inferences based on datasets containing highly influential observations are not representative. For this dataset, \(n\)=31, meaning that we’re concerned about Cook’s distance values greater than \(4/31\), or 0.129. We have at least a few values greater than that, indicating that there are several observations in the dataset that are highly influential.
  5. Residuals vs Leverage: This plot allows us to identify \(x\) values that have high leverage. A value greater than \(2(p + 1)/n\) is considered to have high leverage so for this dataset, a leverage value greater than 0.194 would be considered high. This dataset appears to contain at least one high leverage observation.
  6. Cook’s dist vs Leverage: This plot shows observations that have a high Cook’s distance and/or high leverage. Again, we have at least one observation that has both high Cook’s distance and high leverage and a couple that have high leverage.

Based on the plots alone, I believe the assumptions of linearity, constant variance of errors, and influential observations have all been violated.

Based on the Normal Q-Q plot, I do not believe the assumption of normality has been violated. That being said, the Shapiro-Wilk test allows us formally test that assumption.

Show code
# shapiro wilk test
shapiro.test(trees$Diameter) # diameter

    Shapiro-Wilk normality test

data:  trees$Diameter
W = 0.94117, p-value = 0.08893
Show code
shapiro.test(trees$Height) # height

    Shapiro-Wilk normality test

data:  trees$Height
W = 0.96545, p-value = 0.4034

For this test, \(H_0\): the observations are normally distributed and \(H_a\): the observations are not normally distributed.

Thus the \(P\)-values of .08893 (diameter) and .4034 (height) indicate that the observations for each variable are normally distributed (more precisely, we cannot reject the null hypothesis of normal distribution).

Based on the Scale-Location plot, I believe that the assumption of homoscedasticity (i.e. constant variance) has been violated but I will use the Breusch-Pagan test to formally test whether that is true.

For this test, \(H_0\): homoscedasticity and \(H_a\): heteroscedasticity.

Show code
# bp test
bptest(fitTrees)

    studentized Breusch-Pagan test

data:  fitTrees
BP = 2.4681, df = 2, p-value = 0.2911

Somewhat surprisingly, the \(P\)-value is .2911, meaning that we cannot reject the null hypothesis of homoscedasticity at this time.

Question 3

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Solution

First I’ll load and inspect the data.

Show code
# load data
data("florida")

# get string
str(florida)
'data.frame':   67 obs. of  3 variables:
 $ Gore    : int  47300 2392 18850 3072 97318 386518 2155 29641 25501 14630 ...
 $ Bush    : int  34062 5610 38637 5413 115185 177279 2873 35419 29744 41745 ...
 $ Buchanan: int  262 73 248 65 570 789 90 182 270 186 ...
Show code
# preview
head(florida, 5)
          Gore   Bush Buchanan
ALACHUA  47300  34062      262
BAKER     2392   5610       73
BAY      18850  38637      248
BRADFORD  3072   5413       65
BREVARD  97318 115185      570

Before I do anything else, I’m going to visualize the data.

Show code
# pivot longer
floridaLong <- florida %>%
  rownames_to_column("County") %>%
  pivot_longer(cols = c(`Gore`, `Bush`, `Buchanan`),
               names_to = "Candidate",
               values_to = "Votes")

# create bar plot
ggplot(floridaLong, aes(fill = Candidate, y = Votes, x = County)) +
  geom_bar(position = "stack", stat = "identity") +
  labs(title = "Votes by County") + 
  theme_minimal() +
  coord_flip() +
  theme(axis.text.y = element_text(hjust = .75, size = 4))

This plot shows the number of votes each candidate received in each county. It’s somewhat ridiculous because of the scale but we can see that Buchanan received more votes in Palm Beach County than in any other county.

This perhaps alerts us to the distribution of votes in Palm Beach County being atypical but can’t tell us much more than that.

To better answer the question of whether Palm Beach County is an outlier, I’ll fit a model regressing Buchanan’s share of the vote onto Bush’s share of the vote and then produce some diagnostic plots to help evaluate the model.

Show code
# fit model
fitVote <- lm(Buchanan ~ Bush, data = florida)

# get summary
summary(fitVote)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
              Estimate Std. Error t value     Pr(>|t|)    
(Intercept) 45.2898613 54.4794230   0.831        0.409    
Bush         0.0049168  0.0007644   6.432 0.0000000173 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 0.00000001727

The first thing to note is that Bush’s share of the vote is a statistically significant predictor of Buchanan’s share of the vote. That is, we can be confident that there is a relationship between the two variables.

That being said, both the \(R^2\) and \(R^2_{adj}\) values are quite low, indicating that this model explains relatively little of the variation in Buchanan’s share of the vote.

Now we can take a look at the diagnostic plots for this model.

Show code
# create diagnostic plots
autoplot(fitVote, which = 1:6, ncol = 3) +
  theme_minimal()

Moving clockwise from the top, left-hand corner we have:

  1. Residuals vs Fitted: The residual for Palm Beach County is extremely large, which indicates that the observed value was significantly greater than the predicted value. It falls outside the pattern of residuals, suggesting that it is an outlier.
  2. Normal Q-Q: The data appear to be normally distributed with the exception of Palm Beach County, which falls quite far outside the distribution.
  3. Scale-Location: We see both an increasing trend and increasing variance, indicating that the assumption of homoscedasticity has been violated. And, again, Palm Beach County appears to be an outlier.
  4. Cook’s distance: This plot indicates that Miami-Dade and Palm Beach counties are both influential observations. That is, removing those observations from the model would change the model in a meaningful way.
  5. Residuals vs Leverage: This plot indicates that Bush’s share of the vote in Palm Beach County is influential.
  6. Cook’s dist vs Leverage: Again, we can see that Palm Beach County has a high Cook’s distance and Miami-Dade County has both high Cook’s distance and high leverage.

Based on the above diagnostic plots, we can say that, yes, Palm Beach County does appear to be an outlier.

We can also pass the model to a test that will test specifically for outliers.

Show code
# bonferroni test
outlierTest(fitVote)
           rstudent                       unadjusted p-value
PALM BEACH 24.08014 0.00000000000000000000000000000000086246
                                     Bonferroni p
PALM BEACH 0.000000000000000000000000000000057785

This confirms our conclusion based on the diagnostic plots: Palm Beach County is an outlier. None of these tests explain why it is an outlier (that is, what happened on election night) but they do tell us that it warrants further attention.

This question isn’t ultimately concerned with fitting a good model to predict vote share but if it were, these plots suggest that we might consider fitting a model without the two outliers/influential points.

Part 2

  1. What is your research question for the final project?
  2. What is your hypothesis (i.e. an answer to the research question) that you want to test?
  3. Present some exploratory analysis. In particular:
    1. Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).
    2. Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots. Interpret what you see.

I am interested in exploring predictors of happiness. I am specifically interested in this right now because of the impact of the COVID-19 pandemic. So much advice was given about how to maintain some degree of contentedness in the midst of the sea change wrought by the pandemic and I am curious whether people who spent more or less time engaged in different activities experienced different levels of happiness.

It seems as though people generally fell into one of two camps and either leaned into the “Netflix and chill” mode of survival or hit the trails—many for the first time—to pass the time. This leads me to the two variables I’ll be looking at.

My research question is “How does amount of TV watched and frequency of leisure time in nature affect happiness?” I’ll be looking at data from the General Social Survey (GSS) from 2021.

The outcome variable is happiness. The question asked is “Taken all together, how would you say things are these days–would you say that you are very happy, pretty happy, or not too happy?”

My predictor variables are:

  1. Hours per day watching TV: “On the average day, about how many hours do you personally watch television?”
  2. Outside leisure activity in the past 12 months: “In the last twelve months how often, if at all, have you engaged in any leisure activities outside in nature, such as hiking, bird watching, swimming, skiing, other outdoor activities, or just relaxing?”

My control variables are:

  1. Condition of health: “Would you say your own health, in general, is excellent, good, fair, or poor?”
  2. Satisfaction with financial situation: “So far as you and your family are concerned, would you say that you are pretty well satisfied with your present financial situation, more or less satisfied, or not satisfied at all?”

My null and alternative hypotheses are:

\(H_0:\) number of hours of TV watched per day has no effect on level of happiness

\(H_a:\) number of hours of TV watched per day has some effect on level of happiness

\(H_0:\) frequency of leisure time in nature has no effect on level of happiness

\(H_a:\) frequency of leisure time in nature has some effect on level of happiness

Show code
# read data
gSS <- read_excel("GSS.xlsx", sheet = "Data")

# convert to dataframe
gSS <- as.data.frame(unclass(gSS), stringsAsFactors = TRUE)
Show code
# create subset
gSS <- gSS %>%
  subset(select = c("happy", "health", "satfin", "tvhours", "activnat"))

# convert 0 hours to 0
gSS$tvhours <- sub("0 hours", "0", gSS$tvhours)

# convert tvhours from factor to numeric
gSS$tvhours <- as.numeric(as.character(gSS$tvhours))

# get summary
summary(gSS)
           happy            health                        satfin   
 Not too happy: 331   Excellent: 337   More or less satisfied:769  
 Pretty happy :1087   Fair     : 338   Not satisfied at all  :420  
 Very happy   : 347   Good     :1011   Pretty well satisfied :576  
                      Poor     :  79                               
                                                                   
                                                                   
                                                                   
    tvhours                        activnat  
 Min.   : 0.000   Daily                :328  
 1st Qu.: 2.000   Never                : 85  
 Median : 3.000   Several times a month:473  
 Mean   : 3.378   Several times a week :528  
 3rd Qu.: 4.000   Several times a year :351  
 Max.   :23.000                              
 NA's   :12                                  

All of the variables are categorical variables except hours of TV watched per day, which is a discrete numerical variable. That is, the respondent could select from a specified set of values. The most basic summary then is a simply a count of the frequency of each value.

Show code
# create table health and happy
table(gSS$health, gSS$happy)
           
            Not too happy Pretty happy Very happy
  Excellent            32          173        132
  Fair                 98          209         31
  Good                157          676        178
  Poor                 44           29          6
Show code
# create table financial satisfaction and happy
table(gSS$satfin, gSS$happy)
                        
                         Not too happy Pretty happy Very happy
  More or less satisfied           117          519        133
  Not satisfied at all             152          226         42
  Pretty well satisfied             62          342        172
Show code
# create bar plot
ggplot(gSS, aes(x = happy)) +
  geom_bar(fill = "#830042") +
  labs(title = "Self-Reported Happiness Level", x = NULL, y = "Count") +
  coord_flip() +
  theme_minimal()

Show code
# plot health and happy
ggplot(gSS, aes(x = health,
                fill = happy)) +
  geom_bar(position = "stack") +
  labs(title = "Assessment of Health and Level of Happiness", x = "Assessment of health", y = NULL) +
  coord_flip() +
  theme_minimal()
Show code
# plot financial satisfaction and happy
ggplot(gSS, aes(x = satfin,
                fill = happy)) +
  geom_bar(position = "stack") +
  labs(title = "Financial Satisfcation and Level of Happiness", x = "Satisfaction with financial situation", y = NULL) +
  coord_flip() +
  theme_minimal()
Show code
# plot tv and happy
ggplot(gSS, aes(x = tvhours,
                fill = happy)) +
  geom_bar(position = "stack") +
  labs(title = "TV Watched and Level of Happiness", x = "Hours watched per day", y = NULL) +
  coord_flip() +
  theme_minimal()
Show code
# plot nature and happy
ggplot(gSS, aes(x = activnat,
                fill = happy)) +
  geom_bar(position = "stack") +
  labs(title = "Leisure Time in Nature and Level of Happiness", x = "Frequency of leisure time in nature", y = NULL) +
  coord_flip() +
  theme_minimal()

I don’t think I’ve hit on the best way to visualize the relationship between each set of variables yet but I’ll keep working on it as I continue my analysis.