Activity03 - Multiple Linear Regression

Part 1

Load the necessary packages

We will continue using the two packages from Posit (formerly RStudio): {tidyverse} and {tidymodels}.
Let’s load these packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.1 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

theme_set(ggthemes::theme_few())

Since we will be looking at many relationships graphically, it will be nice to not have to code each of these individually. {GGally} is an extension to {ggplot2} that reduces some of the complexities when combining multiple plots. For example, GGally::ggpairs is very handy for pairwise comparisons of multiple variables.

Load the data

We are going to download the data first and then load it into the R session.
The data we are working with is from the OpenIntro site (its “about” page: https://www.openintro.org/data/index.php?data=hfi).
We can access the raw data from their tab-delimited text file link: https://www.openintro.org/data/tab-delimited/hfi.txt.

Create a new R code chunk below that is titled load-data and reads in the above linked TSV (tab-separated values) file by doing the following:

Start by downloading this file, uploading to RStudio, then reading it in.
Assign this data set into a data frame named hfi (short for “Human Freedom Index”).
Filter the data hfi data frame for year 2016 and assigns the result to an R data object named hfi_2016. We will use hfi_2016 for the remainder of this activity.

# Load the data
hfi <- read_csv("hfi.csv")

## Rows: 1458 Columns: 123
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): ISO_code, countries, region
## dbl (120): year, pf_rol_procedural, pf_rol_civil, pf_rol_criminal, pf_rol, p...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Filter the data for the year 2016
hfi_2016 <- hfi %>% filter(year == 2016)

Get the characteristics of the dataset

# Getting the characteristics of the dataset
glimpse(hfi_2016)

## Rows: 162
## Columns: 123
## $ year                               <dbl> 2016, 2016, 2016, 2016, 2016, 2016,…
## $ ISO_code                           <chr> "ALB", "DZA", "AGO", "ARG", "ARM", …
## $ countries                          <chr> "Albania", "Algeria", "Angola", "Ar…
## $ region                             <chr> "Eastern Europe", "Middle East & No…
## $ pf_rol_procedural                  <dbl> 6.661503, NA, NA, 7.098483, NA, 8.4…
## $ pf_rol_civil                       <dbl> 4.547244, NA, NA, 5.791960, NA, 7.5…
## $ pf_rol_criminal                    <dbl> 4.666508, NA, NA, 4.343930, NA, 7.3…
## $ pf_rol                             <dbl> 5.291752, 3.819566, 3.451814, 5.744…
## $ pf_ss_homicide                     <dbl> 8.920429, 9.456254, 8.060260, 7.622…
## $ pf_ss_disappearances_disap         <dbl> 10, 10, 5, 10, 10, 10, 10, 10, 10, …
## $ pf_ss_disappearances_violent       <dbl> 10.000000, 9.294030, 10.000000, 10.…
## $ pf_ss_disappearances_organized     <dbl> 10.0, 5.0, 7.5, 7.5, 7.5, 10.0, 10.…
## $ pf_ss_disappearances_fatalities    <dbl> 10.000000, 9.926119, 10.000000, 10.…
## $ pf_ss_disappearances_injuries      <dbl> 10.000000, 9.990149, 10.000000, 9.9…
## $ pf_ss_disappearances               <dbl> 10.000000, 8.842060, 8.500000, 9.49…
## $ pf_ss_women_fgm                    <dbl> 10.0, 10.0, 10.0, 10.0, 10.0, 10.0,…
## $ pf_ss_women_missing                <dbl> 7.5, 7.5, 10.0, 10.0, 5.0, 10.0, 10…
## $ pf_ss_women_inheritance_widows     <dbl> 5, 0, 5, 10, 10, 10, 10, 5, NA, 0, …
## $ pf_ss_women_inheritance_daughters  <dbl> 5, 0, 5, 10, 10, 10, 10, 10, NA, 0,…
## $ pf_ss_women_inheritance            <dbl> 5.0, 0.0, 5.0, 10.0, 10.0, 10.0, 10…
## $ pf_ss_women                        <dbl> 7.500000, 5.833333, 8.333333, 10.00…
## $ pf_ss                              <dbl> 8.806810, 8.043882, 8.297865, 9.040…
## $ pf_movement_domestic               <dbl> 5, 5, 0, 10, 5, 10, 10, 5, 10, 10, …
## $ pf_movement_foreign                <dbl> 10, 5, 5, 10, 5, 10, 10, 5, 10, 5, …
## $ pf_movement_women                  <dbl> 5, 5, 10, 10, 10, 10, 10, 5, NA, 5,…
## $ pf_movement                        <dbl> 6.666667, 5.000000, 5.000000, 10.00…
## $ pf_religion_estop_establish        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_religion_estop_operate          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_religion_estop                  <dbl> 10.0, 5.0, 10.0, 7.5, 5.0, 10.0, 10…
## $ pf_religion_harassment             <dbl> 9.566667, 6.873333, 8.904444, 9.037…
## $ pf_religion_restrictions           <dbl> 8.011111, 2.961111, 7.455556, 6.850…
## $ pf_religion                        <dbl> 9.192593, 4.944815, 8.786667, 7.795…
## $ pf_association_association         <dbl> 10.0, 5.0, 2.5, 7.5, 7.5, 10.0, 10.…
## $ pf_association_assembly            <dbl> 10.0, 5.0, 2.5, 10.0, 7.5, 10.0, 10…
## $ pf_association_political_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_political_operate   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_political           <dbl> 10.0, 5.0, 2.5, 5.0, 5.0, 10.0, 10.…
## $ pf_association_prof_establish      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_prof_operate        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_prof                <dbl> 10.0, 5.0, 5.0, 7.5, 5.0, 10.0, 10.…
## $ pf_association_sport_establish     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_sport_operate       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_sport               <dbl> 10.0, 5.0, 7.5, 7.5, 7.5, 10.0, 10.…
## $ pf_association                     <dbl> 10.0, 5.0, 4.0, 7.5, 6.5, 10.0, 10.…
## $ pf_expression_killed               <dbl> 10.000000, 10.000000, 10.000000, 10…
## $ pf_expression_jailed               <dbl> 10.000000, 10.000000, 10.000000, 10…
## $ pf_expression_influence            <dbl> 5.0000000, 2.6666667, 2.6666667, 5.…
## $ pf_expression_control              <dbl> 5.25, 4.00, 2.50, 5.50, 4.25, 7.75,…
## $ pf_expression_cable                <dbl> 10.0, 10.0, 7.5, 10.0, 7.5, 10.0, 1…
## $ pf_expression_newspapers           <dbl> 10.0, 7.5, 5.0, 10.0, 7.5, 10.0, 10…
## $ pf_expression_internet             <dbl> 10.0, 7.5, 7.5, 10.0, 7.5, 10.0, 10…
## $ pf_expression                      <dbl> 8.607143, 7.380952, 6.452381, 8.738…
## $ pf_identity_legal                  <dbl> 0, NA, 10, 10, 7, 7, 10, 0, NA, NA,…
## $ pf_identity_parental_marriage      <dbl> 10, 0, 10, 10, 10, 10, 10, 10, 10, …
## $ pf_identity_parental_divorce       <dbl> 10, 5, 10, 10, 10, 10, 10, 10, 10, …
## $ pf_identity_parental               <dbl> 10.0, 2.5, 10.0, 10.0, 10.0, 10.0, …
## $ pf_identity_sex_male               <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_sex_female             <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_sex                    <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_divorce                <dbl> 5, 0, 10, 10, 5, 10, 10, 5, NA, 0, …
## $ pf_identity                        <dbl> 6.2500000, 0.8333333, 7.5000000, 10…
## $ pf_score                           <dbl> 7.596281, 5.281772, 6.111324, 8.099…
## $ pf_rank                            <dbl> 57, 147, 117, 42, 84, 11, 8, 131, 6…
## $ ef_government_consumption          <dbl> 8.232353, 2.150000, 7.600000, 5.335…
## $ ef_government_transfers            <dbl> 7.509902, 7.817129, 8.886739, 6.048…
## $ ef_government_enterprises          <dbl> 8, 0, 0, 6, 8, 10, 10, 0, 7, 10, 7,…
## $ ef_government_tax_income           <dbl> 9, 7, 10, 7, 5, 5, 4, 9, 10, 10, 8,…
## $ ef_government_tax_payroll          <dbl> 7, 2, 9, 1, 5, 5, 3, 4, 10, 10, 8, …
## $ ef_government_tax                  <dbl> 8.0, 4.5, 9.5, 4.0, 5.0, 5.0, 3.5, …
## $ ef_government                      <dbl> 7.935564, 3.616782, 6.496685, 5.346…
## $ ef_legal_judicial                  <dbl> 2.6682218, 4.1867042, 1.8431292, 3.…
## $ ef_legal_courts                    <dbl> 3.145462, 4.327113, 1.974566, 2.930…
## $ ef_legal_protection                <dbl> 4.512228, 4.689952, 2.512364, 4.255…
## $ ef_legal_military                  <dbl> 8.333333, 4.166667, 3.333333, 7.500…
## $ ef_legal_integrity                 <dbl> 4.166667, 5.000000, 4.166667, 3.333…
## $ ef_legal_enforcement               <dbl> 4.3874441, 4.5075380, 2.3022004, 3.…
## $ ef_legal_restrictions              <dbl> 6.485287, 6.626692, 5.455882, 6.857…
## $ ef_legal_police                    <dbl> 6.933500, 6.136845, 3.016104, 3.385…
## $ ef_legal_crime                     <dbl> 6.215401, 6.737383, 4.291197, 4.133…
## $ ef_legal_gender                    <dbl> 0.9487179, 0.8205128, 0.8461538, 0.…
## $ ef_legal                           <dbl> 5.071814, 4.690743, 2.963635, 3.904…
## $ ef_money_growth                    <dbl> 8.986454, 6.955962, 9.385679, 5.233…
## $ ef_money_sd                        <dbl> 9.484575, 8.339152, 4.986742, 5.224…
## $ ef_money_inflation                 <dbl> 9.743600, 8.720460, 3.054000, 2.000…
## $ ef_money_currency                  <dbl> 10, 5, 5, 10, 10, 10, 10, 5, 0, 10,…
## $ ef_money                           <dbl> 9.553657, 7.253894, 5.606605, 5.614…
## $ ef_trade_tariffs_revenue           <dbl> 9.626667, 8.480000, 8.993333, 6.060…
## $ ef_trade_tariffs_mean              <dbl> 9.24, 6.22, 7.72, 7.26, 8.76, 9.50,…
## $ ef_trade_tariffs_sd                <dbl> 8.0240, 5.9176, 4.2544, 5.9448, 8.0…
## $ ef_trade_tariffs                   <dbl> 8.963556, 6.872533, 6.989244, 6.421…
## $ ef_trade_regulatory_nontariff      <dbl> 5.574481, 4.962589, 3.132738, 4.466…
## $ ef_trade_regulatory_compliance     <dbl> 9.4053278, 0.0000000, 0.9171598, 5.…
## $ ef_trade_regulatory                <dbl> 7.489905, 2.481294, 2.024949, 4.811…
## $ ef_trade_black                     <dbl> 10.00000, 5.56391, 10.00000, 0.0000…
## $ ef_trade_movement_foreign          <dbl> 6.306106, 3.664829, 2.946919, 5.358…
## $ ef_trade_movement_capital          <dbl> 4.6153846, 0.0000000, 3.0769231, 0.…
## $ ef_trade_movement_visit            <dbl> 8.2969231, 1.1062564, 0.1106256, 7.…
## $ ef_trade_movement                  <dbl> 6.406138, 1.590362, 2.044823, 4.697…
## $ ef_trade                           <dbl> 8.214900, 4.127025, 5.264754, 3.982…
## $ ef_regulation_credit_ownership     <dbl> 5, 0, 8, 5, 10, 10, 8, 5, 10, 10, 5…
## $ ef_regulation_credit_private       <dbl> 7.295687, 5.301526, 9.194715, 4.259…
## $ ef_regulation_credit_interest      <dbl> 9, 10, 4, 7, 10, 10, 10, 9, 10, 10,…
## $ ef_regulation_credit               <dbl> 7.098562, 5.100509, 7.064905, 5.419…
## $ ef_regulation_labor_minwage        <dbl> 5.566667, 5.566667, 8.900000, 2.766…
## $ ef_regulation_labor_firing         <dbl> 5.396399, 3.896912, 2.656198, 2.191…
## $ ef_regulation_labor_bargain        <dbl> 6.234861, 5.958321, 5.172987, 3.432…
## $ ef_regulation_labor_hours          <dbl> 8, 6, 4, 10, 10, 10, 6, 6, 8, 8, 10…
## $ ef_regulation_labor_dismissal      <dbl> 6.299741, 7.755176, 6.632764, 2.517…
## $ ef_regulation_labor_conscription   <dbl> 10, 1, 0, 10, 0, 10, 3, 1, 10, 10, …
## $ ef_regulation_labor                <dbl> 6.916278, 5.029513, 4.560325, 5.151…
## $ ef_regulation_business_adm         <dbl> 6.072172, 3.722341, 2.758428, 2.404…
## $ ef_regulation_business_bureaucracy <dbl> 6.000000, 1.777778, 1.333333, 6.666…
## $ ef_regulation_business_start       <dbl> 9.713864, 9.243070, 8.664627, 9.122…
## $ ef_regulation_business_bribes      <dbl> 4.050196, 3.765515, 1.945540, 3.260…
## $ ef_regulation_business_licensing   <dbl> 7.324582, 8.523503, 8.096776, 5.253…
## $ ef_regulation_business_compliance  <dbl> 7.074366, 7.029528, 6.782923, 6.508…
## $ ef_regulation_business             <dbl> 6.705863, 5.676956, 4.930271, 5.535…
## $ ef_regulation                      <dbl> 6.906901, 5.268992, 5.518500, 5.369…
## $ ef_score                           <dbl> 7.54, 4.99, 5.17, 4.84, 7.57, 7.98,…
## $ ef_rank                            <dbl> 34, 159, 155, 160, 29, 10, 27, 106,…
## $ hf_score                           <dbl> 7.568140, 5.135886, 5.640662, 6.469…
## $ hf_rank                            <dbl> 48, 155, 142, 107, 57, 4, 16, 130, …
## $ hf_quartile                        <dbl> 2, 4, 4, 3, 2, 1, 1, 4, 2, 2, 4, 2,…

We will continue using personal freedom scores, pf_score, as the response variable and build on our model that had pf_expression_control as the explanatory variable.

We will use pf_movement and pf_religion as additional numeric variables (They have <dbl> or <int> designations)

Pairwise relationships

In Activity 2 we explored simple linear regression models.
Specifically, we fit and assessed this relationship:

\[ y = \beta_0 + \beta_1 \times x + \varepsilon \]

# I have identified 'pf_movement_control' and 'pf_religion_control' as additional numeric variables
selected_variables <- hfi_2016 %>%
  select(pf_score, pf_expression_control, pf_movement, pf_religion)

# Summary statistics and distributions of the selected numeric variables
summary(selected_variables)

##     pf_score     pf_expression_control  pf_movement      pf_religion   
##  Min.   :2.167   Min.   :0.250         Min.   : 0.000   Min.   :0.000  
##  1st Qu.:6.025   1st Qu.:3.312         1st Qu.: 6.667   1st Qu.:6.490  
##  Median :6.932   Median :5.000         Median : 8.333   Median :7.957  
##  Mean   :6.985   Mean   :4.985         Mean   : 7.665   Mean   :7.476  
##  3rd Qu.:8.142   3rd Qu.:6.750         3rd Qu.:10.000   3rd Qu.:8.585  
##  Max.   :9.399   Max.   :9.250         Max.   :10.000   Max.   :9.853  
##                                                         NA's   :1

check-in Check in

Let’s remind ourselves about our SLR model in Activity 2. - What were the parameter estimates (i.e., the \(\beta\)s)? How did we interpret these and what did they imply for this scenario? - How good of a fit was this model? What did we use to assess this?

For this activity, we will begin using the two other quantitative variables to describe the patterns in the response variable.

What does this mean from a statistical point of view?
What does this mean from a “real world” point of view (i.e., for our data’s situation)?

Now, we will obtain graphical and numerical summaries to describe the pairwise relationships.

hfi_2016 %>% 
  select(pf_score, pf_expression_control, pf_religion) %>% 
  ggpairs()

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).

After running the pairs-plot code, we will now answer the following questions:

For each pair of variables, how can the relationship be described graphically?
Do any of the relationships look linear?
Are there any interesting/odd features (outliers, non-linear patterns, etc.)?

The relationship between pf_score and pf_expression_control is linear. The points are forming a straight line. No outliers.

The relationship between pf_score and pf_religion is linear with one outliear at the bottom. The points are forming a straight line.

The relationship between pf_expression_control and pf_religion is linear with one outlier at the bottom. The points are forming a straight line.

For each pair of variables, how can the relationship be described numerically?

pf_score vs pf_expression_control have a corr of 0.845. There is a very strong positive linear relationship between personal freedom score and expression control. As expression control increases, the personal freedom score tends to increase significantly. This strong correlation suggests that expression control is a key factor influencing personal freedom scores.

pf_score vs pf_religion have a corr of 0.576. There is a moderate positive linear relationship between personal freedom score and religion control. Higher values of religion control are associated with higher personal freedom scores, but the relationship is not as strong as with expression control.

pf_expression_control vs pf_religion have a corr of 0.524. There is a moderate positive linear relationship between expression control and religion control. Countries with higher expression control tend to also have higher religion control, but the relationship is less pronounced compared to the relationship between personal freedom score and expression control.

Are the two additional explanatory variables collinear (correlated)? Essentially, this means that adding more than one of these variables to the model would not add much value to the model. We will talk more on this issue in Activity 4 (other considerations in regression models).

Yes, the two explanatory variables, pf_expression_control and pf_religion, have a correlation coefficient of 0.524. This indicates a moderate positive linear relationship between the two variables, meaning that as one variable increases, the other tends to increase as well. It might be useful to check for multicollinearity using additional diagnostics, such as the Variance Inflation Factor (VIF), to ensure it does not unduly influence the model.

The multiple linear regression model

We will now fit the following model:

\[ y = \beta_0 + \beta_1 \times x_1 + \beta_2 \times x_2 + \varepsilon \]

#fit the mlr model
lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

lm_spec

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

mlr_mod <- lm_spec %>% 
fit(pf_score ~ pf_expression_control + pf_religion, data = hfi_2016)

# model output
tidy(mlr_mod)

## # A tibble: 3 × 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              3.46     0.273      12.7  8.80e-26
## 2 pf_expression_control    0.469    0.0298     15.8  3.25e-34
## 3 pf_religion              0.161    0.0419      3.85 1.72e- 4

After doing this, we will answer the following questions:

First we will write the complete estimated equation for this model using your output.

\[ \hat{y} = 3.46 + 0.469 \times x_1+ 0.161 \times x_2 \]

For each of the estimated parameters (the y-intercept and the slopes associated with each explanatory variable - three total), we will interpret these values in the context of this problem.

Intercept = 3.46. The intercept represents the estimated personal freedom (pf_score) when there is no freedom of expression (pf_expression_control) or freedom of religion (pf_religion) in a country. It is statistically significant (p < 0.001).

pf_expression_control estimate = 0.469 indicates that for each unit increase in pf_expression_control, the pf_score increases by approximately 0.469, holding pf_religion constant. This variable is highly significant (p < 0.001).

pf_religion estimate = 0.16 indicates that for each unit increase in pf_religion, the pf_score increases by approximately 0.161, holding pf_expression_control constant. This variable is also statistically significant (p < 0.001).

Challenge: 3-D plots

Let’s try to produce a 3-D scatterplot (you do not need to include the plane).

# Load necessary libraries
library(plotly)
library(ggplot2)

# Create 3D scatter plot
plot_ly(hfi_2016, 
        x = ~pf_expression_control, 
        y = ~pf_religion, 
        z = ~pf_score,
        type = 'scatter3d',
        mode = 'markers') %>%
  layout(scene = list(
    xaxis = list(title = 'pf_expression_control'),
    yaxis = list(title = 'pf_religion'),
    zaxis = list(title = 'pf_score')
  ))

Let’s compare our 3-D scatterplot and the GGally::ggpairs output.
What are the strengths and weaknesses of these two visualizations.
Do both display on GitHub when you push your work there?

The 3-D plot is more interactive in terms of zooming, rotating and it also visualizes all the variables simultaneously allowing for a holistic view of the data.

On the downside the 3-D plot might be complex to interpret sometimes and it my also suffer from overplotting making points to overlap hence difficult to distinguish individual points.

The ggpairs provides detailed pairwise comparison, it is easy to interpret and it provides plots for distribution and correlation which is a comprehensive overview of the data.

On the downside the ggpairs is static which limits interactive exploration of the data and it is limited to pairwise relationships.

GGally::ggpairs is displayed correctly in Github while 3-D plot is not displayed directly due to its interactive nature

Part 2

In Part 1, we fit a model with one quantitative response variable and two quantitative explanatory variables. Now we look at a model with one quantitative explanatory variable and one qualitative explanatory variable. We will use the full 2016 dataset for this entire activity.

Fitting the overall model

This is similar to what we have already been doing - fitting our desired model.
For today’s activity, we will fit something like:

\[ y = \beta_0 + \beta_1 \times \text{qualitative\\_variable} + \beta_2 \times \text{quantitative\\_variable} + \varepsilon \]

where \(y\), \(\text{qualitative\\_variable}\), and \(\text{quantitative\\_variable}\) are from hfi_2016. Note that the two explanatory variables can be entered in whatever order.

To help with interpretability, we will focus on qualitative predictor variables with only two levels. Unfortunately, none of the current chr variables have only two levels. Fortunately, we can create our own.

hfi_2016 <- hfi_2016 %>%
  mutate(west_atlantic = if_else(
    region %in% c("North America", "Latin America & the Caribbean"),
    "No",
    "Yes"
  ))

What is happening in the above code?
What new variable did we create?
How do you know it is new?
What values does it take when?

We are creating a new binary qualitative variable called west_atlantic. If the region is “North America” or “Latin America & the Caribbean”, then west_atlantic is assigned the value “No”. Otherwise, west_atlantic is assigned the value “Yes”.

We now fit the MLR model

# review any visual patterns
hfi_2016 %>% 
  select(pf_score, west_atlantic, pf_expression_control) %>% 
  ggpairs()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#fit the mlr model
lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

qual_mod <- lm_spec %>% 
  fit(pf_score ~ west_atlantic + pf_expression_control, data = hfi_2016)

# model output
tidy(qual_mod)

## # A tibble: 3 × 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              4.38     0.213     20.5   1.57e-46
## 2 west_atlanticYes        -0.102    0.167     -0.612 5.41e- 1
## 3 pf_expression_control    0.540    0.0273    19.8   1.01e-44

When looking at the ggpairs output, we should ask ourselves, “does it make sense to include all of these variables?”
Specifically, if we notice that the response variables are highly correlated (collinear), including both does not necessarily add much value as they are essentially saying the same thing.
However, there are often times when we choose to include variables in our model because it is important to us - for various reasons.
Also, when looking at our model (tidy) output, the term label for the qualitative explanatory variable look odd.
Let’s answer the following questions:

What is the label that R assigned to this explanatory variable term?

west_atlanticYes

What information is represented here?

Intercept (4.38):

This represents the estimated average pf_score for the baseline category of west_atlantic, which is “No”. So, for countries not in the Western Atlantic region, the average pf_score is approximately 4.38.

west_atlanticYes (-0.102):

This coefficient represents the difference in the estimated average pf_score between the “Yes” and “No” categories of west_atlantic, holding pf_expression_control constant. Since the p-value (0.5413382) is greater than 0.05, this difference is not statistically significant. This means there is no significant difference in pf_score between Western Atlantic and non-Western Atlantic countries.

pf_expression_control (0.540):

This coefficient represents the estimated change in pf_score for a one-unit increase in pf_expression_control, holding west_atlantic constant. The positive value (0.5401164) indicates that higher expression control is associated with higher personal freedom scores. The p-value (1.005691e-44) is highly significant, indicating a strong relationship.

What information is missing here?

Degrees of Freedom: These values indicate the degrees of freedom associated with the t-statistics and p-values. Degrees of freedom are typically calculated as the difference between the total number of observations and the number of parameters estimated in the model.

Confidence Intervals: Confidence intervals provide a range of values within which the true population parameter is likely to lie with a certain level of confidence. Including confidence intervals along with coefficient estimates gives a sense of the precision of the estimates.

Residual Standard Error (RSE): The residual standard error represents the average amount that the response variable (pf_score) deviates from the fitted regression line. It provides a measure of the spread of the residuals around the regression line.

R-squared (or Adjusted R-squared): R-squared measures the proportion of variability in the response variable that is explained by the predictor variables in the model. It provides an overall assessment of the model’s goodness of fit.

F-statistic and associated p-value: These statistics assess the overall significance of the model by comparing the fit of the full model to a model with no predictors. They indicate whether the predictor variables, taken together, have a significant effect on the response variable.

We are essentially fitting two models (or \(k\) models, where \(k\) is the number of levels in your qualitative variable).
If we have 3 levels in our qualitative variable, we would have 2 (3 - 1) indicator variables.
If we have \(k\) levels in our qualitative variable, we would have \(k - 1\) indicator variables.

The decision for R to call the indicator variable by one of our levels instead of the other has no deeper meaning.
R simply codes the level that comes first alphabetically with a \(0\) for our indicator variable.
We can change this reference level of a categorical variable, which is the level that is coded as a 0, using the relevel function.

Next we write the estimated equation for your MLR model with a qualitative explanatory variable.

\[ \hat{y} = \beta_0 + \beta_1 \times \text{west_atlanticYes} + \beta_2 \times \text{pf_expression_control} + \varepsilon \]

Now, for each level of our qualitative variable, we will write the simplified equation of the estimated line for that level. Note that if our qualitative variable has two levels, you should have two simplified equations.

When west_atlantic is “Yes”:

\[ \hat{y}{Yes} = \beta_0 - \beta_1 + \beta_2 \times \text{pf_expression_control} \]

When west_atlantic is “No”:

\[ \hat{y}{No} = \beta_0 + \beta_1 + \beta_2 \times \text{pf_expression_control} \]

Where:

\[\hat{y}{Yes}\] represents the predicted value of the response variable (pf_score) when west_atlantic is “Yes”. \[\hat{y}{No}\] represents the predicted value of the response variable (pf_score) when west_atlantic is “No”. \[\beta_0, \beta_1 and, \beta_2\] are the estimated coefficients from the MLR model.
pf_expression_control is the quantitative explanatory variable representing the level of expression and control of personal freedom in a country.

These equations show how the predicted personal freedom score varies with changes in the level of expression and control of personal freedom (pf_expression_control) for each level of the qualitative variable west_atlantic.
The interpretation of the coefficients (parameter estimates) in multiple regression is slightly different from that of simple regression.
The estimate for the indicator variable reflects how much more a group is expected to be if something has that quality, while holding all other variables constant.
The estimate for the quantitative variable reflects how much change in the response variable occurs due to a 1-unit increase in the quantitative variable, while holding all other variables constant.

Next, we interpret the parameter estimate for the reference level of our categorical variable in the context of our problem.

In the context of our problem, the reference level of the categorical variable west_atlantic is “No”. The parameter estimate for the reference level reflects the expected change in the response variable (pf_score) when the categorical variable is at its reference level, while holding all other variables constant.
For the reference level “No” of west_atlantic, the parameter estimate (-0.102) indicates that, on average, countries within the “No” category are expected to have a lower personal freedom score compared to the reference level, “Yes”, of the west_atlantic variable, while controlling for the level of expression and control of personal freedom (pf_expression_control).
In other words, countries within the “No” category, which includes regions outside the West Atlantic area, are expected to have a lower personal freedom score by approximately 0.102 units compared to countries within the West Atlantic region, all else being equal.

Next we interpret the parameter estimate for our quantitative variable in the context of our problem.

For each one-unit increase in the pf_expression_control score, the personal freedom score (pf_score) is expected to increase by approximately 0.540 units on average, controlling for the effect of the categorical variable west_atlantic.

Challenge: Multiple levels

Below, we will create a new R code chunk (with a descriptive name) that fits a new model with the same response (pf_score) and quantitative explanatory variable (pf_expression_control), but now we will use a qualitative variable with more than two levels (say, region) and obtain the tidy model output. How does R appear to handle categorical variables with more than two levels?

# Fitting the model with a qualitative variable with more than two levels
lm_spec_multiple_levels <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

# Fit the model
qual_mod_multiple_levels <- lm_spec_multiple_levels %>% 
  fit(pf_score ~ region + pf_expression_control, data = hfi_2016)

# Display the model output
tidy(qual_mod_multiple_levels)

## # A tibble: 11 × 5
##    term                                estimate std.error statistic  p.value
##    <chr>                                  <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                            5.39     0.272     19.8   7.47e-44
##  2 regionEast Asia                        0.496    0.380      1.31  1.93e- 1
##  3 regionEastern Europe                   0.326    0.309      1.06  2.93e- 1
##  4 regionLatin America & the Caribbean   -0.229    0.300     -0.762 4.47e- 1
##  5 regionMiddle East & North Africa      -1.39     0.299     -4.64  7.40e- 6
##  6 regionNorth America                    0.610    0.542      1.13  2.62e- 1
##  7 regionOceania                          0.233    0.433      0.537 5.92e- 1
##  8 regionSouth Asia                      -0.716    0.305     -2.35  2.02e- 2
##  9 regionSub-Saharan Africa              -0.746    0.283     -2.64  9.24e- 3
## 10 regionWestern Europe                   0.522    0.345      1.52  1.32e- 1
## 11 pf_expression_control                  0.387    0.0299    12.9   2.99e-26

R appears to handle categorical variables with more than two levels by automatically converting them into dummy variables, where each level becomes its own binary variable. The output provides parameter estimates for each level of the categorical variable, except for one reference level, which serves as the baseline for comparison.

Part 3

We will explore a MLR model with an interaction between quantitative and qualitative explanatory variables as well as see some other methods to assess the fit of our model.
From the modeling process we came up with, we will now address the “series of important questions that we should consider when performing multiple linear regression” (ISL Section 3.2.2, p. 75):

Is at least one of the \(p\) predictors \(X_1\), \(X_2\), \(\ldots\), \(X_p\) useful in predicting the response \(Y\)?
Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict and how accurate is our prediction?

By including an interaction term in our model, it may seem like we are relaxing the “additive assumption” a little.
However, the additive assumption is about the coefficients (the \(\beta\)s) and not the variables.

Fitting the overall model with \(qualitative \times quantitative\) interaction

In Part 2 that we explored the model:

\[ y = \beta_0 + \beta_1 \times \text{qualitative\\_variable} + \beta_2 \times \text{quantitative\\_variable} + \varepsilon \]

Today we will explore a similar model, except that also includes the interaction between our qualitative and quantitative explanatory variables.
That is,

\[ y = \beta_0 + \beta_1 \times \text{qualitative\\_variable} + \beta_2 \times \text{quantitative\\_variable} + \beta_3 \times ( \text{qualitative\\_variable} \times \text{quantitative\\_variable}) + \varepsilon \]

Fitting an MLR model with an interaction term

# review any visual patterns
hfi_2016 %>% 
  select(pf_score, west_atlantic, pf_expression_control) %>% 
  ggpairs()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#fit the mlr model
lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

int_mod <- lm_spec %>% 
  fit(pf_score ~ west_atlantic * pf_expression_control, data = hfi_2016)

# model output
tidy(int_mod)

## # A tibble: 4 × 5
##   term                                   estimate std.error statistic  p.value
##   <chr>                                     <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                               5.72     0.459      12.5  2.76e-25
## 2 west_atlanticYes                         -1.60     0.484      -3.30 1.18e- 3
## 3 pf_expression_control                     0.296    0.0789      3.75 2.45e- 4
## 4 west_atlanticYes:pf_expression_control    0.275    0.0838      3.28 1.26e- 3

Note that we shortened the model statement using qualitative * quantitative, but this can sometimes be confusing to read.
Another way to write the right-hand side of the equation is: qualitative + quantitative + qualitative * quantitative.

After doing this, we will answer the following question:

When viewing the tidy output, notice that the interaction term is listed as qualitativelevel:quantitative. Referring back to Part 2 with how R displays qualitative variables, interpret what this syntax means.

\[\beta_3\]

Represents the additional change in pf_score for a one-unit increase in pf_expression_control when west_atlantic is “Yes” compared to “No”.
The qualitativelevel:quantitative term indicates that the effect of the quantitative variable (pf_expression_control) on the response variable (pf_score) is modified by the qualitative variable (west_atlantic).
In simpler terms, it shows that the relationship between pf_expression_control and pf_score is different depending on whether a region is classified as “Yes” or “No” in the west_atlantic variable.

Let’s now write the simplified equation of the line corresponding to each level of our qualitative explanatory variable.

To write the simplified equations of the line corresponding to each level of the qualitative explanatory variable, we’ll break down the model into its components based on the levels of west_atlantic. Here’s the model again:

\[\hat{y} = \beta_0 + \beta_1 \times \text{west_atlantic} + \beta_2 \times \text{pf_expression_control} + \beta_3 \times (\text{west_atlantic} \times \text{pf_expression_control})\]

Simplified Equations for Each Level

For west_atlantic = “No” (reference level):
When west_atlantic is “No”, the term west_atlantic and the interaction term west_atlanticYes:pf_expression_control both drop out because they are zero. Thus, the equation simplifies to:

\[\hat{y}_{\text{No}} = \beta_0 + \beta_2 \times \text{pf_expression_control}\]

Substituting in the values:

\[\hat{y}_{\text{No}} = 5.7213860 + 0.2961044 \times \text{pf_expression_control}\]

For west_atlantic = “Yes”:
When west_atlantic is “Yes”, both the west_atlantic term and the interaction term west_atlanticYes:pf_expression_control are included. Thus, the equation is:

\[\hat{y}_{\text{Yes}} = \beta_0 + \beta_1 + \beta_2 \times \text{pf_expression_control} + \beta_3 \times \text{pf_expression_control}\]

Combining the coefficients:

\[\hat{y}_{\text{Yes}} = 5.7213860 + (-1.5979076) + (0.2961044 + 0.2750385) \times \text{pf_expression_control}\]

Simplifying further:

\[\hat{y}_{\text{Yes}} = 4.1234784 + 0.5711429 \times \text{pf_expression_control}\]

Summary of Simplified Equations

For west_atlantic = “No”: \[\hat{y}_{\text{No}} = 5.7213860 + 0.2961044 \times \text{pf_expression_control}\]

For west_atlantic = “Yes”: \[\hat{y}_{\text{Yes}} = 4.1234784 + 0.5711429 \times \text{pf_expression_control}\]

For two observations with similar values of the quantitative , which level tends to have higher values of the response variable?

The intercept for the “No” group is higher than the intercept for the “Yes” group by 5.7213860−4.1234784=1.5979076. This means that when pf_expression_control is 0, the “No” group has a higher pf_score.
The slope for the “Yes” group is higher than the slope for the “No” group by 0.5711429−0.2961044=0.2750385. This means that as pf_expression_control increases, the pf_score increases more rapidly for the “Yes” group compared to the “No” group.

Conclusion

For low values of pf_expression_control (close to 0), the “No” group tends to have higher pf_score values because of the higher intercept.
For higher values of pf_expression_control, the “Yes” group will eventually have higher pf_score values because of the higher slope. The exact point where this crossover happens can be calculated, but generally speaking, as pf_expression_control increases, the Yes group benefits more due to the larger coefficient.

Like we did in Part 1, let’s assess the fit of this model (no need to do any formal hypothesis testing - we will explore this next). How does int_mod’s fit compare to mlr_mod?
What did we use to compare these?
Why?

# I will refit the Day 1 model here to have them both in one place for comparison
# Fit the models
#mlr_mod <- tidymodels::lm(pf_score ~ west_atlantic + pf_expression_control, data = hfi_2016)
#int_mod <- tidymodels::lm(pf_score ~ west_atlantic * pf_expression_control, data = hfi_2016)

# Extract model summaries
#mlr_mod_summary <- summary(mlr_mod)
#int_mod_summary <- summary(int_mod)

# Extract R-squared and Adjusted R-squared
#mlr_mod_r2 <- mlr_mod_summary$r.squared
#mlr_mod_adj_r2 <- mlr_mod_summary$adj.r.squared
#int_mod_r2 <- int_mod_summary$r.squared
#int_mod_adj_r2 <- int_mod_summary$adj.r.squared

# Display the results
#data.frame(
  #Model = c("MLR", "Interaction MLR"),
  #R2 = c(mlr_mod_r2, int_mod_r2),
  #Adjusted_R2 = c(mlr_mod_adj_r2, int_mod_adj_r2))

We are using R^2 and Adjusted R^2

Why These Metrics?

R^2: Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It is a measure of how well the model explains the data.
Adjusted R^2: Adjusted for the number of predictors in the model. It is a more accurate measure for comparing models with different numbers of predictors, as it accounts for the complexity of the model.
Using these metrics allows us to quantitatively assess and compare the fit of different models. The interaction model appears to fit the data better because it has a higher R^2 and higher Adjusted R^2, suggesting that the interaction between the qualitative and quantitative variables provides additional explanatory power.*

Many disciplines are moving away from \(p\)-values in favor of other methods.
We will explore \(p\)-values these other methods later this semester, but we will practice our classical methods here.
This is known as an “overall \(F\) test” and the hypotheses are:

That (the null) no predictors are useful for the model (i.e., all slopes are equal to zero) versus the alternative that at least one predictor is useful for the model (i.e., at least one slope is not zero).
One way to check this is to build our null model (no predictors) and then compare this to our candidate model (int_mod).

# null model
null_mod <- lm_spec %>% 
fit(pf_score  ~ 1, data = hfi_2016)

anova(
  extract_fit_engine(int_mod),
  extract_fit_engine(null_mod)
)

## Analysis of Variance Table
## 
## Model 1: pf_score ~ west_atlantic * pf_expression_control
## Model 2: pf_score ~ 1
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    158  95.46                                 
## 2    161 357.56 -3    -262.1 144.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the \(F\) test statistic and \(p\)-value for this test?
Based on an \(\alpha = 0.05\) significant level, what should we conclude?

The \(F\) test statistic for this test is 144.6, and the associated \(p\)-value is nearly zero (given as < 2.2e-16).

At a significance level of \(\alpha = 0.05\), since the \(p\)-value is much smaller than 0.05, we reject the null hypothesis. This indicates that at least one predictor in the interaction model (west_atlantic * pf_expression_control) is useful for predicting the response variable (pf_score).

Partial slope test - do all predictors help explain \(y\)?

Since our overall model is significant (at least one predictor is useful), we will continue on.
We could do a similar process to fit a new model while removing one explanatory variable at at time, and using anova to compare these models.
However, the tidy output also helps here (the statistic and p.value columns).

For each slope, we are testing if that slope is zero (when including the other variables, the null) or if it is not zero (when including the other variables, the alternative).
Because the interaction term is a combination of the other two variables, we should assess the first.

What is the \(t\) test statistic and \(p\)-value associated with this test?
Based on an \(\alpha = 0.05\) significant level, what should we conclude?

The \(t\) test statistic and \(p\)-value associated with the partial slope test for the interaction term west_atlanticYes:pf_expression_control are as follows:
\(t\) test statistic: 3.283544
\(p\)-value:1.262236×10 −3

Based on an \(\alpha = 0.05\) significance level, since the \(p\)-value (1.262236×10 −3) is less than \(\alpha\), we reject the null hypothesis that the coefficient of the interaction term is zero. Therefore, we conclude that the interaction between west_atlanticYes and pf_expression_control significantly contributes to explaining the response variable.

If our interaction term was not significant, we could consider removing it.
Now let’s look at our two non-interaction terms…

What are the \(t\) test statistic and \(p\)-value associated with these tests?
Based on an \(\alpha = 0.05\) significant level, what should we conclude about these two predictors?

We would not need to do (21) if the interaction was significant.
We also should not remove a main variable (non-interaction variable) if the interaction variable remains in our model.

Residual assessment - how well does the model fit the data?

We have already done this step in past activities by exploring your residuals (Activity 2).
Using our final model from Part 3, let’s assess how well our model fits the data.

To assess how well the model fits the data, we can examine the residuals from the final model obtained in Part 3. We’ll look for patterns or trends in the residuals that might indicate inadequacies in the model fit. This can include checking for:

Randomness: Residuals should be randomly scattered around zero without any discernible pattern.
Homoscedasticity: The spread of residuals should be roughly constant across different levels of the predictors.
Normality: Residuals should follow a roughly normal distribution.

Let’s extract residuals

# Extract residuals from the final model
residuals <- residuals(int_mod)

Now that we have extracted the residuals, we can create several plots to assess the model fit:

Scatterplot of Residuals vs. Predicted Values: This plot helps us check for randomness and homoscedasticity.
QQ Plot of Residuals: This plot helps us assess the normality assumption of residuals.
Residuals vs. Predictor Variables: These plots help us detect any patterns or trends in the residuals with respect to the predictor variables.

Let’s create these plots:
Or maybe not…