Activity03 - Multiple Linear Regression
Part 1
Load the necessary packages
- We will continue using the two packages from Posit (formerly RStudio):
{tidyverse}
and{tidymodels}
. - Let’s load these packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Since we will be looking at many relationships graphically, it will
be nice to not have to code each of these individually.
{GGally}
is an extension to {ggplot2}
that
reduces some of the complexities when combining multiple plots. For
example, GGally::ggpairs
is very handy for pairwise comparisons of multiple variables.
Load the data
We are going to download the data first and then load it into the R
session.
The data we are working with is from the OpenIntro site (its “about”
page: https://www.openintro.org/data/index.php?data=hfi).
We can access the raw data from their tab-delimited text file link: https://www.openintro.org/data/tab-delimited/hfi.txt.
Create a new R code chunk below that is titled load-data
and reads in the above linked TSV (tab-separated values) file by doing
the following:
- Start by downloading this file, uploading to RStudio, then reading it in.
- Assign this data set into a data frame named
hfi
(short for “Human Freedom Index”). - Filter the data
hfi
data frame for year 2016 and assigns the result to an R data object namedhfi_2016
. We will usehfi_2016
for the remainder of this activity.
## Rows: 1458 Columns: 123
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): ISO_code, countries, region
## dbl (120): year, pf_rol_procedural, pf_rol_civil, pf_rol_criminal, pf_rol, p...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Get the characteristics of the dataset
## Rows: 162
## Columns: 123
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016,…
## $ ISO_code <chr> "ALB", "DZA", "AGO", "ARG", "ARM", …
## $ countries <chr> "Albania", "Algeria", "Angola", "Ar…
## $ region <chr> "Eastern Europe", "Middle East & No…
## $ pf_rol_procedural <dbl> 6.661503, NA, NA, 7.098483, NA, 8.4…
## $ pf_rol_civil <dbl> 4.547244, NA, NA, 5.791960, NA, 7.5…
## $ pf_rol_criminal <dbl> 4.666508, NA, NA, 4.343930, NA, 7.3…
## $ pf_rol <dbl> 5.291752, 3.819566, 3.451814, 5.744…
## $ pf_ss_homicide <dbl> 8.920429, 9.456254, 8.060260, 7.622…
## $ pf_ss_disappearances_disap <dbl> 10, 10, 5, 10, 10, 10, 10, 10, 10, …
## $ pf_ss_disappearances_violent <dbl> 10.000000, 9.294030, 10.000000, 10.…
## $ pf_ss_disappearances_organized <dbl> 10.0, 5.0, 7.5, 7.5, 7.5, 10.0, 10.…
## $ pf_ss_disappearances_fatalities <dbl> 10.000000, 9.926119, 10.000000, 10.…
## $ pf_ss_disappearances_injuries <dbl> 10.000000, 9.990149, 10.000000, 9.9…
## $ pf_ss_disappearances <dbl> 10.000000, 8.842060, 8.500000, 9.49…
## $ pf_ss_women_fgm <dbl> 10.0, 10.0, 10.0, 10.0, 10.0, 10.0,…
## $ pf_ss_women_missing <dbl> 7.5, 7.5, 10.0, 10.0, 5.0, 10.0, 10…
## $ pf_ss_women_inheritance_widows <dbl> 5, 0, 5, 10, 10, 10, 10, 5, NA, 0, …
## $ pf_ss_women_inheritance_daughters <dbl> 5, 0, 5, 10, 10, 10, 10, 10, NA, 0,…
## $ pf_ss_women_inheritance <dbl> 5.0, 0.0, 5.0, 10.0, 10.0, 10.0, 10…
## $ pf_ss_women <dbl> 7.500000, 5.833333, 8.333333, 10.00…
## $ pf_ss <dbl> 8.806810, 8.043882, 8.297865, 9.040…
## $ pf_movement_domestic <dbl> 5, 5, 0, 10, 5, 10, 10, 5, 10, 10, …
## $ pf_movement_foreign <dbl> 10, 5, 5, 10, 5, 10, 10, 5, 10, 5, …
## $ pf_movement_women <dbl> 5, 5, 10, 10, 10, 10, 10, 5, NA, 5,…
## $ pf_movement <dbl> 6.666667, 5.000000, 5.000000, 10.00…
## $ pf_religion_estop_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_religion_estop_operate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_religion_estop <dbl> 10.0, 5.0, 10.0, 7.5, 5.0, 10.0, 10…
## $ pf_religion_harassment <dbl> 9.566667, 6.873333, 8.904444, 9.037…
## $ pf_religion_restrictions <dbl> 8.011111, 2.961111, 7.455556, 6.850…
## $ pf_religion <dbl> 9.192593, 4.944815, 8.786667, 7.795…
## $ pf_association_association <dbl> 10.0, 5.0, 2.5, 7.5, 7.5, 10.0, 10.…
## $ pf_association_assembly <dbl> 10.0, 5.0, 2.5, 10.0, 7.5, 10.0, 10…
## $ pf_association_political_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_political_operate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_political <dbl> 10.0, 5.0, 2.5, 5.0, 5.0, 10.0, 10.…
## $ pf_association_prof_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_prof_operate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_prof <dbl> 10.0, 5.0, 5.0, 7.5, 5.0, 10.0, 10.…
## $ pf_association_sport_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_sport_operate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_sport <dbl> 10.0, 5.0, 7.5, 7.5, 7.5, 10.0, 10.…
## $ pf_association <dbl> 10.0, 5.0, 4.0, 7.5, 6.5, 10.0, 10.…
## $ pf_expression_killed <dbl> 10.000000, 10.000000, 10.000000, 10…
## $ pf_expression_jailed <dbl> 10.000000, 10.000000, 10.000000, 10…
## $ pf_expression_influence <dbl> 5.0000000, 2.6666667, 2.6666667, 5.…
## $ pf_expression_control <dbl> 5.25, 4.00, 2.50, 5.50, 4.25, 7.75,…
## $ pf_expression_cable <dbl> 10.0, 10.0, 7.5, 10.0, 7.5, 10.0, 1…
## $ pf_expression_newspapers <dbl> 10.0, 7.5, 5.0, 10.0, 7.5, 10.0, 10…
## $ pf_expression_internet <dbl> 10.0, 7.5, 7.5, 10.0, 7.5, 10.0, 10…
## $ pf_expression <dbl> 8.607143, 7.380952, 6.452381, 8.738…
## $ pf_identity_legal <dbl> 0, NA, 10, 10, 7, 7, 10, 0, NA, NA,…
## $ pf_identity_parental_marriage <dbl> 10, 0, 10, 10, 10, 10, 10, 10, 10, …
## $ pf_identity_parental_divorce <dbl> 10, 5, 10, 10, 10, 10, 10, 10, 10, …
## $ pf_identity_parental <dbl> 10.0, 2.5, 10.0, 10.0, 10.0, 10.0, …
## $ pf_identity_sex_male <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_sex_female <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_sex <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_divorce <dbl> 5, 0, 10, 10, 5, 10, 10, 5, NA, 0, …
## $ pf_identity <dbl> 6.2500000, 0.8333333, 7.5000000, 10…
## $ pf_score <dbl> 7.596281, 5.281772, 6.111324, 8.099…
## $ pf_rank <dbl> 57, 147, 117, 42, 84, 11, 8, 131, 6…
## $ ef_government_consumption <dbl> 8.232353, 2.150000, 7.600000, 5.335…
## $ ef_government_transfers <dbl> 7.509902, 7.817129, 8.886739, 6.048…
## $ ef_government_enterprises <dbl> 8, 0, 0, 6, 8, 10, 10, 0, 7, 10, 7,…
## $ ef_government_tax_income <dbl> 9, 7, 10, 7, 5, 5, 4, 9, 10, 10, 8,…
## $ ef_government_tax_payroll <dbl> 7, 2, 9, 1, 5, 5, 3, 4, 10, 10, 8, …
## $ ef_government_tax <dbl> 8.0, 4.5, 9.5, 4.0, 5.0, 5.0, 3.5, …
## $ ef_government <dbl> 7.935564, 3.616782, 6.496685, 5.346…
## $ ef_legal_judicial <dbl> 2.6682218, 4.1867042, 1.8431292, 3.…
## $ ef_legal_courts <dbl> 3.145462, 4.327113, 1.974566, 2.930…
## $ ef_legal_protection <dbl> 4.512228, 4.689952, 2.512364, 4.255…
## $ ef_legal_military <dbl> 8.333333, 4.166667, 3.333333, 7.500…
## $ ef_legal_integrity <dbl> 4.166667, 5.000000, 4.166667, 3.333…
## $ ef_legal_enforcement <dbl> 4.3874441, 4.5075380, 2.3022004, 3.…
## $ ef_legal_restrictions <dbl> 6.485287, 6.626692, 5.455882, 6.857…
## $ ef_legal_police <dbl> 6.933500, 6.136845, 3.016104, 3.385…
## $ ef_legal_crime <dbl> 6.215401, 6.737383, 4.291197, 4.133…
## $ ef_legal_gender <dbl> 0.9487179, 0.8205128, 0.8461538, 0.…
## $ ef_legal <dbl> 5.071814, 4.690743, 2.963635, 3.904…
## $ ef_money_growth <dbl> 8.986454, 6.955962, 9.385679, 5.233…
## $ ef_money_sd <dbl> 9.484575, 8.339152, 4.986742, 5.224…
## $ ef_money_inflation <dbl> 9.743600, 8.720460, 3.054000, 2.000…
## $ ef_money_currency <dbl> 10, 5, 5, 10, 10, 10, 10, 5, 0, 10,…
## $ ef_money <dbl> 9.553657, 7.253894, 5.606605, 5.614…
## $ ef_trade_tariffs_revenue <dbl> 9.626667, 8.480000, 8.993333, 6.060…
## $ ef_trade_tariffs_mean <dbl> 9.24, 6.22, 7.72, 7.26, 8.76, 9.50,…
## $ ef_trade_tariffs_sd <dbl> 8.0240, 5.9176, 4.2544, 5.9448, 8.0…
## $ ef_trade_tariffs <dbl> 8.963556, 6.872533, 6.989244, 6.421…
## $ ef_trade_regulatory_nontariff <dbl> 5.574481, 4.962589, 3.132738, 4.466…
## $ ef_trade_regulatory_compliance <dbl> 9.4053278, 0.0000000, 0.9171598, 5.…
## $ ef_trade_regulatory <dbl> 7.489905, 2.481294, 2.024949, 4.811…
## $ ef_trade_black <dbl> 10.00000, 5.56391, 10.00000, 0.0000…
## $ ef_trade_movement_foreign <dbl> 6.306106, 3.664829, 2.946919, 5.358…
## $ ef_trade_movement_capital <dbl> 4.6153846, 0.0000000, 3.0769231, 0.…
## $ ef_trade_movement_visit <dbl> 8.2969231, 1.1062564, 0.1106256, 7.…
## $ ef_trade_movement <dbl> 6.406138, 1.590362, 2.044823, 4.697…
## $ ef_trade <dbl> 8.214900, 4.127025, 5.264754, 3.982…
## $ ef_regulation_credit_ownership <dbl> 5, 0, 8, 5, 10, 10, 8, 5, 10, 10, 5…
## $ ef_regulation_credit_private <dbl> 7.295687, 5.301526, 9.194715, 4.259…
## $ ef_regulation_credit_interest <dbl> 9, 10, 4, 7, 10, 10, 10, 9, 10, 10,…
## $ ef_regulation_credit <dbl> 7.098562, 5.100509, 7.064905, 5.419…
## $ ef_regulation_labor_minwage <dbl> 5.566667, 5.566667, 8.900000, 2.766…
## $ ef_regulation_labor_firing <dbl> 5.396399, 3.896912, 2.656198, 2.191…
## $ ef_regulation_labor_bargain <dbl> 6.234861, 5.958321, 5.172987, 3.432…
## $ ef_regulation_labor_hours <dbl> 8, 6, 4, 10, 10, 10, 6, 6, 8, 8, 10…
## $ ef_regulation_labor_dismissal <dbl> 6.299741, 7.755176, 6.632764, 2.517…
## $ ef_regulation_labor_conscription <dbl> 10, 1, 0, 10, 0, 10, 3, 1, 10, 10, …
## $ ef_regulation_labor <dbl> 6.916278, 5.029513, 4.560325, 5.151…
## $ ef_regulation_business_adm <dbl> 6.072172, 3.722341, 2.758428, 2.404…
## $ ef_regulation_business_bureaucracy <dbl> 6.000000, 1.777778, 1.333333, 6.666…
## $ ef_regulation_business_start <dbl> 9.713864, 9.243070, 8.664627, 9.122…
## $ ef_regulation_business_bribes <dbl> 4.050196, 3.765515, 1.945540, 3.260…
## $ ef_regulation_business_licensing <dbl> 7.324582, 8.523503, 8.096776, 5.253…
## $ ef_regulation_business_compliance <dbl> 7.074366, 7.029528, 6.782923, 6.508…
## $ ef_regulation_business <dbl> 6.705863, 5.676956, 4.930271, 5.535…
## $ ef_regulation <dbl> 6.906901, 5.268992, 5.518500, 5.369…
## $ ef_score <dbl> 7.54, 4.99, 5.17, 4.84, 7.57, 7.98,…
## $ ef_rank <dbl> 34, 159, 155, 160, 29, 10, 27, 106,…
## $ hf_score <dbl> 7.568140, 5.135886, 5.640662, 6.469…
## $ hf_rank <dbl> 48, 155, 142, 107, 57, 4, 16, 130, …
## $ hf_quartile <dbl> 2, 4, 4, 3, 2, 1, 1, 4, 2, 2, 4, 2,…
We will continue using personal freedom scores,
pf_score
, as the response variable and build on our model
that had pf_expression_control
as the explanatory
variable.
- We will use
pf_movement
andpf_religion
as additional numeric variables (They have<dbl>
or<int>
designations)
Pairwise relationships
In Activity 2 we explored simple linear regression models.
Specifically, we fit and assessed this relationship:
\[ y = \beta_0 + \beta_1 \times x + \varepsilon \]
# I have identified 'pf_movement_control' and 'pf_religion_control' as additional numeric variables
selected_variables <- hfi_2016 %>%
select(pf_score, pf_expression_control, pf_movement, pf_religion)
# Summary statistics and distributions of the selected numeric variables
summary(selected_variables)
## pf_score pf_expression_control pf_movement pf_religion
## Min. :2.167 Min. :0.250 Min. : 0.000 Min. :0.000
## 1st Qu.:6.025 1st Qu.:3.312 1st Qu.: 6.667 1st Qu.:6.490
## Median :6.932 Median :5.000 Median : 8.333 Median :7.957
## Mean :6.985 Mean :4.985 Mean : 7.665 Mean :7.476
## 3rd Qu.:8.142 3rd Qu.:6.750 3rd Qu.:10.000 3rd Qu.:8.585
## Max. :9.399 Max. :9.250 Max. :10.000 Max. :9.853
## NA's :1
Check in
Let’s remind ourselves about our SLR model in Activity 2. - What were the parameter estimates (i.e., the \(\beta\)s)? How did we interpret these and what did they imply for this scenario? - How good of a fit was this model? What did we use to assess this?
For this activity, we will begin using the two other quantitative variables to describe the patterns in the response variable.
- What does this mean from a statistical point of view?
- What does this mean from a “real world” point of view (i.e., for our data’s situation)?
Now, we will obtain graphical and numerical summaries to describe the pairwise relationships.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).
After running the pairs-plot
code, we will now answer
the following questions:
- For each pair of variables, how can the relationship be described
graphically?
Do any of the relationships look linear?
Are there any interesting/odd features (outliers, non-linear patterns, etc.)?
The relationship between
pf_score
andpf_expression_control
is linear. The points are forming a straight line. No outliers.
The relationship between
pf_score
andpf_religion
is linear with one outliear at the bottom. The points are forming a straight line.
The relationship between
pf_expression_control
andpf_religion
is linear with one outlier at the bottom. The points are forming a straight line.
- For each pair of variables, how can the relationship be described numerically?
pf_score
vspf_expression_control
have a corr of 0.845. There is a very strong positive linear relationship between personal freedom score and expression control. As expression control increases, the personal freedom score tends to increase significantly. This strong correlation suggests that expression control is a key factor influencing personal freedom scores.
pf_score
vspf_religion
have a corr of 0.576. There is a moderate positive linear relationship between personal freedom score and religion control. Higher values of religion control are associated with higher personal freedom scores, but the relationship is not as strong as with expression control.
pf_expression_control
vspf_religion
have a corr of 0.524. There is a moderate positive linear relationship between expression control and religion control. Countries with higher expression control tend to also have higher religion control, but the relationship is less pronounced compared to the relationship between personal freedom score and expression control.
- Are the two additional explanatory variables collinear (correlated)? Essentially, this means that adding more than one of these variables to the model would not add much value to the model. We will talk more on this issue in Activity 4 (other considerations in regression models).
Yes, the two explanatory variables, pf_expression_control and pf_religion, have a correlation coefficient of 0.524. This indicates a moderate positive linear relationship between the two variables, meaning that as one variable increases, the other tends to increase as well. It might be useful to check for multicollinearity using additional diagnostics, such as the Variance Inflation Factor (VIF), to ensure it does not unduly influence the model.
The multiple linear regression model
We will now fit the following model:
\[ y = \beta_0 + \beta_1 \times x_1 + \beta_2 \times x_2 + \varepsilon \]
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
mlr_mod <- lm_spec %>%
fit(pf_score ~ pf_expression_control + pf_religion, data = hfi_2016)
# model output
tidy(mlr_mod)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.46 0.273 12.7 8.80e-26
## 2 pf_expression_control 0.469 0.0298 15.8 3.25e-34
## 3 pf_religion 0.161 0.0419 3.85 1.72e- 4
After doing this, we will answer the following questions:
- First we will write the complete estimated equation for this model using your output.
\[ \hat{y} = 3.46 + 0.469 \times x_1+ 0.161 \times x_2 \]
- For each of the estimated parameters (the y-intercept and the slopes associated with each explanatory variable - three total), we will interpret these values in the context of this problem.
Intercept = 3.46. The intercept represents the estimated personal freedom (
pf_score
) when there is no freedom of expression (pf_expression_control
) or freedom of religion (pf_religion
) in a country. It is statistically significant (p < 0.001).
pf_expression_control
estimate = 0.469 indicates that for each unit increase in pf_expression_control, the pf_score increases by approximately 0.469, holding pf_religion constant. This variable is highly significant (p < 0.001).
pf_religion
estimate = 0.16 indicates that for each unit increase in pf_religion, the pf_score increases by approximately 0.161, holding pf_expression_control constant. This variable is also statistically significant (p < 0.001).
Challenge: 3-D plots
Let’s try to produce a 3-D scatterplot (you do not need to include the plane).
# Load necessary libraries
library(plotly)
library(ggplot2)
# Create 3D scatter plot
plot_ly(hfi_2016,
x = ~pf_expression_control,
y = ~pf_religion,
z = ~pf_score,
type = 'scatter3d',
mode = 'markers') %>%
layout(scene = list(
xaxis = list(title = 'pf_expression_control'),
yaxis = list(title = 'pf_religion'),
zaxis = list(title = 'pf_score')
))
- Let’s compare our 3-D scatterplot and the
GGally::ggpairs
output.
What are the strengths and weaknesses of these two visualizations.
Do both display on GitHub when you push your work there?
The 3-D plot is more interactive in terms of zooming, rotating and it also visualizes all the variables simultaneously allowing for a holistic view of the data.
On the downside the 3-D plot might be complex to interpret sometimes and it my also suffer from overplotting making points to overlap hence difficult to distinguish individual points.
The
ggpairs
provides detailed pairwise comparison, it is easy to interpret and it provides plots for distribution and correlation which is a comprehensive overview of the data.
On the downside the
ggpairs
is static which limits interactive exploration of the data and it is limited to pairwise relationships.
GGally::ggpairs
is displayed correctly in Github while 3-D plot is not displayed directly due to its interactive nature
Part 2
In Part 1, we fit a model with one quantitative response variable and two quantitative explanatory variables. Now we look at a model with one quantitative explanatory variable and one qualitative explanatory variable. We will use the full 2016 dataset for this entire activity.
Fitting the overall model
This is similar to what we have already been doing - fitting our
desired model.
For today’s activity, we will fit something like:
\[ y = \beta_0 + \beta_1 \times \text{qualitative\\_variable} + \beta_2 \times \text{quantitative\\_variable} + \varepsilon \]
where \(y\), \(\text{qualitative\\_variable}\), and \(\text{quantitative\\_variable}\) are from
hfi_2016
. Note that the two explanatory variables can be
entered in whatever order.
To help with interpretability, we will focus on qualitative predictor
variables with only two levels. Unfortunately, none of the current
chr
variables have only two levels. Fortunately, we can
create our own.
hfi_2016 <- hfi_2016 %>%
mutate(west_atlantic = if_else(
region %in% c("North America", "Latin America & the Caribbean"),
"No",
"Yes"
))
- What is happening in the above code?
What new variable did we create?
How do you know it is new?
What values does it take when?
We are creating a new binary qualitative variable called
west_atlantic.
If the region is “North America” or “Latin America & the Caribbean”, then west_atlantic is assigned the value “No”. Otherwise, west_atlantic is assigned the value “Yes”.
We now fit the MLR model
# review any visual patterns
hfi_2016 %>%
select(pf_score, west_atlantic, pf_expression_control) %>%
ggpairs()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#fit the mlr model
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
qual_mod <- lm_spec %>%
fit(pf_score ~ west_atlantic + pf_expression_control, data = hfi_2016)
# model output
tidy(qual_mod)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.38 0.213 20.5 1.57e-46
## 2 west_atlanticYes -0.102 0.167 -0.612 5.41e- 1
## 3 pf_expression_control 0.540 0.0273 19.8 1.01e-44
When looking at the ggpairs
output, we should ask
ourselves, “does it make sense to include all of these variables?”
Specifically, if we notice that the response variables are highly
correlated (collinear), including both does not necessarily add much
value as they are essentially saying the same thing.
However, there are often times when we choose to include variables in
our model because it is important to us - for various reasons.
Also, when looking at our model (tidy
) output, the
term
label for the qualitative explanatory variable look
odd.
Let’s answer the following questions:
- What is the label that R assigned to this explanatory variable
term
?
west_atlanticYes
- What information is represented here?
Intercept (4.38):
This represents the estimated average
pf_score
for the baseline category ofwest_atlantic
, which is “No”. So, for countries not in the Western Atlantic region, the averagepf_score
is approximately 4.38.
west_atlanticYes
(-0.102):
This coefficient represents the difference in the estimated average pf_score between the “Yes” and “No” categories of
west_atlantic
, holdingpf_expression_control
constant. Since the p-value (0.5413382) is greater than 0.05, this difference is not statistically significant. This means there is no significant difference inpf_score
between Western Atlantic and non-Western Atlantic countries.
pf_expression_control
(0.540):
This coefficient represents the estimated change in pf_score for a one-unit increase in
pf_expression_control
, holdingwest_atlantic
constant. The positive value (0.5401164) indicates that higher expression control is associated with higher personal freedom scores. The p-value (1.005691e-44) is highly significant, indicating a strong relationship.
- What information is missing here?
Degrees of Freedom: These values indicate the degrees of freedom associated with the t-statistics and p-values. Degrees of freedom are typically calculated as the difference between the total number of observations and the number of parameters estimated in the model.
Confidence Intervals: Confidence intervals provide a range of values within which the true population parameter is likely to lie with a certain level of confidence. Including confidence intervals along with coefficient estimates gives a sense of the precision of the estimates.
Residual Standard Error (RSE): The residual standard error represents the average amount that the response variable (pf_score) deviates from the fitted regression line. It provides a measure of the spread of the residuals around the regression line.
R-squared (or Adjusted R-squared): R-squared measures the proportion of variability in the response variable that is explained by the predictor variables in the model. It provides an overall assessment of the model’s goodness of fit.
F-statistic and associated p-value: These statistics assess the overall significance of the model by comparing the fit of the full model to a model with no predictors. They indicate whether the predictor variables, taken together, have a significant effect on the response variable.
We are essentially fitting two models (or \(k\) models, where \(k\) is the number of levels in your
qualitative variable).
If we have 3 levels in our qualitative variable, we would have 2 (3 - 1)
indicator variables.
If we have \(k\) levels in our
qualitative variable, we would have \(k -
1\) indicator variables.
The decision for R to call the indicator variable by one of our
levels instead of the other has no deeper meaning.
R simply codes the level that comes first alphabetically with a \(0\) for our indicator variable.
We can change this reference level of a categorical variable, which is
the level that is coded as a 0, using the relevel
function.
- Next we write the estimated equation for your MLR model with a qualitative explanatory variable.
\[ \hat{y} = \beta_0 + \beta_1 \times \text{west_atlanticYes} + \beta_2 \times \text{pf_expression_control} + \varepsilon \]
- Now, for each level of our qualitative variable, we will write the simplified equation of the estimated line for that level. Note that if our qualitative variable has two levels, you should have two simplified equations.
When
west_atlantic
is “Yes”:
\[ \hat{y}{Yes} = \beta_0 - \beta_1 + \beta_2 \times \text{pf_expression_control} \]
When
west_atlantic
is “No”:
\[ \hat{y}{No} = \beta_0 + \beta_1 + \beta_2 \times \text{pf_expression_control} \]
Where:
\[\hat{y}{Yes}\] represents the
predicted value of the response variable (pf_score) when west_atlantic
is “Yes”. \[\hat{y}{No}\] represents
the predicted value of the response variable (pf_score) when
west_atlantic is “No”. \[\beta_0, \beta_1
and, \beta_2\] are the estimated coefficients from the MLR
model.
pf_expression_control
is the quantitative explanatory
variable representing the level of expression and control of personal
freedom in a country.
These equations show how the predicted personal freedom score varies with changes in the level of expression and control of personal freedom (
pf_expression_control
) for each level of the qualitative variablewest_atlantic
.
The interpretation of the coefficients (parameter estimates) in multiple regression is slightly different from that of simple regression.
The estimate for the indicator variable reflects how much more a group is expected to be if something has that quality, while holding all other variables constant.
The estimate for the quantitative variable reflects how much change in the response variable occurs due to a 1-unit increase in the quantitative variable, while holding all other variables constant.
- Next, we interpret the parameter estimate for the reference level of our categorical variable in the context of our problem.
In the context of our problem, the reference level of the categorical variable
west_atlantic
is “No”. The parameter estimate for the reference level reflects the expected change in the response variable (pf_score
) when the categorical variable is at its reference level, while holding all other variables constant.
For the reference level “No” ofwest_atlantic
, the parameter estimate (-0.102) indicates that, on average, countries within the “No” category are expected to have a lower personal freedom score compared to the reference level, “Yes”, of thewest_atlantic
variable, while controlling for the level of expression and control of personal freedom (pf_expression_control
).
In other words, countries within the “No” category, which includes regions outside the West Atlantic area, are expected to have a lower personal freedom score by approximately 0.102 units compared to countries within the West Atlantic region, all else being equal.
- Next we interpret the parameter estimate for our quantitative variable in the context of our problem.
For each one-unit increase in the
pf_expression_control
score, the personal freedom score (pf_score
) is expected to increase by approximately 0.540 units on average, controlling for the effect of the categorical variablewest_atlantic
.
Challenge: Multiple levels
Below, we will create a new R code chunk (with a descriptive name)
that fits a new model with the same response (pf_score
) and
quantitative explanatory variable (pf_expression_control
),
but now we will use a qualitative variable with more than two levels
(say, region
) and obtain the tidy
model
output. How does R appear to handle categorical variables with more than
two levels?
# Fitting the model with a qualitative variable with more than two levels
lm_spec_multiple_levels <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
# Fit the model
qual_mod_multiple_levels <- lm_spec_multiple_levels %>%
fit(pf_score ~ region + pf_expression_control, data = hfi_2016)
# Display the model output
tidy(qual_mod_multiple_levels)
## # A tibble: 11 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.39 0.272 19.8 7.47e-44
## 2 regionEast Asia 0.496 0.380 1.31 1.93e- 1
## 3 regionEastern Europe 0.326 0.309 1.06 2.93e- 1
## 4 regionLatin America & the Caribbean -0.229 0.300 -0.762 4.47e- 1
## 5 regionMiddle East & North Africa -1.39 0.299 -4.64 7.40e- 6
## 6 regionNorth America 0.610 0.542 1.13 2.62e- 1
## 7 regionOceania 0.233 0.433 0.537 5.92e- 1
## 8 regionSouth Asia -0.716 0.305 -2.35 2.02e- 2
## 9 regionSub-Saharan Africa -0.746 0.283 -2.64 9.24e- 3
## 10 regionWestern Europe 0.522 0.345 1.52 1.32e- 1
## 11 pf_expression_control 0.387 0.0299 12.9 2.99e-26
R appears to handle categorical variables with more than two levels by automatically converting them into dummy variables, where each level becomes its own binary variable. The output provides parameter estimates for each level of the categorical variable, except for one reference level, which serves as the baseline for comparison.
Part 3
We will explore a MLR model with an interaction between quantitative
and qualitative explanatory variables as well as see some other methods
to assess the fit of our model.
From the modeling process we came up with, we will now address the
“series of important questions that we should consider when performing
multiple linear regression” (ISL Section
3.2.2, p. 75):
- Is at least one of the \(p\) predictors \(X_1\), \(X_2\), \(\ldots\), \(X_p\) useful in predicting the response \(Y\)?
- Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
- How well does the model fit the data?
- Given a set of predictor values, what response value should we predict and how accurate is our prediction?
By including an interaction term in our model, it may seem like we
are relaxing the “additive assumption” a little.
However, the additive assumption is about the coefficients (the \(\beta\)s) and not the variables.
Fitting the overall model with \(qualitative \times quantitative\) interaction
In Part 2 that we explored the model:
\[ y = \beta_0 + \beta_1 \times \text{qualitative\\_variable} + \beta_2 \times \text{quantitative\\_variable} + \varepsilon \]
Today we will explore a similar model, except that also includes the
interaction between our qualitative and quantitative explanatory
variables.
That is,
\[ y = \beta_0 + \beta_1 \times \text{qualitative\\_variable} + \beta_2 \times \text{quantitative\\_variable} + \beta_3 \times ( \text{qualitative\\_variable} \times \text{quantitative\\_variable}) + \varepsilon \]
Fitting an MLR model with an interaction term
# review any visual patterns
hfi_2016 %>%
select(pf_score, west_atlantic, pf_expression_control) %>%
ggpairs()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#fit the mlr model
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
int_mod <- lm_spec %>%
fit(pf_score ~ west_atlantic * pf_expression_control, data = hfi_2016)
# model output
tidy(int_mod)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.72 0.459 12.5 2.76e-25
## 2 west_atlanticYes -1.60 0.484 -3.30 1.18e- 3
## 3 pf_expression_control 0.296 0.0789 3.75 2.45e- 4
## 4 west_atlanticYes:pf_expression_control 0.275 0.0838 3.28 1.26e- 3
Note that we shortened the model statement using
qualitative * quantitative
, but this can sometimes be
confusing to read.
Another way to write the right-hand side of the equation is:
qualitative + quantitative + qualitative * quantitative
.
After doing this, we will answer the following question:
- When viewing the
tidy
output, notice that the interaction term is listed asqualitativelevel:quantitative
. Referring back to Part 2 with how R displays qualitative variables, interpret what this syntax means.
- \[\beta_3\]
- Represents the additional change in
pf_score
for a one-unit increase inpf_expression_control
whenwest_atlantic
is “Yes” compared to “No”.
The qualitativelevel:quantitative term indicates that the effect of the quantitative variable (pf_expression_control
) on the response variable (pf_score
) is modified by the qualitative variable (west_atlantic
).
In simpler terms, it shows that the relationship betweenpf_expression_control
andpf_score
is different depending on whether a region is classified as “Yes” or “No” in thewest_atlantic
variable.
- Let’s now write the simplified equation of the line corresponding to each level of our qualitative explanatory variable.
To write the simplified equations of the line corresponding to each level of the qualitative explanatory variable, we’ll break down the model into its components based on the levels of
west_atlantic.
Here’s the model again:
\[\hat{y} = \beta_0 + \beta_1 \times \text{west_atlantic} + \beta_2 \times \text{pf_expression_control} + \beta_3 \times (\text{west_atlantic} \times \text{pf_expression_control})\]
Simplified Equations for Each Level
For west_atlantic
= “No” (reference level):
When west_atlantic
is “No”, the term
west_atlantic
and the interaction term
west_atlanticYes
:pf_expression_control
both
drop out because they are zero. Thus, the equation simplifies to:
\[\hat{y}_{\text{No}} = \beta_0 + \beta_2 \times \text{pf_expression_control}\]
Substituting in the values:
\[\hat{y}_{\text{No}} = 5.7213860 + 0.2961044 \times \text{pf_expression_control}\]
For west_atlantic
= “Yes”:
When west_atlantic
is “Yes”, both the
west_atlantic
term and the interaction term
west_atlanticYes
:pf_expression_control
are
included. Thus, the equation is:
\[\hat{y}_{\text{Yes}} = \beta_0 + \beta_1 + \beta_2 \times \text{pf_expression_control} + \beta_3 \times \text{pf_expression_control}\]
Combining the coefficients:
\[\hat{y}_{\text{Yes}} = 5.7213860 + (-1.5979076) + (0.2961044 + 0.2750385) \times \text{pf_expression_control}\]
Simplifying further:
\[\hat{y}_{\text{Yes}} = 4.1234784 + 0.5711429 \times \text{pf_expression_control}\]
Summary of Simplified Equations
For west_atlantic
= “No”: \[\hat{y}_{\text{No}} = 5.7213860 + 0.2961044
\times \text{pf_expression_control}\]
For west_atlantic
= “Yes”: \[\hat{y}_{\text{Yes}} = 4.1234784 + 0.5711429
\times \text{pf_expression_control}\]
- For two observations with similar values of the quantitative , which level tends to have higher values of the response variable?
The intercept for the “No” group is higher than the intercept for the “Yes” group by 5.7213860−4.1234784=1.5979076. This means that when
pf_expression_control
is 0, the “No” group has a higherpf_score
.
The slope for the “Yes” group is higher than the slope for the “No” group by 0.5711429−0.2961044=0.2750385. This means that aspf_expression_control
increases, thepf_score
increases more rapidly for the “Yes” group compared to the “No” group.
Conclusion
For low values of
pf_expression_control
(close to 0), the “No” group tends to have higherpf_score
values because of the higher intercept.
For higher values ofpf_expression_control
, the “Yes” group will eventually have higher pf_score values because of the higher slope. The exact point where this crossover happens can be calculated, but generally speaking, aspf_expression_control
increases, the Yes group benefits more due to the larger coefficient.
- Like we did in Part 1, let’s assess the fit of this model (no need
to do any formal hypothesis testing - we will explore this next). How
does
int_mod
’s fit compare tomlr_mod
?
What did we use to compare these?
Why?
# I will refit the Day 1 model here to have them both in one place for comparison
# Fit the models
#mlr_mod <- tidymodels::lm(pf_score ~ west_atlantic + pf_expression_control, data = hfi_2016)
#int_mod <- tidymodels::lm(pf_score ~ west_atlantic * pf_expression_control, data = hfi_2016)
# Extract model summaries
#mlr_mod_summary <- summary(mlr_mod)
#int_mod_summary <- summary(int_mod)
# Extract R-squared and Adjusted R-squared
#mlr_mod_r2 <- mlr_mod_summary$r.squared
#mlr_mod_adj_r2 <- mlr_mod_summary$adj.r.squared
#int_mod_r2 <- int_mod_summary$r.squared
#int_mod_adj_r2 <- int_mod_summary$adj.r.squared
# Display the results
#data.frame(
#Model = c("MLR", "Interaction MLR"),
#R2 = c(mlr_mod_r2, int_mod_r2),
#Adjusted_R2 = c(mlr_mod_adj_r2, int_mod_adj_r2))
We are using R^2 and Adjusted R^2
Why These Metrics?
R^2: Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It is a measure of how well the model explains the data.
Adjusted R^2: Adjusted for the number of predictors in the model. It is a more accurate measure for comparing models with different numbers of predictors, as it accounts for the complexity of the model.
Using these metrics allows us to quantitatively assess and compare the fit of different models. The interaction model appears to fit the data better because it has a higher R^2 and higher Adjusted R^2, suggesting that the interaction between the qualitative and quantitative variables provides additional explanatory power.*
Many disciplines are moving away from \(p\)-values in favor of other methods.
We will explore \(p\)-values these
other methods later this semester, but we will practice our classical
methods here.
This is known as an “overall \(F\)
test” and the hypotheses are:
That (the null) no predictors are useful for the model (i.e., all
slopes are equal to zero) versus the alternative that at least one
predictor is useful for the model (i.e., at least one slope is not
zero).
One way to check this is to build our null model (no predictors) and
then compare this to our candidate model (int_mod
).
# null model
null_mod <- lm_spec %>%
fit(pf_score ~ 1, data = hfi_2016)
anova(
extract_fit_engine(int_mod),
extract_fit_engine(null_mod)
)
## Analysis of Variance Table
##
## Model 1: pf_score ~ west_atlantic * pf_expression_control
## Model 2: pf_score ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 158 95.46
## 2 161 357.56 -3 -262.1 144.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- What is the \(F\) test statistic
and \(p\)-value for this test?
Based on an \(\alpha = 0.05\) significant level, what should we conclude?
The \(F\) test statistic for this test is 144.6, and the associated \(p\)-value is nearly zero (given as < 2.2e-16).
At a significance level of \(\alpha = 0.05\), since the \(p\)-value is much smaller than 0.05, we reject the null hypothesis. This indicates that at least one predictor in the interaction model (
west_atlantic
*pf_expression_control
) is useful for predicting the response variable (pf_score
).
Partial slope test - do all predictors help explain \(y\)?
Since our overall model is significant (at least one predictor is
useful), we will continue on.
We could do a similar process to fit a new model while removing one
explanatory variable at at time, and using anova to compare these
models.
However, the tidy
output also helps here (the
statistic
and p.value
columns).
For each slope, we are testing if that slope is zero (when including
the other variables, the null) or if it is not zero (when including the
other variables, the alternative).
Because the interaction term is a combination of the other two
variables, we should assess the first.
- What is the \(t\) test statistic
and \(p\)-value associated with this
test?
Based on an \(\alpha = 0.05\) significant level, what should we conclude?
The \(t\) test statistic and \(p\)-value associated with the partial slope test for the interaction term
west_atlanticYes
:pf_expression_control are as follows:
\(t\) test statistic: 3.283544
\(p\)-value:1.262236×10 −3
Based on an \(\alpha = 0.05\) significance level, since the \(p\)-value (1.262236×10 −3) is less than \(\alpha\), we reject the null hypothesis that the coefficient of the interaction term is zero. Therefore, we conclude that the interaction between
west_atlanticYes
andpf_expression_control
significantly contributes to explaining the response variable.
If our interaction term was not significant, we could consider
removing it.
Now let’s look at our two non-interaction terms…
- What are the \(t\) test statistic
and \(p\)-value associated with these
tests?
Based on an \(\alpha = 0.05\) significant level, what should we conclude about these two predictors?
We would not need to do (21) if the interaction was
significant.
We also should not remove a main variable (non-interaction variable) if
the interaction variable remains in our model.
Residual assessment - how well does the model fit the data?
We have already done this step in past activities by exploring your
residuals (Activity 2).
Using our final model from Part 3, let’s assess how well our model fits
the data.
To assess how well the model fits the data, we can examine the residuals from the final model obtained in Part 3. We’ll look for patterns or trends in the residuals that might indicate inadequacies in the model fit. This can include checking for:
Randomness: Residuals should be randomly scattered around zero without any discernible pattern.
Homoscedasticity: The spread of residuals should be roughly constant across different levels of the predictors.
Normality: Residuals should follow a roughly normal distribution.
Let’s extract residuals
Now that we have extracted the residuals, we can create several plots to assess the model fit:
Scatterplot of Residuals vs. Predicted Values: This plot helps us check for randomness and homoscedasticity.
QQ Plot of Residuals: This plot helps us assess the normality assumption of residuals.
Residuals vs. Predictor Variables: These plots help us detect any patterns or trends in the residuals with respect to the predictor variables.
Let’s create these plots:
Or maybe not…