Activity02 - Simple Linear Regression

A typical modeling process

The process that we will use for this activity is:

  1. Identify our research question(s),
  2. Explore (graphically and with numerical summaries) the variables of interest - both individually and in relationship to one another,
  3. Fit a simple linear regression model to obtain and describe model estimates,
  4. Assess how “good” our model is, and
  5. Predict new values.

We will continue to update/tweak/adapt this process.
Before we begin, we set up our R session and introduce this activity’s data.

Part 1

The setup

We will be using two packages from Posit (formerly RStudio): {tidyverse} and {tidymodels}.
Let’s load these packages

# Loading necessary packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.1 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
theme_set(ggthemes::theme_few())

Get going You see that!

The data

The data we’re working with is from the OpenIntro site: https://www.openintro.org/data/csv/hfi.csv. Here is the “about” page: https://www.openintro.org/data/index.php?data=hfi.

  • Let’s load the data by downloading this file, uploading to RStudio, then reading it in.
  • We will assign this data set into a data frame named hfi (short for “Human Freedom Index”).
# Loading the dataset
hfi <- readr::read_csv("hfi.csv")
## Rows: 1458 Columns: 123
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): ISO_code, countries, region
## dbl (120): year, pf_rol_procedural, pf_rol_civil, pf_rol_criminal, pf_rol, p...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • Next, we check the characteristics of the dataframe.
# Checking the characteristics of the dataset
glimpse(hfi)
## Rows: 1,458
## Columns: 123
## $ year                               <dbl> 2016, 2016, 2016, 2016, 2016, 2016,…
## $ ISO_code                           <chr> "ALB", "DZA", "AGO", "ARG", "ARM", …
## $ countries                          <chr> "Albania", "Algeria", "Angola", "Ar…
## $ region                             <chr> "Eastern Europe", "Middle East & No…
## $ pf_rol_procedural                  <dbl> 6.661503, NA, NA, 7.098483, NA, 8.4…
## $ pf_rol_civil                       <dbl> 4.547244, NA, NA, 5.791960, NA, 7.5…
## $ pf_rol_criminal                    <dbl> 4.666508, NA, NA, 4.343930, NA, 7.3…
## $ pf_rol                             <dbl> 5.291752, 3.819566, 3.451814, 5.744…
## $ pf_ss_homicide                     <dbl> 8.920429, 9.456254, 8.060260, 7.622…
## $ pf_ss_disappearances_disap         <dbl> 10, 10, 5, 10, 10, 10, 10, 10, 10, …
## $ pf_ss_disappearances_violent       <dbl> 10.000000, 9.294030, 10.000000, 10.…
## $ pf_ss_disappearances_organized     <dbl> 10.0, 5.0, 7.5, 7.5, 7.5, 10.0, 10.…
## $ pf_ss_disappearances_fatalities    <dbl> 10.000000, 9.926119, 10.000000, 10.…
## $ pf_ss_disappearances_injuries      <dbl> 10.000000, 9.990149, 10.000000, 9.9…
## $ pf_ss_disappearances               <dbl> 10.000000, 8.842060, 8.500000, 9.49…
## $ pf_ss_women_fgm                    <dbl> 10.0, 10.0, 10.0, 10.0, 10.0, 10.0,…
## $ pf_ss_women_missing                <dbl> 7.5, 7.5, 10.0, 10.0, 5.0, 10.0, 10…
## $ pf_ss_women_inheritance_widows     <dbl> 5, 0, 5, 10, 10, 10, 10, 5, NA, 0, …
## $ pf_ss_women_inheritance_daughters  <dbl> 5, 0, 5, 10, 10, 10, 10, 10, NA, 0,…
## $ pf_ss_women_inheritance            <dbl> 5.0, 0.0, 5.0, 10.0, 10.0, 10.0, 10…
## $ pf_ss_women                        <dbl> 7.500000, 5.833333, 8.333333, 10.00…
## $ pf_ss                              <dbl> 8.806810, 8.043882, 8.297865, 9.040…
## $ pf_movement_domestic               <dbl> 5, 5, 0, 10, 5, 10, 10, 5, 10, 10, …
## $ pf_movement_foreign                <dbl> 10, 5, 5, 10, 5, 10, 10, 5, 10, 5, …
## $ pf_movement_women                  <dbl> 5, 5, 10, 10, 10, 10, 10, 5, NA, 5,…
## $ pf_movement                        <dbl> 6.666667, 5.000000, 5.000000, 10.00…
## $ pf_religion_estop_establish        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_religion_estop_operate          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_religion_estop                  <dbl> 10.0, 5.0, 10.0, 7.5, 5.0, 10.0, 10…
## $ pf_religion_harassment             <dbl> 9.566667, 6.873333, 8.904444, 9.037…
## $ pf_religion_restrictions           <dbl> 8.011111, 2.961111, 7.455556, 6.850…
## $ pf_religion                        <dbl> 9.192593, 4.944815, 8.786667, 7.795…
## $ pf_association_association         <dbl> 10.0, 5.0, 2.5, 7.5, 7.5, 10.0, 10.…
## $ pf_association_assembly            <dbl> 10.0, 5.0, 2.5, 10.0, 7.5, 10.0, 10…
## $ pf_association_political_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_political_operate   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_political           <dbl> 10.0, 5.0, 2.5, 5.0, 5.0, 10.0, 10.…
## $ pf_association_prof_establish      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_prof_operate        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_prof                <dbl> 10.0, 5.0, 5.0, 7.5, 5.0, 10.0, 10.…
## $ pf_association_sport_establish     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_sport_operate       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_sport               <dbl> 10.0, 5.0, 7.5, 7.5, 7.5, 10.0, 10.…
## $ pf_association                     <dbl> 10.0, 5.0, 4.0, 7.5, 6.5, 10.0, 10.…
## $ pf_expression_killed               <dbl> 10.000000, 10.000000, 10.000000, 10…
## $ pf_expression_jailed               <dbl> 10.000000, 10.000000, 10.000000, 10…
## $ pf_expression_influence            <dbl> 5.0000000, 2.6666667, 2.6666667, 5.…
## $ pf_expression_control              <dbl> 5.25, 4.00, 2.50, 5.50, 4.25, 7.75,…
## $ pf_expression_cable                <dbl> 10.0, 10.0, 7.5, 10.0, 7.5, 10.0, 1…
## $ pf_expression_newspapers           <dbl> 10.0, 7.5, 5.0, 10.0, 7.5, 10.0, 10…
## $ pf_expression_internet             <dbl> 10.0, 7.5, 7.5, 10.0, 7.5, 10.0, 10…
## $ pf_expression                      <dbl> 8.607143, 7.380952, 6.452381, 8.738…
## $ pf_identity_legal                  <dbl> 0, NA, 10, 10, 7, 7, 10, 0, NA, NA,…
## $ pf_identity_parental_marriage      <dbl> 10, 0, 10, 10, 10, 10, 10, 10, 10, …
## $ pf_identity_parental_divorce       <dbl> 10, 5, 10, 10, 10, 10, 10, 10, 10, …
## $ pf_identity_parental               <dbl> 10.0, 2.5, 10.0, 10.0, 10.0, 10.0, …
## $ pf_identity_sex_male               <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_sex_female             <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_sex                    <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_divorce                <dbl> 5, 0, 10, 10, 5, 10, 10, 5, NA, 0, …
## $ pf_identity                        <dbl> 6.2500000, 0.8333333, 7.5000000, 10…
## $ pf_score                           <dbl> 7.596281, 5.281772, 6.111324, 8.099…
## $ pf_rank                            <dbl> 57, 147, 117, 42, 84, 11, 8, 131, 6…
## $ ef_government_consumption          <dbl> 8.232353, 2.150000, 7.600000, 5.335…
## $ ef_government_transfers            <dbl> 7.509902, 7.817129, 8.886739, 6.048…
## $ ef_government_enterprises          <dbl> 8, 0, 0, 6, 8, 10, 10, 0, 7, 10, 7,…
## $ ef_government_tax_income           <dbl> 9, 7, 10, 7, 5, 5, 4, 9, 10, 10, 8,…
## $ ef_government_tax_payroll          <dbl> 7, 2, 9, 1, 5, 5, 3, 4, 10, 10, 8, …
## $ ef_government_tax                  <dbl> 8.0, 4.5, 9.5, 4.0, 5.0, 5.0, 3.5, …
## $ ef_government                      <dbl> 7.935564, 3.616782, 6.496685, 5.346…
## $ ef_legal_judicial                  <dbl> 2.6682218, 4.1867042, 1.8431292, 3.…
## $ ef_legal_courts                    <dbl> 3.145462, 4.327113, 1.974566, 2.930…
## $ ef_legal_protection                <dbl> 4.512228, 4.689952, 2.512364, 4.255…
## $ ef_legal_military                  <dbl> 8.333333, 4.166667, 3.333333, 7.500…
## $ ef_legal_integrity                 <dbl> 4.166667, 5.000000, 4.166667, 3.333…
## $ ef_legal_enforcement               <dbl> 4.3874441, 4.5075380, 2.3022004, 3.…
## $ ef_legal_restrictions              <dbl> 6.485287, 6.626692, 5.455882, 6.857…
## $ ef_legal_police                    <dbl> 6.933500, 6.136845, 3.016104, 3.385…
## $ ef_legal_crime                     <dbl> 6.215401, 6.737383, 4.291197, 4.133…
## $ ef_legal_gender                    <dbl> 0.9487179, 0.8205128, 0.8461538, 0.…
## $ ef_legal                           <dbl> 5.071814, 4.690743, 2.963635, 3.904…
## $ ef_money_growth                    <dbl> 8.986454, 6.955962, 9.385679, 5.233…
## $ ef_money_sd                        <dbl> 9.484575, 8.339152, 4.986742, 5.224…
## $ ef_money_inflation                 <dbl> 9.743600, 8.720460, 3.054000, 2.000…
## $ ef_money_currency                  <dbl> 10, 5, 5, 10, 10, 10, 10, 5, 0, 10,…
## $ ef_money                           <dbl> 9.553657, 7.253894, 5.606605, 5.614…
## $ ef_trade_tariffs_revenue           <dbl> 9.626667, 8.480000, 8.993333, 6.060…
## $ ef_trade_tariffs_mean              <dbl> 9.24, 6.22, 7.72, 7.26, 8.76, 9.50,…
## $ ef_trade_tariffs_sd                <dbl> 8.0240, 5.9176, 4.2544, 5.9448, 8.0…
## $ ef_trade_tariffs                   <dbl> 8.963556, 6.872533, 6.989244, 6.421…
## $ ef_trade_regulatory_nontariff      <dbl> 5.574481, 4.962589, 3.132738, 4.466…
## $ ef_trade_regulatory_compliance     <dbl> 9.4053278, 0.0000000, 0.9171598, 5.…
## $ ef_trade_regulatory                <dbl> 7.489905, 2.481294, 2.024949, 4.811…
## $ ef_trade_black                     <dbl> 10.00000, 5.56391, 10.00000, 0.0000…
## $ ef_trade_movement_foreign          <dbl> 6.306106, 3.664829, 2.946919, 5.358…
## $ ef_trade_movement_capital          <dbl> 4.6153846, 0.0000000, 3.0769231, 0.…
## $ ef_trade_movement_visit            <dbl> 8.2969231, 1.1062564, 0.1106256, 7.…
## $ ef_trade_movement                  <dbl> 6.406138, 1.590362, 2.044823, 4.697…
## $ ef_trade                           <dbl> 8.214900, 4.127025, 5.264754, 3.982…
## $ ef_regulation_credit_ownership     <dbl> 5, 0, 8, 5, 10, 10, 8, 5, 10, 10, 5…
## $ ef_regulation_credit_private       <dbl> 7.295687, 5.301526, 9.194715, 4.259…
## $ ef_regulation_credit_interest      <dbl> 9, 10, 4, 7, 10, 10, 10, 9, 10, 10,…
## $ ef_regulation_credit               <dbl> 7.098562, 5.100509, 7.064905, 5.419…
## $ ef_regulation_labor_minwage        <dbl> 5.566667, 5.566667, 8.900000, 2.766…
## $ ef_regulation_labor_firing         <dbl> 5.396399, 3.896912, 2.656198, 2.191…
## $ ef_regulation_labor_bargain        <dbl> 6.234861, 5.958321, 5.172987, 3.432…
## $ ef_regulation_labor_hours          <dbl> 8, 6, 4, 10, 10, 10, 6, 6, 8, 8, 10…
## $ ef_regulation_labor_dismissal      <dbl> 6.299741, 7.755176, 6.632764, 2.517…
## $ ef_regulation_labor_conscription   <dbl> 10, 1, 0, 10, 0, 10, 3, 1, 10, 10, …
## $ ef_regulation_labor                <dbl> 6.916278, 5.029513, 4.560325, 5.151…
## $ ef_regulation_business_adm         <dbl> 6.072172, 3.722341, 2.758428, 2.404…
## $ ef_regulation_business_bureaucracy <dbl> 6.000000, 1.777778, 1.333333, 6.666…
## $ ef_regulation_business_start       <dbl> 9.713864, 9.243070, 8.664627, 9.122…
## $ ef_regulation_business_bribes      <dbl> 4.050196, 3.765515, 1.945540, 3.260…
## $ ef_regulation_business_licensing   <dbl> 7.324582, 8.523503, 8.096776, 5.253…
## $ ef_regulation_business_compliance  <dbl> 7.074366, 7.029528, 6.782923, 6.508…
## $ ef_regulation_business             <dbl> 6.705863, 5.676956, 4.930271, 5.535…
## $ ef_regulation                      <dbl> 6.906901, 5.268992, 5.518500, 5.369…
## $ ef_score                           <dbl> 7.54, 4.99, 5.17, 4.84, 7.57, 7.98,…
## $ ef_rank                            <dbl> 34, 159, 155, 160, 29, 10, 27, 106,…
## $ hf_score                           <dbl> 7.568140, 5.135886, 5.640662, 6.469…
## $ hf_rank                            <dbl> 48, 155, 142, 107, 57, 4, 16, 130, …
## $ hf_quartile                        <dbl> 2, 4, 4, 3, 2, 1, 1, 4, 2, 2, 4, 2,…

After doing this and viewing the loaded data, let’s answer the following questions:
1. What are the dimensions of the dataset?

Rows: 1,458

Columns: 123

What does each row represent?

Each row represents unique observations of country and year detailing various aspects of human freedom.

The dataset spans a lot of years.
We are only interested in data from year 2016.
We will now do the following:

  • Filter the data hfi data frame for year 2016, and
  • Assign the result to a data frame named hfi_2016.
# Selecting data from only 2016
hfi_2016 <- hfi[hfi$year == 2016, ]

1. Identify our research question(s)

The research question is often defined by us (or the company, boss, etc.).
Today’s research question/goal is to predict a country’s personal freedom score in 2016.
For this activity we want to explore the relationship between the personal freedom score, pf_score, and the political pressures and controls on media content index,pf_expression_control. Specifically, we are going to use the political pressures and controls on media content index to predict a country’s personal freedom score in 2016.

2. Explore the variables of interest

Let’s answer the following questions.
2. What type of plot can be used to display the distribution of the personal freedom scores, pf_score?

A histogram because this is a continuous variable*

Would this be the same type of plot to display the distribution of the political pressures and controls on media content index, pf_expression_control?

Yes

Let’s now: - display this plot for pf_score. - display this plot for pf_expression_control.

# Histogram for pf_score
hist(hfi_2016$pf_score, main = "Distribution of Personal Freedom Scores", xlab = "Personal Freedom Score")

# Histogram for pf_expression_control
hist(hfi_2016$pf_expression_control, main = "Distribution of Political Pressures and Controls on Media Content", xlab = "Political Pressures and Controls on Media Content Index")

Let’s comment on each of these two distributions by describing their centers, spread, shape, and any potential outliers.

  • Personal Freedom Scores (pf_score) Distribution:

  • Center: The center appears to be around 6.5

  • Spread: The spread varies from 5 to 10.

  • Shape: Left-skewed.

  • Potential Outliers: No potential outliers.

  • Political Pressures and Controls on Media Content (pf_expression_control) Distribution:

  • Center: The center appears to be around 4.5.

  • Spread: The spread varies.

  • Shape: Symmetric.

  • Potential Outliers: No potential outliers.

  1. Plot to display relationship between pf_score, and pf_expression_control

Since these are two continuous variables, we will use a scatter plot to to display the relationship between the personal freedom score, pf_score, and the political pressures and controls on media content index,pf_expression_control

  • We will plot this relationship using the variable pf_expression_control as the predictor/explanatory variable.
# Scatter plot
ggplot(hfi_2016, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  labs(x = "Political Expression and Control", y = "Personal Freedom Score",
       title = "Personal Freedom Score vs Political Expression and Control")

  1. Does the relationship look linear?

The relationship looks linear.

Since the relationship is linear and the data has a normal distribution as seen in the histograms, we would be comfortable using a linear model to predict the personal freedom score if we knew a country’s pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content.

Challenge

Using {dplyr} skills, let’s obtain the appropriate numerical summary statistics for each plot and provide more detailed descriptions of the plots.

# Summary statistics for pf_score
pf_score_summary <- hfi_2016 %>%
  summarize(
    Mean = mean(pf_score),
    Median = median(pf_score),
    Mode = {
      tbl <- table(pf_score)
      as.numeric(names(tbl)[which.max(tbl)])
    },
    Range = max(pf_score) - min(pf_score),
    IQR = IQR(pf_score),
    SD = sd(pf_score),
    Variance = var(pf_score),
    Skewness = moments::skewness(pf_score),
    Kurtosis = moments::kurtosis(pf_score)
  )

# Summary statistics for pf_expression_control
pf_expression_control_summary <- hfi_2016 %>%
  summarize(
    Mean = mean(pf_expression_control),
    Median = median(pf_expression_control),
    Mode = {
      tbl <- table(pf_expression_control)
      as.numeric(names(tbl)[which.max(tbl)])
    },
    Range = max(pf_expression_control) - min(pf_expression_control),
    IQR = IQR(pf_expression_control),
    SD = sd(pf_expression_control),
    Variance = var(pf_expression_control),
    Skewness = moments::skewness(pf_expression_control),
    Kurtosis = moments::kurtosis(pf_expression_control)
  )

# Display the summary statistics
pf_score_summary
## # A tibble: 1 × 9
##    Mean Median  Mode Range   IQR    SD Variance Skewness Kurtosis
##   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>
## 1  6.98   6.93  2.17  7.23  2.12  1.49     2.22   -0.393     3.01
pf_expression_control_summary
## # A tibble: 1 × 9
##    Mean Median  Mode Range   IQR    SD Variance Skewness Kurtosis
##   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>
## 1  4.98      5  3.75     9  3.44  2.32     5.40  -0.0663     2.08

Let’s explore how we can use the cor function from Base R to describe the relationship between two numerical variables?

The correlation coefficient is a numerical summary that indicates the strength and direction of the linear relationship between two numerical variables.

# The correlation coefficient
correlation <- cor(hfi_2016$pf_score, hfi_2016$pf_expression_control)

correlation
## [1] 0.8450646

3. Fit a simple linear regression model

We will now continue to fitting a simple linear regression (SLR) model to these data.
The code that we will be using to fit statistical models in this activity use {tidymodels} - an opinionated way to fit models in R
To begin, we will create a {parsnip} specification for a linear model.

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

The set_mode("regression") is really unnecessary/redundant as linear models ("lm") can only be regression models.
You can type ?function_name in the R Console to explore a function’s help documentation.
The above code also outputs the lm_spec output.
This code does not do any calculations by itself, but rather specifies what we plan to do.
Using this specification, we can now fit our model:

\(\texttt{pf\\_score} = \beta_0 + \beta_1 \times \texttt{pf\\_expression\\_control} + \varepsilon\).

Note, the “$” portion in the previous sentence is LaTeX snytax which is a math scripting (and other scripting) language.
Let’s fit a linear model

# Fitting a linear model
slr_mod <- lm_spec %>% 
  fit(pf_score ~ pf_expression_control, data = hfi_2016)

tidy(slr_mod)
## # A tibble: 2 × 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              4.28     0.149       28.8 4.23e-65
## 2 pf_expression_control    0.542    0.0271      20.0 2.31e-45

The above code fits our SLR model, then provides a tidy parameter estimates table.
5. Let’s update the formula with the estimated parameters using the tidy output. That is, replace “intercept” and “slope” with the appropriate values

\(\widehat{\texttt{pf\\_score}} = 4.28 + 0.542 \times \texttt{pf\\_expression\\_control}\)

  1. Let’s interpret each of the estimated parameters from (5) in the context of this research question. That is, explaining what these values represent.

The intercept (4.28) represents the baseline personal freedom score for a country with the most political pressures and controls on media content

pf_expression_control (β₁): The coefficient for pf_expression_control represents the change in the estimated personal freedom score for a one-unit increase in the political pressures and controls on media content index (pf_expression_control). In this case, for every one-unit increase in pf_expression_control, we expect the estimated personal freedom score to increase by approximately 0.542 units.

Part 2

In Part 1, we were able to interpret the SLR model parameter estimates (i.e., the y-intercept and slope) as follows:

For countries with a pf_expression_control of 0 (those with the largest amount of political pressure on media content), we expect their mean personal freedom score to be 4.28.

For every 1 unit increase in pf_expression_control (political pressure on media content index), we expect a country’s mean personal freedom score to increase 0.542 units.

4. Assessing

4.a: Assessing with the part 1 model

To assess our model fit, we can use \(R^2\) (the coefficient of determination), the proportion of variability in the response variable that is explained by the explanatory variable.
We use glance from {broom} (which is automatically loaded with {tidymodels} - {broom} is also where tidy is from) to access this information.

# Getting R-squared
glance(slr_mod)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.714         0.712 0.799      400. 2.31e-45     1  -193.  391.  400.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Next, we will answer the following questions:
7. What is the value of \(R^2\) for this model?

0.714

  1. What does this value mean in the context of this model?
    What would a “good” value of \(R^2\) would be?
    Can/should this value be “perfect”?

In the context of this model, the \(R^2\) value of 0.714 means that 71.4% of the variability in a country’s personal freedom score (pf_score) in 2016 can be explained by the political pressures and controls on media content index (pf_expression_control)

This is a good fit model because most of the variability in the response variable is explained by the predictor variable

\(R^2\) should not necessarily be perfect because this may be overfitting

4.b: Assessing with test/train

We previously fit a model and evaluated it using the exact same data.
This is a bit of circular reasoning and does not provide much information about the model’s performance.
Now we will work through the test/train process of fitting and assessing a simple linear regression model.
To create a test/train split, we do the following:

  • Create a new R code chunk and provide it with a descriptive tile (e.g., train-test).
  • Set a seed.
  • Create an initial 80-20 split of the hfi_2016 dataset
  • Using your initial split R object, assign the two splits into a training R object and a testing R object.
# Set seed for reproducibility
set.seed(123)

# Create an 80-20 split of the hfi_2016 dataset
hfi_split <- initial_split(hfi_2016, prop = 0.8)

# Assign the splits into training and testing datasets
hfi_train <- training(hfi_split)
hfi_test <- testing(hfi_split)

Now, we will use the training dataset to fit a SLR model.

slr_train <- lm_spec %>% 
  fit(pf_score ~ pf_expression_control, data = hfi_train)

# Display the tidy summary of the model
tidy(slr_train)
## # A tibble: 2 × 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              4.32     0.166       26.1 7.85e-53
## 2 pf_expression_control    0.536    0.0299      17.9 1.41e-36

We can reuse the lm_spec specification because we are still doing a linear model.
9. Let’s update the below formula with the estimated parameters using the tidy output.
That is, replace “intercept” and “slope” with the appropriate values

\(\widehat{\texttt{pf\\_score}} = 4.32 + 0.536 \times \texttt{pf\\_expression\\_control}\)

  1. Next we interpret each of the estimated parameters from (10) in the context of this research question. That is, what do these values represent?

For countries with a pf_expression_control of 0 (those with the largest amount of political pressure on media content), we expect their mean personal freedom score to be 4.32.

For every 1 unit increase in pf_expression_control (political pressure on media content index), we expect a country’s mean personal freedom score to increase 0.536 units.

Now we will assess using the testing data set.

test_aug <- augment(slr_train, new_data = hfi_test)
test_aug
## # A tibble: 33 × 125
##    .pred .resid  year ISO_code countries  region  pf_rol_procedural pf_rol_civil
##    <dbl>  <dbl> <dbl> <chr>    <chr>      <chr>               <dbl>        <dbl>
##  1  6.46 -1.18   2016 DZA      Algeria    Middle…             NA           NA   
##  2  5.66  0.451  2016 AGO      Angola     Sub-Sa…             NA           NA   
##  3  4.72  1.41   2016 BHR      Bahrain    Middle…             NA           NA   
##  4  7.94 -0.506  2016 BLZ      Belize     Latin …              4.75         4.74
##  5  6.60  0.609  2016 BOL      Bolivia    Latin …              3.70         3.36
##  6  7.13 -0.257  2016 BWA      Botswana   Sub-Sa…              5.33         6.06
##  7  8.74  0.412  2016 CAN      Canada     North …              8.62         7.18
##  8  8.47 -0.485  2016 CPV      Cape Verde Sub-Sa…             NA           NA   
##  9  5.12  0.226  2016 CHN      China      East A…              3.95         5.38
## 10  5.12 -1.23   2016 EGY      Egypt      Middle…              2.95         3.76
## # ℹ 23 more rows
## # ℹ 117 more variables: pf_rol_criminal <dbl>, pf_rol <dbl>,
## #   pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
## #   pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
## #   pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
## #   pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
## #   pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …

This takes the SLR model and applies it to the testing data.
check-in Check in
Look at the various information produced by this code.
What does each column represent?
The .pred column in this output can also be obtained by using predict (i.e., predict(slr_fit, new_data = data_test))
11. Now, using the responses to (7) and (8) as an example, let’s assess how well this model fits the testing data.
We will compare these results here to Part 1 results of this activity.
Did this model perform any differently?

\(R^2\) did not change

library(yardstick)

# Calculate performance metrics for the testing dataset
test_metrics <- test_aug %>%
  metrics(truth = pf_score, estimate = .pred)

# Print the metrics
test_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.867
## 2 rsq     standard       0.705
## 3 mae     standard       0.701

Model diagnostics

To assess whether the linear model is reliable, we should check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
Note that the normal residuals is not really necessary for all models (sometimes we simply want to describe a relationship for the data that we have or population-level data, where statistical inference is not appropriate/necessary).
In order to do these checks we need access to the fitted (predicted) values and the residuals. We can use broom::augment to calculate these.

train_aug <- augment(slr_train, new_data = hfi_train)
train_aug
## # A tibble: 129 × 125
##    .pred  .resid  year ISO_code countries  region pf_rol_procedural pf_rol_civil
##    <dbl>   <dbl> <dbl> <chr>    <chr>      <chr>              <dbl>        <dbl>
##  1  5.26  0.710   2016 VNM      Vietnam    South…              5.38         4.45
##  2  9.28 -0.288   2016 BEL      Belgium    Weste…              8.67         7.34
##  3  9.14  0.153   2016 FIN      Finland    Weste…              9.50         7.96
##  4  7.13  0.588   2016 PER      Peru       Latin…              6.17         4.41
##  5  6.87  0.0774  2016 DOM      Dominican… Latin…              5.24         4.54
##  6  6.46 -0.456   2016 ZMB      Zambia     Sub-S…              3.30         4.89
##  7  6.33 -1.16    2016 ZWE      Zimbabwe   Sub-S…              2.43         4.33
##  8  6.33  0.257   2016 UKR      Ukraine    Easte…              4.77         5.14
##  9  6.33  1.08    2016 MKD      Macedonia  Easte…              4.68         5.63
## 10  6.60  0.238   2016 MDG      Madagascar Sub-S…              3.33         3.93
## # ℹ 119 more rows
## # ℹ 117 more variables: pf_rol_criminal <dbl>, pf_rol <dbl>,
## #   pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
## #   pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
## #   pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
## #   pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
## #   pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …

Linearity: We already checked if the relationship between pf_score and pf_expression_control is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.

ggplot(data = train_aug, aes(x = .pred, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

Notice here that train_aug can also serve as a data set because stored within it are the fitted values (\(\hat{y}\)) and the residuals.
Also note that we are getting fancy with the code here.
After creating the scatterplot on the first layer (first line of code), we overlay a red horizontal dashed line at \(y = 0\) (to help us check whether the residuals are distributed around 0), and we also rename the axis labels to be more informative.
Let’s answer the following question:
11. Is there any apparent pattern in the residuals plot?

There is no apparent pattern

What does this indicate about the linearity of the relationship between the two variables?
Nearly normal residuals: To check this condition, we can look at a histogram of the residuals.

# residual histogram
ggplot(data = train_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  xlab("Residuals")

Let’s answer the following question:
12. Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?

No. Because the histogram has only one peak. Although it is left-skewed.

Constant variability:

  1. Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?

No. The points are relatively equally distributed from the red dotted line across the plot.

Attribution

This document is based on labs from OpenIntro.
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This class activity was assigned to me by Prof. Bradford Dykes at GVSU