Activity02 - Simple Linear Regression
A typical modeling process
The process that we will use for this activity is:
- Identify our research question(s),
- Explore (graphically and with numerical summaries) the variables of interest - both individually and in relationship to one another,
- Fit a simple linear regression model to obtain and describe model estimates,
- Assess how “good” our model is, and
- Predict new values.
We will continue to update/tweak/adapt this process.
Before we begin, we set up our R session and introduce this activity’s
data.
Part 1
The setup
We will be using two packages from Posit (formerly RStudio): {tidyverse}
and
{tidymodels}
.
Let’s load these packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
You see that!
The data
The data we’re working with is from the OpenIntro site:
https://www.openintro.org/data/csv/hfi.csv
. Here is the
“about” page: https://www.openintro.org/data/index.php?data=hfi.
- Let’s load the data by downloading this file, uploading to RStudio, then reading it in.
- We will assign this data set into a data frame named
hfi
(short for “Human Freedom Index”).
## Rows: 1458 Columns: 123
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): ISO_code, countries, region
## dbl (120): year, pf_rol_procedural, pf_rol_civil, pf_rol_criminal, pf_rol, p...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Next, we check the characteristics of the dataframe.
## Rows: 1,458
## Columns: 123
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016,…
## $ ISO_code <chr> "ALB", "DZA", "AGO", "ARG", "ARM", …
## $ countries <chr> "Albania", "Algeria", "Angola", "Ar…
## $ region <chr> "Eastern Europe", "Middle East & No…
## $ pf_rol_procedural <dbl> 6.661503, NA, NA, 7.098483, NA, 8.4…
## $ pf_rol_civil <dbl> 4.547244, NA, NA, 5.791960, NA, 7.5…
## $ pf_rol_criminal <dbl> 4.666508, NA, NA, 4.343930, NA, 7.3…
## $ pf_rol <dbl> 5.291752, 3.819566, 3.451814, 5.744…
## $ pf_ss_homicide <dbl> 8.920429, 9.456254, 8.060260, 7.622…
## $ pf_ss_disappearances_disap <dbl> 10, 10, 5, 10, 10, 10, 10, 10, 10, …
## $ pf_ss_disappearances_violent <dbl> 10.000000, 9.294030, 10.000000, 10.…
## $ pf_ss_disappearances_organized <dbl> 10.0, 5.0, 7.5, 7.5, 7.5, 10.0, 10.…
## $ pf_ss_disappearances_fatalities <dbl> 10.000000, 9.926119, 10.000000, 10.…
## $ pf_ss_disappearances_injuries <dbl> 10.000000, 9.990149, 10.000000, 9.9…
## $ pf_ss_disappearances <dbl> 10.000000, 8.842060, 8.500000, 9.49…
## $ pf_ss_women_fgm <dbl> 10.0, 10.0, 10.0, 10.0, 10.0, 10.0,…
## $ pf_ss_women_missing <dbl> 7.5, 7.5, 10.0, 10.0, 5.0, 10.0, 10…
## $ pf_ss_women_inheritance_widows <dbl> 5, 0, 5, 10, 10, 10, 10, 5, NA, 0, …
## $ pf_ss_women_inheritance_daughters <dbl> 5, 0, 5, 10, 10, 10, 10, 10, NA, 0,…
## $ pf_ss_women_inheritance <dbl> 5.0, 0.0, 5.0, 10.0, 10.0, 10.0, 10…
## $ pf_ss_women <dbl> 7.500000, 5.833333, 8.333333, 10.00…
## $ pf_ss <dbl> 8.806810, 8.043882, 8.297865, 9.040…
## $ pf_movement_domestic <dbl> 5, 5, 0, 10, 5, 10, 10, 5, 10, 10, …
## $ pf_movement_foreign <dbl> 10, 5, 5, 10, 5, 10, 10, 5, 10, 5, …
## $ pf_movement_women <dbl> 5, 5, 10, 10, 10, 10, 10, 5, NA, 5,…
## $ pf_movement <dbl> 6.666667, 5.000000, 5.000000, 10.00…
## $ pf_religion_estop_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_religion_estop_operate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_religion_estop <dbl> 10.0, 5.0, 10.0, 7.5, 5.0, 10.0, 10…
## $ pf_religion_harassment <dbl> 9.566667, 6.873333, 8.904444, 9.037…
## $ pf_religion_restrictions <dbl> 8.011111, 2.961111, 7.455556, 6.850…
## $ pf_religion <dbl> 9.192593, 4.944815, 8.786667, 7.795…
## $ pf_association_association <dbl> 10.0, 5.0, 2.5, 7.5, 7.5, 10.0, 10.…
## $ pf_association_assembly <dbl> 10.0, 5.0, 2.5, 10.0, 7.5, 10.0, 10…
## $ pf_association_political_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_political_operate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_political <dbl> 10.0, 5.0, 2.5, 5.0, 5.0, 10.0, 10.…
## $ pf_association_prof_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_prof_operate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_prof <dbl> 10.0, 5.0, 5.0, 7.5, 5.0, 10.0, 10.…
## $ pf_association_sport_establish <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_sport_operate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pf_association_sport <dbl> 10.0, 5.0, 7.5, 7.5, 7.5, 10.0, 10.…
## $ pf_association <dbl> 10.0, 5.0, 4.0, 7.5, 6.5, 10.0, 10.…
## $ pf_expression_killed <dbl> 10.000000, 10.000000, 10.000000, 10…
## $ pf_expression_jailed <dbl> 10.000000, 10.000000, 10.000000, 10…
## $ pf_expression_influence <dbl> 5.0000000, 2.6666667, 2.6666667, 5.…
## $ pf_expression_control <dbl> 5.25, 4.00, 2.50, 5.50, 4.25, 7.75,…
## $ pf_expression_cable <dbl> 10.0, 10.0, 7.5, 10.0, 7.5, 10.0, 1…
## $ pf_expression_newspapers <dbl> 10.0, 7.5, 5.0, 10.0, 7.5, 10.0, 10…
## $ pf_expression_internet <dbl> 10.0, 7.5, 7.5, 10.0, 7.5, 10.0, 10…
## $ pf_expression <dbl> 8.607143, 7.380952, 6.452381, 8.738…
## $ pf_identity_legal <dbl> 0, NA, 10, 10, 7, 7, 10, 0, NA, NA,…
## $ pf_identity_parental_marriage <dbl> 10, 0, 10, 10, 10, 10, 10, 10, 10, …
## $ pf_identity_parental_divorce <dbl> 10, 5, 10, 10, 10, 10, 10, 10, 10, …
## $ pf_identity_parental <dbl> 10.0, 2.5, 10.0, 10.0, 10.0, 10.0, …
## $ pf_identity_sex_male <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_sex_female <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_sex <dbl> 10, 0, 0, 10, 10, 10, 10, 10, 10, 1…
## $ pf_identity_divorce <dbl> 5, 0, 10, 10, 5, 10, 10, 5, NA, 0, …
## $ pf_identity <dbl> 6.2500000, 0.8333333, 7.5000000, 10…
## $ pf_score <dbl> 7.596281, 5.281772, 6.111324, 8.099…
## $ pf_rank <dbl> 57, 147, 117, 42, 84, 11, 8, 131, 6…
## $ ef_government_consumption <dbl> 8.232353, 2.150000, 7.600000, 5.335…
## $ ef_government_transfers <dbl> 7.509902, 7.817129, 8.886739, 6.048…
## $ ef_government_enterprises <dbl> 8, 0, 0, 6, 8, 10, 10, 0, 7, 10, 7,…
## $ ef_government_tax_income <dbl> 9, 7, 10, 7, 5, 5, 4, 9, 10, 10, 8,…
## $ ef_government_tax_payroll <dbl> 7, 2, 9, 1, 5, 5, 3, 4, 10, 10, 8, …
## $ ef_government_tax <dbl> 8.0, 4.5, 9.5, 4.0, 5.0, 5.0, 3.5, …
## $ ef_government <dbl> 7.935564, 3.616782, 6.496685, 5.346…
## $ ef_legal_judicial <dbl> 2.6682218, 4.1867042, 1.8431292, 3.…
## $ ef_legal_courts <dbl> 3.145462, 4.327113, 1.974566, 2.930…
## $ ef_legal_protection <dbl> 4.512228, 4.689952, 2.512364, 4.255…
## $ ef_legal_military <dbl> 8.333333, 4.166667, 3.333333, 7.500…
## $ ef_legal_integrity <dbl> 4.166667, 5.000000, 4.166667, 3.333…
## $ ef_legal_enforcement <dbl> 4.3874441, 4.5075380, 2.3022004, 3.…
## $ ef_legal_restrictions <dbl> 6.485287, 6.626692, 5.455882, 6.857…
## $ ef_legal_police <dbl> 6.933500, 6.136845, 3.016104, 3.385…
## $ ef_legal_crime <dbl> 6.215401, 6.737383, 4.291197, 4.133…
## $ ef_legal_gender <dbl> 0.9487179, 0.8205128, 0.8461538, 0.…
## $ ef_legal <dbl> 5.071814, 4.690743, 2.963635, 3.904…
## $ ef_money_growth <dbl> 8.986454, 6.955962, 9.385679, 5.233…
## $ ef_money_sd <dbl> 9.484575, 8.339152, 4.986742, 5.224…
## $ ef_money_inflation <dbl> 9.743600, 8.720460, 3.054000, 2.000…
## $ ef_money_currency <dbl> 10, 5, 5, 10, 10, 10, 10, 5, 0, 10,…
## $ ef_money <dbl> 9.553657, 7.253894, 5.606605, 5.614…
## $ ef_trade_tariffs_revenue <dbl> 9.626667, 8.480000, 8.993333, 6.060…
## $ ef_trade_tariffs_mean <dbl> 9.24, 6.22, 7.72, 7.26, 8.76, 9.50,…
## $ ef_trade_tariffs_sd <dbl> 8.0240, 5.9176, 4.2544, 5.9448, 8.0…
## $ ef_trade_tariffs <dbl> 8.963556, 6.872533, 6.989244, 6.421…
## $ ef_trade_regulatory_nontariff <dbl> 5.574481, 4.962589, 3.132738, 4.466…
## $ ef_trade_regulatory_compliance <dbl> 9.4053278, 0.0000000, 0.9171598, 5.…
## $ ef_trade_regulatory <dbl> 7.489905, 2.481294, 2.024949, 4.811…
## $ ef_trade_black <dbl> 10.00000, 5.56391, 10.00000, 0.0000…
## $ ef_trade_movement_foreign <dbl> 6.306106, 3.664829, 2.946919, 5.358…
## $ ef_trade_movement_capital <dbl> 4.6153846, 0.0000000, 3.0769231, 0.…
## $ ef_trade_movement_visit <dbl> 8.2969231, 1.1062564, 0.1106256, 7.…
## $ ef_trade_movement <dbl> 6.406138, 1.590362, 2.044823, 4.697…
## $ ef_trade <dbl> 8.214900, 4.127025, 5.264754, 3.982…
## $ ef_regulation_credit_ownership <dbl> 5, 0, 8, 5, 10, 10, 8, 5, 10, 10, 5…
## $ ef_regulation_credit_private <dbl> 7.295687, 5.301526, 9.194715, 4.259…
## $ ef_regulation_credit_interest <dbl> 9, 10, 4, 7, 10, 10, 10, 9, 10, 10,…
## $ ef_regulation_credit <dbl> 7.098562, 5.100509, 7.064905, 5.419…
## $ ef_regulation_labor_minwage <dbl> 5.566667, 5.566667, 8.900000, 2.766…
## $ ef_regulation_labor_firing <dbl> 5.396399, 3.896912, 2.656198, 2.191…
## $ ef_regulation_labor_bargain <dbl> 6.234861, 5.958321, 5.172987, 3.432…
## $ ef_regulation_labor_hours <dbl> 8, 6, 4, 10, 10, 10, 6, 6, 8, 8, 10…
## $ ef_regulation_labor_dismissal <dbl> 6.299741, 7.755176, 6.632764, 2.517…
## $ ef_regulation_labor_conscription <dbl> 10, 1, 0, 10, 0, 10, 3, 1, 10, 10, …
## $ ef_regulation_labor <dbl> 6.916278, 5.029513, 4.560325, 5.151…
## $ ef_regulation_business_adm <dbl> 6.072172, 3.722341, 2.758428, 2.404…
## $ ef_regulation_business_bureaucracy <dbl> 6.000000, 1.777778, 1.333333, 6.666…
## $ ef_regulation_business_start <dbl> 9.713864, 9.243070, 8.664627, 9.122…
## $ ef_regulation_business_bribes <dbl> 4.050196, 3.765515, 1.945540, 3.260…
## $ ef_regulation_business_licensing <dbl> 7.324582, 8.523503, 8.096776, 5.253…
## $ ef_regulation_business_compliance <dbl> 7.074366, 7.029528, 6.782923, 6.508…
## $ ef_regulation_business <dbl> 6.705863, 5.676956, 4.930271, 5.535…
## $ ef_regulation <dbl> 6.906901, 5.268992, 5.518500, 5.369…
## $ ef_score <dbl> 7.54, 4.99, 5.17, 4.84, 7.57, 7.98,…
## $ ef_rank <dbl> 34, 159, 155, 160, 29, 10, 27, 106,…
## $ hf_score <dbl> 7.568140, 5.135886, 5.640662, 6.469…
## $ hf_rank <dbl> 48, 155, 142, 107, 57, 4, 16, 130, …
## $ hf_quartile <dbl> 2, 4, 4, 3, 2, 1, 1, 4, 2, 2, 4, 2,…
After doing this and viewing the loaded data, let’s answer the
following questions:
1. What are the dimensions of the dataset?
Rows: 1,458
Columns: 123
What does each row represent?
Each row represents unique observations of country and year detailing various aspects of human freedom.
The dataset spans a lot of years.
We are only interested in data from year 2016.
We will now do the following:
- Filter the data
hfi
data frame for year 2016, and - Assign the result to a data frame named
hfi_2016
.
1. Identify our research question(s)
The research question is often defined by us (or the company, boss,
etc.).
Today’s research question/goal is to predict a country’s personal
freedom score in 2016.
For this activity we want to explore the relationship between the
personal freedom score, pf_score
, and the political
pressures and controls on media content
index,pf_expression_control
. Specifically, we are going to
use the political pressures and controls on media content index to
predict a country’s personal freedom score in 2016.
2. Explore the variables of interest
Let’s answer the following questions.
2. What type of plot can be used to display the distribution of the
personal freedom scores, pf_score
?
A histogram because this is a continuous variable*
Would this be the same type of plot to display the distribution of
the political pressures and controls on media content index,
pf_expression_control
?
Yes
Let’s now: - display this plot for pf_score
. - display
this plot for pf_expression_control
.
# Histogram for pf_score
hist(hfi_2016$pf_score, main = "Distribution of Personal Freedom Scores", xlab = "Personal Freedom Score")
# Histogram for pf_expression_control
hist(hfi_2016$pf_expression_control, main = "Distribution of Political Pressures and Controls on Media Content", xlab = "Political Pressures and Controls on Media Content Index")
Let’s comment on each of these two distributions by describing their centers, spread, shape, and any potential outliers.
Personal Freedom Scores (pf_score) Distribution:
Center: The center appears to be around 6.5
Spread: The spread varies from 5 to 10.
Shape: Left-skewed.
Potential Outliers: No potential outliers.
Political Pressures and Controls on Media Content (pf_expression_control) Distribution:
Center: The center appears to be around 4.5.
Spread: The spread varies.
Shape: Symmetric.
Potential Outliers: No potential outliers.
- Plot to display relationship between
pf_score
, andpf_expression_control
Since these are two continuous variables, we will use a scatter plot to to display the relationship between the personal freedom score,
pf_score
, and the political pressures and controls on media content index,pf_expression_control
- We will plot this relationship using the variable
pf_expression_control
as the predictor/explanatory variable.
# Scatter plot
ggplot(hfi_2016, aes(x = pf_expression_control, y = pf_score)) +
geom_point() +
labs(x = "Political Expression and Control", y = "Personal Freedom Score",
title = "Personal Freedom Score vs Political Expression and Control")
- Does the relationship look linear?
The relationship looks linear.
Since the relationship is linear and the data has a normal distribution as seen in the histograms, we would be comfortable using a linear model to predict the personal freedom score if we knew a country’s
pf_expression_control
, or its score out of 10, with 0 being the most, of political pressures and controls on media content.
Challenge
Using {dplyr}
skills, let’s obtain the appropriate
numerical summary statistics for each plot and provide more detailed
descriptions of the plots.
# Summary statistics for pf_score
pf_score_summary <- hfi_2016 %>%
summarize(
Mean = mean(pf_score),
Median = median(pf_score),
Mode = {
tbl <- table(pf_score)
as.numeric(names(tbl)[which.max(tbl)])
},
Range = max(pf_score) - min(pf_score),
IQR = IQR(pf_score),
SD = sd(pf_score),
Variance = var(pf_score),
Skewness = moments::skewness(pf_score),
Kurtosis = moments::kurtosis(pf_score)
)
# Summary statistics for pf_expression_control
pf_expression_control_summary <- hfi_2016 %>%
summarize(
Mean = mean(pf_expression_control),
Median = median(pf_expression_control),
Mode = {
tbl <- table(pf_expression_control)
as.numeric(names(tbl)[which.max(tbl)])
},
Range = max(pf_expression_control) - min(pf_expression_control),
IQR = IQR(pf_expression_control),
SD = sd(pf_expression_control),
Variance = var(pf_expression_control),
Skewness = moments::skewness(pf_expression_control),
Kurtosis = moments::kurtosis(pf_expression_control)
)
# Display the summary statistics
pf_score_summary
## # A tibble: 1 × 9
## Mean Median Mode Range IQR SD Variance Skewness Kurtosis
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.98 6.93 2.17 7.23 2.12 1.49 2.22 -0.393 3.01
## # A tibble: 1 × 9
## Mean Median Mode Range IQR SD Variance Skewness Kurtosis
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.98 5 3.75 9 3.44 2.32 5.40 -0.0663 2.08
Let’s explore how we can use the cor
function from Base
R to describe the relationship between two numerical variables?
The correlation coefficient is a numerical summary that indicates the strength and direction of the linear relationship between two numerical variables.
# The correlation coefficient
correlation <- cor(hfi_2016$pf_score, hfi_2016$pf_expression_control)
correlation
## [1] 0.8450646
3. Fit a simple linear regression model
We will now continue to fitting a simple linear regression (SLR)
model to these data.
The code that we will be using to fit statistical models in this
activity use {tidymodels}
- an opinionated way to fit
models in R
To begin, we will create a {parsnip}
specification for a
linear model.
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
The set_mode("regression")
is really
unnecessary/redundant as linear models ("lm"
) can only be
regression models.
You can type ?function_name
in the R
Console to explore a function’s help
documentation.
The above code also outputs the lm_spec
output.
This code does not do any calculations by itself, but rather specifies
what we plan to do.
Using this specification, we can now fit our model:
\(\texttt{pf\\_score} = \beta_0 + \beta_1 \times \texttt{pf\\_expression\\_control} + \varepsilon\).
Note, the “$” portion in the previous sentence is LaTeX snytax which
is a math scripting (and other scripting) language.
Let’s fit a linear model
# Fitting a linear model
slr_mod <- lm_spec %>%
fit(pf_score ~ pf_expression_control, data = hfi_2016)
tidy(slr_mod)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.28 0.149 28.8 4.23e-65
## 2 pf_expression_control 0.542 0.0271 20.0 2.31e-45
The above code fits our SLR model, then provides a tidy
parameter estimates table.
5. Let’s update the formula with the estimated parameters using the
tidy
output. That is, replace “intercept” and “slope” with
the appropriate values
\(\widehat{\texttt{pf\\_score}} = 4.28 + 0.542 \times \texttt{pf\\_expression\\_control}\)
- Let’s interpret each of the estimated parameters from (5) in the context of this research question. That is, explaining what these values represent.
The intercept (4.28) represents the baseline personal freedom score for a country with the most political pressures and controls on media content
pf_expression_control
(β₁): The coefficient for pf_expression_control represents the change in the estimated personal freedom score for a one-unit increase in the political pressures and controls on media content index (pf_expression_control
). In this case, for every one-unit increase inpf_expression_control
, we expect the estimated personal freedom score to increase by approximately 0.542 units.
Part 2
In Part 1, we were able to interpret the SLR model parameter estimates (i.e., the y-intercept and slope) as follows:
For countries with a
pf_expression_control
of 0 (those with the largest amount of political pressure on media content), we expect their mean personal freedom score to be 4.28.
For every 1 unit increase in
pf_expression_control
(political pressure on media content index), we expect a country’s mean personal freedom score to increase 0.542 units.
4. Assessing
4.a: Assessing with the part 1 model
To assess our model fit, we can use \(R^2\) (the coefficient of determination),
the proportion of variability in the response variable that is explained
by the explanatory variable.
We use glance
from {broom}
(which is
automatically loaded with {tidymodels}
-
{broom}
is also where tidy
is from) to access
this information.
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.714 0.712 0.799 400. 2.31e-45 1 -193. 391. 400.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Next, we will answer the following questions:
7. What is the value of \(R^2\) for
this model?
0.714
- What does this value mean in the context of this model?
What would a “good” value of \(R^2\) would be?
Can/should this value be “perfect”?
In the context of this model, the \(R^2\) value of 0.714 means that 71.4% of the variability in a country’s personal freedom score (pf_score) in 2016 can be explained by the political pressures and controls on media content index (pf_expression_control)
This is a good fit model because most of the variability in the response variable is explained by the predictor variable
\(R^2\) should not necessarily be perfect because this may be overfitting
4.b: Assessing with test/train
We previously fit a model and evaluated it using the exact same
data.
This is a bit of circular reasoning and does not provide much
information about the model’s performance.
Now we will work through the test/train process of fitting and assessing
a simple linear regression model.
To create a test/train split, we do the following:
- Create a new R code chunk and provide it with a descriptive tile
(e.g.,
train-test
). - Set a seed.
- Create an initial 80-20 split of the
hfi_2016
dataset - Using your initial split R object, assign the two splits into a training R object and a testing R object.
# Set seed for reproducibility
set.seed(123)
# Create an 80-20 split of the hfi_2016 dataset
hfi_split <- initial_split(hfi_2016, prop = 0.8)
# Assign the splits into training and testing datasets
hfi_train <- training(hfi_split)
hfi_test <- testing(hfi_split)
Now, we will use the training dataset to fit a SLR model.
slr_train <- lm_spec %>%
fit(pf_score ~ pf_expression_control, data = hfi_train)
# Display the tidy summary of the model
tidy(slr_train)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.32 0.166 26.1 7.85e-53
## 2 pf_expression_control 0.536 0.0299 17.9 1.41e-36
We can reuse the lm_spec
specification because we are
still doing a linear model.
9. Let’s update the below formula with the estimated parameters using
the tidy
output.
That is, replace “intercept” and “slope” with the appropriate values
\(\widehat{\texttt{pf\\_score}} = 4.32 + 0.536 \times \texttt{pf\\_expression\\_control}\)
- Next we interpret each of the estimated parameters from (10) in the context of this research question. That is, what do these values represent?
For countries with a
pf_expression_control
of 0 (those with the largest amount of political pressure on media content), we expect their mean personal freedom score to be 4.32.
For every 1 unit increase in
pf_expression_control
(political pressure on media content index), we expect a country’s mean personal freedom score to increase 0.536 units.
Now we will assess using the testing data set.
## # A tibble: 33 × 125
## .pred .resid year ISO_code countries region pf_rol_procedural pf_rol_civil
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 6.46 -1.18 2016 DZA Algeria Middle… NA NA
## 2 5.66 0.451 2016 AGO Angola Sub-Sa… NA NA
## 3 4.72 1.41 2016 BHR Bahrain Middle… NA NA
## 4 7.94 -0.506 2016 BLZ Belize Latin … 4.75 4.74
## 5 6.60 0.609 2016 BOL Bolivia Latin … 3.70 3.36
## 6 7.13 -0.257 2016 BWA Botswana Sub-Sa… 5.33 6.06
## 7 8.74 0.412 2016 CAN Canada North … 8.62 7.18
## 8 8.47 -0.485 2016 CPV Cape Verde Sub-Sa… NA NA
## 9 5.12 0.226 2016 CHN China East A… 3.95 5.38
## 10 5.12 -1.23 2016 EGY Egypt Middle… 2.95 3.76
## # ℹ 23 more rows
## # ℹ 117 more variables: pf_rol_criminal <dbl>, pf_rol <dbl>,
## # pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
## # pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
## # pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
## # pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
## # pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …
This takes the SLR model and applies it to the testing data.
Check in
Look at the various information produced by this code.
What does each column represent?
The .pred
column in this output can also be obtained by
using predict
(i.e.,
predict(slr_fit, new_data = data_test)
)
11. Now, using the responses to (7) and (8) as an example, let’s assess
how well this model fits the testing data.
We will compare these results here to Part 1 results of this
activity.
Did this model perform any differently?
\(R^2\) did not change
library(yardstick)
# Calculate performance metrics for the testing dataset
test_metrics <- test_aug %>%
metrics(truth = pf_score, estimate = .pred)
# Print the metrics
test_metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.867
## 2 rsq standard 0.705
## 3 mae standard 0.701
Model diagnostics
To assess whether the linear model is reliable, we should check for
(1) linearity, (2) nearly normal residuals, and (3) constant
variability.
Note that the normal residuals is not really necessary for all models
(sometimes we simply want to describe a relationship for the data that
we have or population-level data, where statistical inference is not
appropriate/necessary).
In order to do these checks we need access to the fitted (predicted)
values and the residuals. We can use broom::augment
to
calculate these.
## # A tibble: 129 × 125
## .pred .resid year ISO_code countries region pf_rol_procedural pf_rol_civil
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 5.26 0.710 2016 VNM Vietnam South… 5.38 4.45
## 2 9.28 -0.288 2016 BEL Belgium Weste… 8.67 7.34
## 3 9.14 0.153 2016 FIN Finland Weste… 9.50 7.96
## 4 7.13 0.588 2016 PER Peru Latin… 6.17 4.41
## 5 6.87 0.0774 2016 DOM Dominican… Latin… 5.24 4.54
## 6 6.46 -0.456 2016 ZMB Zambia Sub-S… 3.30 4.89
## 7 6.33 -1.16 2016 ZWE Zimbabwe Sub-S… 2.43 4.33
## 8 6.33 0.257 2016 UKR Ukraine Easte… 4.77 5.14
## 9 6.33 1.08 2016 MKD Macedonia Easte… 4.68 5.63
## 10 6.60 0.238 2016 MDG Madagascar Sub-S… 3.33 3.93
## # ℹ 119 more rows
## # ℹ 117 more variables: pf_rol_criminal <dbl>, pf_rol <dbl>,
## # pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
## # pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
## # pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
## # pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
## # pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …
Linearity: We already checked if the relationship
between pf_score
and pf_expression_control
is
linear using a scatterplot. We should also verify this condition with a
plot of the residuals vs. fitted (predicted) values.
ggplot(data = train_aug, aes(x = .pred, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")
Notice here that train_aug
can also serve as a data set
because stored within it are the fitted values (\(\hat{y}\)) and the residuals.
Also note that we are getting fancy with the code here.
After creating the scatterplot on the first layer (first line of code),
we overlay a red horizontal dashed line at \(y
= 0\) (to help us check whether the residuals are distributed
around 0), and we also rename the axis labels to be more
informative.
Let’s answer the following question:
11. Is there any apparent pattern in the residuals plot?
There is no apparent pattern
What does this indicate about the linearity of the relationship
between the two variables?
Nearly normal residuals: To check this condition, we
can look at a histogram of the residuals.
# residual histogram
ggplot(data = train_aug, aes(x = .resid)) +
geom_histogram(binwidth = 0.25) +
xlab("Residuals")
Let’s answer the following question:
12. Based on the histogram, does the nearly normal residuals condition
appear to be violated? Why or why not?
No. Because the histogram has only one peak. Although it is left-skewed.
Constant variability:
- Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?
No. The points are relatively equally distributed from the red dotted line across the plot.
Attribution
This document is based on labs from OpenIntro.
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.
This class activity was assigned to me by Prof. Bradford Dykes at
GVSU