library(tidyverse)
library(openintro)
library(dplyr)
library(ggplot2)
data('hfi', package='openintro')
# Get the dimensions of the evals dataset
dim(evals)
## [1] 463 23
As we can see, we have 463 rows and 23 columns in this dataset.
# Scatterplot of pf_score vs pf_expression_control using the HFI dataset
ggplot(hfi, aes(x = pf_expression_control, y = pf_score)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Personal Freedom Score vs Expression Control",
x = "Expression Control (0 = most control, 10 = least)",
y = "Personal Freedom Score")
The scatterplot illustrates a strong positive linear relationship
between pf_expression_control and pf_score. As expression control
becomes less strict, personal freedom scores tend to rise. The
relationship appears fairly linear, with a consistent upward trend.
While there is some spread, especially at mid-levels of control, the
linear model seems appropriate for predicting personal freedom from
expression control.
hfi %>%
summarise(cor(pf_expression_control, pf_score, use = "complete.obs"))
## # A tibble: 1 × 1
## `cor(pf_expression_control, pf_score, use = "complete.obs")`
## <dbl>
## 1 0.796
The relationship between pf_expression_control and pf_score appears to be linear in form, meaning the data points roughly follow a straight-line pattern.
The direction of the relationship is positive: as pf_expression_control increases (i.e., media and expression are less controlled), the pf_score also increases, indicating higher levels of personal freedom.
The strength of the relationship is moderate to strong, as evidenced by the .79 correlation coefficient, the clear upward trend and relatively narrow band of points around the regression line, especially in the mid-to-high range of expression control.
There are a few unusual observations (potential outliers). For instance, some countries with high expression control (values near 0 or 1) still have moderately high personal freedom scores (around 6–7), which could indicate countries that restrict media but are otherwise free in personal domains.
Overall, the pattern suggests that greater freedom of expression is strongly associated with higher overall personal freedom, and a linear model is likely appropriate for modeling this relationship.
# This will only work interactively (i.e. will not show in the knitted document)
hfi <- hfi %>% filter(complete.cases(pf_expression_control, pf_score))
DATA606::plot_ss(x = hfi$pf_expression_control, y = hfi$pf_score)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 4.6171 0.4914
##
## Sum of Squares: 952.153
DATA606::plot_ss(x = hfi$pf_expression_control, y = hfi$pf_score, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 4.6171 0.4914
##
## Sum of Squares: 952.153
The smallest I can get is 952.153
m1 <- lm(pf_score ~ pf_expression_control, data = hfi)
summary(m1)
##
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8467 -0.5704 0.1452 0.6066 3.2060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.61707 0.05745 80.36 <2e-16 ***
## pf_expression_control 0.49143 0.01006 48.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8318 on 1376 degrees of freedom
## Multiple R-squared: 0.6342, Adjusted R-squared: 0.634
## F-statistic: 2386 on 1 and 1376 DF, p-value: < 2.2e-16
# Fit model to predict hf_score from pf_expression_control
m2 <- lm(hf_score ~ pf_expression_control, data = hfi)
summary(m2)
##
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6198 -0.4908 0.1031 0.4703 2.2933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.153687 0.046070 111.87 <2e-16 ***
## pf_expression_control 0.349862 0.008067 43.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.667 on 1376 degrees of freedom
## Multiple R-squared: 0.5775, Adjusted R-squared: 0.5772
## F-statistic: 1881 on 1 and 1376 DF, p-value: < 2.2e-16
ggplot(data = hfi, aes(x = pf_expression_control, y = pf_score)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
Ran into an issue with this. Initially I looked for a score of exactly 6.7 and there wasn’t one, so ended up using slice_min to find the countries that most closely matched 6.7.
predicted <- predict(m1, newdata = data.frame(pf_expression_control = 6.7))
# Find the actual observation closest to 6.7
closest <- hfi %>%
filter(!is.na(pf_expression_control), !is.na(pf_score)) %>%
mutate(diff = abs(pf_expression_control - 6.7)) %>%
slice_min(order_by = diff, n = 1)
actual <- closest$pf_score
residual <- actual - predicted
data.frame(
pf_expression_control = closest$pf_expression_control,
actual = actual,
predicted = predicted,
residual = residual
)
## pf_expression_control actual predicted residual
## 1 6.75 7.430864 7.909663 -0.4787990
## 2 6.75 8.216035 7.909663 0.3063725
## 3 6.75 8.766803 7.909663 0.8571401
## 4 6.75 7.872178 7.909663 -0.0374850
## 5 6.75 7.394916 7.909663 -0.5147473
## 6 6.75 7.254949 7.909663 -0.6547140
## 7 6.75 7.788563 7.909663 -0.1210998
## 8 6.75 8.265002 7.909663 0.3553392
## 9 6.75 7.194145 7.909663 -0.7155183
## 10 6.75 7.750407 7.909663 -0.1592554
## 11 6.75 7.314257 7.909663 -0.5954059
## 12 6.75 7.633296 7.909663 -0.2763669
## 13 6.75 8.557804 7.909663 0.6481411
## 14 6.75 6.895944 7.909663 -1.0137186
## 15 6.75 7.285546 7.909663 -0.6241173
## 16 6.75 8.528692 7.909663 0.6190295
## 17 6.75 8.663341 7.909663 0.7536782
## 18 6.75 9.065782 7.909663 1.1561189
## 19 6.75 8.715524 7.909663 0.8058613
## 20 6.75 7.317269 7.909663 -0.5923935
## 21 6.75 8.652734 7.909663 0.7430714
## 22 6.75 7.194589 7.909663 -0.7150734
## 23 6.75 8.762426 7.909663 0.8527631
## 24 6.75 8.789731 7.909663 0.8800682
## 25 6.75 7.579833 7.909663 -0.3298301
## 26 6.75 9.172400 7.909663 1.2627372
## 27 6.75 8.406205 7.909663 0.4965419
## 28 6.75 8.628272 7.909663 0.7186091
## 29 6.75 7.335984 7.909663 -0.5736787
## 30 6.75 8.862743 7.909663 0.9530806
## 31 6.75 6.502495 7.909663 -1.4071682
## 32 6.75 8.632398 7.909663 0.7227354
## 33 6.75 6.954978 7.909663 -0.9546847
## 34 6.75 8.828608 7.909663 0.9189450
ggplot(data = m1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
##Exercise 7 - Is there any apparent pattern in the residuals plot? What
does this indicate about the linearity of the relationship between the
two variables?
The residuals plot does not show any clear curved or funnel-shaped pattern, and the residuals appear to be randomly scattered around the horizontal line at zero.
This suggests that the linearity assumption is reasonable. The relationship between pf_expression_control and pf_score appears to be appropriately modeled with a linear function.
If the residuals had shown a systematic curve or increasing/decreasing spread, that would indicate non-linearity. But in this case, we don’t see those issues, so a linear model is justified.
ggplot(data = m1, aes(x = .resid)) +
geom_histogram(binwidth = 25) +
xlab("Residuals")
ggplot(data = m1, aes(sample = .resid)) +
stat_qq()
## Exercise 8 - Based on the histogram and the normal probability plot,
does the nearly normal residuals condition appear to be met?
Yes, the nearly normal residuals condition appears to be met.
While the histogram isn’t informative due to bin size (the entire thing is one big block), the Q-Q plot shows that the residuals follow a roughly straight line, indicating approximate normality. There is some deviation at the extremes, but the center of the distribution aligns well with the expected normal distribution. Using a linear model with these residuals is appropriate.
The residuals vs. fitted values plot shows that the spread of the residuals remains fairly constant across the range of predicted values. There is no clear fan shape or increasing/decreasing pattern in the variance.
So, the constant variability condition appears to be met, supporting the use of a linear regression model for this relationship.
For this, we are going to look at the religion freedom variable: pf_religion and the predictor variable will be the total human freedom score, hf_score
hfi_clean <- hfi %>%
filter(!is.na(pf_religion), !is.na(hf_score))
ggplot(hfi_clean, aes(x = hf_score, y = pf_religion)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
labs(
title = "Freedom of Religion vs Human Freedom Score",
x = "Human Freedom Score",
y = "Freedom of Religion"
)
m3 <- lm(pf_religion ~ hf_score, data = hfi)
summary(m3)
##
## Call:
## lm(formula = pf_religion ~ hf_score, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2548 -0.6808 0.2069 0.8671 2.8577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2787 0.2343 18.26 <2e-16 ***
## hf_score 0.5134 0.0331 15.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.244 on 1366 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.1497, Adjusted R-squared: 0.1491
## F-statistic: 240.6 on 1 and 1366 DF, p-value: < 2.2e-16
This suggests that pf_expression_control is a much stronger predictor of pf_score than hf_score (overall human freedom) is of pf_religion (freedom of religion). This makes sense conceptually — pf_expression_control is a direct measure of media/political restrictions, which likely impacts personal freedoms more directly than broader freedom indices influence specific rights like religious freedom.
I was most surprised that the relationship between hf_score (overall human freedom) and pf_religion (freedom of religion) was relatively weak. I expected a much stronger correlation between overall freedom and religious freedom.
Below are the model diagnostics for this regression model:
Overall, the model fits acceptably, but the low explanatory power was unexpected.
par(mfrow = c(2, 2))
plot(m3)
par(mfrow = c(1, 1))