Lab 8

Load Libraries/Packages

library(tidyverse)
library(openintro)
library(dplyr)
library(ggplot2)
data('hfi', package='openintro')

Exercise 1 - What are the dimensions of the dataset?

# Get the dimensions of the evals dataset
dim(evals)

## [1] 463  23

As we can see, we have 463 rows and 23 columns in this dataset.

Exercise 2 - What type of plot would you use?

# Scatterplot of pf_score vs pf_expression_control using the HFI dataset
ggplot(hfi, aes(x = pf_expression_control, y = pf_score)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Personal Freedom Score vs Expression Control",
       x = "Expression Control (0 = most control, 10 = least)",
       y = "Personal Freedom Score")

The scatterplot illustrates a strong positive linear relationship between pf_expression_control and pf_score. As expression control becomes less strict, personal freedom scores tend to rise. The relationship appears fairly linear, with a consistent upward trend. While there is some spread, especially at mid-levels of control, the linear model seems appropriate for predicting personal freedom from expression control.

hfi %>%
  summarise(cor(pf_expression_control, pf_score, use = "complete.obs"))

## # A tibble: 1 × 1
##   `cor(pf_expression_control, pf_score, use = "complete.obs")`
##                                                          <dbl>
## 1                                                        0.796

Exercise 3 - Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

The relationship between pf_expression_control and pf_score appears to be linear in form, meaning the data points roughly follow a straight-line pattern.

The direction of the relationship is positive: as pf_expression_control increases (i.e., media and expression are less controlled), the pf_score also increases, indicating higher levels of personal freedom.

The strength of the relationship is moderate to strong, as evidenced by the .79 correlation coefficient, the clear upward trend and relatively narrow band of points around the regression line, especially in the mid-to-high range of expression control.

There are a few unusual observations (potential outliers). For instance, some countries with high expression control (values near 0 or 1) still have moderately high personal freedom scores (around 6–7), which could indicate countries that restrict media but are otherwise free in personal domains.

Overall, the pattern suggests that greater freedom of expression is strongly associated with higher overall personal freedom, and a linear model is likely appropriate for modeling this relationship.

# This will only work interactively (i.e. will not show in the knitted document)
hfi <- hfi %>% filter(complete.cases(pf_expression_control, pf_score))
DATA606::plot_ss(x = hfi$pf_expression_control, y = hfi$pf_score)

## Click two points to make a line.                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      4.6171       0.4914  
## 
## Sum of Squares:  952.153

DATA606::plot_ss(x = hfi$pf_expression_control, y = hfi$pf_score, showSquares = TRUE)

## Click two points to make a line.                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      4.6171       0.4914  
## 
## Sum of Squares:  952.153

Exercise 4 - Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

The smallest I can get is 952.153

m1 <- lm(pf_score ~ pf_expression_control, data = hfi)
summary(m1)

## 
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8467 -0.5704  0.1452  0.6066  3.2060 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.61707    0.05745   80.36   <2e-16 ***
## pf_expression_control  0.49143    0.01006   48.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8318 on 1376 degrees of freedom
## Multiple R-squared:  0.6342, Adjusted R-squared:  0.634 
## F-statistic:  2386 on 1 and 1376 DF,  p-value: < 2.2e-16

Exercise 5 - Fit a new model that uses pf_expression_control to predict hf_score, or the total human freedom score. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?

# Fit model to predict hf_score from pf_expression_control
m2 <- lm(hf_score ~ pf_expression_control, data = hfi)
summary(m2)

## 
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6198 -0.4908  0.1031  0.4703  2.2933 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.153687   0.046070  111.87   <2e-16 ***
## pf_expression_control 0.349862   0.008067   43.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.667 on 1376 degrees of freedom
## Multiple R-squared:  0.5775, Adjusted R-squared:  0.5772 
## F-statistic:  1881 on 1 and 1376 DF,  p-value: < 2.2e-16

ggplot(data = hfi, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

Exercise 6 - If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom score for one with a 6.7 rating for pf_expression_control? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Ran into an issue with this. Initially I looked for a score of exactly 6.7 and there wasn’t one, so ended up using slice_min to find the countries that most closely matched 6.7.

predicted <- predict(m1, newdata = data.frame(pf_expression_control = 6.7))

# Find the actual observation closest to 6.7
closest <- hfi %>%
  filter(!is.na(pf_expression_control), !is.na(pf_score)) %>%
  mutate(diff = abs(pf_expression_control - 6.7)) %>%
  slice_min(order_by = diff, n = 1)

actual <- closest$pf_score
residual <- actual - predicted

data.frame(
  pf_expression_control = closest$pf_expression_control,
  actual = actual,
  predicted = predicted,
  residual = residual
)

##    pf_expression_control   actual predicted   residual
## 1                   6.75 7.430864  7.909663 -0.4787990
## 2                   6.75 8.216035  7.909663  0.3063725
## 3                   6.75 8.766803  7.909663  0.8571401
## 4                   6.75 7.872178  7.909663 -0.0374850
## 5                   6.75 7.394916  7.909663 -0.5147473
## 6                   6.75 7.254949  7.909663 -0.6547140
## 7                   6.75 7.788563  7.909663 -0.1210998
## 8                   6.75 8.265002  7.909663  0.3553392
## 9                   6.75 7.194145  7.909663 -0.7155183
## 10                  6.75 7.750407  7.909663 -0.1592554
## 11                  6.75 7.314257  7.909663 -0.5954059
## 12                  6.75 7.633296  7.909663 -0.2763669
## 13                  6.75 8.557804  7.909663  0.6481411
## 14                  6.75 6.895944  7.909663 -1.0137186
## 15                  6.75 7.285546  7.909663 -0.6241173
## 16                  6.75 8.528692  7.909663  0.6190295
## 17                  6.75 8.663341  7.909663  0.7536782
## 18                  6.75 9.065782  7.909663  1.1561189
## 19                  6.75 8.715524  7.909663  0.8058613
## 20                  6.75 7.317269  7.909663 -0.5923935
## 21                  6.75 8.652734  7.909663  0.7430714
## 22                  6.75 7.194589  7.909663 -0.7150734
## 23                  6.75 8.762426  7.909663  0.8527631
## 24                  6.75 8.789731  7.909663  0.8800682
## 25                  6.75 7.579833  7.909663 -0.3298301
## 26                  6.75 9.172400  7.909663  1.2627372
## 27                  6.75 8.406205  7.909663  0.4965419
## 28                  6.75 8.628272  7.909663  0.7186091
## 29                  6.75 7.335984  7.909663 -0.5736787
## 30                  6.75 8.862743  7.909663  0.9530806
## 31                  6.75 6.502495  7.909663 -1.4071682
## 32                  6.75 8.632398  7.909663  0.7227354
## 33                  6.75 6.954978  7.909663 -0.9546847
## 34                  6.75 8.828608  7.909663  0.9189450

ggplot(data = m1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

##Exercise 7 - Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between the two variables?

The residuals plot does not show any clear curved or funnel-shaped pattern, and the residuals appear to be randomly scattered around the horizontal line at zero.

This suggests that the linearity assumption is reasonable. The relationship between pf_expression_control and pf_score appears to be appropriately modeled with a linear function.

If the residuals had shown a systematic curve or increasing/decreasing spread, that would indicate non-linearity. But in this case, we don’t see those issues, so a linear model is justified.

ggplot(data = m1, aes(x = .resid)) +
  geom_histogram(binwidth = 25) +
  xlab("Residuals")

ggplot(data = m1, aes(sample = .resid)) +
  stat_qq()

## Exercise 8 - Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Yes, the nearly normal residuals condition appears to be met.

While the histogram isn’t informative due to bin size (the entire thing is one big block), the Q-Q plot shows that the residuals follow a roughly straight line, indicating approximate normality. There is some deviation at the extremes, but the center of the distribution aligns well with the expected normal distribution. Using a linear model with these residuals is appropriate.

Exercise 9 - Based on the residuals vs. fitted plot, does the constant variability condition appear to be met?

The residuals vs. fitted values plot shows that the spread of the residuals remains fairly constant across the range of predicted values. There is no clear fan shape or increasing/decreasing pattern in the variance.

So, the constant variability condition appears to be met, supporting the use of a linear regression model for this relationship.

MORE PRACTICE

Choose another freedom variable and a variable you think would strongly correlate with it. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

For this, we are going to look at the religion freedom variable: pf_religion and the predictor variable will be the total human freedom score, hf_score

hfi_clean <- hfi %>%
  filter(!is.na(pf_religion), !is.na(hf_score))

ggplot(hfi_clean, aes(x = hf_score, y = pf_religion)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  labs(
    title = "Freedom of Religion vs Human Freedom Score",
    x = "Human Freedom Score",
    y = "Freedom of Religion"
  )

How does this relationship compare to the relationship between pf_expression_control and pf_score? Use the R2 values from the two model summaries to compare. Does your independent variable seem to predict your dependent one better? Why or why not?

m3 <- lm(pf_religion ~ hf_score, data = hfi)
summary(m3)

## 
## Call:
## lm(formula = pf_religion ~ hf_score, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2548 -0.6808  0.2069  0.8671  2.8577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2787     0.2343   18.26   <2e-16 ***
## hf_score      0.5134     0.0331   15.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.244 on 1366 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.1497, Adjusted R-squared:  0.1491 
## F-statistic: 240.6 on 1 and 1366 DF,  p-value: < 2.2e-16

This suggests that pf_expression_control is a much stronger predictor of pf_score than hf_score (overall human freedom) is of pf_religion (freedom of religion). This makes sense conceptually — pf_expression_control is a direct measure of media/political restrictions, which likely impacts personal freedoms more directly than broader freedom indices influence specific rights like religious freedom.

What’s one freedom relationship you were most surprised about and why? Display the model diagnostics for the regression model analyzing this relationship.

I was most surprised that the relationship between hf_score (overall human freedom) and pf_religion (freedom of religion) was relatively weak. I expected a much stronger correlation between overall freedom and religious freedom.

Below are the model diagnostics for this regression model:

The residuals vs fitted plot shows a fairly random spread, supporting linearity.
The Q-Q plot shows some deviation from normality at the tails.
The Scale-Location plot indicates a relatively consistent spread, with mild non-linearity
The Residuals vs Leverage plot doesn’t indicate strong influence from any individual observation.

Overall, the model fits acceptably, but the low explanatory power was unexpected.

par(mfrow = c(2, 2))
plot(m3)

par(mfrow = c(1, 1))