Introduction to Linear Regression

The Friedrich Naumann Foundation for Freedom’s Cato Institute, Fraser Institute, and Liberales Institute publish an annual study called the Human Freedom Index that provides a summary of “freedom” in a number of different nations across the world. It gauges the connections between social and economic circumstances and many forms of freedom, including political, religious, economic, and personal freedom.

In order to determine the important factors that influence freedom narratives, this lab will examine data from 2008 to 2016.

Load Packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data('hfi', package='openintro')

Exploring the Dataset’s Dimensions

dim(hfi)
## [1] 1458  123

Visualization

Will use a scatter plot to explore the relationship between personal freedom score and political expression control

ggplot(hfi, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 80 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 80 rows containing missing values or values outside the scale range
## (`geom_point()`).

This plot provide a more understanding whether linear relationship exits. A linear model may be appropriate if the points form an approximately straight line.

Quantifying Relationships with Correlation

hfi %>%
  summarise(cor(pf_expression_control, pf_score, use = "complete.obs"))
## # A tibble: 1 × 1
##   `cor(pf_expression_control, pf_score, use = "complete.obs")`
##                                                          <dbl>
## 1                                                        0.796

Here we can see that a correlation close to 1 or -1 suggests a strong linear relationship.

Visual Inspection

ggplot(hfi, aes(x = pf_expression_control, y = pf_score)) +
  geom_point(alpha = 0.5) +
  labs(x = "Political Pressure on Expression",
       y = "Personal Freedom Score",
       title = "Relationship Between Political Pressure and Personal Freedom")
## Warning: Removed 80 rows containing missing values or values outside the scale range
## (`geom_point()`).

This plot shows a clear upward trend, indicating that as political pressure decreases (higher values on the pf_expression_control scale), the personal freedom score increases. Numerical inspection

Sum of Squaredd Residuals

#installing required packages.
devtools::install_github("jbryer/DATA606")
## Skipping install of 'DATA606' from a github remote, the SHA1 (96507a85) has not changed since last install.
##   Use `force = TRUE` to force installation
library(DATA606)
## Loading required package: shiny
## Loading required package: markdown
## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 4th Edition. You can read this by typing 
## vignette('os4') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
## 
## Attaching package: 'DATA606'
## The following objects are masked from 'package:openintro':
## 
##     calc_streak, present, qqnormsim
## The following object is masked from 'package:utils':
## 
##     demo
#Load and Inspect Data
hfi <- hfi %>% filter(complete.cases(pf_expression_control, pf_score))
DATA606::plot_ss(x = hfi$pf_expression_control, y = hfi$pf_score)

## Click two points to make a line.                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      4.6171       0.4914  
## 
## Sum of Squares:  952.153

The residual sum of squares for the regression model, 952.153, was the lowest sum of squares that could be obtained using the plot_ss interactive tool. Since the line closely resembles the real data points, this score denotes a stronger model fit. The better the regression line fits the data, the closer the sum of squares is to 0.

Fitting Linear Model

Fitting the leas squares regression model

m1 <- lm(pf_score ~ pf_expression_control, data = hfi)
summary(m1)
## 
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8467 -0.5704  0.1452  0.6066  3.2060 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.61707    0.05745   80.36   <2e-16 ***
## pf_expression_control  0.49143    0.01006   48.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8318 on 1376 degrees of freedom
## Multiple R-squared:  0.6342, Adjusted R-squared:  0.634 
## F-statistic:  2386 on 1 and 1376 DF,  p-value: < 2.2e-16

Visualize

Plotting with the Regression Line

ggplot(hfi, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

This plot demonstrate the actual daata and the regression line from the model

Predicting and Residual Calculation

predict(m1, newdata = data.frame(pf_expression_control = 6.7))
##        1 
## 7.909663

Model Diagnostics

checking linearity

ggplot(data = m1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

## Normality of Residuals Confirmation

ggplot(data = m1, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25)

ggplot(data = m1, aes(sample = .resid)) +
  stat_qq()

## More Practice

Visualizing the relationship bet payroll tax rate and personal freedom score

ggplot(data = hfi, aes(x = ef_government_tax_payroll , y = pf_score)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Scatterplot of Payroll Tax vs Personal Freedom Score",
    x = "Payroll Tax Rate",
    y = "Personal Freedom Score"
  )
## Warning: Removed 113 rows containing missing values or values outside the scale range
## (`geom_point()`).

Each of the plotted point represents a country.

Fitting a Linear Model

Fitting linear model to quantify the relationship

m_tax <- lm(pf_score ~ ef_government_tax_payroll , data = hfi)
summary(m_tax)
## 
## Call:
## lm(formula = pf_score ~ ef_government_tax_payroll, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8324 -0.8937  0.0953  1.0816  2.5947 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.21696    0.08518   96.47   <2e-16 ***
## ef_government_tax_payroll -0.17458    0.01417  -12.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.317 on 1263 degrees of freedom
##   (113 observations deleted due to missingness)
## Multiple R-squared:  0.1073, Adjusted R-squared:  0.1066 
## F-statistic: 151.9 on 1 and 1263 DF,  p-value: < 2.2e-16
lm(pf_score ~ ef_government_tax_payroll, data = hfi)
## 
## Call:
## lm(formula = pf_score ~ ef_government_tax_payroll, data = hfi)
## 
## Coefficients:
##               (Intercept)  ef_government_tax_payroll  
##                    8.2170                    -0.1746

Conclusion

Using data from the Human Freedom Index, this lab showed how to apply linear regression to investigate the link between various policy elements and personal freedom. The findings indicate:

With an R2 of 0.63, there is a substantial positive correlation between personal freedom and control over political discourse.

Payroll tax rates and personal freedom have a weaker but still substantial negative association (R2 = 0.11).

These results imply that personal freedom is more impacted by government rules pertaining to political expression than by payroll taxes. The assumptions of linear regression are supported by visual inspections, residual checks, and model diagnostics.