2025-03-16

Mathematical Overview and Goal

The basic form of a linear equation in point intercept form is as follows:

\(y = \beta_0 + \beta_1\cdot x + \varepsilon\)

The equation for residuals can be calculated as follows:

\(e_{i} = y_{i} - \hat{y}_{i}\)

Goal: Explore basic linear regression in R and determine the best indicator of fertility based on the “swiss” data set, primarily looking at Education (% education beyond primary school for draftees) and Agriculture (% of males involved in agriculture as occupation).

Computational Overview

#Setting up Linear Model
education_model = lm(Fertility ~ Education, data=swiss)
agriculture_model = lm(Fertility ~  Agriculture, data=swiss)

#Determining Residuals
education_residuals = residuals(education_model)
agriculture_residuals = residuals(agriculture_model)

#Determining adjusted R^2 values
edj_adj = summary(education_model)$adj.r.squared
agr_adj = summary(agriculture_model)$adj.r.squared

#Calculating predicted values for Observed v. Predicted plot
educational_ovp = 
data.frame(observed = swiss$Fertility, predicted = fitted(init_model))
agricultural_ovp = data.frame(agr_observed = swiss$Fertility,
             agr_predicted = fitted(improved_model))

Dataset

“Standardized fertility measure and socioeconomic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.” - R docs

##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

Educational Prediction Model

Residuals of Educational Model

## [1] "Mean of Residuals: -7.17658792912618e-16 & stdev: 9.34279035686924"

Agricultural Prediction Model

Residuals of Agricultural Model

## [1] "Mean of Residuals: 7.09022084542861e-16 & stdev: 11.6871499422087"

Observed v. Predicted for Educational and Agricultural Model

Code for Previous Slide

suppressPackageStartupMessages(library(plotly))
educational_ovp = data.frame(observed = swiss$Fertility, 
                             predicted = fitted(init_model))
agricultural_ovp = data.frame(agr_observed = swiss$Fertility,
                              agr_predicted = fitted(improved_model))
plot1 = ggplot(educational_ovp, aes(x = predicted, y = observed)) 
+ geom_point(color = "blue")
+  geom_abline(intercept = 0, slope = 1, color ="red", linetype = "dashed")
+  labs(title = " Observed v. Predicted for Educational")
plot2 = ggplot(agricultural_ovp, aes(x = agr_predicted, y = agr_observed))
+ geom_point(color = "blue") 
+ geom_abline(intercept = 0, slope = 1, color ="red", linetype = "dashed")
+ labs(title = " Observed v. Predicted for Educational")
plot1 = ggplotly(plot1)
plot2 = ggplotly(plot2)
htmltools::
  browsable(subplot(plot1,plot2, nrows = 1, titleX = TRUE, titleY = TRUE))

Comparison

Utilizing the following formula, we can compare the two models: \(\overline{R}^2 = 1 - \left( \frac{n-1}{n-(k+1)} \right) (1 - R^2)\)

We find the following two values for the Educational and Agricultural Models respectively.

## [1] "Adjusted r^2 of educational model: 0.428184883317788"
## [1] "Adjusted r^2 of agricultural model: 0.105213019014765"

From these values and the Observed v. Predicted plots, we can safely conclude that Education is a more pertinent indicator of Fertility than Agriculture in this data set.