2025-02-04

Linear Regression

  • Simple Linear Regression is a statistical model that can be used to find linear relationships between two variables in a given data set.

  • Linear Regression models can be a powerful analytical tool, allowing researchers to predict potential future outcomes.

  • A typlical linear regression can be modeled with the equation \[ Y = \beta_0 + \beta_1x + \varepsilon \], with the equation in terms of dataset variables being \[ Life Expectancy = \beta_0 + \beta_1(Year) + \varepsilon\]

  • The linear regression model finds a line of best fit through the provided datapoints, with \(\varepsilon\) = +/- error from the regression line

  • Using the dataset “Life Expectancy Data” a linear regression model of life expectancy vs. year can be generated.

Linear Regression Life Expectancy vs. Year

  • The derived linear regression model can be seen below:
 
Call:
lm(formula = Life_expectancy ~ Year, data = regression_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.803  -6.504   2.852   6.295  21.004 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -635.87153   75.54552  -8.417   <2e-16 ***
Year           0.35123    0.03763   9.333   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.387 on 2926 degrees of freedom
Multiple R-squared:  0.02891,   Adjusted R-squared:  0.02858 
F-statistic: 87.11 on 1 and 2926 DF,  p-value: < 2.2e-16
 

Linear Regression Plot

  • We can see that there is a wide range in life expectancy depending on the country, but it seems to be slightly trending upwards.

Modified Linear Regression Plot

  • An alternate perspective of this data can be obtained by taking the average life expectancy of all countries included by year and plotting the trend:

Multiple Linear Regression

  • The Multiple Linear Regression Model allows researchers to study the relationship between multiple variables in a given dataset.
  • This model helps analyze complex relationships between multiple variables.
  • Adding the “GDP” variable into the Life Expectancy vs. Year regression model can provide more insights into expected Life Expectancy.
  • The updated formula for the Multiple Linear Regression Model with respect to the dataset variables is \[ Life Expectancy = \beta_0 + \beta_1(GDP) + \beta_2(Year) + \varepsilon\]

3 Variable Linear Regression Model

expectancy_data <- Life_Expectancy_Data %>%
  filter(!is.na(GDP), !is.na(Life_expectancy), !is.na(Year))

multiple_model <- lm(Life_expectancy ~ GDP + Year, data = expectancy_data)
regression2 <- capture.output(summary(multiple_model))
cat("<pre style='font-size:12px; position:relative; top: -50px;'>",paste(regression2, collapse = "\n"), "</pre>")
 
Call:
lm(formula = Life_expectancy ~ GDP + Year, data = expectancy_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.606  -5.643   2.045   5.845  21.899 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.217e+02  7.468e+01  -5.647 1.82e-08 ***
GDP          3.036e-04  1.199e-05  25.326  < 2e-16 ***
Year         2.435e-01  3.720e-02   6.544 7.23e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.488 on 2482 degrees of freedom
Multiple R-squared:  0.2263,    Adjusted R-squared:  0.2257 
F-statistic:   363 on 2 and 2482 DF,  p-value: < 2.2e-16
 

3 Variable Plot

  • The 3D plot allows Year, GDP, and Life Expectancy to be compared simultaneously. The blue data points are from “Developed” countries, and the green data points are from “Undeveloped” countries. This added feature provides more insight and nuance to the data.