June 10, 2024

Introduction

Simple linear regression is Simple linear regression is a powerful tool for predicting a dependent variable (Y) based on an independent variable (X).

The simple linear regression model can be represented as:

\(\text{Y}\) =  \(\beta_0\) +   \(\beta_1{X}\) +  \(\varepsilon\)

Where:
\(Y\) is the predicted value of the dependent variable.
\(\beta_0\) is the intercept of the regression line.
\(\beta_1\) is the slope of the regression line.
\(X\) is the independent variable.
\(\varepsilon\) is the error term.

Example: Depicting the correlation between the Income (average) and Happiness for 111 countries. Dataset can be found here –> Kaggle

## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Plotly Scatter plot with fitted line

Code that produces the scatter plot in the next slide

The code below will render a ggplot2 scatter plot with a regression line showing the correlation between the two variables.

# Scatter plot with regression line
g <- ggplot(data, aes(x = avg_income, y = happyScore)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "gold") +
  labs(title = "Scatter Plot with Regression Line",
       x = "Average Income",
       y = "Happy Score")

Scatter plot with regression line

## `geom_smooth()` using formula = 'y ~ x'

From the chart, we observe that once the monthly income exceeds $15,000, 99% of the data points have a happyScore above 6.5.

Code that produces the following ggplot2 residual plot

# Residual plot
r <- ggplot(model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residual Plot",
       x = "Fitted Values",
       y = "Residuals")

Residual Plot

By examining the residuals, we can evaluate the model’s fit quality and detect any potential outliers or trends in the data that the model may have missed.

Interpretation of Results

The fitted regression model can be used to predict happiness scores based on average income. The summary of the model provides coefficients for the intercept and slope, which can be interpreted as follows:

Intercept (\(\beta_0\)): The expected happiness score when average income is zero.
Slope (\(\beta_1\)): The expected change in happiness score for each additional unit of average income.

Understanding the model’s coefficients and assessing its fit are crucial steps in the analysis.