2025-01-27

Introduction

Simple linear regression aims to find a linear relationship that describes a correlation between an independent and potentially dependent variable.

The regression line can then be used to predict missing or future values.

The linear regression model assumes a linear relationship of: \(y = \beta_0 + \beta_1 x + \epsilon\)

Linear Regression

  • Dependent Variable: This is what we aim to predict from the model.
  • Independent Variable: This is the variable we use for prediction.
  • Formula coefficients:
    • \(\beta_0\): This is our constant or intercept.
    • \(\beta_1\): This is our slope or coefficient.
    • \(x\): This is our independent variable.
    • \(\epsilon\): This accounts for a deviation from the model.

Creating a Random Dataset

We can create a random dataset:

set.seed(123)
x <- rnorm(100, mean = 5, sd = 2)
y <- 3 + 2 * x + rnorm(100, mean = 0, sd = 2)
data <- data.frame(x, y)
head(data)
##          x         y
## 1 3.879049  9.337284
## 2 4.539645 12.593057
## 3 8.117417 18.741449
## 4 5.141017 12.586948
## 5 5.258575 11.613914
## 6 8.430130 19.770204

Plotly 3D Plot

We can plot our dataset in a 3d plotly plot:

Scatter Plot with Regression Line

We can create a 2d scatter plot using ggplot showing a regression line:

## `geom_smooth()` using formula = 'y ~ x'

Residuals plot

Slide 7: Mathematical Formulation

Model Equation

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Estimation of Parameters

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \] \[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

Model fitting for linear regression in R

R Code for Model Fitting

# Fitting the model
model <- lm(y ~ x, data = data)
summary(model)
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.815 -1.367 -0.175  1.161  6.581 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0568     0.5868   5.209 1.05e-06 ***
## x             1.9475     0.1069  18.222  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.941 on 98 degrees of freedom
## Multiple R-squared:  0.7721, Adjusted R-squared:  0.7698 
## F-statistic:   332 on 1 and 98 DF,  p-value: < 2.2e-16