Simple Linear Regression

2023-10-18

Slide 2: Introduction to Simple Linear Regression

Simple linear regression is a statistical method used to model the relationship between a single independent variable (\(X\)) and a dependent variable (\(Y\)). It assumes a linear relationship between \(X\) and \(Y\) and aims to find the best-fitting line that describes this relationship. The model can be represented as \(Y = \beta_0 + \beta_1 X + \epsilon\), where \(\epsilon\) represents the error term.

Slide 3: Scatter Plot

Before fitting a regression line, it’s essential to visualize the data with a scatter plot. A scatter plot helps to understand the relationship between the variables and identify potential outlines or influential points.

Slide 4: Scatter Plot

Slide 5: Fitting the Regression Line

The goal is to find the best-fitting line that minimizes the sum of squared residuals.
The regression line is determined by estimating the coefficients \(\beta_0\) and \(\beta_1\) that minimize the sum of squared differences between the observed and predicted values.
The formula for the regression line is \(Y = \hat{\beta}_0 + \hat{\beta}_1 X\), where \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the estimated coefficients.

Slide 6 : Residuals

Residuals are the differences between the observed (\(Y\)) and predicted (\(\hat{Y}\)) values.
They represent the error in the models predictions.
A good model will have residuals that are close to zero and evenly distributed.

Slide 7 : Coefficients and Interpretation

\(\hat{\beta}_0\) (Intercept): Represents the estimated value of \(Y\) when \(X\) is zero.
\(\hat{\beta}_1\) (Slope): Represents the change in \(Y\) for a one-unit change in \(X\).
Interpretation of \(\hat{\beta}_1\): For every one-unit increase in \(X\), we expect \(Y\) to increase by \(\hat{\beta}_1\) units.

Slide 8 : R code

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

hours_studied <- c(2, 3, 5, 7, 9, 10, 12, 15, 16, 18)
exam_scores <- c(45, 50, 60, 70, 80, 85, 90, 95, 100, 110)

ggplot(data = data.frame(hours_studied = hours_studied, exam_scores = exam_scores), 
       aes(x = hours_studied, y = exam_scores)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(x = "Hours Studied", y = "Exam Scores") +
  ggtitle("Regression Line")

## `geom_smooth()` using formula = 'y ~ x'

reg_model <- lm(exam_scores ~ hours_studied)
summary(reg_model)

## 
## Call:
## lm(formula = exam_scores ~ hours_studied)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.049 -2.825 -0.479  2.429  5.337 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    40.8910     2.3699   17.25 1.30e-07 ***
## hours_studied   3.8772     0.2148   18.05 9.12e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.57 on 8 degrees of freedom
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.973 
## F-statistic: 325.7 on 1 and 8 DF,  p-value: 9.116e-08

Slide 9 : Conclusion

Simple linear regression is a valuable tool for modeling the relationship between two variables.
It provides insights into how changes in the independent variable impact the dependent variable.
Interpretation of coefficients allows for making predictions based on the model.