2023-10-18

Slide 2: Introduction to Simple Linear Regression

Simple linear regression is a statistical method used to model the relationship between a single independent variable (\(X\)) and a dependent variable (\(Y\)). It assumes a linear relationship between \(X\) and \(Y\) and aims to find the best-fitting line that describes this relationship. The model can be represented as \(Y = \beta_0 + \beta_1 X + \epsilon\), where \(\epsilon\) represents the error term.

Slide 3: Scatter Plot

Before fitting a regression line, it’s essential to visualize the data with a scatter plot. A scatter plot helps to understand the relationship between the variables and identify potential outlines or influential points.

Slide 4: Scatter Plot

Slide 5: Fitting the Regression Line

  • The goal is to find the best-fitting line that minimizes the sum of squared residuals.
  • The regression line is determined by estimating the coefficients \(\beta_0\) and \(\beta_1\) that minimize the sum of squared differences between the observed and predicted values.
  • The formula for the regression line is \(Y = \hat{\beta}_0 + \hat{\beta}_1 X\), where \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the estimated coefficients.

Slide 6 : Residuals

  • Residuals are the differences between the observed (\(Y\)) and predicted (\(\hat{Y}\)) values.
  • They represent the error in the models predictions.
  • A good model will have residuals that are close to zero and evenly distributed.

Slide 7 : Coefficients and Interpretation

  • \(\hat{\beta}_0\) (Intercept): Represents the estimated value of \(Y\) when \(X\) is zero.
  • \(\hat{\beta}_1\) (Slope): Represents the change in \(Y\) for a one-unit change in \(X\).
  • Interpretation of \(\hat{\beta}_1\): For every one-unit increase in \(X\), we expect \(Y\) to increase by \(\hat{\beta}_1\) units.

Slide 8 : R code

library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
hours_studied <- c(2, 3, 5, 7, 9, 10, 12, 15, 16, 18)
exam_scores <- c(45, 50, 60, 70, 80, 85, 90, 95, 100, 110)

ggplot(data = data.frame(hours_studied = hours_studied, exam_scores = exam_scores), 
       aes(x = hours_studied, y = exam_scores)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(x = "Hours Studied", y = "Exam Scores") +
  ggtitle("Regression Line")
## `geom_smooth()` using formula = 'y ~ x'

reg_model <- lm(exam_scores ~ hours_studied)
summary(reg_model)
## 
## Call:
## lm(formula = exam_scores ~ hours_studied)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.049 -2.825 -0.479  2.429  5.337 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    40.8910     2.3699   17.25 1.30e-07 ***
## hours_studied   3.8772     0.2148   18.05 9.12e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.57 on 8 degrees of freedom
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.973 
## F-statistic: 325.7 on 1 and 8 DF,  p-value: 9.116e-08

Slide 9 : Conclusion

  • Simple linear regression is a valuable tool for modeling the relationship between two variables.
  • It provides insights into how changes in the independent variable impact the dependent variable.
  • Interpretation of coefficients allows for making predictions based on the model.