2025-11-16

Simple Linear Regression

We will learn how to predict something using previous data points, plotting them, then finding the best fit line through them, and using that line to make future predictions with confidence.

What are we doing?

The scenario which we are choosing for this analays is described below:

  • We will study the relation of study hours with respected to the grades students get in an educational environment
  • We take the X variable as the x-axis of our graph which represents the Hours Studied
  • We take the Y variable as the y-axis of our graph which represents the Exam Score
  • We will predict a rule, based on which we can predict if someone studies X many hours, then they will score Y many marks.

Our example data

  • We make 2 vectors study_hours and scores respective to those study hours
study_hours = c(1,2,3,4,5,6,7,8)
scores = c(50,55,60,65,67,72,78,85)
  • then we make a data frame called data
data = data.frame(study_hours, scores)

Scatter Plot

  • A scatter plot is simple graph that shows dots on a coordinate grid. Each dot represents a pair of values corresponding to each other.
  • plotting the graph with study_hours on the x-axis and scores on the y-axis.
ggplot(data, aes(x = study_hours, y = scores)) +
geom_point(size = 3) +
labs(
title = "Study Hours vs Exam Score",
x = "Study Hours",
y = "Exam Score"
)

Scatter Plot

Scatter Plot With Linear Regression

ggplot(data, aes(x = study_hours, y = scores)) +
geom_point(size = 3, color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Study Hours vs Exam Score ( with Regression Line )",
x = "Study Hours",
y = "Exam Score")

Scatter Plot With Linear Regression

## `geom_smooth()` using formula = 'y ~ x'

More interactive Scatter Plot with Linear Regression

Our Linear Regression Expression

\[ \hat{y} = a + bx \]

Where:

  • \(\hat{y}\) = predicted exam score
  • \(x\) = hours studied
  • \(b\) = slope ( how much the score changes per hour )
  • \(a\) = intercept ( when hours studied = 0 )

How a and b are Calculated

  • \(b\) = slope ( how much the score changes per hour )
    \[b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\]

  • \(a\) = intercept ( when hours studied = 0 )
    \[a = \bar{y} - b \bar{x}\]

Summary

In this presentation, we:

  • Used a simple example of study hours vs exam scores
  • Drew a scatter plot with ggplot
  • Added a regression line with ggplot
  • Made an interactive plot with plotly
  • Saw the basic math behind the line

Simple linear regression is just :

  • “Draw the best straight line through the dots so we can predict y from x.”