10/26/2025

Introduction to Simple Linear Regression

What is simple linear regression?

Simple linear regression displays the closest linear form of how a quantitative independent variable influences a quantitative dependent variable through the use of “a line of best fit”.

What are its purposes?

  • Better interpret how one variable influences the other as mentioned above
  • Helps researchers form predictions based on the model’s line of best fit

Mathematical Representation

The line of best fit is represented by the equation \(y = mx + b\), where y can be associated with the dependent variable and x with the independent variable. When considering simple linear regression, this can be rewritten as \(y = \beta_1x + \beta_0 + \epsilon\) where:

\(\beta_0\) = y-intercept where (0, \(\beta_0\))

\(\beta_1\) = regression coefficient which is equivalent to slope

\(\epsilon\) = error between the actual points and line of best fit

Example 1: Salary Dataset

Background of Dataset

In this first case, a miniature dataset was pulled from Kaggle, consisting of two main columns which are total years experience and resulting salary. The independent variable is the number of years of experience that the individual has and the dependent variable is their annual salary.

The following graph is the simple linear regression model for the salary dataset mentioned above. Simple linear regression models are built on scatter plots which includes a line of best fit.

Salary Dataset: Coding SLR Model in R

R Code for ‘Simple Linear Regression Model: Salary Dataset’

ggplot(salary, aes(x = YearsExperience, y = Salary)) + 
  geom_point(color = "blue") + 
  geom_smooth(method = "lm", formula = y ~ x,color = "red", 
                                           se = FALSE) + 
  labs(title = "Simple Linear Regression Model: Salary Dataset", 
       x = "Number of Years Experience", y = "Annual Salary ($)")

Key Takeaways: Coding SLR Models in R

  • Use functions ggplot() and geom_point() to create a scatter plot that plots the independent variable (x variable) against dependent variable (y variable)
  • Add a line of best fit to the existing scatter plot using function geom_smooth() by specifying “lm” as method (standing for linear model). While geom_smooth() without any arguments will produce a graph that connects all points, this special argument requires the insertion of a straight line that is closest to all points
  • Ensure graph has a title and axis labels with labs() function

Mathematical Breakdown: Salary Dataset

How do we determine the equation for the line of best fit mathematically?

The equation below shows how to find regression coefficient or slope where n is equal to the number of observations. This is one of multiple ways in which the slope can be solved for.

\(\beta_1\) = \(\frac{n(\sum x y) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2}\)

Following this, \(\beta_0\) can be solved for by setting it equivalent to \(\frac{(\sum y) - \beta_1(\sum x)}{n}\).

You can use the sum(), tibble(), and mutate() functions to add extra columns to existing data frames or create tibbles with the sum of x times y for each observation as an extra column to help solve the above equations, for example. Additionally, in R, the mathematical process can be made easier by using the lm() function which will output the y-intercept (\(\beta_0\)) and the slope (\(\beta_1\)).

Mathematical Breakdown Contd.

lm() Output

lm(salary$Salary ~ salary$YearsExperience, data = salary)
## 
## Call:
## lm(formula = salary$Salary ~ salary$YearsExperience, data = salary)
## 
## Coefficients:
##            (Intercept)  salary$YearsExperience  
##                  25792                    9450

The equation of the simple linear regression line is \(y = 9450x + 25792\) given the output of this code.

Example 2: Student Performance Dataset

Background of Dataset

For the second case, the dataset was also pulled from Kaggle and consists of 8 columns ranging from math scores to writing scores of students.

Above, the single linear regression model has been visualized to better display the relationship between students’ reading and writing scores.

Student Performance Dataset: Plotly Plot

Aside from the ggplot() function, the function plot_ly() can be used to develop 3D representations of a dataset with more than two variables to be considered as shown below.

The 3D display (shown in next slide) plots three variables against each other which are the reading, writing, and math scores. On the top right corner, there is a legend which serves as a color code for each of the points depending on the group they belong to which is drawn from race and/or ethnicity of the individual.

Closer Look: Plotly Plot

Student Performance Dataset: Coding Plotly

R Code for Plotly Plot

num_colors <- c("blue", "red","gray","yellow","darkgreen")
plot_ly(performance, x = ~reading.score, 
        y = ~writing.score, z = ~math.score, 
        color = ~race.ethnicity, colors = num_colors, 
        type = 'scatter3d',
        mode = 'markers') %>% 
  layout(title = "Reading vs. Writing vs. Math Scores",
         scene = list(xaxis = list(title = "Reading Score"), 
  yaxis = list(title = "Writing Score"), 
  zaxis = list(title = "Math Score")))

Key Takeaways: Coding for Plotly Plot

  • Developing a 3 dimensional scatterplot with three quantitative variables on the x, y, and z axes using plot_ly() function
  • Ensure plot is properly titled and labeled with the layout() function with internal function scene() for 3D controls

References

Informational Resources

  • Bevans, Rebecca. “Simple Linear Regression | An Easy Introduction & Examples.” Scribbr, Scribbr, 19 Feb. 2020, www.scribbr.com/statistics/simple-linear-regression/.
  • “Linear Regression Formula.” GeeksforGeeks, GeeksforGeeks, 23 July 2025, www.geeksforgeeks.org/maths/linear-regression-formula/.

Kaggle Dataset Links

The links to the Kaggle datasets will be provided in the pdf file.