What is Simple Linear Regression?

Simple Linear Regression is a statistical method that helps us understand the relationship between two continuous variables.


Regression Equation

The equation of a simple linear regression model:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where: - \(y\) is the dependent variable - \(x\) is the independent variable - \(\beta_0\) is the intercept - \(\beta_1\) is the slope - \(\epsilon\) is the error term


Real-Life Example

Let’s say we want to predict a student’s final exam score (\(y\)) based on the number of hours studied (\(x\)).


ggplot: Visualizing the Data

library(ggplot2)
data <- data.frame(hours = c(1, 2, 3, 4, 5, 6, 7),
                   score = c(50, 55, 65, 70, 75, 78, 85))

ggplot(data, aes(x = hours, y = score)) +
  geom_point(color = "blue", size = 3) +
  ggtitle("Hours Studied vs. Exam Score") +
  theme_minimal()


ggplot: Adding Regression Line

model <- lm(score ~ hours, data = data)

ggplot(data, aes(x = hours, y = score)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Regression Line") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'


Math Behind the Model

The slope (\(\beta_1\)) is calculated by:

\[ \beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

And the intercept (\(\beta_0\)):

\[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]


R Code for Linear Regression

summary(model)
## 
## Call:
## lm(formula = score ~ hours, data = data)
## 
## Residuals:
##       1       2       3       4       5       6       7 
## -1.0357 -1.7857  2.4643  1.7143  0.9643 -1.7857 -0.5357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.2857     1.5892   28.50 9.97e-07 ***
## hours         5.7500     0.3554   16.18 1.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.88 on 5 degrees of freedom
## Multiple R-squared:  0.9813, Adjusted R-squared:  0.9775 
## F-statistic: 261.8 on 1 and 5 DF,  p-value: 1.643e-05

This code shows coefficients, R-squared value, and other statistics.


Plotly: 3D Linear Regression (simulated)

library(plotly)
set.seed(123)
x <- rnorm(100)
y <- rnorm(100)
z <- 2 + 3*x + 4*y + rnorm(100)

plot_ly(x = ~x, y = ~y, z = ~z, type = "scatter3d", mode = "markers",
        marker = list(size = 3, color = z, colorscale = 'Viridis'))