2025-03-16
1
It is a statistical method to model relationship between a dependent variable and an independent variable.
The formula for simple linear regression: \[ y = \beta_0 + \beta_1 x + \epsilon \]
Linear regression finds the line of best fit line in data by finding \(\beta_1\) that minimizes \(\epsilon\), the estimation of error.
The least squares estimates for \(\beta_0\) and \(\beta_1\) are:
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \]
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} \]
Here, \(\beta_0\) is the intercept and \(\beta_1\) the slope.
set.seed(141)
x <- 1:50
y <- 7 + 3*x + rnorm(50, mean = 0, sd = 5)
data <- data.frame(x = x, y = y)
# Fit linear regression model
mod <- lm(y ~ x, data = data)
# Create a ggplot2 plot
library(ggplot2)
plot1 <- ggplot(data, aes(x = x, y = y)) +
geom_point(color = "orangered") +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Scatter Plot with Best-Fit Line")
`geom_smooth()` using formula = 'y ~ x'
Residuals should be randomly scattered around 0, there should be no obvious pattern, like U-shape(non - linear). Also, there should be no increasing/decreasing trends.