Definitions

  • a technique to understand the relationship between variables
  • \(y = \alpha + \beta x\)
    • y: the dependent variable you want to predict
    • x: the independent variable you use as an input
    • \(\alpha\): y intercept of regression line
    • \(\beta\): slope of regression line

Plotting points

Find the value of \(\beta\)

  • \(\beta = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\)
  • \(\bar{x}\) is the mean of the x values
  • \(\bar{y}\) is the mean of the y values

Find the value of \(\alpha\)

  • \(\alpha = \bar{y} - \beta \bar{x}\)

Apply the formula to the data points

linear_regression <- function(x, y) {
  
  x_bar <- mean(x)
  y_bar <- mean(y)

  beta <- sum((x - x_bar) * (y - y_bar)) / sum((x - x_bar)^2)
  
  alpha <- y_bar - beta * x_bar
  
  return(c(alpha = alpha, beta = beta))
}

linear_regression(data$x,data$y)
##     alpha      beta 
##  4.900000 -0.290625

Plot the regression line

ggplot(data, aes(x = x, y = y)) +
  geom_point(color = "blue", size = 3) +
  geom_abline(slope = -0.290625, intercept = 4.9, color = "blue")

Another example

##     alpha      beta 
## 2.5500000 0.3796875