What is simple linear regression?

Simple Linear Regression describes the relationship between two variables:

  • A predictor variable X
  • A response variable Y

The goal is to fit a straight line through the data that best explains how Y changes as X changes.

The Equation

The regression line takes the form:

Y = \(\beta_0\) + \(\beta_1\) X + \(\varepsilon\)

  • \(\beta_0\) = intercept (where the line crosses the Y axis)
  • \(\beta_1\) = slope (how much \(Y\) changes per unit increase in \(X\))
  • \(\varepsilon\) = error term (natural variability in the data)

A NBA Example

Can we predict an NBA player’s points per game from their minutes per game?

  • 80 simulated NBA players
  • More minutes played -> more opportunities to score
  • We expect a positive relationship

Exploring the data

Fitting the model in R

# fit the regression model
model <- lm(PointsPerGame ~ MinutesPerGame, data = nba)

# view the results
summary(model)
## 
## Call:
## lm(formula = PointsPerGame ~ MinutesPerGame, data = nba)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1653 -1.6110  0.3389  1.8049  8.0497 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.02597    0.97949   2.068   0.0419 *  
## MinutesPerGame  0.57680    0.03704  15.574   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.822 on 78 degrees of freedom
## Multiple R-squared:  0.7567, Adjusted R-squared:  0.7535 
## F-statistic: 242.5 on 1 and 78 DF,  p-value: < 2.2e-16

The regression line

Checking the residuals

Interactive 3D view

What do the results tell us?

The fitted model is:

\[Points = 2.03 + 0.58 \times \text{Minutes}\]

  • For every extra minute played, a player scores about 0.58 more points
  • The model explains 76% of the variation in scoring (\(R^2\) = 0.76)

Conclusion

  • Simple Linear Regression is a straightforward way to model relationships between two variables
  • In the NBA example, minutes played is a strong predictor of points scored
  • The model has limits, other factors like shot attempts and player skill also matter
  • Adding more variables leads to multiple regression