March 19th, 2025

1. What is Linear Regression?

  • Linear regression is used to predict the value of a variable based on another variable.
  • The variable being predicted: Dependent Variable.
  • The predictor variable: Independent Variable.
  • Uses the least squares method to fit a best-fit line.
  • The idea behind the least squares method to minimize the sum of the vertical distance between all of the data points and the line of best fit.
  • There will be the least error in the best-fit line.

2. Importance of Linear Regression

Linear regression provides valuable insights for prediction and data analysis. The model’s equation offers clear coefficients that illustrate the influence of each independent variable on the dependent variable, enhancing our understanding of the underlying relationships. Linear regression is transparent, easy to implement, and serves as a foundational concept for more advanced algorithms.

3. Linear Regression Formula

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

  • \(Y\): Dependent Variable
  • \(X\): Independent Variable
  • \(\beta_0\): Intercept
  • \(\beta_1\): Slope
  • \(\epsilon\): Error term

Multiple linear regression involves more than one independent variable and one dependent variable: \[ Y = \beta_0 + \beta_1 X_1 +\beta_2 X_2 + \epsilon \]

4. Assumptions of Linear Regression

  1. Linearity: Relationship between X and Y is linear.
  2. Independence: Observations are independent.
  3. Homoscedasticity: Constant variance of residuals.
  4. Normality: Residuals are normally distributed.

5. Plotly 2D Visualization

Interpretation: The scatter around the line suggests some variance in predictions. And if the confidence interval is narrow, predictions are precise; if wide, there is more uncertainty. I think the prediction is pretty good based on the confidence interval.

6. Scatterplot with Regression Line

Each dot represents an observation (data point). It shows the actual values of x and y in your dataset. The regression line (blue) is the best-fit line computed by linear regression, it represents the predicted relationship between x and y.

7. Residuals Plot

Residual = actual y value − predicted y value. The randomness of the residuals around zero is a good thing, it shows the model assumptions are holding. If there is a clear pattern or trend that shows the relationship is not truly linear and you may need to add more variables to your model.

8. Fit a linear regression model

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9073 -0.6835 -0.0875  0.5806  3.2904 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.10280    0.09755  -1.054    0.295    
## x            1.94753    0.10688  18.222   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9707 on 98 degrees of freedom
## Multiple R-squared:  0.7721, Adjusted R-squared:  0.7698 
## F-statistic:   332 on 1 and 98 DF,  p-value: < 2.2e-16

9. Interpreting the Output

Intercept = -0.0123 → When x = 0, y is around -0.0123.
Slope = 1.9784 → For every 1-unit increase in x, y increases by ~1.98.
p-value (< 2e-16) is very small meaning x is a significant predictor of y.
R² = 0.803 → 80.3% of the variance in y is explained by x 
showing a strong relationship.