2024-01-28

The Simple Linear Regression Model

\[ Y = \beta_0 + \beta_1 X \]

This equation represents a simple linear regression model which includes the following elements:

  • dependent variable Y
  • independent variable X
  • the constant, \(\beta_0\)
  • the coefficient on X, \(\beta_1\)

\(\beta_0\) and \(\beta_1\) is estimated by \(\hat{\beta}_0\) and \(\hat{\beta}_1\) respectively using the least squares method.

Least Squares Regression Equation

The fitted or estimated line is as follows: \[ y_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + e_i \] (\(x_i\), \(y_i\)) is the values of data point/observation.

\(e_i\) is the residual for observation i ( \(y_i\) - \(\hat{y}_i\)).

\(\hat{y}_i\) is the estimate. \[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \]

Example of R code for Plotting a Linear Regression Line using Car Data

x = cars$speed
y = cars$dist
df = data.frame(x = x, y = y)
lm_model = lm(y ~ x, data = df)
intercept = coef(lm_model)[1]
slope = coef(lm_model)[2]
plot = ggplot(df, aes(x = x, y = y)) +
  geom_point() +  # Add points
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x) +
  geom_text(aes(x = max(x), y = max(y), 
                label = paste("y =", round(intercept, 2),
                              "+", round(slope, 2), "* x")),
            
            hjust = 1, vjust = 1)

Cars Speed vs. Distance

\(\hat{y}_i\) = -17.58 + 3.93x

Significance of Regression Test

This is a possible hypothesis test to determine if there is or if there is not evidence that X and Y have a linear relationship. We can use correlation, (\(\rho\)), or \(\beta_1\) as means for the null and alternative hypothesis.

\(\rho\) is a measure of how strong or weak the linear relationship is between X and Y. \[ -1 \leq \rho \leq 1 \]

Coefficent of Determination : \(R^2\)

\(R^2\) is the fraction of variability that can be accounted for by the regression model.

\[ 0 \leq R^2 \leq 1 \] We want \(R^2\) to be high. An \(R^2\) value > 0.7 indicates a strong correlation. Let’s use our car data:

## [1] 0.6510794

\(R^2\) is fairly close to 0.7 indicating a strong correlation.

Residual Analysis

We want the residuals \(e_i\) to be:

  • normally distributed
  • have constant variance
  • have a mean equal to 0.

Mean of Residuals of Previous Car data:

## [1] -4.440892e-16

It is very close to 0 so it is a good indicator. Let’s check the other two conditions using the same car data from before.

Are the \(e_i\) Normally Distributed?

It is somewhat skewed but relatively follows the normal distribution.

Does \(e_i\) have constant variance?

Yes, the data looks scattered and does not seem to follow a pattern.

Summary and Conclusion of Car data

  • Scatterplot appears to show a linear relationship between car speed and distance.
  • \(R^2\) is high showing that the linear relationship is strong.
  • Residuals appear to be normally distributed, have constant variance, and have a mean close to 0.

Inference: A simple linear regression model seems to be an excellent model of car speed as a function of car distance.