2024-10-20

Introduction to Simple Linear Regression

Simple linear regression models the relationship between two variables by fitting a linear equation to observed data.

Formula:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where: - \(y\) is the dependent variable. - \(x\) is the independent variable. - \(\beta_0\) is the intercept. - \(\beta_1\) is the slope. - \(\epsilon\) is the error term.

Assumptions of Linear Regression

  • Linearity: The relationship between X and Y is linear.
  • Independence: The observations are independent.
  • Homoscedasticity: Constant variance of residuals.
  • Normality: Residuals should be normally distributed.

Example: Advertising Dataset

Let’s look at an example dataset. The response variable is Sales, and the predictor is TV Advertising Budget.

data <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/Advertising.csv")
head(data)
##   X    TV radio newspaper sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2

Scatter Plot of Sales vs TV Advertising Budget

ggplot(data, aes(x=TV, y=sales)) + 
  geom_point() + 
  ggtitle("Sales vs TV Advertising Budget") +
  xlab("TV Advertising Budget (in thousands)") +
  ylab("Sales (in thousands)")

Fitting the Linear Model

We fit a linear regression model to predict sales based on the TV advertising budget.

Equation:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where \(\beta_0\) and \(\beta_1\) are the estimated coefficients from the regression model.

model <- lm(sales ~ TV, data=data)
summary(model)
## 
## Call:
## lm(formula = sales ~ TV, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

3D Plot of Fitted Values and Residuals

plot_ly(x = data$TV, y = fitted(model), z = residuals(model),
        type = "scatter3d", mode = "markers") %>%
  layout(title = "3D Plot of Fitted Values and Residuals",
         scene = list(xaxis = list(title = "TV Advertising"),
                      yaxis = list(title = "Fitted Sales"),
                      zaxis = list(title = "Residuals")))

Residuals Plot

Checking residuals to validate assumptions.

ggplot(data, aes(x=fitted(model), y=residuals(model))) + 
  geom_point() + 
  ggtitle("Residuals vs Fitted Values") +
  xlab("Fitted Sales") +
  ylab("Residuals")

R Code for Plotting

Here is the R code used for creating the scatter plot of Sales vs TV Advertising Budget:

ggplot(data, aes(x=TV, y=sales)) + 
  geom_point() + 
  ggtitle("Sales vs TV Advertising Budget") +
  xlab("TV Advertising Budget (in thousands)") +
  ylab("Sales (in thousands)")

Conclusion

Simple linear regression is a useful statistical technique for modeling the relationship between two continuous variables. We used the advertising dataset to predict sales based on TV advertising budget and validated the assumptions by examining residual plots.

References