Introduction to Simple Linear Regression

Simple linear regression is a statistical method used to model the relationship between two continuous variables:

  • Dependent variable (Y): The outcome we want to predict
  • Independent variable (X): The predictor variable

The goal is to find the best-fitting straight line through the data points that minimizes the prediction errors.

The Linear Model

The mathematical representation of simple linear regression is:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

Where:

  • \(Y_i\) is the observed value of the dependent variable
  • \(X_i\) is the value of the independent variable
  • \(\beta_0\) is the y-intercept (constant term)
  • \(\beta_1\) is the slope coefficient
  • \(\epsilon_i\) is the random error term, where \(\epsilon_i \sim N(0, \sigma^2)\)

Least Squares Estimation

The coefficients are estimated by minimizing the sum of squared residuals (SSR):

\[SSR = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - \beta_0 - \beta_1 X_i)^2\]

The solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\]

Example Dataset: Advertising and Sales

Let’s analyze the relationship between advertising spending (in thousands of dollars) and product sales (in thousands of units).

##   advertising    sales
## 1    35.88198 31.62384
## 2    80.94746 83.36552
## 3    46.80792 53.99253
## 4    89.47157 73.52500
## 5    94.64206 96.27996
## 6    14.10008 33.28177

Scatter Plot with Regression Line

Model Results

# Fit the linear regression model
model <- lm(sales ~ advertising, data = data)

# Display summary
summary(model)
## 
## Call:
## lm(formula = sales ~ advertising, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0462  -4.4629  -0.5254   4.3941  17.4890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.52185    2.53874   7.296 2.59e-09 ***
## advertising  0.73393    0.04058  18.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.523 on 48 degrees of freedom
## Multiple R-squared:  0.8721, Adjusted R-squared:  0.8694 
## F-statistic: 327.2 on 1 and 48 DF,  p-value: < 2.2e-16

Interactive 3D Visualization

Residual Analysis

Model Interpretation

From our analysis:

  • Intercept (\(\hat{\beta}_0\)): 18.52 - Expected sales when advertising is zero
  • Slope (\(\hat{\beta}_1\)): 0.734 - For each $1,000 increase in advertising, sales increase by approximately 734 units
  • R-squared: 0.872 - About 87.2% of the variation in sales is explained by advertising spending
  • p-value: < 0.001 - The relationship is statistically significant

The model suggests a strong positive relationship between advertising expenditure and sales.