DAT 301 HW3

2025-04-13

Simple Linear Regression

Attempts to predict outcomes of a dependent variable based on an independent variable

Linear Regression equation \(y = \beta_0 + \beta_1 x + \epsilon\)

\(y\) is the dependent variable
\(\beta_0\) is the intercept
\(\beta_1\) is the regression coefficient (slope)
\(x\) is the independent variable
\(\epsilon\) is the error term

Using Least Squares Method to minimize the sum of residuals

\(\beta_1 = \frac{n\sum(xy) - {\sum}x{\sum}y}{n{\sum}x^2 - ({\sum}x)^2}\)
\(\beta_0 = \frac{{\sum}y - \beta_1{\sum}x}{n}\)

Best Fit Line

The best fit line is the equation that best represents the relationship between the dependent and independent variables to identify trends, correlations, and predictions.

## `geom_smooth()` using formula = 'y ~ x'

Residuals

Residual is observed values - predicted values

\(\epsilon = y - \hat{y}\)

The residuals help us determine how well the model fits the data. For the Old Faithful eruption data, our best fit line was defined by

y = -1.87 + 0.08x + \(\epsilon\)

with \(\epsilon\) assumed to be 0. Using the built in lm from r, we can find the residuals, and more importantly, the Sum of Squared Residuals (SSE) which quantifies the unexplained variability in the model. A lower SSE means the data fits the model very well. Below is the SSE for the best fit line of the Old Faithful data.

epsilon = resid(faithful_model)
sum_residuals = sum(epsilon)
sum_squared_residuals = sum(epsilon^2)

\(\sum{\epsilon} =\) 6.9735884^{-16} ~ 0

\(SSE = \sum{\epsilon^2} =\) 66.5617757

Homoscedasticity

Homoscedasticity describes constant variance of residuals, a key assumption of linear regression, which ensures that statistical inferences are valid and unbiased. Visually, homoscedasticity shows up as a uniform cloud of points evenly distributed around zero with no discernible pattern.

Heteroscedasticity

Heteroscedasticity describes non-constant variance of the residuals that can frequently be visually identified from the data points fanning out in a cone shape (as demonstrated in the plot of simulated income vs consumption data below). Non-constant variance violates a key assumption of linear regression.

Detecting Homoscedasticity

Usually, Homoscedasticity can be detected through visual means (observing whether or not the samples have a discernible pattern). You can also use tools such as the Breusch-Pagan test to measure the variance or residuals. For normally distributed residuals, a p-value <0.05 indicates heteroscedasticity. The lmtest package in R helps us do this calculation.

income_model = lm(income ~ consumption, data=sim_data)
income_result = bptest(income_model)
print(income_result)

## 
##  studentized Breusch-Pagan test
## 
## data:  income_model
## BP = 11.487, df = 1, p-value = 0.0007008

With \(p =\) 0.0007 \(< 0.05\), we can confirm the heteroscedascticity of the data from the plot on the previous page.

Correlation

Regression alone cannot prove causation. Instead, it can imply correlation. If the variables are normally distributed, one way to evaluate correlation from a linear regression is to use the Pearson Correlation Coefficient

\(r = \frac{\sum(x_i-\bar{x})\cdot\sum(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\cdot\sum(y_i-\bar{y})^2}}\)

\(r\) exists on a scale of \(-1\) to \(1\), where \(0\) indicates no linear relationship, and values of \(+1\) or \(-1\) indicate a perfect positive linear relationship or a perfect negative linear relationship, respectively. The closer to 0 the r value is, the less correlation there is between the independent and dependent variables.

For our Old Faithful dataset, we have the following Pearson Correlation Coefficient

Old Faithful Eruption vs Waiting: \(r =\) 0.9008112