# Global options for R chunks
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = TRUE)
# Load essential libraries
library(ggplot2)
library(dplyr)
Introduction: Modeling Relationships
Welcome to this module on Simple Linear Regression (SLR). Our goal in
regression is to formally model the linear relationship between two
quantitative variables:
Response Variable (\(y\)): The
variable we are trying to predict or explain.
Explanatory Variable (\(x\)):
The variable used to predict the response.
We will use the built-in mtcars dataset for
demonstration, examining the relationship between vehicle weight (wt, in
1000s of lbs) and fuel efficiency (mpg, miles per gallon).
# Display the first few rows and structure of the relevant variables
mtcars %>%
select(wt, mpg) %>%
head()
1. Visualization: Scatterplots
Before modeling, we must always visualize the relationship. A
scatterplot is the foundational tool for this, allowing us to assess the
form (linear, curved), direction (positive, negative), and strength of
the association, as well as identify potential outliers.
We use ggplot2 to create an elegant and informative scatterplot.
#R Code: Creating the Scatterplot
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 3, color = "#0D47A1") +
labs(
title = "Fuel Efficiency vs. Vehicle Weight",
x = "Vehicle Weight (1000s of lbs)",
y = "Miles Per Gallon (MPG)"
) +
theme_minimal(base_size = 14)

Observation: The plot suggests a strong, negative,
and reasonably linear relationship. As weight increases, MPG tends to
decrease.
2. Quantifying Association: Correlation
The Pearson product-moment correlation coefficient, r ,
quantifies the strength and direction
of the linear association between two quantitative variables.
r is always between −1 and +1.
r > 0 indicates a positive
association.
r < 0 indicates a negative
association.
|r| close to 1 indicates a strong
linear relationship.
|r| close to 0 indicates a weak linear
relationship.
Formula (Theoretically):
\(r_{xy}=\frac{\sum{(X_i - \bar{X})(Y_i -
\bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2}\sum{(Y_i -
\bar{Y})^2}}}\)
The correlation is the average product of the standardized values
(z-scores) of x and y.
Correlation does not imply causation!
Let’s calculate r for our cars data.
# Calculate the correlation coefficient
correlation_r <- cor(cars$speed, cars$dist)
cat("The correlation coefficient (r) is:", round(correlation_r, 4), "\n")
The correlation coefficient (r) is: 0.8069
Interpretation: An r close to 0.8 is a strong, positive linear
relationship, confirming our visual inspection.
3. Simple Linear Regression (SLR): Modeling the Relationship
A Simple Linear Regression model fits a straight line to the data to
predict a response variable (Y) from an explanatory variable (X).
The model is defined by the Least-Squares Regression Line (LSRL):
\[ \hat{y} = b_0 + b_1 x \]
\(\hat{y}\) is the predicted
value of the response variable.
\(b_0\) is the y-intercept (the
predicted value of Y when X=0).
\(b_1\) is the slope (the change
in \(\hat{y}\) for a one-unit increase
in \(X\)).
The line is found by minimizing the sum of the squared vertical
distances (the residuals, \(e_i = y_i −
\hat{y_i}\)) from the points to the line.
Calculating the Coefficients
The least-squares estimates for the slope (\(b_1\)) and intercept (\(b_0\)) are:
Running the Model in R
In R, we use the lm() (linear model) function.
# Fit the linear model: dist is predicted by speed
slr_model <- lm(dist ~ speed, data = cars)
# View the model summary
summary(slr_model)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Focus on the Coefficients:
Prediction Example
What is the predicted stopping distance for a car traveling at 15
mph?
\[ \hat{y} = b_0 + b_1(15) \]
# Use the predict function for a 15 mph car
new_data <- data.frame(speed = 15)
predicted_dist <- predict(slr_model, new_data)
cat("Predicted stopping distance at 15 mph:", round(predicted_dist, 2), "ft\n")
Predicted stopping distance at 15 mph: 41.41 ft
Visualizing the Model
It’s good practice to add the regression line to the scatterplot.
# R Markdown Chunk: Plot with Regression Line
ggplot(cars, aes(x = speed, y = dist)) +
geom_point(color = "darkblue", size = 2) +
geom_smooth(method = "lm", se = TRUE, color = "red") + # Add the LSRL
labs(
title = "SLR of Stopping Distance on Speed",
x = "Speed (mph)",
y = "Stopping Distance (ft)"
) +
theme_minimal()

4. Inference for Regression: Making Generalizations
To move beyond describing the sample data to inferring about the
larger population.
Assumptions (The “LINE” Conditions)
For valid inference, the simple linear regression model assumes:
Linearity: The relationship between X and Y is
linear.
Independence: The observations are
independent.
Normality: For any fixed X, the distribution of
Y (and thus the residuals) is normal.
Equal Variance (Homoscedasticity): The
variability of Y (and thus the residuals) is the same across all X
values.
We check these assumptions primarily through residual plots and
Normal Q-Q plots.
Hypothesis Testing for the Slope
The most common inference test is whether the slope in the
population, \(\beta_1\), is zero. If
\(\beta_1 = 0\), there is no linear
relationship between \(X\) and \(Y\).
The summary(slr_model) output provides the t-statistic
and p-value for this test.
T-statistic for \(\beta_1\)
\[ t = \frac{(b_1 -
\beta_{1,0})}{SE_{b_1}}\]
where \(\beta_{1,0}\) is the
hypothesized value (usually 0).
In the summary() output, look at the row for speed:
# R Markdown Chunk: Inference Interpretation
# Focus on the 'Coefficients' table from summary(slr_model)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -17.5791 6.7584 -2.601 0.0123 *
# speed 3.9324 0.4155 9.464 1.49e-12 ***
t-value (9.464): The number of standard
errors the sample slope ($b_1) is from zero. A large absolute t
is evidence against \(H_0\).
Pr(>|t|) (\(1.49×10^{−12}\)): The
p-value. Since this is extremely small (much less than \(\alpha\)=0.05), we reject \(H_0\).
Conclusion: We have strong evidence that the
true population slope (\(\beta_1\)) is
not zero, meaning there is a statistically significant linear
relationship between speed and stopping distance.
Confidence Intervals for the Slope
A Confidence Interval (CI) provides a range of plausible values for
the true population slope \(\beta_1\)
# Calculate the 95% Confidence Interval for the coefficients
confint(slr_model, level = 0.95)
2.5 % 97.5 %
(Intercept) -31.167850 -3.990340
speed 3.096964 4.767853
Interpretation: We are 95% confident that for every
1 mph increase in speed, the true mean stopping distance increases by an
amount between the lower and upper bounds of the interval for the speed
coefficient.
Conclusion
We’ve covered the full framework of Simple Linear Regression:
Visualize with a scatterplot.
Quantify the linear relationship with correlation
(r).
Model the relationship with the LSRL using
lm().
Infer about the population slope (\(\beta_1\)) using t-tests and confidence
intervals.
Now is to apply these steps to a new dataset and practice
interpreting the summary() output.
---
title: "Simple Linear Regression: From Visualization to Inference"
output: 
  html_notebook:
    toc: true
    toc_float: true
  
---

```{r}
# Global options for R chunks
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = TRUE)

# Load essential libraries
library(ggplot2)
library(dplyr)
```

## Introduction: Modeling Relationships
Welcome to this module on Simple Linear Regression (SLR). Our goal in regression is to formally model the linear relationship between two quantitative variables:  

1. Response Variable ($y$): The variable we are trying to predict or explain.  

2. Explanatory Variable ($x$): The variable used to predict the response. 

We will use the built-in `mtcars` dataset for demonstration, examining the relationship between vehicle weight (wt, in 1000s of lbs) and fuel efficiency (mpg, miles per gallon).

```{r}
# Display the first few rows and structure of the relevant variables
mtcars %>% 
  select(wt, mpg) %>%
  head()
```

## 1. Visualization: Scatterplots  
Before modeling, we must always visualize the relationship. A scatterplot is the foundational tool for this, allowing us to assess the form (linear, curved), direction (positive, negative), and strength of the association, as well as identify potential outliers.

We use ggplot2 to create an elegant and informative scatterplot.

```{r}
#R Code: Creating the Scatterplot
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "#0D47A1") +
  labs(
    title = "Fuel Efficiency vs. Vehicle Weight",
    x = "Vehicle Weight (1000s of lbs)",
    y = "Miles Per Gallon (MPG)"
  ) +
  theme_minimal(base_size = 14)
```

__Observation:__ The plot suggests a strong, negative, and reasonably linear relationship. As weight increases, MPG tends to decrease.

## 2. Quantifying Association: Correlation
The Pearson product-moment correlation coefficient, *r* , quantifies the __strength__ and __direction__ of the linear association between two quantitative variables.


 - __*r*__ is always between −1 and +1.

 - __*r*__ > 0 indicates a positive association.

 - __*r*__ < 0 indicates a negative association.

 - __*|r|*__ close to 1 indicates a strong linear relationship.

 - __*|r|*__ close to 0 indicates a weak linear relationship.

Formula (Theoretically):

$r_{xy}=\frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2}\sum{(Y_i - \bar{Y})^2}}}$


The correlation is the average product of the standardized values (z-scores) of x and y.

Correlation does not imply [causation](https://www.youtube.com/watch?v=t8ADnyw5ou8)!

Let's calculate r for our cars data.

```{r}
# Calculate the correlation coefficient
correlation_r <- cor(cars$speed, cars$dist)

cat("The correlation coefficient (r) is:", round(correlation_r, 4), "\n")
```

Interpretation: An r close to 0.8 is a strong, positive linear relationship, confirming our visual inspection.

### 3. Simple Linear Regression (SLR): Modeling the Relationship

A Simple Linear Regression model fits a straight line to the data to predict a response variable (Y) from an explanatory variable (X).

The model is defined by the Least-Squares Regression Line (LSRL):


$$ \hat{y} = b_0 + b_1 x $$


 - $\hat{y}$  is the predicted value of the response variable.

 - $b_0$ is the y-intercept (the predicted value of Y when X=0).
 
 - $b_1$ is the slope (the change in $\hat{y}$ for a one-unit increase in $X$).

The line is found by minimizing the sum of the squared vertical distances (the residuals, $e_i = y_i − \hat{y_i}$) from the points to the line.

### Calculating the Coefficients

The least-squares estimates for the slope ($b_1$) and intercept ($b_0$) are:

 - $b_1 = r\left(\frac{s_x}{s_y}\right)$ 

 - $b_0 = \bar{y} - b_1\bar{x}$
 
### Running the Model in R

In R, we use the `lm()` (linear model) function.

```{r}
# Fit the linear model: dist is predicted by speed
slr_model <- lm(dist ~ speed, data = cars)

# View the model summary
summary(slr_model)
```

### Focus on the Coefficients:

 - The Intercept ($b_0$) is the estimate for the baseline stopping distance.

 - The Slope ($b_1$) is the estimated increase in stopping distance for every 1 mph increase in speed.

### Prediction Example 
What is the predicted stopping distance for a car traveling at 15 mph?

$$ \hat{y} = b_0 + b_1(15) $$

```{r}
# Use the predict function for a 15 mph car
new_data <- data.frame(speed = 15)
predicted_dist <- predict(slr_model, new_data)

cat("Predicted stopping distance at 15 mph:", round(predicted_dist, 2), "ft\n")
```

### Visualizing the Model
It's good practice to add the regression line to the scatterplot.

```{r}
# R Markdown Chunk: Plot with Regression Line
ggplot(cars, aes(x = speed, y = dist)) +
  geom_point(color = "darkblue", size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "red") + # Add the LSRL
  labs(
    title = "SLR of Stopping Distance on Speed",
    x = "Speed (mph)",
    y = "Stopping Distance (ft)"
  ) +
  theme_minimal()
```

## 4. Inference for Regression: Making Generalizations
To move beyond describing the sample data to inferring about the larger population.

### Assumptions (The "LINE" Conditions)

For valid inference, the simple linear regression model assumes:

1. **L**inearity: The relationship between X and Y is linear.

2. **I**ndependence: The observations are independent.

3. **N**ormality: For any fixed X, the distribution of Y (and thus the residuals) is normal.

4. **E**qual Variance (Homoscedasticity): The variability of Y (and thus the residuals) is the same across all X values.

We check these assumptions primarily through residual plots and Normal Q-Q plots.

### Hypothesis Testing for the Slope
The most common inference test is whether the slope in the population, $\beta_1$, is zero. If  $\beta_1 = 0$, there is no linear relationship between $X$ and $Y$.

 - Null Hypothesis ($H_0$): $\beta_1 = 0$ (No linear relationship)

 - Alternative Hypothesis ($H_a$): $\beta_1 \ne 0$ (A linear relationship exists)

The summary(`slr_model`) output provides the t-statistic and p-value for this test.

### T-statistic for $\beta_1$ 

$$ t = \frac{(b_1 - \beta_{1,0})}{SE_{b_1}}$$

 
 

where $\beta_{1,0}$ is the hypothesized value (usually 0).

In the `summary()` output, look at the row for speed:

```{}
# R Markdown Chunk: Inference Interpretation
# Focus on the 'Coefficients' table from summary(slr_model)
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept) -17.5791   6.7584  -2.601   0.0123 *
# speed         3.9324   0.4155   9.464   1.49e-12 ***
```

 - **_t_-value (9.464)**: The number of standard errors the sample slope ($b_1) is from zero. A large absolute *t* is evidence against $H_0$.

 - **Pr(>|_t_|) ($1.49×10^{−12}$)**: The *p*-value. Since this is extremely small (much less than $\alpha$=0.05), we reject $H_0$.

 - __Conclusion__: We have strong evidence that the true population slope ($\beta_1$) is not zero, meaning there is a statistically significant linear relationship between speed and stopping distance.

### Confidence Intervals for the Slope

A Confidence Interval (CI) provides a range of plausible values for the true population slope $\beta_1$

```{r}
# Calculate the 95% Confidence Interval for the coefficients
confint(slr_model, level = 0.95)
```

__Interpretation:__ We are 95% confident that for every 1 mph increase in speed, the true mean stopping distance increases by an amount between the lower and upper bounds of the interval for the speed coefficient.

## Conclusion

We've covered the full framework of Simple Linear Regression:

1. Visualize with a scatterplot.

2. Quantify the linear relationship with correlation (*r*).

3. Model the relationship with the LSRL using `lm()`.

4. Infer about the population slope ($\beta_1$) using t-tests and confidence intervals.

Now is to apply these steps to a new dataset and practice interpreting the `summary()` output.