MAS 261 - Lecture 24

Introduction to Simple Linear Regression

Author

Penelope Pooler Eisenbies

Published

November 12, 2025

Housekeeping

Today’s plan
- Comments about Quiz 2 and R 🪄
- Introduction to Simple Linear Regression
  - Function vs. Model
  - Examining Real Data
  - Creating a Model
  - Interpreting an Regression Model

Upcoming Dates

I will check and recheck solutions and post grades on on Monday or Tuesday.
After tests and solutions are posted:
- Please go through your test carefully
  - If you missed a question due to a typo, please let me know.
  - I would be happy to go through any questions you missed with you.
HW 8 is now posted but is not due until after Thanksgiving.
In-person Final Exam is on Friday, 12/12/25 at 5:15 PM

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- For those who want to go further with R/RStudio:
- If you are interested in downloading R and RStudio to your own computer, I can guide you through the process.
- The software is completely free but it does have to be updated a couple times each year.

Lecture 24 In-class Exercises - Q1-Q2

Poll Everywhere - My User Name: penelopepoolereisenbies685

Import the data find the average rate of return (expected value) and volatility for a portfolio that invests 75% in Starbucks (SBUX) and 25% in Nestle(NSRGY).

Use stock adjusted close data from 1/1/25 to 11/1/25.

Code

```{r echo=T, eval=F}
getSymbols("SBUX", from = "2025-01-01", to = "2025-11-01")
getSymbols("NSRGY", from = "2025-01-01", to = "2025-11-01")
```

Question 1: What is the average rate of return or expected value of this coffee portfolio? Round answer to two decimal places.

Question 2: What is the volatility of this coffee portfolio? Round answer to two decimal places.

NOTE: The final exam and HW 8 will include questions like this.

Average Rate of Return questions ask for a weighted average and could include three or more stocks.
Volatility questions require calculating covariances and variances and will only include two stocks, at most.

Models vs. Functions

In high school algebra, the concept of a function, $y=f(x)$ is covered.

For example, a function that most people recall from high school is

\[y=x^2\] How does this function appear?

Functions are Mathematical relationships

Every point is exactly on the line
No points are above or below the line
BOTH the points and the line were generated with the same function

Function of a LINE

While covering functions, a common topic is the function of a line

\[y = mx + b\]

m is the slope of the line
b is the y-intercept

Examples:
- Positive slope: $y = 2x + 3$
- Negative slope: $y = -3x + 7$
- Y axis range is the same on both plots.

Models ARE NOT Functions

Favorite Quote attributed to George Box:

“All models are wrong, but some are useful.”

Common student query:

If all models are wrong, why do we bother modeling?

Models are considered ‘wrong’ because they simplify the ‘messiness’ of the real world to a mathematical relationship.

Models can’t (and shouldn’t) include all the noise of real world data

BUT models are still useful in understanding how variables are related to each other.

Examples of Models of Noisy Data

No. of Bedrooms helps explain selling price
MANY other factors effect selling price
- Location
- Size
- Age

Mileage helps explain resale price
MANY other factors effect resale price
- Model
- Maintenance and Climate

One More Example

Years of Education helps explain income
Many other factors do too:
- Major
- College
- Employer
So what do we do about all this noise?
- As Box would say, we “worry selectively”.
- A strong relationship is still useful and informative.
- In a later lecture will talk about adding more variables to a model.

Lecture 24 In-class Exercises - Q3

Poll Everywhere - My User Name: penelopepoolereisenbies685

To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.

Here is the full recipe.

Here is the equation (y-intercept = 0):

$y = 6x$

Is this a function or a model?

Lecture 24 In-class Exercises - Q4

Poll Everywhere - My User Name: penelopepoolereisenbies685

The scatterplot and line show the relationship between height and mass for all Star Wars characters for whom data were available.

Questions 4: Is the relationship show here a model or a function?

Follow up Question (not on Point Solutions):

What is a good way to determine this?

Simple Linear Regression Model

True Population Model

\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]

$\beta_{0}$ is the y-intercept
$\beta_{1}$ is the slope
$e$ is the unexplained variability in Y

Estimated Sample Data Model

\[\hat{y} = b_{0} + b_{1}x\]

$\hat{y}$ is model estimate of y from x
$b_{0}$ is model estimate of y-intercept
$b_{1}$ is model estimate of slope

Each $e_{i}$ is a residual.
- y obs. - reg. estimate of y
- $e_{i} = y_{i} - \hat{y}_{i}$
Software estimates model with smallest sum of all squared residuals
- minimizes $\sum_{i=1}^ne_{i}^2$

Function of a Line vs. Regression Model

Function of a Line

\[y = mx + b\]

Exact precise mathematical relationship with NO NOISE:

Regression Model Equation

\[\hat{y} = b_{0} + b_{1}x\]

Estimated line that is simultaneously as close as possible to all observations.

Interpreting a Regression Model

\[\hat{y} = b_{0} + b_{1}x\]

$\hat{y}$ is regression est. of y
$b_{0}$ is value of y when X = 0
- NOT always meaningful
$b_{1}$ is change in y due to 1 unit change in x.
- unit depends on data
NOTE:
- Model is only valid for the range of X values used to estimate it.
- Using a model to estimate a value outside of this range is referred to as extrapolation and this estimate is invalid.

Specifying the Model in R

Code

```{r echo=T}
hp_mod <- lm(mpg_h ~ hp, data=gt_cars)
hp_mod$coefficients
```

(Intercept)          hp 
33.86410831 -0.02241685

\[\hat{y} = 33.8641 - 0.022417x\]

Lecture 24 In-class Exercises - Q5-Q6

Poll Everywhere - My User Name: penelopepoolereisenbies685

Regression Model:

\[\hat{y} = 33.8641 - 0.022417x\]

Question 5. Based on this model, if Horsepower (x) is increased by 1, what is the change in Highway MPG?

Round answer to six decimal places

Question 6. Based on this model, if Horsepower (x) is increased by 20 (which is more realistic), what is the change in Highway MPG?

Round answer to 3 decimal places.

Lecture 24 In-class Exercises - Q7-Q8

Poll Everywhere - My User Name: penelopepoolereisenbies685

Regression Model:

\[\hat{y} = 33.8641 - 0.022417x\]

Question 7. If HP is 600, what is the estimated Highway MPG?

Question 8. What is the residual for the 2016 Aston Martin Vantage

Follow up Question (not on Point Solutions): Does the intercept have a real-world interpretation in this model.

Key Points from Today

Simple linear regression (SLR) models are similar in format to the function of line.
The interpretation is very different because SLR models are a simplification of the real world.
Box said “All models are wrong, but some are useful.”
This refers to the inherent simplification of modeling that leaves out the noise of the real world.
Despite this simplification, models provide valuable insight.
A model is only valid for the range data used to create it.
- Outside of that range we are extrapolating which is invalid.

To submit an Engagement Question or Comment about material from Lecture 24: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 24" subtitle: "Introduction to Simple Linear Regression" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData, epiDisplay, vistributions, psych, tidyquant, dygraphs) # verify packages # p_loaded() ``` - Today's plan - Comments about Quiz 2 and R 🪄 - Introduction to Simple Linear Regression - Function vs. Model - Examining Real Data - Creating a Model - Interpreting an Regression Model ## Upcoming Dates - I will check and recheck solutions and post grades on on Monday or Tuesday. - After tests and solutions are posted: - Please go through your test carefully - If you missed a question due to a typo, please let me know. - I would be happy to go through any questions you missed with you. - HW 8 is now posted but is not due until after Thanksgiving. - **In-person Final Exam is on Friday, 12/12/25 at 5:15 PM** ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free){target="_blank"} - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - For those who want to go further with R/RStudio: - If you are interested in downloading R and RStudio to your own computer, I can guide you through the process. - The software is completely free but it does have to be updated a couple times each year. ## ### Lecture 24 In-class Exercises - Q1-Q2 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** Import the data find the average rate of return (expected value) and volatility for a portfolio that invests 75% in Starbucks (SBUX) and 25% in Nestle(NSRGY). Use stock adjusted close data from 1/1/25 to 11/1/25. ```{r echo=T, eval=F} getSymbols("SBUX", from = "2025-01-01", to = "2025-11-01") getSymbols("NSRGY", from = "2025-01-01", to = "2025-11-01") ``` **Question 1:** What is the average rate of return or expected value of this coffee portfolio? Round answer to two decimal places. **Question 2:** What is the volatility of this coffee portfolio? Round answer to two decimal places. **NOTE: The final exam and HW 8 will include questions like this.** - Average Rate of Return questions ask for a weighted average and could include three or more stocks. - Volatility questions require calculating covariances and variances and will only include two stocks, at most. ## Models vs. Functions :::::: columns ::: {.column width="50%"} In high school algebra, the concept of a function, $y=f(x)$ is covered. For example, a function that most people recall from high school is $$y=x^2$$ How does this function appear? ::: :::: {.column width="50%"} ::: fragment ```{r} X <- seq(-1,1,.1) Y <- X^2 xy <- tibble(X,Y) (parabola <- xy |> ggplot(aes(x=X,y=Y)) + geom_point(color="blue", size=3) + geom_line(color="red", linewidth=1) + #theme_classic() + labs(title=expression(Y == X^2)) + theme(plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15))) ``` ::: :::: :::::: ## Functions are Mathematical relationships ::::: columns ::: {.column width="50%"} - Every point is exactly on the line - No points are above or below the line - BOTH the points and the line were generated with the same function ::: ::: {.column width="50%"} ```{r} parabola ``` ::: ::::: ## Function of a LINE :::::::: columns :::: {.column width="50%"} - While covering functions, a common topic is the function of a line ::: fragment $$y = mx + b$$ ::: - m is the slope of the line - b is the y-intercept - Examples: - Positive slope: $y = 2x + 3$ - Negative slope: $y = -3x + 7$ - Y axis range is the same on both plots. :::: ::::: {.column width="50%"} ::: fragment ```{r} X <- seq(-2,10,.5) Y <- 2*X+3 line_pos <- tibble(X,Y) (pos_slope <- line_pos |> ggplot(aes(x=X,y=Y)) + geom_point(color="blue", size=3) + geom_line(color="red", linewidth=1) + lims(y=c(-25,25)) + #theme_classic() + labs(title=expression(Y == 2*X + 3)) + theme(plot.title = element_text(size = 30), axis.title = element_text(size=20), axis.text = element_text(size=18))) ``` ::: ::: fragment ```{r} X <- seq(-2,10,.5) Y <- -3*X + 7 line_neg <- tibble(X,Y) (neg_slope <- line_neg |> ggplot(aes(x=X,y=Y)) + geom_point(color="blue", size=3) + geom_line(color="red", linewidth=1) + lims(y=c(-25,25)) + #theme_classic() + labs(title=expression(Y == -3*X + 7)) + theme(plot.title = element_text(size = 30), axis.title = element_text(size=20), axis.text = element_text(size=18))) ``` ::: ::::: :::::::: ## Models ARE NOT Functions ::::: columns ::: {.column width="50%"} [Favorite Quote](https://en.wikipedia.org/wiki/All_models_are_wrong) attributed to George Box: "All models are wrong, but some are useful." Common student query: If all models are wrong, why do we bother modeling? ::: ::: {.column width="50%"} ```{r} knitr::include_graphics("img/george_box.png") ``` ::: ::::: ::: fragment Models are considered 'wrong' because they simplify the 'messiness' of the real world to a mathematical relationship. Models can't (and shouldn't) include all the **noise** of real world data - BUT models are still useful in understanding how variables are related to each other. ::: ## Examples of Models of Noisy Data ::::::: columns :::: {.column width="50%"} ::: fragment ```{r} knitr::include_graphics("img/House_Selling_Price.png") ``` ::: - No. of Bedrooms helps explain selling price - MANY other factors effect selling price - Location - Size - Age :::: :::: {.column width="50%"} ::: fragment ```{r} knitr::include_graphics("img/Car_Selling_Price.png", dpi=200) ``` ::: - Mileage helps explain resale price - MANY other factors effect resale price - Model - Maintenance and Climate :::: ::::::: ## One More Example ::::: columns ::: {.column width="50%"} ![](img/edu_income_icons.png) ![](img/edu_income_data.png) ::: ::: {.column width="50%"} - Years of Education helps explain income - Many other factors do too: - Major - College - Employer - So what do we do about all this noise? - As Box would say, we "worry selectively". - A strong relationship is still useful and informative. - In a later lecture will talk about adding more variables to a model. ::: ::::: ## ### Lecture 24 In-class Exercises - Q3 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** ::::: columns ::: {.column width="50%"} To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies. Here is the full [recipe](https://www.allrecipes.com/recipe/10192/russian-tea-cakes-i/). Here is the equation (y-intercept = 0): $y = 6x$ **Is this a function or a model?** ::: ::: {.column width="50%"} ![](img/russian_tea_cake_cookies.png){height="2in"} ```{r} X <- seq(2,18,2) Y <- 6*X cookie_pos <- tibble(X,Y) (cookie_slope <- cookie_pos |> ggplot(aes(x=X,y=Y)) + geom_point(color="blue", size=3) + geom_line(color="red", linewidth=1) + #theme_classic() + labs(title=expression(Y == 6*X)) + scale_y_continuous(breaks=seq(12,108,12)) + scale_x_continuous(breaks=seq(2,18,2)) + theme(plot.title = element_text(size = 30), axis.title = element_text(size=20), axis.text = element_text(size=18))) ``` ::: ::::: ## ### Lecture 24 In-class Exercises - Q4 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** ::::: columns ::: {.column width="50%"} The scatterplot and line show the relationship between height and mass for all Star Wars characters for whom data were available. **Questions 4: Is the relationship show here a model or a function?** Follow up Question (not on Point Solutions): What is a good way to determine this? ::: ::: {.column width="50%"} ![](img/sw_char.png){height="2.5in"} ```{r message=F} sw <- starwars |> filter(mass <= 1000) (sw_plot <- sw |> ggplot(aes(x=height, y=mass)) + geom_point(color="blue", size=3) + geom_smooth(method='lm', se=FALSE, color="red", linewidth=1) + #theme_classic() + labs(title="Height vs. Mass of Star Wars Characters", x="Height (cm)", y="Mass (kg)", caption="Jabba excluded") + theme(plot.title = element_text(size = 24), axis.title = element_text(size=20), axis.text = element_text(size=18), plot.caption = element_text(size=15))) ``` ::: ::::: ## Simple Linear Regression Model ::::::::: columns ::::::: {.column width="50%"} ::: fragment **True Population Model** ::: ::: fragment $$y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}$$ ::: - $\beta_{0}$ is the y-intercept - $\beta_{1}$ is the slope - $e$ is the unexplained variability in Y ::: fragment **Estimated Sample Data Model** ::: ::: fragment $$\hat{y} = b_{0} + b_{1}x$$ ::: - $\hat{y}$ is model estimate of y from x - $b_{0}$ is model estimate of y-intercept - $b_{1}$ is model estimate of slope ::::::: ::: {.column width="50%"} ```{r} knitr::include_graphics("img/Regression_Line_and_Residuals.png", dpi=50) ``` - Each $e_{i}$ is a residual. - y obs. - reg. estimate of y - $e_{i} = y_{i} - \hat{y}_{i}$ - Software estimates model with smallest sum of all squared residuals - minimizes $\sum_{i=1}^ne_{i}^2$ ::: ::::::::: ## Function of a Line vs. Regression Model ::::::: columns :::: {.column width="50%"} **Function of a Line** $$y = mx + b$$ ::: fragment Exact precise mathematical relationship with NO NOISE: ```{r} pos_slope ``` ::: :::: :::: {.column width="50%"} **Regression Model Equation** $$\hat{y} = b_{0} + b_{1}x$$ ::: fragment Estimated line that is simultaneously as close as possible to all observations. ```{r} knitr::include_graphics("img/Regression_Line_and_Residuals.png", dpi=50) ``` ::: :::: ::::::: ## Interpreting a Regression Model ::::::: columns :::: {.column width="55%"} ::: fragment $$\hat{y} = b_{0} + b_{1}x$$ ::: - $\hat{y}$ is regression est. of y - $b_{0}$ is value of y when X = 0 - **NOT always meaningful** - $b_{1}$ is change in y due to 1 unit change in x. - unit depends on data - **NOTE:** - Model is only valid for the range of X values used to estimate it. - Using a model to estimate a value outside of this range is referred to as extrapolation and this estimate is invalid. :::: :::: {.column width="45%"} ```{r message=FALSE} gt_cars <- gtcars |> filter(!is.na(mpg_h)) (hp_plot <- gt_cars |> ggplot(aes(x=hp, y=mpg_h)) + geom_point(color="blue", size=3) + geom_smooth(method='lm', se=FALSE, color="red", linewidth=1) + #theme_classic() + labs(title="Horsepower vs Highway MPG", x="Horsepower", y="Highway MPG") + theme(plot.title = element_text(size = 30), axis.title = element_text(size=20), axis.text = element_text(size=18))) ``` ::: fragment **Specifying the Model in R** ```{r echo=T} hp_mod <- lm(mpg_h ~ hp, data=gt_cars) hp_mod$coefficients ``` $$\hat{y} = 33.8641 - 0.022417x$$ ::: :::: ::::::: ## ### Lecture 24 In-class Exercises - Q5-Q6 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** ::::::: columns :::: {.column width="50%"} Regression Model: $$\hat{y} = 33.8641 - 0.022417x$$ ::: fragment **Question 5. Based on this model, if Horsepower (x) is increased by 1, what is the change in Highway MPG?** ::: - Round answer to six decimal places :::: :::: {.column width="50%"} ```{r message=F} hp_plot ``` ::: fragment **Question 6. Based on this model, if Horsepower (x) is increased by 20 (which is more realistic), what is the change in Highway MPG?** ::: - Round answer to 3 decimal places. :::: ::::::: ## ### Lecture 24 In-class Exercises - Q7-Q8 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** ::::: columns ::: {.column width="50%"} Regression Model: $$\hat{y} = 33.8641 - 0.022417x$$ **Question 7. If HP is 600, what is the estimated Highway MPG?** **Question 8. What is the residual for the 2016 Aston Martin Vantage** ::: ::: {.column width="50%"} ```{r message=F} hp_plot ``` - **Follow up Question (not on Point Solutions):** Does the intercept have a real-world interpretation in this model. ::: ::::: ## ### Key Points from Today - Simple linear regression (SLR) models are similar in format to the function of line. - The interpretation is very different because SLR models are a simplification of the real world. - Box said "All models are wrong, but some are useful." - This refers to the inherent simplification of modeling that leaves out the noise of the real world. - Despite this simplification, models provide valuable insight. - A model is only valid for the range data used to create it. - Outside of that range we are extrapolating which is invalid. ::: fragment **To submit an Engagement Question or Comment about material from Lecture 24:** Submit it by midnight today (day of lecture). :::