---
title: "MAS 261 - Lecture 24"
subtitle: "Introduction to Simple Linear Regression"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r setup, echo=FALSE, warning=F, message=F, include=F}
#| include: false
# this line specifies options for default options for all R Chunks
knitr::opts_chunk$set(echo=F)
# suppress scientific notation
options(scipen=100)
# install helper package that loads and installs other packages, if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
# install and load required packages
pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra,
countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt,
mosaicData, epiDisplay, vistributions, psych, tidyquant, dygraphs)
# verify packages
# p_loaded()
```
- Today's plan
- Comments about Quiz 2 and R 🪄
- Introduction to Simple Linear Regression
- Function vs. Model
- Examining Real Data
- Creating a Model
- Interpreting an Regression Model
## Upcoming Dates
- I will check and recheck solutions and post grades on on Monday or Tuesday.
- After tests and solutions are posted:
- Please go through your test carefully
- If you missed a question due to a typo, please let me know.
- I would be happy to go through any questions you missed with you.
- HW 8 is now posted but is not due until after Thanksgiving.
- **In-person Final Exam is on Friday, 12/12/25 at 5:15 PM**
## R and RStudio
- In this course we will use R and RStudio to understand statistical concepts.
- You will access R and RStudio through **Posit Cloud**.
- Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free){target="_blank"}
- I will post R/RStudio files on Posit Cloud that you can access in provided links.
- I will also provide demo videos that show how to access files and complete exercises.
- NOTE: The free Posit Cloud account is limited to 25 hours per month.
- For those who want to go further with R/RStudio:
- If you are interested in downloading R and RStudio to your own computer, I can guide you through the process.
- The software is completely free but it does have to be updated a couple times each year.
##
### Lecture 24 In-class Exercises - Q1-Q2
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Import the data find the average rate of return (expected value) and volatility for a portfolio that invests 75% in Starbucks (SBUX) and 25% in Nestle(NSRGY).
Use stock adjusted close data from 1/1/25 to 11/1/25.
```{r echo=T, eval=F}
getSymbols("SBUX", from = "2025-01-01", to = "2025-11-01")
getSymbols("NSRGY", from = "2025-01-01", to = "2025-11-01")
```
**Question 1:** What is the average rate of return or expected value of this coffee portfolio? Round answer to two decimal places.
**Question 2:** What is the volatility of this coffee portfolio? Round answer to two decimal places.
**NOTE: The final exam and HW 8 will include questions like this.**
- Average Rate of Return questions ask for a weighted average and could include three or more stocks.
- Volatility questions require calculating covariances and variances and will only include two stocks, at most.
## Models vs. Functions
:::::: columns
::: {.column width="50%"}
In high school algebra, the concept of a function, $y=f(x)$ is covered.
For example, a function that most people recall from high school is
$$y=x^2$$ How does this function appear?
:::
:::: {.column width="50%"}
::: fragment
```{r}
X <- seq(-1,1,.1)
Y <- X^2
xy <- tibble(X,Y)
(parabola <- xy |> ggplot(aes(x=X,y=Y)) +
geom_point(color="blue", size=3) +
geom_line(color="red", linewidth=1) +
#theme_classic() +
labs(title=expression(Y == X^2)) +
theme(plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15)))
```
:::
::::
::::::
## Functions are Mathematical relationships
::::: columns
::: {.column width="50%"}
- Every point is exactly on the line
- No points are above or below the line
- BOTH the points and the line were generated with the same function
:::
::: {.column width="50%"}
```{r}
parabola
```
:::
:::::
## Function of a LINE
:::::::: columns
:::: {.column width="50%"}
- While covering functions, a common topic is the function of a line
::: fragment
$$y = mx + b$$
:::
- m is the slope of the line
- b is the y-intercept
<br>
- Examples:
- Positive slope: $y = 2x + 3$
- Negative slope: $y = -3x + 7$
- Y axis range is the same on both plots.
::::
::::: {.column width="50%"}
::: fragment
```{r}
X <- seq(-2,10,.5)
Y <- 2*X+3
line_pos <- tibble(X,Y)
(pos_slope <- line_pos |> ggplot(aes(x=X,y=Y)) +
geom_point(color="blue", size=3) +
geom_line(color="red", linewidth=1) +
lims(y=c(-25,25)) +
#theme_classic() +
labs(title=expression(Y == 2*X + 3)) +
theme(plot.title = element_text(size = 30),
axis.title = element_text(size=20),
axis.text = element_text(size=18)))
```
:::
::: fragment
```{r}
X <- seq(-2,10,.5)
Y <- -3*X + 7
line_neg <- tibble(X,Y)
(neg_slope <- line_neg |> ggplot(aes(x=X,y=Y)) +
geom_point(color="blue", size=3) +
geom_line(color="red", linewidth=1) +
lims(y=c(-25,25)) +
#theme_classic() +
labs(title=expression(Y == -3*X + 7)) +
theme(plot.title = element_text(size = 30),
axis.title = element_text(size=20),
axis.text = element_text(size=18)))
```
:::
:::::
::::::::
## Models ARE NOT Functions
::::: columns
::: {.column width="50%"}
[Favorite Quote](https://en.wikipedia.org/wiki/All_models_are_wrong) attributed to George Box:
"All models are wrong, but some are useful."
<br>
Common student query:
If all models are wrong, why do we bother modeling?
:::
::: {.column width="50%"}
```{r}
knitr::include_graphics("img/george_box.png")
```
:::
:::::
::: fragment
Models are considered 'wrong' because they simplify the 'messiness' of the real world to a mathematical relationship.
Models can't (and shouldn't) include all the **noise** of real world data
- BUT models are still useful in understanding how variables are related to each other.
:::
## Examples of Models of Noisy Data
::::::: columns
:::: {.column width="50%"}
::: fragment
```{r}
knitr::include_graphics("img/House_Selling_Price.png")
```
:::
- No. of Bedrooms helps explain selling price
- MANY other factors effect selling price
- Location
- Size
- Age
::::
:::: {.column width="50%"}
::: fragment
```{r}
knitr::include_graphics("img/Car_Selling_Price.png", dpi=200)
```
:::
- Mileage helps explain resale price
- MANY other factors effect resale price
- Model
- Maintenance and Climate
::::
:::::::
## One More Example
::::: columns
::: {.column width="50%"}


:::
::: {.column width="50%"}
- Years of Education helps explain income
- Many other factors do too:
- Major
- College
- Employer
- So what do we do about all this noise?
- As Box would say, we "worry selectively".
- A strong relationship is still useful and informative.
- In a later lecture will talk about adding more variables to a model.
:::
:::::
##
### Lecture 24 In-class Exercises - Q3
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
::::: columns
::: {.column width="50%"}
To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.
Here is the full [recipe](https://www.allrecipes.com/recipe/10192/russian-tea-cakes-i/).
<br>
Here is the equation (y-intercept = 0):
$y = 6x$
<br>
**Is this a function or a model?**
:::
::: {.column width="50%"}
{height="2in"}
```{r}
X <- seq(2,18,2)
Y <- 6*X
cookie_pos <- tibble(X,Y)
(cookie_slope <- cookie_pos |> ggplot(aes(x=X,y=Y)) +
geom_point(color="blue", size=3) +
geom_line(color="red", linewidth=1) +
#theme_classic() +
labs(title=expression(Y == 6*X)) +
scale_y_continuous(breaks=seq(12,108,12)) +
scale_x_continuous(breaks=seq(2,18,2)) +
theme(plot.title = element_text(size = 30),
axis.title = element_text(size=20),
axis.text = element_text(size=18)))
```
:::
:::::
##
### Lecture 24 In-class Exercises - Q4
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
::::: columns
::: {.column width="50%"}
The scatterplot and line show the relationship between height and mass for all Star Wars characters for whom data were available.
<br>
**Questions 4: Is the relationship show here a model or a function?**
<br>
Follow up Question (not on Point Solutions):
What is a good way to determine this?
:::
::: {.column width="50%"}
{height="2.5in"}
```{r message=F}
sw <- starwars |>
filter(mass <= 1000)
(sw_plot <- sw |>
ggplot(aes(x=height, y=mass)) +
geom_point(color="blue", size=3) +
geom_smooth(method='lm', se=FALSE, color="red", linewidth=1) +
#theme_classic() +
labs(title="Height vs. Mass of Star Wars Characters", x="Height (cm)", y="Mass (kg)", caption="Jabba excluded") +
theme(plot.title = element_text(size = 24),
axis.title = element_text(size=20),
axis.text = element_text(size=18),
plot.caption = element_text(size=15)))
```
:::
:::::
## Simple Linear Regression Model
::::::::: columns
::::::: {.column width="50%"}
::: fragment
**True Population Model**
:::
::: fragment
$$y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}$$
:::
- $\beta_{0}$ is the y-intercept
- $\beta_{1}$ is the slope
- $e$ is the unexplained variability in Y
::: fragment
**Estimated Sample Data Model**
:::
::: fragment
$$\hat{y} = b_{0} + b_{1}x$$
:::
- $\hat{y}$ is model estimate of y from x
- $b_{0}$ is model estimate of y-intercept
- $b_{1}$ is model estimate of slope
:::::::
::: {.column width="50%"}
```{r}
knitr::include_graphics("img/Regression_Line_and_Residuals.png", dpi=50)
```
- Each $e_{i}$ is a residual.
- y obs. - reg. estimate of y
- $e_{i} = y_{i} - \hat{y}_{i}$
- Software estimates model with smallest sum of all squared residuals
- minimizes $\sum_{i=1}^ne_{i}^2$
:::
:::::::::
## Function of a Line vs. Regression Model
::::::: columns
:::: {.column width="50%"}
**Function of a Line**
$$y = mx + b$$
::: fragment
Exact precise mathematical relationship with NO NOISE:
```{r}
pos_slope
```
:::
::::
:::: {.column width="50%"}
**Regression Model Equation**
$$\hat{y} = b_{0} + b_{1}x$$
::: fragment
Estimated line that is simultaneously as close as possible to all observations.
```{r}
knitr::include_graphics("img/Regression_Line_and_Residuals.png", dpi=50)
```
:::
::::
:::::::
## Interpreting a Regression Model
::::::: columns
:::: {.column width="55%"}
::: fragment
$$\hat{y} = b_{0} + b_{1}x$$
:::
- $\hat{y}$ is regression est. of y
- $b_{0}$ is value of y when X = 0
- **NOT always meaningful**
- $b_{1}$ is change in y due to 1 unit change in x.
- unit depends on data
- **NOTE:**
- Model is only valid for the range of X values used to estimate it.
- Using a model to estimate a value outside of this range is referred to as extrapolation and this estimate is invalid.
::::
:::: {.column width="45%"}
```{r message=FALSE}
gt_cars <- gtcars |>
filter(!is.na(mpg_h))
(hp_plot <- gt_cars |>
ggplot(aes(x=hp, y=mpg_h)) +
geom_point(color="blue", size=3) +
geom_smooth(method='lm', se=FALSE, color="red", linewidth=1) +
#theme_classic() +
labs(title="Horsepower vs Highway MPG", x="Horsepower", y="Highway MPG") +
theme(plot.title = element_text(size = 30),
axis.title = element_text(size=20),
axis.text = element_text(size=18)))
```
<br>
::: fragment
**Specifying the Model in R**
```{r echo=T}
hp_mod <- lm(mpg_h ~ hp, data=gt_cars)
hp_mod$coefficients
```
$$\hat{y} = 33.8641 - 0.022417x$$
:::
::::
:::::::
##
### Lecture 24 In-class Exercises - Q5-Q6
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
::::::: columns
:::: {.column width="50%"}
Regression Model:
$$\hat{y} = 33.8641 - 0.022417x$$ <br>
::: fragment
**Question 5. Based on this model, if Horsepower (x) is increased by 1, what is the change in Highway MPG?**
:::
- Round answer to six decimal places
<br>
::::
:::: {.column width="50%"}
```{r message=F}
hp_plot
```
<br>
::: fragment
**Question 6. Based on this model, if Horsepower (x) is increased by 20 (which is more realistic), what is the change in Highway MPG?**
:::
- Round answer to 3 decimal places.
::::
:::::::
##
### Lecture 24 In-class Exercises - Q7-Q8
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
::::: columns
::: {.column width="50%"}
Regression Model:
$$\hat{y} = 33.8641 - 0.022417x$$ <br>
<br>
**Question 7. If HP is 600, what is the estimated Highway MPG?**
<br>
**Question 8. What is the residual for the 2016 Aston Martin Vantage**
<br>
:::
::: {.column width="50%"}
```{r message=F}
hp_plot
```
- **Follow up Question (not on Point Solutions):** Does the intercept have a real-world interpretation in this model.
:::
:::::
##
### Key Points from Today
- Simple linear regression (SLR) models are similar in format to the function of line.
- The interpretation is very different because SLR models are a simplification of the real world.
- Box said "All models are wrong, but some are useful."
- This refers to the inherent simplification of modeling that leaves out the noise of the real world.
- Despite this simplification, models provide valuable insight.
- A model is only valid for the range data used to create it.
- Outside of that range we are extrapolating which is invalid.
::: fragment
**To submit an Engagement Question or Comment about material from Lecture 24:** Submit it by midnight today (day of lecture).
:::