---
title: "BUA 345 - Lecture 8"
subtitle: "Introduction to Regression Modeling in R"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r setup, echo=FALSE, warning=F, message=F, include=F}
#| include: false
# this line specifies options for default options for all R Chunks
knitr::opts_chunk$set(echo=F)
# suppress scientific notation
options(scipen=100)
# install helper package that loads and installs other packages, if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
# install and load required packages
pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra,
countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt,
mosaicData, epiDisplay, vistributions, psych, tidyquant, dygraphs)
# verify packages
# p_loaded()
```
**HW 4 is due 2/11/2026**
**HW R project UPDATED** - Now includes files for assignments HW 4 through HW 7
[**FREE Posit Cloud Account**](https://posit.cloud/plans/free){target="_blank"}
### Today's plan
- Review of Simple Linear Regression
- Function vs. Model
- Examining Real Data
- Creating a Model
- Interpreting an Regression Model
##
### Lecture 8 In-class Exercise - Q1
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
```{r eval=F, message=F, warning=F}
bom2025 <- read_csv("data/bom2025.csv", skip = 9, show_col_types = F) |>
dplyr::select(Date, `Top 10 Gross`) |>
rename("Top_10_Gross" = `Top 10 Gross`) |>
filter(!is.na(Top_10_Gross)) |>
mutate(Date = dmy(paste(Date, 2025)),
Top_10_Gross = gsub("$","", Top_10_Gross, fixed=T),
Top_10_Gross = gsub(",","", Top_10_Gross, fixed=T) |> as.numeric(),
Top_10_Gross_M = (Top_10_Gross/1000000) |> round(2)) |>
glimpse()
(bom_plot <- bom2025 |>
ggplot(aes(x=Date, y=Top_10_Gross_M)) +
geom_point(color="blue", size=2) +
#theme_classic() +
labs(title="Date vs. Gross of Top 10 Movies for 2025", x="Date", y="Gross ($US Mil)", caption="Data Source: https://www.boxofficemojo.com/") +
theme(plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15)))
ggsave("img/BOM_Top_10_Gross_Scatterplot.png",
width=10, height=3, units="in")
```
Many people think that the best movies come at the end of the year, but there are always summer blockbuster movies too.
Based on this scatterplot created from 2025 data, do you think there is a linear correlation between time of year and the daily gross from top 10 movies?
{fig-align="center"}
## R and RStudio
- We began using R and RStudio for the predictive analytics lectures.
- and we access R and RStudio through **Posit Cloud**.
- Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free){target="_blank"}
- I post R/RStudio files on Posit Cloud that you can access in provided links.
- I also provide demo videos that show how to access files and complete exercises.
- NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I demo how to download completed work so that you can use this allotment efficiently.
- We will also use Posit cloud for quiz questions of predictive analytics skills.
- For those who want to download R and RStudio (not required):
- There is an information page on my course website, [Installing R and RStudio](https://peneloopy.github.io/bua_345_sem/#installing-r-and-rstudio){target="_blank"}
## Models vs. Functions
:::::: columns
::: {.column width="50%"}
In high school algebra, the concept of a function is covered.
`f(x)` is a calculation involving a variable `x` that results in a new value, `y`.
$$ y = f(x) $$
For example, a function that most people recall from high school is
$$y=x^2$$
How does this function appear graphically?
:::
:::: {.column width="50%"}
::: fragment
```{r}
X <- seq(-1,1,.1)
Y <- X^2
xy <- tibble(X,Y)
(parabola <- xy |> ggplot(aes(x=X,y=Y)) +
geom_point(color="blue", size=3) +
geom_line(color="red", linewidth=1) +
#theme_classic() +
labs(title=expression(Y == X^2)) +
theme(plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15)))
```
:::
::::
::::::
## Functions are Mathematical relationships
::::::: columns
::::: {.column width="50%"}
- Every point is exactly on the line
<br>
- No points are above or below the line
<br>
- BOTH the points and the line were generated with the same function
:::: fragment
::: {.fragment .grow}
$$ y = x^2 $$
:::
::::
:::::
::: {.column width="50%"}
```{r}
parabola
```
:::
:::::::
##
:::::: columns
:::: {.column width="50%"}
### Function of a LINE
- While covering functions, a common topic is the function of a line
::: fragment
$$y = mx + b$$
:::
- m is the slope of the line
- b is the y-intercept
- Examples:
- Positive slope: $y = 2x + 3$
- Negative slope: $y = -3x + 7$
- Notice the Y axis is each plot.
::::
::: {.column width="50%"}
Positive slope: $y = 2x + 3$
```{r}
X <- seq(-2,10,.5)
Y <- 2*X+3
line_pos <- tibble(X,Y)
(pos_slope <- line_pos |> ggplot(aes(x=X,y=Y)) +
geom_point(color="blue", size=3) +
geom_line(color="red", linewidth=1) +
#theme_classic() +
labs(title=expression(Y == 2*X + 3)) +
theme(plot.title = element_text(size = 25),
axis.title = element_text(size=18),
axis.text = element_text(size=15)))
```
Negative slope: $y = -3x + 7$
```{r}
X <- seq(-2,10,.5)
Y <- -3*X + 7
line_neg <- tibble(X,Y)
(neg_slope <- line_neg |> ggplot(aes(x=X,y=Y)) +
geom_point(color="blue", size=3) +
geom_line(color="red", linewidth=1) +
#theme_classic() +
labs(title=expression(Y == -3*X + 7)) +
theme(plot.title = element_text(size = 25),
axis.title = element_text(size=18),
axis.text = element_text(size=15)))
```
:::
::::::
## Models ARE NOT Functions
::::: columns
::: {.column width="50%"}
[Favorite Quote](https://en.wikipedia.org/wiki/All_models_are_wrong) attributed to George Box:
"All models are wrong, but some are useful."
<br>
Common student query:
If all models are wrong, why do we bother modeling?
:::
::: {.column width="50%"}
{fig-align="center" height="3in"}
:::
:::::
::: fragment
Models are considered 'wrong' because they simplify the 'messiness' of the real world to a mathematical relationship.
Models can't (and shouldn't) include all the **noise** of real world data
- BUT models are still useful in understanding how variables are related to each other.
:::
## Examples of Models of Noisy Data
::::::: columns
:::: {.column width="50%"}
::: fragment
{fig-align="center"}
:::
- No. of Bedrooms helps explain selling price
- MANY other factors effect selling price
- Location
- Size
- Age
::::
:::: {.column width="50%"}
::: fragment
{fig-align="center"}
:::
- Mileage helps explain resale price
- MANY other factors effect resale price
- Model
- Maintenance and Climate
::::
:::::::
## One More Example
::::: columns
::: {.column width="50%"}
{fig-align="center"}
{fig-align="center"}
:::
::: {.column width="50%"}
- Years of Education helps explain income
- Many other factors do too:
- Major
- College
- Employer
- So what do we do about all this noise?
- As Box would say, we "worry selectively"
- A strong relationship is still useful and informative
- In a later lecture will talk about adding more variables to a model.
:::
:::::
##
### Lecture 8 In-class Exercises - Q2
The following is an example of a recipe for Russian Tea Cakes
{fig-align="center"}
##
### Lecture 8 In-class Exercises - Q2 Cont'd
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
:::::: columns
::: {.column width="50%"}
To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.
Here is the full [recipe](https://www.allrecipes.com/recipe/10192/russian-tea-cakes-i/).
<br>
Here is the equation (y-intercept = 0):
$y = 6x$
<br>
**Is this a function or a model?**
:::
:::: {.column width="50%"}
::: fragment
```{r}
X <- seq(2,18,2)
Y <- 6*X
cookie_pos <- tibble(X,Y)
(cookie_slope <- cookie_pos |> ggplot(aes(x=X,y=Y)) +
geom_point(color="blue", size=3) +
geom_line(color="red", linewidth=1) +
#theme_classic() +
labs(title=expression(Y == 6*X)) +
scale_y_continuous(breaks=seq(12,108,12)) +
scale_x_continuous(breaks=seq(2,18,2)) +
theme(plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15)))
```
:::
::::
::::::
## Lecture 8 In-class Exercises - Q3
Star Wars Character Data Example
{fig-align="center"}
##
### Lecture 8 In-class Exercises - Q3 Cont'd
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
::::: columns
::: {.column width="50%"}
The plot shows the relationship between height and mass for all Star Wars characters for whom data were available.
<br>
**Questions 3: Is the relationship shown here a model or a function?**
<br>
Follow up Question (not on Point Solutions): What is a good way to determine this?
:::
::: {.column width="50%"}
```{r message=F}
sw <- starwars |>
filter(mass <= 1000)
(sw_plot <- sw |>
ggplot(aes(x=height, y=mass)) +
geom_point(color="blue", size=3) +
geom_smooth(method='lm', se=FALSE, color="red", linewidth=1) +
#theme_classic() +
labs(title="Height vs. Mass of Star Wars Characters", x="Height (cm)", y="Mass (kg)", caption="Jabba excluded") +
theme(plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15)))
```
:::
:::::
## Simple Linear Regression Model
::::::::: columns
::::::: {.column width="50%"}
::: fragment
**True Population Model**
:::
::: fragment
$$y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}$$
:::
- $\beta_{0}$ is the y-intercept
- $\beta_{1}$ is the slope
- $e$ is the unexplained variability in Y
::: fragment
**Estimated Sample Data Model**
:::
::: fragment
$$\hat{y} = b_{0} + b_{1}x$$
:::
- $\hat{y}$ is model estimate of y from x
- $b_{0}$ is model estimate of y-intercept
- $b_{1}$ is model estimate of slope
:::::::
::: {.column width="50%"}
{fig-align="center"}
- Each $e_{i}$ is a residual.
- y obs. - reg. estimate of y
- $e_{i} = y_{i} - \hat{y}_{i}$
- Software estimates model with smallest sum of all squared residuals
- minimizes $\sum_{i=1}^ne_{i}^2$
:::
:::::::::
## Function of a Line vs. Regression Model
::::: columns
::: {.column width="50%"}
**Function of a Line**
$$y = mx + b$$
Exact precise mathmatical relationship with NO NOISE
```{r}
pos_slope
```
:::
::: {.column width="50%"}
**Regression Model Equation**
$$\hat{y} = b_{0} + b_{1}x$$ Estimated line that is simultaneously as close as possible to all observations.
{fig-align="center"}
:::
:::::
## Interpreting a Regression Model
::::::::: columns
:::: {.column width="55%"}
::: fragment
$$\hat{y} = b_{0} + b_{1}x$$
:::
- $\hat{y}$ is regression est. of y
- $b_{0}$ is value of y when X = 0
- **NOT always meaningful**
- $b_{1}$ is change in y due to 1 unit change in x.
- unit depends on data
- **NOTE:**
- Model is only valid for the range of X values used to estimate it.
- Using a model to outside of this range is extrapolation.
- Extrapolated estimates are invalid
::::
:::::: {.column width="45%"}
```{r message=FALSE}
gt_cars <- gtcars |>
filter(!is.na(mpg_h))
(hp_plot <- gt_cars |>
ggplot(aes(x=hp, y=mpg_h)) +
geom_point(color="blue", size=3) +
geom_smooth(method='lm', se=FALSE, color="red", linewidth=1) +
#theme_classic() +
labs(title="Horsepower vs Highway MPG", x="Horsepower", y="Highway MPG") +
theme(plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15)))
```
<br>
::::: fragment
:::: {.fragment .grow}
::: {.fragment .shrink}
**Specifying the Model in R**
```{r echo=T}
hp_mod <- lm(mpg_h ~ hp, data=gt_cars)
hp_mod$coefficients
```
$$\hat{y} = 33.8641 - 0.022417x$$
:::
::::
:::::
::::::
:::::::::
##
### Lecture 8 In-class Exercises - Q4-Q5
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
:::::::: columns
:::: {.column width="48%"}
Regression Model:
$$\hat{y} = 33.8641 - 0.022417x$$ <br>
::: fragment
**Question 4. Based on this model, if Horsepower (x) is increased by 1, what is the change in Highway MPG?**
:::
- Round answer to six decimal places
<br>
::::
::: {.column width="2%"}
:::
:::: {.column width="50%"}
```{r message=F}
hp_plot
```
<br>
::: fragment
**Question 5. Based on this model, if Horsepower (x) is increased by 20 (which is more realistic), what is the change in Highway MPG?**
:::
- Round answer to 3 decimal places.
::::
::::::::
##
### Lecture 8 In-class Exercises - Q6-Q7
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
::::: columns
::: {.column width="50%"}
Regression Model:
$$\hat{y} = 33.8641 - 0.022417x$$ <br>
<br>
**Question 6. If HP is 600, what is the estimated Highway MPG?**
Round answer to closest whole number.
<br>
**Question 7. What is the residual for the 2016 Aston Martin Vantage**
Round answer to two decimal places.
<br>
:::
::: {.column width="50%"}
```{r}
hp_plot
```
- Follow up Question (not on Point Solutions): Does the intercept have a real-world interpretation in this model.
:::
:::::
##
### Key Points from Today
- Simple linear regression (SLR) models are similar in format to the function of line.
- The interpretation is very different because SLR models are simplification of the real world.
- Box said "All models are wrong, but some are useful"
- This refers to the inherent simplication of modeling that leaves out the noise of the real world.
- Despite this simplfication, models provide valuable insight.
- A model is only valid for the range data used to create it.
- Outside of that range we are extrapolating which is invalid.
::: fragment
**HW 4 is due 2/11/2026**
**To submit an Engagement Question or Comment about material from Lecture 8:** Submit it by midnight today (day of lecture).
:::