Lectures 7 and 8 - Introduction to Correlation and Regression in R

Penelope Pooler Eisenbies
BUA 345

2024-02-08

Housekeeping

This week’s plan 📋
- Introduction to using R and RStudio
- Review of correlation, $R_{XY}$
- Review of Simple Linear Regression
  - Function vs. Model
  - Examining Real Data
  - Creating a Model
  - Interpreting an Regression Model

💥 Lecture 7 In-class Exercises - Q1 - Review 💥

Introduction to R and RStudio 🪄

Two options to facilitate your introduction to R and RStudio:
- Option 1: BOTH Versions NOW
  - Create Posit Cloud account
  - Download and install R and RStudio on your laptop.
- Option 2: Start with Posit Cloud only
  - Create Posit Cloud account and later transition to using R/Rstudio on your laptop.
  - Using Posit Cloud only for whole course may require $5 per month if you use it more 25 hours per month (likely).
We will use Posit Cloud for Quizzes.
For both options: I can help with download/install issues during office hours.
I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.

Review of Linear Correlations

In your prerequisite course for BUA 345, you covered linear relationships between two or more quantitative variables.

We will introduce the review this material this week while introducing R and RStudio.

Often if we have two quantitative variables we want to understand the extent to which they are associated.
- The first step is often to plot the data using a scatterplot.
- We can also use quantitative measures of association to understand these relationships.

Grocery Sales per Sq. Ft. and Planned Store Openings

Understanding Linear Relationships

chain	sales_sq_ft	openings
Roundy’s	393	2
Weis Markets	325	3
Natural Grocers	419	5
Ingles	325	10
Kroger	496	15
Harris Teeter’s	442	20
Fresh Market	490	20
Sprouts Farmer’s Market	490	20
Publix	552	30
Whole Foods	937	38

Direction of the Relationship

As X (sales per square feet) increases, Y (planned store openings) also increases.

When Y increases with X in an approximately linear fashion, that is a

POSITIVE LINEAR RELATIONSHIP
- The trend has a positive slope.

Strength of the Linear Relationship

In addition to determining if there is a positive or negative relationship,

We also want to quantify, how strong the relationship is.

To quantify the strength a linear relationship, we calculate:

Pearson’s correlation coefficient, $R_{xy}$.
$R_{xy} = 0.85$
How do we interpret this value?
- …Spoiler: This a strong positive correlation!

cor(grocery$sales_sq_ft, grocery$openings)

[1] 0.8517842

Interpreting $R_{xy}$, the correlation coefficient

$R_{xy}$ ranges from -1 to 1.

The most extreme $R_{xy}$ values represent ‘perfectly correlated data’:

Very Strongly Correlated Data

$R_{xy} = 1$ or $R_{xy} = -1$ is unrealistic. These correlations are both strong and realistic:

Range of $R_{xy}$ Guidelines for Interpretation

Example of Negative Correlation

💥 In-class Exercises - Q2 💥

What is the correlation between Year and Rural_Pct in the urban_rural dataset?

Hint: This Correlation is almost perfect.

Round answer to three decimal places.

When NOT to use $R_{xy}$

$R_{xy}$ is only valid when examining linear relationships.

If the data have a curvilinear relationship, there are other tools that will be covered in other courses.

Key Points from Lecture 7

This short lecture is an introduction to linear associations between variables.
We will continue this discussion in Lecture 8 on Thursday
For now, you are expected to understand
- How to open provided files in R and RStudio
- How to interpret a scatterplot
- Calculating $R_{xy}$ in R using the cor command in R
- Interpreting $R_{xy}$
- When NOT to use $R_{xy}$ to examine data associations

To submit an Engagement Question or Comment about material from Lecture 7: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 7.

Review of Housekeeping for Lectures 7 and 8

This week’s plan 📋
- Introduction to using R and RStudio
- Review of correlation, $R_{XY}$
- Review of Simple Linear Regression
  - Function vs. Model
  - Examining Real Data
  - Creating a Model
  - Interpreting an Regression Model

💥 In-class Exercises - Q1 - Review 💥

Many people think that the best movies come at the end of the year, but there are always summer blockbuster movies too.

Based on this scatterplot created from 2023 data, do you think there is a linear correlation between time of year and the daily gross from top 10 movies?

Models vs. Functions

In high school algebra, the concept of a function is covered.

f(x) is a calculation involving a variable x that results in a new value, y.

\[ y = f(x) \]

For example, a function that most people recall from high school is

\[y=x^2\]

How does this function appear graphically?

Functions are Mathematical relationships

Every point is exactly on the line

No points are above or below the line

BOTH the points and the line were generated with the same function

\[ y = x^2 \]

Function of a LINE

While covering functions, a common topic is the function of a line

\[y = mx + b\]

m is the slope of the line
b is the y-intercept
Examples:
- Positive slope: $y = 2x + 3$
- Negative slope: $y = -3x + 7$
Notice the Y axis is each plot.

Positive slope: $y = 2x + 3$

Negative slope: $y = -3x + 7$

Models ARE NOT Functions

Favorite Quote attributed to George Box:

“All models are wrong, but some are useful.”

Common student query:

If all models are wrong, why do we bother modeling?

Models are considered ‘wrong’ because they simplify the ‘messiness’ of the real world to a mathematical relationship.

Models can’t (and shouldn’t) include all the noise of real world data

BUT models are still useful in understanding how variables are related to each other.

Examples of Models of Noisy Data

No. of Bedrooms helps explain selling price
MANY other factors effect selling price
- Location
- Size
- Age

Mileage helps explain resale price
MANY other factors effect resale price
- Model
- Maintenance and Climate

One More Example

Years of Education helps explain income
Many other factors do too:
- Major
- College
- Employer
So what do we do about all this noise?
- As Box would say, we “worry selectively”
- A strong relationship is still useful and informative
- In a later lecture will talk about adding more variables to a model.

💥 Lecture 8 In-class Exercises - Q2 💥

The following is an example of a recipe for Russian Tea Cakes

💥 Lecture 8 In-class Exercises - Q2 💥

To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.

Here is the full recipe.

Here is the equation (y-intercept = 0):

$y = 6x$

Is this a function or a model?

💥 Lecture 8 In-class Exercises - Q3 💥

Star Wars Character Data Example

💥 Lecture 8 In-class Exercises - Q3 💥

The plot and model show the relationship between height and mass for all Star Wars characters for whom data were available.

Questions 3: Is the relationship shown here a model or a function?

Follow up Question (not on Point Solutions): What is a good way to determine this?

Simple Linear Regression Model

True Population Model

\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]

$\beta_{0}$ is the y-intercept
$\beta_{1}$ is the slope
$e$ is the unexplained variability in Y

Estimated Sample Data Model

\[\hat{y} = b_{0} + b_{1}x\]

$\hat{y}$ is model estimate of y from x
$b_{0}$ is model estimate of y-intercept
$b_{1}$ is model estimate of slope

Each $e_{i}$ is a residual.
- y obs. - reg. estimate of y
- $e_{i} = y_{i} - \hat{y}_{i}$
Software estimates model with smallest sum of all squared residuals
- minimizes $\sum_{i=1}^ne_{i}^2$

Function of a Line vs. Regression Model

Function of a Line

\[y = mx + b\]

Exact precise mathmatical relationship with NO NOISE

Regression Model Equation

\[\hat{y} = b_{0} + b_{1}x\] Estimated line that is simultaneously as close as possible to all observations.

Interpreting a Regression Model

\[\hat{y} = b_{0} + b_{1}x\]

$\hat{y}$ is regression est. of y
$b_{0}$ is value of y when X = 0
- NOT always meaningful
$b_{1}$ is change in y due to 1 unit change in x.
- unit depends on data
NOTE:
- Model is only valid for the range of X values used to estimate it.
- Using a model to outside of this range is extrapolation.
  - Extrapolated estimates are invalid

Specifying the Model in R

hp_mod <- lm(mpg_h ~ hp, data=gt_cars)
hp_mod$coefficients

(Intercept)          hp 
33.86410831 -0.02241685

\[\hat{y} = 33.6841 - 0.022417x\]

💥 Lecture 8 In-class Exercises - Q4 & Q5 💥

Regression Model:

\[\hat{y} = 33.6841 - 0.022417x\]

Question 5. Based on this model, if Horsepower (x) is increased by 1, what is the change in Highway MPG?

Round answer to six decimal places

Question 6. Based on this model, if Horsepower (x) is increased by 20 (which is more realistic), what is the change in Highway MPG?

Round answer to 3 decimal places.

💥 Lecture 8 In-class Exercises - Q6 & Q7 💥

Regression Model:

\[\hat{y} = 33.6841 - 0.022417x\]

Question 7. If HP is 600, what is the estimated Highway MPG?

Question 8. What is the residual for the 2016 Aston Martin Vantage

Follow up Question (not on Point Solutions): Does the intercept have a real-world interpretation in this model.

Key Points from Lecture 8

Simple linear regression (SLR) models are similar in format to the function of line.
The interpretation is very different because SLR models are simplification of the real world.
Box said “All models are wrong, but some are useful”
This refers to the inherent simplication of modeling that leaves out the noise of the real world.
Despite this simplfication, models provide valuable insight.
A model is only valid for the range data used to create it.
- Outside of that range we are extrapolating which is invalid.

To submit an Engagement Question or Comment about material from Lecture 8: Submit by midnight on the day of lecture 8. Click on Link next to the ❓ under Lecture 8

Lectures 7 and 8 - Introduction to Correlation and Regression in R

Housekeeping

💥 Lecture 7 In-class Exercises - Q1 - Review 💥

Introduction to R and RStudio 🪄

Review of Linear Correlations

Grocery Sales per Sq. Ft. and Planned Store Openings

Understanding Linear Relationships

Direction of the Relationship

Strength of the Linear Relationship

Interpreting \(R_{xy}\), the correlation coefficient

Very Strongly Correlated Data

Range of \(R_{xy}\) Guidelines for Interpretation

Example of Negative Correlation

💥 In-class Exercises - Q2 💥

When NOT to use \(R_{xy}\)

Key Points from Lecture 7

Review of Housekeeping for Lectures 7 and 8

💥 In-class Exercises - Q1 - Review 💥

Models vs. Functions

Functions are Mathematical relationships

Function of a LINE

Models ARE NOT Functions

Examples of Models of Noisy Data

One More Example

💥 Lecture 8 In-class Exercises - Q2 💥

💥 Lecture 8 In-class Exercises - Q2 💥

💥 Lecture 8 In-class Exercises - Q3 💥

💥 Lecture 8 In-class Exercises - Q3 💥

Simple Linear Regression Model

Function of a Line vs. Regression Model

Interpreting a Regression Model

💥 Lecture 8 In-class Exercises - Q4 & Q5 💥

💥 Lecture 8 In-class Exercises - Q6 & Q7 💥

Key Points from Lecture 8