MAS 261 - Lecture 20

Introduction to Correlation and Covariance

Author

Penelope Pooler Eisenbies

Published

October 30, 2024

Housekeeping

Today’s plan
- Corrections to Syllabus
R Help on your laptop
Upcoming Dates
Review and New Questions
- Two Sided Test of Proportions and Contingency Tables
- Row Percentages and Column Percentages
Understanding Correlation
- Examining correlations visually and quantitatively
- Estimating correlation quantitatively
Converting Correlation to Covariance
- Why and How
- Conversion Formulas

Upcoming Dates

HW 6 was due 10/30 (Grace period ends 10/31)
- Demo videos are posted on Blackboard
- This assignment seems long but it’s not.
- It consists of just three hypothesis tests with questions about each test.
- Most questions are multiple choice, but do not just guess and keep trying.
HW 7 is now posted and is due Wed. 11/6 at midnight.
- Videos will be posted ASAP.
Test 2 is on November 12th and will include material up through Lecture 20
Lecture 21 - Intro to Portfolio Management will be on Final Exam, not on Test 2.

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - I have added a new page to the MAS 261 website, Installing R and RStudio

Review and NEW

Do Gen-Zs and Millenials differ from Gen-Xers with respect to daylight savings?

Should the USA Eliminate Daylight Savings Clock Changes
Age	Yes	No/Not Sure	Row Totals
18-44	228	205	433
45-64	201	118	319
Col. Totals	429	323	752

Column and Row Percentages

Original Data

Should the USA Eliminate Daylight Savings Clock Changes
Age	Yes	No/Not Sure	Row Totals
18-44	228	205	433
45-64	201	118	319
Col. Totals	429	323	752

Row %: Percentages of each age group that said ‘Yes’ or ‘No’.

Row Percentages
	Yes	No/Not Sure
18-44	52.66	47.34
45-64	63.01	36.99

Column %: Percentages of Yes/No opinions in each age group.

Column Percentages
	Yes	No/Not Sure
18-44	53.15	63.47
45-64	46.85	36.53

Lecture 20 In-class Exercises - Q1-Q2

Review data with new concepts - Use tables on previous slide

Question 1. What percentage of all the ‘Yes, lets end daylight savings!’ votes are in the 45-64 age group?

Round percentage to one decimal place.

Question 2. What percentage of all 18 - 44 year olds said ‘No’ or ‘Not sure’ when asked if they want to eliminate daylight savings.

Round percentage to one decimal place.

Note: There will be homework questions providing more practice on relating questions to these percentage tables.

Row and column percentages can be calculated from raw data, but I provide them.
These questions focus on interpretation instead of arithmetic.

Linear Correlations

The last part of the course will focus on understanding linear relationships between two or more quantitative variables.

We will introduce the first part of this topic today and then build on these concepts after Quiz 2.

Often if we have two quantitative variables we want to understand the extent to which they are associated.
- The first step is often to plot the data using a scatterplot.
- We can also use quantitative measures of association to understand these relationships.

Grocery Sales per Sq. Ft. and Planned Store Openings

Understanding Linear Relationships

chain	sales_sq_ft	openings
Roundy’s	393	2
Weis Markets	325	3
Natural Grocers	419	5
Ingles	325	10
Kroger	496	15
Harris Teeter’s	442	20
Fresh Market	490	20
Sprouts Farmer’s Market	490	20
Publix	552	30
Whole Foods	937	38

Direction of the Relationship

As X (sales per square feet) increases, Y (planned store openings) also increases.

When Y increases with X in an approximately linear fashion, that is a

POSITIVE LINEAR RELATIONSHIP
- The trend has a positive slope.

Strength of the Linear Relationship

In addition to determining if there is a positive or negative relationship,

We also want to quantify, how strong the relationship is.

To quantify the strength a linear relationship, we calculate:

Pearson’s correlation coefficient, $R_{xy}$.
$R_{xy} = 0.85$
How do we interpret this value?
- …Spoiler: This a strong positive correlation!

Code

```{r echo=T}
cor(grocery$sales_sq_ft, grocery$openings) |> round(2)
```

[1] 0.85

Interpreting $R_{xy}$, the correlation coefficient

The most extreme $R_{xy}$ values represent ‘perfectly correlated data’:

Very Strongly Correlated Data

$R_{xy} = 1$ or $R_{xy} = -1$ is unrealistic. These correlations are both strong and realistic:

Range of $R_{xy}$ Guidelines for Interpretation

Example of Negative Correlation

Lecture 20 In-class Exercises - Q3

What is the correlation between Year and Rural_Pct in the urban_rural dataset?

Hint: This Correlation is almost perfect.

Round answer to three decimal places.

When NOT to use $R_{xy}$

$R_{xy}$ is only valid when examining linear relationships.

If the data have a curvilinear relationship, there are other tools that will be covered in other courses.

Calculating Covariance from Correlation

$R_{xy}$, the correlation is straightforward to interpret because it is unitless.
$R_{xy}$ is ALWAYS between -1 and 1 and interpreted the same way.
Another measure, Covariance, is also useful for calcuations
In Lecture 21, we will cover how to create and examine a linear combination of multiple variables.
- Example: Mutual funds and stock portfolios are linear comnbinations of stocks.
- In order to examine linear combinations of variables we first calculate their covariance:
Covariance of two variables, X and Y:
- $COV_{xy} = R_{xy} \times S_{x} \times S_{y}$
- $R_{xy} = \frac{COV_{xy}}{S_{x} \times S_{y}}$
  - $S_{x}$ is the standard deviation of x and $S_{y}$ is the standard deviation of y.

Calculating $COV_{xy}$ from the Data or $R_{xy}$

Below I show Covariance/Correlation calculations using the Grocery Data

In HW 7 you will use these formulas because you don’t have the data.

ALSO: Remember that if you are given variance (which you are),

Standard Deviation is the Square Root of Variance
R command to find Square Root is sqrt()

Code

```{r echo=T}
Rxy <- cor(grocery$sales_sq_ft, grocery$openings) # correlation  
Sx <- sd(grocery$sales_sq_ft)                     # sd of x
Sy <- sd(grocery$openings)                        # sd of y

Rxy*Sx*Sy                                         # calculate cov from Rxy and SD

cov(grocery$sales_sq_ft, grocery$openings)        # calculate cov from the data

cov(grocery$sales_sq_ft, grocery$openings)/(Sx * Sy) # calculate Rxy from cov
```

[1] 1754.144
[1] 1754.144
[1] 0.8517842

Key Points from Today

This short lecture is an introduction to linear associations between variables.
We will continue this discussion in Lecture 21 when we examine linear combinations of variables
- This topic will provide insite into Portfolio Management
For now, you are expected to understand
- How to interpret a scatterplot
- Calculating $R_{xy}$ in R using the cor command
- Interpreting $R_{xy}$
- When NOT to use $R_{xy}$ to examine data associations
- How to convert $R_{xy}$ to $COV_{xy}$ vise versa

To submit an Engagement Question or Comment about material from Lecture 20: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 20" subtitle: "Introduction to Correlation and Covariance" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData, epiDisplay, vistributions) # verify packages # p_loaded() ``` - Today's plan - Corrections to [**Syllabus**](https://docs.google.com/document/d/1g8TGtKF1rR_6rRmRe5IHhiJ2KhjGAGhcZcOhM621kGM/edit?usp=sharing){target="_blank"} - R Help on your laptop - Upcoming Dates - Review and New Questions - Two Sided Test of Proportions and Contingency Tables - Row Percentages and Column Percentages - Understanding Correlation - Examining correlations visually and quantitatively - Estimating correlation quantitatively - Converting Correlation to Covariance - Why and How - Conversion Formulas ## Upcoming Dates - HW 6 was due 10/30 (Grace period ends 10/31) - Demo videos are posted on Blackboard - This assignment seems long but it's not. - It consists of just three hypothesis tests with questions about each test. - Most questions are multiple choice, but do not just guess and keep trying. - HW 7 is now posted and is due Wed. 11/6 at midnight. - Videos will be posted ASAP. - Test 2 is on November 12th and will include material up through Lecture 20 - Lecture 21 - Intro to Portfolio Management will be on Final Exam, not on Test 2. ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free){target="_blank"} - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - I demo how to download completed work so that you can use this allotment efficiently. - For those who want to go further with R/RStudio: - I have added a new page to the MAS 261 website, [Installing R and RStudio](https://penelope2040.quarto.pub/mas-261/#installing-r-and-rstudio){target="_blank"} ## Review and NEW **Do Gen-Zs and Millenials differ from Gen-Xers with respect to daylight savings?** :::::: columns ::: {.column width="48%"} ```{r} # Summarized counts # poll data Age <- c("18-44","45-64", "Col. Totals") Yes <- c(228,201,429) `No/Not Sure` <- c(205,118,323) `Row Totals` <- c(433,319,752) full_table <- tibble(Age, Yes, `No/Not Sure`, `Row Totals`) full_table |> kable(align="lccc", caption="Should the USA Eliminate Daylight Savings Clock Changes") dl_poll3 <- matrix(c(228,205,201,118), ncol=2, byrow = T) # label columns (col) and rows colnames(dl_poll3) <- c("Yes", "No/Not Sure") rownames(dl_poll3) <- c("18-44"," 45-64") #create a table of these data in R dl_table3 <- as.table(dl_poll3) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r} Frequency <- c(228,205,201,118) Age_Group <- c(rep("Ages 18-44",2), rep("Ages 45-64",2)) Opinion <- rep(c("Yes", "No/Not Sure"),2) dl_data3<- tibble(Age_Group,Opinion, Frequency) (op_plot3 <- dl_data3 |> ggplot() + geom_bar(aes(x=Age_Group, y=Frequency, fill=Opinion), stat="identity", position="dodge") + scale_fill_manual(values=c("cornflowerblue","chartreuse3")) + theme_classic() + labs(title="Should We Eliminate Daylight Savings Clock Changes", x="")+ theme(plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), legend.position = "bottom", plot.caption = element_text(size = 10), legend.text = element_text(size = 12), legend.title = element_text(size = 15))) ``` ::: :::::: ## Column and Row Percentages :::::: columns ::: {.column width="48%"} **Original Data** ```{r} full_table |> kable(align="lccc", caption="Should the USA Eliminate Daylight Savings Clock Changes") ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Row %: Percentages of each age group that said 'Yes' or 'No'.** ```{r} #print version of row percentages table kable(prop.table(dl_table3, 1)*100, digits=2, align="lcc", caption = "Row Percentages") ``` **Column %: Percentages of Yes/No opinions in each age group.** ```{r} #print version of column percentages table kable(prop.table(dl_table3, 2)*100, digits=2, align="lcc", caption = "Column Percentages") ``` ::: :::::: ## ### Lecture 20 In-class Exercises - Q1-Q2 **Review data with new concepts - Use tables on previous slide** Question 1. What percentage of all the 'Yes, lets end daylight savings!' votes are in the 45-64 age group? Round percentage to one decimal place. Question 2. What percentage of all 18 - 44 year olds said 'No' or 'Not sure' when asked if they want to eliminate daylight savings. Round percentage to one decimal place. **Note:** There will be homework questions providing more practice on relating questions to these percentage tables. - Row and column percentages can be calculated from raw data, but I provide them. - These questions focus on interpretation instead of arithmetic. ## Linear Correlations - The last part of the course will focus on understanding linear relationships between two or more quantitative variables. - We will introduce the first part of this topic today and then build on these concepts after Quiz 2. - Often if we have two quantitative variables we want to understand the extent to which they are associated. - The first step is often to plot the data using a scatterplot. - We can also use quantitative measures of association to understand these relationships. ## Grocery Sales per Sq. Ft. and Planned Store Openings ```{r} grocery <- read_csv("data/grocery.csv", show_col_types = F) (grocery_plot <- grocery |> ggplot(aes(x=sales_sq_ft, y=openings, color=openings)) + geom_point(size=4, show.legend = F) + #geom_smooth(method = lm, color="red", se=F, linetype="dashed") + labs(x="Sales per sq. foot", y="Planned Store Openings", title="Relationship between Grocery Sales and Expansion") + theme(legend.position = "none", plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), legend.text = element_text(size = 12), legend.title = element_text(size = 15)) + theme_classic()) #ggsave("img/grocery_scatterplot.png", width=6, height=4) ``` ## Understanding Linear Relationships :::::: columns ::: {.column width="48%"} ```{r} kable(grocery) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r } knitr::include_graphics("img/grocery_scatterplot.png", dpi=300) ``` ::: :::::: ## Direction of the Relationship :::::: columns ::: {.column width="48%"} As X (sales per square feet) increases, Y (planned store openings) also increases. When Y increases with X in an approximately linear fashion, that is a - POSITIVE LINEAR RELATIONSHIP - **The trend has a positive slope.** ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r message=F} (grocery_plot <- grocery |> ggplot(aes(x=sales_sq_ft, y=openings, color=openings)) + geom_point(size=4, show.legend = F) + geom_smooth(method = lm, color="red", se=F, linetype="dashed") + labs(x="Sales per sq. foot", y="Planned Store Openings", title="Relationship between Grocery Sales and Expansion") + theme(legend.position = "none", plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), legend.text = element_text(size = 12), legend.title = element_text(size = 15)) + theme_classic()) #ggsave("img/grocery_scatterplot_w_line.png", width=6, height=4) ``` ::: :::::: ## Strength of the Linear Relationship ::::::::: columns :::: {.column width="48%"} In addition to determining if there is a positive or negative relationship, - We also want to quantify, how strong the relationship is. ::: fragment To quantify the strength a linear relationship, we calculate: ::: - Pearson's correlation coefficient, $R_{xy}$. - $R_{xy} = 0.85$ - How do we interpret this value? - ...Spoiler: This a strong positive correlation! :::: ::: {.column width="4%"} ::: ::::: {.column width="48%"} ```{r } knitr::include_graphics("img/grocery_scatterplot_w_line.png", dpi=300) ``` :::: fragment ::: r-fit-text ```{r echo=T} cor(grocery$sales_sq_ft, grocery$openings) |> round(2) ``` ::: :::: ::::: ::::::::: ## ### Interpreting $R_{xy}$, the correlation coefficient - The most extreme $R_{xy}$ values represent 'perfectly correlated data': ::: fragment ![](img/perfect_cor.png){fig_align="center"} ::: ## Very Strongly Correlated Data $R_{xy} = 1$ or $R_{xy} = -1$ is unrealistic. These correlations are both strong and realistic: ::: fragment ![](img/very_strong_cor.png){fig_align="center"} ::: ## ### Range of $R_{xy}$ Guidelines for Interpretation ::: fragment ![](img/range_of_cor.png){fig_align="center"} ::: ## Example of Negative Correlation ```{r} urban_rural <- read_csv("data/Urban_Rural.csv", show_col_types = F) |> filter(Year >= 1830) (rural_plot <- urban_rural |> ggplot(aes(x=Year, y=Rural_Pct, color=Rural_Pct)) + geom_point(size=4, show.legend = F) + labs(x="Year", y="Percent of People in Rural Areas", title = "Transition Away from Rural Living in USA") + theme(legend.position = "none", plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), legend.text = element_text(size = 12), legend.title = element_text(size = 15)) + scale_x_continuous(breaks=seq(1830, 2010,20)) + theme_classic()) #ggsave("img/rural_pct_usa.png") ``` ## ### Lecture 20 In-class Exercises - Q3 :::::: columns ::: {.column width="48%"} **What is the correlation between `Year` and `Rural_Pct` in the `urban_rural` dataset?** Hint: This Correlation is almost perfect. Round answer to three decimal places. ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/rural_pct_usa.png){fig_align="center"} ::: :::::: ## When NOT to use $R_{xy}$ $R_{xy}$ is only valid when examining linear relationships. If the data have a curvilinear relationship, there are other tools that will be covered in other courses. ::: fragment ![](img/dont_use_cor.png){fig_align="center"} ::: ## Calculating Covariance from Correlation - $R_{xy}$, the correlation is straightforward to interpret because it is unitless. - $R_{xy}$ is ALWAYS between -1 and 1 and interpreted the same way. - Another measure, Covariance, is also useful for calcuations - In Lecture 21, we will cover how to create and examine a linear combination of multiple variables. - Example: Mutual funds and stock portfolios are linear comnbinations of stocks. - In order to examine linear combinations of variables we first calculate their covariance: - Covariance of two variables, X and Y: - $COV_{xy} = R_{xy} \times S_{x} \times S_{y}$ - $R_{xy} = \frac{COV_{xy}}{S_{x} \times S_{y}}$ - $S_{x}$ is the standard deviation of x and $S_{y}$ is the standard deviation of y. ## Calculating $COV_{xy}$ from the Data or $R_{xy}$ Below I show Covariance/Correlation calculations using the Grocery Data In HW 7 you will use these formulas because you don't have the data. ALSO: Remember that if you are given variance (which you are), - **Standard Deviation is the Square Root of Variance** - **R command to find Square Root is `sqrt()`** ::: fragment ```{r echo=T} Rxy <- cor(grocery$sales_sq_ft, grocery$openings) # correlation Sx <- sd(grocery$sales_sq_ft) # sd of x Sy <- sd(grocery$openings) # sd of y Rxy*Sx*Sy # calculate cov from Rxy and SD cov(grocery$sales_sq_ft, grocery$openings) # calculate cov from the data cov(grocery$sales_sq_ft, grocery$openings)/(Sx * Sy) # calculate Rxy from cov ``` ::: ## ### Key Points from Today - This short lecture is an introduction to linear associations between variables. - We will continue this discussion in Lecture 21 when we examine linear combinations of variables - This topic will provide insite into Portfolio Management - For now, you are expected to understand - How to interpret a scatterplot - Calculating $R_{xy}$ in R using the `cor` command - Interpreting $R_{xy}$ - When NOT to use $R_{xy}$ to examine data associations - How to convert $R_{xy}$ to $COV_{xy}$ vise versa ::: fragment **To submit an Engagement Question or Comment about material from Lecture 20:** Submit it by midnight today (day of lecture). :::

MAS 261 - Lecture 20

Housekeeping

Upcoming Dates

R and RStudio

Review and NEW

Column and Row Percentages

Lecture 20 In-class Exercises - Q1-Q2

Linear Correlations

Grocery Sales per Sq. Ft. and Planned Store Openings

Understanding Linear Relationships

Direction of the Relationship

Strength of the Linear Relationship

Interpreting \(R_{xy}\), the correlation coefficient

Very Strongly Correlated Data

Range of \(R_{xy}\) Guidelines for Interpretation

Example of Negative Correlation

Lecture 20 In-class Exercises - Q3

When NOT to use \(R_{xy}\)

Calculating Covariance from Correlation

Calculating \(COV_{xy}\) from the Data or \(R_{xy}\)

Key Points from Today