Lecture 20 - Introduction to Correlation and Covariance

Penelope Pooler Eisenbies
MAS 261

2023-11-02

Housekeeping

  • Today’s plan 📋

    • Comments and Questions from Engagement Questions or about R

    • Upcoming Dates

    • Review and New Questions

      • Two Sided Test of Proportions and Contingency Tables

        • Row Percentages and Column Percentages
    • Understanding Correlation

      • Examining correlations visually and quantitatively

        • Slope and Strength of Relationship
      • Estimating correlation quantitatively

    • Converting Correlation to Covariance

      • Why and How

        • Conversion Formulas

Review: R and RStudio 🪄

  • Review: You have two options to facilitate your introduction to R and RStudio:

  • If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.

    • We will use Posit Cloud for Quizzes.
  • If you are nervous about coding: Choose Option 2.

  • For both options: I can help with download/install issues during office hours.

  • What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.

  • NOTE: We will use R and RStudio in class during MOST lectures

    • You can use either Posit Cloud or your laptop.

Upcoming Dates

  • HW 6 is due 11/1 (Grace period ends 11/3)

    • Demo videos are posted on Blackboard

    • This assignment seems long but it’s not.

    • It consists of just three hypothesis tests with questions about each test.

    • Most questions are multiple choice, but do not just guess and keep trying.


  • HW 7 is now posted and is due Wed. 11/8 at midnight.

  • Test 2 is on November 14th and will include material up through Lecture 20

  • Lecture 21 - Intro to Portfolio Management will be on Final Exam, not on Test 2.

Review and NEW: Do Gen-Zs and Millenials differ from Gen-Xers with respect to daylight savings?

Should the USA Eliminate Daylight Savings Clock Changes
Age Yes No/Not Sure Row Totals
18-44 228 205 433
45-64 201 118 319
Col. Totals 429 323 752

Column and Row Percentages

Original Data

Should the USA Eliminate Daylight Savings Clock Changes
Age Yes No/Not Sure Row Totals
18-44 228 205 433
45-64 201 118 319
Col. Totals 429 323 752

Row Percentages: Percentages of each age group that said ‘Yes’ or ‘No’.

Row Percentages
Yes No/Not Sure
18-44 52.66 47.34
45-64 63.01 36.99

Column percentages: Percentages of Yes/No opinions in each age group.

Column Percentages
Yes No/Not Sure
18-44 53.15 63.47
45-64 46.85 36.53

💥 Lecture 20 In-class Exercises - Q1 and Q2 💥

Review data with new concepts - Use tables on previous slide


Question 1. What percentage of all the ‘Yes, lets end daylight savings’ votes are in the 45-64 age group?

Round percentage to one decimal place.


Question 2. What percentage of all 18 - 44 year olds said ‘No’ or ‘Not sure’ when asked if they want to eliminate daylight savings.

Round percentage to one decimal place.


Note: There will be homework questions providing more practice on relating questions to these percentage tables.

  • Row and column percentages can be calculated from raw data, but I provide them.

  • These questions focus on interpretation instead of arithmetic.

Linear Correlations

  • The last part of the course will focus on understanding linear relationships between two or more quantitative variables.


  • We will introduce the first part of this topic today and continue with this next week.


  • Often if we have two quantitative variables we want to understand the extent to which they are associated.

    • The first step is often to plot the data using a scatterplot.

    • We can also use quantitative measures of association to understand these relationships.

Grocery Sales per Sq. Ft. and Planned Store Openings

Understanding Linear Relationships

chain sales_sq_ft openings
Roundy's 393 2
Weis Markets 325 3
Natural Grocers 419 5
Ingles 325 10
Kroger 496 15
Harris Teeter's 442 20
Fresh Market 490 20
Sprouts Farmer's Market 490 20
Publix 552 30
Whole Foods 937 38

Direction of the Relationship

As X (sales per square feet) increases, Y (planned store openings) also increases.


When Y increases with X in an approximately linear fashion, that is a

  • POSITIVE LINEAR RELATIONSHIP

    • The trend has a positive slope.

Strength of the Linear Relationship

In addition to determining if there is a positive or negative relationship,

  • We also want to quantify, how strong the relationship is.


To quantify the strength a linear relationship, we calculate:

  • Pearson’s correlation coefficient, \(R_{xy}\).

  • \(R_{xy} = 0.85\)

  • How do we interpret this value?

    • …Spoiler: This a strong positive correlation!
[1] 0.8517842

Interpreting \(R_{xy}\), the correlation coefficient

\(R_{xy}\) ranges from -1 to 1.

  • The most extreme \(R_{xy}\) values represent ‘perfectly correlated data’:

Very Strongly Correlated Data

\(R_{xy} = 1\) or \(R_{xy} = -1\) is unrealistic. These correlations are both strong and realistic:

Range of \(R_{xy}\) Guidelines for Interpretation

Example of Negative Correlation

💥 Lecture 20 In-class Exercises - Q3 💥

What is the correlation between Year and Rural_Pct in the urban_rural dataset?

Hint: This Correlation is almost perfect.

Round answer to three decimal places.

When NOT to use \(R_{xy}\)

\(R_{xy}\) is only valid when examining linear relationships.

If the data have a curvilinear relationship, there are other tools that will be covered in other courses.

Calculating Covariance from Correlation

  • \(R_{xy}\), the correlation is straightforward to interpret because it is unitless.

  • \(R_{xy}\) is ALWAYS between -1 and 1 and interpreted the same way.

  • Another measure, Covariance, is also useful for calcuations

  • In lecture 21, we will cover how to create and examine a linear combination of multiple variables.

    • Example: Mutual funds and stock portfolios are linear comnbinations of stocks.

    • In order to examine linear combinations of variables we first calculate their covariance:

  • Covariance of two variables, X and y:

    • \(COV_{xy} = R_{xy} \times S_{x} \times S_{y}\)

    • \(R_{xy} = \frac{COV_{xy}}{S_{x} \times S_{y}}\)

      • \(S_{x}\) is the standard deviation of x
      • \(S_{y}\) is the standard deviation of y.

Calculating \(COV_{xy}\) from the Data or \(R_{xy}\)

Below I show Covariance/Correlation calculations using the Grocery Data

In HW 7 you will use these formulas because you don’t have the data.

ALSO: Remember that if you are given variance (which you are),

  • Standard Deviation is the Square Root of Variance

  • R command to find Square Root is sqrt()

Rxy <- cor(grocery$sales_sq_ft, grocery$openings) # correlation  
Sx <- sd(grocery$sales_sq_ft)                     # sd of x
Sy <- sd(grocery$openings)                        # sd of y

Rxy*Sx*Sy                                         # calculate cov from Rxy and SD
[1] 1754.144
cov(grocery$sales_sq_ft, grocery$openings)        # calculate cov from the data
[1] 1754.144
cov(grocery$sales_sq_ft, grocery$openings)/(Sx * Sy) # calculate Rxy from cov
[1] 0.8517842

Key Points from Today

  • This short lecture is an introduction to linear associations between variables.

  • We will continue this discussion in Lecture 21 when we examine linear combinations of variables

    • This topic will provide insite into Portfolio Management
  • For now, you are expected to understand

    • How to interpret a scatterplot
    • Calculating \(R_{xy}\) in R using the cor command
    • Interpreting \(R_{xy}\)
    • When NOT to use \(R_{xy}\) to examine data associations
    • How to convert \(R_{xy}\) to \(COV_{xy}\) vise versa

To submit an Engagement Question or Comment about material from Lecture 20: Submit by midnight today (day of lecture). Click on Link next to the under Lecture 20