Interpreting \(R_{xy}\), the correlation coefficient
\(R_{xy}\) ranges from -1 to 1.
The most extreme \(R_{xy}\) values represent ‘perfectly correlated data’:
Very Strongly Correlated Data
\(R_{xy} = 1\) or \(R_{xy} = -1\) is unrealistic. These correlations are both strong and realistic:
Range of \(R_{xy}\) Guidelines for Interpretation
Example of Negative Correlation
Lecture 7 In-class Exercises - Q2
What is the correlation between Year and Rural_Pct in the urban_rural dataset?
Hint: This correlation is almost perfect.
Round answer to three decimal places.
Correlation between Height and Mass in Starwars
What is the correlation between height and mass in the starwars data?
Lecture 7 In-class Exercise - Q3-Q4
Question 3. What is the correlation between height and mass in the Star Wars dataset, my_starwars?
Question 4. How strong is this correlation based on the provided guidelines:
When NOT to use \(R_{xy}\)
\(R_{xy}\) is only valid when examining linear relationships.
If the data have a curvilinear relationship, there are other tools that will be covered in other courses.
Key Points from Today
An introduction to R and RStudio in Posit Cloud.
A review of linear associations between variables.
We will continue this discussion in Lecture 8 on Thursday
For now, you are expected to understand
How to open provided files in Posit Cloud
How to interpret a scatterplot
Calculating \(R_{xy}\) in R using the cor command in R
Interpreting \(R_{xy}\)
When NOT to use \(R_{xy}\) to examine data associations
HW 3 was due 2/3/2025
HW 4 is due 2/12/2025
To submit an Engagement Question or Comment about material from Lecture 7: Submit it by midnight today (day of lecture).
Source Code
---title: "BUA 345 - Lecture 7"subtitle: "Introduction to R/Rstudio in Posit Cloud and Review of Correlation"author: "Penelope Pooler Eisenbies"date: last-modifiedlightbox: truetoc: truetoc-depth: 3toc-location: lefttoc-title: "Table of Contents"toc-expand: 1format: html: code-line-numbers: true code-fold: true code-tools: trueexecute: echo: fenced---## Housekeeping```{r setup, echo=FALSE, warning=F, message=F, include=F}#| include: false# this line specifies options for default options for all R Chunksknitr::opts_chunk$set(echo=F)# suppress scientific notationoptions(scipen=100)# install helper package that loads and installs other packages, if neededif (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")# install and load required packagespacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData, epiDisplay, vistributions, psych, tidyquant, dygraphs)# verify packages# p_loaded()```**HW 3 is due 2/3/2025****Sign up for a [FREE Posit Cloud Account](https://posit.cloud/plans/free){target="_blank"}**### Today's plan- Introduction to using R and RStudio- Review of correlation, $R_{XY}$#### Lecture 8 plan (Preview) :clipboard:- Review of Simple Linear Regression - Function vs. Model - Examining Real Data - Creating a Model - Interpreting a Regression Model::: fragment**In-class Polling (Session ID: bua345s25)**:::## ### Lecture 8 In-class Exercise - Q1Recall the Lecture 6 ‘Weather’ worksheet which is the ‘Lecture 7 Review Worksheet’.The first and second inputs for the `VLOOKUP` command in cell H4, are:::: nonincremental- Where the input reference value is located- Where the data to be searched are:::**Which choice below contains the correct first and second inputs?**- **HINT: You may use `=FORMULATEXT(H4)` to check your answer.**::: nonincremental=VLOOKUP(H2, A2:E91,…=VLOOKUP(H3, A1:E91,…=VLOOKUP(H4, A2:E91,…=VLOOKUP(H2, B1:E91,…=VLOOKUP(H3, B2:E91,…=VLOOKUP(,H4, B2:E9,…:::## R and RStudio- In this course we will use R and RStudio for the predictive analytics lectures.- You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free){target="_blank"}- I will post R/RStudio files on Posit Cloud that you can access in provided links.- I will also provide demo videos that show how to access files and complete exercises.- NOTE: The free Posit Cloud account is limited to 25 hours per month. - I demo how to download completed work so that you can use this allotment efficiently.- We will also use Posit cloud for quiz questions of predictive analytics skills.- For those who want to download R and RStudio (not required): - There is an information page on my course website, [Installing R and RStudio](https://penelope2040.quarto.pub/bua-345/#installing-r-and-rstudio){target="_blank"}## ### Opening a Posit Cloud Link**Always click 'Save a Permanent Copy'** so you don't lose your work.{fig-align="center"}## ### Helpful Global Options - GeneralClick `Tools`\>`Global Options`. The next few slides are helpful reference but are not required.:::::: columns::: {.column width="38%"}- On the `Save workspace...` line choose `Never`.- Your work can still be saved by clicking - `Ctrl + S` or `Cmd + S`.:::::: {.column width="4%"}:::::: {.column width="58%"}{fig-align="center"}:::::::::## :::::: columns::: {.column width="38%"}### Helpful Global Options - Code- On the `Editing` tab, select the `Use native pipe operator` option.- On the `Display` tab, select all 3 options under `Syntax`.:::::: {.column width="4%"}:::::: {.column width="58%"}{fig-align="center" height="2in"}{fig-align="center" height="4.5in"}:::::::::## :::::: columns::: {.column width="38%"}### Helpful Global Options - Appearance- The default white appearance can cause eye strain more quickly.- You can choose a different `Editor Theme`. - I prefer `Tomorrow Night Blue`.:::::: {.column width="4%"}:::::: {.column width="58%"}{fig-align="center"}:::::::::## :::::: columns::: {.column width="48%"}### Helpful Global Options - R Markdown- On the Basic tab, next to `Show in document outline` select `Sections and All Chunks`.- On the `Visual` tab: - check box next to `Show line numbers in code blocks`. - next to `Editor content width (px)`, change the value to `900`.- When you're done selecting all options, click `OK` at the bottom.:::::: {.column width="4%"}:::::: {.column width="48%"}{fig-align="center"}{fig-align="center"}{fig-align="center"}:::::::::## ### A brief Tour of the Screen and Panels:::::: columns::: {.column width="28%"}When you open a provided project link you see- the `Console` in the left panel- the `Global Environment` in the upper right panel- `Files` (and other options) in the lower right panel:::::: {.column width="4%"}:::::: {.column width="68%"}{fig-align="center"}:::::::::## ### Appearance with Quarto (`.qmd`) File OpenProvided `.qmd files` appear in the upper left panel above the `Console`.{fig-align="center"}## ### Running the `Setup` Code ChunkWhenever you begin working with a provided code file, click the `green triangle` in the `Setup` chunk to setup options and load and install packages.{fig-align="center"}{fig-align="center"}## Review of Linear Correlations- In your prerequisite course for BUA 345, you covered linear relationships between two or more quantitative variables.<br>- We will introduce the review this material this week while introducing R and RStudio.<br>- Often if we have two quantitative variables we want to understand the extent to which they are associated. - The first step is often to plot the data using a scatterplot. - We can also use quantitative measures of association to understand these relationships.## #### Grocery Sales per Sq. Ft. and Planned Store Openings```{r}grocery <-read_csv("data/grocery.csv", show_col_types = F)(grocery_plot <- grocery |>ggplot(aes(x=sales_sq_ft, y=openings, color=openings)) +geom_point(size=4, show.legend = F) +#geom_smooth(method = lm, color="red", se=F, linetype="dashed") +labs(x="Sales per sq. foot", y="Planned Store Openings", title="Relationship between Grocery Sales and Expansion") +theme(legend.position ="none",plot.title =element_text(size =20),axis.title =element_text(size=18),axis.text =element_text(size=15),legend.text =element_text(size =12),legend.title =element_text(size =15)) +theme_classic()) #ggsave("img/grocery_scatterplot.png", width=6, height=4)```## Understanding Linear Relationships:::::: columns::: {.column width="48%"}```{r}kable(grocery)```:::::: {.column width="2%"}:::::: {.column width="50%"}{fig-align="center"}:::::::::## Direction of the Relationship:::::: columns::: {.column width="48%"}<br>As X (sales per square feet) increases, Y (planned store openings) also increases.<br>When Y increases with X in an approximately linear fashion, that is a- POSITIVE LINEAR RELATIONSHIP - **The trend has a positive slope.**:::::: {.column width="2%"}:::::: {.column width="50%"}```{r message=F}(grocery_plot <- grocery |> ggplot(aes(x=sales_sq_ft, y=openings, color=openings)) + geom_point(size=4, show.legend = F) + geom_smooth(method = lm, color="red", se=F, linetype="dashed") + labs(x="Sales per sq. foot", y="Planned Store Openings", title="Relationship between Grocery Sales and Expansion") + theme(legend.position = "none", plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), legend.text = element_text(size = 12), legend.title = element_text(size = 15)) + theme_classic()) #ggsave("img/grocery_scatterplot_w_line.png", width=6, height=4)```:::::::::## Strength of the Linear Relationship:::::::::: columns:::: {.column width="48%"}In addition to determining if there is a positive or negative relationship,- We also want to quantify, how strong the relationship is.<br>::: fragmentTo quantify the strength a linear relationship, we calculate::::- Pearson's correlation coefficient, $R_{xy}$.- $R_{xy} = 0.85$- How do we interpret this value? - ...Spoiler: This a strong positive correlation!::::::: {.column width="2%"}::::::::: {.column width="50%"}{fig-align="center"}<br>::::: {.fragment .fade-in}:::: {.fragment .grow}::: {.fragment .shrink}```{r echo=T}cor(grocery$sales_sq_ft, grocery$openings)```::::::::::::::::::::::::::::## ### Interpreting $R_{xy}$, the correlation coefficient$R_{xy}$ ranges from -1 to 1.- The most extreme $R_{xy}$ values represent 'perfectly correlated data':{fig-align="center"}## Very Strongly Correlated Data$R_{xy} = 1$ or $R_{xy} = -1$ is unrealistic. These correlations are both strong and realistic:{fig-align="center"}## ### Range of $R_{xy}$ Guidelines for Interpretation{fig-align="center"}## Example of Negative Correlation```{r}urban_rural <-read_csv("data/Urban_Rural.csv", show_col_types = F) |>filter(Year >=1830)(rural_plot <- urban_rural |>ggplot(aes(x=Year, y=Rural_Pct, color=Rural_Pct)) +geom_point(size=4, show.legend = F) +labs(x="Year", y="Percent of People in Rural Areas", title ="Transition Away from Rural Living in USA") +theme(legend.position ="none",plot.title =element_text(size =20),axis.title =element_text(size=18),axis.text =element_text(size=15),legend.text =element_text(size =12),legend.title =element_text(size =15)) +scale_x_continuous(breaks=seq(1830, 2010,20)) +theme_classic()) # ggsave("img/rural_pct_usa.png")```## ### Lecture 7 In-class Exercises - Q2:::::: columns::: {.column width="48%"}**What is the correlation between `Year` and `Rural_Pct` in the `urban_rural` dataset?**Hint: This correlation is almost perfect.Round answer to three decimal places.:::::: {.column width="2%"}:::::: {.column width="50%"}{fig-align="center"}:::::::::## ### Correlation between Height and Mass in StarwarsWhat is the correlation between height and mass in the starwars data?```{r, message=F}my_starwars <- starwars |> filter(mass <= 1000) # removes missing values and Jabba(sw_plot <- my_starwars |> ggplot(aes(x=height, y=mass)) + geom_point(color="blue", size=3) + geom_smooth(method = lm, color="red", se=F, linetype="dashed") + labs(x="Height (cm)", y="Mass (kg)", title="Relationship between Height and Mass - Star Wars", caption="Extreme Outlier, Jabba the Hut, excluded.") + theme(legend.position = "none", plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), plot.caption = element_text(size=12)) + theme_classic()) #ggsave("img/sw_height_mass.png")```## ### Lecture 7 In-class Exercise - Q3-Q4:::::: columns::: {.column width="40%"}**Question 3.** What is the correlation between height and mass in the Star Wars dataset, `my_starwars`?<br>**Question 4.** How strong is this correlation based on the provided guidelines::::::: {.column width="2%"}:::::: {.column width="58%"}{fig-align="center"}:::::::::## When NOT to use $R_{xy}$$R_{xy}$ is only valid when examining linear relationships.If the data have a curvilinear relationship, there are other tools that will be covered in other courses.{fig-align="center"}## ### Key Points from Today- An introduction to R and RStudio in Posit Cloud.- A review of linear associations between variables.- We will continue this discussion in Lecture 8 on Thursday- For now, you are expected to understand - How to open provided files in Posit Cloud - How to interpret a scatterplot - Calculating $R_{xy}$ in R using the `cor` command in R - Interpreting $R_{xy}$ - When NOT to use $R_{xy}$ to examine data associations::: fragment**HW 3 was due 2/3/2025****HW 4 is due 2/12/2025****To submit an Engagement Question or Comment about material from Lecture 7:** Submit it by midnight today (day of lecture).:::