MAS 261 - Lecture 3

Measures of Variability

Author

Penelope Pooler Eisenbies

Published

September 2, 2024

Housekeeping

Today’s plan
- Review Question about Measures of Central Tendancy
- A few minutes for R Questions 🪄
- Measures of Variability
  - How do we determine variability (spread)
  - Sample and Population measures
- Examining Data Variability Visually
  - Boxplots
- In-class Exercises

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I will demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

Lecture 3 In-class Exercises - Q1

Session ID: MAS261f24

In Lecture 2 we discussed the measures of central tendency, the mean, median, and mode.

Which of these measures can ALWAYS be found and is good measure of central tendancy regardless of whether or not the data has extreme values (high or low extremes) referred to as outliers?

A. Mode

B. Mean

C. Median

Measures of Variability and Deviations

Measures Central Tendancy such as MEAN and MEDIAN indicate where the center of the data is located.
Measures of Variability indicate how spread out the data observations are.
The building block used to calculate many measures variability is called a DEVIATION.
- DEVIATION: How far an observation is above or below the data mean.
- DEVIATION (of an observation) - Observation - mean
Demonstration of calculating deviations (2 options shown):
- Notice that deviations are both positive (above the mean) and negative (below the mean)

R Demo of Calculating Deviations

Notice that some deviations are positive and others are negative.
A positive deviation indicate an observation is above the mean.
A negative deviation indicates an observation is below the mean.

	mpg	mean_mpg	mpg_dev
Mazda RX4	21.0	20.09062	0.909375
Mazda RX4 Wag	21.0	20.09062	0.909375
Datsun 710	22.8	20.09062	2.709375
Hornet 4 Drive	21.4	20.09062	1.309375
Hornet Sportabout	18.7	20.09062	-1.390625
Valiant	18.1	20.09062	-1.990625
Duster 360	14.3	20.09062	-5.790625
Merc 240D	24.4	20.09062	4.309375
Merc 230	22.8	20.09062	2.709375
Merc 280	19.2	20.09062	-0.890625

Total Sums of Squares (TSS)

We want a measure of the overall spread of the data.

If we sum all of these deviations we would get zero
We could sum absolute values of the deviations, but the underlying math (Calculus) wouldn’t work well.).
INSTEAD we sum the SQUARED deviations and take the square root at the end of our calculations.

TSS Total Sum of Squares is total variability of variable. It’s the sum of the squared deviations

Variance (Var)

Sample Variance (Var) Typical value of squared deviation in the data.

R command for Variance: var
$Var = \frac{TSS}{n-1}$ where n is the number of observations in variable.
$TSS = Var \times (n-1)$
In HW 2 you will calculate TSS from Variance using this simple equation.

Code

```{r variance and TSS of mpg, echo=T}
var(my_cars$mpg)               # variance of mpg
var(my_cars$mpg)*31            # tss of mpg
```

[1] 36.3241
[1] 1126.047

Standard Deviation

Sample Standard Deviation (SD): Typical value of deviation in the data

$SD = \sqrt{Var} = \sqrt{\frac{TSS}{n-1}}$
$Var = SD^2$
R command: sd
Sample Standard Deviation is the measure we use most often in MAS 261.

Code

```{r sd and var calculations, echo=T}
sd(my_cars$mpg)             # std deviation calculation
(sd(my_cars$mpg))^2         # calculating variance from std. dev
```

[1] 6.026948
[1] 36.3241

Coefficient of Variation (CV) and Range

Coefficient of Variation (CV): provides a measure of variability in the data with scale (units) factored out.

Ideal for directly comparing variability in data with different units, e.g. US $ and European €.

$CV = \frac{SD}{\overline{X}}$, SD divided by sample mean, xbar
- In R sd()/mean() where the dataset and variable are specified in parentheses.

Range: is NOT calculated based on deviations

Range = data maximum minus date min
In R, the range command outputs the data minimum and maximum

Code

```{r cv and range, echo=T}
sd(my_cars$mpg)/mean(my_cars$mpg)       # cv calculation
range(my_cars$mpg)
```

[1] 0.2999881
[1] 10.4 33.9

Relationships between Measures of Variability

For TSS, Variance, Standard Deviation, and CV, deviations are the building blocks
$Var = \frac{TSS}{n-1}$ and $TSS = Var \times (n-1)$
$SD = \sqrt{Var}$ and $Var = SD^2$
$CV = \frac{SD}{\overline{X}}$ and $SD = CV \times \overline{X}$
- Recall that $\overline{X}$ is the symbol the sample mean.

The two measures we will use most often are Standard Deviation (SD) and CV.
Later in the course we will also use Variance.

Variability Calculations in Excel

All of the calculations shown today (and in lecture 2) can be done in Excel
For large datasets, however, it makes sense to to use R or another coding language
Starting with Lecture 4, Excel will become less practical to use.
The R dataset, mtcars has been exported to Excel and can be accessed here.

Sample and Population measures

In R, it is assumed that you are calculating measures of variability from a sample
Sample STATISTICS
In Excel, you have to specify by using =stdev.s and =var.s for sample data
We will USUALLY not use the population calculations in this course,
- If you use Excel, DO NOT USE =stdev.p and =var.p unless specified.
Be aware that THEY ARE SLIGHTLY DIFFERENT

Lecture 3 In-class Exercises - Q2

Session ID: MAS261f24

For these exercises we will use the R starwars dataset. You can use Excel or R, but I recommend R.

The R dataset, mtcars has been exported to Excel and can be accessed here.
- This file Shows the Excel formulas for all the measures in Lectures 2 and 3.
The R dataset starwars has been Exported to Excel and can be accessed here.

If using R, run the following code to save the starwars dataset to the Global Environment and filter out missing values in the height data.

Code

```{r starwars exercise data, echo=T}
my_starwars <- starwars |> 
  filter(!is.na(height)) |> 
  select(name, height, species)
```

How many observations (rows) are in this filtered dataset with three variables?

Lecture 3 In-class Exercises - Q3

Session ID: MAS261f24

Recall that the deviation for an individual observation is observation minus mean.

What is the deviation from the mean for the height of R2-D2? Round answer to closest whole number and include negative sign if needed.

Height for R2-D2 is 96 cm. Use this code to answer this question.

96 - mean(my_starwars$height)

Lecture 3 In-class Exercises - Q4

Session ID: MAS261f24

Recall that $SST = Variance\times(n-1)$ and $Variance = \frac{SST}{n-1}$

What is the SST, the total sum of the squared deviations for the height variable in the StarWars dataset? Round answer to closest whole number.

To find this answer, find the variance of height and multiply it by the sample size minus 1

var(my_starwars$height)*80

Lecture 3 In-class Exercises - Q4

Session ID: MAS261f24

Recall that $CV = \frac{SD}{\overline{X}}$, Standard Deviation divided by sample mean, $\overline{X}$.

CV is useful for comparing data with different units, e.g. centimeter data to inches, or US dollars to British pounds because the units are factored out.

What is the CV, the coefficient of variation for the height variable in the StarWars dataset? Round answer to two decimal places

sd(my_starwars$height)/mean(my_starwars$height)

Min, Max, Median, Quartiles, and Percentiles

There are values in the data that are used to understand it’s variability and distribution

We’ve already discussed these values:

Minimum: lowest value in the data
Maximum: highest value in the data
Range: Maximum minus Minimum
Median: The middle value or average of two middle values
- Also referred to as the 50th percentile or the 2nd Quartile
- 50% (two quarters) of the observations are below this value
- 50% (two quarters) of the observations are above this value

Two Additional Informative Values:

25th percentile, also called the 1st Quartile: 25% of the data is below this value
75th percentile, also called the 3rd Quartile: 75% of the data is below this value

Five Number Summary

Five Number Summary:

Minimum, 25th Percentile, Median, 75th Percentile, Maximum
In R, we use the command summary to calculate these values and also get the mean as a bonus.

Code

```{r summary demo, echo=T}
summary(my_cars$mpg)                                                                            
```

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90

NOTE: The summary command ONLY works if R recognizes data as numeric.

Visualizing Data

In Lecture 4, we will talk more about visualizing data to understand

Measures of Central Tendancy
Measures of Variability
Quartiles and Percentiles (Percentiles are also called Quantiles)
Extreme Values also referred to as Outliers
Comparing different categories

For today, let’s examine how the five number summary can be visualized.

Five Number Summary and Boxplots

Recall the large real estate dataset from Lecture 2.

Today, we filter this dataset to create realtor1 with:
- two states, Maine and Vermont.
- houses with prices of $1.2 Million or less.
We also make 2 separate datasets:
- realtor_VT includes data for Vermont only
- realtor_ME includes data for Maine only

MAS 261 students are not responsible for R code to import and filter data.

Five Number Summary for Each State

We use the summary command in R on the price data for each state to find summary values:

Minimum, Q1, Median (Q2), Q3, Maximum and Mean

Code

summary(realtor_VT$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10000   89000  200000  250091  343750 1200000

Code

summary(realtor_ME$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14999  174500  349900  391567  569000 1200000

In each state summary, the mean is substantially higher than the median.
Summary values for Vermont are shown on a boxplot (next slide).
In HW 2, you will annotate annotate a boxplot of the data used for that assignment about commute times.
Excel is NOT RECOMMENDED for finding these values!

Boxplot Annotated with Five Number Summary

Five Number summary for Vermont:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10000   89000  200000  250091  343750 1200000

Key Points from Today

Measures of Measures of Variability
- Deviation is ‘building block’
  - Deviation = Observation - Mean
  - TSS, Variance, Standard Deviation and CV are all related
  - Range is Maximum minus Minimum
  - Quartiles, Q1, Q2, Q3 also help describe variability
  - summary command shows Min., Q1, Q2, Q3, Max. and Mean
  - Boxplot shows Min., Q1, Q2, Q3, Max.
  - Boxplot shows additional information (Lecture 4)

To submit an Engagement Question or Comment about material from Lecture 3: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 3" subtitle: "Measures of Variability" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, include=F, message=F, warning=F} # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData) # verify packages # p_loaded() ``` - Today's plan - Review Question about Measures of Central Tendancy - A few minutes for R Questions 🪄 - Measures of Variability - How do we determine variability (spread) - Sample and Population measures - Examining Data Variability Visually - Boxplots - In-class Exercises ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free) - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - I will demo how to download completed work so that you can use this allotment efficiently. - For those who want to go further with R/RStudio: - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer. ## Lecture 3 In-class Exercises - Q1 ***Session ID: MAS261f24*** In Lecture 2 we discussed the measures of central tendency, the mean, median, and mode. **Which of these measures can ALWAYS be found and is good measure of central tendancy regardless of whether or not the data has extreme values (high or low extremes) referred to as outliers?** A. Mode B. Mean C. Median ## Measures of Variability and Deviations - Measures **Central Tendancy** such as **MEAN** and **MEDIAN** indicate where the center of the data is located. - Measures of **Variability** indicate how spread out the data observations are. - The **building block** used to calculate many measures variability is called a **DEVIATION**. - **DEVIATION:** How far an observation is above or below the data mean. - **DEVIATION (of an observation) - Observation - mean** - Demonstration of calculating deviations (2 options shown): - Notice that deviations are both positive (above the mean) and negative (below the mean) ## R Demo of Calculating Deviations - Notice that some deviations are positive and others are negative. - A positive deviation indicate an observation is above the mean. - A negative deviation indicates an observation is below the mean. ::: fragment ```{r create dataset of mpg only, message=FALSE, warning=FALSE} my_cars <- mtcars |> select(mpg) # save R dataset mtcars mpg data to Global Env. ``` ```{r calc devs for each observation} mean_mpg <- mean(my_cars$mpg) # option 1 (traditional coding) my_cars$mean_mpg <- mean(my_cars$mpg) my_cars$mpg_dev <- my_cars$mpg - my_cars$mean_mpg # option 2 (coding with piping and tidyverse commands) my_cars <- my_cars |> mutate(mean_mpg = mean(mpg), mpg_dev = mpg - mean_mpg) my_cars[1:10,] |> kable() ``` ::: ## Total Sums of Squares (TSS) ::: fragment We want a measure of the overall spread of the data. ::: - If we sum all of these deviations we would get zero - We could sum absolute values of the deviations, but the [underlying math (Calculus) wouldn't work well](https://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia#:~:text=The%20benefits%20of%20squaring%20include,of%20the%20effect%20outliers%20have).). - INSTEAD we **sum the SQUARED deviations and take the square root at the end of our calculations.** ::: fragment **TSS** Total Sum of Squares is total variability of variable. It's the sum of the squared deviations ::: ## Variance (Var) ::: fragment **Sample Variance (Var)** Typical value of squared deviation in the data. ::: - **R command for Variance: `var`** - $Var = \frac{TSS}{n-1}$ where n is the number of observations in variable. - $TSS = Var \times (n-1)$ - In HW 2 you will calculate TSS from Variance using this simple equation. ::: fragment ```{r variance and TSS of mpg, echo=T} var(my_cars$mpg) # variance of mpg var(my_cars$mpg)*31 # tss of mpg ``` ::: ## Standard Deviation ::: fragment **Sample Standard Deviation (SD):** Typical value of deviation in the data ::: - $SD = \sqrt{Var} = \sqrt{\frac{TSS}{n-1}}$ - $Var = SD^2$ - R command: `sd` - **Sample Standard Deviation** is the measure we use most often in MAS 261. ::: fragment ```{r sd and var calculations, echo=T} sd(my_cars$mpg) # std deviation calculation (sd(my_cars$mpg))^2 # calculating variance from std. dev ``` ::: ## Coefficient of Variation (CV) and Range ::: fragment **Coefficient of Variation (CV):** provides a measure of variability in the data with scale (units) factored out. - Ideal for directly comparing variability in data with different units, e.g. US \$ and European €. ::: - $CV = \frac{SD}{\overline{X}}$, SD divided by sample mean, xbar - In R `sd()/mean()` where the dataset and variable are specified in parentheses. ::: fragment **Range:** is NOT calculated based on deviations ::: - Range = data maximum minus date min - In R, the range command outputs the data minimum and maximum ::: fragment ```{r cv and range, echo=T} sd(my_cars$mpg)/mean(my_cars$mpg) # cv calculation range(my_cars$mpg) ``` ::: ## ### Relationships between Measures of Variability - For TSS, Variance, Standard Deviation, and CV, deviations are the building blocks - $Var = \frac{TSS}{n-1}$ and $TSS = Var \times (n-1)$ - $SD = \sqrt{Var}$ and $Var = SD^2$ - $CV = \frac{SD}{\overline{X}}$ and $SD = CV \times \overline{X}$ - Recall that $\overline{X}$ is the symbol the sample mean. - The two measures we will use most often are Standard Deviation (SD) and CV. - Later in the course we will also use Variance. ## Variability Calculations in Excel - All of the calculations shown today (and in lecture 2) can be done in Excel - For large datasets, however, it makes sense to to use R or another coding language - Starting with Lecture 4, Excel will become less practical to use. - [The R dataset, `mtcars` has been exported to Excel and can be accessed here.](https://docs.google.com/spreadsheets/d/1mKHuKNp2O9re-nPJY_froKX9Uxzc6k6p/edit?usp=drive_link&ouid=105789079702095490344&rtpof=true&sd=true) ::: fragment ![](img/Excel_Variability_Calculations.png){.r-stretch} ::: ## Sample and Population measures - In R, it is assumed that you are calculating measures of variability from a sample - Sample **STATISTICS** - In Excel, you have to specify by using `=stdev.s` and `=var.s` for sample data - We will **USUALLY** not use the population calculations in this course, - If you use Excel, DO NOT USE `=stdev.p` and `=var.p` unless specified. - Be aware that [THEY ARE SLIGHTLY DIFFERENT](https://www.statology.org/sample-variance-vs-population-variance/) ## Lecture 3 In-class Exercises - Q2 ***Session ID: MAS261f24*** For these exercises we will use the R starwars dataset. You can use Excel or R, **but I recommend R**. - [The R dataset, `mtcars` has been exported to Excel and can be accessed here.](https://docs.google.com/spreadsheets/d/1mKHuKNp2O9re-nPJY_froKX9Uxzc6k6p/edit?usp=drive_link&ouid=105789079702095490344&rtpof=true&sd=true) - This file Shows the Excel formulas for all the measures in Lectures 2 and 3. - [The R dataset `starwars` has been Exported to Excel and can be accessed here.](https://docs.google.com/spreadsheets/d/1LuUHZwwQI8gY5B0qREXC8iZtMzd74tQe/edit?usp=sharing&ouid=105789079702095490344&rtpof=true&sd=true) ::: fragment If using R, run the following code to save the starwars dataset to the Global Environment and filter out missing values in the height data. ```{r starwars exercise data, echo=T} my_starwars <- starwars |> filter(!is.na(height)) |> select(name, height, species) ``` ```{r export starwars data for excel} # write_csv(my_starwars, "data/Starwars.csv") ``` **How many observations (rows) are in this filtered dataset with three variables?** ::: ## Lecture 3 In-class Exercises - Q3 ***Session ID: MAS261f24*** Recall that the deviation for an individual observation is observation minus mean. **What is the deviation from the mean for the height of R2-D2? Round answer to closest whole number and include negative sign if needed.** Height for R2-D2 is 96 cm. Use this code to answer this question. `96 - mean(my_starwars$height)` ## Lecture 3 In-class Exercises - Q4 ***Session ID: MAS261f24*** Recall that $SST = Variance\times(n-1)$ and $Variance = \frac{SST}{n-1}$ **What is the SST, the total sum of the squared deviations for the height variable in the StarWars dataset? Round answer to closest whole number.** **To find this answer**, find the variance of height and multiply it by the sample size minus 1 `var(my_starwars$height)*80` ## Lecture 3 In-class Exercises - Q4 ***Session ID: MAS261f24*** Recall that $CV = \frac{SD}{\overline{X}}$, Standard Deviation divided by sample mean, $\overline{X}$. CV is useful for comparing data with different units, e.g. centimeter data to inches, or US dollars to British pounds because the units are factored out. **What is the CV, the coefficient of variation for the height variable in the StarWars dataset? Round answer to two decimal places** `sd(my_starwars$height)/mean(my_starwars$height)` ## {} ### Min, Max, Median, Quartiles, and Percentiles There are values in the data that are used to understand it's variability and distribution We've already discussed these values: - **Minimum:** lowest value in the data - **Maximum:** highest value in the data - **Range:** Maximum minus Minimum - **Median:** The middle value or average of two middle values - Also referred to as the 50th percentile or the 2nd Quartile - 50% (two quarters) of the observations are below this value - 50% (two quarters) of the observations are above this value ::: fragment Two Additional Informative Values: ::: - **25th percentile**, also called the 1st Quartile: 25% of the data is below this value - **75th percentile**, also called the 3rd Quartile: 75% of the data is below this value ## Five Number Summary ::: fragment Five Number Summary: ::: - Minimum, 25th Percentile, Median, 75th Percentile, Maximum - In `R`, we use the command `summary` to calculate these values and also get the `mean` as a bonus. ::: fragment ::: {.r-fit-text} ```{r summary demo, echo=T} summary(my_cars$mpg) ``` ::: **NOTE:** The `summary` command ONLY works if R recognizes data as numeric. ::: ## Visualizing Data ::: fragment In Lecture 4, we will talk more about visualizing data to understand ::: - Measures of Central Tendancy - Measures of Variability - Quartiles and Percentiles (Percentiles are also called Quantiles) - Extreme Values also referred to as Outliers - Comparing different categories ::: fragment For today, let's examine how the five number summary can be visualized. ::: ## Five Number Summary and Boxplots ::: fragment Recall the large real estate dataset from Lecture 2. ::: - Today, we filter this dataset to create `realtor1` with: - two states, Maine and Vermont. - houses with prices of \$1.2 Million or less. - We also make 2 separate datasets: - `realtor_VT` includes data for Vermont only - `realtor_ME` includes data for Maine only ::: fragment **MAS 261 students are not responsible for R code to import and filter data.** ::: ```{r realtor data imported and filtered, echo=F} realtor <- read_csv("data/realtor_data.csv", show_col_types = F) realtor1 <- realtor |> filter(state %in% c("Maine", "Vermont")) |> select(state, price, acre_lot) |> mutate(State = factor(state)) |> filter(price <= 1200000) realtor_VT <- realtor1 |> filter(State=="Vermont") realtor_ME <- realtor1 |> filter(State=="Maine") ``` ```{r boxplot of data for export} realtor_box <- realtor1 |> ggplot() + geom_boxplot(aes(x=state,y=price, fill=state))+ theme_classic() + theme(legend.position = "none") + labs(y="Price (US$)", x="State", title="Real Estate Prices in Maine and Vermont") + theme(axis.text=element_text(size=10), axis.title=element_text(size=15), plot.title=element_text(size=15)) ggsave("vt_me.png", plot = realtor_box, width = 5, height = 4, dpi = 300) ``` ## Five Number Summary for Each State ::: fragment We use the summary command in R on the price data for each state to find summary values: ::: - Minimum, Q1, Median (Q2), Q3, Maximum and Mean ::: fragment ::: {.r-fit-text} ```{r summary for vt and me data, echo=TRUE} summary(realtor_VT$price) summary(realtor_ME$price) ``` ::: ::: - In each state summary, the mean is substantially higher than the median. - Summary values for Vermont are shown on a boxplot (next slide). - **In HW 2, you will annotate annotate a boxplot of the data used for that assignment about commute times.** - **Excel is NOT RECOMMENDED for finding these values!** ## {} ### Boxplot Annotated with Five Number Summary **Five Number summary for Vermont:** ::: {.r-fit-text} ```{r} summary(realtor_VT$price) ``` ::: ![](img/ME_VT_annotated.png){fig.align="center"} ## {background-image="img/tired_panda_faded.png"} ### Key Points from Today - Measures of Measures of Variability - Deviation is 'building block' - Deviation = Observation - Mean - TSS, Variance, Standard Deviation and CV are all related - Range is Maximum minus Minimum - Quartiles, Q1, Q2, Q3 also help describe variability - `summary` command shows Min., Q1, Q2, Q3, Max. and Mean - Boxplot shows Min., Q1, Q2, Q3, Max. - Boxplot shows additional information (Lecture 4) ::: fragment **To submit an Engagement Question or Comment about material from Lecture 3:** Submit it by midnight today (day of lecture). :::