Lecture 3 - Measures of Variability

Penelope Pooler Eisenbies
MAS 261

2023-09-05

Housekeeping

Today’s plan 📋
- Review Question about Measures of Central Tendancy
- A few minutes for R Questions 🪄
- Measures of Variability
  - How do we determine variability (spread)
  - Sample and Population measures
- Examining Data Variability Visually
  - Boxplots
- In-class Exercises

Review: R and RStudio 🪄

Review: You have two options to facilitate your introduction to R and RStudio:
- Option 1: Create Posit Cloud account and download and install R and RStudio on your laptop.
- Option 2: Start with free Posit Cloud account and use that and later transition to using R/Rstudio on your laptop.
If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.
- We will use Posit Cloud for Quizzes.
If you are nervous about coding: Choose Option 2.
For both options: I can help with download/install issues during office hours.
What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.
NOTE: We will use R and RStudio in class during MOST lectures
- You can use either Posit Cloud or your laptop.

💥 Lecture 3 In-class Exercises - Q1 (Review) 💥

Session ID: mas261f23

In lecture 2 we discussed the measures of central tendency, the mean, median, and mode.

Which of these measures can ALWAYS be found and is good measure of central tendancy regardless of whether or not the data has extreme values (high or low extremes) referred to as outliers?

A. Mode

B. Mean

C. Median

Measures of variability - Deviations are Building Blocks

Measures Central Tendancy such as MEAN and MEDIAN indicate where the center of the data is located.
Measures of Variability indicate how spread out the data observations are.
The building block used to calculate these measures is called a DEVIATION.
- DEVIATION: How far an observation is above or below the data mean.
- DEVIATION (of an observation) - Observation - mean
Demonstration of calculating deviations (2 options shown):
- Notice that deviations are both positive (above the mean) and negative (below the mean)

🤯 R Demo of Calculating Deviations

Notice that some deviations are positive and others are negative.
A positive deviation indicate an observation is above the mean.
A negative deviation indicates an observation is below the mean.

	mpg	mean_mpg	mpg_dev
Mazda RX4	21.0	20.09062	0.909375
Mazda RX4 Wag	21.0	20.09062	0.909375
Datsun 710	22.8	20.09062	2.709375
Hornet 4 Drive	21.4	20.09062	1.309375
Hornet Sportabout	18.7	20.09062	-1.390625
Valiant	18.1	20.09062	-1.990625
Duster 360	14.3	20.09062	-5.790625
Merc 240D	24.4	20.09062	4.309375
Merc 230	22.8	20.09062	2.709375
Merc 280	19.2	20.09062	-0.890625

Total Sums of Squares (TSS)

We want a measure of the overall spread of the data.

If we sum all of these deviations we would get zero
We could sum absolute values of the deviations, but the underlying math (Calculus) wouldn’t work well.).
INSTEAD we sum the SQUARED deviations and take the square root at the end of our calculations.

TSS Total Sum of Squares is total variability of variable. It’s the sum of the squared deviations

Variance (Var)

Sample Variance (Var) Typical value of squared deviation in the data.

R command for Variance: var
$Var = \frac{TSS}{n-1}$ where n is the number of observations in variable.
$TSS = Var \times (n-1)$
In HW 2 you will calculate TSS from Variance using this simple equation.

var(my_cars$mpg)               # variance of mpg

[1] 36.3241

var(my_cars$mpg)*31            # tss of mpg

[1] 1126.047

Standard Deviation

Sample Standard Deviation (SD): Typical value of deviation in the data

$SD = \sqrt{Var} = \sqrt{\frac{TSS}{n-1}}$
$Var = SD^2$
R command: sd
Sample Standard Deviation is the measure we use most often in MAS 261.

sd(my_cars$mpg)             # std deviation calculation

[1] 6.026948

(sd(my_cars$mpg))^2         # calculating variance from std. dev

[1] 36.3241

Coefficient of Variation (CV) and Range

Coefficient of Variation (CV): provides a measure of variability in the data with scale (units) factored out.

Ideal for directly comparing variability in data with different units, e.g. US $ and European €.

$CV = \frac{SD}{\overline{X}}$, SD divided by sample mean, xbar
- In R sd()/mean() where the dataset and variable are specified in parentheses.

Range: is NOT calculated based on deviations

Range = data maximum minus date min
In R, the range command outputs the data minimum and maximum

sd(my_cars$mpg)/mean(my_cars$mpg)       # cv calculation

[1] 0.2999881

range(my_cars$mpg)

[1] 10.4 33.9

Summary of Relationships between mesures of variability

For SST, Variance, Standard Deviation, and CV, deviations are the building blocks
$Var = \frac{TSS}{n-1}$ and $TSS = Var \times (n-1)$
$SD = \sqrt{Var}$ and $Var = SD^2$
$CV = \frac{SD}{\overline{X}}$ and $SD = CV \times \overline{X}$
- Recall that $\overline{X}$ is the symbol the sample mean.

The two measures we will use most often are Standard Deviation (SD) and CV.
Later in the course we will also use Variance.

Variability Calculations in Excel

All of the calculations shown today (and in lecture 2) can be done in Excel
For large datasets, however, it makes sense to to use R or another coding language
Starting with Lecture 4, Excel will become less practical to use.
The R dataset, mtcars has been exported to Excel and can be accessed here.

A Note About Sample and Population measures

In R, it is assumed that you are calculating measures of variability from a sample
Sample STATISTICS
In Excel, you have to specify by using =stdev.s and =var.s for sample data
We will USUALLY not use the population calculations in this course,
- If using Excel, DO NOT USE =stdev.p and =var.p unless specified.
Be aware that THEY ARE SLIGHTLY DIFFERENT

💥 Lecture 3 In-class Exercises - Q2 💥

For these exercises we will use the R starwars dataset. You can use Excel or R, but I recommend R.

The R dataset, mtcars has been exported to Excel and can be accessed here.
- This file Shows the Excel formulas for all the measures in Lectures 2 and 3.
The R dataset starwars has been Exported to Excel and can be accessed here.

If using R, run the following code to save the starwars dataset to the Global Environment and filter out missing values in the height data.

my_starwars <- starwars |> 
  filter(!is.na(height)) |> 
  select(name, height, species)

How many observations (rows) are in this filtered dataset with three variables?

💥 Lecture 3 In-class Exercises - Q3 💥

Recall that the deviation for an individual observation is observation minus mean.

What is the deviation from the mean for the height of R2-D2? Round answer to closest whole number and include negative sign if needed.

Height for R2-D2 is 96 cm. Use this code to answer this question.

96 - mean(my_starwars$height)

💥 Lecture 3 In-class Exercises - Q4 💥

Recall that $SST = Variance\times(n-1)$ and $Variance = \frac{SST}{n-1}$

What is the SST, the total sum of the squared deviations for the height variable in the StarWars dataset? Round answer to closest whole number.

To find this answer:

find the variance of height and multiply it by the sample size minus 1

var(my_starwars$height)*80

💥 Lecture 3 In-class Exercises - Q5 💥

Recall that $CV = \frac{SD}{\overline{X}}$, Standard Deviation divided by sample mean, $\overline{X}$.

CV is useful for comparing data with different units, e.g. centimeter data to inches, or US dollars to British pounds because the units are factored out.

What is the CV, the coefficient of variation for the height variable in the StarWars dataset? Round answer to two decimal places

sd(my_starwars$height)/mean(my_starwars$height)

Quartiles, Percentiles, and Five Number Summary

There are other key values in the data that are used to understand it’s variability and distribution

We’ve already discussed these values:

Minimum: lowest value in the data
Maximum: highest value in the data
Range: Maximum minus Minimum
Median: The middle value or average of two middle values
- Also referred to as the 50th percentile or the 2nd Quartile
- 50% (two quarters) of the observations are below this value
- 50% (two quarters) of the observations are above this value

Two Additional Informative Values:

25th percentile, also called the 1st Quartile: 25% of the data is below this value
75th percentile, also called the 3rd Quartile: 75% of the data is below this value

Five Number Summary:

Minimum, 25th Percentile, Median, 75th Percentile, Maximum

Five Number Summary in R (with a Bonus Mean)

In R, you can use the command summary to find these values:

summary(my_cars$mpg)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90

NOTE: The summary command ONLY works if R recognizes data as numeric.

Visualizing Data

In Lecture 4, we will talk more about visualizing data to understand

Measures of Central Tendancy
Measures of Variability
Quartiles and Percentiles (Percentiles are also called Quantiles)
Extreme Values also referred to as Outliers
Comparing different categories

For today, let’s examine how the five number summary can be visualized.

Five Number Summary and Boxplots

Recall the large real estate dataset from Lecture 2.

Today, we filter this dataset to create realtor1 with:
- two states, Maine and Vermont.
- houses with prices of $1.2 Million or less.
We also make 2 separate datasets:
- realtor_VT includes data for Vermont only
- realtor_ME includes data for Maine only

MAS 261 students are not responsible for R code to import and filter data.

Five Number Summary for Each State

We use the summary command in R on the price data for each state to find summary values:

Minimum, Q1, Median (Q2), Q3, Maximum and Mean

summary(realtor_VT$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10000   89000  200000  250091  343750 1200000

summary(realtor_ME$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14999  174500  349900  391567  569000 1200000

Notice that in each summary, the mean is substantially higher than the median.
Summary values for Vermont are shown on the following box plot.
In HW 1, you will annotate annotate a boxplot of the data used for that assignment about commute times.

Excel is NOT RECOMMENDED for finding these values!

Boxplot Annotated with Five Number Summary

Five Number summary for Maine:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10000   89000  200000  250091  343750 1200000

Key Points from Today

Continuation of summarizing QUANTITATIVE DATA
- Measures of Measures of Variability
  - Deviation is ‘building block’
    - Deviation = Observation - Mean
  - TSS, Variance, Standard Deviation and CV are all related
  - Range is Maximum minus Minimum
- Quartiles, Q1, Q2, Q3 also help describe variability
  - summary command shows Min., Q1, Q2, Q3, Max. and Mean
- Boxplot shows Min., Q1, Q2, Q3, Max.
  - Boxplot shows additional information (Lecture 4)

To submit an Engagement Question or Comment about material from Lecture 3: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 3