mpg | mean_mpg | mpg_dev | |
---|---|---|---|
Mazda RX4 | 21.0 | 20.09062 | 0.909375 |
Mazda RX4 Wag | 21.0 | 20.09062 | 0.909375 |
Datsun 710 | 22.8 | 20.09062 | 2.709375 |
Hornet 4 Drive | 21.4 | 20.09062 | 1.309375 |
Hornet Sportabout | 18.7 | 20.09062 | -1.390625 |
Valiant | 18.1 | 20.09062 | -1.990625 |
Duster 360 | 14.3 | 20.09062 | -5.790625 |
Merc 240D | 24.4 | 20.09062 | 4.309375 |
Merc 230 | 22.8 | 20.09062 | 2.709375 |
Merc 280 | 19.2 | 20.09062 | -0.890625 |
MAS 261 - Lecture 3
Measures of Variability
Housekeeping
Today’s plan
Review Question about Measures of Central Tendancy
A few minutes for R Questions 🪄
Measures of Variability
How do we determine variability (spread)
Sample and Population measures
Examining Data Variability Visually
- Boxplots
In-class Exercises
R and RStudio
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
- After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.
Lecture 3 In-class Exercises - Q1
Session ID: MAS261f24
In Lecture 2 we discussed the measures of central tendency, the mean, median, and mode.
Which of these measures can ALWAYS be found and is good measure of central tendancy regardless of whether or not the data has extreme values (high or low extremes) referred to as outliers?
A. Mode
B. Mean
C. Median
Measures of Variability and Deviations
Measures Central Tendancy such as MEAN and MEDIAN indicate where the center of the data is located.
Measures of Variability indicate how spread out the data observations are.
The building block used to calculate many measures variability is called a DEVIATION.
- DEVIATION: How far an observation is above or below the data mean.
- DEVIATION (of an observation) - Observation - mean
Demonstration of calculating deviations (2 options shown):
- Notice that deviations are both positive (above the mean) and negative (below the mean)
R Demo of Calculating Deviations
- Notice that some deviations are positive and others are negative.
- A positive deviation indicate an observation is above the mean.
- A negative deviation indicates an observation is below the mean.
Total Sums of Squares (TSS)
We want a measure of the overall spread of the data.
If we sum all of these deviations we would get zero
We could sum absolute values of the deviations, but the underlying math (Calculus) wouldn’t work well.).
INSTEAD we sum the SQUARED deviations and take the square root at the end of our calculations.
TSS Total Sum of Squares is total variability of variable. It’s the sum of the squared deviations
Variance (Var)
Sample Variance (Var) Typical value of squared deviation in the data.
R command for Variance:
var
\(Var = \frac{TSS}{n-1}\) where n is the number of observations in variable.
\(TSS = Var \times (n-1)\)
In HW 2 you will calculate TSS from Variance using this simple equation.
Standard Deviation
Sample Standard Deviation (SD): Typical value of deviation in the data
\(SD = \sqrt{Var} = \sqrt{\frac{TSS}{n-1}}\)
\(Var = SD^2\)
R command:
sd
Sample Standard Deviation is the measure we use most often in MAS 261.
Coefficient of Variation (CV) and Range
Coefficient of Variation (CV): provides a measure of variability in the data with scale (units) factored out.
- Ideal for directly comparing variability in data with different units, e.g. US $ and European €.
\(CV = \frac{SD}{\overline{X}}\), SD divided by sample mean, xbar
- In R
sd()/mean()
where the dataset and variable are specified in parentheses.
- In R
Range: is NOT calculated based on deviations
Range = data maximum minus date min
In R, the range command outputs the data minimum and maximum
Relationships between Measures of Variability
For TSS, Variance, Standard Deviation, and CV, deviations are the building blocks
\(Var = \frac{TSS}{n-1}\) and \(TSS = Var \times (n-1)\)
\(SD = \sqrt{Var}\) and \(Var = SD^2\)
\(CV = \frac{SD}{\overline{X}}\) and \(SD = CV \times \overline{X}\)
- Recall that \(\overline{X}\) is the symbol the sample mean.
The two measures we will use most often are Standard Deviation (SD) and CV.
Later in the course we will also use Variance.
Variability Calculations in Excel
All of the calculations shown today (and in lecture 2) can be done in Excel
For large datasets, however, it makes sense to to use R or another coding language
Starting with Lecture 4, Excel will become less practical to use.
The R dataset,
mtcars
has been exported to Excel and can be accessed here.
Sample and Population measures
In R, it is assumed that you are calculating measures of variability from a sample
Sample STATISTICS
In Excel, you have to specify by using
=stdev.s
and=var.s
for sample dataWe will USUALLY not use the population calculations in this course,
- If you use Excel, DO NOT USE
=stdev.p
and=var.p
unless specified.
- If you use Excel, DO NOT USE
Be aware that THEY ARE SLIGHTLY DIFFERENT
Lecture 3 In-class Exercises - Q2
Session ID: MAS261f24
For these exercises we will use the R starwars dataset. You can use Excel or R, but I recommend R.
- The R dataset,
mtcars
has been exported to Excel and can be accessed here.- This file Shows the Excel formulas for all the measures in Lectures 2 and 3.
- The R dataset
starwars
has been Exported to Excel and can be accessed here.
If using R, run the following code to save the starwars dataset to the Global Environment and filter out missing values in the height data.
How many observations (rows) are in this filtered dataset with three variables?
Lecture 3 In-class Exercises - Q3
Session ID: MAS261f24
Recall that the deviation for an individual observation is observation minus mean.
What is the deviation from the mean for the height of R2-D2? Round answer to closest whole number and include negative sign if needed.
Height for R2-D2 is 96 cm. Use this code to answer this question.
96 - mean(my_starwars$height)
Lecture 3 In-class Exercises - Q4
Session ID: MAS261f24
Recall that \(SST = Variance\times(n-1)\) and \(Variance = \frac{SST}{n-1}\)
What is the SST, the total sum of the squared deviations for the height variable in the StarWars dataset? Round answer to closest whole number.
To find this answer, find the variance of height and multiply it by the sample size minus 1
var(my_starwars$height)*80
Lecture 3 In-class Exercises - Q4
Session ID: MAS261f24
Recall that \(CV = \frac{SD}{\overline{X}}\), Standard Deviation divided by sample mean, \(\overline{X}\).
CV is useful for comparing data with different units, e.g. centimeter data to inches, or US dollars to British pounds because the units are factored out.
What is the CV, the coefficient of variation for the height variable in the StarWars dataset? Round answer to two decimal places
sd(my_starwars$height)/mean(my_starwars$height)
Min, Max, Median, Quartiles, and Percentiles
There are values in the data that are used to understand it’s variability and distribution
We’ve already discussed these values:
Minimum: lowest value in the data
Maximum: highest value in the data
Range: Maximum minus Minimum
Median: The middle value or average of two middle values
- Also referred to as the 50th percentile or the 2nd Quartile
- 50% (two quarters) of the observations are below this value
- 50% (two quarters) of the observations are above this value
Two Additional Informative Values:
- 25th percentile, also called the 1st Quartile: 25% of the data is below this value
- 75th percentile, also called the 3rd Quartile: 75% of the data is below this value
Five Number Summary
Five Number Summary:
Minimum, 25th Percentile, Median, 75th Percentile, Maximum
In
R
, we use the commandsummary
to calculate these values and also get themean
as a bonus.
Visualizing Data
In Lecture 4, we will talk more about visualizing data to understand
- Measures of Central Tendancy
- Measures of Variability
- Quartiles and Percentiles (Percentiles are also called Quantiles)
- Extreme Values also referred to as Outliers
- Comparing different categories
For today, let’s examine how the five number summary can be visualized.
Five Number Summary and Boxplots
Recall the large real estate dataset from Lecture 2.
Today, we filter this dataset to create
realtor1
with:two states, Maine and Vermont.
houses with prices of $1.2 Million or less.
We also make 2 separate datasets:
realtor_VT
includes data for Vermont onlyrealtor_ME
includes data for Maine only
MAS 261 students are not responsible for R code to import and filter data.
Five Number Summary for Each State
We use the summary command in R on the price data for each state to find summary values:
- Minimum, Q1, Median (Q2), Q3, Maximum and Mean
In each state summary, the mean is substantially higher than the median.
Summary values for Vermont are shown on a boxplot (next slide).
In HW 2, you will annotate annotate a boxplot of the data used for that assignment about commute times.
Excel is NOT RECOMMENDED for finding these values!
Boxplot Annotated with Five Number Summary
Five Number summary for Vermont:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10000 89000 200000 250091 343750 1200000
Key Points from Today
Measures of Measures of Variability
Deviation is ‘building block’
Deviation = Observation - Mean
TSS, Variance, Standard Deviation and CV are all related
Range is Maximum minus Minimum
Quartiles, Q1, Q2, Q3 also help describe variability
summary
command shows Min., Q1, Q2, Q3, Max. and MeanBoxplot shows Min., Q1, Q2, Q3, Max.
Boxplot shows additional information (Lecture 4)
To submit an Engagement Question or Comment about material from Lecture 3: Submit it by midnight today (day of lecture).