In our previous lesson, we discussed Central Tendency (Mean, Median, Mode). However, the center does not tell the whole story.
Imagine two climate zones where the average temperature is \(25^\circ C\). - Zone A: Temperatures range from \(24^\circ C\) to \(26^\circ C\). - Zone B: Temperatures range from \(-10^\circ C\) to \(60^\circ C\).
Both have the same mean, but their Variance is vastly different. Measures of variance (or dispersion) tell us how spread out or “scattered” the data points are around the center.
The simplest measure of variance. It is the difference between the maximum and minimum values.
\[Range = X_{max} - X_{min}\]
The IQR measures the spread of the middle 50% of the data. It is highly resistant to outliers.
\[IQR = Q_3 - Q_1\] Where: - \(Q_1\) (First Quartile): 25th percentile. - \(Q_3\) (Third Quartile): 75th percentile.
When looking at company salaries, the IQR helps identify the pay scale for the “typical” employee, ignoring the extremely high salaries of CEOs or the low wages of temporary interns.
Variance measures the average squared deviation of each data point from the mean.
Population Variance (\(\sigma^2\)): \[\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}\]
Sample Variance (\(s^2\)): \[s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}\]
Note: We use \(n-1\) for samples (Bessel’s Correction) to provide an unbiased estimate of the population variance.
Standard Deviation is the square root of the variance. It is the most used measure because it is expressed in the same units as the original data.
\[s = \sqrt{s^2}\]
# Data: Daily stock price changes (%)
stock_changes <- c(1.2, -0.5, 2.3, -1.8, 0.4)
variance_val <- var(stock_changes)
sd_val <- sd(stock_changes)
cat("Variance:", variance_val, "\nStandard Deviation:", sd_val)## Variance: 2.467
## Standard Deviation: 1.570669
Visualizing variance is crucial for understanding the “consistency” of data.
In finance, Standard Deviation is a proxy for Risk.
An investor uses measures of variance to decide if the potential reward is worth the uncertainty.
| Measure | Formula | Use Case |
|---|---|---|
| Range | \(Max - Min\) | Quick, rough estimate of spread. |
| IQR | \(Q_3 - Q_1\) | Best for skewed data with outliers. |
| Variance | \(\frac{\sum(x-\bar{x})^2}{n-1}\) | Mathematical modeling and ANOVA. |
| Standard Dev | \(\sqrt{Var}\) | Reporting “typical” deviation in original units. |
Using the built-in R dataset iris, perform the
following:
Sepal.Length.Petal.Width
grouped by Species.```