1. Introduction: Why Variance Matters?

In our previous lesson, we discussed Central Tendency (Mean, Median, Mode). However, the center does not tell the whole story.

Imagine two climate zones where the average temperature is \(25^\circ C\). - Zone A: Temperatures range from \(24^\circ C\) to \(26^\circ C\). - Zone B: Temperatures range from \(-10^\circ C\) to \(60^\circ C\).

Both have the same mean, but their Variance is vastly different. Measures of variance (or dispersion) tell us how spread out or “scattered” the data points are around the center.


2. The Range

The simplest measure of variance. It is the difference between the maximum and minimum values.

Formula

\[Range = X_{max} - X_{min}\]

R Example

scores <- c(55, 67, 89, 92, 45, 77)
range_val <- max(scores) - min(scores)
print(paste("The Range is:", range_val))
## [1] "The Range is: 47"

3. Interquartile Range (IQR)

The IQR measures the spread of the middle 50% of the data. It is highly resistant to outliers.

Formula

\[IQR = Q_3 - Q_1\] Where: - \(Q_1\) (First Quartile): 25th percentile. - \(Q_3\) (Third Quartile): 75th percentile.

Real-Life Example: Salary Distribution

When looking at company salaries, the IQR helps identify the pay scale for the “typical” employee, ignoring the extremely high salaries of CEOs or the low wages of temporary interns.


4. Variance (\(s^2\) or \(\sigma^2\))

Variance measures the average squared deviation of each data point from the mean.

Mathematical Formulas

Population Variance (\(\sigma^2\)): \[\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}\]

Sample Variance (\(s^2\)): \[s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}\]

Note: We use \(n-1\) for samples (Bessel’s Correction) to provide an unbiased estimate of the population variance.


5. Standard Deviation (\(s\) or \(\sigma\))

Standard Deviation is the square root of the variance. It is the most used measure because it is expressed in the same units as the original data.

Formula

\[s = \sqrt{s^2}\]

R Example: Calculating Variance and SD

# Data: Daily stock price changes (%)
stock_changes <- c(1.2, -0.5, 2.3, -1.8, 0.4)

variance_val <- var(stock_changes)
sd_val <- sd(stock_changes)

cat("Variance:", variance_val, "\nStandard Deviation:", sd_val)
## Variance: 2.467 
## Standard Deviation: 1.570669

6. Visualizing Variance

Visualizing variance is crucial for understanding the “consistency” of data.

Interpretation of the Graph:

  • Blue Curve (Low Variance): Data points are tightly clustered around the mean. The process is consistent.
  • Red Curve (High Variance): Data points are widely spread. The process is volatile or unpredictable.

7. Real-Life Example: Investment Risk

In finance, Standard Deviation is a proxy for Risk.

  • Investment A (Bond): Mean return 5%, Standard Deviation 1%. (Predictable, low risk).
  • Investment B (Crypto): Mean return 5%, Standard Deviation 50%. (Highly volatile, high risk).

An investor uses measures of variance to decide if the potential reward is worth the uncertainty.


8. Summary Table

Measure Formula Use Case
Range \(Max - Min\) Quick, rough estimate of spread.
IQR \(Q_3 - Q_1\) Best for skewed data with outliers.
Variance \(\frac{\sum(x-\bar{x})^2}{n-1}\) Mathematical modeling and ANOVA.
Standard Dev \(\sqrt{Var}\) Reporting “typical” deviation in original units.

9. Class Exercise

Using the built-in R dataset iris, perform the following:

  1. Calculate the Range and Standard Deviation of Sepal.Length.
  2. Create a Boxplot of Petal.Width grouped by Species.
  3. Which species shows the highest variance in petal width?
# Hint:
sd(iris$Sepal.Length)
boxplot(Petal.Width ~ Species, data = iris)

```