Measuring Variation

Department of Environmental Science, AUT

Measuring Variation: Prerequisites

Measuring Variation

Content you should have understood before watching this video:

Number 1, ‘Variables’
Number 2, ‘Variation’

The mean, the mode, and the median

Measuring Variation

Mean: you should know what the mean is
Mode
- The most frequent score or category (The peak of the histogram!)
Median
- The middle score when scores are ordered. In R: median()
- Example: number of friends of 11 Facebook users:

22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252

What is the median?

The dispersion: range

Measuring Variation

The Range
- The smallest score subtracted from the largest. In R: range()
Example
- Number of friends of 11 Facebook users.
- 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252
- Range = 252 – 22 = 230
What could be a problem when you indicate the range of a variable? What could we be missing?

=> this metric is prone to extreme values! Often not a very informative metric!

The dispersion: the interquartile range

Measuring Variation

Quartiles
- The three values that split the sorted data into four equal parts.
- Second quartile = median.
- Lower quartile (25\(^{th}\) percentile) = median of lower half of the data.
- Upper quartile (75\(^{th}\) percentile) = median of upper half of the data.
- And of course you can have any ‘percentile’ (e.g. the 5\(^{th}\))

=> quartiles/medians are not prone to extreme values!

Creating some variables to play with

Measuring Variation

name = c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income = c(58000, 38000, 28000, 63000, 90500, 17000)
sex = rep(1:2, each = 3)
d1 = data.frame(name, alcohol, income, sex)
d1
     name alcohol income sex
1     Ben    0.75  58000   1
2  Martin    1.20  38000   1
3    Andy    2.40  28000   1
4 Pauline    0.23  63000   2
5     Eva    0.90  90500   2
6  Carina    1.36  17000   2

Graphically measuring the variation of a (continuous) variable

Measuring Variation

The histogram! Always THE first thing to look at:
A histogram is used to show the distribution of a single (mostly continuous) variable
The x-axis represents the ‘bins’: a continuous variable is ‘categorised’ into a number of bins (about 10-20)
The y-axis is the frequency, i.e. how many values fall in a given bin.

What’s a histogram, what is it used for?

Measuring Variation

hist(x)
hist(y)

Graphically measuring the variation of a (continuous) variable

Measuring Variation

The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical) variable. What values does the box show? (look it up!)

boxplot(d1$income ~ d1$sex) #note the use of '$' and '~'!

Measuring variation with numbers

Measuring Variation

A perfect fit (rare!):

Measuring variation with numbers

Measuring Variation

More often it looks like this:

Calculating ‘error’

Measuring Variation

A deviation (or error) is the difference between the mean and an actual data point.
Deviations can be calculated by taking each score and subtracting the mean from it:

\[deviation = x_i - \bar{x}\]

NB: ‘Deviation’ is called ‘residual’ in linear models

Calculating ‘error’

Measuring Variation

What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?

Use the total error?

Measuring Variation

We could just sum up the errors between the mean and the data

score	mean	deviation
1	2.6	-1.6
2	2.6	-0.6
3	2.6	0.4
3	2.6	0.4
4	2.6	1.4
	Total	0

\[\sum(x_i - \bar{x}) = 0\]

The sum of squared errors

Measuring Variation

The problem with summing up deviations is that they cancel out because some are positive and others negative
Therefore, we square each deviation.
If we add these squared deviations we get the sum of squared errors (SS).

\[SS = \sum(x_i - \bar{x})^2\]

Sum of squared errors

Measuring Variation

score	mean	deviation	squared deviation
1	2.6	-1.6	2.56
2	2.6	-0.6	0.36
3	2.6	0.4	0.16
3	2.6	0.4	0.16
4	2.6	1.4	1.96
	Total	0	5.2

\[SS = \sum(x_i - \bar{x})^2 = 5.2\]

Variance

Measuring Variation

The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
We calculate the average variability by dividing by the number of scores (\(n\)) minus 1.
This value is called the variance (\(s^2\)).

\[variance = s^2 = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]

Standard deviation

Measuring Variation

The variance has one problem: it is measured in units squared
This isn’t a very meaningful metric so we take the square root value
This is the standard deviation (\(s\), sometimes \(sd\)):

\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\] NB: mostly, the population standard deviation is called \(s\), while the sample standard deviation is called \(\sigma\)

In R:

friends = c(1, 2, 3, 3, 4)
sd(friends)
[1] 1.140175

Same mean, different standard deviation

Measuring Variation

The most important in a nutshell

Measuring Variation

Be sure to know how to compute the mean, interquartile range, and the median, know the corresponding R functions!
Know which metrics are prone to outliers, and which ones are robust
The histogram is most important when characterising variation
Numerically characterising the variation in a variable brings us from the SS to the variance and the standard deviation