Department of Environmental Science, AUT

Measuring Variation: Prerequisites

Measuring Variation

Content you should have understood before watching this video:

  • Number 1, ‘Variables’
  • Number 2, ‘Variation’

The mean, the mode, and the median

Measuring Variation
  • Mean: you should know what the mean is
  • Mode
    • The most frequent score or category (The peak of the histogram!)
  • Median
    • The middle score when scores are ordered. In R: median()
    • Example: number of friends of 11 Facebook users:

22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252

What is the median?

The dispersion: range

Measuring Variation
  • The Range
    • The smallest score subtracted from the largest. In R: range()
  • Example
    • Number of friends of 11 Facebook users.
    • 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252
    • Range = 252 – 22 = 230
  • What could be a problem when you indicate the range of a variable? What could we be missing?

=> this metric is prone to extreme values! Often not a very informative metric!

The dispersion: the interquartile range

Measuring Variation
  • Quartiles
    • The three values that split the sorted data into four equal parts.
    • Second quartile = median.
    • Lower quartile (25\(^{th}\) percentile) = median of lower half of the data.
    • Upper quartile (75\(^{th}\) percentile) = median of upper half of the data.
    • And of course you can have any ‘percentile’ (e.g. the 5\(^{th}\))

=> quartiles/medians are not prone to extreme values!

Creating some variables to play with

Measuring Variation
name = c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income = c(58000, 38000, 28000, 63000, 90500, 17000)
sex = rep(1:2, each = 3)
d1 = data.frame(name, alcohol, income, sex)
d1
     name alcohol income sex
1     Ben    0.75  58000   1
2  Martin    1.20  38000   1
3    Andy    2.40  28000   1
4 Pauline    0.23  63000   2
5     Eva    0.90  90500   2
6  Carina    1.36  17000   2

Graphically measuring the variation of a (continuous) variable

Measuring Variation
  • The histogram! Always THE first thing to look at:
  • A histogram is used to show the distribution of a single (mostly continuous) variable
  • The x-axis represents the ‘bins’: a continuous variable is ‘categorised’ into a number of bins (about 10-20)
  • The y-axis is the frequency, i.e. how many values fall in a given bin.

What’s a histogram, what is it used for?

Measuring Variation
hist(x)
hist(y)

Graphically measuring the variation of a (continuous) variable

Measuring Variation
  • The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical) variable. What values does the box show? (look it up!)
boxplot(d1$income ~ d1$sex) #note the use of '$' and '~'!

Measuring variation with numbers

Measuring Variation
  • A perfect fit (rare!):

Measuring variation with numbers

Measuring Variation
  • More often it looks like this:

Calculating ‘error’

Measuring Variation
  • A deviation (or error) is the difference between the mean and an actual data point.
  • Deviations can be calculated by taking each score and subtracting the mean from it:

\[deviation = x_i - \bar{x}\]

  • NB: ‘Deviation’ is called ‘residual’ in linear models

Calculating ‘error’

Measuring Variation

What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?

Use the total error?

Measuring Variation
  • We could just sum up the errors between the mean and the data
score mean deviation
1 2.6 -1.6
2 2.6 -0.6
3 2.6 0.4
3 2.6 0.4
4 2.6 1.4
Total 0

\[\sum(x_i - \bar{x}) = 0\]

The sum of squared errors

Measuring Variation
  • The problem with summing up deviations is that they cancel out because some are positive and others negative
  • Therefore, we square each deviation.
  • If we add these squared deviations we get the sum of squared errors (SS).

\[SS = \sum(x_i - \bar{x})^2\]

Sum of squared errors

Measuring Variation
score mean deviation squared deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 0 5.2

\[SS = \sum(x_i - \bar{x})^2 = 5.2\]

Variance

Measuring Variation
  • The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
  • We calculate the average variability by dividing by the number of scores (\(n\)) minus 1.
  • This value is called the variance (\(s^2\)).

\[variance = s^2 = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]

Standard deviation

Measuring Variation
  • The variance has one problem: it is measured in units squared
  • This isn’t a very meaningful metric so we take the square root value
  • This is the standard deviation (\(s\), sometimes \(sd\)):

\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\] NB: mostly, the population standard deviation is called \(s\), while the sample standard deviation is called \(\sigma\)

In R:

friends = c(1, 2, 3, 3, 4)
sd(friends)
[1] 1.140175

Same mean, different standard deviation

Measuring Variation

The most important in a nutshell

Measuring Variation
  • Be sure to know how to compute the mean, interquartile range, and the median, know the corresponding R functions!
  • Know which metrics are prone to outliers, and which ones are robust
  • The histogram is most important when characterising variation
  • Numerically characterising the variation in a variable brings us from the SS to the variance and the standard deviation