Content you should have understood before watching this video:
- Number 1, ‘Variables’
- Number 2, ‘Variation’
The mean, the mode, and the median
- Mean: you should know what the mean is
- Mode
- The most frequent score or category (The peak of the histogram!)
- Median
- The middle score when scores are ordered. In R:
median() - Example: number of friends of 11 Facebook users:
- The middle score when scores are ordered. In R:
22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252
What is the median?
The dispersion: range
- The Range
- The smallest score subtracted from the largest. In R:
range()
- The smallest score subtracted from the largest. In R:
- Example
- Number of friends of 11 Facebook users.
- 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252
- Range = 252 – 22 = 230
- What could be a problem when you indicate the range of a variable? What could we be missing?
=> this metric is prone to extreme values! Often not a very informative metric!
The dispersion: the interquartile range
- Quartiles
- The three values that split the sorted data into four equal parts.
- Second quartile = median.
- Lower quartile (25\(^{th}\) percentile) = median of lower half of the data.
- Upper quartile (75\(^{th}\) percentile) = median of upper half of the data.
- And of course you can have any ‘percentile’ (e.g. the 5\(^{th}\))
=> quartiles/medians are not prone to extreme values!
Creating some variables to play with
name = c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income = c(58000, 38000, 28000, 63000, 90500, 17000)
sex = rep(1:2, each = 3)
d1 = data.frame(name, alcohol, income, sex)
d1
name alcohol income sex
1 Ben 0.75 58000 1
2 Martin 1.20 38000 1
3 Andy 2.40 28000 1
4 Pauline 0.23 63000 2
5 Eva 0.90 90500 2
6 Carina 1.36 17000 2
Graphically measuring the variation of a (continuous) variable
- The histogram! Always THE first thing to look at:
- A histogram is used to show the distribution of a single (mostly continuous) variable
- The x-axis represents the ‘bins’: a continuous variable is ‘categorised’ into a number of bins (about 10-20)
- The y-axis is the frequency, i.e. how many values fall in a given bin.
What’s a histogram, what is it used for?
hist(x) hist(y)
Graphically measuring the variation of a (continuous) variable
- The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical) variable. What values does the box show? (look it up!)
boxplot(d1$income ~ d1$sex) #note the use of '$' and '~'!
Measuring variation with numbers
- A perfect fit (rare!):
Measuring variation with numbers
- More often it looks like this:
Calculating ‘error’
- A deviation (or error) is the difference between the mean and an actual data point.
- Deviations can be calculated by taking each score and subtracting the mean from it:
\[deviation = x_i - \bar{x}\]
- NB: ‘Deviation’ is called ‘residual’ in linear models
Calculating ‘error’
What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?
Use the total error?
- We could just sum up the errors between the mean and the data
| score | mean | deviation |
|---|---|---|
| 1 | 2.6 | -1.6 |
| 2 | 2.6 | -0.6 |
| 3 | 2.6 | 0.4 |
| 3 | 2.6 | 0.4 |
| 4 | 2.6 | 1.4 |
| Total | 0 |
\[\sum(x_i - \bar{x}) = 0\]
The sum of squared errors
- The problem with summing up deviations is that they cancel out because some are positive and others negative
- Therefore, we square each deviation.
- If we add these squared deviations we get the sum of squared errors (SS).
\[SS = \sum(x_i - \bar{x})^2\]
Sum of squared errors
| score | mean | deviation | squared deviation |
|---|---|---|---|
| 1 | 2.6 | -1.6 | 2.56 |
| 2 | 2.6 | -0.6 | 0.36 |
| 3 | 2.6 | 0.4 | 0.16 |
| 3 | 2.6 | 0.4 | 0.16 |
| 4 | 2.6 | 1.4 | 1.96 |
| Total | 0 | 5.2 |
\[SS = \sum(x_i - \bar{x})^2 = 5.2\]
Variance
- The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
- We calculate the average variability by dividing by the number of scores (\(n\)) minus 1.
- This value is called the variance (\(s^2\)).
\[variance = s^2 = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]
Standard deviation
- The variance has one problem: it is measured in units squared
- This isn’t a very meaningful metric so we take the square root value
- This is the standard deviation (\(s\), sometimes \(sd\)):
\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\] NB: mostly, the population standard deviation is called \(s\), while the sample standard deviation is called \(\sigma\)
In R:
friends = c(1, 2, 3, 3, 4) sd(friends) [1] 1.140175
Same mean, different standard deviation
The most important in a nutshell
- Be sure to know how to compute the mean, interquartile range, and the median, know the corresponding R functions!
- Know which metrics are prone to outliers, and which ones are robust
- The histogram is most important when characterising variation
- Numerically characterising the variation in a variable brings us from the SS to the variance and the standard deviation