The range is the difference between the largest and smallest observations. It is easy to compute but not a very good measure of how spread out the data is since we can have two datasets with very different spread but the same range.
For example, consider two datasets 1, 1, 1, 1, 1, 9, 9, 9, 9, 9 and 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 9.
Both have a range of 8 but the spreads of each are different. The first data set has a larger spread because values tend to be farther away from each other while in the second data set values are clustered together at the mean.
Before we can calculate the standard deviation, we must first calculate the variance. So our discussion below will start with the variance but end with the standard deviation.
The variance is based on how far away each individual observation is from the mean. These distances are called the deviations from the mean and we will use the following notation to denote the \(i^{th}\) deviation from the mean \[x_i - \bar x\]Example 1: Find the deviations from the mean for the following data: 5.3, 4.1, 4.2, 6.6, 2.9
Click For AnswerThe sample mean is \(\bar x = \frac{23.1}{5} = 4.62\) which we computed in Example 1 in lesson 2.4.
| \(x_i\) | \(x_i-\bar x\) | |
|---|---|---|
| \(5.3\) | \(5.3-4.62=0.68\) | |
| \(4.1\) | \(4.1-4.62=-0.52\) | |
| \(4.2\) | \(4.2-4.62=-0.42\) | |
| \(6.6\) | \(6.6-4.62=1.98\) | |
| \(2.9\) | \(2.9-4.62=-1.72\) | |
| Total | \(23.1\) | \(0\) |
In R, we can easily find the deviations from the mean
> ex1.data <-c(5.3,4.1,4.2,6.6,2.9)
> ex1.data - mean(ex1.data)
[1] 0.68 -0.52 -0.42 1.98 -1.72
Calculating the deviations is a good start, but how do we summarize multiple deviations from the mean into one value? One idea is to average the deviations from the mean. This would mean adding up the deviations and then dividing by the number of deviations, but it turns out that the sum of the deviations from the mean is always 0. It always sums to zero since the deviations are both negative and positive and cancel each other out.
One solution is to square each deviation to get rid of the negative deviations. And then we (almost) average the squared deviations from the mean by taking the sum of the squared deviations and dividing by \(n-1\). It may seem odd to divide by \(n-1\) but there is a good reason for it that we will address in detail later in the course.
The sample variance is \[s^2 = \sum_{i=1}^{n}\frac{(x_i - \bar x)^2}{n-1}\]
Example 2: Find the sample variance of the following set of data: 5.3, 4.1, 4.2, 6.6, 2.9
Click For AnswerThe sample mean is \(\bar x = \frac{23.1}{5} = 4.62\) which we computed in Example 1 in lesson 2.4.
| \(x_i\) | \(x_i-\bar x\) | \((x_i-\bar x)^2\) | |
|---|---|---|---|
| \(5.3\) | \(5.3-4.62=0.68\) | \((0.68)^2=0.4624\) | |
| \(4.1\) | \(4.1-4.62=-0.52\) | \((-0.52)^2=0.2704\) | |
| \(4.2\) | \(4.2-4.62=-0.42\) | \((-0.42)^2=0.1764\) | |
| \(6.6\) | \(6.6-4.62=1.98\) | \((1.98)^2=3.9204\) | |
| \(2.9\) | \(2.9-4.62=-1.72\) | \((-1.72)^2=2.9584\) | |
| Total | \(23.1\) | \(0\) | \(7.788\) |
And we find that the sample variance is \(s^2 = \frac{7.788}{5-1} = 1.947\).
In R, we can use the var function to compute the variance:
> ex1.data <-c(5.3,4.1,4.2,6.6,2.9)
> var(ex1.data)
[1] 1.947
The sample standard deviation is the square root of the sample variance \[s = \sqrt{\sum_{i=1}^{n}\frac{(x_i-\bar x)^2}{n-1}}\]
The sample standard deviation is preferred because it is measured in the same units as the problem whereas the sample variance is measured in squared units.
Example 3: Find the sample standard deviation of the following set of data: 5.3, 4.1, 4.2, 6.6, 2.9.
Click For AnswerWe already found that \(s^2 = 1.947\) so the sample standard deviation is \(s = \sqrt{1.947} = 1.395.\)
In R, we can use the sd function to compute the standard deviation:
> ex1.data <-c(5.3,4.1,4.2,6.6,2.9)
> sd(ex1.data)
[1] 1.395349
The interquartile range (IQR) measures the spread in the middle 50% of the data; it is the difference between the third and first quartile
\[IQR = Q_3-Q_1\]
Recall that we first encountered quartiles when we learned about boxplots. We are now going to consider the quartiles in more detail.
The first quartile separates the smallest 25% of the data from the remainder of the data. It is denoted by \(Q_1\). It is the same as the 25th percentile. To compute \(Q_1\), you simply find the median of the lower half of the data where the lower half INCLUDES the median when \(n\) is odd.
The third quartile separates the smallest 75% of the data from the remainder of the data. It is denoted by \(Q_3\). It is the same as the 75th percentile. To compute \(Q_3\), you simply find the median of the upper half of the data where the upper half INCLUDES the median when \(n\) is odd.
It’s important to note that there are several different methods of computing quartiles that result in slightly different values. You may have learned a different method of computing quartiles in another class but this is the definition that we will be using in this class.
First let’s consider an example when \(n\) is even
Example 4: Find the interquartile range of the following data: 47, 18, 25, 3, 14, 9, 29, 54
Click For AnswerFirst we must put the data in order: 3, 9, 14, 18, 25, 29, 47, 54
The median is the observation in the \(\frac{n+1}{2}=\frac{8+1}{2}=4.5^{th}\) ordered position, so the median is the average of the observations in 4th and the 5th position in the ordered data: \[m=\frac{18+25}{2}=21.5\]
The first quartile is the median of the lower half of the data, i.e. the median of 3, 9, 14, 18 which is \[Q_1=\frac{9+14}{2}=11.5\]
The third quartile is the median of the upper half of the data, i.e. the median of 25, 29, 47, 54 which is \[Q_3=\frac{29+47}{2}=38\]
So, now we can determine the interquartile range \[IQR=38−11.5=26.5\]
We saw in lesson 2.4 how to use R to find \(Q_1\) and \(Q_3\) with the fivenum command:
> data <- c(47, 18, 25, 3, 14, 9, 29, 54)
> data5numsum <- fivenum(data)
> data5numsum[4] - data5numsum[2]
[1] 26.5
Next let’s consider an example when \(n\) is odd
Example 5: Find the interquartile range of the following data: 47, 18, 25, 3, 14, 9, 29
Click For AnswerFirst we must put the data in order: 3, 9, 14, 18, 25, 29, 47
The median is the observation in the \(\frac{n+1}{2}=\frac{7+1}{2}=4{th}\) ordered position, so the median is 18.
In this example, the lower half of the data is 3, 9, 14, 18 and the upper half of the data is 18, 25, 29, 47. Since \(n\) is odd we include the median in both halves of the data.
The first quartile is the median of the lower half of the data, i.e. the median of 3, 9, 14, 18 which is \[Q_1=\frac{9+14}{2}=11.5\]
The third quartile is the median of the upper half of the data, i.e. the median of 18, 25, 29, 47 which is \[Q_3=\frac{25+29}{2}=27\]
So, now we can determine the interquartile range \[IQR=27−11.5=15.5\]
Can you write the R code to find the IQR in this example?
The standard deviation is easily influenced by extreme observations and the IQR is not. For this reason, we would prefer to use the IQR as a measure of spread when the data is skewed. Putting this together with what we know about measures of spread we determine that the mean and standard deviation are recommended when the data that is symmetric. If the data are skewed, then the median and IQR are recommended.