The Sample Mean

The sample mean, \(\bar x\), of a set of data, \(x_1, x_2, ...., x_n\) is the sum of the data values divided by the number of observations:

\[\bar x = \frac{1}{n}(x_1 + x_2 + ... + x_n) = \frac{1}{n} \sum_{i=1}^{n} x_i\]

Example 1: Find the sample mean of the following set of data: 5.3, 4.1, 4.2, 6.6, 2.9

In this case, we have \(x_1=5.3\), \(x_2=4.1\), \(x_3=4.2\), \(x_4=6.6\), \(x_5=2.9\) and \(n=5\) so \[\bar x = \frac{5.3+4.1+4.2+6.6+2.9}{5}=4.62\]

In R, we can use the mean command

> ex1.data <-c(5.3,4.1,4.2,6.6,2.9)
> mean(ex1.data)

[1] 4.62

The Sample Median

The sample median, \(m\), is the middle observation of a set of observations that are arranged in increasing order. The median will be the number located in the \(\frac{n+1}{2}\) position in the ordered list. If the sample size is an odd number then the median is the middle observation. If the sample size is an even number then the median is the average of the two middle observations.

Example 2: Find the sample median of the following set of data: 5.3, 4.1, 4.2, 6.6, 2.9

First we must order the data: 2.9, 4.1, 4.2, 5.3, 6.6 Next we must compute \(\frac{n+1}{2}=\frac{5+1}{2}=3\) which means that the median is the third observation in the ordered list, i.e. \(m = 4.2\)

In R, we can use the median command

> median(ex1.data)

[1] 4.2

Comparing the mean and the median

Often, the mean and the median provide similar values as in the case of the examples above. However, if one of the values in our data were extremely small or large then the mean and median can take on dissimilar values. This is because the mean is affected by extreme observations while the median is not.

Example 3: Find the sample mean and the sample median of the following data: 8, 10, 4, 56, 2, 18

The sample mean is \(\bar x = \frac{8+10+4+56+2+18}{6}=16.33\)

To find the sample median we must first order the data: 2, 4, 8, 10, 18, 56

Next, we must compute \(\frac{n+1}{2}=\frac{6+1}{2}=3.5\) which means that the median is the average of the third and fourth observations in the ordered list, i.e. \(m = \frac{8+10}{2} = 9\). In this case, the median was the average of the two middle observations since \(n\) was even.

Using R,

> ex3.data <-c(8, 10, 4, 56, 2, 18)
> mean(ex3.data)

[1] 16.33333

> median(ex3.data)

[1] 9

The sample mean is much larger than the sample median due to the one large observation in our data. For this reason, it is recommended that the median be used to describe skewed data and the mean be used to describe symmetric data.

The mean and the median and the shape of a distribution

In fact, the mean and the median can be useful in determining the shape of a distribution. If a distribution is symmetric then the mean is roughly equal to the median. If the distribution is right skewed then the mean will be larger than the median because the larger data values that skew the right tail will cause the mean to be inflated. And distributions that are left skewed will have a mean that is smaller than the median.

Example 4: For the following descriptions, would you recommend using the mean or the median as a measure of center?

  • Grades on a difficult exam
  • Students’ guesses of their professor’s age on the first day of class
  • IQ scores of randomly selected adults

  • Grades on a difficult exam would be right skewed since most students get low grades and just a few students get high grades (i.e. they “pull” the mean up). In this case, I would report the median.
  • Students’ guesses of their professor’s age on the first day of class would be left skewed since most professors tend to be older and only a few are younger. In this case, I would report the median.
  • IQ scores of randomly selected adults would be symmetric since most people have around average IQ and only a few have either very low or very high IQ’s. In this case, I would report the mean.