In this series of notes, we will learn about some of the fundamental concepts in statistics. We will demonstrate the application of these concepts using the R programming language. The content of this series is largely taken from the edX course on Statistics: Unlocking the World of Data, which is offered by EdinburghX.
Module 1 looks at the techniques for describing, presenting and summarising data. Throughout this module we will make use of two data sets: the performance data set shows primary school performance in Surrey and Hampshire and the offers data set shows school admissions offer data by local authority in England.
To get started we will need to load the tidyverse and janitor packages.
# Load packages
library(tidyverse)
library(janitor)
Let’s read in the two data sets we will use for this module and look at the first 6 rows of data in each. The performance data set describes Key Stage 2 performance for primary schools in Surrey for the 2018-2019 academic year. It was published by the UK government.
# Load the performance data
load("data-in/performance.rda")
# Preview the dataframe
performance %>%
head() %>%
kable(digits = 2,
align = "l",
caption = "First 6 rows of the performance dataset")
| lea | urn | estab | dfe_number | laestab | schname | admin_district | academic_year | estab_type | education_phase | relchar | percentage_fsm | ofsted_rating | totpups | readprog | writprog | matprog |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 850 | 115851 | 2001 | 850/2001 | 8502001 | Anstey Junior School | East Hampshire | 2016-2017 | Local authority maintained schools | Primary | Does not apply | 16.5 | Good | 248 | -1.8 | -0.6 | -1.1 |
| 850 | 115851 | 2001 | 850/2001 | 8502001 | Anstey Junior School | East Hampshire | 2017-2018 | Local authority maintained schools | Primary | Does not apply | 16.5 | Good | 249 | -2.0 | -2.9 | -2.0 |
| 850 | 115851 | 2001 | 850/2001 | 8502001 | Anstey Junior School | East Hampshire | 2018-2019 | Local authority maintained schools | Primary | Does not apply | 16.5 | Good | 240 | -1.9 | -2.8 | -2.6 |
| 850 | 115852 | 2002 | 850/2002 | 8502002 | Balksbury Junior School | Test Valley | 2016-2017 | Local authority maintained schools | Primary | Does not apply | 9.9 | Good | 349 | -2.0 | -1.9 | -2.2 |
| 850 | 115852 | 2002 | 850/2002 | 8502002 | Balksbury Junior School | Test Valley | 2017-2018 | Local authority maintained schools | Primary | Does not apply | 9.9 | Good | 351 | -2.6 | -1.0 | -1.6 |
| 850 | 115852 | 2002 | 850/2002 | 8502002 | Balksbury Junior School | Test Valley | 2018-2019 | Local authority maintained schools | Primary | Does not apply | 9.9 | Good | 357 | -3.6 | -1.9 | -3.2 |
The offers data set shows school admissions offers by local authority in England for all academic years from 2014-2015 to 2020-2021. It was also published by the UK government.
# Load in the offers data
load("data-in/offers.rda")
# Preview the dataframe
offers %>%
head() %>%
kable(digits = 2,
align = "l",
caption = "First 6 rows of the offers dataset")
| academic_year | lea | la_name | education_phase | nc_year_admission | admission_numbers | applications_received | online_applications | online_apps_percent | no_of_preferences | first_preference_offers | first_preference_offers_percent | second_preference_offers | second_preference_offers_percent | third_preference_offers | third_preference_offers_percent | one_of_the_three_preference_offers | one_of_the_three_preference_offers_percent | preferred_school_offer | preferred_school_offer_percent | non_preferred_offer | non_preferred_offer_percent | no_offer | no_offer_percent | schools_in_la_offer | schools_in_la_offer_percent | schools_in_another_la_offer | schools_in_another_la_offer_percent | offers_to_nonapplicants |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-2021 | 201 | City of London | Primary | R | 30 | 27 | 25 | 92.59 | 6 | 24 | 88.89 | 0 | 0.00 | 0 | 0.00 | 24 | 88.89 | 25 | 92.59 | 2 | 7.41 | 0 | 0.00 | 11 | 40.74 | 16 | 59.26 | 0 |
| 2020-2021 | 202 | Camden | Primary | R | 1688 | 1591 | 1590 | 99.94 | 6 | 1191 | 74.86 | 142 | 8.93 | 55 | 3.46 | 1388 | 87.24 | 1466 | 92.14 | 118 | 7.42 | 7 | 0.44 | 1411 | 89.08 | 173 | 10.92 | 0 |
| 2020-2021 | 203 | Greenwich | Primary | R | 3845 | 3554 | 3524 | 99.16 | 6 | 3100 | 87.23 | 230 | 6.47 | 65 | 1.83 | 3395 | 95.53 | 3452 | 97.13 | 102 | 2.87 | 0 | 0.00 | 3286 | 92.46 | 268 | 7.54 | 0 |
| 2020-2021 | 204 | Hackney | Primary | R | 2959 | 2440 | 2411 | 98.81 | 6 | 2055 | 84.22 | 206 | 8.44 | 61 | 2.50 | 2322 | 95.16 | 2368 | 97.05 | 72 | 2.95 | 0 | 0.00 | 2296 | 94.10 | 144 | 5.90 | 0 |
| 2020-2021 | 205 | Hammersmith and Fulham | Primary | R | 1536 | 1392 | 1360 | 97.70 | 6 | 1020 | 73.28 | 143 | 10.27 | 84 | 6.03 | 1247 | 89.58 | 1311 | 94.18 | 81 | 5.82 | 0 | 0.00 | 1295 | 93.03 | 97 | 6.97 | 0 |
| 2020-2021 | 206 | Islington | Primary | R | 2310 | 2027 | 2027 | 100.00 | 6 | 1554 | 76.67 | 188 | 9.27 | 77 | 3.80 | 1819 | 89.74 | 1914 | 94.43 | 113 | 5.57 | 0 | 0.00 | 1809 | 89.25 | 218 | 10.75 | 0 |
Data is information and comes in different types:
Usually we are interested in data that comes from recording some quantity or quality of interest. We may know in advance what the possible values of the data could be, but until we collect the data we do not know the actual values. For example, we know that if we roll a dice we will get a number between 1 and 6, but we don’t know which one until we roll it. We might know that the speed of a moving car at a particular location is between 0 and 100 miles per hour, but we don’t know the exact speed until we measure it. When we want to measure a quantity or quality and it can take on a range of values like this, before we observe the actual value we refer to it as a random variable, or variable for short.
| Column Name | Data Type | Description |
|---|---|---|
| lea | Qualitative Nominal | Unique code of the Local Education Authority |
| urn | Qualitative Nominal | Unique Reference Number of the school |
| estab | Qualitative Nominal | School establishment number |
| dfe_number | Qualitative Nominal | School DfE number |
| laestab | Qualitative Nominal | School LA Establishment number |
| schname | Qualitative Nominal | School name |
| admin_district | Qualitative Nominal | Administrative district |
| academic_year | Qualitative Nominal | Academic year |
| estab_type | Qualitative Nominal | School type |
| education_phase | Qualitative Nominal | Phase of education |
| relchar | Qualitative Nominal | Religious character |
| percentage_fsm | Quantitative Continuous | Percentage of pupils eligible for free school meals |
| ofsted_rating | Qualitative Nominal | Latest Ofsted rating |
| totpups | Quantitative Discrete | Total number of pupils |
| readprog | Quantitative Continuous | Reading progress measure |
| writprog | Quantitative Continuous | Writing progress measure |
| matprog | Quantitative Continuous | Maths progress measure |
Statistics is a branch of maths that deals with analysing data and interpreting the results. Scientific research is often advanced by posing scientific questions and then answering these, and statistics answers those questions using data. For example, if you are looking at primary school performance, you might want to know if schools with a religious character achieve better maths results than those with no religious character. To answer this, you might first collect some data. For example, you could take a random sample of 100 schools; 50 of those have no religious character and 50 schools have a religious character. Then you would collect data on the maths performance of each school and use that data to determine whether having a religious character improves school performance in maths.
In order to answer the question, you would first look at the data, which might look something like this:
# We'll create a new df called test_relchar to hold our sample data
# for religious character and maths performance.
test_relchar <- performance %>%
filter(!is.na(matprog)) %>% # remove NAs
filter(academic_year == "2018-2019") %>% # most recent year
select(urn, relchar, matprog) %>% # required cols
mutate(is_religious = if_else( # bool for relchar
relchar %in% c("None", "Does not apply"),
FALSE,
TRUE)) %>%
group_by(is_religious) %>% # sample 50 of each
slice_sample(n = 50) %>%
ungroup() %>%
select(urn, is_religious, matprog)
# Preview the result
test_relchar %>%
head() %>%
kable(digits = 1,
align = "l",
caption = "Religious character and maths performance")
| urn | is_religious | matprog |
|---|---|---|
| 125008 | FALSE | -0.4 |
| 142415 | FALSE | 0.1 |
| 116242 | FALSE | 0.3 |
| 136888 | FALSE | 5.1 |
| 140350 | FALSE | -1.4 |
| 116166 | FALSE | -0.1 |
The table shows 100 schools, and the corresponding maths progress score for each. You can then divide them into 2 groups and create a chart with one plot for the group with a religious character and one plot for the group without a religious character, as shown below.
# Plot a boxplot of maths performance by religious character
ggplot(test_relchar, aes(is_religious, matprog)) +
geom_boxplot() +
labs(x = "Is religious?",
y = "Maths progress score",
title = "Boxplot of maths progress scores in religious and non-religious schools")
Looking at this data might tell us that having a religious character has possibly decreased the maths progress score, but we would need to do a formal statistical analysis, which will take into account the different amounts of variability, to work this out for certain.
Often we start out with an initial question that we want to answer with data. We collect the data and analyse it to answer that question, but in doing so that uncovers extra questions that we may want to go further and investigate. This process of analysis, refinement and reanalysis is generally how scientific research develops. We call this diagram the statistics cycle as shown below.
Figure 1: The statistics cycle
It is a useful tool for designing, conducting and interpreting a statistical experiment. We typically start with a question that we wish to answer. Given the question we collect relevant data that allows us to investigate this question. Using the data we conduct an appropriate statistical analysis and interpret the results accordingly. Given our updated understanding of the world from the results obtained we may pose a further follow-up or in-depth question that we wish to investigate, and so on…
The climate of the Earth has changed throughout history. There is evidence to show that we are currently in a state of rapid warming, and there is a lot of debate about whether or not this will continue, and what impact it might have on our planet. Scientific studies looking into global warming make use of statistics in order to try to answer these difficult questions. A huge amount of data is needed to say anything concrete about long-term weather patterns, but closer to home we can investigate the temperature of our local area and how this has changed over the past few years.
First, we need to design a study that will allow us to answer our question. We begin by defining the specific question that we are interested in, before looking at what data we might collect, and how we might record it.
We are interested in whether or not there has been an overall change in temperature over the past three years, and if so what this change has been.
Here is how we might start to design our experiment:
Now that we have designed a study, we are ready to collect data. Here are the monthly average maximum temperatures for Edinburgh over the three years from January 2014 to December 2016 (source: Weather Underground).
# Define the variables
year <- c(rep(2012, 12), rep(2013, 12), rep(2014, 12))
month <- rep(c("Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug",
"Sep", "Oct", "Nov", "Dec"), 3)
temp <- c(7, 8, 10, 12, 15, 18, 21, 18, 17, 14, 10, 7,
7, 7, 9, 13, 13, 17, 17, 19, 17, 14, 11, 10,
7, 7, 9, 10, 15, 16, 19, 18, 18, 13, 8, 9)
# Build a dataframe
edinburgh <- tibble(year, month, temp) %>%
mutate(year = as.character(year),
month = factor(month, levels = month.abb))
# View the data
kable(edinburgh %>%
pivot_wider(names_from = month, values_from = temp))
| year | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2012 | 7 | 8 | 10 | 12 | 15 | 18 | 21 | 18 | 17 | 14 | 10 | 7 |
| 2013 | 7 | 7 | 9 | 13 | 13 | 17 | 17 | 19 | 17 | 14 | 11 | 10 |
| 2014 | 7 | 7 | 9 | 10 | 15 | 16 | 19 | 18 | 18 | 13 | 8 | 9 |
It is hard to distinguish any firm patterns just by looking at this table of data. We can see that May to September is typically warmer than October to April. We can’t really tell if the trend by year is increasing or decreasing temperatures.
We just looked at weather data in the form of a table. We found that it was not simple to extract useful information from the data in this form, even though it was a relatively small data set. Modern data science often involves analysing data sets with thousands, if not millions of pieces of data in them. This section focuses on ways of visualising data in order to make it easier to analyse and interpret. This idea dates back to the work of William Playfair in the late 1700s.
In 1786, the Scottish engineer and economist William Playfair (1759-1823) published a book of economical data called The Commercial and Political Atlas, in which the tables of data were also presented in the form of line graphs. Today, the line graph is still largely the favoured way of presenting data over time.
The offers data set shows the percentage of pupils that were offered their first choice school place over time. The line graph below shows the first preference percentage for primary school offers in Surrey between 2014 and 2020.
line_offers <- offers %>%
# Filter to show only primary schools in Surrey
filter(education_phase == "Primary",
la_name == "Surrey")
ggplot(data = line_offers) +
geom_path(mapping = aes(x = academic_year,
y = first_preference_offers_percent,
group = la_name)) +
labs(x = "Academic year",
y = "First Preference Offers Percent")
The graph has a horizontal scale showing the years, and a vertical scale showing a range of first preference offer percentages. Each dot on the graph represents one piece of data - the percentage of first preference offers that were made for a given academic year. Reading vertically downwards off the horizontal scale tells you the date, while reading horizontally to the left off the vertical scale tells you the percentage of first preference offers that were made for that academic year. The dot furthest to the left, for example, tells you that the percentage of first preference offers for 2014-2015 was 81.5%. The dots have been joined by a line to help provide a visualisation of how the first preference offer rate has changed, producing the line graph.
Line graphs can be used whenever we want to show how a variable changes over time. The graph below shows a line graph of three data sets showing the monthly average daily maximum temperature of Edinburgh in the years 2012, 2013 and 2014.
# Plot a line graph for Edinburgh weather data
ggplot(data = edinburgh) +
geom_line(mapping = aes(x = month,
y = temp,
group = year,
color = year))
Playfair’s Statistical Breviary, published in 1801, used entirely different data visualisation methods to the Commercial and Political Atlas. One of these is the bar chart, which Playfair used to show and compare the populations and revenues of different countries. The bar chart is one of the most commonly seen types of graph or chart today, and is a useful way of showing and comparing amounts or values across different categories.
The performance data set includes data on school types in Surrey. The bar chart below shows the total number of pupils by school type for the academic year 2018-2019.
bar_performance <- performance %>%
# Filter to show only schools in Surrey for 2018-2019
filter(lea == 936,
academic_year == "2018-2019")
ggplot(data = bar_performance) +
geom_bar(mapping = aes(x = reorder(estab_type, -totpups), y = totpups),
stat = "identity",
na.rm = TRUE) +
labs(x = "Type of school",
y = "Total pupils")
Bar charts allow us to present data that vary across different categories or over discrete time intervals. They are also a good way of showing the number of times different values appear in a data set when the data is a qualitative (or discrete quantitative) variable, providing a general picture of what the data set looks like. But what if the data are from a continuous quantitative variable?
In the performance data set, we have data on the maths progress scores of 492 primary schools in Surrey and Hampshire in 2018-2019 academic year. The maths progress score is a continuous variable, and if the scores were measured to a high degree of accuracy, it is unlikely that any two schools would have exactly the same score. It follows that a bar chart showing the frequency of the different scores would consist of 492 bars of height 1, and would therefore by quite meaningless. To get a better idea of the shape of the data set we can group together similar values, and produce a bar chart for the grouped values. This type of chart is referred to as a histogram.
We can group the maths progress scores into categories that are 1 point wide, counting the number of schools whose score was between, for example, 0.5 and 1.5. These categories are called classes, and we call the width of the category the class width. It is likely that some of the scores lay on the boundary between two classes, for example if a school had a score of exactly 0.5. In such cases, we must make a decision as to whether such boundary values would be included in the higher or the lower class. Each class is represented by a value that is in the middle of the class, so that class that contained scores between 0.5 and 1.5 would be labelled 1. A histogram shows the number of values in the data set lying in each class.
hist_performance <- performance %>%
# Filter to include only the 2018-2019 academic year and where matprog is
# not null
filter(academic_year == "2018-2019",
!is.na(matprog))
ggplot(data = hist_performance) +
geom_histogram(mapping = aes(matprog),
binwidth = 1,
na.rm=TRUE)
The horizontal axis lists the score classes, labelled by their representative value. Each class has a corresponding bar, and the height of this bar tells us how many schools there were whose maths progress score lay within that class. Note that the vertical axis is sometimes rescaled to record the proportion, rather than the count, of values that lie within each class.
Histograms and bar charts allow us to see the “shape” of a data set. There are several different features to look out for in these types of charts.
The shape of a histogram or bar chart is often described as the distribution of the data (we will see a more formal definition of the word distribution later in the course). Sometimes, a histogram or bar chart might look completely flat, with all bars the same height. In this case we say that it is uniform. Usually though, it will feature one or more peaks. If it has just one peak, we say it is unimodal; if it has two peaks we call it bimodal; and if there are multiple peaks we say it is multimodal. If it has a reflection line down the centre, we say it is symmetric.
Figure 2: Data shapes
In the middle and right-hand diagrams above, the heights of the bars become smaller and smaller towards the far left and right. We call these regions the tails. The tail might be longer on one side than the other. If this is the case, we say that the distribution of the data is skewed. A chart with a longer tail to the left is left-skewed, while a chart with a longer tail to the right is right-skewed.
Figure 3: Skewed data shapes
Playfair’s Statistical Breviary also introduced the pie chart as a way of showing the relative sizes of the African, European and Asiatic regions of the Turkish Empire. Pie charts, like bar charts, allow us to show and compare amounts or values across different categories. They were popularised by Florence Nightingale, a nurse during the Crimean War, who used a form of pie chart to document causes of patient mortality.
The pie chart below shows the total number of offers made by preference type for primary schools across England in the 2020-2021 academic year. You can see that the overwhelming majority of applicants received an offer at their first preference school.
pie_offers <- offers %>%
# Include only data for primary schools for 2020-2021
filter(education_phase == "Primary",
academic_year == "2020-2021") %>%
# select only the preference type columns
select(first_preference_offers,
second_preference_offers,
third_preference_offers,
preferred_school_offer,
non_preferred_offer,
no_offer) %>%
# currently preferred_school_offer overlaps with the first, second and third
# preference offer columns. Let's make a separate column for
# other_preference_offer and remove preferred_school_offer
mutate(other_preference_offer = preferred_school_offer -
(first_preference_offers +
second_preference_offers +
third_preference_offers)) %>%
select(-preferred_school_offer) %>%
# pivot this data to long format, and summarise by sum
pivot_longer(everything(),
names_to = "preference_type",
values_to = "offers",
values_drop_na = TRUE) %>%
group_by(preference_type) %>%
summarise(total_offers = sum(offers)) %>%
ungroup() %>%
# rename the preference_type values to a chart-friendly format
mutate(preference_type = case_when(
grepl("first_pref", preference_type) ~ "First Preference",
grepl("second_pref", preference_type) ~ "Second Preference",
grepl("third_pref", preference_type) ~ "Third Preference",
grepl("other_pref", preference_type) ~ "Other Preference",
grepl("non_pref", preference_type) ~ "Non-Preferred",
grepl("no_off", preference_type) ~ "No Offer",
TRUE ~ preference_type))
ggplot(pie_offers, aes(x="", y=total_offers, fill=preference_type)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
labs(x = NULL, y = NULL) +
theme_classic() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
scale_fill_brewer(palette="Blues")
The area of each ‘wedge’ in the pie chart is proportional to the number of offers made for a particular preference type. By presenting the data in a pie chart rather than a bar chart, it is more difficult to accurately compare the offer values without looking at the figures. We do, however, gain the ability to compare the number of offers for a particular preference type with the total number of offers combined.
The type of graph you should use depends not only on the type of your data, but also on what you want to do with your data. The table below depicts the most useful chart types depending on what you want to achieve.
| Objective | Table | Line Graph | Bar Chart | Histogram | Pie Chart |
|---|---|---|---|---|---|
| Compare values over time | • | • | |||
| Compare values between categories | • | • | • | ||
| See the distribution of data | • | • | |||
| Look for trends over time | • | ||||
| Look at the composition of data | • | • | • |
In statistics, we often want to communicate as much information about our data as possible, in as simple a way as possible. Sometimes, presenting data in the form of a graph or chart is not the best way to do this. In this section we look at other key ways of presenting data by providing summaries of the information contained within the data.
Ways of summarising data can be divided into two categories: those that aim to give a ‘typical value’ (or average) of a data set, and those that give a measure of how different the values of the data set generally are from this typical value. Such measures are called summary statistics.
Given a set of data, an average gives an idea of what a typical value of that data set is. There are many different ways to define an average value of a data set, and in this course we will focus on two of them.
The most widely used average is called the arithmetic mean, or, more commonly, just the mean. We calculate the mean of a data set by adding up all the values in the data set, and dividing it by the number of pieces of data in the set. What we are doing here is ‘sharing’ the data out evenly. For example, if over the past three days you have slept 7, 6 and 9 hours per night respectively, then the mean amount of sleep is
\[ \frac{7+6+9}{3}=7\frac{1}{3}=\text{7 hours and 20 minutes} \]
per night, which is the number of hours you would have slept if you had slept the same total number of hours, but slept the same amount each night.
Another important average is the median. The median is found by ordering the data from smallest to largest and choosing the middle value. If you have an even number of pieces of data, then there are two middle values, and we choose the median to be the value half way between those (which is the mean of the two values). In our above example, the median amount of sleep for those three nights is 7 hours.
Note that when someone quotes an “average” value for a given data set, they have most likely calculated the mean. However, this need not be the case, and in general the type of average used should always be specified.
If you have ever watched an artistic gymnastics competition, you may have wondered how the scores are calculated. Following a change of rules in 2008, a contestant’s score is calculated by adding together two scores, one for the difficulty of the contestant’s routine, and one for the execution of the routine - how well they performed it. The difficulty score is simply agreed between two judges, but the execution score is calculated from the separate scores of five judges:
Sometimes the score of one of the judges is much higher or lower than the scores of the other judges. In statistics, an unusually low or high data value is called an outlier.
The reason why the smallest and largest scores are ignored in gymnastics scoring is to avoid the effect of outliers. In the activity above, we saw that changing the value of the outlier did not affect the gymnast’s execution score, while it did affect the mean of all five scores. The mean is sensitive to outliers: particularly large and small values can change it considerably.
The idea of calculating the mean after removing outliers is used more generally in statistics, outside of the calculation of gymnastics scores. A trimmed mean is a mean that is calculated after discarding, for example, the smallest 10% and the largest 10% of the values in the data set. Discarding the lower and upper 10% values and calculating the mean of the remaining values would be called the 10% trimmed mean. The precise percentage of values discarded varies, but is usually between 5% and 25%. The gymnastics execution score is the 20% trimmed mean of the five gymnastics scores.
You might wonder what would happen if the gymnastics execution score was calculated using the median rather than the mean. The median of all five scores is the same as the median of the three middle values, and so there is no longer any need to remove the smallest and largest scores. The median can be thought of as a trimmed mean where all but the middle value is discarded. In particular, any outliers are certainly discarded, and so do not affect the median. In other words, the median is resistant to outliers.
How big is the gap between the earnings of men and women in the United Kingdom? In 2009, it was calculated by the Office for National Statistics (ONS) that female full-time workers were paid on average 12.8% less than male full-time workers. In that same year, it was reported by the Equality and Human Rights Commission (EHRC) that this gap is actually 17.1%. Both organisations used the same dataset, so how have they calculated different figures? The key is in the word ‘average’.
The two graphs below show simulated pay data for 1000 males and 1000 females. Take a look at the mean and median in each graph. How do they compare? Can you see why the two figures quoted above could be different?
The figures for the difference between the average pay of males and females are different because the type of average used in each case is different. As you can see from the graphs above, the difference in median pay between males and females is much less than the difference between mean pay. The ONS compared median pay, while the EHRC compared mean pay, and so the ONS got a smaller figure for the difference than the EHRC did.
So why are means and medians so different? We have seen that the mean is affected by outliers while the median is not, but the overall distribution of the data has an impact too.
Suppose you are in charge of a housing development project. You might look at the size of families in the area to get an idea of how many bedrooms to build in each house. You might find that the mean family size is 3.2, and build a large number of three bedroom houses. But what the average here doesn’t tell us is that family size has a bimodal distribution, with many households consisting of just one or two people, and many consisting of four or more people. Three-bedroom houses may not sell well.
Averages are useful ways of summarising data, but we must be careful when we come across them. We have seen that we must take into account what type of average is being used, but we should also think about whether or not we need further information that the average does not tell us.
In October 2010, the Office for National Statistics released figures describing the average British man and woman. These figures were the focus of a BBC News article, which can be found here on the BBC News website. You might find it interesting to read the article and think about what information is lost when such averages are taken.
In general, what information might be lost when we take an average? What problems can averaging cause for decision- and policy-making? Can you think of any other situations in which taking an average might give a completely wrong impression?
When we take an average, we will always lose information about our data - we are summarising the full data set with just a single “typical” value. How much information is lost, and how critical this is, depends on how much the data values vary from the average. Fortunately, there are ways of quantifying the variability of a data set that can be presented in addition to an average, and this provides further information about our data that would otherwise be lost.
There are many ways to measure the variability of a data set.
The most straightforward way of measuring the variability of a data set is to use the range, which is simply the difference between the smallest and the largest values in a data set.
Looking back at our weather data for Edinburgh, we can see that the average (mean) maximum daily temperature each month in 2014, in °C, was as follows.
# View the data
edinburgh %>%
filter(year == 2014) %>%
pivot_wider(names_from = month, values_from = temp) %>%
select(-year) %>%
kable()
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 7 | 9 | 10 | 15 | 16 | 19 | 18 | 18 | 13 | 8 | 9 |
The lowest monthly mean maximum temperature was 7°C, whilst the highest was 19°C. The range is therefore 12°C.
Most months of the year it is not as cold at 7°C or as warm as 19°C, and so the range gives a bit of an extreme picture of how the weather varies over the year. We might instead use the interquartile range to measure the variability of our data. The interquartile range is the difference between the lower quartile and the upper quartile. The lower quartile is the value such that one quarter of the data lies below it, and the upper quartile is the value such that one quarter of the data lies above it. Note that half the data lies between the lower quartile and the upper quartile, and so the interquartile range measures the range of the ‘middle half’ of our data.
To calculate the interquartile range, we begin by putting our data in order, from smallest to largest: \[ 7, 7, 9, 10, 11, 13, 13, 14, 17, 17, 17, 19. \]
Next we divide the data set into two halves. We have an even number of pieces of data, and so we can do this exactly.
If we had an odd number of pieces of data, then we would include the middle value (the median) in both halves. We can now calculate the lower quartile as the median of the first of the two halves of the data set, which in our case is 9.5°C. The upper quartile is the median of the second half of the data, which is 17°C. The interquartile range, the difference between the upper quartile and the lower quartile, is therefore 7.5°C.
Note that there are other methods for calculating the lower and upper quartile values that differ slightly from the one described here. All methods are essentially the same for large data sets, and so this is nothing to worry about.
There are two other key measurements of the variability of a data set, namely the variance and the standard deviation. These work by giving a measure of how far away the different data values are from the mean. The two are closely related, with the standard deviation being the square root of the variance. The variance and standard deviation are the measurements most commonly used by statisticians, and we will introduce them formally later in the course.
Note that, because of similarities in the calculation method, the range and interquartile range are usually used alongside the median. Similarly, the variance or standard deviation is generally used alongside the mean.
When looking at the calculation of gymnastics scores, we saw that the mean is sensitive to outliers while the median is resistant to outliers. More generally, we say that any measure such as an average or a range is sensitive to outliers if it is strongly affected by outliers, and resistant to outliers if it is not sensitive to outliers. We have seen in the question above that the range is sensitive to outliers, while the interquartile range is resistant to outliers.
In the previous section, we looked at ways of presenting data graphically. There is one further visual representation of data that we will introduce this week, and this is the box plot. The box plot is different from many other visual representations of data because rather than displaying the entire data set, it shows only certain summary statistics: the minimum, lower quartile, median, upper quartile and maximum values. The box plot below shows the summary statistics that we calculated previously for the temperature of Edinburgh in 2015.
The horizontal scale along the bottom gives the temperature in °C. Above the scale is a rectangular box with the left-hand side aligned at the lower quartile value, 9.5, and the right-hand side aligned at the upper quartile value, 17. The box is subdivided by a vertical line aligned at the median, which is 13. The width of the rectangle is the interquartile range. Two further vertical lines are aligned at the minimum and maximum values, which are 7 and 19 respectively, joined to the rectangle by horizontal lines. The distance between these two outer vertical lines is the range.
Box plots provide a neat way of summarising a data set visually, but since they are based on summary statistics rather than the entire data set, information about the distribution of the data set and its exact values is naturally lost. That said, there are benefits to using a box plot over another chart, such as the ability to easily see the summary statistics. Furthermore, it is possible to use a box plot to get an idea of what the distribution of the data set might be. For example, we can determine the smallest and largest values of the data set, and we know the proportions of the data set that lie between the various vertical lines on the box plot.
Earlier in the course we designed a simple study to investigate weather patterns, collecting temperature data from our local area and presenting it in the form of a line graph. Now we will see if we can make better sense of our data by looking at the summary statistics.
The data we collected was the mean daily maximum temperature in each month from 2014 to 2016. There are two ways we can make use of summary statistics here. Firstly, we can take the summary statistics of the three values for a given month, and secondly we can take the summary statistics of all the values for a given year.
We are aiming use our data to see how the temperature of our nearest city has varied over the past three years. The summary statistics of all the temperatures from a given year will be most useful to help answer the question.
Summary statistics allow us to describe large and complex data sets in just a few key numbers. This is very convenient if we want to quickly compare data sets, or describe our data in a simple way, but we must always keep in mind that these figures may not tell us everything that our data has to say.
Take a look at the graphs below. They are examples of scatter plots, which are graphs allowing us to visualise data with two variables, one represented by the horizontal axis and one represented by the vertical axis. We will introduce scatter plots more formally next week in the course, but for now, take a look at the general location and spread of the points in the graphs. What do you think a “typical” (or average) point in each of the graphs might be? What about a measure of the spread of the data?
Image 7: Anscombe’s Quartet
The graphs shown above are known as Anscombe’s Quartet, named after the statistician Francis Anscombe (1918-2001). They are interesting because despite the fact that the four graphs look very different, many of their summary statistics are very similar or identical. For example, the mean of the horizontal values of each point are identical for each graph, and the mean of the vertical values of each point are equal to two decimal places. Similarly the variance of the horizontal values of each point are identical for each graph, and the variance of the vertical values are within 0.003 of each other. This emphasises the point that summary statistics do not capture all the information we might want to, or should, know about a data set, and hence that it’s important to visualise data as well as look at the summary statistics. Next week, we will look at how visualising data allows us to see patterns in the data that we would not spot from looking at summary statistics alone.