Misrepresentation in Statistics
Example 1:
During lectures we saw several examples where choosing the incorrect baseline could lead to inaccurate visual representations of data. One example we encountered was the case of circulation data of the daily newspapers The Times and the Daily Telegraph.
The daily sales figures for each newspaper are given in the following table
In the code block below we create a data frame to represent these data. The reason for creating this data frame is so it may be used with the plotting functions available in the library ggplot2, which has many more features than the regular barplot() function
Circulation <- c(485729,446954)
Papers<-c("The Times", "Daily Telegraph")
Sales<-data.frame(Papers,Circulation)
Sales
- We recall the basic syntax of the ggplot() function here:
- We plot the data frame Sales
- The argument aes() indicates which item is to appear on the horizontal and vertical axes
- The geom_bar() function controls most features of the bars appearing in the plot:
- stat=“identity” indicates that \(y\)-values correspond to bar height
- fill allows us to control the color of individual bars in the bar chart
- The function coord_catesian() is where the representation of the data is really controlled. By choosing our \(y\)-limits we can make the difference in sales look as we please
- In the plot below, we choose a baseline \(y=420000\), which will make the circulation of the Daily Telegraph look unrealistically small compared to the circulation of The Times. This baseline was used in a similar graph by The Times, in a story newspaper reporting these differences in circulation.
ggplot(Sales, aes(Papers,Circulation))+geom_bar(stat="identity",fill=c("azure4","cornflowerblue"))+coord_cartesian(ylim=c(420000,500000))

- Alternatively, if we make the \(y\)-range of the plot too large, then the difference in circulations will appear negligible. In the plot below, we use an appropriate baseline, however, the upper limit of the bar plot is set to \(y=4,000,000\), which will mean both bars appear to be essentially the same height.
ggplot(Sales, aes(Papers,Circulation))+geom_bar(stat="identity",fill=c("azure4","cornflowerblue"))+coord_cartesian(ylim=c(0,4000000))

- In the final version of this bar plot, we choose the baseline to be \(y=0\), while the upper limit of the bar plot is set to \(y=500,000\)
ggplot(Sales, aes(Papers,Circulation))+geom_bar(stat="identity",fill=c("azure4","cornflowerblue"))+coord_cartesian(ylim=c(0,500000))

- In this plot, we see that the difference in circulations is small, though not completely negligible, when compared to the overall circulation of each newspaper.
Exercise 1:
On 1 January 2013, tax cuts from the era of President George W Bush were set to expire. With the tax cuts in place, the tax rate was 35%, while the tax rate when the tax cuts expire climbed to 39.6%. In a Fox News story reporting this issue, a bar plot was used to represent these tax rates to highlight the difference in tax rates. The baseline use for the plot was a tax rate of 34%.
Given this, answer the following:
Create a data frame to represent the data relating to the tax rate pre and post the expiry date.
Use the base line 34% to plot a bar chart representing the change in tax rates.
Does the tax rate increase look significant or not, when compared to the actual tax rate?
Choose the \(y\)-limits of the graph to make the difference in tax rates look insignificant.
Use appropriate \(y\)-limits to represent the change in tax rate more accurately.
Example 2: Cumulative Revenue vs. Actual Revenue
As an example of how data may be misrepresented by selecting an inappropriate chart we consider the following revenue data at a company
| 1 |
1.22 |
1.22 |
| 2 |
1.81 |
3.03 |
| 3 |
2.14 |
5.17 |
| 4 |
1.65 |
6.82 |
| 5 |
1.50 |
8.32 |
| 6 |
1.36 |
9.68 |
| 7 |
1.17 |
10.85 |
| 8 |
1.08 |
11.93 |
| 9 |
1.07 |
13.00 |
| 10 |
0.98 |
13.98 |
We first create a data frame to represent this data
Q<-c("Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8", "Q9", "Q10")
R<-c(1.22,1.81,2.14,1.65,1.50,1.36,1.17,1.08,1.07, 0.98)
CR<-c(1.22,3.03,5.17,6.82,8.32,9.68,10.85,11.93,13.00,13.98)
Revenue<-data.frame(Q,R,CR)
Revenue$Q <- factor(Revenue$Q, levels = Revenue$Q[order(Revenue$CR)])# This will ensure ggplot will plot the data in the correct order
Revenue
The Cumulative Revenue
- We now plot the cumulative revenue of the company
ggplot(Revenue, aes(Q, CR, group=1))+geom_line(color="gray", size=1.5)+geom_point(size=4,color="red")+labs(x="Quarter", y="Cumulative Quarterly Revenue(€ Millions)")

The cumulative revenue plot seems to suggest the company is doing well, with sales constantly increasing.
However, this is a result of plotting the incorrect data. When we plot the sales data directly, we find a very different picture
ggplot(Revenue, aes(Q, R, group=1))+geom_line(color="gray", size=1.5)+geom_point(size=4,color="red")+labs(x="Quarter", y="Quarterly Revenue(€ Millions)")

- When plotting the quarterly revenues directly, we see there is actually a steep decline in the company revenue between Q3 and Q4, and a steady decline in revenue after that.
Exercise 2
The data file RegionalGDPGrowth(2000-2016).csv contains GDP growth rate data for various economic regions around the world: The file is available at
Moodle \(\rightarrow\) Data Visualisation \(\rightarrow\) Workbook Files \(\rightarrow\) RegionalGDPGrowth(2000-2015).csv
Import this data file into the workbook folder using Data<-read.csv(file.chooe()) as usual. The create a data frame from this file using DF<-data.frame(Data)
Using this data frame answer the following:
- Obtain the mean GDP growth rate for each region using mean(DF$region)
- Create a bar plot comparing the average economic growth rate of each region, and choose a baseline such that the economic growth rate of China is exaggerated compared to all the others.
- Choose \(y\)-limits so that the average economic growth rate of each region all appear to be approximately the same.
- Choose \(y\)-limits so that the average growth rates are displayed accurately.
Omitting Data
Another common way data may be misrepresented in statistical charts is by omitting data.
In Example 2 we saw that the revenue of a company could be falsely represented by plotting the cumulative revenue as opposed to the quarterly revenue.
Another way we may misrepresent this data is by not displaying all data available to us:
ggplot(Revenue, aes(Q, R, group=1))+geom_line(color="gray", size=1.5)+geom_point(size=4,color="red")+labs(x="Quarter", y="Quarterly Revenue")+scale_x_discrete(limits=c("Q1","Q2","Q3" ))

- The function scale_x_discrete() allows us to choose which quarters we actually show in the line graph. By only showing data for the first 3 quarters, we manage to give a false impression of the company revenue growth.
Exercise 3:
Using the data frame constructed in Exercise 2 answer the following:
- Plot the GDP growth rate for each region between the years 2000-2007.
- From these plots, what conclusion might be made about economic growth in each region?
- Plot the GDP growth rate for each region between the years 2007-2012.
- What trend might be deduced about the economic growth rate in each region, from these new plots?
- Plot the GDP growth rate for each region between 2000 and 2016.
- What would be a more accurate assessment of the economic growth rate of each region, base on these plot?
