This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

List of R colours at: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

Required Libraries

The library we need for this work book is ggplot2 which is required for plotting clustered and stacked bar-plots. This library must be first installed as follows:

  1. Go to the Tools tab and select Install Packages
  2. In the Packages box type ggplot2 and press install
  3. Once installed, call the ggplot2 library as follows
library(ggplot2)

Clustered Bar Charts

In the case of clustered bar-charts it is often helpful to create a separate data file for our table. In the example from lectures we drew a clustered bar chart to represent the results of a customer survey carried out by a mobile phone company regarding five areas of customer experience:

Experience Satisfied Dissatisfied
Support 551 449
Pricing 684 316
Contracts 329 671
Coverage 848 152
Rewards 215 785

We create this table in a standard database system and save the file with a .csv (comma separated values) extension. The file for this particular example, CustomerSurvey.csv, is available on Moodle in the section Data Files. Download this file into the same directory as this R workbook.

We now import the data in the .csv file, which we call Survey, as follows: Run the chunk below as usual, and when the pop-up menu appears select the file CustomerSurvey.csv and press Open

Survey <- read.csv(file.choose())

We display this data frame by running the following chunk

Survey

We now plot this data frame using the functions ggplot(), geom_bar() and scale_fill_manual():

ggplot(data=Survey, aes(x=Experience, y=Frequency)) +   
  geom_bar(aes(fill = Satisfaction), position = "dodge", stat = "identity")+
  scale_fill_manual(values=c("blue", "red"))

Ordering the data

  • We saw during lectures that rearranging the order of the clusters may improve the overall appearance of the bar chart, and in particular it may make the content of data more apparent.

  • We will organise our clusters so they are arranged from highest to lowest levels of customer satisfaction. This means the clusters should be arranged as

    1. Coverage
    2. Pricing
    3. Support
    4. Contracts
    5. Rewards

This is done in R in the cell below:

Survey$Experience <- factor(Survey$Experience, levels = c('Coverage', 'Pricing', 'Support', 'Contracts', 'Rewards'))
Survey
ggplot(data=Survey, aes(x=Experience,y=Frequency)) +   
  geom_bar(aes(fill = Satisfaction), position = "dodge", stat = "identity")+
  scale_fill_manual(values=c("blue", "red"))

  • It is clear, at least in this particular example, that re-ordering the clusters has improved the presentation of the data, and in particular it should be more apparent that levels of customer satisfaction are slightly higher than levels of customer dissatisfaction.

Colouring

  • So far, the bars representing satisfaction/dissatisfaction are coloured in a similar way. This can make it difficult to discern which bar represents which sub-category.

  • A list of colours in R can be found at http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

  • To highlight the difference between satisfaction/dissatisfaction we could choose different colours that bear no resemblance to one another.

  • In addition, colours that are occur in nature are found to be more appealing to the viewer. With this in mind, we make the same plot with alternative colors

ggplot(data=Survey, aes(x=Experience,y=Frequency)) +   
  geom_bar(aes(fill = Satisfaction), position = "dodge", stat = "identity")+
  scale_fill_manual(values=c("cadetblue4", "goldenrod"))

Exercise 1

A computer retailer collected data on laptop sales and organise it according to make and chip type, with the following data obtained

Make Intel i3 Intel i5 Intel i7
Apple 5 25 31
Dell 15 21 16
HP 21 28 26
Lenovo 18 32 31

Given this data answer the following

  1. Identify the data type given.

  2. Create a .csv file to tabulate this data.

  3. Create two clustered bar plot from this data file, using different colours in each plot.

  4. Reorder these plots in order of decreasing Intel i3 sales.

  5. Which of these plots conveys the data in the clearest way?

Exercise 2

A company survey asked a sample of 500 employees a series of satisfied/dissatisfied questions in relation to their work in the following areas

  1. Work/Life Balance 2. Remuneration 3. Career Opportunities 4. Job Satisfaction 5. Vacation Time 6. Up-skilling Opportunities

with the following data obtained

Work Feature Satisfied Dissatisfied
Work/Life Balance 398 102
Remuneration 302 198
Career Opportunities 274 226
Job Satisfaction 405 95
Vacation Time 277 233
Up-skilling Opportunities 321 179

Using this data answer the following

  1. Create a .csv file for this data an import it into this work book

  2. Create a clustered bar-plot for this data

  3. Identify the datatype given

  4. Is there any trend obvious from the chart?

Stacked Bar Charts

A stacked bar-chart is another type of bar-chart where we compare data within a given class, and in particular it illustrates how the overall frequency in a given class is decomposed into further sub-categories.

Example 2

A Toyota dealership is taking an inventory of all models present on the lot, and organises its data according to make and age. The ages of the cars are categorised according to pre-2008 and post-2008, with the data given as follows

Model Pre-2008 Post-2008
Auris 8 15
Avensis 11 21
Camry 4 2
Corolla 23 18
Prius 1 4
Yaris 4 12
  1. Import the data from CarInventory.csv available in Workbook Files on Moodle and display the data.
Cars <- read.csv(file.choose())

We display the data frame Cars to ensure all is correct

Cars
  1. Plot the data with high-contrast colours using ggplot.
ggplot(Cars, aes(x=Model,y=Inventory, fill=Year))+
  geom_bar(stat="identity")+
  scale_fill_manual(values=c("dodgerblue","goldenrod1"))

  1. Reorder the plots in order of decreaing Pre-2008 inventory.
Cars$Model <- factor(Cars$Model, levels = c('Corolla', 'Avensis', 'Auris', 'Camry', 'Yaris', 'Prius'))
Cars
  1. Plot the reordered data using the same colouring.
ggplot(Cars, aes(x=Model,y=Inventory, fill=Year))+
  geom_bar(stat="identity")+
  scale_fill_manual(values=c("dodgerblue","goldenrod1"))

  1. Re-plot the ordered bar-chart using different colours. Which of the bar-charts is easier to interpret?
ggplot(Cars, aes(x=Model,y=Inventory, fill=Year))+
  geom_bar(stat="identity")+
  scale_fill_manual(values=c("slategrey","skyblue"))

Exercise 3

A computer retailer sumarises its quarterly sales by computer make and sales point, with the following data collected

Make Online Sales Store Sales Corporate Sales
Apple 210 155 53
Asus 335 278 55
Fujitsu 188 205 75
HP 336 451 125
Lenovo 225 321 144

Using this data answer the following:

  1. Identify the data type given
  2. Construct a .csv file to store this data
  3. Generate a stacked bar chart to represent this data
  4. Re-plot the data in order of decreasing Online Sales
  5. Use a different set of colours to re-plot the ordered bar-chart

Exercise 4

Using the data given in Exercise 1, generate ordered and unordered stacked bar charts to represent this data set.

Exercise 5

Using the data given in Exercise 2, generate ordered and unordered clustered bar chart to represent this data.

Exercise 6

A food producer collects data on its global revenue from one year of sales. It categorises the regions of sales activity as

  1. North America 2. Central & South America 3. Europe, Middle East and Africa (EMEA) 4. Asia 5. Australasia

It categorises its food products according to

  1. Wheat and Dairy 2. Beverages 3. Confectionery 4. Weight Reduction

with the following sales data in millions of euro

Region Wheat and Dairy Beverages Confectionary Weight Reduction
North America 51 152 95 125
South & Central America 71 32 122 75
EMEA 241 111 84 119
Asia 188 94 88 92
Australasia 44 29 74 57

Using this data set answer the following:

  1. Identify the data types given

  2. Generate a .csv file to store this data

  3. Generate a clustered barplot to represent this data

  4. Generate a stacked barplot to represent this data

  5. Which of the bar charts represents this data better?

  6. Re-plot these charts with a different ordering and with different colouring, of your own choice?

