This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

List of R colors:

http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

Stem and Leaf Plots

Given a data set, a stem and leaf plot is a very effective way of visually representing the data directly.
The shape of the plot may indicate whether the data set is skewed-left, skewed-right or centered.
The appearance of tails in the plot may also indicate the presence of outliers in the data set, located in the tail region.
In R we can generate a stem and leaf plot for a data set using the stem() function.

Example 1

The distances traveled by a sales representative over the course of a month are given by the following data set:

Daily Distance Travelled (km)

91, 67, 58, 73, 63, 84, 88, 103, 102, 100, 59 ,89, 63, 97, 59, 78, 85, 96, 85, 134, 71, 77, 62, 87, 93

Use stem() to create a stem and leaf plot for this data.

Solution

Distances <- c(91, 67, 58, 73, 63, 84, 88, 103, 102, 100, 59 ,89, 63, 97, 59, 78, 85, 96, 85, 134, 71, 77, 62, 87, 93)
stem(Distances,scale=2)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    5 | 899
##    6 | 2337
##    7 | 1378
##    8 | 455789
##    9 | 1367
##   10 | 023
##   11 | 
##   12 | 
##   13 | 4

The message

The decimal point is 1 digit(s) to the right of the |

indicates that the decimal point appears after each digit on the right hand side.

From the shape of the plot it appears that the data is skewed right, with a possible outlier at 134km.

Quartiles and the Median

In lectures we used the Tuckey Method of determining if a data set had mild outliers using the first and third quartiles $Q_{1}$ and $Q_{3}$ and the interquartile range $R_{IQ}$.
In R we can use the function quantiles() to determine the quartiles and median of a data set automatically.

Example 2

Using the data given in Example 1, use the quantiles() function to determine the quartiles and median of the data set Distances

Solution

quantile(Distances)

##   0%  25%  50%  75% 100% 
##   58   67   85   93  134

We see that in the data set Distances the quartiles and median values are \[ Q_{1} = 67\ \ Q_{3} = 93\ \ M= 85 \] This means that
- One quarter of the distances are less than 67km
- One quarter of the distances are greater than 93km
- One half of the distances are less than than 85km. Equivalently, one quarter of the distances are greater than 85km.

Interquartile Range

The R function IQR() also allows us to obtain the interquartile range from a data set.

Example 3

Use the IQR() function to obtain the interquartile range of the data set Distances

Solution

IQR(Distances)

## [1] 26

This agrees with what we saw in lectures, where the interquartile range was defined as \[ R_{IQ}=Q_3-Q_1 \]
Using the quantiles found in Example 2 we have \[ Q_1=67\ Q_3=93 \ \ R_{IQ}=93-67=26 \]

Outliers

In lectures we introduced Tukey’s Criteria to determine if a data set had mild or extreme outliers.

Mild Outliers

A data value $x$ is a mild outlier if \[ x< Q_1-1.5R_{IQ} \] or if \[ x> Q_3+1.5R_{IQ}. \]

We can test for these mild outliers using the boxplots.stats()$out function.

Example 4

Use the function boxplots.stats()$out to identify any outliers in the data set Distances.

Solution

boxplot.stats(Distances)$out

## [1] 134

——————————————————————————————————————————————

Boxplots

Having found the median, the quartiles and the interquartile range, we can now use all of these parameters to construct

 boxplot(Distances,range=1.5,col="darkseagreen",horizontal=T,plot=T, xlab="Daily Distances (km)", main="A boxplot of daily distance traveled ")

* In this example the range=1.5 argument in the function sets the distance between the quartiles and the fences to be $1.5R_{IQ}$.

Exercise 1

The 2016 GDP of the top 50 economies of the world is given in the data set GDP(50)2016.csv, which was sourced from the World Bank, and is available on Moodle.
The data may be imported into the data structure which we choose to call Data2016 using the following command

Data2016<-read.csv('GDP(50)2016.csv')

Data2016

##                 Country GDP..Millions.US...
## 1         United States            18569100
## 2                 China            11199145
## 3                 Japan             4939384
## 4               Germany             3466757
## 5        United Kingdom             2618886
## 6                France             2465454
## 7                 India             2263523
## 8                 Italy             1849970
## 9                Brazil             1796187
## 10               Canada             1529760
## 11          Korea, Rep.             1411246
## 12   Russian Federation             1283162
## 13                Spain             1232088
## 14            Australia             1204616
## 15               Mexico             1045998
## 16            Indonesia              932259
## 17               Turkey              857749
## 18          Netherlands              770845
## 19          Switzerland              659827
## 20         Saudi Arabia              646438
## 21            Argentina              545866
## 22               Sweden              511000
## 23               Poland              469509
## 24              Belgium              466366
## 25             Thailand              406840
## 26              Nigeria              405083
## 27   Iran, Islamic Rep.              393436
## 28              Austria              386428
## 29               Norway              370557
## 30 United Arab Emirates              348743
## 31     Egypt, Arab Rep.              336297
## 32 Hong Kong SAR, China              320912
## 33               Israel              318744
## 34              Denmark              306143
## 35          Philippines              304905
## 36            Singapore              296966
## 37             Malaysia              296359
## 38         South Africa              294841
## 39              Ireland              294054
## 40             Pakistan              283660
## 41             Colombia              282463
## 42                Chile              247028
## 43              Finland              236785
## 44           Bangladesh              221415
## 45             Portugal              204565
## 46              Vietnam              202616
## 47               Greece              194559
## 48       Czech Republic              192925
## 49                 Peru              192094

To extract the numerical data from this set we use Data2016$GDP..Millions.US.., which we choose to call GDPData

GDPData<-Data2016$GDP..Millions.US...
GDPData

##  [1] 18569100 11199145  4939384  3466757  2618886  2465454  2263523
##  [8]  1849970  1796187  1529760  1411246  1283162  1232088  1204616
## [15]  1045998   932259   857749   770845   659827   646438   545866
## [22]   511000   469509   466366   406840   405083   393436   386428
## [29]   370557   348743   336297   320912   318744   306143   304905
## [36]   296966   296359   294841   294054   283660   282463   247028
## [43]   236785   221415   204565   202616   194559   192925   192094

Using the data given in this table, answer the following:

Identify the data source type
Use a stem and leaf plot to represent the data
From the stem and leaf plot, determine if the data is skewed or centered
Find the median of the data set
Find the first and third quartiles ($Q_1$ and $Q_3$)
Determine if the data set has any mild outliers
Identify the fences for the data set
Use a box plot to represent this data set

——————————————————————————————————————————————

Box Plots with ggplot2

Exercise 2

The net Foreign Direct Investment (FDI) into Ireland between 2005 and 2017 is given in the data file IrelandFDI, obtained from the World Bank, and available on Moodle.
Extract the numerical data from this data set as in Exercise 1 and answer the following use a ggplot2 to draw a box plot to represent this data set.

Solution:

To begin we must import the library containing ggplot2 using

library(ggplot2)

Next we import the data in the ususla way:

FDI<-read.csv('IrelandFDI.csv')

FDI

##    Year  Net.FDI.US.
## 1  2005  44824253570
## 2  2006  20816925829
## 3  2007  -4014366973
## 4  2008  34905390728
## 5  2009    545425984
## 6  2010 -20931725589
## 7  2011 -25201352356
## 8  2012 -19621467336
## 9  2013 -12371927262
## 10 2015  11685685364
## 11 2016 -20249872618
## 12 2017  23024931240

and we extract the relevant data in the usual way

FDIData<-FDI$Net.FDI.US.
FDIData

##  [1]  44824253570  20816925829  -4014366973  34905390728    545425984
##  [6] -20931725589 -25201352356 -19621467336 -12371927262  11685685364
## [11] -20249872618  23024931240

ggplot(FDI, aes(x ='', y = FDI$Net.FDI.US.),xlab='Net FDI into Ireland') + #Plot the data set FDI using FDI$Net.FDI.US. as y
        coord_flip()+                                                      #Flip the chart so the bar is horizontal
       stat_boxplot(geom = "errorbar", width = 0.25,coef=1.5) +            # Used boc plot geometry for the plot usine 1.5 as the coeffiecient for outliers as in Tukey's criteria
       geom_boxplot(stat='boxplot',outlier.size=3,width=0.25,fill='steelblue4')+ #Select the colour for the box plot 
       ylab('FDI ($US)')+ # Set the ylabel (actuall hte horizontal label because of coord_flip())
       xlab('')+ # Do not include an x (vertical) label 
       ggtitle('A box plot of FOreign Direct Investment into Ireland from 2002 to 2017 in $US')+ # Create a main title
       theme(plot.title=element_text(hjust=0.5)) # Adjust the position of the main title

Exercise 3

The CO2 Emissions of Ireland (in Metric Tons per Capita) between 1992 and 2017 is given in the data file Ireland Emissions, obtained from the World Bank, and available on Moodle.
Extract the numerical data from this data set as in Exercise 1 and answer the following

Identify the data source type
Use a stem and leaf plot to represent the data
From the stem and leaf plot, determine if the data is skewed or centered
Find the median of the data set
Find the first and third quartiles ($Q_1$ and $Q_3$)
Determine if the data set has any mild outliers
Identify the fences for the data set
Use a box plot to represent this data set

Exercise 4

The average fuel consumption of Volkswagen Golf cars (measured in liters per 100 km), between the years 1990 - 2017 is given by the following table

Year	Fuel Consumption (L/100km)
1990	9.3
1991	6.7
1992	8.5
1993	8.1
1994	8.0
1995	7.1
1996	8.2
1997	8.3
1998	9.0
1999	6.8
2000	6.4
2001	6.5
2002	6.5
2003	6.4
2004	7.5
2005	6.6
2006	6.6
2007	6.7
2008	6.9
2009	6.2
2010	6.2
2011	6.0
2012	6.5
2013	6.8
2014	6.4
2015	6.4
2016	9.2
2017	7.8

Source: Fuelly.com

Create a .csv file to store this data and import it into this work book.
Using this data, answer the following:

Identify the data source type
Use a stem and leaf plot to represent the data
From the stem and leaf plot, determine if the data is skewed or centered
Find the median of the data set
Find the first and third quartiles ($Q_1$ and $Q_3$)
Determine if the data set has any mild outliers
Identify the fences for the data set
Use a box plot to represent this data set

——————————————————————————————————————————————

Comparing Data Sets

We can also use box plots to plot more than one data set on the same chart.
This is a very useful wave of comparing the median and spread of the two data sets.

Example:

A class of student are divided into two groups for the purposes of sitting an exam. The results of each group are shown in the data file Grades.csv available on Moodle.

This data is imported in the ususal way

Grades<-read.csv('Grades.csv')

Grades<-data.frame(Grades)
Grades

##    Group Grade
## 1      1    23
## 2      1    56
## 3      1    43
## 4      1    45
## 5      1    78
## 6      1    61
## 7      1    41
## 8      1    36
## 9      1    51
## 10     1     4
## 11     2    45
## 12     2    47
## 13     2    39
## 14     2    51
## 15     2    52
## 16     2    43
## 17     2    28
## 18     2    40
## 19     2    32
## 20     2    38

Factoring the Data:

We use the command as.factor() to indicate that the label Group in the data set is used to factorise the data set:

Grades$Group<- as.factor(Grades$Group)
Grades

##    Group Grade
## 1      1    23
## 2      1    56
## 3      1    43
## 4      1    45
## 5      1    78
## 6      1    61
## 7      1    41
## 8      1    36
## 9      1    51
## 10     1     4
## 11     2    45
## 12     2    47
## 13     2    39
## 14     2    51
## 15     2    52
## 16     2    43
## 17     2    28
## 18     2    40
## 19     2    32
## 20     2    38

Now the the data set is appropriately factorised we can plot it using ggplot

ggplot(Grades,aes(x=Group,y=Grade,fill=Group))+                  # Plot the data frame Grades with Group on the x-axis and Grade on the y-axis
  geom_boxplot(stat='boxplot')+                                  # Plot the data in the form of a box plot
  coord_flip()+                                                  # Plot the boxes horizontally
  stat_boxplot(geom='errorbar',width=0.25,coef=1.5)+             # Plot fences, coef=1.5 correspond to the 1.5 appearing in Tukey's criteria for mile outliers
  scale_fill_manual(values=c('darkseagreen','lightskyblue3'))+   # Colour the individual boxplots
  theme(legend.position='none')+                               # Set the position of the legend: options are 'top', 'bottom', 'left', 'right', 'none'
  ggtitle('Box plots comparing exam grades of two different class groups')+ # Main title for the box plot
  theme(plot.title=element_text(hjust=0.5))+                     # Centre the title
  xlab('Group Number')+                                           # Set the y-label 
  ylab('Exam Grade')                                             # Set the x-label: Note x and y have interchanged due to coord_flip()

Exercise 5:

The time it take a computer to execute 10 tasks is recorder in miliseconds, and compared for two computers with clock speeds of 2.4GMHz and 4.2GHz:

Clock Speed	Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Task 7	Task 8	Task 9	Task 10
2.4GHz	1.2ms	2.4ms	3.2ms	3.1ms	7.2ms	7.5ms	8.3ms	8.7ms	9.4ms	9.1ms
4.2GHz	1.1ms	2.4ms	2.9ms	2.7ms	4.3ms	4.9ms	4.9ms	5.1ms	5.2ms	6.1ms

Create an approprate .csv file to represente this data.
Import this data and factor it appropriatly.
Create a chart with two boxplots to represent this data.
Using the boxplot, which clock speed had the hightest median and the largest spread in executin times?

Data Visualisation 2019 - R Workbook 3

List of R colors:

Stem and Leaf Plots

Example 1

Solution

Quartiles and the Median

Example 2

Solution

Interquartile Range

Example 3

Solution

Outliers

Mild Outliers

Example 4

Solution

——————————————————————————————————————————————

Boxplots

Exercise 1

——————————————————————————————————————————————

Box Plots with ggplot2

Exercise 2

Solution:

Exercise 3

Exercise 4

Source: Fuelly.com

——————————————————————————————————————————————

Comparing Data Sets

Example:

Factoring the Data:

Exercise 5: