This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

List of R colors:

http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

Stem and Leaf Plots

Example 1

The distances traveled by a sales representative over the course of a month are given by the following data set:

Daily Distance Travelled (km)

91, 67, 58, 73, 63, 84, 88, 103, 102, 100, 59 ,89, 63, 97, 59, 78, 85, 96, 85, 134, 71, 77, 62, 87, 93

Use stem() to create a stem and leaf plot for this data.

Solution

Distances <- c(91, 67, 58, 73, 63, 84, 88, 103, 102, 100, 59 ,89, 63, 97, 59, 78, 85, 96, 85, 134, 71, 77, 62, 87, 93)
stem(Distances,scale=2)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    5 | 899
##    6 | 2337
##    7 | 1378
##    8 | 455789
##    9 | 1367
##   10 | 023
##   11 | 
##   12 | 
##   13 | 4
  • The message
The decimal point is 1 digit(s) to the right of the |

indicates that the decimal point appears after each digit on the right hand side.

  • From the shape of the plot it appears that the data is skewed right, with a possible outlier at 134km.

Quartiles and the Median

Example 2

Using the data given in Example 1, use the quantiles() function to determine the quartiles and median of the data set Distances

Solution

quantile(Distances)  
##   0%  25%  50%  75% 100% 
##   58   67   85   93  134
  • We see that in the data set Distances the quartiles and median values are \[ Q_{1} = 67\ \ Q_{3} = 93\ \ M= 85 \] This means that
    • One quarter of the distances are less than 67km
    • One quarter of the distances are greater than 93km
    • One half of the distances are less than than 85km. Equivalently, one quarter of the distances are greater than 85km.

Interquartile Range

Example 3

Use the IQR() function to obtain the interquartile range of the data set Distances

Solution

IQR(Distances)
## [1] 26
  • This agrees with what we saw in lectures, where the interquartile range was defined as \[ R_{IQ}=Q_3-Q_1 \]

  • Using the quantiles found in Example 2 we have \[ Q_1=67\ Q_3=93 \ \ R_{IQ}=93-67=26 \]

Outliers

Mild Outliers

A data value \(x\) is a mild outlier if \[ x< Q_1-1.5R_{IQ} \] or if \[ x> Q_3+1.5R_{IQ}. \]

  • We can test for these mild outliers using the boxplots.stats()$out function.

Example 4

Use the function boxplots.stats()$out to identify any outliers in the data set Distances.

Solution

boxplot.stats(Distances)$out
## [1] 134

——————————————————————————————————————————————


Boxplots

Having found the median, the quartiles and the interquartile range, we can now use all of these parameters to construct

 boxplot(Distances,range=1.5,col="darkseagreen",horizontal=T,plot=T, xlab="Daily Distances (km)", main="A boxplot of daily distance traveled ")

* In this example the range=1.5 argument in the function sets the distance between the quartiles and the fences to be \(1.5R_{IQ}\).

Exercise 1

Data2016<-read.csv('GDP(50)2016.csv')
Data2016
##                 Country GDP..Millions.US...
## 1         United States            18569100
## 2                 China            11199145
## 3                 Japan             4939384
## 4               Germany             3466757
## 5        United Kingdom             2618886
## 6                France             2465454
## 7                 India             2263523
## 8                 Italy             1849970
## 9                Brazil             1796187
## 10               Canada             1529760
## 11          Korea, Rep.             1411246
## 12   Russian Federation             1283162
## 13                Spain             1232088
## 14            Australia             1204616
## 15               Mexico             1045998
## 16            Indonesia              932259
## 17               Turkey              857749
## 18          Netherlands              770845
## 19          Switzerland              659827
## 20         Saudi Arabia              646438
## 21            Argentina              545866
## 22               Sweden              511000
## 23               Poland              469509
## 24              Belgium              466366
## 25             Thailand              406840
## 26              Nigeria              405083
## 27   Iran, Islamic Rep.              393436
## 28              Austria              386428
## 29               Norway              370557
## 30 United Arab Emirates              348743
## 31     Egypt, Arab Rep.              336297
## 32 Hong Kong SAR, China              320912
## 33               Israel              318744
## 34              Denmark              306143
## 35          Philippines              304905
## 36            Singapore              296966
## 37             Malaysia              296359
## 38         South Africa              294841
## 39              Ireland              294054
## 40             Pakistan              283660
## 41             Colombia              282463
## 42                Chile              247028
## 43              Finland              236785
## 44           Bangladesh              221415
## 45             Portugal              204565
## 46              Vietnam              202616
## 47               Greece              194559
## 48       Czech Republic              192925
## 49                 Peru              192094
GDPData<-Data2016$GDP..Millions.US...
GDPData
##  [1] 18569100 11199145  4939384  3466757  2618886  2465454  2263523
##  [8]  1849970  1796187  1529760  1411246  1283162  1232088  1204616
## [15]  1045998   932259   857749   770845   659827   646438   545866
## [22]   511000   469509   466366   406840   405083   393436   386428
## [29]   370557   348743   336297   320912   318744   306143   304905
## [36]   296966   296359   294841   294054   283660   282463   247028
## [43]   236785   221415   204565   202616   194559   192925   192094

Using the data given in this table, answer the following:

  1. Identify the data source type

  2. Use a stem and leaf plot to represent the data

  3. From the stem and leaf plot, determine if the data is skewed or centered

  4. Find the median of the data set

  5. Find the first and third quartiles (\(Q_1\) and \(Q_3\))

  6. Determine if the data set has any mild outliers

  7. Identify the fences for the data set

  8. Use a box plot to represent this data set

——————————————————————————————————————————————


Box Plots with ggplot2

Exercise 2

Solution:

  • To begin we must import the library containing ggplot2 using
library(ggplot2)
  • Next we import the data in the ususla way:
FDI<-read.csv('IrelandFDI.csv')
FDI
##    Year  Net.FDI.US.
## 1  2005  44824253570
## 2  2006  20816925829
## 3  2007  -4014366973
## 4  2008  34905390728
## 5  2009    545425984
## 6  2010 -20931725589
## 7  2011 -25201352356
## 8  2012 -19621467336
## 9  2013 -12371927262
## 10 2015  11685685364
## 11 2016 -20249872618
## 12 2017  23024931240

and we extract the relevant data in the usual way

FDIData<-FDI$Net.FDI.US.
FDIData
##  [1]  44824253570  20816925829  -4014366973  34905390728    545425984
##  [6] -20931725589 -25201352356 -19621467336 -12371927262  11685685364
## [11] -20249872618  23024931240
ggplot(FDI, aes(x ='', y = FDI$Net.FDI.US.),xlab='Net FDI into Ireland') + #Plot the data set FDI using FDI$Net.FDI.US. as y
        coord_flip()+                                                      #Flip the chart so the bar is horizontal
       stat_boxplot(geom = "errorbar", width = 0.25,coef=1.5) +            # Used boc plot geometry for the plot usine 1.5 as the coeffiecient for outliers as in Tukey's criteria
       geom_boxplot(stat='boxplot',outlier.size=3,width=0.25,fill='steelblue4')+ #Select the colour for the box plot 
       ylab('FDI ($US)')+ # Set the ylabel (actuall hte horizontal label because of coord_flip())
       xlab('')+ # Do not include an x (vertical) label 
       ggtitle('A box plot of FOreign Direct Investment into Ireland from 2002 to 2017 in $US')+ # Create a main title
       theme(plot.title=element_text(hjust=0.5)) # Adjust the position of the main title

Exercise 3

  1. Identify the data source type

  2. Use a stem and leaf plot to represent the data

  3. From the stem and leaf plot, determine if the data is skewed or centered

  4. Find the median of the data set

  5. Find the first and third quartiles (\(Q_1\) and \(Q_3\))

  6. Determine if the data set has any mild outliers

  7. Identify the fences for the data set

  8. Use a box plot to represent this data set

Exercise 4

The average fuel consumption of Volkswagen Golf cars (measured in liters per 100 km), between the years 1990 - 2017 is given by the following table

Year Fuel Consumption (L/100km)
1990 9.3
1991 6.7
1992 8.5
1993 8.1
1994 8.0
1995 7.1
1996 8.2
1997 8.3
1998 9.0
1999 6.8
2000 6.4
2001 6.5
2002 6.5
2003 6.4
2004 7.5
2005 6.6
2006 6.6
2007 6.7
2008 6.9
2009 6.2
2010 6.2
2011 6.0
2012 6.5
2013 6.8
2014 6.4
2015 6.4
2016 9.2
2017 7.8

Source: Fuelly.com

  • Create a .csv file to store this data and import it into this work book.

  • Using this data, answer the following:

  1. Identify the data source type

  2. Use a stem and leaf plot to represent the data

  3. From the stem and leaf plot, determine if the data is skewed or centered

  4. Find the median of the data set

  5. Find the first and third quartiles (\(Q_1\) and \(Q_3\))

  6. Determine if the data set has any mild outliers

  7. Identify the fences for the data set

  8. Use a box plot to represent this data set

——————————————————————————————————————————————

Comparing Data Sets

Example:

A class of student are divided into two groups for the purposes of sitting an exam. The results of each group are shown in the data file Grades.csv available on Moodle.

  • This data is imported in the ususal way
Grades<-read.csv('Grades.csv')
Grades<-data.frame(Grades)
Grades
##    Group Grade
## 1      1    23
## 2      1    56
## 3      1    43
## 4      1    45
## 5      1    78
## 6      1    61
## 7      1    41
## 8      1    36
## 9      1    51
## 10     1     4
## 11     2    45
## 12     2    47
## 13     2    39
## 14     2    51
## 15     2    52
## 16     2    43
## 17     2    28
## 18     2    40
## 19     2    32
## 20     2    38

Factoring the Data:

  • We use the command as.factor() to indicate that the label Group in the data set is used to factorise the data set:
Grades$Group<- as.factor(Grades$Group)
Grades
##    Group Grade
## 1      1    23
## 2      1    56
## 3      1    43
## 4      1    45
## 5      1    78
## 6      1    61
## 7      1    41
## 8      1    36
## 9      1    51
## 10     1     4
## 11     2    45
## 12     2    47
## 13     2    39
## 14     2    51
## 15     2    52
## 16     2    43
## 17     2    28
## 18     2    40
## 19     2    32
## 20     2    38
  • Now the the data set is appropriately factorised we can plot it using ggplot
ggplot(Grades,aes(x=Group,y=Grade,fill=Group))+                  # Plot the data frame Grades with Group on the x-axis and Grade on the y-axis
  geom_boxplot(stat='boxplot')+                                  # Plot the data in the form of a box plot
  coord_flip()+                                                  # Plot the boxes horizontally
  stat_boxplot(geom='errorbar',width=0.25,coef=1.5)+             # Plot fences, coef=1.5 correspond to the 1.5 appearing in Tukey's criteria for mile outliers
  scale_fill_manual(values=c('darkseagreen','lightskyblue3'))+   # Colour the individual boxplots
  theme(legend.position='none')+                               # Set the position of the legend: options are 'top', 'bottom', 'left', 'right', 'none'
  ggtitle('Box plots comparing exam grades of two different class groups')+ # Main title for the box plot
  theme(plot.title=element_text(hjust=0.5))+                     # Centre the title
  xlab('Group Number')+                                           # Set the y-label 
  ylab('Exam Grade')                                             # Set the x-label: Note x and y have interchanged due to coord_flip()  

Exercise 5:

The time it take a computer to execute 10 tasks is recorder in miliseconds, and compared for two computers with clock speeds of 2.4GMHz and 4.2GHz:

Clock Speed Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10
2.4GHz 1.2ms 2.4ms 3.2ms 3.1ms 7.2ms 7.5ms 8.3ms 8.7ms 9.4ms 9.1ms
4.2GHz 1.1ms 2.4ms 2.9ms 2.7ms 4.3ms 4.9ms 4.9ms 5.1ms 5.2ms 6.1ms
  1. Create an approprate .csv file to represente this data.

  2. Import this data and factor it appropriatly.

  3. Create a chart with two boxplots to represent this data.

  4. Using the boxplot, which clock speed had the hightest median and the largest spread in executin times?