This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The distances traveled by a sales representative over the course of a month are given by the following data set:
Daily Distance Travelled (km)
91, 67, 58, 73, 63, 84, 88, 103, 102, 100, 59 ,89, 63, 97, 59, 78, 85, 96, 85, 134, 71, 77, 62, 87, 93
Use stem() to create a stem and leaf plot for this data.
Distances <- c(91, 67, 58, 73, 63, 84, 88, 103, 102, 100, 59 ,89, 63, 97, 59, 78, 85, 96, 85, 134, 71, 77, 62, 87, 93)
stem(Distances,scale=2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 899
## 6 | 2337
## 7 | 1378
## 8 | 455789
## 9 | 1367
## 10 | 023
## 11 |
## 12 |
## 13 | 4
indicates that the decimal point appears after each digit on the right hand side.
In lectures we used the Tuckey Method of determining if a data set had mild outliers using the first and third quartiles \(Q_{1}\) and \(Q_{3}\) and the interquartile range \(R_{IQ}\).
In R we can use the function quantiles() to determine the quartiles and median of a data set automatically.
Using the data given in Example 1, use the quantiles() function to determine the quartiles and median of the data set Distances
quantile(Distances)
## 0% 25% 50% 75% 100%
## 58 67 85 93 134
Use the IQR() function to obtain the interquartile range of the data set Distances
IQR(Distances)
## [1] 26
This agrees with what we saw in lectures, where the interquartile range was defined as \[ R_{IQ}=Q_3-Q_1 \]
Using the quantiles found in Example 2 we have \[ Q_1=67\ Q_3=93 \ \ R_{IQ}=93-67=26 \]
A data value \(x\) is a mild outlier if \[ x< Q_1-1.5R_{IQ} \] or if \[ x> Q_3+1.5R_{IQ}. \]
Use the function boxplots.stats()$out to identify any outliers in the data set Distances.
boxplot.stats(Distances)$out
## [1] 134
Having found the median, the quartiles and the interquartile range, we can now use all of these parameters to construct
boxplot(Distances,range=1.5,col="darkseagreen",horizontal=T,plot=T, xlab="Daily Distances (km)", main="A boxplot of daily distance traveled ")
* In this example the range=1.5 argument in the function sets the distance between the quartiles and the fences to be \(1.5R_{IQ}\).
The 2016 GDP of the top 50 economies of the world is given in the data set GDP(50)2016.csv, which was sourced from the World Bank, and is available on Moodle.
The data may be imported into the data structure which we choose to call Data2016 using the following command
Data2016<-read.csv('GDP(50)2016.csv')
Data2016
## Country GDP..Millions.US...
## 1 United States 18569100
## 2 China 11199145
## 3 Japan 4939384
## 4 Germany 3466757
## 5 United Kingdom 2618886
## 6 France 2465454
## 7 India 2263523
## 8 Italy 1849970
## 9 Brazil 1796187
## 10 Canada 1529760
## 11 Korea, Rep. 1411246
## 12 Russian Federation 1283162
## 13 Spain 1232088
## 14 Australia 1204616
## 15 Mexico 1045998
## 16 Indonesia 932259
## 17 Turkey 857749
## 18 Netherlands 770845
## 19 Switzerland 659827
## 20 Saudi Arabia 646438
## 21 Argentina 545866
## 22 Sweden 511000
## 23 Poland 469509
## 24 Belgium 466366
## 25 Thailand 406840
## 26 Nigeria 405083
## 27 Iran, Islamic Rep. 393436
## 28 Austria 386428
## 29 Norway 370557
## 30 United Arab Emirates 348743
## 31 Egypt, Arab Rep. 336297
## 32 Hong Kong SAR, China 320912
## 33 Israel 318744
## 34 Denmark 306143
## 35 Philippines 304905
## 36 Singapore 296966
## 37 Malaysia 296359
## 38 South Africa 294841
## 39 Ireland 294054
## 40 Pakistan 283660
## 41 Colombia 282463
## 42 Chile 247028
## 43 Finland 236785
## 44 Bangladesh 221415
## 45 Portugal 204565
## 46 Vietnam 202616
## 47 Greece 194559
## 48 Czech Republic 192925
## 49 Peru 192094
GDPData<-Data2016$GDP..Millions.US...
GDPData
## [1] 18569100 11199145 4939384 3466757 2618886 2465454 2263523
## [8] 1849970 1796187 1529760 1411246 1283162 1232088 1204616
## [15] 1045998 932259 857749 770845 659827 646438 545866
## [22] 511000 469509 466366 406840 405083 393436 386428
## [29] 370557 348743 336297 320912 318744 306143 304905
## [36] 296966 296359 294841 294054 283660 282463 247028
## [43] 236785 221415 204565 202616 194559 192925 192094
Using the data given in this table, answer the following:
Identify the data source type
Use a stem and leaf plot to represent the data
From the stem and leaf plot, determine if the data is skewed or centered
Find the median of the data set
Find the first and third quartiles (\(Q_1\) and \(Q_3\))
Determine if the data set has any mild outliers
Identify the fences for the data set
Use a box plot to represent this data set
The net Foreign Direct Investment (FDI) into Ireland between 2005 and 2017 is given in the data file IrelandFDI, obtained from the World Bank, and available on Moodle.
Extract the numerical data from this data set as in Exercise 1 and answer the following use a ggplot2 to draw a box plot to represent this data set.
library(ggplot2)
FDI<-read.csv('IrelandFDI.csv')
FDI
## Year Net.FDI.US.
## 1 2005 44824253570
## 2 2006 20816925829
## 3 2007 -4014366973
## 4 2008 34905390728
## 5 2009 545425984
## 6 2010 -20931725589
## 7 2011 -25201352356
## 8 2012 -19621467336
## 9 2013 -12371927262
## 10 2015 11685685364
## 11 2016 -20249872618
## 12 2017 23024931240
and we extract the relevant data in the usual way
FDIData<-FDI$Net.FDI.US.
FDIData
## [1] 44824253570 20816925829 -4014366973 34905390728 545425984
## [6] -20931725589 -25201352356 -19621467336 -12371927262 11685685364
## [11] -20249872618 23024931240
ggplot(FDI, aes(x ='', y = FDI$Net.FDI.US.),xlab='Net FDI into Ireland') + #Plot the data set FDI using FDI$Net.FDI.US. as y
coord_flip()+ #Flip the chart so the bar is horizontal
stat_boxplot(geom = "errorbar", width = 0.25,coef=1.5) + # Used boc plot geometry for the plot usine 1.5 as the coeffiecient for outliers as in Tukey's criteria
geom_boxplot(stat='boxplot',outlier.size=3,width=0.25,fill='steelblue4')+ #Select the colour for the box plot
ylab('FDI ($US)')+ # Set the ylabel (actuall hte horizontal label because of coord_flip())
xlab('')+ # Do not include an x (vertical) label
ggtitle('A box plot of FOreign Direct Investment into Ireland from 2002 to 2017 in $US')+ # Create a main title
theme(plot.title=element_text(hjust=0.5)) # Adjust the position of the main title
The CO2 Emissions of Ireland (in Metric Tons per Capita) between 1992 and 2017 is given in the data file Ireland Emissions, obtained from the World Bank, and available on Moodle.
Extract the numerical data from this data set as in Exercise 1 and answer the following
Identify the data source type
Use a stem and leaf plot to represent the data
From the stem and leaf plot, determine if the data is skewed or centered
Find the median of the data set
Find the first and third quartiles (\(Q_1\) and \(Q_3\))
Determine if the data set has any mild outliers
Identify the fences for the data set
Use a box plot to represent this data set
The average fuel consumption of Volkswagen Golf cars (measured in liters per 100 km), between the years 1990 - 2017 is given by the following table
Year | Fuel Consumption (L/100km) |
---|---|
1990 | 9.3 |
1991 | 6.7 |
1992 | 8.5 |
1993 | 8.1 |
1994 | 8.0 |
1995 | 7.1 |
1996 | 8.2 |
1997 | 8.3 |
1998 | 9.0 |
1999 | 6.8 |
2000 | 6.4 |
2001 | 6.5 |
2002 | 6.5 |
2003 | 6.4 |
2004 | 7.5 |
2005 | 6.6 |
2006 | 6.6 |
2007 | 6.7 |
2008 | 6.9 |
2009 | 6.2 |
2010 | 6.2 |
2011 | 6.0 |
2012 | 6.5 |
2013 | 6.8 |
2014 | 6.4 |
2015 | 6.4 |
2016 | 9.2 |
2017 | 7.8 |
Create a .csv file to store this data and import it into this work book.
Using this data, answer the following:
Identify the data source type
Use a stem and leaf plot to represent the data
From the stem and leaf plot, determine if the data is skewed or centered
Find the median of the data set
Find the first and third quartiles (\(Q_1\) and \(Q_3\))
Determine if the data set has any mild outliers
Identify the fences for the data set
Use a box plot to represent this data set
We can also use box plots to plot more than one data set on the same chart.
This is a very useful wave of comparing the median and spread of the two data sets.
A class of student are divided into two groups for the purposes of sitting an exam. The results of each group are shown in the data file Grades.csv available on Moodle.
Grades<-read.csv('Grades.csv')
Grades<-data.frame(Grades)
Grades
## Group Grade
## 1 1 23
## 2 1 56
## 3 1 43
## 4 1 45
## 5 1 78
## 6 1 61
## 7 1 41
## 8 1 36
## 9 1 51
## 10 1 4
## 11 2 45
## 12 2 47
## 13 2 39
## 14 2 51
## 15 2 52
## 16 2 43
## 17 2 28
## 18 2 40
## 19 2 32
## 20 2 38
Grades$Group<- as.factor(Grades$Group)
Grades
## Group Grade
## 1 1 23
## 2 1 56
## 3 1 43
## 4 1 45
## 5 1 78
## 6 1 61
## 7 1 41
## 8 1 36
## 9 1 51
## 10 1 4
## 11 2 45
## 12 2 47
## 13 2 39
## 14 2 51
## 15 2 52
## 16 2 43
## 17 2 28
## 18 2 40
## 19 2 32
## 20 2 38
ggplot(Grades,aes(x=Group,y=Grade,fill=Group))+ # Plot the data frame Grades with Group on the x-axis and Grade on the y-axis
geom_boxplot(stat='boxplot')+ # Plot the data in the form of a box plot
coord_flip()+ # Plot the boxes horizontally
stat_boxplot(geom='errorbar',width=0.25,coef=1.5)+ # Plot fences, coef=1.5 correspond to the 1.5 appearing in Tukey's criteria for mile outliers
scale_fill_manual(values=c('darkseagreen','lightskyblue3'))+ # Colour the individual boxplots
theme(legend.position='none')+ # Set the position of the legend: options are 'top', 'bottom', 'left', 'right', 'none'
ggtitle('Box plots comparing exam grades of two different class groups')+ # Main title for the box plot
theme(plot.title=element_text(hjust=0.5))+ # Centre the title
xlab('Group Number')+ # Set the y-label
ylab('Exam Grade') # Set the x-label: Note x and y have interchanged due to coord_flip()
The time it take a computer to execute 10 tasks is recorder in miliseconds, and compared for two computers with clock speeds of 2.4GMHz and 4.2GHz:
Clock Speed | Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Task 6 | Task 7 | Task 8 | Task 9 | Task 10 |
---|---|---|---|---|---|---|---|---|---|---|
2.4GHz | 1.2ms | 2.4ms | 3.2ms | 3.1ms | 7.2ms | 7.5ms | 8.3ms | 8.7ms | 9.4ms | 9.1ms |
4.2GHz | 1.1ms | 2.4ms | 2.9ms | 2.7ms | 4.3ms | 4.9ms | 4.9ms | 5.1ms | 5.2ms | 6.1ms |
Create an approprate .csv file to represente this data.
Import this data and factor it appropriatly.
Create a chart with two boxplots to represent this data.
Using the boxplot, which clock speed had the hightest median and the largest spread in executin times?