In this lesson, you upload the sample airline data set to the InfoSphere® BigInsightsT server, and then you access it as a bigr.frame object, which is a Big R data.frame, and it serves as a proxy for the underlying data set.
Load the Big R package, connect to the InfoSphere BigInsights server, and then confirm that it is connected. Update the following example for your environment settings, then run the code in your R environment.
library(bigr)
## Loading required package: rJava
bigr.connect(host="lath3842",
port=7052, database="default",
user="biadmin", password="password")
is.bigr.connected()
## [1] TRUE
lath3842 is the host name of the node where your InfoSphere BigInsights Console is installed.
biadmin is the name of the InfoSphere BigInsights administrator.
For this tutorial, you use R to extract an airline data set from the Big R package and import it into your distributed file system for further analysis. For typical examples of importing data, see the tutorial on Importing data for analysis.
airfile <- system.file("extdata", "airline.zip", package="bigr")
airfile <- unzip(airfile, exdir = tempdir())
airR <- read.csv(airfile, stringsAsFactors=F)
air <- as.bigr.frame(airR)
bigr.persist(air, dataSource="DEL",
dataPath="airline_demo.csv", header=T,
delimiter=",", useMapReduce=F)
## Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 2004 2 12 4 633 635 935 930
## 2 2004 2 16 1 2115 2120 2340 2350
## 3 2004 2 18 3 700 700 817 820
## 4 2004 2 19 4 1140 1145 1427 1420
## 5 2004 2 21 6 936 935 1036 1035
## 6 2004 2 24 2 1117 1120 1922 1930
## 7 2004 2 28 6 2029 2030 2209 2214
## 8 2004 2 16 1 1155 1155 1309 1256
## 9 2004 2 20 5 1459 1500 1608 1615
## 10 2004 2 21 6 734 740 911 926
## 11 2004 2 5 4 1521 1530 1837 1838
## 12 2004 2 27 5 1544 1535 1852 1837
## 13 2004 2 16 1 1340 1345 1701 1611
## 14 2004 2 10 2 1740 1740 2028 1953
## 15 2004 2 13 5 1049 1100 1318 1347
## 16 2004 2 17 2 1319 1325 1757 1807
## 17 2004 2 25 3 1827 1700 2231 2019
## 18 2004 2 24 2 754 800 940 928
## 19 2004 2 4 3 1803 1735 2125 1947
## 20 2004 2 27 5 1540 1540 1905 1901
## 21 2004 2 26 4 640 650 1216 1227
## 22 2004 2 5 4 808 815 1636 1628
## 23 2004 2 22 7 1845 1845 2217 2233
## 24 2004 2 13 5 129 130 628 630
## 25 2004 2 13 5 1318 1319 1647 1655
## 26 2004 2 4 3 1721 1725 2015 2029
## 27 2004 2 10 2 1607 1541 2116 2000
## 28 2004 2 28 6 738 740 1020 1038
## 29 2004 2 27 5 1717 1708 1808 1810
## 30 2004 2 4 3 1533 1535 1850 1844
## 31 2004 2 24 2 2000 1910 2303 2213
## 32 2004 2 29 7 856 900 1150 1145
## 33 2004 2 8 7 1047 1050 1429 1447
## 34 2004 2 1 7 1241 1240 1345 1342
## 35 2004 2 3 2 1451 1459 1630 1625
## 36 2004 2 4 3 2219 2100 35 2336
## 37 2004 2 4 3 730 740 917 910
## 38 2004 2 5 4 1203 1210 1447 1402
## 39 2004 2 6 5 1215 1200 1415 1357
## 40 2004 2 9 1 1857 1735 1936 1805
## 41 2004 2 9 1 1708 1710 1822 1829
## 42 2004 2 10 2 1405 1340 1659 1645
## 43 2004 2 11 3 610 615 737 745
## 44 2004 2 15 7 2145 2125 2247 2240
## 45 2004 2 16 1 837 840 953 957
## 46 2004 2 17 2 1447 1500 1718 1655
## 47 2004 2 18 3 837 830 1006 1009
## 48 2004 2 19 4 1104 1055 1358 1352
## 49 2004 2 22 7 1706 1700 1815 1800
## 50 2004 2 22 7 1749 1750 1943 1945
## UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime
## 1 B6 165 N553JB 182 175
## 2 B6 199 N570JB 325 330
## 3 B6 2 N544JB 77 80
## 4 B6 67 N570JB 167 155
## 5 B6 68 N544JB 60 60
## 6 B6 206 N548JB 305 310
## 7 CO 780 N17644 100 104
## 8 CO 683 N36247 74 61
## 9 CO 1707 N35260 129 135
## 10 CO 191 N14325 157 166
## 11 CO 65 N67158 376 368
## 12 CO 1601 N75851 188 182
## 13 CO 1161 N14605 201 146
## 14 CO 1122 N27213 288 253
## 15 CO 1719 N14653 149 167
## 16 CO 482 N24729 218 222
## 17 CO 348 N14102 424 379
## 18 CO 1144 N14601 166 148
## 19 CO 1498 N16339 202 132
## 20 CO 365 N73406 145 141
## 21 CO 354 N35204 216 217
## 22 CO 1049 N36247 328 313
## 23 CO 717 N14840 152 168
## 24 CO 586 N34137 179 180
## 25 CO 1790 N10834 149 156
## 26 CO 1426 N37277 354 364
## 27 CO 632 N14662 249 199
## 28 CO 1581 N12225 342 358
## 29 CO 616 N76254 51 62
## 30 CO 1601 N13113 197 189
## 31 CO 638 N14831 183 183
## 32 CO 600 N12109 174 165
## 33 CO 226 N14230 162 177
## 34 DH 7841 N644BR 64 62
## 35 DH 7827 N313UE 99 86
## 36 DH 7462 N657BR 76 96
## 37 DH 7483 N693BR 167 150
## 38 DH 7627 N703BR 224 172
## 39 DH 7397 N690BR 120 117
## 40 DH 7410 N669BR 99 90
## 41 DH 7765 N307UE 74 79
## 42 DH 7472 N662BR 114 125
## 43 DH 6188 N408FJ 87 90
## 44 DH 7325 N678BR 62 75
## 45 DH 7299 N640BR 76 77
## 46 DH 7389 N697BR 151 115
## 47 DH 6229 N428FJ 89 99
## 48 DH 7588 N705BR 114 117
## 49 DH 7461 N699BR 129 120
## 50 DH 7524 N697BR 54 55
## AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
## 1 162 5 -2 JFK TPA 1005 3 17 0
## 2 114 -10 -5 JFK LAS 2248 8 23 0
## 3 49 -3 0 JFK BUF 301 2 26 0
## 4 141 7 -5 RSW JFK 1074 7 19 0
## 5 41 1 1 JFK SYR 209 3 16 0
## 6 468 -8 -3 LGB JFK 2465 7 10 0
## 7 83 -5 -1 IAH MCI 643 4 13 0
## 8 41 13 0 IAH MSY 305 5 28 0
## 9 108 -7 -1 TPA IAH 787 7 14 0
## 10 135 -15 -6 MIA IAH 964 5 17 0
## 11 326 -1 -9 EWR LAX 2454 13 37 0
## 12 156 15 9 EWR FLL 1065 8 24 0
## 13 142 50 -5 EWR ATL 745 10 49 0
## 14 216 35 0 IAH SFO 1635 9 63 0
## 15 128 -29 -11 TPA EWR 998 7 14 0
## 16 182 -10 -6 IAH BOS 1597 6 30 0
## 17 361 132 87 EWR SFO 2565 7 56 0
## 18 135 12 -6 EWR MDW 711 3 28 0
## 19 98 98 28 IAH OMA 781 4 100 0
## 20 117 4 0 IAH FLL 965 8 20 0
## 21 192 -11 -10 SMF IAH 1609 14 10 0
## 22 296 8 -7 SFO EWR 2565 6 26 0
## 23 131 -16 0 MSY EWR 1167 8 13 0
## 24 157 -2 -1 LAX IAH 1379 6 16 0
## 25 124 -8 -1 IAH MIA 964 4 21 0
## 26 332 -14 -4 EWR SAN 2425 2 20 0
## 27 158 76 26 IAH LGA 1416 7 84 0
## 28 320 -18 -2 EWR SEA 2401 3 19 0
## 29 35 -2 9 SAT IAH 191 4 12 0
## 30 152 6 -2 EWR FLL 1065 4 41 0
## 31 155 50 50 EWR MIA 1086 4 24 0
## 32 147 5 -4 FLL EWR 1065 8 19 0
## 33 140 -18 -3 IAH BWI 1235 6 16 0
## 34 39 3 1 IAD ORF 157 3 22 0
## 35 74 5 -8 HPN IAD 247 3 22 0
## 36 64 59 79 ORD TYS 475 3 9 0
## 37 108 7 -10 CHS ORD 760 28 30 0
## 38 155 45 -7 PWM ORD 900 46 22 0
## 39 90 18 15 ORD TUL 585 8 22 0
## 40 71 91 82 LEX ORD 323 6 22 0
## 41 63 -7 -2 IAD GSO 239 3 7 0
## 42 86 14 25 ORD CAE 666 4 24 0
## 43 69 -8 -5 BOS PHL 280 6 12 0
## 44 50 7 20 IAD CLT 321 3 8 0
## 45 40 -4 -3 IAD EWR 213 14 22 0
## 46 98 23 -13 ICT ORD 588 36 17 0
## 47 79 -3 7 CAE CVG 404 4 5 0
## 48 93 6 9 ORD ALB 723 9 12 0
## 49 82 15 6 TYS ORD 475 20 27 0
## 50 27 -2 -1 ORD FWA 157 3 24 0
## CancellationCode Diverted CarrierDelay WeatherDelay NASDelay
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 0
## 12 0 9 0 6
## 13 0 0 0 50
## 14 0 0 0 35
## 15 0 0 0 0
## 16 0 0 0 0
## 17 0 0 25 45
## 18 0 0 0 0
## 19 0 3 0 70
## 20 0 0 0 0
## 21 0 0 0 0
## 22 0 0 0 0
## 23 0 0 0 0
## 24 0 0 0 0
## 25 0 0 0 0
## 26 0 0 0 0
## 27 0 0 0 50
## 28 0 0 0 0
## 29 0 0 0 0
## 30 0 0 0 0
## 31 0 0 0 0
## 32 0 0 0 0
## 33 0 0 0 0
## 34 0 0 0 0
## 35 0 0 0 0
## 36 0 0 0 0
## 37 0 0 0 0
## 38 0 0 0 45
## 39 0 0 0 3
## 40 0 0 0 91
## 41 0 0 0 0
## 42 0 0 0 0
## 43 0 0 0 0
## 44 0 0 0 0
## 45 0 0 0 0
## 46 0 0 0 23
## 47 0 0 0 0
## 48 0 0 0 0
## 49 0 6 0 9
## 50 0 0 0 0
## SecurityDelay LateAircraftDelay
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 0 0
## 10 0 0
## 11 0 0
## 12 0 0
## 13 0 0
## 14 0 0
## 15 0 0
## 16 0 0
## 17 0 62
## 18 0 0
## 19 0 25
## 20 0 0
## 21 0 0
## 22 0 0
## 23 0 0
## 24 0 0
## 25 0 0
## 26 0 0
## 27 0 26
## 28 0 0
## 29 0 0
## 30 0 0
## 31 0 50
## 32 0 0
## 33 0 0
## 34 0 0
## 35 0 0
## 36 0 59
## 37 0 0
## 38 0 0
## 39 0 15
## 40 0 0
## 41 0 0
## 42 0 0
## 43 0 0
## 44 0 0
## 45 0 0
## 46 0 0
## 47 0 0
## 48 0 0
## 49 0 0
## 50 0 0
## ... showing first 50 rows only.
Important: Moving the file to the InfoSphere BigInsights server can take a few minutes.
The parameter useMapReduce by default is true. The sample airline data set is not large, so setting the parameter to false will run the data faster.
air <- bigr.frame(dataSource="DEL",
dataPath="airline_demo.csv", delimiter=",",
header=T, useMapReduce=F)
## Warning: WARN[bigr.frame]: Operations on this bigr.frame will not use
## MapReduce since argument 'useMapReduce' was set to FALSE. This
## configuration is only suitable for small datasets.
After uploading the sample data set, the file exists in the following location in the InfoSphere BigInsights Console: user/bigsql/airline_demo.csv.
In this lesson, you learn how to review the structure of the data set.
colnames(air)
## [1] "Year" "Month" "DayofMonth"
## [4] "DayOfWeek" "DepTime" "CRSDepTime"
## [7] "ArrTime" "CRSArrTime" "UniqueCarrier"
## [10] "FlightNum" "TailNum" "ActualElapsedTime"
## [13] "CRSElapsedTime" "AirTime" "ArrDelay"
## [16] "DepDelay" "Origin" "Dest"
## [19] "Distance" "TaxiIn" "TaxiOut"
## [22] "Cancelled" "CancellationCode" "Diverted"
## [25] "CarrierDelay" "WeatherDelay" "NASDelay"
## [28] "SecurityDelay" "LateAircraftDelay"
coltypes(air)
## [1] "character" "character" "character" "character" "character"
## [6] "character" "character" "character" "character" "character"
## [11] "character" "character" "character" "character" "character"
## [16] "character" "character" "character" "character" "character"
## [21] "character" "character" "character" "character" "character"
## [26] "character" "character" "character" "character"
By default, Big R sets all column types to character. However, for this data set, all column types need to be integers, except for: UniqueCarrier, TailNum, Origin, Dest, and CancellationCode.
coltypes(air) <- ifelse(1:29 %in% c(9,11,17,18,23), "character", "integer")
coltypes(air)
## [1] "integer" "integer" "integer" "integer" "integer"
## [6] "integer" "integer" "integer" "character" "integer"
## [11] "character" "integer" "integer" "integer" "integer"
## [16] "integer" "character" "character" "integer" "integer"
## [21] "integer" "integer" "character" "integer" "integer"
## [26] "integer" "integer" "integer" "integer"
Important: To avoid long processing times, limit the number of observations to display, for example, by using the bigr.setRowLimit function.
nrow(air)
## [1] 128790
ncol(air)
## [1] 29
dim(air)
## [1] 128790 29
str(air)
## 'bigr.frame': 128790 obs. of 29 variables:
## $ Year : int 2004 2004 2004 2004 2004 2004 ...
## $ Month : int 2 2 2 2 2 2 ...
## $ DayofMonth : int 12 16 18 19 21 24 ...
## $ DayOfWeek : int 4 1 3 4 6 2 ...
## $ DepTime : int 633 2115 700 1140 936 1117 ...
## $ CRSDepTime : int 635 2120 700 1145 935 1120 ...
## $ ArrTime : int 935 2340 817 1427 1036 1922 ...
## $ CRSArrTime : int 930 2350 820 1420 1035 1930 ...
## $ UniqueCarrier : chr "B6" "B6" "B6" "B6" "B6" "B6" ...
## $ FlightNum : int 165 199 2 67 68 206 ...
## $ TailNum : chr "N553JB" "N570JB" "N544JB" "N570JB" "N544JB" "N548JB" ...
## $ ActualElapsedTime: int 182 325 77 167 60 305 ...
## $ CRSElapsedTime : int 175 330 80 155 60 310 ...
## $ AirTime : int 162 114 49 141 41 468 ...
## $ ArrDelay : int 5 -10 -3 7 1 -8 ...
## $ DepDelay : int -2 -5 0 -5 1 -3 ...
## $ Origin : chr "JFK" "JFK" "JFK" "RSW" "JFK" "LGB" ...
## $ Dest : chr "TPA" "LAS" "BUF" "JFK" "SYR" "JFK" ...
## $ Distance : int 1005 2248 301 1074 209 2465 ...
## $ TaxiIn : int 3 8 2 7 3 7 ...
## $ TaxiOut : int 17 23 26 19 16 10 ...
## $ Cancelled : int 0 0 0 0 0 0 ...
## $ CancellationCode : chr "" "" "" "" "" "" ...
## $ Diverted : int 0 0 0 0 0 0 ...
## $ CarrierDelay : int 0 0 0 0 0 0 ...
## $ WeatherDelay : int 0 0 0 0 0 0 ...
## $ NASDelay : int 0 0 0 0 0 0 ...
## $ SecurityDelay : int 0 0 0 0 0 0 ...
## $ LateAircraftDelay: int 0 0 0 0 0 0 ...
head(air, 5)
## Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 2004 2 12 4 633 635 935 930
## 2 2004 2 16 1 2115 2120 2340 2350
## 3 2004 2 18 3 700 700 817 820
## 4 2004 2 19 4 1140 1145 1427 1420
## 5 2004 2 21 6 936 935 1036 1035
## UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime
## 1 B6 165 N553JB 182 175 162
## 2 B6 199 N570JB 325 330 114
## 3 B6 2 N544JB 77 80 49
## 4 B6 67 N570JB 167 155 141
## 5 B6 68 N544JB 60 60 41
## ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
## 1 5 -2 JFK TPA 1005 3 17 0
## 2 -10 -5 JFK LAS 2248 8 23 0
## 3 -3 0 JFK BUF 301 2 26 0
## 4 7 -5 RSW JFK 1074 7 19 0
## 5 1 1 JFK SYR 209 3 16 0
## CancellationCode Diverted CarrierDelay WeatherDelay NASDelay
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## SecurityDelay LateAircraftDelay
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
tail(air, 7)
## Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 2008 12 11 4 734 735 1010 1022
## 2 2008 12 12 5 1325 1310 1759 1759
## 3 2008 12 12 5 1425 1345 1938 1928
## 4 2008 12 12 5 1119 1101 1419 1405
## 5 2008 12 12 5 2127 2129 2258 2311
## 6 2008 12 13 6 719 710 938 949
## 7 2008 12 13 6 1250 1220 1617 1552
## UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime
## 1 DL 1209 N915DN 216 227 200
## 2 DL 662 N3746H 154 169 131
## 3 DL 1032 N697DL 193 223 169
## 4 DL 1410 N913DL 180 184 149
## 5 DL 1792 N969DL 91 102 62
## 6 DL 1491 N935DL 199 219 179
## 7 DL 1621 N938DL 147 152 120
## ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
## 1 -12 -1 MSY SLC 1428 6 10 0
## 2 0 15 COS ATL 1185 8 15 0
## 3 10 40 PHX ATL 1587 11 13 0
## 4 14 18 PBI BOS 1197 6 25 0
## 5 -13 -2 ATL ORF 516 4 25 0
## 6 -11 9 LGA MSY 1183 5 15 0
## 7 25 30 MSP ATL 906 9 18 0
## CancellationCode Diverted CarrierDelay WeatherDelay NASDelay
## 1 0 NA NA NA
## 2 0 NA NA NA
## 3 0 NA NA NA
## 4 0 NA NA NA
## 5 0 NA NA NA
## 6 0 NA NA NA
## 7 0 3 0 0
## SecurityDelay LateAircraftDelay
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## 7 0 22
print(air$UniqueCarrier)
## [1] "B6" "B6" "B6" "B6" "B6" "B6" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO"
## [15] "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO"
## [29] "CO" "CO" "CO" "CO" "CO" "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH"
## [43] "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH"
## ... 128740 more observations.
print(air$UniqueCarrier)
## [1] "B6" "B6" "B6" "B6" "B6" "B6" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO"
## [15] "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO"
## [29] "CO" "CO" "CO" "CO" "CO" "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH"
## [43] "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH"
## ... 128740 more observations.
In this lesson, you analyze flight delay information.
The fictional Sample Outdoor Company wants to partner with an airline that has few delays.
attach(air)
The attach() function makes referencing variables easier. For example, length(air$UniqueCarrier) computes the number of flights. The attach(air) function allows you to drop the air$ prefix. Now length(UniqueCarrier) can compute the number of flights in the air data set.
airSubset <- air[Cancelled == 0 & (DepDelay >= 15 | ArrDelay >= 15),
c("UniqueCarrier", "Origin", "Dest", "DepDelay", "ArrDelay")]
nrow(airSubset) / nrow(air)
## [1] 0.227
nrow(airSubset[airSubset$UniqueCarrier == c("HA"),]) /
nrow(air[UniqueCarrier == "HA",])
## [1] 0.07435
cor(air[, c("Distance",
"DepDelay", "ArrDelay", "DepTime",
"ArrTime", "ActualElapsedTime")])
## Distance DepDelay ArrDelay DepTime ArrTime
## Distance 1.000000 0.02884 0.008679 -0.03649 0.03721
## DepDelay 0.028840 1.00000 0.871991 0.16668 0.07343
## ArrDelay 0.008679 0.87199 1.000000 0.15191 0.07216
## DepTime -0.036489 0.16668 0.151907 1.00000 0.73996
## ArrTime 0.037212 0.07343 0.072163 0.73996 1.00000
## ActualElapsedTime 0.968554 0.04201 0.094298 -0.03314 0.04807
## ActualElapsedTime
## Distance 0.96855
## DepDelay 0.04201
## ArrDelay 0.09430
## DepTime -0.03314
## ArrTime 0.04807
## ActualElapsedTime 1.00000
max(air$DepDelay)
## [1] 1435
summary(mean(ArrDelay) + max(ArrDelay) ~ .,
object = air)
## mean(ArrDelay) max(ArrDelay)
## 1 7.108 1016
Because the delay information for the fictional HA airlines is better than others as our data analysis has shown, the fictional HA airline is a strong candidate for the Sample Outdoor Company.
In this lesson, you create visualizations by using the data set to find trends in flight delays.
Install the ggplot2 data visualization package in your R environment with the following code:
#install.packages("ggplot2")
Install the makeR package:
#install.packages("makeR_1.0.2.tar.gz")
library (makeR)
## Loading required package: tools
## Loading required package: XML
##
## Attaching package: 'XML'
##
## The following object is masked from 'package:tools':
##
## toHTML
## Warning: replacing previous import by 'tools::toHTML' when loading 'makeR'
## Auto saving of PROJECT.xml is enabled. This can be changed using the setAutoSave() function.
##
## Auto opening of built and released files is enabled. This can be changed using the setAutoOpen() function.
The fictional Sample Outdoor Company wants to find the times and days with the least amount of delays.
library(ggplot2)
With the ggplot2 library you can create bar charts and other types of plots.
bf <- air
bf$FlightDelayed <- ifelse(bf$DepDelay >= 15, 1, 0)
bf$DepHour <- bf$DepTime / 100
df <- summary(sum(FlightDelayed) ~ DepHour, object = bf)
df <- na.omit(df)
df <- df[order(df[,1]),]
colnames(df) <- c("DepHour", "FlightCount")
deps <- df[,2]
names(deps) <- as.vector(df[,1])
barplot(deps, main="Airline Delays By Hour",
xlab="Hour", ylab="# of Flights Delayed")
The chart shows delays increasing as the day progresses. This is probably due to an increase in flights as the day progresses.
bigr.histogram(air$ArrTime, nbins=24) +
labs(title = "Flight Volume (Arrival) by Hour")
## Loading required package: plyr
When you compare the delay chart with this flight volume chart you see that flights between 09:00-11:00 a.m. have a similar flight frequency with flights between 12:00-06:00 p.m., however the morning flights appear to experience fewer delays.
There seems to be a correlation between the number flights in a day and the amount of delays. If the proportion of delays decreased with respect to the number of flights, finding days with fewer flights might be useful.
library(makeR)
bf <- air[air$Cancelled == 0,]
df <- summary(count(Month) ~ Year + Month + DayofMonth, object = bf)
df$DateStr <- paste(df$Year, df$Month, df$DayofMonth, sep="-")
df2011 <- df[df$Year == 2000 | df$Year == 2001 | df$Year == 2002,]
calendarHeat(df2011$DateStr, df2011[,4] * 100, varname="Flight Volume")
## Loading required package: lattice
## Loading required package: grid
## Loading required package: chron
## Warning: there is no package called 'chron'
In the month of February, Saturdays seem to have fewer flights on average. The Sample Outdoor Company might want to schedule February flights between 09:00-11:00 a.m. on Saturday.
In this lesson, you create a decision-tree model and make predictions.
Install the DMwR package in your R environment with the following code:
#install.packages("DMwR")
Install the rpart package in your R environment with the following code:
#install.packages("rpart")
Important: Make sure that R is installed on each node of the cluster (NameNode and DataNode), and that rpart is installed on each node.
To communicate expected flight delay information to customers, you will create a decision-tree then try to predict flight arrival delays for the fictional HA and UA airlines.
bf <- air[air$UniqueCarrier %in% c("HA", "UA"),]
splits <- bigr.sample(bf, c(0.7, 0.3))
class(splits)
## [1] "list"
train <- splits[[1]]
test <- splits[[2]]
nrow(train) / nrow(bf)
## [1] 0.7053
nrow(test) / nrow(bf)
## [1] 0.2947
When building models, you can use a subset of the data to train the model, and use the remaining data to test and validate your model.
models <- groupApply(data = train,
groupingColumns=list(train$UniqueCarrier),
rfunction = function(df) {
library(rpart)
predcols <- c('ArrDelay', 'DepDelay', 'DepTime',
'CRSArrTime', 'Distance')
m <- rpart(ArrDelay ~ ., df[,predcols])
m
})
## groupApply ran for 37.28secs
print(models)
## bigr.list
## group1 status
## 1 HA OK
## 2 UA OK
The models will predict flight arrival delay using departure delay, departure time, air travel time, and distance as predicting variables.
modelHA <- bigr.pull(models$HA)
print(modelHA)
## n=197 (1 observation deleted due to missingness)
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 197 29740.0 -1.7410
## 2) DepDelay< 5 178 9954.0 -4.3030
## 4) DepDelay< -2.5 128 4665.0 -6.1560
## 8) Distance>=2580 9 1420.0 -14.5600 *
## 9) Distance< 2580 119 2562.0 -5.5210
## 18) DepDelay< -6.5 26 404.2 -8.6150 *
## 19) DepDelay>=-6.5 93 1839.0 -4.6560 *
## 5) DepDelay>=-2.5 50 3724.0 0.4400
## 10) Distance< 2646 43 1188.0 -0.6047 *
## 11) Distance>=2646 7 2201.0 6.8570 *
## 3) DepDelay>=5 19 7674.0 22.2600 *
library(DMwR)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
##
## Attaching package: 'DMwR'
##
## The following object is masked from 'package:plyr':
##
## join
prettyTree(modelHA)
The decision-tree uses the column DepDelay more than other columns showing that there is a strong relationship between the decision-tree and DepDelay.
#preds <- groupApply(test,
# list(test$UniqueCarrier),
# function(df, models) {
# library(rpart)
# carrier <- df$UniqueCarrier[1]
# m <- bigr.pull(models[carrier])
# data.frame(carrier, df$DepDelay, df$ArrDelay, predict(m, df))
# },
# signature=data.frame(carrier='Carrier', DepDelay=1.0,
# ArrDelay=1.0, ArrDelayPred=1.0, stringsAsFactors=F),
# models)
#head(preds, 20)
Examining row six in the output, the predicted arrival delay is 20.429878; however, the actual arrival delay is 15 minutes. As expected, there are discrepancies between the predicted and actual results. It is important to test the quality of your model to see where predictions are wrong or different from the actual results.
#rmsd <- sqrt(sum((preds$ArrDelay - preds$ArrDelayPred) ^ 2) / nrow(preds))
#print(rmsd)
The RMSD shows that the model has a high error ratio. To improve the model, you can add more predictors like departure and arrival cities.
#preds$error <- abs(preds$ArrDelay - preds$ArrDelayPred)
#head(sort(preds, by=preds$error, decreasing=T))
The error is very high for the model’s worst predictions. The long delays are probably due to plane maintenance and repair. And the top errors might be outliers, because the range from the largest error to the sixth largest error is over 60 minutes.
This tutorial demonstrated some IBM® InfoSphere® BigInsightsT Big R functions and some techniques for analyzing data.
You now have a good understanding of the following tasks: