Lesson 1: Uploading the airline data set to InfoSphere BigInsights server with Big R

In this lesson, you upload the sample airline data set to the InfoSphere® BigInsightsT server, and then you access it as a bigr.frame object, which is a Big R data.frame, and it serves as a proxy for the underlying data set.

Before you begin:

Load the Big R package, connect to the InfoSphere BigInsights server, and then confirm that it is connected. Update the following example for your environment settings, then run the code in your R environment.

library(bigr)
## Loading required package: rJava
bigr.connect(host="lath3842", 
  port=7052, database="default", 
  user="biadmin", password="password")
is.bigr.connected()
## [1] TRUE

lath3842 is the host name of the node where your InfoSphere BigInsights Console is installed.

biadmin is the name of the InfoSphere BigInsights administrator.

About this lesson:

For this tutorial, you use R to extract an airline data set from the Big R package and import it into your distributed file system for further analysis. For typical examples of importing data, see the tutorial on Importing data for analysis.

Procedure:

  1. Run the following code to extract the airline data set from the bigr package directory, and then create a data frame named airR:
airfile <- system.file("extdata", "airline.zip", package="bigr")
airfile <- unzip(airfile, exdir = tempdir())
airR <- read.csv(airfile, stringsAsFactors=F)
  1. Convert airR to a bigr.frame named air, then move the data set to the InfoSphere BigInsights server, and then show the sample airline data set from the InfoSphere BigInsights server:
air <- as.bigr.frame(airR)
bigr.persist(air, dataSource="DEL", 
  dataPath="airline_demo.csv", header=T, 
  delimiter=",", useMapReduce=F)
##    Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1  2004     2         12         4     633        635     935        930
## 2  2004     2         16         1    2115       2120    2340       2350
## 3  2004     2         18         3     700        700     817        820
## 4  2004     2         19         4    1140       1145    1427       1420
## 5  2004     2         21         6     936        935    1036       1035
## 6  2004     2         24         2    1117       1120    1922       1930
## 7  2004     2         28         6    2029       2030    2209       2214
## 8  2004     2         16         1    1155       1155    1309       1256
## 9  2004     2         20         5    1459       1500    1608       1615
## 10 2004     2         21         6     734        740     911        926
## 11 2004     2          5         4    1521       1530    1837       1838
## 12 2004     2         27         5    1544       1535    1852       1837
## 13 2004     2         16         1    1340       1345    1701       1611
## 14 2004     2         10         2    1740       1740    2028       1953
## 15 2004     2         13         5    1049       1100    1318       1347
## 16 2004     2         17         2    1319       1325    1757       1807
## 17 2004     2         25         3    1827       1700    2231       2019
## 18 2004     2         24         2     754        800     940        928
## 19 2004     2          4         3    1803       1735    2125       1947
## 20 2004     2         27         5    1540       1540    1905       1901
## 21 2004     2         26         4     640        650    1216       1227
## 22 2004     2          5         4     808        815    1636       1628
## 23 2004     2         22         7    1845       1845    2217       2233
## 24 2004     2         13         5     129        130     628        630
## 25 2004     2         13         5    1318       1319    1647       1655
## 26 2004     2          4         3    1721       1725    2015       2029
## 27 2004     2         10         2    1607       1541    2116       2000
## 28 2004     2         28         6     738        740    1020       1038
## 29 2004     2         27         5    1717       1708    1808       1810
## 30 2004     2          4         3    1533       1535    1850       1844
## 31 2004     2         24         2    2000       1910    2303       2213
## 32 2004     2         29         7     856        900    1150       1145
## 33 2004     2          8         7    1047       1050    1429       1447
## 34 2004     2          1         7    1241       1240    1345       1342
## 35 2004     2          3         2    1451       1459    1630       1625
## 36 2004     2          4         3    2219       2100      35       2336
## 37 2004     2          4         3     730        740     917        910
## 38 2004     2          5         4    1203       1210    1447       1402
## 39 2004     2          6         5    1215       1200    1415       1357
## 40 2004     2          9         1    1857       1735    1936       1805
## 41 2004     2          9         1    1708       1710    1822       1829
## 42 2004     2         10         2    1405       1340    1659       1645
## 43 2004     2         11         3     610        615     737        745
## 44 2004     2         15         7    2145       2125    2247       2240
## 45 2004     2         16         1     837        840     953        957
## 46 2004     2         17         2    1447       1500    1718       1655
## 47 2004     2         18         3     837        830    1006       1009
## 48 2004     2         19         4    1104       1055    1358       1352
## 49 2004     2         22         7    1706       1700    1815       1800
## 50 2004     2         22         7    1749       1750    1943       1945
##    UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime
## 1             B6       165  N553JB               182            175
## 2             B6       199  N570JB               325            330
## 3             B6         2  N544JB                77             80
## 4             B6        67  N570JB               167            155
## 5             B6        68  N544JB                60             60
## 6             B6       206  N548JB               305            310
## 7             CO       780  N17644               100            104
## 8             CO       683  N36247                74             61
## 9             CO      1707  N35260               129            135
## 10            CO       191  N14325               157            166
## 11            CO        65  N67158               376            368
## 12            CO      1601  N75851               188            182
## 13            CO      1161  N14605               201            146
## 14            CO      1122  N27213               288            253
## 15            CO      1719  N14653               149            167
## 16            CO       482  N24729               218            222
## 17            CO       348  N14102               424            379
## 18            CO      1144  N14601               166            148
## 19            CO      1498  N16339               202            132
## 20            CO       365  N73406               145            141
## 21            CO       354  N35204               216            217
## 22            CO      1049  N36247               328            313
## 23            CO       717  N14840               152            168
## 24            CO       586  N34137               179            180
## 25            CO      1790  N10834               149            156
## 26            CO      1426  N37277               354            364
## 27            CO       632  N14662               249            199
## 28            CO      1581  N12225               342            358
## 29            CO       616  N76254                51             62
## 30            CO      1601  N13113               197            189
## 31            CO       638  N14831               183            183
## 32            CO       600  N12109               174            165
## 33            CO       226  N14230               162            177
## 34            DH      7841  N644BR                64             62
## 35            DH      7827  N313UE                99             86
## 36            DH      7462  N657BR                76             96
## 37            DH      7483  N693BR               167            150
## 38            DH      7627  N703BR               224            172
## 39            DH      7397  N690BR               120            117
## 40            DH      7410  N669BR                99             90
## 41            DH      7765  N307UE                74             79
## 42            DH      7472  N662BR               114            125
## 43            DH      6188  N408FJ                87             90
## 44            DH      7325  N678BR                62             75
## 45            DH      7299  N640BR                76             77
## 46            DH      7389  N697BR               151            115
## 47            DH      6229  N428FJ                89             99
## 48            DH      7588  N705BR               114            117
## 49            DH      7461  N699BR               129            120
## 50            DH      7524  N697BR                54             55
##    AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
## 1      162        5       -2    JFK  TPA     1005      3      17         0
## 2      114      -10       -5    JFK  LAS     2248      8      23         0
## 3       49       -3        0    JFK  BUF      301      2      26         0
## 4      141        7       -5    RSW  JFK     1074      7      19         0
## 5       41        1        1    JFK  SYR      209      3      16         0
## 6      468       -8       -3    LGB  JFK     2465      7      10         0
## 7       83       -5       -1    IAH  MCI      643      4      13         0
## 8       41       13        0    IAH  MSY      305      5      28         0
## 9      108       -7       -1    TPA  IAH      787      7      14         0
## 10     135      -15       -6    MIA  IAH      964      5      17         0
## 11     326       -1       -9    EWR  LAX     2454     13      37         0
## 12     156       15        9    EWR  FLL     1065      8      24         0
## 13     142       50       -5    EWR  ATL      745     10      49         0
## 14     216       35        0    IAH  SFO     1635      9      63         0
## 15     128      -29      -11    TPA  EWR      998      7      14         0
## 16     182      -10       -6    IAH  BOS     1597      6      30         0
## 17     361      132       87    EWR  SFO     2565      7      56         0
## 18     135       12       -6    EWR  MDW      711      3      28         0
## 19      98       98       28    IAH  OMA      781      4     100         0
## 20     117        4        0    IAH  FLL      965      8      20         0
## 21     192      -11      -10    SMF  IAH     1609     14      10         0
## 22     296        8       -7    SFO  EWR     2565      6      26         0
## 23     131      -16        0    MSY  EWR     1167      8      13         0
## 24     157       -2       -1    LAX  IAH     1379      6      16         0
## 25     124       -8       -1    IAH  MIA      964      4      21         0
## 26     332      -14       -4    EWR  SAN     2425      2      20         0
## 27     158       76       26    IAH  LGA     1416      7      84         0
## 28     320      -18       -2    EWR  SEA     2401      3      19         0
## 29      35       -2        9    SAT  IAH      191      4      12         0
## 30     152        6       -2    EWR  FLL     1065      4      41         0
## 31     155       50       50    EWR  MIA     1086      4      24         0
## 32     147        5       -4    FLL  EWR     1065      8      19         0
## 33     140      -18       -3    IAH  BWI     1235      6      16         0
## 34      39        3        1    IAD  ORF      157      3      22         0
## 35      74        5       -8    HPN  IAD      247      3      22         0
## 36      64       59       79    ORD  TYS      475      3       9         0
## 37     108        7      -10    CHS  ORD      760     28      30         0
## 38     155       45       -7    PWM  ORD      900     46      22         0
## 39      90       18       15    ORD  TUL      585      8      22         0
## 40      71       91       82    LEX  ORD      323      6      22         0
## 41      63       -7       -2    IAD  GSO      239      3       7         0
## 42      86       14       25    ORD  CAE      666      4      24         0
## 43      69       -8       -5    BOS  PHL      280      6      12         0
## 44      50        7       20    IAD  CLT      321      3       8         0
## 45      40       -4       -3    IAD  EWR      213     14      22         0
## 46      98       23      -13    ICT  ORD      588     36      17         0
## 47      79       -3        7    CAE  CVG      404      4       5         0
## 48      93        6        9    ORD  ALB      723      9      12         0
## 49      82       15        6    TYS  ORD      475     20      27         0
## 50      27       -2       -1    ORD  FWA      157      3      24         0
##    CancellationCode Diverted CarrierDelay WeatherDelay NASDelay
## 1                          0            0            0        0
## 2                          0            0            0        0
## 3                          0            0            0        0
## 4                          0            0            0        0
## 5                          0            0            0        0
## 6                          0            0            0        0
## 7                          0            0            0        0
## 8                          0            0            0        0
## 9                          0            0            0        0
## 10                         0            0            0        0
## 11                         0            0            0        0
## 12                         0            9            0        6
## 13                         0            0            0       50
## 14                         0            0            0       35
## 15                         0            0            0        0
## 16                         0            0            0        0
## 17                         0            0           25       45
## 18                         0            0            0        0
## 19                         0            3            0       70
## 20                         0            0            0        0
## 21                         0            0            0        0
## 22                         0            0            0        0
## 23                         0            0            0        0
## 24                         0            0            0        0
## 25                         0            0            0        0
## 26                         0            0            0        0
## 27                         0            0            0       50
## 28                         0            0            0        0
## 29                         0            0            0        0
## 30                         0            0            0        0
## 31                         0            0            0        0
## 32                         0            0            0        0
## 33                         0            0            0        0
## 34                         0            0            0        0
## 35                         0            0            0        0
## 36                         0            0            0        0
## 37                         0            0            0        0
## 38                         0            0            0       45
## 39                         0            0            0        3
## 40                         0            0            0       91
## 41                         0            0            0        0
## 42                         0            0            0        0
## 43                         0            0            0        0
## 44                         0            0            0        0
## 45                         0            0            0        0
## 46                         0            0            0       23
## 47                         0            0            0        0
## 48                         0            0            0        0
## 49                         0            6            0        9
## 50                         0            0            0        0
##    SecurityDelay LateAircraftDelay
## 1              0                 0
## 2              0                 0
## 3              0                 0
## 4              0                 0
## 5              0                 0
## 6              0                 0
## 7              0                 0
## 8              0                 0
## 9              0                 0
## 10             0                 0
## 11             0                 0
## 12             0                 0
## 13             0                 0
## 14             0                 0
## 15             0                 0
## 16             0                 0
## 17             0                62
## 18             0                 0
## 19             0                25
## 20             0                 0
## 21             0                 0
## 22             0                 0
## 23             0                 0
## 24             0                 0
## 25             0                 0
## 26             0                 0
## 27             0                26
## 28             0                 0
## 29             0                 0
## 30             0                 0
## 31             0                50
## 32             0                 0
## 33             0                 0
## 34             0                 0
## 35             0                 0
## 36             0                59
## 37             0                 0
## 38             0                 0
## 39             0                15
## 40             0                 0
## 41             0                 0
## 42             0                 0
## 43             0                 0
## 44             0                 0
## 45             0                 0
## 46             0                 0
## 47             0                 0
## 48             0                 0
## 49             0                 0
## 50             0                 0
## ...  showing first 50 rows only.

Important: Moving the file to the InfoSphere BigInsights server can take a few minutes.

The parameter useMapReduce by default is true. The sample airline data set is not large, so setting the parameter to false will run the data faster.

  1. Run the following code to access the same data set as a bigr.frame object named air directly from the InfoSphere BigInsights Console:
air <- bigr.frame(dataSource="DEL", 
  dataPath="airline_demo.csv", delimiter=",", 
  header=T, useMapReduce=F)
## Warning: WARN[bigr.frame]: Operations on this bigr.frame will not use
## MapReduce since argument 'useMapReduce' was set to FALSE. This
## configuration is only suitable for small datasets.

Results:

After uploading the sample data set, the file exists in the following location in the InfoSphere BigInsights Console: user/bigsql/airline_demo.csv.

Lesson 2: Exploring the structure of the data set with IBM InfoSphere BigInsights Big R

In this lesson, you learn how to review the structure of the data set.

Procedure:

  1. In your R environment, become familiar with the bigr.frame object named air created in the last lesson, by exploring the column names and column types by running the following code:
colnames(air)
##  [1] "Year"              "Month"             "DayofMonth"       
##  [4] "DayOfWeek"         "DepTime"           "CRSDepTime"       
##  [7] "ArrTime"           "CRSArrTime"        "UniqueCarrier"    
## [10] "FlightNum"         "TailNum"           "ActualElapsedTime"
## [13] "CRSElapsedTime"    "AirTime"           "ArrDelay"         
## [16] "DepDelay"          "Origin"            "Dest"             
## [19] "Distance"          "TaxiIn"            "TaxiOut"          
## [22] "Cancelled"         "CancellationCode"  "Diverted"         
## [25] "CarrierDelay"      "WeatherDelay"      "NASDelay"         
## [28] "SecurityDelay"     "LateAircraftDelay"
coltypes(air)
##  [1] "character" "character" "character" "character" "character"
##  [6] "character" "character" "character" "character" "character"
## [11] "character" "character" "character" "character" "character"
## [16] "character" "character" "character" "character" "character"
## [21] "character" "character" "character" "character" "character"
## [26] "character" "character" "character" "character"

By default, Big R sets all column types to character. However, for this data set, all column types need to be integers, except for: UniqueCarrier, TailNum, Origin, Dest, and CancellationCode.

  1. Run the following code to assign type integer to all column types except for the columns listed in the previous step, and then display the updated column types:
coltypes(air) <- ifelse(1:29 %in% c(9,11,17,18,23), "character", "integer")
coltypes(air)
##  [1] "integer"   "integer"   "integer"   "integer"   "integer"  
##  [6] "integer"   "integer"   "integer"   "character" "integer"  
## [11] "character" "integer"   "integer"   "integer"   "integer"  
## [16] "integer"   "character" "character" "integer"   "integer"  
## [21] "integer"   "integer"   "character" "integer"   "integer"  
## [26] "integer"   "integer"   "integer"   "integer"
  1. Optional: Continue to explore the data set by running some of the following functions.

Important: To avoid long processing times, limit the number of observations to display, for example, by using the bigr.setRowLimit function.

Number of flights (number of rows)

nrow(air)
## [1] 128790

Number of flight attributes (number of columns)

ncol(air)
## [1] 29

Data dimensions (rows x columns)

dim(air)
## [1] 128790     29

Structure of the data set, including sample data

str(air)
## 'bigr.frame': 128790 obs. of 29 variables:
##  $ Year             : int 2004 2004 2004 2004 2004 2004 ...
##  $ Month            : int 2 2 2 2 2 2 ...
##  $ DayofMonth       : int 12 16 18 19 21 24 ...
##  $ DayOfWeek        : int 4 1 3 4 6 2 ...
##  $ DepTime          : int 633 2115 700 1140 936 1117 ...
##  $ CRSDepTime       : int 635 2120 700 1145 935 1120 ...
##  $ ArrTime          : int 935 2340 817 1427 1036 1922 ...
##  $ CRSArrTime       : int 930 2350 820 1420 1035 1930 ...
##  $ UniqueCarrier    : chr "B6" "B6" "B6" "B6" "B6" "B6" ...
##  $ FlightNum        : int 165 199 2 67 68 206 ...
##  $ TailNum          : chr "N553JB" "N570JB" "N544JB" "N570JB" "N544JB" "N548JB" ...
##  $ ActualElapsedTime: int 182 325 77 167 60 305 ...
##  $ CRSElapsedTime   : int 175 330 80 155 60 310 ...
##  $ AirTime          : int 162 114 49 141 41 468 ...
##  $ ArrDelay         : int 5 -10 -3 7 1 -8 ...
##  $ DepDelay         : int -2 -5 0 -5 1 -3 ...
##  $ Origin           : chr "JFK" "JFK" "JFK" "RSW" "JFK" "LGB" ...
##  $ Dest             : chr "TPA" "LAS" "BUF" "JFK" "SYR" "JFK" ...
##  $ Distance         : int 1005 2248 301 1074 209 2465 ...
##  $ TaxiIn           : int 3 8 2 7 3 7 ...
##  $ TaxiOut          : int 17 23 26 19 16 10 ...
##  $ Cancelled        : int 0 0 0 0 0 0 ...
##  $ CancellationCode : chr "" "" "" "" "" "" ...
##  $ Diverted         : int 0 0 0 0 0 0 ...
##  $ CarrierDelay     : int 0 0 0 0 0 0 ...
##  $ WeatherDelay     : int 0 0 0 0 0 0 ...
##  $ NASDelay         : int 0 0 0 0 0 0 ...
##  $ SecurityDelay    : int 0 0 0 0 0 0 ...
##  $ LateAircraftDelay: int 0 0 0 0 0 0 ...

First five rows

head(air, 5)
##   Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 2004     2         12         4     633        635     935        930
## 2 2004     2         16         1    2115       2120    2340       2350
## 3 2004     2         18         3     700        700     817        820
## 4 2004     2         19         4    1140       1145    1427       1420
## 5 2004     2         21         6     936        935    1036       1035
##   UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime
## 1            B6       165  N553JB               182            175     162
## 2            B6       199  N570JB               325            330     114
## 3            B6         2  N544JB                77             80      49
## 4            B6        67  N570JB               167            155     141
## 5            B6        68  N544JB                60             60      41
##   ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
## 1        5       -2    JFK  TPA     1005      3      17         0
## 2      -10       -5    JFK  LAS     2248      8      23         0
## 3       -3        0    JFK  BUF      301      2      26         0
## 4        7       -5    RSW  JFK     1074      7      19         0
## 5        1        1    JFK  SYR      209      3      16         0
##   CancellationCode Diverted CarrierDelay WeatherDelay NASDelay
## 1                         0            0            0        0
## 2                         0            0            0        0
## 3                         0            0            0        0
## 4                         0            0            0        0
## 5                         0            0            0        0
##   SecurityDelay LateAircraftDelay
## 1             0                 0
## 2             0                 0
## 3             0                 0
## 4             0                 0
## 5             0                 0

Last seven rows

tail(air, 7)
##   Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 2008    12         11         4     734        735    1010       1022
## 2 2008    12         12         5    1325       1310    1759       1759
## 3 2008    12         12         5    1425       1345    1938       1928
## 4 2008    12         12         5    1119       1101    1419       1405
## 5 2008    12         12         5    2127       2129    2258       2311
## 6 2008    12         13         6     719        710     938        949
## 7 2008    12         13         6    1250       1220    1617       1552
##   UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime
## 1            DL      1209  N915DN               216            227     200
## 2            DL       662  N3746H               154            169     131
## 3            DL      1032  N697DL               193            223     169
## 4            DL      1410  N913DL               180            184     149
## 5            DL      1792  N969DL                91            102      62
## 6            DL      1491  N935DL               199            219     179
## 7            DL      1621  N938DL               147            152     120
##   ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
## 1      -12       -1    MSY  SLC     1428      6      10         0
## 2        0       15    COS  ATL     1185      8      15         0
## 3       10       40    PHX  ATL     1587     11      13         0
## 4       14       18    PBI  BOS     1197      6      25         0
## 5      -13       -2    ATL  ORF      516      4      25         0
## 6      -11        9    LGA  MSY     1183      5      15         0
## 7       25       30    MSP  ATL      906      9      18         0
##   CancellationCode Diverted CarrierDelay WeatherDelay NASDelay
## 1                         0           NA           NA       NA
## 2                         0           NA           NA       NA
## 3                         0           NA           NA       NA
## 4                         0           NA           NA       NA
## 5                         0           NA           NA       NA
## 6                         0           NA           NA       NA
## 7                         0            3            0        0
##   SecurityDelay LateAircraftDelay
## 1            NA                NA
## 2            NA                NA
## 3            NA                NA
## 4            NA                NA
## 5            NA                NA
## 6            NA                NA
## 7             0                22

Carrier codes of all flights

print(air$UniqueCarrier)
##  [1] "B6" "B6" "B6" "B6" "B6" "B6" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO"
## [15] "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO"
## [29] "CO" "CO" "CO" "CO" "CO" "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH"
## [43] "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH"
## ... 128740 more observations.

Destination cities of all flights

print(air$UniqueCarrier)
##  [1] "B6" "B6" "B6" "B6" "B6" "B6" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO"
## [15] "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO" "CO"
## [29] "CO" "CO" "CO" "CO" "CO" "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH"
## [43] "DH" "DH" "DH" "DH" "DH" "DH" "DH" "DH"
## ... 128740 more observations.

Lesson 3: Analyzing data with IBM InfoSphere BigInsights Big R

In this lesson, you analyze flight delay information.

About this lesson:

The fictional Sample Outdoor Company wants to partner with an airline that has few delays.

Procedure:

  1. Run the following code to attach the air data set to the R search path.
attach(air)

The attach() function makes referencing variables easier. For example, length(air$UniqueCarrier) computes the number of flights. The attach(air) function allows you to drop the air$ prefix. Now length(UniqueCarrier) can compute the number of flights in the air data set.

  1. To create a subset of the data set named airSubset that contains flight information for flights that are delayed by 15 minutes or more, run the following code:
airSubset <- air[Cancelled == 0 & (DepDelay >= 15 | ArrDelay >= 15), 
    c("UniqueCarrier", "Origin", "Dest", "DepDelay", "ArrDelay")]
  1. To find the overall ratio of flights that were delayed, run the following code:
nrow(airSubset) / nrow(air)
## [1] 0.227
  1. To see how a fictional HA airline compares to flight delays overall, run the following code:
nrow(airSubset[airSubset$UniqueCarrier == c("HA"),]) / 
  nrow(air[UniqueCarrier == "HA",])
## [1] 0.07435
  1. Continue to analyze flight delay information by running some of the following functions.

Correlation between columns

cor(air[, c("Distance", 
  "DepDelay", "ArrDelay", "DepTime", 
  "ArrTime", "ActualElapsedTime")])
##                    Distance DepDelay ArrDelay  DepTime ArrTime
## Distance           1.000000  0.02884 0.008679 -0.03649 0.03721
## DepDelay           0.028840  1.00000 0.871991  0.16668 0.07343
## ArrDelay           0.008679  0.87199 1.000000  0.15191 0.07216
## DepTime           -0.036489  0.16668 0.151907  1.00000 0.73996
## ArrTime            0.037212  0.07343 0.072163  0.73996 1.00000
## ActualElapsedTime  0.968554  0.04201 0.094298 -0.03314 0.04807
##                   ActualElapsedTime
## Distance                    0.96855
## DepDelay                    0.04201
## ArrDelay                    0.09430
## DepTime                    -0.03314
## ArrTime                     0.04807
## ActualElapsedTime           1.00000

Maximum departure delay (in minutes)

max(air$DepDelay)
## [1] 1435

Average and maximum arrival delay (in minutes)

summary(mean(ArrDelay) + max(ArrDelay) ~ ., 
  object = air)
##   mean(ArrDelay) max(ArrDelay)
## 1          7.108          1016

Results:

Because the delay information for the fictional HA airlines is better than others as our data analysis has shown, the fictional HA airline is a strong candidate for the Sample Outdoor Company.

Lesson 4: Visualizing big data with IBM InfoSphere BigInsights Big R

In this lesson, you create visualizations by using the data set to find trends in flight delays.

Before you begin:

Install the ggplot2 data visualization package in your R environment with the following code:

#install.packages("ggplot2")

Install the makeR package:

  1. Download the makeR data visualization package from the cran archives.
  2. Install the package using the following R command:
#install.packages("makeR_1.0.2.tar.gz")
  1. Load the library using the following R command:
library (makeR)
## Loading required package: tools
## Loading required package: XML
## 
## Attaching package: 'XML'
## 
## The following object is masked from 'package:tools':
## 
##     toHTML
## Warning: replacing previous import by 'tools::toHTML' when loading 'makeR'
## Auto saving of PROJECT.xml is enabled. This can be changed using the setAutoSave() function.
## 
## Auto opening of built and released files is enabled. This can be changed using the setAutoOpen() function.

About this lesson:

The fictional Sample Outdoor Company wants to find the times and days with the least amount of delays.

Procedure

  1. Connect to the ggplot2 library by running the following command:
library(ggplot2)

With the ggplot2 library you can create bar charts and other types of plots.

  1. Create a bar chart that shows the number of delays for each hour, by running the following code:
  1. Create a working copy of the air data frame by running the following code.
bf <- air
  1. Add columns for the number of flights delayed and the departure times by hour to the bf data frame by running the following code.
bf$FlightDelayed <- ifelse(bf$DepDelay >= 15, 1, 0)
bf$DepHour <- bf$DepTime / 100
  1. Compute the distribution of the number of flights in each hour.
df <- summary(sum(FlightDelayed) ~ DepHour, object = bf)
  1. Remove flight data that is missing information, and then calculate the number of flights for each time interval. “NA” in the data frame represent values that are missing or not available.
df <- na.omit(df)
df <- df[order(df[,1]),]
colnames(df) <- c("DepHour", "FlightCount")
deps <- df[,2]
names(deps) <- as.vector(df[,1])
  1. Label and create the bar chart.
barplot(deps, main="Airline Delays By Hour", 
  xlab="Hour", ylab="# of Flights Delayed")

plot of chunk unnamed-chunk-30

The chart shows delays increasing as the day progresses. This is probably due to an increase in flights as the day progresses.

  1. Create a bar chart that shows the number of flights for each hour, by running the following script:
bigr.histogram(air$ArrTime, nbins=24) + 
   labs(title = "Flight Volume (Arrival) by Hour")
## Loading required package: plyr

plot of chunk unnamed-chunk-31

When you compare the delay chart with this flight volume chart you see that flights between 09:00-11:00 a.m. have a similar flight frequency with flights between 12:00-06:00 p.m., however the morning flights appear to experience fewer delays.

There seems to be a correlation between the number flights in a day and the amount of delays. If the proportion of delays decreased with respect to the number of flights, finding days with fewer flights might be useful.

  1. Create a calendar heat map that shows the number of flights on each day, by running the following code:
library(makeR)
bf <- air[air$Cancelled == 0,]
df <- summary(count(Month) ~ Year + Month + DayofMonth, object = bf)
df$DateStr <- paste(df$Year, df$Month, df$DayofMonth, sep="-")
df2011 <- df[df$Year == 2000 | df$Year == 2001 | df$Year == 2002,]
calendarHeat(df2011$DateStr, df2011[,4] * 100, varname="Flight Volume")
## Loading required package: lattice
## Loading required package: grid
## Loading required package: chron
## Warning: there is no package called 'chron'

plot of chunk unnamed-chunk-32

In the month of February, Saturdays seem to have fewer flights on average. The Sample Outdoor Company might want to schedule February flights between 09:00-11:00 a.m. on Saturday.

Lesson 5: Creating a predictive model with IBM InfoSphere BigInsights Big R

In this lesson, you create a decision-tree model and make predictions.

Before you begin:

Install the DMwR package in your R environment with the following code:

#install.packages("DMwR")

Install the rpart package in your R environment with the following code:

#install.packages("rpart")

Important: Make sure that R is installed on each node of the cluster (NameNode and DataNode), and that rpart is installed on each node.

About this lesson:

To communicate expected flight delay information to customers, you will create a decision-tree then try to predict flight arrival delays for the fictional HA and UA airlines.

Procedure

  1. Create a decision-tree by running the following code.
  1. Create a subset of the data set named bf that contains flight information for only HA and UA airlines, by running the following code:
bf <- air[air$UniqueCarrier %in% c("HA", "UA"),]
  1. Split the data into a train and a test data set, and then test to see if the split percentages are accurate, by running the following code:
splits <- bigr.sample(bf, c(0.7, 0.3))
class(splits)
## [1] "list"
train <- splits[[1]]
test <- splits[[2]]
nrow(train) / nrow(bf)
## [1] 0.7053
nrow(test) / nrow(bf)
## [1] 0.2947

When building models, you can use a subset of the data to train the model, and use the remaining data to test and validate your model.

  1. Build a decision-tree model for the HA and UA airlines, and then view the two models.
models <- groupApply(data = train, 
  groupingColumns=list(train$UniqueCarrier),
  rfunction = function(df) {
    library(rpart)
    predcols <- c('ArrDelay', 'DepDelay', 'DepTime', 
      'CRSArrTime', 'Distance')
    m <- rpart(ArrDelay ~ ., df[,predcols])
    m
})
## groupApply ran for 37.28secs
print(models)
## bigr.list
##   group1 status
## 1     HA     OK
## 2     UA     OK

The models will predict flight arrival delay using departure delay, departure time, air travel time, and distance as predicting variables.

  1. Get the contents of the HA model and name it modelHA, then print the contents of modelHA.
modelHA <- bigr.pull(models$HA)
print(modelHA)
## n=197 (1 observation deleted due to missingness)
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 197 29740.0  -1.7410  
##    2) DepDelay< 5 178  9954.0  -4.3030  
##      4) DepDelay< -2.5 128  4665.0  -6.1560  
##        8) Distance>=2580 9  1420.0 -14.5600 *
##        9) Distance< 2580 119  2562.0  -5.5210  
##         18) DepDelay< -6.5 26   404.2  -8.6150 *
##         19) DepDelay>=-6.5 93  1839.0  -4.6560 *
##      5) DepDelay>=-2.5 50  3724.0   0.4400  
##       10) Distance< 2646 43  1188.0  -0.6047 *
##       11) Distance>=2646 7  2201.0   6.8570 *
##    3) DepDelay>=5 19  7674.0  22.2600 *
  1. Create a decision-tree for HA airline.
library(DMwR)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'DMwR'
## 
## The following object is masked from 'package:plyr':
## 
##     join
prettyTree(modelHA)

plot of chunk unnamed-chunk-39

The decision-tree uses the column DepDelay more than other columns showing that there is a strong relationship between the decision-tree and DepDelay.

  1. Use the models that you created in step 1 to make arrival delay predictions.
  1. Score the data set. Create columns to compare departure delay, actual arrival delay, and predicted arrival delay.
#preds <- groupApply(test,
#  list(test$UniqueCarrier),
#  function(df, models) {
#    library(rpart)
#    carrier <- df$UniqueCarrier[1]
#    m <- bigr.pull(models[carrier])
#    data.frame(carrier, df$DepDelay, df$ArrDelay, predict(m, df))
#  },
#  signature=data.frame(carrier='Carrier', DepDelay=1.0, 
#    ArrDelay=1.0, ArrDelayPred=1.0, stringsAsFactors=F),
#    models)
  1. Show 20 predictions.
#head(preds, 20)

Examining row six in the output, the predicted arrival delay is 20.429878; however, the actual arrival delay is 15 minutes. As expected, there are discrepancies between the predicted and actual results. It is important to test the quality of your model to see where predictions are wrong or different from the actual results.

  1. Check the quality of your model.
  1. Use the root mean squared deviation (RMSD) error metric.
#rmsd <- sqrt(sum((preds$ArrDelay - preds$ArrDelayPred) ^ 2) / nrow(preds))
#print(rmsd)

The RMSD shows that the model has a high error ratio. To improve the model, you can add more predictors like departure and arrival cities.

  1. Examine the rows where your model gave the worst predictions.
#preds$error <- abs(preds$ArrDelay - preds$ArrDelayPred)
#head(sort(preds, by=preds$error, decreasing=T))

The error is very high for the model’s worst predictions. The long delays are probably due to plane maintenance and repair. And the top errors might be outliers, because the range from the largest error to the sixth largest error is over 60 minutes.

Summary of analyzing data with IBM InfoSphere BigInsights Big R tutorial

This tutorial demonstrated some IBM® InfoSphere® BigInsightsT Big R functions and some techniques for analyzing data.

Lessons learned

You now have a good understanding of the following tasks:

  • How to use basic Big R functions
  • How to format data
  • How to create visualizations
  • How to use create predictive models
  • How to test predictive models