Session 1 BUS211 Demo of R Scripting

This document presents selected commands and results from the first class demonstration.

Like most computer languages, an R script consists of a logical series of commands, each obeying specific rules of syntax and performing particular actions.

This script analyzes a small portion (10%) of the flight quality data from US domestic flights for the month of October in 2014. The raw data, obtained from the US Department of Transportation, is in a comma-separated-values (CSV) file on our LATTE site. The first task in the script is to read in the csv file and store it in an object that we will name “ontime”.

Before reading the data, we issue a command to identify the Working Directory (wd) on the particular computer.

# the "#" symbol indicates a comment, not to be executed
setwd("~/Dropbox/MyRWork/Data")
# Create object by reading in a csv file.  The "read.csv" command accomplishes this
# the operator symbol "<-" assigns a value to an object
ontime <- read.csv("On_Time.csv")

Now we have a data frame containing all of the flight data. The data frame is too huge to simply look at, so we’ll give some commands to examine the structure and some of the contents of the data frame.

dim(ontime) # show the dimensions of the data table -- can add a comment anywhere in a line!

## [1] 49101   109

str(ontime [1:9]) # show the structure of the data table (just first 9 columns)

## 'data.frame':    49101 obs. of  9 variables:
##  $ Year         : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ Quarter      : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Month        : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ DayofMonth   : int  23 23 21 21 21 21 21 21 21 21 ...
##  $ DayOfWeek    : int  4 4 2 2 2 2 2 2 2 2 ...
##  $ FlightDate   : Factor w/ 31 levels "10/1/2014","10/10/2014",..: 16 16 14 14 14 14 14 14 14 14 ...
##  $ UniqueCarrier: Factor w/ 14 levels "AA","AS","B6",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ AirlineID    : int  19790 19790 19790 19790 19790 19790 19790 19790 19790 19790 ...
##  $ Carrier      : Factor w/ 14 levels "AA","AS","B6",..: 4 4 4 4 4 4 4 4 4 4 ...

names(ontime [1:9]) # list the variable names

## [1] "Year"          "Quarter"       "Month"         "DayofMonth"   
## [5] "DayOfWeek"     "FlightDate"    "UniqueCarrier" "AirlineID"    
## [9] "Carrier"

head(ontime[1:6])  # display the first few top (head) rows for columns 1 through 6

##   Year Quarter Month DayofMonth DayOfWeek FlightDate
## 1 2014       4    10         23         4 10/23/2014
## 2 2014       4    10         23         4 10/23/2014
## 3 2014       4    10         21         2 10/21/2014
## 4 2014       4    10         21         2 10/21/2014
## 5 2014       4    10         21         2 10/21/2014
## 6 2014       4    10         21         2 10/21/2014

Note that this data frame contains two classes of variable: integers and factors. There are numerous object classes available. In our course, we’ll primarily use the following object classes:

Data Frame: essentially a matrix-like set of vectors. Think of the structure of an Excel worksheet.
Vector: a 1 x k list of values; Often a column of values, analogous to a variable

Classes referring to values in a vector or data frame column:

+ Numeric -- quantitative values; can perform arithmetic operations on a numeric object
+ Integer -- special case of Numeric: whole numbers
+ Character -- any alphanumeric values (e.g. customer names, movie title)
+ Factor -- a variable intended to be used as a categorical variable (say, for Chi-Square tabulation)
+ Logical -- TRUE/FALSE binary variable
+ Date -- a variable representing date or time

Next, let’s compute some summary statistics for one INTEGER variable: Arrival Delay per flight, expressed in minutes. To refer to a variable by name, we specify both the names of the data frame and the variable, separated by a “$”:

summary(ontime$ArrDelay) # compute Tukey's 5-Number Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -65.00  -11.00   -3.00    4.87    8.00 1308.00     606

mean(ontime$ArrDelay) # compute just the mean

## [1] NA

# note issue of missing data
mean(ontime$ArrDelay, na.rm = TRUE)  # "na.rm" says "remove NA values"

## [1] 4.869554

sd(ontime$ArrDelay, na.rm = TRUE)

## [1] 34.53292

The five-number summary suggests that arrival delays are positively skewed. We explore further with some simple graphs. Also, to reduce typing, we’ll first attach the data frame. After doing so, we can then refer to variables by name without having to reference “ontime”

# reduct typing by ATTACHING the dataframe
attach(ontime)
# some graphs, using the base graphing commands
hist(ArrDelay, breaks=50)

table(Carrier)  # frequency table of Carriers

## Carrier
##   AA   AS   B6   DL   EV   F9   FL   HA   MQ   OO   UA   US   VX   WN 
## 4428 1378 2048 7014 5708  799  465  613 3245 5130 4364 3470  466 9973

boxplot(ArrDelay ~ Carrier) # common syntax: variable vs factor

# some scatterplots for integer data
plot(DepDelay, ArrDelay, main="Arrival Delay vs Departure Delay")  # scatterplot of X vs Y

cols <- c(43, 51, 52, 53, 55)  # concatenate values into a vector called "cols"
pairs(ontime[cols])   # scatterplot matrix for group of columns

Although the “base” R system has great functionality, sometimes it is necessary to add capabilities by adding packages to a session. RStudio comes with a large number of packages installed; we invoke a package with the library command. In the remainder of this document, we use a few packages to facilitate creation of some presentation-ready graphs.

In your R script, for a a package not previously installed with RStudio or by you, the command is:

install.packages("doBy")  # add the package name within the parentheses

Now create 2 additional dataframes, containing median arrival delays, by Carrier and by State

library(doBy)  # invoke with library each script

## Loading required package: survival

medscar <-summaryBy(ArrDelay ~ Carrier, data=ontime, FUN=median, na.rm=TRUE)
medscar

##    Carrier ArrDelay.median
## 1       AA              -1
## 2       AS              -5
## 3       B6              -7
## 4       DL              -6
## 5       EV              -3
## 6       F9              -4
## 7       FL              -9
## 8       HA              -1
## 9       MQ               1
## 10      OO              -3
## 11      UA              -3
## 12      US              -5
## 13      VX              -5
## 14      WN              -3

medsstate <- summaryBy(ArrDelay ~ OriginStateName, data=ontime, FUN=median, na.rm=TRUE)
medsstate$State <- tolower(medsstate$OriginStateName)  # simplify variable name

And now for some more attractive graphs created by package ggplot2

library(ggplot2)
qplot(ArrDelay, data=ontime, binwidth=10, fill="red") # one syntax for quick plot

ggplot(medscar, aes(x=Carrier, y=ArrDelay.median)) +
        geom_bar(fill="blue", stat="identity") +
        geom_text(aes(label=ArrDelay.median), vjust=1.5)

Finally, we construct a choropleth map of arrival delays by state. This requires the package “maps” as well as a fair amount of data re-arranging. Don’t worry about the details here; this is typical in analysis. The data preparation phase is complex and time-consuming.

library(maps)

## 
##  # ATTENTION: maps v3.0 has an updated 'world' map.        #
##  # Many country borders and names have changed since 1990. #
##  # Type '?world' or 'news(package="maps")'. See README_v3. #

state_map <- map_data("state")
# Create a choropleth map
# first merge the medsstate and state_map dataframes

delay_map <- merge(state_map, medsstate, by.x="region", by.y = "State")
head(delay_map)

##    region      long      lat group order subregion OriginStateName
## 1 alabama -87.46201 30.38968     1     1      <NA>         Alabama
## 2 alabama -87.48493 30.37249     1     2      <NA>         Alabama
## 3 alabama -87.95475 30.24644     1    13      <NA>         Alabama
## 4 alabama -88.00632 30.24071     1    14      <NA>         Alabama
## 5 alabama -88.01778 30.25217     1    15      <NA>         Alabama
## 6 alabama -87.52503 30.37249     1     3      <NA>         Alabama
##   ArrDelay.median
## 1              -4
## 2              -4
## 3              -4
## 4              -4
## 5              -4
## 6              -4

library(plyr) # to use ARRANGE command to reorder map file

## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:maps':
## 
##     ozone

delay_map <- arrange(delay_map, group, order)
head(delay_map)

##    region      long      lat group order subregion OriginStateName
## 1 alabama -87.46201 30.38968     1     1      <NA>         Alabama
## 2 alabama -87.48493 30.37249     1     2      <NA>         Alabama
## 3 alabama -87.52503 30.37249     1     3      <NA>         Alabama
## 4 alabama -87.53076 30.33239     1     4      <NA>         Alabama
## 5 alabama -87.57087 30.32665     1     5      <NA>         Alabama
## 6 alabama -87.58806 30.32665     1     6      <NA>         Alabama
##   ArrDelay.median
## 1              -4
## 2              -4
## 3              -4
## 4              -4
## 5              -4
## 6              -4

ggplot(delay_map, aes(x=long, y=lat, group=group, fill=ArrDelay.median)) +
        geom_polygon(color="black")

# map looks funny -- alter projection
library(mapproj)
ggplot(delay_map, aes(x=long, y=lat, group=group, fill=ArrDelay.median)) +
        geom_polygon(color="black")  +
        coord_map("polyconic")

Session 1 BUS211 Demo of R Scripting

Prof. Robert Carver