Session 2 BUS211 R Demo Spring 2017

This document presents selected commands and results from the first class demonstration. The goal is to demonstrate a few of the basic capabilities of R and RStudio.

Like most computer languages, an R script consists of a logical series of commands, each obeying specific rules of syntax and performing particular actions.

In R, we often add packages to the base set of commands and functions. We have already installed most of the packages we will need this term,but one can always install others as well. You only need to install a package once. In the example below, I want to use a package called mapproj; I manually installed it before running this script with the command install.packages("mapproj").

Within a script, before we can use the functionality of a package, we need to invoke the package using the command library. Though you can insert library commands anywhere, it’s considered good practice to load packages right at the start of your script.

# Invoke several packages at the outset
library("dplyr")   # dplyr helps to reorganize a data table
library("ggplot2") # a graphing package with numerous 
library("maps")
library("mapproj")

This script analyzes a small portion (10%) of the flight quality data from US domestic flights for the month of October in 2014. The raw data, obtained from the US Department of Transportation, is in a comma-separated-values (CSV) file on our LATTE site. The first task in the script is to read in the csv file and store it in an object that we will name “ontime”.

Before reading the data, we issue a command to identify the Working Directory (wd) on the particular computer.

# the "#" symbol indicates a comment, not to be executed
setwd("C:/Users/Rob/Box Sync/My R Work/BUS211")

# Create object by reading in a csv file.  The "read.csv" command accomplishes this
# the operator symbol "<-" assigns values to an object

ontime <- read.csv("data/On_Time.csv")

Now we have a data frame containing all of the flight data. A data frame is essentially a matrix of data. Each column represents a variable, and each row is an observation.

Data frames have metadata, such as the names of columns, or the dimensions (# of rows, # of columns).

The data frame is too huge to simply look at, so we’ll give some commands to examine the structure and some of the contents of the data frame.

dim(ontime) # show the dimensions of the data table -- can add a comment anywhere in a line!

## [1] 49101   109

So, the full data frame contains 49,101 rows and 109 columns.

str(ontime [1:9]) # show the structure of the data table (just first 9 columns)

## 'data.frame':    49101 obs. of  9 variables:
##  $ Year         : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ Quarter      : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Month        : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ DayofMonth   : int  23 23 21 21 21 21 21 21 21 21 ...
##  $ DayOfWeek    : int  4 4 2 2 2 2 2 2 2 2 ...
##  $ FlightDate   : Factor w/ 31 levels "10/1/2014","10/10/2014",..: 16 16 14 14 14 14 14 14 14 14 ...
##  $ UniqueCarrier: Factor w/ 14 levels "AA","AS","B6",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ AirlineID    : int  19790 19790 19790 19790 19790 19790 19790 19790 19790 19790 ...
##  $ Carrier      : Factor w/ 14 levels "AA","AS","B6",..: 4 4 4 4 4 4 4 4 4 4 ...

Note that this data frame contains two classes of variable: integers and factors. There are numerous object classes available. In our course, we’ll primarily use the following object classes:

Data Frame: essentially a matrix-like set of vectors. Think of the structure of an Excel worksheet.
Vector: a 1 x k list of values; Often a column of values, analogous to a variable

Classes referring to values in a vector or data frame column:

+ Numeric -- quantitative values; can perform arithmetic operations on a numeric object
+ Integer -- special case of Numeric: whole numbers
+ Character -- any alphanumeric values (e.g. customer names, movie title)
+ Factor -- a variable intended to be used as a categorical variable (say, for Chi-Square tabulation)
+ Logical -- TRUE/FALSE binary variable
+ Date -- a variable representing date or time

names(ontime [1:9]) # list the variable names

## [1] "Year"          "Quarter"       "Month"         "DayofMonth"   
## [5] "DayOfWeek"     "FlightDate"    "UniqueCarrier" "AirlineID"    
## [9] "Carrier"

head(ontime[1:6])  # display the first few top (head) rows for columns 1 through 6

##   Year Quarter Month DayofMonth DayOfWeek FlightDate
## 1 2014       4    10         23         4 10/23/2014
## 2 2014       4    10         23         4 10/23/2014
## 3 2014       4    10         21         2 10/21/2014
## 4 2014       4    10         21         2 10/21/2014
## 5 2014       4    10         21         2 10/21/2014
## 6 2014       4    10         21         2 10/21/2014

Next, let’s compute some summary statistics for one INTEGER variable: Arrival Delay per flight, expressed in minutes. To refer to a variable by name, we specify both the names of the data frame and the variable, separated by a “$”:

summary(ontime$ArrDelay) # compute Tukey's 5-Number Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -65.00  -11.00   -3.00    4.87    8.00 1308.00     606

mean(ontime$ArrDelay) # compute just the mean

## [1] NA

The ‘mean’ function returns an error because some of the rows in the ArrDelay column are missing values (NA). We can modify the command by telling R to remove NA values.

mean(ontime$ArrDelay, na.rm = TRUE)  # "na.rm" says "remove NA values"

## [1] 4.869554

sd(ontime$ArrDelay, na.rm = TRUE)

## [1] 34.53292

The five-number summary suggests that arrival delays are positively skewed. We explore further with some simple graphs. Also, to reduce typing, we’ll first attach the data frame. After doing so, we can then refer to variables by name without having to reference “ontime”.

In this next section, we also create some simple exploratory graphs using R’s Base Plotting System, with commands built-in to base R.

# reduce typing by ATTACHING the dataframe
attach(ontime)
# some graphs, using the base graphing commands
hist(ArrDelay, breaks=50)

hist(ArrDelay, breaks=25)

table(Carrier)  # frequency table of Carriers

## Carrier
##   AA   AS   B6   DL   EV   F9   FL   HA   MQ   OO   UA   US   VX   WN 
## 4428 1378 2048 7014 5708  799  465  613 3245 5130 4364 3470  466 9973

boxplot(ArrDelay ~ Carrier) # common syntax: variable vs factor (or Y vs X)

# some scatterplots for integer data
plot(DepDelay, ArrDelay, main="Arrival Delay vs Departure Delay")  # scatterplot of X vs Y

cols <- c(43, 51, 52, 53, 55)  # concatenate values into a vector called "cols"
pairs(ontime[cols])   # scatterplot matrix for group of columns

Now create 2 additional dataframes, containing median arrival delays, by Carrier and by State. We’ll do this with the command ‘aggregate’, which subsets data according to a factor variable and returns results in the form we specify:

medscar <- aggregate(ArrDelay ~ Carrier, data=ontime, FUN = median)
medscar

##    Carrier ArrDelay
## 1       AA       -1
## 2       AS       -5
## 3       B6       -7
## 4       DL       -6
## 5       EV       -3
## 6       F9       -4
## 7       FL       -9
## 8       HA       -1
## 9       MQ        1
## 10      OO       -3
## 11      UA       -3
## 12      US       -5
## 13      VX       -5
## 14      WN       -3

medsstate <- aggregate(ArrDelay ~ OriginStateName, data=ontime, FUN=median)
medsstate$State <- tolower(medsstate$OriginStateName)  # simplify variable name

And now for some more attractive graphs created by package ggplot2

qplot(ArrDelay, data=ontime, binwidth=10, fill="red") # qplot is one syntax for quick plot

# ggplot command has more options, allowing more control.

ggplot(medscar, aes(x=Carrier, y=ArrDelay)) +
        geom_bar(fill="blue", stat="identity") +
        geom_text(aes(label=ArrDelay), vjust=1.5)

Finally, we construct a choropleth map of arrival delays by state. This requires the package “maps” as well as a fair amount of data re-arranging. Don’t worry about the details here; this is typical in analysis. The data preparation phase is complex and time-consuming.

state_map <- map_data("state")
# Create a choropleth map
# first merge the medsstate and state_map dataframes

# merge command is in base R, and arrange is in dplyr

delay_map <- merge(state_map, medsstate, by.x="region", by.y = "State")
head(delay_map)

##    region      long      lat group order subregion OriginStateName
## 1 alabama -87.46201 30.38968     1     1      <NA>         Alabama
## 2 alabama -87.48493 30.37249     1     2      <NA>         Alabama
## 3 alabama -87.95475 30.24644     1    13      <NA>         Alabama
## 4 alabama -88.00632 30.24071     1    14      <NA>         Alabama
## 5 alabama -88.01778 30.25217     1    15      <NA>         Alabama
## 6 alabama -87.52503 30.37249     1     3      <NA>         Alabama
##   ArrDelay
## 1       -4
## 2       -4
## 3       -4
## 4       -4
## 5       -4
## 6       -4

delay_map <- arrange(delay_map, group, order)
head(delay_map)

##    region      long      lat group order subregion OriginStateName
## 1 alabama -87.46201 30.38968     1     1      <NA>         Alabama
## 2 alabama -87.48493 30.37249     1     2      <NA>         Alabama
## 3 alabama -87.52503 30.37249     1     3      <NA>         Alabama
## 4 alabama -87.53076 30.33239     1     4      <NA>         Alabama
## 5 alabama -87.57087 30.32665     1     5      <NA>         Alabama
## 6 alabama -87.58806 30.32665     1     6      <NA>         Alabama
##   ArrDelay
## 1       -4
## 2       -4
## 3       -4
## 4       -4
## 5       -4
## 6       -4

ggplot(delay_map, aes(x=long, y=lat, group=group, fill=ArrDelay)) +
        geom_polygon(color="black")

# map looks funny -- alter projection


ggplot(delay_map, aes(x=long, y=lat, group=group, fill=ArrDelay)) +
        geom_polygon(color="black")  +
        coord_map("polyconic")

Session 2 BUS211 R Demo Spring 2017

Rob Carver

January 17, 2017