This document presents selected commands and results from the first class demonstration.
Like most computer languages, an R script consists of a logical series of commands, each obeying specific rules of syntax and performing particular actions.
This script analyzes a small portion (10%) of the flight quality data from US domestic flights for the month of October in 2014. The raw data, obtained from the US Department of Transportation, is in a comma-separated-values (CSV) file on our LATTE site. The first task in the script is to read in the csv file and store it in an object that we will name “ontime”.
Before reading the data, we issue a command to identify the Working Directory (wd) on the particular computer.
# the "#" symbol indicates a comment, not to be executed
setwd("~/Dropbox/MyRWork/Data")
# Create object by reading in a csv file. The "read.csv" command accomplishes this
# the operator symbol "<-" assigns a value to an object
ontime <- read.csv("On_Time.csv")
Now we have a data frame containing all of the flight data. The data frame is too huge to simply look at, so we’ll give some commands to examine the structure and some of the contents of the data frame.
dim(ontime) # show the dimensions of the data table -- can add a comment anywhere in a line!
## [1] 49101 109
str(ontime [1:9]) # show the structure of the data table (just first 9 columns)
## 'data.frame': 49101 obs. of 9 variables:
## $ Year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ Quarter : int 4 4 4 4 4 4 4 4 4 4 ...
## $ Month : int 10 10 10 10 10 10 10 10 10 10 ...
## $ DayofMonth : int 23 23 21 21 21 21 21 21 21 21 ...
## $ DayOfWeek : int 4 4 2 2 2 2 2 2 2 2 ...
## $ FlightDate : Factor w/ 31 levels "10/1/2014","10/10/2014",..: 16 16 14 14 14 14 14 14 14 14 ...
## $ UniqueCarrier: Factor w/ 14 levels "AA","AS","B6",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ AirlineID : int 19790 19790 19790 19790 19790 19790 19790 19790 19790 19790 ...
## $ Carrier : Factor w/ 14 levels "AA","AS","B6",..: 4 4 4 4 4 4 4 4 4 4 ...
names(ontime [1:9]) # list the variable names
## [1] "Year" "Quarter" "Month" "DayofMonth"
## [5] "DayOfWeek" "FlightDate" "UniqueCarrier" "AirlineID"
## [9] "Carrier"
head(ontime[1:6]) # display the first few top (head) rows for columns 1 through 6
## Year Quarter Month DayofMonth DayOfWeek FlightDate
## 1 2014 4 10 23 4 10/23/2014
## 2 2014 4 10 23 4 10/23/2014
## 3 2014 4 10 21 2 10/21/2014
## 4 2014 4 10 21 2 10/21/2014
## 5 2014 4 10 21 2 10/21/2014
## 6 2014 4 10 21 2 10/21/2014
Note that this data frame contains two classes of variable: integers and factors. There are numerous object classes available. In our course, we’ll primarily use the following object classes:
Classes referring to values in a vector or data frame column:
+ Numeric -- quantitative values; can perform arithmetic operations on a numeric object
+ Integer -- special case of Numeric: whole numbers
+ Character -- any alphanumeric values (e.g. customer names, movie title)
+ Factor -- a variable intended to be used as a categorical variable (say, for Chi-Square tabulation)
+ Logical -- TRUE/FALSE binary variable
+ Date -- a variable representing date or timeNext, let’s compute some summary statistics for one INTEGER variable: Arrival Delay per flight, expressed in minutes. To refer to a variable by name, we specify both the names of the data frame and the variable, separated by a “$”:
summary(ontime$ArrDelay) # compute Tukey's 5-Number Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -65.00 -11.00 -3.00 4.87 8.00 1308.00 606
mean(ontime$ArrDelay) # compute just the mean
## [1] NA
# note issue of missing data
mean(ontime$ArrDelay, na.rm = TRUE) # "na.rm" says "remove NA values"
## [1] 4.869554
sd(ontime$ArrDelay, na.rm = TRUE)
## [1] 34.53292
The five-number summary suggests that arrival delays are positively skewed. We explore further with some simple graphs. Also, to reduce typing, we’ll first attach the data frame. After doing so, we can then refer to variables by name without having to reference “ontime”
# reduct typing by ATTACHING the dataframe
attach(ontime)
# some graphs, using the base graphing commands
hist(ArrDelay, breaks=50)
table(Carrier) # frequency table of Carriers
## Carrier
## AA AS B6 DL EV F9 FL HA MQ OO UA US VX WN
## 4428 1378 2048 7014 5708 799 465 613 3245 5130 4364 3470 466 9973
boxplot(ArrDelay ~ Carrier) # common syntax: variable vs factor
# some scatterplots for integer data
plot(DepDelay, ArrDelay, main="Arrival Delay vs Departure Delay") # scatterplot of X vs Y
cols <- c(43, 51, 52, 53, 55) # concatenate values into a vector called "cols"
pairs(ontime[cols]) # scatterplot matrix for group of columns
Although the “base” R system has great functionality, sometimes it is necessary to add capabilities by adding packages to a session. RStudio comes with a large number of packages installed; we invoke a package with the library command. In the remainder of this document, we use a few packages to facilitate creation of some presentation-ready graphs.
In your R script, for a a package not previously installed with RStudio or by you, the command is:
install.packages("doBy") # add the package name within the parentheses
Now create 2 additional dataframes, containing median arrival delays, by Carrier and by State
library(doBy) # invoke with library each script
## Loading required package: survival
medscar <-summaryBy(ArrDelay ~ Carrier, data=ontime, FUN=median, na.rm=TRUE)
medscar
## Carrier ArrDelay.median
## 1 AA -1
## 2 AS -5
## 3 B6 -7
## 4 DL -6
## 5 EV -3
## 6 F9 -4
## 7 FL -9
## 8 HA -1
## 9 MQ 1
## 10 OO -3
## 11 UA -3
## 12 US -5
## 13 VX -5
## 14 WN -3
medsstate <- summaryBy(ArrDelay ~ OriginStateName, data=ontime, FUN=median, na.rm=TRUE)
medsstate$State <- tolower(medsstate$OriginStateName) # simplify variable name
And now for some more attractive graphs created by package ggplot2
library(ggplot2)
qplot(ArrDelay, data=ontime, binwidth=10, fill="red") # one syntax for quick plot
ggplot(medscar, aes(x=Carrier, y=ArrDelay.median)) +
geom_bar(fill="blue", stat="identity") +
geom_text(aes(label=ArrDelay.median), vjust=1.5)
Finally, we construct a choropleth map of arrival delays by state. This requires the package “maps” as well as a fair amount of data re-arranging. Don’t worry about the details here; this is typical in analysis. The data preparation phase is complex and time-consuming.
library(maps)
##
## # ATTENTION: maps v3.0 has an updated 'world' map. #
## # Many country borders and names have changed since 1990. #
## # Type '?world' or 'news(package="maps")'. See README_v3. #
state_map <- map_data("state")
# Create a choropleth map
# first merge the medsstate and state_map dataframes
delay_map <- merge(state_map, medsstate, by.x="region", by.y = "State")
head(delay_map)
## region long lat group order subregion OriginStateName
## 1 alabama -87.46201 30.38968 1 1 <NA> Alabama
## 2 alabama -87.48493 30.37249 1 2 <NA> Alabama
## 3 alabama -87.95475 30.24644 1 13 <NA> Alabama
## 4 alabama -88.00632 30.24071 1 14 <NA> Alabama
## 5 alabama -88.01778 30.25217 1 15 <NA> Alabama
## 6 alabama -87.52503 30.37249 1 3 <NA> Alabama
## ArrDelay.median
## 1 -4
## 2 -4
## 3 -4
## 4 -4
## 5 -4
## 6 -4
library(plyr) # to use ARRANGE command to reorder map file
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:maps':
##
## ozone
delay_map <- arrange(delay_map, group, order)
head(delay_map)
## region long lat group order subregion OriginStateName
## 1 alabama -87.46201 30.38968 1 1 <NA> Alabama
## 2 alabama -87.48493 30.37249 1 2 <NA> Alabama
## 3 alabama -87.52503 30.37249 1 3 <NA> Alabama
## 4 alabama -87.53076 30.33239 1 4 <NA> Alabama
## 5 alabama -87.57087 30.32665 1 5 <NA> Alabama
## 6 alabama -87.58806 30.32665 1 6 <NA> Alabama
## ArrDelay.median
## 1 -4
## 2 -4
## 3 -4
## 4 -4
## 5 -4
## 6 -4
ggplot(delay_map, aes(x=long, y=lat, group=group, fill=ArrDelay.median)) +
geom_polygon(color="black")
# map looks funny -- alter projection
library(mapproj)
ggplot(delay_map, aes(x=long, y=lat, group=group, fill=ArrDelay.median)) +
geom_polygon(color="black") +
coord_map("polyconic")