This document presents selected commands and results from the first class demonstration. The goal is to demonstrate a few of the basic capabilities of R and RStudio.
Like most computer languages, an R script consists of a logical series of commands, each obeying specific rules of syntax and performing particular actions.
In R, we often add packages to the base set of commands and functions. We have already installed most of the packages we will need this term,but one can always install others as well. You only need to install a package once. In the example below, I want to use a package called mapproj; I manually installed it before running this script with the command install.packages("mapproj").
Within a script, before we can use the functionality of a package, we need to invoke the package using the command library. Though you can insert library commands anywhere, it’s considered good practice to load packages right at the start of your script.
# Invoke several packages at the outset
library("dplyr") # dplyr helps to reorganize a data table
library("ggplot2") # a graphing package with numerous
library("maps")
library("mapproj")
This script analyzes a small portion (10%) of the flight quality data from US domestic flights for the month of October in 2014. The raw data, obtained from the US Department of Transportation, is in a comma-separated-values (CSV) file on our LATTE site. The first task in the script is to read in the csv file and store it in an object that we will name “ontime”.
Before reading the data, we issue a command to identify the Working Directory (wd) on the particular computer.
# the "#" symbol indicates a comment, not to be executed
setwd("C:/Users/Rob/Box Sync/My R Work/BUS211")
# Create object by reading in a csv file. The "read.csv" command accomplishes this
# the operator symbol "<-" assigns values to an object
ontime <- read.csv("data/On_Time.csv")
Now we have a data frame containing all of the flight data. A data frame is essentially a matrix of data. Each column represents a variable, and each row is an observation.
Data frames have metadata, such as the names of columns, or the dimensions (# of rows, # of columns).
The data frame is too huge to simply look at, so we’ll give some commands to examine the structure and some of the contents of the data frame.
dim(ontime) # show the dimensions of the data table -- can add a comment anywhere in a line!
## [1] 49101 109
So, the full data frame contains 49,101 rows and 109 columns.
str(ontime [1:9]) # show the structure of the data table (just first 9 columns)
## 'data.frame': 49101 obs. of 9 variables:
## $ Year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ Quarter : int 4 4 4 4 4 4 4 4 4 4 ...
## $ Month : int 10 10 10 10 10 10 10 10 10 10 ...
## $ DayofMonth : int 23 23 21 21 21 21 21 21 21 21 ...
## $ DayOfWeek : int 4 4 2 2 2 2 2 2 2 2 ...
## $ FlightDate : Factor w/ 31 levels "10/1/2014","10/10/2014",..: 16 16 14 14 14 14 14 14 14 14 ...
## $ UniqueCarrier: Factor w/ 14 levels "AA","AS","B6",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ AirlineID : int 19790 19790 19790 19790 19790 19790 19790 19790 19790 19790 ...
## $ Carrier : Factor w/ 14 levels "AA","AS","B6",..: 4 4 4 4 4 4 4 4 4 4 ...
Note that this data frame contains two classes of variable: integers and factors. There are numerous object classes available. In our course, we’ll primarily use the following object classes:
Classes referring to values in a vector or data frame column:
+ Numeric -- quantitative values; can perform arithmetic operations on a numeric object
+ Integer -- special case of Numeric: whole numbers
+ Character -- any alphanumeric values (e.g. customer names, movie title)
+ Factor -- a variable intended to be used as a categorical variable (say, for Chi-Square tabulation)
+ Logical -- TRUE/FALSE binary variable
+ Date -- a variable representing date or timenames(ontime [1:9]) # list the variable names
## [1] "Year" "Quarter" "Month" "DayofMonth"
## [5] "DayOfWeek" "FlightDate" "UniqueCarrier" "AirlineID"
## [9] "Carrier"
head(ontime[1:6]) # display the first few top (head) rows for columns 1 through 6
## Year Quarter Month DayofMonth DayOfWeek FlightDate
## 1 2014 4 10 23 4 10/23/2014
## 2 2014 4 10 23 4 10/23/2014
## 3 2014 4 10 21 2 10/21/2014
## 4 2014 4 10 21 2 10/21/2014
## 5 2014 4 10 21 2 10/21/2014
## 6 2014 4 10 21 2 10/21/2014
Next, let’s compute some summary statistics for one INTEGER variable: Arrival Delay per flight, expressed in minutes. To refer to a variable by name, we specify both the names of the data frame and the variable, separated by a “$”:
summary(ontime$ArrDelay) # compute Tukey's 5-Number Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -65.00 -11.00 -3.00 4.87 8.00 1308.00 606
mean(ontime$ArrDelay) # compute just the mean
## [1] NA
The ‘mean’ function returns an error because some of the rows in the ArrDelay column are missing values (NA). We can modify the command by telling R to remove NA values.
mean(ontime$ArrDelay, na.rm = TRUE) # "na.rm" says "remove NA values"
## [1] 4.869554
sd(ontime$ArrDelay, na.rm = TRUE)
## [1] 34.53292
The five-number summary suggests that arrival delays are positively skewed. We explore further with some simple graphs. Also, to reduce typing, we’ll first attach the data frame. After doing so, we can then refer to variables by name without having to reference “ontime”.
In this next section, we also create some simple exploratory graphs using R’s Base Plotting System, with commands built-in to base R.
# reduce typing by ATTACHING the dataframe
attach(ontime)
# some graphs, using the base graphing commands
hist(ArrDelay, breaks=50)
hist(ArrDelay, breaks=25)
table(Carrier) # frequency table of Carriers
## Carrier
## AA AS B6 DL EV F9 FL HA MQ OO UA US VX WN
## 4428 1378 2048 7014 5708 799 465 613 3245 5130 4364 3470 466 9973
boxplot(ArrDelay ~ Carrier) # common syntax: variable vs factor (or Y vs X)
# some scatterplots for integer data
plot(DepDelay, ArrDelay, main="Arrival Delay vs Departure Delay") # scatterplot of X vs Y
cols <- c(43, 51, 52, 53, 55) # concatenate values into a vector called "cols"
pairs(ontime[cols]) # scatterplot matrix for group of columns
Now create 2 additional dataframes, containing median arrival delays, by Carrier and by State. We’ll do this with the command ‘aggregate’, which subsets data according to a factor variable and returns results in the form we specify:
medscar <- aggregate(ArrDelay ~ Carrier, data=ontime, FUN = median)
medscar
## Carrier ArrDelay
## 1 AA -1
## 2 AS -5
## 3 B6 -7
## 4 DL -6
## 5 EV -3
## 6 F9 -4
## 7 FL -9
## 8 HA -1
## 9 MQ 1
## 10 OO -3
## 11 UA -3
## 12 US -5
## 13 VX -5
## 14 WN -3
medsstate <- aggregate(ArrDelay ~ OriginStateName, data=ontime, FUN=median)
medsstate$State <- tolower(medsstate$OriginStateName) # simplify variable name
And now for some more attractive graphs created by package ggplot2
qplot(ArrDelay, data=ontime, binwidth=10, fill="red") # qplot is one syntax for quick plot
# ggplot command has more options, allowing more control.
ggplot(medscar, aes(x=Carrier, y=ArrDelay)) +
geom_bar(fill="blue", stat="identity") +
geom_text(aes(label=ArrDelay), vjust=1.5)
Finally, we construct a choropleth map of arrival delays by state. This requires the package “maps” as well as a fair amount of data re-arranging. Don’t worry about the details here; this is typical in analysis. The data preparation phase is complex and time-consuming.
state_map <- map_data("state")
# Create a choropleth map
# first merge the medsstate and state_map dataframes
# merge command is in base R, and arrange is in dplyr
delay_map <- merge(state_map, medsstate, by.x="region", by.y = "State")
head(delay_map)
## region long lat group order subregion OriginStateName
## 1 alabama -87.46201 30.38968 1 1 <NA> Alabama
## 2 alabama -87.48493 30.37249 1 2 <NA> Alabama
## 3 alabama -87.95475 30.24644 1 13 <NA> Alabama
## 4 alabama -88.00632 30.24071 1 14 <NA> Alabama
## 5 alabama -88.01778 30.25217 1 15 <NA> Alabama
## 6 alabama -87.52503 30.37249 1 3 <NA> Alabama
## ArrDelay
## 1 -4
## 2 -4
## 3 -4
## 4 -4
## 5 -4
## 6 -4
delay_map <- arrange(delay_map, group, order)
head(delay_map)
## region long lat group order subregion OriginStateName
## 1 alabama -87.46201 30.38968 1 1 <NA> Alabama
## 2 alabama -87.48493 30.37249 1 2 <NA> Alabama
## 3 alabama -87.52503 30.37249 1 3 <NA> Alabama
## 4 alabama -87.53076 30.33239 1 4 <NA> Alabama
## 5 alabama -87.57087 30.32665 1 5 <NA> Alabama
## 6 alabama -87.58806 30.32665 1 6 <NA> Alabama
## ArrDelay
## 1 -4
## 2 -4
## 3 -4
## 4 -4
## 5 -4
## 6 -4
ggplot(delay_map, aes(x=long, y=lat, group=group, fill=ArrDelay)) +
geom_polygon(color="black")
# map looks funny -- alter projection
ggplot(delay_map, aes(x=long, y=lat, group=group, fill=ArrDelay)) +
geom_polygon(color="black") +
coord_map("polyconic")