This script illustrates some common tasks and “best practices”. It also illustrates the value and utility of using R Markdown when creating reports in R.
Before trying to run this code, visit our class GitHub repository and grab the r script bigdatapackages.R and the Word document called “Making your class directory the default start.docx”. Run the script once (it will take a while), and follow the defaulty starting directory instructions.
I like to start with a code block that invokes packages that will be used in the code.
library(readr) # provide additional data-reading functions
library(corrplot) # attractive correlation graphs
Each user must set a working directory for her/his own computer. I urge you to set up a directory for all of your R work. Within that, create a folder for BUS212, and then within that have 2 folders: one for Data, one for Scripts.
See the document titled “How to set your default working directory” on LATTE under the R Resources section.
In this example, my main R directory is located in my Brandeis Box account. We read the data using read_csv from package readr, rather than the usualread.csv. We’ll work with the data as a “tibble” rather than a data frame.
setwd("C:/Users/Rob/Box Sync/My R Work/BUS212")
mydata <- read_csv("Data/On_Time_50.csv")
In this example, we’ll eventually want to do some modeling on the variable called ArrDelay. First let’s get familiar with the data.
Helpful exploratory commands:
dim(mydata)
## [1] 49101 50
head(mydata)
## # A tibble: 6 × 50
## Year Quarter Month DayofMonth DayOfWeek FlightDate UniqueCarrier
## <int> <int> <int> <int> <int> <chr> <chr>
## 1 2014 4 10 23 4 10/23/2014 DL
## 2 2014 4 10 23 4 10/23/2014 DL
## 3 2014 4 10 21 2 10/21/2014 DL
## 4 2014 4 10 21 2 10/21/2014 DL
## 5 2014 4 10 21 2 10/21/2014 DL
## 6 2014 4 10 21 2 10/21/2014 DL
## # ... with 43 more variables: TailNum <chr>, FlightNum <int>,
## # Origin <chr>, OriginCityName <chr>, OriginState <chr>,
## # OriginStateFips <int>, OriginStateName <chr>, Dest <chr>,
## # DestCityName <chr>, DestState <chr>, DestStateFips <int>,
## # DestStateName <chr>, CRSDepTime <int>, DepTime <int>, DepDelay <int>,
## # DepDelayMinutes <int>, DepDel15 <int>, DepartureDelayGroups <int>,
## # DepTimeBlk <chr>, TaxiOut <int>, WheelsOff <int>, WheelsOn <int>,
## # TaxiIn <int>, CRSArrTime <int>, ArrTime <int>, ArrDelay <int>,
## # ArrDelayMinutes <int>, ArrDel15 <int>, ArrivalDelayGroups <int>,
## # ArrTimeBlk <chr>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, CRSElapsedTime <int>, Distan <int>, AirTime <int>,
## # Distance <int>, DistanceGroup <int>, CarrierDelay <int>,
## # WeatherDelay <int>, NASDelay <int>, SecurityDelay <int>,
## # LateAircraftDelay <int>
summary(mydata$ArrDelay) # compute stats for one variable
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -65.00 -11.00 -3.00 4.87 8.00 1308.00 606
Suppose that we want to create a model to account for arrival delays longer than 15 minutes. This table has 50 variables, and surely many of the variables are not relevant to our purpose. Somewhat arbitrarily, let’s select some columns that might be useful in such a model.
Next we create a new object containing 4 of the 50 variables (as specified), and only use rows where ArrDelay > 15 minutes. This is simliar to Select… Where in SQL.
# to simplify the subset command, first set up a vector of column names
varlist <- c("DepDelay", "ArrDelay", "AirTime", "Distance")
myvars <-subset (mydata, ArrDelay > 15, select=varlist)
str(myvars)
## Classes 'tbl_df', 'tbl' and 'data.frame': 8759 obs. of 4 variables:
## $ DepDelay: int -4 11 73 39 10 85 68 139 25 47 ...
## $ ArrDelay: int 28 28 64 42 18 67 67 134 68 31 ...
## $ AirTime : int 100 206 130 125 272 332 103 67 79 76 ...
## $ Distance: int 746 1399 1010 980 1990 2586 746 440 488 488 ...
This creates a more compact data table with fewer than 9000 rows. Suppose we want to build a regression model to estimate arrival delays. As standard practice in model-building with “big data” we first split the large data frame into two partitions:
# Demonstrate concept of partitioning a dataframe into Training
# and Testing samples
set.seed(1234) # initialize the random number generator for reproducibilty.
# Use an integer of your choosing.
# Below, "ind" will be a vectors of randomly generated 1s and 2s. There will be as
# many values as there are rows of data in the myvars df. 60% of the ind values
# will equal 1 nad 40% will equal 2.
ind <- sample(2,nrow(myvars),replace=TRUE, prob=c(0.6,0.4))
table(ind)
## ind
## 1 2
## 5274 3485
train <- myvars[ind==1,] # new df "train" will consists of randomly
# chosen rows from myvars, corresponding to the 1's in ind.
test <- myvars[ind==2,]
dim(train)
## [1] 5274 4
dim(test)
## [1] 3485 4
Before creating models, lets look at the correlations among the numeric variables. To help visualize the correlations, we’ll use the package called “corrplot”.
tr <- subset(train, select=c("ArrDelay","DepDelay","AirTime", "Distance"))
cm <- cor(tr, method="pearson")
cm
## ArrDelay DepDelay AirTime Distance
## ArrDelay 1.00000000 0.95447643 -0.01176882 -0.01592223
## DepDelay 0.95447643 1.00000000 -0.05361880 -0.02675997
## AirTime -0.01176882 -0.05361880 1.00000000 0.98126974
## Distance -0.01592223 -0.02675997 0.98126974 1.00000000
corrplot(cm, method= "ellipse", type="lower" )
Now we’ll estimate 2 “candidate” linear regression models to estimate arrival. We’ll choose the better model to re-run with the training data.
The important R function is lm (for linear model). We create the two models and then use summary to report results.
lm1 <- lm(ArrDelay ~ DepDelay, data = tr)
summary(lm1)
##
## Call:
## lm(formula = ArrDelay ~ DepDelay, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.263 -11.307 -3.147 8.465 148.188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.287218 0.308209 33.38 <2e-16 ***
## DepDelay 0.917482 0.003949 232.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.24 on 5272 degrees of freedom
## Multiple R-squared: 0.911, Adjusted R-squared: 0.911
## F-statistic: 5.398e+04 on 1 and 5272 DF, p-value: < 2.2e-16
lm2 <- lm(ArrDelay ~ DepDelay + Distance, data = tr)
summary(lm2)
##
## Call:
## lm(formula = ArrDelay ~ DepDelay + Distance, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.038 -11.203 -3.122 8.246 147.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.5325641 0.4456644 21.390 <2e-16 ***
## DepDelay 0.9177294 0.0039486 232.416 <2e-16 ***
## Distance 0.0009414 0.0004017 2.343 0.0191 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.23 on 5271 degrees of freedom
## Multiple R-squared: 0.9111, Adjusted R-squared: 0.9111
## F-statistic: 2.702e+04 on 2 and 5271 DF, p-value: < 2.2e-16
The second model appears to be barely better than the first. Now re-estimate it with the test data set and see if the quality of the model persists.
lm1t <- lm(ArrDelay ~ DepDelay, data = test)
summary(lm1t)
##
## Call:
## lm(formula = ArrDelay ~ DepDelay, data = test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.453 -11.223 -3.352 7.940 127.761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.121911 0.404184 27.52 <2e-16 ***
## DepDelay 0.901678 0.005684 158.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.61 on 3483 degrees of freedom
## Multiple R-squared: 0.8784, Adjusted R-squared: 0.8784
## F-statistic: 2.516e+04 on 1 and 3483 DF, p-value: < 2.2e-16
lm2t <- lm(ArrDelay ~ DepDelay + Distance, data = test)
summary(lm2t)
##
## Call:
## lm(formula = ArrDelay ~ DepDelay + Distance, data = test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.266 -11.231 -3.391 7.956 127.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.013e+01 5.786e-01 17.513 <2e-16 ***
## DepDelay 9.024e-01 5.689e-03 158.635 <2e-16 ***
## Distance 1.199e-03 5.029e-04 2.384 0.0172 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.6 on 3482 degrees of freedom
## Multiple R-squared: 0.8786, Adjusted R-squared: 0.8785
## F-statistic: 1.26e+04 on 2 and 3482 DF, p-value: < 2.2e-16