RStart

Getting started with RStudio using tidyverse and dplyr

Start up RStudio and look for the window labeled Console.

Type commands after the prompt > and then press the key.

Note: anything after the pound symbol # is a comment-- explanatory text that is not executed.

4*9   #Simple Arithmetic

## [1] 36

If you strike the ENTER key before typing a complete expression, you will see the continuation prompt, the plus sign (+). For example, suppose you wish to calculate 3+2*(8-4), but you accidentally strike the ENTER key after typing the 8

3+2*(8 ENTER

plus sign +

Finish the expression by typing -4) after the +

plus sign +, - 4) ENTER

[1]11

Create a sequence incrementing by 1:

20:30

##  [1] 20 21 22 23 24 25 26 27 28 29 30

We will create an object called dog and assign it the values 1, 2, 3, 4, 5. The symbol <- is the assignment operator.

dog <- 1:5
dog

## [1] 1 2 3 4 5

dog + 10

## [1] 11 12 13 14 15

3*dog

## [1]  3  6  9 12 15

sum(dog)

## [1] 15

The object dog is called a vector. If you need to abort a command, press the escape key ESC. The up arrow key ??? can be used to recall previous entries. To obtain help on any of the commands, type the name of the command you wish help on:

?sum

Install Packages

We are now going to install two new packages, dplyr and tidyverse. These new packages are a easier way to look at specific variables and perform other functions in R. To install these packages, type the install.packages command with the package you want in quotes:

install.packages("tidyverse")

install.packages("dplyr")

From here we can reference the packages with the library function and won't need to install in future uses

library(dplyr)
library(tidyverse)

Importing data

Data for the second edition of the textbook can be downloaded from the web site https://sites.google.com/site/ChiharaHesterberg/data2 For instance, let's start with the Flight Delays Case Study (see Chapter 2, Exploratory Data Analysis, of the text for a description of this data set). We use the read.csv command to import the data into our R workspace:

FlightDelays <- read.csv("https://sites.google.com/site/chiharahesterberg/data2/FlightDelays.csv")

#### To view the names of the variables in FlightDelays:

names(FlightDelays)

##  [1] "ID"           "Carrier"      "FlightNo"     "Destination"  "DepartTime"  
##  [6] "Day"          "Month"        "FlightLength" "Delay"        "Delayed30"

To view the first part of the data, use the head command:

 head(FlightDelays)

##   ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 1  1      UA      403         DEN      4-8am Fri   May          281    -1
## 2  2      UA      405         DEN     8-Noon Fri   May          277   102
## 3  3      UA      409         DEN      4-8pm Fri   May          279     4
## 4  4      UA      511         ORD     8-Noon Fri   May          158    -2
## 5  5      UA      667         ORD      4-8am Fri   May          143    -3
## 6  6      UA      669         ORD      4-8am Fri   May          150     0
##   Delayed30
## 1        No
## 2       Yes
## 3        No
## 4        No
## 5        No
## 6        No

(What do you suppose the tail command does?) The columns are the variables. There are two types of variables: numeric, for example, FlightLength and Delay and factor (also called categorical) (for example Carrier and DepartTime). The rows are called observations or cases. To check the size (number of rows and columns) of the data frame, type

dim(FlightDelays)  #dim= dimension

## [1] 4029   10

This tells us that there are 4029 observations and 10 columns. ###Tables, bar charts and histograms #### The factor variable Carrier in the FlightDelays data set assigns each flight to one of two levels: UA or AA. To obtain the summary of this variable

FlightDelays %>% count(Carrier)

##   Carrier    n
## 1      AA 2906
## 2      UA 1123

Remark: R is case-sensitive! Carrier and carrier are considered different. To create a contingency table summarizing the relationship between carrier and whether or not a flight was delayed by more than 30 minutes, type:

FlightDelays %>% count(Carrier, Delayed30)

##   Carrier Delayed30    n
## 1      AA        No 2513
## 2      AA       Yes  393
## 3      UA        No  919
## 4      UA       Yes  204

The prop command gives joint or conditional distributions:

FlightDelays %>% count(Carrier, Delayed30) %>% mutate(prop = n/sum(n))  #joint distribution (sum of all cells = 1)

##   Carrier Delayed30    n       prop
## 1      AA        No 2513 0.62372797
## 2      AA       Yes  393 0.09754281
## 3      UA        No  919 0.22809630
## 4      UA       Yes  204 0.05063291

FlightDelays %>% group_by(Carrier) %>% count(Carrier, Delayed30) %>% mutate(prop = n/sum(n))

## # A tibble: 4 x 4
## # Groups:   Carrier [2]
##   Carrier Delayed30     n  prop
##   <fct>   <fct>     <int> <dbl>
## 1 AA      No         2513 0.865
## 2 AA      Yes         393 0.135
## 3 UA      No          919 0.818
## 4 UA      Yes         204 0.182

#conditional distributions (row sums = 1)

Thus, 9.8% of flights were American Airlines and delayed by more than 30 minutes, whereas of all American Airline flights, 13.5% were delayed by more than 30 minutes To visualize the distribution of a factor variable, create a bar chart:

barplot(table(select(FlightDelays, Carrier)))

#### To visualize the distribution of a numeric variable, create a histogram.

hist(table(select(FlightDelays, Delay)))

Numeric Summaries

Because it is a bit cumbersome to use the syntax select(FlightDelays, Delay) each time we want to work with the Delay variable, we will create a new object delay

delay <- select(FlightDelays, Delay)

mean(delay)

## Warning in mean.default(delay): argument is not numeric or logical: returning NA

## [1] NA

To compute the trimmed mean by removing the lowest and highest 25% of values:

mean(delay, trim= .25)

## Warning in mean.default(delay, trim = 0.25): argument is not numeric or logical:
## returning NA

## [1] NA

Other basic statistics:

  max(delay)

## [1] 693

 min(delay)

## [1] -19

 range(delay)

## [1] -19 693

 var(delay) #variance >

##          Delay
## Delay 1733.098

 #standard deviation 
  #quartiles

If you need the population variance (that is, denominator of 1/n instead of 1/(n-1)), n <- length(delay) (n-1)/n*var(delay)

The mutate command #### The tapply command allows you to compute numeric summaries on values based on levels of a factor variable. For instance, find the mean flight delay length by carrier,

FlightDelays %>% group_by(Carrier) %>% count(Carrier) %>% mutate(mean=sum(delay)/n)

## # A tibble: 2 x 3
## # Groups:   Carrier [2]
##   Carrier     n  mean
##   <fct>   <int> <dbl>
## 1 AA       2906  16.3
## 2 UA       1123  42.1

FlightDelays %>% group_by(DepartTime) %>% count(DepartTime) %>% mutate(mean=sum(delay)/n)

## # A tibble: 5 x 3
## # Groups:   DepartTime [5]
##   DepartTime     n  mean
##   <fct>      <int> <dbl>
## 1 4-8am        699  67.7
## 2 4-8pm        972  48.7
## 3 8-Mid        257 184. 
## 4 8-Noon      1053  44.9
## 5 Noon-4pm    1048  45.1

Boxplots

Boxplots give a visualization of the 5-number summary of a variable.

summary(delay) #numeric summary

##      Delay       
##  Min.   :-19.00  
##  1st Qu.: -6.00  
##  Median : -3.00  
##  Mean   : 11.74  
##  3rd Qu.:  5.00  
##  Max.   :693.00

ggplot(FlightDelays, aes(Delay)) + geom_boxplot()

To compare the distribution of a numeric variable across the levels of a factor variable

ggplot(FlightDelays, aes(x=Day, y=Delay)) + geom_boxplot()

ggplot(FlightDelays, aes(x=DepartTime, y=Delay)) + geom_boxplot()

#### With the boxplot command, you have the option of using a formula syntax,x= Factor, y= Numeric. Note that you also specify the relevant data set so that the variables in the formula are assumed to come from that data set.

Misc. Remarks #### . Functions in R are called by typing their name followed by arguments surrounded by parentheses: ex. hist(delay). Typing a function name without parentheses will give the code for the function.

sd

## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x0000000030e1dab8>
## <environment: namespace:stats>

. We saw earlier that we can assign names to data (we created a vector called dog.) Names can be any length, must start with a letter, and may contain letters or numbers:

 fish25 <- 10:35
 fish25

##  [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [26] 35

Certain names are reserved so be careful to not use them: cat, c, t, T, F,... To be safe, before making an assignment, type the name:

whale [1] Problem: Object "whale" not found

Safe to use whale

whale <- 200
objects()

## [1] "delay"        "dog"          "fish25"       "FlightDelays" "whale"

rm(whale)
objects()

## [1] "delay"        "dog"          "fish25"       "FlightDelays"

. In general, R is space-insensitive.

 3 +4

## [1] 7

 3+ 4

## [1] 7

 mean(3+5)

## [1] 8

 mean ( 3 + 5 )

## [1] 8

BUT, the assignment operator must not have spaces! <- is different from < - . To quit, type q() You will be given an option to Save the workspace image. Select Yes: all objects created in this session are saved to your working directory so that the next time you start up R, if you load this working directory, these objects will still be available. You will not have to re-import FlightDelays, for instance, nor recreate delay. You can, for back-up purposes, save data to an external file/disk by using, for instance, the write.csv command. See the help file for more information.

Vectors in R #### The basic data object in R is the vector. Even scalars are vectors of length 1. There are several ways to create vectors. The : operator creates sequences incrementing/decrementing by 1.

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

5:-3

## [1]  5  4  3  2  1  0 -1 -2 -3

The seq() function creates sequences also.

seq(0, 3, by = .2) seq(0, 3, length = 15) quantile(delay, seq(0, 1, by = .1)) #deciles of delay

To create vectors with no particular pattern, use the c() function (c for combine).

 c(1, 4, 8, 2, 9)

## [1] 1 4 8 2 9

 x <- c(2, 0, -4)
 x

## [1]  2  0 -4

 c(x, 0:5, x)

##  [1]  2  0 -4  0  1  2  3  4  5  2  0 -4

For vectors of characters,

c("a", "b", "c", "d")

## [1] "a" "b" "c" "d"

or logical values (note that there are no double quotes)

c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE)

## [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

The rep() command for repeating values:

rep("a", 5)

## [1] "a" "a" "a" "a" "a"

rep(c("a", "b"), 5)

##  [1] "a" "b" "a" "b" "a" "b" "a" "b" "a" "b"

rep(c("a", "b"), c(5, 2))

## [1] "a" "a" "a" "a" "a" "b" "b"

The "class" attribute #### Use data.class to determine the class attribute of an object.

state.name data.class(state.name) state.name == "Idaho" data.class(state.name == "Idaho") head(Select(Carrier)) data.class(Select(Carrier))

Basic Arithmetic

 x <- 1:5
 x - 3

## [1] -2 -1  0  1  2

 x*10

## [1] 10 20 30 40 50

 x/10

## [1] 0.1 0.2 0.3 0.4 0.5

x^2

## [1]  1  4  9 16 25

2^x

## [1]  2  4  8 16 32

 log(x)

## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379

 w <- 6:10
 w

## [1]  6  7  8  9 10

 x*w #coordinate-wise multiplication

## [1]  6 14 24 36 50

Logical expressions

x < 3

## [1]  TRUE  TRUE FALSE FALSE FALSE

x == 4

## [1] FALSE FALSE FALSE  TRUE FALSE

Subsetting #### In many cases, we will want only a portion of a data set. For subsetting a vector, the basic syntax is vector[index]. In particular, note the use of brackets to indicate that we are subsetting

 z <- c(8, 3, 0, 9, 9, 2, 1, 3)

The fourth element of z:

z[4]

## [1] 9

The first, third and fourth element,

z[c(1, 3, 4)]

## [1] 8 0 9

To return the values of z less than 4, we first introduce the which command:

which(z < 4) # which positions are z values < 4?

## [1] 2 3 6 7 8

index <- which(z < 4) # store in index
z[index] # return z[c(2, 3, 6, 7)]

## [1] 3 0 2 1 3

Suppose you want to find those observations when the delay length was greater than the mean delay length. We'll store this in a vector called index.

index <- which(delay > mean(delay))

## Warning in mean.default(delay): argument is not numeric or logical: returning NA

head(index)

## integer(0)

Thus, observations in rows 2, 10, 12, 14, 15, 16 are the first six that correspond to flights that had delays that were larger than the average delay length. ### The subset command #### The subset command is used to extract rows of the data that satisfy certain conditions. If necessary, re-import the FlightDelays data set. Recall that we extracted the variable Delay using the $ syntax:

delay <- FlightDelays$Delay

The subset command can also be used:

delay <- subset(FlightDelays, select = Delay, drop = TRUE)

The select = argument indicates which column to choose. The drop = TRUE argument is need to create a vector. Compare

delay <- subset(FlightDelays, select = Delay, drop = TRUE)
head(delay)

## [1]  -1 102   4  -2  -3   0

data.class(delay)

## [1] "numeric"

delay2 <- subset(FlightDelays, select = Delay)
head(delay2)

##   Delay
## 1    -1
## 2   102
## 3     4
## 4    -2
## 5    -3
## 6     0

data.class(delay2)

## [1] "data.frame"

The second output (omitting the drop = TRUE argument) is a data.frame object (more on data frames later). Suppose you wish to extract the flight delay lengths for all Monday and then find the mean delay length:

delay3 <- subset(FlightDelays, select = Delay, subset = Day == "Mon",
drop = TRUE)
mean(delay3)

## [1] 5.868254

The subset = argument is a logical expression indicating which rows to keep. We want those days not equal to Monday. To extract the delay lengths for Saturday and Sunday only

delay3 <- subset(FlightDelays, select = Delay,
subset = (Day == "Sat" | Day == "Sun"), drop = TRUE)
mean(delay3)

## [1] 5.325697

n(delay3) The vertical bar | stands for "or."

Vectorized operators

== equal to

! = not equal to

">" greater than

">=" greater than or equal to

< less than

<= less than or equal to

& vectorized AND

vectorized OR

! not

Programming Note: The vectorized AND and OR are for use with vectors (when you are extracting subsets of vectors). For control in programming (ex. when writing for or if statements), the operators are && and ||.

Misc. commands on vectors

length(z) # number of elements in z

## [1] 8

sum(z) # add elements in z

## [1] 35

sort(z) # sort in increasing order

## [1] 0 1 2 3 3 8 9 9

The sample command is used to obtain random samples from a set. For instance, to permute the numbers 1, 2, ,,, , 10

sample(10)

##  [1] 10  8  7  1  6  4  3  2  9  5

To obtain a random sample from 1, 2, ,,, , 10 of size 4 (without replacement):

sample(10, 4)

## [1] 8 7 1 3

Without replacement is the default; if you want to sample with replacement:

sample(10, 4, replace = TRUE)

## [1] 2 4 7 5

The built-in vector state.name contains the 50 states.

state.name

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

Draw a random sample of 20 states:

 sample(state.name, 20)

##  [1] "North Dakota"   "Alaska"         "Mississippi"    "Rhode Island"  
##  [5] "Oregon"         "California"     "Connecticut"    "Colorado"      
##  [9] "Idaho"          "Texas"          "Montana"        "Arkansas"      
## [13] "Illinois"       "South Carolina" "Pennsylvania"   "Kansas"        
## [17] "South Dakota"   "Washington"     "Missouri"       "Wisconsin"

If you want to sample with replacement,

 sample(state.name, 20, replace=TRUE)

##  [1] "Tennessee"    "Oklahoma"     "Michigan"     "Hawaii"       "Wisconsin"   
##  [6] "Missouri"     "Washington"   "Louisiana"    "Illinois"     "Montana"     
## [11] "Wisconsin"    "Michigan"     "Michigan"     "Indiana"      "Alaska"      
## [16] "Virginia"     "Wyoming"      "Louisiana"    "Montana"      "Rhode Island"

Suppose you wish to create two vectors, one containing a random sample of 20 of the states, the other vector containing the remaining 30 states.

We obtain a random sample of size 20 from 1, 2, . . . , 50 and store in the vector index:

index <- sample(50, 20)

See what this sample looks like:

index

##  [1] 46 43 32 44 24 14 31 21 30 15 33 10 29  2  6 13 50 28 49 25

tempA will contain the random sample of 20 states:

tempA <- state.name[index]

tempB will contain the remaining 30 states. Recall that in subsetting, the negative of a number means to exclude that observation

tempB <- state.name[-index]
tempA

##  [1] "Virginia"       "Texas"          "New York"       "Utah"          
##  [5] "Mississippi"    "Indiana"        "New Mexico"     "Massachusetts" 
##  [9] "New Jersey"     "Iowa"           "North Carolina" "Georgia"       
## [13] "New Hampshire"  "Alaska"         "Colorado"       "Illinois"      
## [17] "Wyoming"        "Nevada"         "Wisconsin"      "Missouri"

tempB

##  [1] "Alabama"        "Arizona"        "Arkansas"       "California"    
##  [5] "Connecticut"    "Delaware"       "Florida"        "Hawaii"        
##  [9] "Idaho"          "Kansas"         "Kentucky"       "Louisiana"     
## [13] "Maine"          "Maryland"       "Michigan"       "Minnesota"     
## [17] "Montana"        "Nebraska"       "North Dakota"   "Ohio"          
## [21] "Oklahoma"       "Oregon"         "Pennsylvania"   "Rhode Island"  
## [25] "South Carolina" "South Dakota"   "Tennessee"      "Vermont"       
## [29] "Washington"     "West Virginia"

Data Frames in R #### Most data will be stored in data frames, rectangular arrays which usually are formed by combining columns of vectors. FlightDelays is an example of a data frame.

data.class(FlightDelays)

## [1] "data.frame"

Subsetting data frames #### For subsetting a data frame, use the syntax data[row.index, column.index].

For instance, row 5, column 3:

FlightDelays[5, 3]

## [1] 667

or rows 1 through 10, columns 1 and 3:

FlightDelays[1:10, c(1, 3)]

##    ID FlightNo
## 1   1      403
## 2   2      405
## 3   3      409
## 4   4      511
## 5   5      667
## 6   6      669
## 7   7      673
## 8   8      677
## 9   9      679
## 10 10      681

or all rows except 1 through 10, and keep columns 1 and 3:

 #FlightDelays[-(1:10), c(1, 3)]

To extract all rows, but columns 1 and 3

#FlightDelays[, c(1, 3)]

and rows 1:100 and all columns

#FlightDelays[1:100, ]

Use the subset command to extract rows based on some logical condition.

Create a subset of just the Tuesday data:

DelaysTue <- subset(FlightDelays, subset = Day == "Tue")
head(DelaysTue)

##    ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 70 70      UA      403         DEN      4-8am Tue   May          281    -4
## 71 71      UA      405         DEN     8-Noon Tue   May          277   117
## 72 72      UA      409         DEN      4-8pm Tue   May          279   115
## 73 73      UA      511         ORD     8-Noon Tue   May          158     0
## 74 74      UA      667         ORD      4-8am Tue   May          143    -2
## 75 75      UA      669         ORD      4-8am Tue   May          150    -2
##    Delayed30
## 70        No
## 71       Yes
## 72       Yes
## 73        No
## 74        No
## 75        No

Create a subset of just the Tuesday data and columns 1, 6, 7:

DelaysTue <- subset(FlightDelays, Day == "Tue", select = c(1, 6, 7))
head(DelaysTue)

##    ID Day Month
## 70 70 Tue   May
## 71 71 Tue   May
## 72 72 Tue   May
## 73 73 Tue   May
## 74 74 Tue   May
## 75 75 Tue   May