Start up RStudio and look for the window labeled Console.
Type commands after the prompt > and then press the
Note: anything after the pound symbol # is a comment-- explanatory text that is not executed.
4*9 #Simple Arithmetic
## [1] 36
If you strike the ENTER key before typing a complete expression, you will see the continuation prompt, the plus sign (+). For example, suppose you wish to calculate 3+2*(8-4), but you accidentally strike the ENTER key after typing the 8
3+2*(8 ENTER
plus sign +
Finish the expression by typing -4) after the +
plus sign +, - 4) ENTER
[1]11
Create a sequence incrementing by 1:
20:30
## [1] 20 21 22 23 24 25 26 27 28 29 30
We will create an object called dog and assign it the values 1, 2, 3, 4, 5. The symbol <- is the assignment operator.
dog <- 1:5
dog
## [1] 1 2 3 4 5
dog + 10
## [1] 11 12 13 14 15
3*dog
## [1] 3 6 9 12 15
sum(dog)
## [1] 15
The object dog is called a vector. If you need to abort a command, press the escape key ESC. The up arrow key ??? can be used to recall previous entries. To obtain help on any of the commands, type the name of the command you wish help on:
?sum
We are now going to install two new packages, dplyr and tidyverse. These new packages are a easier way to look at specific variables and perform other functions in R. To install these packages, type the install.packages command with the package you want in quotes:
install.packages("tidyverse")
install.packages("dplyr")
From here we can reference the packages with the library function and won't need to install in future uses
library(dplyr)
library(tidyverse)
Data for the second edition of the textbook can be downloaded from the web site https://sites.google.com/site/ChiharaHesterberg/data2 For instance, let's start with the Flight Delays Case Study (see Chapter 2, Exploratory Data Analysis, of the text for a description of this data set). We use the read.csv command to import the data into our R workspace:
FlightDelays <- read.csv("https://sites.google.com/site/chiharahesterberg/data2/FlightDelays.csv")
#### To view the names of the variables in FlightDelays:
names(FlightDelays)
## [1] "ID" "Carrier" "FlightNo" "Destination" "DepartTime"
## [6] "Day" "Month" "FlightLength" "Delay" "Delayed30"
To view the first part of the data, use the head command:
head(FlightDelays)
## ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 1 1 UA 403 DEN 4-8am Fri May 281 -1
## 2 2 UA 405 DEN 8-Noon Fri May 277 102
## 3 3 UA 409 DEN 4-8pm Fri May 279 4
## 4 4 UA 511 ORD 8-Noon Fri May 158 -2
## 5 5 UA 667 ORD 4-8am Fri May 143 -3
## 6 6 UA 669 ORD 4-8am Fri May 150 0
## Delayed30
## 1 No
## 2 Yes
## 3 No
## 4 No
## 5 No
## 6 No
(What do you suppose the tail command does?) The columns are the variables. There are two types of variables: numeric, for example, FlightLength and Delay and factor (also called categorical) (for example Carrier and DepartTime). The rows are called observations or cases. To check the size (number of rows and columns) of the data frame, type
dim(FlightDelays) #dim= dimension
## [1] 4029 10
This tells us that there are 4029 observations and 10 columns. ###Tables, bar charts and histograms #### The factor variable Carrier in the FlightDelays data set assigns each flight to one of two levels: UA or AA. To obtain the summary of this variable
FlightDelays %>% count(Carrier)
## Carrier n
## 1 AA 2906
## 2 UA 1123
Remark: R is case-sensitive! Carrier and carrier are considered different. To create a contingency table summarizing the relationship between carrier and whether or not a flight was delayed by more than 30 minutes, type:
FlightDelays %>% count(Carrier, Delayed30)
## Carrier Delayed30 n
## 1 AA No 2513
## 2 AA Yes 393
## 3 UA No 919
## 4 UA Yes 204
The prop command gives joint or conditional distributions:
FlightDelays %>% count(Carrier, Delayed30) %>% mutate(prop = n/sum(n)) #joint distribution (sum of all cells = 1)
## Carrier Delayed30 n prop
## 1 AA No 2513 0.62372797
## 2 AA Yes 393 0.09754281
## 3 UA No 919 0.22809630
## 4 UA Yes 204 0.05063291
FlightDelays %>% group_by(Carrier) %>% count(Carrier, Delayed30) %>% mutate(prop = n/sum(n))
## # A tibble: 4 x 4
## # Groups: Carrier [2]
## Carrier Delayed30 n prop
## <fct> <fct> <int> <dbl>
## 1 AA No 2513 0.865
## 2 AA Yes 393 0.135
## 3 UA No 919 0.818
## 4 UA Yes 204 0.182
#conditional distributions (row sums = 1)
Thus, 9.8% of flights were American Airlines and delayed by more than 30 minutes, whereas of all American Airline flights, 13.5% were delayed by more than 30 minutes To visualize the distribution of a factor variable, create a bar chart:
barplot(table(select(FlightDelays, Carrier)))
#### To visualize the distribution of a numeric variable, create a histogram.
hist(table(select(FlightDelays, Delay)))
Because it is a bit cumbersome to use the syntax select(FlightDelays, Delay) each time we want to work with the Delay variable, we will create a new object delay
delay <- select(FlightDelays, Delay)
mean(delay)
## Warning in mean.default(delay): argument is not numeric or logical: returning NA
## [1] NA
To compute the trimmed mean by removing the lowest and highest 25% of values:
mean(delay, trim= .25)
## Warning in mean.default(delay, trim = 0.25): argument is not numeric or logical:
## returning NA
## [1] NA
Other basic statistics:
max(delay)
## [1] 693
min(delay)
## [1] -19
range(delay)
## [1] -19 693
var(delay) #variance >
## Delay
## Delay 1733.098
#standard deviation
#quartiles
If you need the population variance (that is, denominator of 1/n instead of 1/(n-1)), n <- length(delay) (n-1)/n*var(delay)
The mutate command #### The tapply command allows you to compute numeric summaries on values based on levels of a factor variable. For instance, find the mean flight delay length by carrier,
FlightDelays %>% group_by(Carrier) %>% count(Carrier) %>% mutate(mean=sum(delay)/n)
## # A tibble: 2 x 3
## # Groups: Carrier [2]
## Carrier n mean
## <fct> <int> <dbl>
## 1 AA 2906 16.3
## 2 UA 1123 42.1
FlightDelays %>% group_by(DepartTime) %>% count(DepartTime) %>% mutate(mean=sum(delay)/n)
## # A tibble: 5 x 3
## # Groups: DepartTime [5]
## DepartTime n mean
## <fct> <int> <dbl>
## 1 4-8am 699 67.7
## 2 4-8pm 972 48.7
## 3 8-Mid 257 184.
## 4 8-Noon 1053 44.9
## 5 Noon-4pm 1048 45.1
Boxplots give a visualization of the 5-number summary of a variable.
summary(delay) #numeric summary
## Delay
## Min. :-19.00
## 1st Qu.: -6.00
## Median : -3.00
## Mean : 11.74
## 3rd Qu.: 5.00
## Max. :693.00
ggplot(FlightDelays, aes(Delay)) + geom_boxplot()
To compare the distribution of a numeric variable across the levels of a factor variable
ggplot(FlightDelays, aes(x=Day, y=Delay)) + geom_boxplot()
ggplot(FlightDelays, aes(x=DepartTime, y=Delay)) + geom_boxplot()
#### With the boxplot command, you have the option of using a formula syntax,x= Factor, y= Numeric. Note that you also specify the relevant data set so that the variables in the formula are assumed to come from that data set.
Misc. Remarks #### . Functions in R are called by typing their name followed by arguments surrounded by parentheses: ex. hist(delay). Typing a function name without parentheses will give the code for the function.
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x0000000030e1dab8>
## <environment: namespace:stats>
. We saw earlier that we can assign names to data (we created a vector called dog.) Names can be any length, must start with a letter, and may contain letters or numbers:
fish25 <- 10:35
fish25
## [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [26] 35
Certain names are reserved so be careful to not use them: cat, c, t, T, F,... To be safe, before making an assignment, type the name:
whale [1] Problem: Object "whale" not found
Safe to use whale
whale <- 200
objects()
## [1] "delay" "dog" "fish25" "FlightDelays" "whale"
rm(whale)
objects()
## [1] "delay" "dog" "fish25" "FlightDelays"
. In general, R is space-insensitive.
3 +4
## [1] 7
3+ 4
## [1] 7
mean(3+5)
## [1] 8
mean ( 3 + 5 )
## [1] 8
BUT, the assignment operator must not have spaces! <- is different from < - . To quit, type q() You will be given an option to Save the workspace image. Select Yes: all objects created in this session are saved to your working directory so that the next time you start up R, if you load this working directory, these objects will still be available. You will not have to re-import FlightDelays, for instance, nor recreate delay. You can, for back-up purposes, save data to an external file/disk by using, for instance, the write.csv command. See the help file for more information.
Vectors in R #### The basic data object in R is the vector. Even scalars are vectors of length 1. There are several ways to create vectors. The : operator creates sequences incrementing/decrementing by 1.
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
5:-3
## [1] 5 4 3 2 1 0 -1 -2 -3
The seq() function creates sequences also.
seq(0, 3, by = .2) seq(0, 3, length = 15) quantile(delay, seq(0, 1, by = .1)) #deciles of delay
To create vectors with no particular pattern, use the c() function (c for combine).
c(1, 4, 8, 2, 9)
## [1] 1 4 8 2 9
x <- c(2, 0, -4)
x
## [1] 2 0 -4
c(x, 0:5, x)
## [1] 2 0 -4 0 1 2 3 4 5 2 0 -4
For vectors of characters,
c("a", "b", "c", "d")
## [1] "a" "b" "c" "d"
or logical values (note that there are no double quotes)
c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE)
## [1] TRUE FALSE FALSE TRUE TRUE FALSE
The rep() command for repeating values:
rep("a", 5)
## [1] "a" "a" "a" "a" "a"
rep(c("a", "b"), 5)
## [1] "a" "b" "a" "b" "a" "b" "a" "b" "a" "b"
rep(c("a", "b"), c(5, 2))
## [1] "a" "a" "a" "a" "a" "b" "b"
The "class" attribute #### Use data.class to determine the class attribute of an object.
state.name data.class(state.name) state.name == "Idaho" data.class(state.name == "Idaho") head(Select(Carrier)) data.class(Select(Carrier))
Basic Arithmetic
x <- 1:5
x - 3
## [1] -2 -1 0 1 2
x*10
## [1] 10 20 30 40 50
x/10
## [1] 0.1 0.2 0.3 0.4 0.5
x^2
## [1] 1 4 9 16 25
2^x
## [1] 2 4 8 16 32
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
w <- 6:10
w
## [1] 6 7 8 9 10
x*w #coordinate-wise multiplication
## [1] 6 14 24 36 50
Logical expressions
x < 3
## [1] TRUE TRUE FALSE FALSE FALSE
x == 4
## [1] FALSE FALSE FALSE TRUE FALSE
Subsetting #### In many cases, we will want only a portion of a data set. For subsetting a vector, the basic syntax is vector[index]. In particular, note the use of brackets to indicate that we are subsetting
z <- c(8, 3, 0, 9, 9, 2, 1, 3)
The fourth element of z:
z[4]
## [1] 9
The first, third and fourth element,
z[c(1, 3, 4)]
## [1] 8 0 9
To return the values of z less than 4, we first introduce the which command:
which(z < 4) # which positions are z values < 4?
## [1] 2 3 6 7 8
index <- which(z < 4) # store in index
z[index] # return z[c(2, 3, 6, 7)]
## [1] 3 0 2 1 3
Suppose you want to find those observations when the delay length was greater than the mean delay length. We'll store this in a vector called index.
index <- which(delay > mean(delay))
## Warning in mean.default(delay): argument is not numeric or logical: returning NA
head(index)
## integer(0)
Thus, observations in rows 2, 10, 12, 14, 15, 16 are the first six that correspond to flights that had delays that were larger than the average delay length. ### The subset command #### The subset command is used to extract rows of the data that satisfy certain conditions. If necessary, re-import the FlightDelays data set. Recall that we extracted the variable Delay using the $ syntax:
delay <- FlightDelays$Delay
The subset command can also be used:
delay <- subset(FlightDelays, select = Delay, drop = TRUE)
The select = argument indicates which column to choose. The drop = TRUE argument is need to create a vector. Compare
delay <- subset(FlightDelays, select = Delay, drop = TRUE)
head(delay)
## [1] -1 102 4 -2 -3 0
data.class(delay)
## [1] "numeric"
delay2 <- subset(FlightDelays, select = Delay)
head(delay2)
## Delay
## 1 -1
## 2 102
## 3 4
## 4 -2
## 5 -3
## 6 0
data.class(delay2)
## [1] "data.frame"
The second output (omitting the drop = TRUE argument) is a data.frame object (more on data frames later). Suppose you wish to extract the flight delay lengths for all Monday and then find the mean delay length:
delay3 <- subset(FlightDelays, select = Delay, subset = Day == "Mon",
drop = TRUE)
mean(delay3)
## [1] 5.868254
The subset = argument is a logical expression indicating which rows to keep. We want those days not equal to Monday. To extract the delay lengths for Saturday and Sunday only
delay3 <- subset(FlightDelays, select = Delay,
subset = (Day == "Sat" | Day == "Sun"), drop = TRUE)
mean(delay3)
## [1] 5.325697
n(delay3) The vertical bar | stands for "or."
Vectorized operators
== equal to
! = not equal to
">" greater than
">=" greater than or equal to
< less than
<= less than or equal to
& vectorized AND
! not
Programming Note: The vectorized AND and OR are for use with vectors (when you are extracting subsets of vectors). For control in programming (ex. when writing for or if statements), the operators are && and ||.
Misc. commands on vectors
length(z) # number of elements in z
## [1] 8
sum(z) # add elements in z
## [1] 35
sort(z) # sort in increasing order
## [1] 0 1 2 3 3 8 9 9
The sample command is used to obtain random samples from a set. For instance, to permute the numbers 1, 2, ,,, , 10
sample(10)
## [1] 10 8 7 1 6 4 3 2 9 5
To obtain a random sample from 1, 2, ,,, , 10 of size 4 (without replacement):
sample(10, 4)
## [1] 8 7 1 3
Without replacement is the default; if you want to sample with replacement:
sample(10, 4, replace = TRUE)
## [1] 2 4 7 5
The built-in vector state.name contains the 50 states.
state.name
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
Draw a random sample of 20 states:
sample(state.name, 20)
## [1] "North Dakota" "Alaska" "Mississippi" "Rhode Island"
## [5] "Oregon" "California" "Connecticut" "Colorado"
## [9] "Idaho" "Texas" "Montana" "Arkansas"
## [13] "Illinois" "South Carolina" "Pennsylvania" "Kansas"
## [17] "South Dakota" "Washington" "Missouri" "Wisconsin"
If you want to sample with replacement,
sample(state.name, 20, replace=TRUE)
## [1] "Tennessee" "Oklahoma" "Michigan" "Hawaii" "Wisconsin"
## [6] "Missouri" "Washington" "Louisiana" "Illinois" "Montana"
## [11] "Wisconsin" "Michigan" "Michigan" "Indiana" "Alaska"
## [16] "Virginia" "Wyoming" "Louisiana" "Montana" "Rhode Island"
Suppose you wish to create two vectors, one containing a random sample of 20 of the states, the other vector containing the remaining 30 states.
We obtain a random sample of size 20 from 1, 2, . . . , 50 and store in the vector index:
index <- sample(50, 20)
See what this sample looks like:
index
## [1] 46 43 32 44 24 14 31 21 30 15 33 10 29 2 6 13 50 28 49 25
tempA will contain the random sample of 20 states:
tempA <- state.name[index]
tempB will contain the remaining 30 states. Recall that in subsetting, the negative of a number means to exclude that observation
tempB <- state.name[-index]
tempA
## [1] "Virginia" "Texas" "New York" "Utah"
## [5] "Mississippi" "Indiana" "New Mexico" "Massachusetts"
## [9] "New Jersey" "Iowa" "North Carolina" "Georgia"
## [13] "New Hampshire" "Alaska" "Colorado" "Illinois"
## [17] "Wyoming" "Nevada" "Wisconsin" "Missouri"
tempB
## [1] "Alabama" "Arizona" "Arkansas" "California"
## [5] "Connecticut" "Delaware" "Florida" "Hawaii"
## [9] "Idaho" "Kansas" "Kentucky" "Louisiana"
## [13] "Maine" "Maryland" "Michigan" "Minnesota"
## [17] "Montana" "Nebraska" "North Dakota" "Ohio"
## [21] "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island"
## [25] "South Carolina" "South Dakota" "Tennessee" "Vermont"
## [29] "Washington" "West Virginia"
Data Frames in R #### Most data will be stored in data frames, rectangular arrays which usually are formed by combining columns of vectors. FlightDelays is an example of a data frame.
data.class(FlightDelays)
## [1] "data.frame"
Subsetting data frames #### For subsetting a data frame, use the syntax data[row.index, column.index].
For instance, row 5, column 3:
FlightDelays[5, 3]
## [1] 667
or rows 1 through 10, columns 1 and 3:
FlightDelays[1:10, c(1, 3)]
## ID FlightNo
## 1 1 403
## 2 2 405
## 3 3 409
## 4 4 511
## 5 5 667
## 6 6 669
## 7 7 673
## 8 8 677
## 9 9 679
## 10 10 681
or all rows except 1 through 10, and keep columns 1 and 3:
#FlightDelays[-(1:10), c(1, 3)]
To extract all rows, but columns 1 and 3
#FlightDelays[, c(1, 3)]
and rows 1:100 and all columns
#FlightDelays[1:100, ]
Use the subset command to extract rows based on some logical condition.
Create a subset of just the Tuesday data:
DelaysTue <- subset(FlightDelays, subset = Day == "Tue")
head(DelaysTue)
## ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 70 70 UA 403 DEN 4-8am Tue May 281 -4
## 71 71 UA 405 DEN 8-Noon Tue May 277 117
## 72 72 UA 409 DEN 4-8pm Tue May 279 115
## 73 73 UA 511 ORD 8-Noon Tue May 158 0
## 74 74 UA 667 ORD 4-8am Tue May 143 -2
## 75 75 UA 669 ORD 4-8am Tue May 150 -2
## Delayed30
## 70 No
## 71 Yes
## 72 Yes
## 73 No
## 74 No
## 75 No
Create a subset of just the Tuesday data and columns 1, 6, 7:
DelaysTue <- subset(FlightDelays, Day == "Tue", select = c(1, 6, 7))
head(DelaysTue)
## ID Day Month
## 70 70 Tue May
## 71 71 Tue May
## 72 72 Tue May
## 73 73 Tue May
## 74 74 Tue May
## 75 75 Tue May