After completing this worksheet, you should have a basic familiarity with the most useful types of data structures in R (lists, data frames, and matrices). You should also be able to subset any of these data structures using the common subsetting operators ([, [[, and $).
You may find the chapters on data structures and subsetting from Hadley Wickham’s Advanced R book to be helpful.
In the previous worksheet you learned about vectors. Vectors hold a one-dimensional set of homogenous values. By homogenous we mean that that the values all have to be the same type (integer, numeric, logical, and so on) and you can’t have both, say, numeric and logical values in the same vector. By one-dimensional we mean that there can be more than one element accessed by a single index. For example, R includes a character vector of the letters in the English alphabet, helpfully called, letters.
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
The letters object is a character vector because everything inside of it is a character vector.
class(letters)
## [1] "character"
And it has twenty-six elements:
length(letters)
## [1] 26
We can subset a vector using the [ operator (actually, it’s a function). For instance, we can get the first element like this:
letters[1]
## [1] "a"
letters[25]
## [1] "y"
We can get a range of values like so. Notice that we can get all the numbers one to five like this:
1:5
## [1] 1 2 3 4 5
So, we can get the first five letters like this:
letters[1:5]
## [1] "a" "b" "c" "d" "e"
letters[10:12]
## [1] "j" "k" "l"
You can get an arbitrary subset by creating a numeric (or integer) vector. Here we get the first, tenth, and twelfth letters.
letters[c(1, 10, 12)]
## [1] "a" "j" "l"
We can also do this by creating a variable and using it to do the subsetting.
what_i_want <- c(1, 3, 5, 7)
letters[what_i_want]
## [1] "a" "c" "e" "g"
letters variable to get the even letters (e.g., the second, fourth, etc.)variable <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26)
letters[variable]
## [1] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z"
seq() function (look it up with ?seq) to get the even letters in a more clever way?letters[seq(2, 26, by = 2)]
## [1] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z"
In addition to values, vectors can also have names. For example, let’s create a variable with the numbers 1 to 5, then give those values some names.
myvar <- 1:5
names(myvar) <- letters[1:5]
myvar
## a b c d e
## 1 2 3 4 5
Now we can also subset the vector based on the names:
myvar["c"]
## c
## 3
song_rankings <- c(10, 8.4, 6, 8.2, 4)
song_titles<- c("1979", "Revolution", "Baby", "Green Back Dollar", "Oops, I did it again")
names(song_rankings) <- song_titles
song_rankings["1979"]
## 1979
## 10
Vectors are one-dimensional and homogeneous. Matrices are two-dimensional and homogenous, so they have the same kind of value, but have rows and columns as well. A matrix can be used for all kinds of problems in digital history. For now, let’s imagine we have four cites, A, B, C, and D, and have measured the distances between them. For instance, the distance from A to B is 2. We can represent those distances as a matrix with 4 columns and 4 rows, where the names of the rows and columns are the cities.
city_distances <- matrix(c(0, 2, 8, 3, 2, 0, 6, 1, 8, 6, 0, 4, 3, 1, 4, 0),
nrow = 4, ncol = 4)
rownames(city_distances) <- LETTERS[1:4]
colnames(city_distances) <- LETTERS[1:4]
city_distances
## A B C D
## A 0 2 8 3
## B 2 0 6 1
## C 8 6 0 4
## D 3 1 4 0
A matrix can be subsetted in the same way that a vector can be subsetted. (Because it is a vector—just a vector with two dimensions.) For instance, we can get the third element of the matrix.
city_distances[3]
## [1] 8
city_distances[5]
## [1] 2
But matrices are more useful when we subset them by row and column. For instance, here is the value contained in the cell for the first row and third column.
city_distances[1, 3]
## [1] 8
city_distances[3, 1]
## [1] 8
The distance from city C to city A.
If a matrix has row and column names, we can subset the vector by that. For instance, here is the distance between cities B and D.
city_distances["B", "D"]
## [1] 1
city_distances["D", "C"]
## [1] 4
city_distances[1,]
## A B C D
## 0 2 8 3
A to B: 2 A to C: 8 A to D: 3
The most useful data structure in R is the data frame. Think of a data frame like a spreadsheet that holds any kind of tabular data. It is two dimensional, like a matrix; unlike a matrix, it is homogenous in that the columns can hold different kinds of data. While other langauges have add on libraries that allow for data structures like this, in R data frames are a first class citizen.
Let’s get a data frame from the historydata package. (If we also load dplyr, we will get nicer printing.)
library(historydata)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(naval_promotions)
We can use str() to get a different look at the data.
str(naval_promotions)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5705 obs. of 5 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "Abbot, Joel" "Abbot, Trevett" "Abbott, Isaac" "Abbott, J. Francis" ...
## $ generation: int 3 4 4 4 4 3 3 4 4 4 ...
## $ rank : Factor w/ 5 levels "midshipman","lieutenant",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ date : chr "1812-06-18" "1848-10-13" "1820-05-10" "1837-12-27" ...
glimpse(naval_promotions)
## Observations: 5,705
## Variables: 5
## $ id (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ name (chr) "Abbot, Joel", "Abbot, Trevett", "Abbott, Isaac", "...
## $ generation (int) 3, 4, 4, 4, 4, 3, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, ...
## $ rank (fctr) midshipman, midshipman, midshipman, midshipman, mi...
## $ date (chr) "1812-06-18", "1848-10-13", "1820-05-10", "1837-12-...
?naval_promotions
length(naval_promotions)
## [1] 5
Columns (variables): 5
Rows (cases): 5705
integer, character, integer, factor, character
We can use the [ subset function to get access to rows and columns, just like we did with matrices. For instance, here we ask for just the first row and the columned named "name":
naval_promotions[1, "name"]
## Source: local data frame [1 x 1]
##
## name
## (chr)
## 1 Abbot, Joel
We can also ask for the entire first row (note the comma):
naval_promotions[1, ]
## Source: local data frame [1 x 5]
##
## id name generation rank date
## (int) (chr) (int) (fctr) (chr)
## 1 1 Abbot, Joel 3 midshipman 1812-06-18
naval_promotions[10, "name"]
## Source: local data frame [1 x 1]
##
## name
## (chr)
## 1 Abercrombie, J.B.
naval_promotions[10, "date"]
## Source: local data frame [1 x 1]
##
## date
## (chr)
## 1 1817-01-01
Because data frames are organized by column, it is possible to extract an entire column using the $ function. Here, let’s get just the names of the people. (We’ll limit it using head() so we don’t print out all of them.)
head(naval_promotions$name, 10)
## [1] "Abbot, Joel" "Abbot, Trevett"
## [3] "Abbott, Isaac" "Abbott, J. Francis"
## [5] "Abbott, James W." "Abbott, Thomas C."
## [7] "Abbott, Walter" "Abbott, William A."
## [9] "Abercrombie, Alexander R." "Abercrombie, J.B."
The function unique() gives you, well, all the unique values in a vector. For instance:
unique(c(1, 1, 1, 2, 3))
## [1] 1 2 3
naval_promotions dataset? (Hint: use unique().)unique(naval_promotions$rank)
## [1] midshipman lieutenant master_commandant captain
## [5] left_service
## 5 Levels: midshipman lieutenant master_commandant ... left_service
naval_promotions dataset?length(unique(naval_promotions$name))
## [1] 3354
The number of rows (cases) indicate the number of promotions in total. The unique number is the number of individuals who were promoted within the Navy. Some individuals received multiple promotions over time.
sort(), head(), tail(), range(), and as.Date() to be useful. Don’t forget about na.rm = TRUE as appropriate.)range(as.Date(naval_promotions$date), na.rm = TRUE)
## [1] "1794-06-04" "1905-02-12"
We will work much, much more with data frames.
Another very useful kind of data structure is the list. A list can hold values of any type, including vectors, data frames, and even other lists. For instance, we can create a list that holds several different kinds of information:
our_class <- list(
title = "Intro to R",
year = 2016,
books = c("Basics of R", "Get Awesome at R"),
students = c("Adam", "Betsy", "Cynthia", "David")
)
str(our_class)
## List of 4
## $ title : chr "Intro to R"
## $ year : num 2016
## $ books : chr [1:2] "Basics of R" "Get Awesome at R"
## $ students: chr [1:4] "Adam" "Betsy" "Cynthia" "David"
We can get just part of the list using the [ function that we’ve become used to. For instance, to get just the title:
our_class["title"]
## $title
## [1] "Intro to R"
But notice that the returned value is a list, not a character vector
is.list(our_class["title"])
## [1] TRUE
is.character(our_class["title"])
## [1] FALSE
R has another subset operator, [[. The single bracket ([) gives us what we asked for inside a list; the double bracket ([[) simplifies the list to give us the vector (or whatever) we asked for.
our_class[["title"]]
## [1] "Intro to R"
is.list(our_class[["title"]])
## [1] FALSE
is.character(our_class[["title"]])
## [1] TRUE
our_class list, get just the year (as a list). Get it as a numeric vector.our_class["year"]
## $year
## [1] 2016
is.list(our_class["year"])
## [1] TRUE
our_class[["year"]]
## [1] 2016
is.numeric(our_class[["year"]])
## [1] TRUE
our_class list, get the class title, the book list, and the year (as a list). (Hint: remember the different kinds of subsetting we did with [.)our_class[c("title","books","year")]
## $title
## [1] "Intro to R"
##
## $books
## [1] "Basics of R" "Get Awesome at R"
##
## $year
## [1] 2016
R also lets us use the $ operator we used to get columns of a data frame.
our_class$title
## [1] "Intro to R"
our_class:str(our_class$students)
## chr [1:4] "Adam" "Betsy" "Cynthia" "David"
?str
$ equivalent to [ or [[?The
$is equivalent to the double bracket[[in that it produces a vector of characters containing all four students names. The$is not equavalaent to the single bracket[as the output is a list and not a character vector.
mountain_meadows_massacre <- list(
groups = c("Paiutes", "Mormons", "Baker-Fancher Party"),
year = 1857,
location= "Mountain Meadows, Utah",
num_killed = 120,
survivors = 17
)
mountain_meadows_massacre$groups
## [1] "Paiutes" "Mormons" "Baker-Fancher Party"
mountain_meadows_massacre["year"]
## $year
## [1] 1857
mountain_meadows_massacre[["survivors"]]
## [1] 17
is.list(mountain_meadows_massacre)
## [1] TRUE
is.data.frame(mountain_meadows_massacre)
## [1] FALSE
is.list(naval_promotions)
## [1] TRUE
is.data.frame(naval_promotions)
## [1] TRUE
$, [, and [[ on both data frames and lists. What is the relationship between a data frame and a list? (Hint: use is.list() and is.data.frame() on a list and a data frame.)It would appear that a data.frame is a specific type of list. Thus all data frames are lists but not all lists are data frames.
We have done subsetting above using numeric and character vectors. We can also do subsetting using logical vectors. Let’s create a sample dataset of the heights of soldiers.
set.seed(3929)
heights <- rnorm(20, mean = 69)
names(heights) <- letters[1:20]
heights
## a b c d e f g h
## 67.82345 69.04801 69.59076 68.86947 68.54834 68.13089 68.82323 67.90888
## i j k l m n o p
## 71.88773 66.51703 68.58429 68.23702 70.36570 68.33046 69.48776 69.86949
## q r s t
## 69.78694 68.90278 68.82545 69.47962
We can find all the soldiers who are taller than average like this. First let’s compare all the heights to the mean height.
heights > mean(heights)
## a b c d e f g h i j k l
## FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## m n o p q r s t
## TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
Notice that we get a logical vector as a result. We can use that within the [ operator to get just the soldiers who are taller than average.
heights[heights > mean(heights)]
## b c i m o p q t
## 69.04801 69.59076 71.88773 70.36570 69.48776 69.86949 69.78694 69.47962
heights[heights > 70]
## i m
## 71.88773 70.36570
This kind of subsetting also works for data frames. Here we get all the officers from the first generation. (Notice the comma.)
first_gen <- naval_promotions[naval_promotions$generation == 1, ]
head(first_gen, 10)
## Source: local data frame [10 x 5]
##
## id name generation rank date
## (int) (chr) (int) (fctr) (chr)
## 1 76 Archer, John 1 lieutenant 1798-11-08
## 2 117 Bainbridge, William 1 lieutenant 1798-08-03
## 3 127 Baker, Thomas 1 lieutenant 1798-05-25
## 4 141 Ballard, John 1 lieutenant 1798-10-02
## 5 168 Barron, James 1 lieutenant 1798-03-09
## 6 180 Barton, Jeremiah 1 lieutenant 1798-06-08
## 7 259 Blair, George 1 lieutenant 1799-03-13
## 8 434 Burns, James 1 lieutenant 1798-10-29
## 9 453 Byrne, Gerald 1 lieutenant 1799-06-17
## 10 474 Campbell, James 1 lieutenant 1799-09-20
#Wasn't sure if you meant the captian promotions for the first_gen data frame or the naval_promotions data frame. So I did both.
first_gen[first_gen$rank == "captain", ]
## Source: local data frame [34 x 5]
##
## id name generation rank date
## (int) (chr) (int) (fctr) (chr)
## 1 117 Bainbridge, William 1 captain 1800-05-20
## 2 127 Baker, Thomas 1 captain 1799-07-13
## 3 168 Barron, James 1 captain 1799-05-22
## 4 170 Barron, Samuel 1 captain 1798-09-13
## 5 173 Barry, John 1 captain 1794-06-06
## 6 387 Brown, Moses 1 captain 1798-09-15
## 7 472 Campbell, Hugh G. 1 captain 1800-10-16
## 8 525 Cassin, John 1 captain 1812-07-03
## 9 542 Chapman, Jonathan 1 captain 1798-09-10
## 10 703 Cross, George 1 captain 1798-09-10
## .. ... ... ... ... ...
naval_promotions[naval_promotions$rank == "captain", ]
## Source: local data frame [413 x 5]
##
## id name generation rank date
## (int) (chr) (int) (fctr) (chr)
## 1 1 Abbot, Joel 3 captain 1850-10-03
## 2 14 Adams, Henry A. 3 captain 1855-09-14
## 3 15 Adams, Henry A., Jr. 4 captain 1877-03-28
## 4 29 Alden, James 4 captain 1863-01-02
## 5 56 Almy, John J. 4 captain 1865-03-03
## 6 58 Ammen, Daniel 4 captain 1866-07-25
## 7 73 Angus, Samuel 2 captain 1816-04-27
## 8 82 Armstrong, James 3 captain 1841-09-08
## 9 84 Armstrong, James F. 4 captain 1866-09-27
## 10 85 Armstrong, William M. 3 captain 1855-03-24
## .. ... ... ... ... ...
naval_promotions$date <- as.Date(naval_promotions$date)
naval_promotions[naval_promotions$date >= as.Date("1800-01-01") & naval_promotions$date <= as.Date("1800-12-31"), ]
## Source: local data frame [190 x 5]
##
## id name generation rank date
## (int) (chr) (int) (fctr) (date)
## 1 31 Aldrick, Samuel 2 midshipman 1800-07-07
## 2 45 Allen, Samuel 2 midshipman 1800-12-13
## 3 47 Allen, William H. 2 midshipman 1800-04-28
## 4 66 Anderson, Thomas O. 2 midshipman 1800-04-14
## 5 72 Angier, Charles 2 midshipman 1800-09-03
## 6 154 Barker, William W. 2 midshipman 1800-03-26
## 7 195 Beale, Thomas T. 2 midshipman 1800-02-17
## 8 203 Beck, Daniel 2 midshipman 1800-11-24
## 9 227 Bennett, Edward 2 midshipman 1800-01-03
## 10 233 Bentley, Peter E. 2 midshipman 1800-04-12
## .. ... ... ... ... ...