After completing this worksheet, you should feel comfortable typing commands into the R console (or, REPL) and into an R Markdown document. In particular, you should know how to use values, variables, and functions, how to install and load packages, and how to use the built-in help for R and its packages.
R lets you store several different kinds of values. These values are the information that we actually want to do something with.
One kind of value is a number. Notice that typing this number, either in an R Markdown document or at the console, produces an identical output
42
## [1] 42
Create a numeric value that has a decimal point:
4.4
## [1] 4.4
Of course numbers can be added together (with +), subtracted (with -), multiplied (with *), and divided (with /), along with other arithmetical operations. Let’s add two numbers, which will produce a new number.
2 + 2
## [1] 4
2 * 2
## [1] 4
7 - 3
## [1] 4
Add two lines, one that multiplies two numbers, and another that subtracts two numbers.
Another important kind of value is a character vector. (Most other programming languages would call these strings.) These contain text. To create a string, include some characters in between quotation marks "". (Single quotation marks work too, but in general use double-quotation marks as a matter of style.) For instance:
"Hello, beginning R programmer"
## [1] "Hello, beginning R programmer"
Create a string with a message to your instructor.
"Hello R Master"
## [1] "Hello R Master"
Character vectors can’t be added together with +. But they can be joined together with the paste() function.
paste("Hello", "everybody")
## [1] "Hello everybody"
Mimic the example above and paste three strings together.
paste("Hello", "Comment", "Hello")
## [1] "Hello Comment Hello"
Now explain in a sentence what happened.
It worked yo
Another very important kind of value are logical values. There are only two of them: TRUE and FALSE.
# This is true
TRUE
## [1] TRUE
# This is false
FALSE
## [1] FALSE
Notice that in the block above, the # character starts a comment. That means that from that point on, R will ignore whatever is on that line until a new line begins.
Logical values aren’t very exciting, but they are useful when we compare other values to one another. For instance, we can compare two numbers to one another.
2 < 3
## [1] TRUE
2 > 3
## [1] FALSE
2 == 3
## [1] FALSE
What do each of those comparison operators do? (Note the double equal sign: ==.)
Each one describes which is larger than the other. R recognizes that the second value is incorrect. 2 is not greater than three. Additionally 2 is not equal to 3. Therefore R recognizes that the last two comparisons are false.
Create your own comparisons between numeric values. See if you can create a comparison between character vectors.
4 < 2
## [1] FALSE
R has a special kind of value: the missing value. This is represented by NA.
NA
## [1] NA
Try adding 2 + NA.
2 + NA
## [1] NA
Does that answer make sense? Why or why not?
R does not recognize an answer. This is because NA does not represent a numeric value. Therefore it cannot recovnize the equation.
We will come back to missing values.
We wouldn’t be able to get very far if we only used values. We also need a place to store them, a way of writing them to the computer’s memory. We can do that by assignment to a variable. Assignment has three parts: it has the name of a variable (which cannot contain spaces), an assignment operator, and a value that will be assigned. Most programming languages use a rinky-dink = for assignment, which works in R too. But R is awesome because the assignment operator is <-, a lovely little arrow which tells you that the value goes into the variable. For example:
number <- 42
Notice that nothing was printed as output when we did that. But now we can just type a and get the value which is stored in the variable.
number
## [1] 42
It works with character vectors too.
computer_name <- "HAL 9000"
No output, but this works.
computer_name
## [1] "HAL 9000"
In the assignment above, what is the name of the variable? What is the assignment operator? What is the value assigned to the variable?
The variable is computer_name. The value assined to the variable is HAL 9000. The operator is a.
Notice that we can use variables any place that we used to use values. For example:
x <- 2
y <- 5
x * y
## [1] 10
x + 9
## [1] 11
Explain in your own words what just happened.
By assigning the number 2 to x and the number 5 to y, R recongizes x times y the same as 2 times 5. Similarly, when x is added to 9, R recognizes x as 2. Therefore, the answer becomes 11.
Now create two assignments. Assign a number to a variable and a character vector to a different variable.
s <- 4
w <- 10
s * w
## [1] 40
w - s
## [1] 6
Now create a third variable with a numeric value and, using the variable with a numeric value from earlier, add them together.
f <- 20
f + s
## [1] 24
Can you predict what the result of running this code will be? (That is, what value is stored in a?)
a <- 10
b <- 20
a <- b
a
Predict your answer, then run the code. What is the value stored in a by the end? Explain why you were right or wrong.
a will be recognized as 10. I was wrong! Apparently assigning a value to a variable works. However, it will not allow you to assign a variable to another variable.
Variables are better than just values, but we still need to be able to store multiple values. If we have to store each value in its own variable, then we are going to need a lot of variables. R is a beautiful language because every value is actually a vector. That means it can store more more than one value.
Notice the funny output here:
"Some words"
## [1] "Some words"
What does the [1] in the output mean? It means that the value has one item inside it. We can test that with the length() function
length("Some words")
## [1] 1
Sure enough, the length is 1: R is counting the number of items, not the number of words or characters.
That would seem to imply that we can add multiple items (or values) inside a vector. R lets us do that with the c() (for “combine”) function.
many <- c(1, 5, 2, 3, 7)
many
## [1] 1 5 2 3 7
What is the length of the vector stored in many?
length (many)
## [1] 5
Let’s try multiplying many by 2:
many * 2
## [1] 2 10 4 6 14
What happened?
R recognizes each number in the series individually and multiplies each by 2.
What happens when you add 2 to many?
many + 2
## [1] 3 7 4 5 9
Each number goes up by 2.
Can you create a variable containing several names as a character vectors?
Hard question: what is happening here? Why does R give you a warning message?
c(1, 2, 3, 4, 5) + c(10, 20)
## Warning in c(1, 2, 3, 4, 5) + c(10, 20): longer object length is not a
## multiple of shorter object length
## [1] 11 22 13 24 15
combining is not the same as adding. R does not recognize the combination of these numbers added to one another because there is not function to show exactly how you want them combined and then added to each other.
Wouldn’t it be nice to be able to do something with data? Let’s take some made up data: the price of books that you or I have bought recently.
book_prices <- c(19.99, 25.78, 5.33, 45.00, 22.45, 21.23)
We can find the total amount that I spent using the sum function.
sum(book_prices)
## [1] 139.78
Try finding the average price of the books (using mean()) and the median price of the books (using median()).
mean(book_prices)
## [1] 23.29667
median(book_prices)
## [1] 21.84
Can you figure out how to find the most expensive book (hint: the book with the maximum price) and the least expensive book (hint: the book with the minimum price)?
max(book_prices)
## [1] 45
min(book_prices)
## [1] 5.33
Hard question: what is happening here?
book_prices >= mean(book_prices)
## [1] FALSE TRUE FALSE TRUE FALSE FALSE
R recognizes that the book prices are less than the mean of book prices.
Let’s try a different set of book prices. This time, we have a vector of book prices, but there are some books for which we don’t know how much we paid. Those are the NA values.
more_books <- c(19.99, NA, 25.78, 5.33, NA, 45.00, 22.45, NA, 21.23)
How many books did we buy? (Hint: what is the length of the vector.)
length(more_books)
## [1] 9
Let’s try finding the total using sum().
sum(more_books)
## [1] NA
That wasn’t very helpful. Why did R give us an answer of NA?
Because NA is not a value. You cannot add all of those books together if you do not know the value of all of them. So the value is not available.
We need to find a way to get the value of the books that we know about. This is an option to the sum() function. If you know the name of a function, you can search for it by typing a question mark followed without a space by the name of the function. For example, ?sum. Look up the sum() function’s documentation. Read at least the “Arguments” and the “Examples” section. How can you get the sum for the values which aren’t missing?
sum(more_books=TRUE)
## [1] 1
Look up the documentation for ?mean, ?max, ?min and see if you can use those functions on a vector with missing values.
max(more_books, na.rm=TRUE)
## [1] 45
mean(more_books, na.rm=TRUE)
## [1] 23.29667
min(more_books, na.rm=TRUE)
## [1] 5.33
We are historians, and we want to work with complex data. Another reason R is awesome is that it includes a kind of data structure called data frames. Think of a data frame as basically a spreadsheet. It is tabular data, and the columns can contain any kind of data available in R, such as character vectors, numeric vectors, or logical vectors. R has some sample data built in, but let’s use some historical data from the historydata package.
You can load a package like this:
library(historydata)
The dplyr package is very helpful. Try loading it as well.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
You might get an error message if you don’t have either package installed. If you need to install it, run install.packages("historydata") at the R console.
We don’t know what is in the historydata package, so let’s look at it’s help. Run this command: help(package = "historydata").
Let’s use the data in the paulist_missions data frame. According to the package documentation, what is in this data frame?
Records of missions held by the Paulist Fathers, 1851-1893
We can print it by using the name of the variable
head(paulist_missions, 10)
## Source: local data frame [10 x 11]
##
## mission_number church city state
## (int) (chr) (chr) (chr)
## 1 1 St. Joseph's Church New York NY
## 2 2 St. Michael's Church Loretto PA
## 3 3 St. Mary's Church Hollidaysburg PA
## 4 4 Church of St. John Evangelist Johnstown PA
## 5 5 St. Peter's Church New York NY
## 6 6 St. Patrick's Cathedral New York NY
## 7 7 St. Patrick's Church Erie PA
## 8 8 St. Philip Benizi Church Cussewago PA
## 9 9 St. Vincent's Church (Benedictine) Youngstown PA
## 10 10 St. Peter's Church Saratoga NY
## Variables not shown: start_date (chr), end_date (chr), confessions (int),
## converts (int), order (chr), lat (dbl), long (dbl)
(The head() function just gives us the first number of items in the vector.)
That showed us some of the data but not all. The str() function is helpful. Look up the documentation for it, and then run it on paulist_missions.
str(paulist_missions)
## Classes 'tbl_df', 'tbl' and 'data.frame': 841 obs. of 11 variables:
## $ mission_number: int 1 2 3 4 5 6 7 8 9 10 ...
## $ church : chr "St. Joseph's Church" "St. Michael's Church" "St. Mary's Church" "Church of St. John Evangelist" ...
## $ city : chr "New York" "Loretto" "Hollidaysburg" "Johnstown" ...
## $ state : chr "NY" "PA" "PA" "PA" ...
## $ start_date : chr "4/6/1851" "4/27/1851" "5/18/1851" "5/31/1851" ...
## $ end_date : chr "4/20/1851" "5/11/1851" "5/28/1851" "6/8/1851" ...
## $ confessions : int 6000 1700 1000 1000 4000 7000 1000 270 1000 600 ...
## $ converts : int 0 0 0 0 0 0 0 0 0 3 ...
## $ order : chr "Redemptorist" "Redemptorist" "Redemptorist" "Redemptorist" ...
## $ lat : num 40.7 40.5 40.4 40.3 40.7 ...
## $ long : num -74 -78.6 -78.4 -78.9 -74 ...
Also try the glimpse() function.
glimpse(paulist_missions)
## Observations: 841
## Variables: 11
## $ mission_number (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ church (chr) "St. Joseph's Church", "St. Michael's Church", ...
## $ city (chr) "New York", "Loretto", "Hollidaysburg", "Johnst...
## $ state (chr) "NY", "PA", "PA", "PA", "NY", "NY", "PA", "PA",...
## $ start_date (chr) "4/6/1851", "4/27/1851", "5/18/1851", "5/31/185...
## $ end_date (chr) "4/20/1851", "5/11/1851", "5/28/1851", "6/8/185...
## $ confessions (int) 6000, 1700, 1000, 1000, 4000, 7000, 1000, 270, ...
## $ converts (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0,...
## $ order (chr) "Redemptorist", "Redemptorist", "Redemptorist",...
## $ lat (dbl) 40.71435, 40.50313, 40.42729, 40.32674, 40.7143...
## $ long (dbl) -74.00597, -78.63030, -78.38890, -78.92197, -74...
Bonus: where does the glimpse() function come from?
We will get into subsetting data in more detail later. But for now, notice that we can get just one of the colums using the $ operator. For example:
head(paulist_missions$city, 20)
## [1] "New York" "Loretto" "Hollidaysburg" "Johnstown"
## [5] "New York" "New York" "Erie" "Cussewago"
## [9] "Youngstown" "Saratoga" "Troy" "Albany"
## [13] "Detroit" "Philadelphia" "Philadelphia" "Cohoes"
## [17] "Wheeling" "Cincinnati" "Louisville" "Albany"
Can you print the first 20 numbers of converts? of confessions?
print(paulist_missions$converts, 20)
## [1] 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 3 0 3 3 3 3 3
## [24] 0 0 0 0 0 0 7 10 16 3 3 2 0 2 2 0 0 0 0 3 0 3 3
## [47] 0 3 2 0 20 3 0 0 0 1 3 0 9 0 0 7 0 3 3 3 3 1 0
## [70] 0 20 0 0 0 2 0 0 1 5 2 1 0 0 3 0 3 5 15 0 0 0 2
## [93] 1 1 5 4 1 2 50 6 6 3 5 0 0 3 3 3 0 0 0 5 5 0 1
## [116] 1 10 0 4 0 0 3 5 1 0 1 0 1 2 1 0 0 2 2 0 4 0 7
## [139] 0 0 3 1 1 8 3 0 6 1 0 10 1 0 2 1 4 3 1 3 3 11 0
## [162] 1 1 13 0 0 0 0 0 0 0 0 9 2 0 2 10 2 2 0 0 1 0 0
## [185] 8 2 0 22 7 4 4 18 2 1 12 8 2 3 6 2 0 1 12 6 6 8 6
## [208] 6 6 7 1 13 2 9 5 0 2 5 2 3 1 1 1 1 6 6 12 6 10 29
## [231] 2 14 3 7 0 17 3 13 6 8 8 2 11 3 3 4 2 3 1 0 2 1 1
## [254] 3 1 2 6 5 7 3 5 0 15 25 3 0 5 7 0 8 0 4 1 4 0 7
## [277] 2 5 2 6 6 0 11 1 2 1 8 1 1 7 3 1 0 1 1 0 0 0 0
## [300] 1 3 0 0 5 0 0 0 0 8 5 0 4 0 3 0 0 0 0 0 4 0 2
## [323] 0 0 2 7 0 9 3 0 7 6 0 18 5 5 2 1 2 1 3 1 0 0 0
## [346] 2 0 3 4 0 0 0 0 3 0 0 25 1 3 0 1 0 0 0 1 2 0 1
## [369] 1 1 1 12 1 7 0 0 0 0 0 0 2 1 5 7 3 0 16 3 2 0 0
## [392] 1 0 0 1 3 4 1 0 5 3 0 0 0 0 6 1 2 2 1 0 1 21 2
## [415] 9 1 0 0 1 0 2 2 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1
## [438] 0 3 1 12 7 0 2 0 9 5 6 0 4 0 0 1 2 1 0 5 4 0 2
## [461] 1 2 0 0 0 4 0 0 0 12 0 0 0 10 0 6 0 7 5 0 1 1 0
## [484] 0 0 0 0 0 0 0 4 0 0 0 0 2 0 0 7 0 11 1 20 6 0 0
## [507] 4 4 0 0 0 0 4 0 0 0 0 0 0 0 2 0 2 2 4 2 0 0 3
## [530] 5 0 0 0 1 0 0 2 0 6 20 0 1 0 16 2 3 1 1 0 25 0 4
## [553] 2 0 0 0 1 0 1 0 0 0 0 0 3 1 0 1 1 1 0 0 2 1 0
## [576] 3 0 3 1 4 0 1 0 0 3 5 0 0 0 4 1 12 1 0 0 0 5 0
## [599] 0 0 0 1 0 1 0 0 0 3 0 0 13 1 0 1 0 0 0 8 5 3 2
## [622] 3 10 0 1 3 2 0 8 0 5 0 6 1 3 0 4 0 0 4 0 0 5 0
## [645] 0 2 20 1 0 2 1 0 0 0 0 2 0 0 0 0 0 1 1 0 0 6 0
## [668] 6 0 0 5 3 0 0 5 3 1 0 3 2 3 11 0 0 0 5 0 0 0 5
## [691] 0 1 1 0 0 0 0 0 0 0 0 2 25 0 2 9 0 0 4 1 0 0 0
## [714] 0 14 6 7 8 0 2 0 0 0 1 4 2 0 4 0 0 0 2 1 0 0 0
## [737] 0 0 1 0 0 0 3 0 0 0 0 0 0 0 5 0 0 0 3 0 0 3 7
## [760] 3 0 9 11 0 6 3 0 0 5 3 0 6 6 0 2 0 4 0 0 0 0 3
## [783] 1 1 4 0 0 1 12 0 0 0 14 0 0 0 0 0 0 1 0 0 1 0 0
## [806] 0 9 0 0 0 2 0 0 3 0 5 0 4 5 20 0 0 12 3 4 1 3 3
## [829] 4 0 0 0 0 0 0 5 0 0 0 1 2
print(paulist_missions$confessions, 20)
## [1] 6000 1700 1000 1000 4000 7000 1000 270 1000 600 3100
## [12] 4000 3000 3000 1700 1400 1250 5000 2000 4500 620 2300
## [23] 2550 4500 320 950 1200 550 7800 500 330 4000 3000
## [34] 1300 1300 2000 1700 1500 1500 1300 2700 500 2500 4400
## [45] 2800 1000 1400 600 350 700 7600 3000 1100 3300 4000
## [56] 2800 1350 400 520 900 210 700 1700 2700 2500 190
## [67] 210 170 750 600 3100 2200 800 1900 800 1950 3000
## [78] 920 2050 1400 1100 277 840 1008 970 800 1073 5700
## [89] 4430 2000 1450 430 575 450 910 2060 570 1135 7250
## [100] 2600 8450 1950 3500 1570 800 1250 2800 2270 1230 810
## [111] 925 2640 1900 2150 1000 1100 6950 4550 1002 2900 1900
## [122] 1980 1850 950 4000 370 700 1150 1000 6800 2400 2500
## [133] 2500 900 2875 5636 3500 2100 1500 3000 2600 2000 710
## [144] 2600 2500 1300 4500 1400 567 2800 1100 700 1080 750
## [155] 730 800 1200 700 2100 2551 1200 1600 1200 1600 1400
## [166] 2600 721 1261 1400 2000 1300 1800 1500 1500 2150 1600
## [177] 6600 2780 3000 5500 2100 700 1030 1007 5300 1400 4000
## [188] 700 4500 1170 400 2350 570 1400 1450 400 1650 700
## [199] 4280 150 260 800 1500 2600 2700 2600 4800 1850 2800
## [210] 1350 1700 9000 1600 360 1600 1100 650 550 350 1200
## [221] 660 75 37 1200 2100 1200 2200 1300 800 4200 800
## [232] 3200 4000 5000 1500 790 440 620 960 520 1537 626
## [243] 460 700 415 1030 520 1100 1050 240 443 470 370
## [254] 400 375 215 750 320 1120 3500 1500 5300 8000 6900
## [265] 1900 1700 468 1700 1280 3086 2000 3790 450 150 160
## [276] 340 462 315 1000 6200 1058 4539 5674 1323 550 612
## [287] 1100 645 845 2322 424 510 360 580 380 127 93
## [298] 130 86 860 1343 975 625 1149 1100 1688 3368 1960
## [309] 1896 2163 3507 1826 1158 984 725 550 270 115 1200
## [320] 2344 480 740 5750 480 225 160 582 1780 1000 150
## [331] 523 794 1600 4530 2325 270 390 410 195 1325 1347
## [342] 1500 490 333 24 75 325 528 1800 1975 4288 714
## [353] 400 1055 536 500 1350 430 1017 750 3926 700 900
## [364] 5000 382 375 576 2500 1550 720 620 4600 234 404
## [375] 204 335 84 350 900 5800 4622 410 4310 6370 3270
## [386] 500 3250 497 442 700 1000 300 550 250 1350 5000
## [397] 2900 50 35 1700 100 NA 120 60 NA 432 2210
## [408] 2740 830 2918 125 2700 6825 5490 6082 970 850 1050
## [419] 1015 952 170 470 230 395 298 670 540 575 190
## [430] 2500 1600 535 625 1500 1000 625 1362 217 563 203
## [441] 4764 4410 1120 2480 3279 4642 4707 1700 720 850 800
## [452] 170 320 625 370 72 670 153 190 250 380 2000
## [463] 1000 6508 2100 2475 1675 550 150 11020 800 5000 260
## [474] 6950 4113 630 623 1100 997 336 1676 2485 2544 5560
## [485] 230 1042 3772 232 67 69 2235 300 109 50 2505
## [496] 3900 3048 2770 1000 2527 8700 300 11052 1774 50 250
## [507] 5900 900 4300 5535 1500 2700 1760 2946 515 750 680
## [518] 800 375 475 2265 721 2700 3203 1140 920 675 460
## [529] 1651 500 636 700 294 416 324 2700 495 917 1818
## [540] 875 1354 145 1120 2822 253 120 430 355 335 6915
## [551] NA 2420 1870 761 253 164 830 3655 43 709 180
## [562] 100 NA 180 1100 1600 652 550 445 430 812 228
## [573] 765 1070 100 1482 300 825 1314 1025 2540 300 830
## [584] 600 1325 3022 511 810 1435 8396 440 3435 2678 465
## [595] 350 5300 1075 318 1291 960 2377 1985 1727 686 452
## [606] 1020 600 180 97 1550 1020 1232 500 478 36 204
## [617] 96 2470 715 860 980 420 3890 720 416 128 165
## [628] 640 4070 394 4712 485 5804 2685 1250 580 3273 4957
## [639] 7131 2900 1000 895 475 455 366 433 6980 417 425
## [650] NA 6000 285 450 365 250 4006 1205 300 1277 475
## [661] 1493 2004 1665 8200 700 2125 83 525 1450 4466 1250
## [672] 2720 450 2972 4192 535 1235 150 930 2900 2343 9520
## [683] 770 1230 1875 6420 870 5613 2337 5151 1900 630 1413
## [694] 1300 30 49 152 3559 505 410 45 1136 9800 1418
## [705] 3540 1250 80 600 4780 1440 1023 228 1720 2500 2495
## [716] 2495 11650 5088 801 2470 1300 1582 3300 2905 3650 2453
## [727] 2573 1300 1665 1050 2801 1622 700 625 1400 1260 1400
## [738] 493 265 400 980 993 795 NA 1110 640 962 300
## [749] 2705 1821 3480 5165 755 1500 750 270 1400 3770 1018
## [760] 640 2600 1300 3375 3550 6500 3980 2055 4925 960 1100
## [771] 2516 7200 4650 8728 410 370 1430 3250 830 3450 2110
## [782] 155 1075 400 1580 2740 750 600 2200 1450 1575 140
## [793] 224 485 130 200 1600 350 520 1610 317 950 2530
## [804] 507 2050 700 1300 2650 1150 2250 1007 2580 3880 1300
## [815] 1600 2950 284 5285 3000 7550 8637 3750 4800 2087 760
## [826] 900 700 4060 2250 600 1350 670 200 120 910 1685
## [837] 240 312 840 217 146
What was the mean number of converts? the maximum? How about for confessions?
mean(paulist_missions$converts)
## [1] 2.507729
max(paulist_missions$converts)
## [1] 50
mean(paulist_missions$confessions)
## [1] NA
max(paulist_missions$confessions)
## [1] NA
Bonus: what was the ratio between confessions and conversions?
And for fun, let’s make a scatter plot of the number of confessions versus the number of conversions.
plot(paulist_missions$confessions, paulist_missions$converts)
title("Confessions versus conversions")
What might you be able to learn from this plot?
There are more people confession than converting. Probably because in 19th century America people were already subscribing to religious instiutions. So you would have more people confessing than converting.
There are other datasets in historydata. Can you make a plot from one or more of them?
plot(paulist_missions$lat, paulist_missions$long)