Aim of this worksheet

After completing this worksheet, you should feel comfortable typing commands into the R console (or, REPL) and into an R Markdown document. In particular, you should know how to use values, variables, and functions, how to install and load packages, and how to use the built-in help for R and its packages.

Values

R lets you store several different kinds of values. These values are the information that we actually want to do something with.

One kind of value is a number. Notice that typing this number, either in an R Markdown document or at the console, produces an identical output

42
## [1] 42

Create a numeric value that has a decimal point:

4.4
## [1] 4.4

Of course numbers can be added together (with +), subtracted (with -), multiplied (with *), and divided (with /), along with other arithmetical operations. Let’s add two numbers, which will produce a new number.

2 + 2
## [1] 4
2 * 2
## [1] 4
7 - 3
## [1] 4

Add two lines, one that multiplies two numbers, and another that subtracts two numbers.

Another important kind of value is a character vector. (Most other programming languages would call these strings.) These contain text. To create a string, include some characters in between quotation marks "". (Single quotation marks work too, but in general use double-quotation marks as a matter of style.) For instance:

"Hello, beginning R programmer"
## [1] "Hello, beginning R programmer"

Create a string with a message to your instructor.

"Hello R Master"
## [1] "Hello R Master"

Character vectors can’t be added together with +. But they can be joined together with the paste() function.

paste("Hello", "everybody")
## [1] "Hello everybody"

Mimic the example above and paste three strings together.

paste("Hello", "Comment", "Hello")
## [1] "Hello Comment Hello"

Now explain in a sentence what happened.

It worked yo

Another very important kind of value are logical values. There are only two of them: TRUE and FALSE.

# This is true
TRUE
## [1] TRUE
# This is false
FALSE
## [1] FALSE

Notice that in the block above, the # character starts a comment. That means that from that point on, R will ignore whatever is on that line until a new line begins.

Logical values aren’t very exciting, but they are useful when we compare other values to one another. For instance, we can compare two numbers to one another.

2 < 3
## [1] TRUE
2 > 3
## [1] FALSE
2 == 3
## [1] FALSE

What do each of those comparison operators do? (Note the double equal sign: ==.)

Each one describes which is larger than the other. R recognizes that the second value is incorrect. 2 is not greater than three. Additionally 2 is not equal to 3. Therefore R recognizes that the last two comparisons are false.

Create your own comparisons between numeric values. See if you can create a comparison between character vectors.

4 < 2
## [1] FALSE

R has a special kind of value: the missing value. This is represented by NA.

NA
## [1] NA

Try adding 2 + NA.

2 + NA
## [1] NA

Does that answer make sense? Why or why not?

R does not recognize an answer. This is because NA does not represent a numeric value. Therefore it cannot recovnize the equation.

We will come back to missing values.

Variables

We wouldn’t be able to get very far if we only used values. We also need a place to store them, a way of writing them to the computer’s memory. We can do that by assignment to a variable. Assignment has three parts: it has the name of a variable (which cannot contain spaces), an assignment operator, and a value that will be assigned. Most programming languages use a rinky-dink = for assignment, which works in R too. But R is awesome because the assignment operator is <-, a lovely little arrow which tells you that the value goes into the variable. For example:

number <- 42

Notice that nothing was printed as output when we did that. But now we can just type a and get the value which is stored in the variable.

number
## [1] 42

It works with character vectors too.

computer_name <- "HAL 9000"

No output, but this works.

computer_name
## [1] "HAL 9000"

In the assignment above, what is the name of the variable? What is the assignment operator? What is the value assigned to the variable?

The variable is computer_name. The value assined to the variable is HAL 9000. The operator is a.

Notice that we can use variables any place that we used to use values. For example:

x <- 2
y <- 5
x * y
## [1] 10
x + 9
## [1] 11

Explain in your own words what just happened.

By assigning the number 2 to x and the number 5 to y, R recongizes x times y the same as 2 times 5. Similarly, when x is added to 9, R recognizes x as 2. Therefore, the answer becomes 11.

Now create two assignments. Assign a number to a variable and a character vector to a different variable.

s <- 4
w <- 10
s * w
## [1] 40
w - s
## [1] 6

Now create a third variable with a numeric value and, using the variable with a numeric value from earlier, add them together.

f <- 20

f + s
## [1] 24

Can you predict what the result of running this code will be? (That is, what value is stored in a?)

a <- 10
b <- 20
a <- b
a

Predict your answer, then run the code. What is the value stored in a by the end? Explain why you were right or wrong.

a will be recognized as 10. I was wrong! Apparently assigning a value to a variable works. However, it will not allow you to assign a variable to another variable.

Vectors

Variables are better than just values, but we still need to be able to store multiple values. If we have to store each value in its own variable, then we are going to need a lot of variables. R is a beautiful language because every value is actually a vector. That means it can store more more than one value.

Notice the funny output here:

"Some words"
## [1] "Some words"

What does the [1] in the output mean? It means that the value has one item inside it. We can test that with the length() function

length("Some words")
## [1] 1

Sure enough, the length is 1: R is counting the number of items, not the number of words or characters.

That would seem to imply that we can add multiple items (or values) inside a vector. R lets us do that with the c() (for “combine”) function.

many <- c(1, 5, 2, 3, 7)
many
## [1] 1 5 2 3 7

What is the length of the vector stored in many?

length (many)
## [1] 5

Let’s try multiplying many by 2:

many * 2
## [1]  2 10  4  6 14

What happened?

R recognizes each number in the series individually and multiplies each by 2.

What happens when you add 2 to many?

many + 2
## [1] 3 7 4 5 9

Each number goes up by 2.

Can you create a variable containing several names as a character vectors?

Hard question: what is happening here? Why does R give you a warning message?

c(1, 2, 3, 4, 5) + c(10, 20)
## Warning in c(1, 2, 3, 4, 5) + c(10, 20): longer object length is not a
## multiple of shorter object length
## [1] 11 22 13 24 15

combining is not the same as adding. R does not recognize the combination of these numbers added to one another because there is not function to show exactly how you want them combined and then added to each other.

Built-in functions

Wouldn’t it be nice to be able to do something with data? Let’s take some made up data: the price of books that you or I have bought recently.

book_prices <- c(19.99, 25.78, 5.33, 45.00, 22.45, 21.23)

We can find the total amount that I spent using the sum function.

sum(book_prices)
## [1] 139.78

Try finding the average price of the books (using mean()) and the median price of the books (using median()).

mean(book_prices)
## [1] 23.29667
median(book_prices)
## [1] 21.84

Can you figure out how to find the most expensive book (hint: the book with the maximum price) and the least expensive book (hint: the book with the minimum price)?

max(book_prices)
## [1] 45
min(book_prices)
## [1] 5.33

Hard question: what is happening here?

book_prices >= mean(book_prices)
## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

R recognizes that the book prices are less than the mean of book prices.

Using the documentation

Let’s try a different set of book prices. This time, we have a vector of book prices, but there are some books for which we don’t know how much we paid. Those are the NA values.

more_books <- c(19.99, NA, 25.78, 5.33, NA, 45.00, 22.45, NA, 21.23)

How many books did we buy? (Hint: what is the length of the vector.)

length(more_books)
## [1] 9

Let’s try finding the total using sum().

sum(more_books)
## [1] NA

That wasn’t very helpful. Why did R give us an answer of NA?

Because NA is not a value. You cannot add all of those books together if you do not know the value of all of them. So the value is not available.

We need to find a way to get the value of the books that we know about. This is an option to the sum() function. If you know the name of a function, you can search for it by typing a question mark followed without a space by the name of the function. For example, ?sum. Look up the sum() function’s documentation. Read at least the “Arguments” and the “Examples” section. How can you get the sum for the values which aren’t missing?

sum(more_books=TRUE)
## [1] 1

Look up the documentation for ?mean, ?max, ?min and see if you can use those functions on a vector with missing values.

max(more_books, na.rm=TRUE)
## [1] 45
mean(more_books, na.rm=TRUE)
## [1] 23.29667
min(more_books, na.rm=TRUE)
## [1] 5.33

Data frames and loading packages

We are historians, and we want to work with complex data. Another reason R is awesome is that it includes a kind of data structure called data frames. Think of a data frame as basically a spreadsheet. It is tabular data, and the columns can contain any kind of data available in R, such as character vectors, numeric vectors, or logical vectors. R has some sample data built in, but let’s use some historical data from the historydata package.

You can load a package like this:

library(historydata)

The dplyr package is very helpful. Try loading it as well.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

You might get an error message if you don’t have either package installed. If you need to install it, run install.packages("historydata") at the R console.

We don’t know what is in the historydata package, so let’s look at it’s help. Run this command: help(package = "historydata").

Let’s use the data in the paulist_missions data frame. According to the package documentation, what is in this data frame?

Records of missions held by the Paulist Fathers, 1851-1893

We can print it by using the name of the variable

head(paulist_missions, 10)
## Source: local data frame [10 x 11]
## 
##    mission_number                             church          city state
##             (int)                              (chr)         (chr) (chr)
## 1               1                St. Joseph's Church      New York    NY
## 2               2               St. Michael's Church       Loretto    PA
## 3               3                  St. Mary's Church Hollidaysburg    PA
## 4               4      Church of St. John Evangelist     Johnstown    PA
## 5               5                 St. Peter's Church      New York    NY
## 6               6            St. Patrick's Cathedral      New York    NY
## 7               7               St. Patrick's Church          Erie    PA
## 8               8           St. Philip Benizi Church     Cussewago    PA
## 9               9 St. Vincent's Church (Benedictine)    Youngstown    PA
## 10             10                 St. Peter's Church      Saratoga    NY
## Variables not shown: start_date (chr), end_date (chr), confessions (int),
##   converts (int), order (chr), lat (dbl), long (dbl)

(The head() function just gives us the first number of items in the vector.)

That showed us some of the data but not all. The str() function is helpful. Look up the documentation for it, and then run it on paulist_missions.

str(paulist_missions)
## Classes 'tbl_df', 'tbl' and 'data.frame':    841 obs. of  11 variables:
##  $ mission_number: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ church        : chr  "St. Joseph's Church" "St. Michael's Church" "St. Mary's Church" "Church of St. John Evangelist" ...
##  $ city          : chr  "New York" "Loretto" "Hollidaysburg" "Johnstown" ...
##  $ state         : chr  "NY" "PA" "PA" "PA" ...
##  $ start_date    : chr  "4/6/1851" "4/27/1851" "5/18/1851" "5/31/1851" ...
##  $ end_date      : chr  "4/20/1851" "5/11/1851" "5/28/1851" "6/8/1851" ...
##  $ confessions   : int  6000 1700 1000 1000 4000 7000 1000 270 1000 600 ...
##  $ converts      : int  0 0 0 0 0 0 0 0 0 3 ...
##  $ order         : chr  "Redemptorist" "Redemptorist" "Redemptorist" "Redemptorist" ...
##  $ lat           : num  40.7 40.5 40.4 40.3 40.7 ...
##  $ long          : num  -74 -78.6 -78.4 -78.9 -74 ...

Also try the glimpse() function.

glimpse(paulist_missions)
## Observations: 841
## Variables: 11
## $ mission_number (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ church         (chr) "St. Joseph's Church", "St. Michael's Church", ...
## $ city           (chr) "New York", "Loretto", "Hollidaysburg", "Johnst...
## $ state          (chr) "NY", "PA", "PA", "PA", "NY", "NY", "PA", "PA",...
## $ start_date     (chr) "4/6/1851", "4/27/1851", "5/18/1851", "5/31/185...
## $ end_date       (chr) "4/20/1851", "5/11/1851", "5/28/1851", "6/8/185...
## $ confessions    (int) 6000, 1700, 1000, 1000, 4000, 7000, 1000, 270, ...
## $ converts       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0,...
## $ order          (chr) "Redemptorist", "Redemptorist", "Redemptorist",...
## $ lat            (dbl) 40.71435, 40.50313, 40.42729, 40.32674, 40.7143...
## $ long           (dbl) -74.00597, -78.63030, -78.38890, -78.92197, -74...

Bonus: where does the glimpse() function come from?

We will get into subsetting data in more detail later. But for now, notice that we can get just one of the colums using the $ operator. For example:

head(paulist_missions$city, 20)
##  [1] "New York"      "Loretto"       "Hollidaysburg" "Johnstown"    
##  [5] "New York"      "New York"      "Erie"          "Cussewago"    
##  [9] "Youngstown"    "Saratoga"      "Troy"          "Albany"       
## [13] "Detroit"       "Philadelphia"  "Philadelphia"  "Cohoes"       
## [17] "Wheeling"      "Cincinnati"    "Louisville"    "Albany"

Can you print the first 20 numbers of converts? of confessions?

print(paulist_missions$converts, 20)
##   [1]  0  0  0  0  0  0  0  0  0  3  0  0  0  0  0  0  3  0  3  3  3  3  3
##  [24]  0  0  0  0  0  0  7 10 16  3  3  2  0  2  2  0  0  0  0  3  0  3  3
##  [47]  0  3  2  0 20  3  0  0  0  1  3  0  9  0  0  7  0  3  3  3  3  1  0
##  [70]  0 20  0  0  0  2  0  0  1  5  2  1  0  0  3  0  3  5 15  0  0  0  2
##  [93]  1  1  5  4  1  2 50  6  6  3  5  0  0  3  3  3  0  0  0  5  5  0  1
## [116]  1 10  0  4  0  0  3  5  1  0  1  0  1  2  1  0  0  2  2  0  4  0  7
## [139]  0  0  3  1  1  8  3  0  6  1  0 10  1  0  2  1  4  3  1  3  3 11  0
## [162]  1  1 13  0  0  0  0  0  0  0  0  9  2  0  2 10  2  2  0  0  1  0  0
## [185]  8  2  0 22  7  4  4 18  2  1 12  8  2  3  6  2  0  1 12  6  6  8  6
## [208]  6  6  7  1 13  2  9  5  0  2  5  2  3  1  1  1  1  6  6 12  6 10 29
## [231]  2 14  3  7  0 17  3 13  6  8  8  2 11  3  3  4  2  3  1  0  2  1  1
## [254]  3  1  2  6  5  7  3  5  0 15 25  3  0  5  7  0  8  0  4  1  4  0  7
## [277]  2  5  2  6  6  0 11  1  2  1  8  1  1  7  3  1  0  1  1  0  0  0  0
## [300]  1  3  0  0  5  0  0  0  0  8  5  0  4  0  3  0  0  0  0  0  4  0  2
## [323]  0  0  2  7  0  9  3  0  7  6  0 18  5  5  2  1  2  1  3  1  0  0  0
## [346]  2  0  3  4  0  0  0  0  3  0  0 25  1  3  0  1  0  0  0  1  2  0  1
## [369]  1  1  1 12  1  7  0  0  0  0  0  0  2  1  5  7  3  0 16  3  2  0  0
## [392]  1  0  0  1  3  4  1  0  5  3  0  0  0  0  6  1  2  2  1  0  1 21  2
## [415]  9  1  0  0  1  0  2  2  0  0  0  0  0  0  0  1  1  0  0  0  0  0  1
## [438]  0  3  1 12  7  0  2  0  9  5  6  0  4  0  0  1  2  1  0  5  4  0  2
## [461]  1  2  0  0  0  4  0  0  0 12  0  0  0 10  0  6  0  7  5  0  1  1  0
## [484]  0  0  0  0  0  0  0  4  0  0  0  0  2  0  0  7  0 11  1 20  6  0  0
## [507]  4  4  0  0  0  0  4  0  0  0  0  0  0  0  2  0  2  2  4  2  0  0  3
## [530]  5  0  0  0  1  0  0  2  0  6 20  0  1  0 16  2  3  1  1  0 25  0  4
## [553]  2  0  0  0  1  0  1  0  0  0  0  0  3  1  0  1  1  1  0  0  2  1  0
## [576]  3  0  3  1  4  0  1  0  0  3  5  0  0  0  4  1 12  1  0  0  0  5  0
## [599]  0  0  0  1  0  1  0  0  0  3  0  0 13  1  0  1  0  0  0  8  5  3  2
## [622]  3 10  0  1  3  2  0  8  0  5  0  6  1  3  0  4  0  0  4  0  0  5  0
## [645]  0  2 20  1  0  2  1  0  0  0  0  2  0  0  0  0  0  1  1  0  0  6  0
## [668]  6  0  0  5  3  0  0  5  3  1  0  3  2  3 11  0  0  0  5  0  0  0  5
## [691]  0  1  1  0  0  0  0  0  0  0  0  2 25  0  2  9  0  0  4  1  0  0  0
## [714]  0 14  6  7  8  0  2  0  0  0  1  4  2  0  4  0  0  0  2  1  0  0  0
## [737]  0  0  1  0  0  0  3  0  0  0  0  0  0  0  5  0  0  0  3  0  0  3  7
## [760]  3  0  9 11  0  6  3  0  0  5  3  0  6  6  0  2  0  4  0  0  0  0  3
## [783]  1  1  4  0  0  1 12  0  0  0 14  0  0  0  0  0  0  1  0  0  1  0  0
## [806]  0  9  0  0  0  2  0  0  3  0  5  0  4  5 20  0  0 12  3  4  1  3  3
## [829]  4  0  0  0  0  0  0  5  0  0  0  1  2
print(paulist_missions$confessions, 20)
##   [1]  6000  1700  1000  1000  4000  7000  1000   270  1000   600  3100
##  [12]  4000  3000  3000  1700  1400  1250  5000  2000  4500   620  2300
##  [23]  2550  4500   320   950  1200   550  7800   500   330  4000  3000
##  [34]  1300  1300  2000  1700  1500  1500  1300  2700   500  2500  4400
##  [45]  2800  1000  1400   600   350   700  7600  3000  1100  3300  4000
##  [56]  2800  1350   400   520   900   210   700  1700  2700  2500   190
##  [67]   210   170   750   600  3100  2200   800  1900   800  1950  3000
##  [78]   920  2050  1400  1100   277   840  1008   970   800  1073  5700
##  [89]  4430  2000  1450   430   575   450   910  2060   570  1135  7250
## [100]  2600  8450  1950  3500  1570   800  1250  2800  2270  1230   810
## [111]   925  2640  1900  2150  1000  1100  6950  4550  1002  2900  1900
## [122]  1980  1850   950  4000   370   700  1150  1000  6800  2400  2500
## [133]  2500   900  2875  5636  3500  2100  1500  3000  2600  2000   710
## [144]  2600  2500  1300  4500  1400   567  2800  1100   700  1080   750
## [155]   730   800  1200   700  2100  2551  1200  1600  1200  1600  1400
## [166]  2600   721  1261  1400  2000  1300  1800  1500  1500  2150  1600
## [177]  6600  2780  3000  5500  2100   700  1030  1007  5300  1400  4000
## [188]   700  4500  1170   400  2350   570  1400  1450   400  1650   700
## [199]  4280   150   260   800  1500  2600  2700  2600  4800  1850  2800
## [210]  1350  1700  9000  1600   360  1600  1100   650   550   350  1200
## [221]   660    75    37  1200  2100  1200  2200  1300   800  4200   800
## [232]  3200  4000  5000  1500   790   440   620   960   520  1537   626
## [243]   460   700   415  1030   520  1100  1050   240   443   470   370
## [254]   400   375   215   750   320  1120  3500  1500  5300  8000  6900
## [265]  1900  1700   468  1700  1280  3086  2000  3790   450   150   160
## [276]   340   462   315  1000  6200  1058  4539  5674  1323   550   612
## [287]  1100   645   845  2322   424   510   360   580   380   127    93
## [298]   130    86   860  1343   975   625  1149  1100  1688  3368  1960
## [309]  1896  2163  3507  1826  1158   984   725   550   270   115  1200
## [320]  2344   480   740  5750   480   225   160   582  1780  1000   150
## [331]   523   794  1600  4530  2325   270   390   410   195  1325  1347
## [342]  1500   490   333    24    75   325   528  1800  1975  4288   714
## [353]   400  1055   536   500  1350   430  1017   750  3926   700   900
## [364]  5000   382   375   576  2500  1550   720   620  4600   234   404
## [375]   204   335    84   350   900  5800  4622   410  4310  6370  3270
## [386]   500  3250   497   442   700  1000   300   550   250  1350  5000
## [397]  2900    50    35  1700   100    NA   120    60    NA   432  2210
## [408]  2740   830  2918   125  2700  6825  5490  6082   970   850  1050
## [419]  1015   952   170   470   230   395   298   670   540   575   190
## [430]  2500  1600   535   625  1500  1000   625  1362   217   563   203
## [441]  4764  4410  1120  2480  3279  4642  4707  1700   720   850   800
## [452]   170   320   625   370    72   670   153   190   250   380  2000
## [463]  1000  6508  2100  2475  1675   550   150 11020   800  5000   260
## [474]  6950  4113   630   623  1100   997   336  1676  2485  2544  5560
## [485]   230  1042  3772   232    67    69  2235   300   109    50  2505
## [496]  3900  3048  2770  1000  2527  8700   300 11052  1774    50   250
## [507]  5900   900  4300  5535  1500  2700  1760  2946   515   750   680
## [518]   800   375   475  2265   721  2700  3203  1140   920   675   460
## [529]  1651   500   636   700   294   416   324  2700   495   917  1818
## [540]   875  1354   145  1120  2822   253   120   430   355   335  6915
## [551]    NA  2420  1870   761   253   164   830  3655    43   709   180
## [562]   100    NA   180  1100  1600   652   550   445   430   812   228
## [573]   765  1070   100  1482   300   825  1314  1025  2540   300   830
## [584]   600  1325  3022   511   810  1435  8396   440  3435  2678   465
## [595]   350  5300  1075   318  1291   960  2377  1985  1727   686   452
## [606]  1020   600   180    97  1550  1020  1232   500   478    36   204
## [617]    96  2470   715   860   980   420  3890   720   416   128   165
## [628]   640  4070   394  4712   485  5804  2685  1250   580  3273  4957
## [639]  7131  2900  1000   895   475   455   366   433  6980   417   425
## [650]    NA  6000   285   450   365   250  4006  1205   300  1277   475
## [661]  1493  2004  1665  8200   700  2125    83   525  1450  4466  1250
## [672]  2720   450  2972  4192   535  1235   150   930  2900  2343  9520
## [683]   770  1230  1875  6420   870  5613  2337  5151  1900   630  1413
## [694]  1300    30    49   152  3559   505   410    45  1136  9800  1418
## [705]  3540  1250    80   600  4780  1440  1023   228  1720  2500  2495
## [716]  2495 11650  5088   801  2470  1300  1582  3300  2905  3650  2453
## [727]  2573  1300  1665  1050  2801  1622   700   625  1400  1260  1400
## [738]   493   265   400   980   993   795    NA  1110   640   962   300
## [749]  2705  1821  3480  5165   755  1500   750   270  1400  3770  1018
## [760]   640  2600  1300  3375  3550  6500  3980  2055  4925   960  1100
## [771]  2516  7200  4650  8728   410   370  1430  3250   830  3450  2110
## [782]   155  1075   400  1580  2740   750   600  2200  1450  1575   140
## [793]   224   485   130   200  1600   350   520  1610   317   950  2530
## [804]   507  2050   700  1300  2650  1150  2250  1007  2580  3880  1300
## [815]  1600  2950   284  5285  3000  7550  8637  3750  4800  2087   760
## [826]   900   700  4060  2250   600  1350   670   200   120   910  1685
## [837]   240   312   840   217   146

What was the mean number of converts? the maximum? How about for confessions?

mean(paulist_missions$converts)
## [1] 2.507729
max(paulist_missions$converts)
## [1] 50
mean(paulist_missions$confessions)
## [1] NA
max(paulist_missions$confessions)
## [1] NA

Bonus: what was the ratio between confessions and conversions?

Plots

And for fun, let’s make a scatter plot of the number of confessions versus the number of conversions.

plot(paulist_missions$confessions, paulist_missions$converts)
title("Confessions versus conversions")

What might you be able to learn from this plot?

There are more people confession than converting. Probably because in 19th century America people were already subscribing to religious instiutions. So you would have more people confessing than converting.

There are other datasets in historydata. Can you make a plot from one or more of them?

plot(paulist_missions$lat, paulist_missions$long)