Getting familiar with R

Aim of this worksheet

After completing this worksheet, you should feel comfortable typing commands into the R console (or, REPL) and into an R Markdown document. In particular, you should know how to use values, variables, and functions, how to install and load packages, and how to use the built-in help for R and its packages.

Values

R lets you store several different kinds of values. These values are the information that we actually want to do something with.

One kind of value is a number. Notice that typing this number, either in an R Markdown document or at the console, produces an identical output

## [1] 42

Create a numeric value that has a decimal point:

42.77

## [1] 42.77

Of course numbers can be added together (with +), subtracted (with -), multiplied (with *), and divided (with /), along with other arithmetical operations. Let’s add two numbers, which will produce a new number.

2 + 2

## [1] 4

2 * 2

## [1] 4

7 - 3

## [1] 4

Add two lines, one that multiplies two numbers, and another that subtracts two numbers.

Another important kind of value is a character vector. (Most other programming languages would call these strings.) These contain text. To create a string, include some characters in between quotation marks "". (Single quotation marks work too, but in general use double-quotation marks as a matter of style.) For instance:

"Hello, beginning R programmer"

## [1] "Hello, beginning R programmer"

Create a string with a message to your instructor.

"Hello Arrgh Master"

## [1] "Hello Arrgh Master"

Character vectors can’t be added together with +. But they can be joined together with the paste() function.

paste("Hello", "everybody")

## [1] "Hello everybody"

Mimic the example above and paste three strings together.

paste("control", "alt", "delete")

## [1] "control alt delete"

Now explain in a sentence what happened.

It pasted the strings together and inserted a space between them.

Another very important kind of value are logical values. There are only two of them: TRUE and FALSE.

# This is true
TRUE

## [1] TRUE

# This is false
FALSE

## [1] FALSE

Notice that in the block above, the # character starts a comment. That means that from that point on, R will ignore whatever is on that line until a new line begins.

Logical values aren’t very exciting, but they are useful when we compare other values to one another. For instance, we can compare two numbers to one another.

2 < 3

## [1] TRUE

2 > 3

## [1] FALSE

2 == 3

## [1] FALSE

What do each of those comparison operators do? (Note the double equal sign: ==.)

the < symbol asks if x is less than y or is 2 less than 3

the > symbol asks if x is greater than y or is 2 greater than 3

the == symbol is asking if x equals 3 or 2 equals 3

Create your own comparisons between numeric values. See if you can create a comparison between character vectors.

7 < 11

## [1] TRUE

9 > 1

## [1] TRUE

17 == 17

## [1] TRUE

c(2, 3, 5) > c(1, 2, 7)

## [1]  TRUE  TRUE FALSE

c("hot","cross","buns") > c("rock","paper","scissors")

## [1] FALSE FALSE FALSE

R has a special kind of value: the missing value. This is represented by NA.

NA

## [1] NA

Try adding 2 + NA.

2 + NA

## [1] NA

Does that answer make sense? Why or why not?

It makes sense because you are adding a number to a non-value. It is different than adding 2 + 0 as 0 is actually a value. Since the it is a non-value the computer cannot produce an answer.

We will come back to missing values.

Variables

We wouldn’t be able to get very far if we only used values. We also need a place to store them, a way of writing them to the computer’s memory. We can do that by assignment to a variable. Assignment has three parts: it has the name of a variable (which cannot contain spaces), an assignment operator, and a value that will be assigned. Most programming languages use a rinky-dink = for assignment, which works in R too. But R is awesome because the assignment operator is <-, a lovely little arrow which tells you that the value goes into the variable. For example:

number <- 42

Notice that nothing was printed as output when we did that. But now we can just type a and get the value which is stored in the variable.

number

## [1] 42

It works with character vectors too.

computer_name <- "HAL 9000"

No output, but this works.

computer_name

## [1] "HAL 9000"

In the assignment above, what is the name of the variable? What is the assignment operator? What is the value assigned to the variable?

Name: compter_name Assignment operator: <- Value: “HAL 9000”

Notice that we can use variables any place that we used to use values. For example:

x <- 2
y <- 5
x * y

## [1] 10

x + 9

## [1] 11

Explain in your own words what just happened.

Variable x was assigned a value of 2 while variable y was assigned a value of 5. Next, variable x was multiplied by variable y (2 * 5) producing the answer 10. Finally, x was added to 9 (2 + 9) which produced the answer 11.

Now create two assignments. Assign a number to a variable and a character vector to a different variable.

test <- 7
char_vec <- c("a", "q", "z")

Now create a third variable with a numeric value and, using the variable with a numeric value from earlier, add them together.

third <- 17
test + third

## [1] 24

Can you predict what the result of running this code will be? (That is, what value is stored in a?)

a <- 10
b <- 20
a <- b
a

Predict your answer, then run the code. What is the value stored in a by the end? Explain why you were right or wrong.

Prediction: a = 20. Answer: a = 20. Originally the variable a was assigned the value 10 and b was assigned the value 20. Then line 182 assigned the value of b, 20, to the variable a. This means that the orginal value of 10 was overwritten with the new value of 20.

Vectors

Variables are better than just values, but we still need to be able to store multiple values. If we have to store each value in its own variable, then we are going to need a lot of variables. R is a beautiful language because every value is actually a vector. That means it can store more than one value.

Notice the funny output here:

"Some words"

## [1] "Some words"

What does the [1] in the output mean? It means that the value has one item inside it. We can test that with the length() function

length("Some words")

## [1] 1

Sure enough, the length is 1: R is counting the number of items, not the number of words or characters.

That would seem to imply that we can add multiple items (or values) inside a vector. R lets us do that with the c() (for “combine”) function.

many <- c(1, 5, 2, 3, 7)
many

## [1] 1 5 2 3 7

What is the length of the vector stored in many?

length(many)

## [1] 5

Let’s try multiplying many by 2:

many * 2

## [1]  2 10  4  6 14

What happened?

Multiplying the vector many by two means that each item within the vector is multiplied by two. So the first item, 1, is now 2. The second item, 5, is now 10 and so on.

What happens when you add 2 to many?

many + 2

## [1] 3 7 4 5 9

The value of 2 is added to every value within the vector many.

Can you create a variable containing several names as a character vectors?

three_Stooges <- c("Moe", "Larry", "Curley")

Hard question: what is happening here? Why does R give you a warning message?

c(1, 2, 3, 4, 5) + c(10, 20)

## Warning in c(1, 2, 3, 4, 5) + c(10, 20): longer object length is not a
## multiple of shorter object length

## [1] 11 22 13 24 15

The user is trying to add a vector with 5 items to a vector with 2 items. The warning is thrown because the two vectors that the user is trying to add together are of different lengths where they are not multiples of eachother. This means that adding the second vector, which has two items, will not be “evenly” added to the first vector with five items. The result (11, 22, 13, 24, 15) shows that the first, third, and fifth item had 10 added to them while the second and fourth items had 20 added to them. It is a warning since it has the potential to be a problem as opposed to an error where this actually is a problem.

Built-in functions

Wouldn’t it be nice to be able to do something with data? Let’s take some made up data: the price of books that you or I have bought recently.

book_prices <- c(19.99, 25.78, 5.33, 45.00, 22.45, 21.23)

We can find the total amount that I spent using the sum function.

sum(book_prices)

## [1] 139.78

Try finding the average price of the books (using mean()) and the median price of the books (using median()).

mean(book_prices)

## [1] 23.29667

Can you figure out how to find the most expensive book (hint: the book with the maximum price) and the least expensive book (hint: the book with the minimum price)?

max(book_prices)

## [1] 45

min(book_prices)

## [1] 5.33

Hard question: what is happening here?

book_prices >= mean(book_prices)

## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

The user is interested in identifying the books whose price was greater than or equal to the average price of all the books. As it is a logical expression, the results will be a series of TRUE or FALSE expressions identifying which books in the vector has a matching or greater price than the average.

Using the documentation

Let’s try a different set of book prices. This time, we have a vector of book prices, but there are some books for which we don’t know how much we paid. Those are the NA values.

more_books <- c(19.99, NA, 25.78, 5.33, NA, 45.00, 22.45, NA, 21.23)

How many books did we buy? (Hint: what is the length of the vector.)

length(more_books)

## [1] 9

You bought nine books.

Let’s try finding the total using sum().

sum(more_books)

## [1] NA

That wasn’t very helpful. Why did R give us an answer of NA?

Three of the items in the vector were “Na” meaning the value of the book was unknown or missing. Thus the sum total is unknown or missing resulting in the programs return of “Na”.

We need to find a way to get the value of the books that we know about. This is an option to the sum() function. If you know the name of a function, you can search for it by typing a question mark followed without a space by the name of the function. For example, ?sum. Look up the sum() function’s documentation. Read at least the “Arguments” and the “Examples” section. How can you get the sum for the values which aren’t missing?

?sum()
sum(more_books, na.rm=TRUE)

## [1] 139.78

You can exclude the missing values explicitly by adding a parameter into the function: sum(more_books, na.rm = TRUE). This will tell the program to add up the values in the vector and ignore or exclude all “Na”.

Look up the documentation for ?mean, ?max, ?min and see if you can use those functions on a vector with missing values.

?mean()
?max()
?min()

They all contain that same na.rm which allows the user to indicate if they want to exclude or remove all the “Na” entries when running the functon.

Data frames and loading packages

We are historians, and we want to work with complex data. Another reason R is awesome is that it includes a kind of data structure called data frames. Think of a data frame as basically a spreadsheet. It is tabular data, and the columns can contain any kind of data available in R, such as character vectors, numeric vectors, or logical vectors. R has some sample data built in, but let’s use some historical data from the historydata package.

You can load a package like this:

library(historydata)

The dplyr package is very helpful. Try loading it as well.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

You might get an error message if you don’t have either package installed. If you need to install it, run install.packages("historydata") at the R console.

We don’t know what is in the historydata package, so let’s look at it’s help. Run this command: help(package = "historydata").

Let’s use the data in the paulist_missions data frame. According to the package documentation, what is in this data frame?

841 observations of 11 variables

We can print it by using the name of the variable

head(paulist_missions, 10)

## Source: local data frame [10 x 11]
## 
##    mission_number                             church          city state
##             (int)                              (chr)         (chr) (chr)
## 1               1                St. Joseph's Church      New York    NY
## 2               2               St. Michael's Church       Loretto    PA
## 3               3                  St. Mary's Church Hollidaysburg    PA
## 4               4      Church of St. John Evangelist     Johnstown    PA
## 5               5                 St. Peter's Church      New York    NY
## 6               6            St. Patrick's Cathedral      New York    NY
## 7               7               St. Patrick's Church          Erie    PA
## 8               8           St. Philip Benizi Church     Cussewago    PA
## 9               9 St. Vincent's Church (Benedictine)    Youngstown    PA
## 10             10                 St. Peter's Church      Saratoga    NY
## Variables not shown: start_date (chr), end_date (chr), confessions (int),
##   converts (int), order (chr), lat (dbl), long (dbl)

(The head() function just gives us the first number of items in the vector.)

That showed us some of the data but not all. The str() function is helpful. Look up the documentation for it, and then run it on paulist_missions.

?str()
str(paulist_missions)

## Classes 'tbl_df', 'tbl' and 'data.frame':    841 obs. of  11 variables:
##  $ mission_number: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ church        : chr  "St. Joseph's Church" "St. Michael's Church" "St. Mary's Church" "Church of St. John Evangelist" ...
##  $ city          : chr  "New York" "Loretto" "Hollidaysburg" "Johnstown" ...
##  $ state         : chr  "NY" "PA" "PA" "PA" ...
##  $ start_date    : chr  "4/6/1851" "4/27/1851" "5/18/1851" "5/31/1851" ...
##  $ end_date      : chr  "4/20/1851" "5/11/1851" "5/28/1851" "6/8/1851" ...
##  $ confessions   : int  6000 1700 1000 1000 4000 7000 1000 270 1000 600 ...
##  $ converts      : int  0 0 0 0 0 0 0 0 0 3 ...
##  $ order         : chr  "Redemptorist" "Redemptorist" "Redemptorist" "Redemptorist" ...
##  $ lat           : num  40.7 40.5 40.4 40.3 40.7 ...
##  $ long          : num  -74 -78.6 -78.4 -78.9 -74 ...

Also try the glimpse() function.

?glimpse()
glimpse(paulist_missions)

## Observations: 841
## Variables: 11
## $ mission_number (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ church         (chr) "St. Joseph's Church", "St. Michael's Church", ...
## $ city           (chr) "New York", "Loretto", "Hollidaysburg", "Johnst...
## $ state          (chr) "NY", "PA", "PA", "PA", "NY", "NY", "PA", "PA",...
## $ start_date     (chr) "4/6/1851", "4/27/1851", "5/18/1851", "5/31/185...
## $ end_date       (chr) "4/20/1851", "5/11/1851", "5/28/1851", "6/8/185...
## $ confessions    (int) 6000, 1700, 1000, 1000, 4000, 7000, 1000, 270, ...
## $ converts       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0,...
## $ order          (chr) "Redemptorist", "Redemptorist", "Redemptorist",...
## $ lat            (dbl) 40.71435, 40.50313, 40.42729, 40.32674, 40.7143...
## $ long           (dbl) -74.00597, -78.63030, -78.38890, -78.92197, -74...

Bonus: where does the glimpse() function come from?

The glimpse function is part of the dplyr library that we installed after historydata library.

We will get into subsetting data in more detail later. But for now, notice that we can get just one of the colums using the $ operator. For example:

head(paulist_missions$city, 20)

##  [1] "New York"      "Loretto"       "Hollidaysburg" "Johnstown"    
##  [5] "New York"      "New York"      "Erie"          "Cussewago"    
##  [9] "Youngstown"    "Saratoga"      "Troy"          "Albany"       
## [13] "Detroit"       "Philadelphia"  "Philadelphia"  "Cohoes"       
## [17] "Wheeling"      "Cincinnati"    "Louisville"    "Albany"

Can you print the first 20 numbers of converts? of confessions?

head(paulist_missions$converts, 20)

##  [1] 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 3 0 3 3

head(paulist_missions$confessions, 20)

##  [1] 6000 1700 1000 1000 4000 7000 1000  270 1000  600 3100 4000 3000 3000
## [15] 1700 1400 1250 5000 2000 4500

What was the mean number of converts? the maximum? How about for confessions?

mean(paulist_missions$converts)

## [1] 2.507729

max(paulist_missions$converts)

## [1] 50

mean(paulist_missions$confessions, na.rm =TRUE)

## [1] 1760.832

max(paulist_missions$confessions, na.rm = TRUE)

## [1] 11650

Mean number of converts: 2.507729 Maximum number of converts: 50 Mean number of confessions:Na or if excluded :1760.832 Maximum number of confessions: Na or if excluded: 11650

Bonus: what was the ratio between confessions and conversions?

max(paulist_missions$confessions, na.rm = TRUE) / max(paulist_missions$converts)

## [1] 233

The ratio of confessions to convert is 233. So for every convert, there was 233 confessions.

Plots

And for fun, let’s make a scatter plot of the number of confessions versus the number of conversions.

plot(paulist_missions$confessions, paulist_missions$converts)
title("Confessions versus Conversions")

What might you be able to learn from this plot?

Really this plot is showing that there is no significant direct or inverse relationship between conversion and confession. The majority of conversions were between 0 - 20 and confessions were 0 - 6000. You do also notice a major outlier with around 7800 confessions but roughly 50 converts. Overall, places that had more confessions did not have fewer conversions and vice versa.

There are other datasets in historydata. Can you make a plot from one or more of them?

help(package = "historydata")
plot(us_national_population$year, us_national_population$population)
title("U.S. Population Growth")