In Lesson 1, we learned about objects. Vectors are objects with multiple values. For example, you could have vectors that tell you the population and year for a particular area. Like this:

pop <- c(100, 347, 500)
year <- c(2019, 2020, 2021)

You can type the name of the object to see these values and also use functions on them just like before:

pop
## [1] 100 347 500
year
## [1] 2019 2020 2021
mean(pop)
## [1] 315.6667

However, what if you want information on population and year in a single object? This is what datasets are for.

1. Datasets

A dataset will store multiple pieces of information for each observation. An observation can any individual thing you want to study - a country, a person, a book, etc. For example, let’s use one of the datasets built into the tidyverse() package. Load the tidyverse package with library(tidyverse). You will need to do this at the start of any script where you want to use tidyverse.

library(tidyverse)

The tidyverse package comes with a built-in dataset called population. This object called population was automatically created when you ran library(tidyverse). Go to the console in RStudio (the panel called “Console” with the > character at the bottom) and type population. It will print out a dataset that looks like this:

population
## # A tibble: 4,060 x 3
##    country      year population
##    <chr>       <int>      <int>
##  1 Afghanistan  1995   17586073
##  2 Afghanistan  1996   18415307
##  3 Afghanistan  1997   19021226
##  4 Afghanistan  1998   19496836
##  5 Afghanistan  1999   19987071
##  6 Afghanistan  2000   20595360
##  7 Afghanistan  2001   21347782
##  8 Afghanistan  2002   22202806
##  9 Afghanistan  2003   23116142
## 10 Afghanistan  2004   24018682
## # … with 4,050 more rows

Notice that this dataset has both rows and columns. 4060 x 3 tells you that there are 4060 rows (up to down) and 3 columns (left to right). We can see the first ten rows in the screenshot above. Each row in the dataset tells you about the population of a country in a particular year. So, the population of Afghanistan in 2003 was 23,116,142.

In the console, type View(population) to look at your dataset. This will open up a new window where you can look at the values. It is a good idea to always look through your datasets:

View(population)

You can do many things with datasets. Often, we will want to use functions on datasets. For example, the slice() function takes two arguments, an object and a number n, and the function will return to you the first n rows you asked for.

For example, to get the first 5 rows of population, type:

slice(population, 5)
## # A tibble: 1 x 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1999   19987071

2. The %>% Operator

For various reasons, we will often want to chain multiple functions together. The %>% operator (called “pipe”) makes this very easy. Remember the tidyverse library? This gives us access to the %>% operator.

The %>% operator passes the object on the left as the first argument to the function on the right. For example, both of these lines do the exact same thing:

slice(population, 5)
## # A tibble: 1 x 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1999   19987071
population %>% slice(5)
## # A tibble: 1 x 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1999   19987071

You can use this operator anywhere we have used functions before, like:

numbers <- c(3, 4, 5)
numbers %>% mean()
## [1] 4

Exercises

  1. Write code to find the first 10 rows of population.

  2. Try to run population + 1. What happens? That’s an error message - read it! Most importantly, why isn’t this working? What is the message trying to tell you?

3. Choosing particular rows: filter()

You can use many functions to interact with datasets too. You will often want to work with only observations that meet a certain condition. For example, only countries in a particular year, or only people over the age of 45, or only Democratic political candidates.

filter() makes a smaller dataset by selecting rows that meet a certain criteria. The function filter() will return only the rows in a particular dataset that meet a certain condition. For example, the population dataset has many countries in it. What if you only want the values from France? You can use filter() to select observations that only have “France” as the value for country.

population %>% filter(country == "France")
## # A tibble: 19 x 3
##    country  year population
##    <chr>   <int>      <int>
##  1 France   1995   58008958
##  2 France   1996   58216225
##  3 France   1997   58418324
##  4 France   1998   58635564
##  5 France   1999   58894671
##  6 France   2000   59213096
##  7 France   2001   59600714
##  8 France   2002   60047743
##  9 France   2003   60527640
## 10 France   2004   61002537
## 11 France   2005   61444972
## 12 France   2006   61845239
## 13 France   2007   62210877
## 14 France   2008   62552614
## 15 France   2009   62888318
## 16 France   2010   63230866
## 17 France   2011   63582112
## 18 France   2012   63936575
## 19 France   2013   64291280

filter() will evaluate the condition in the parentheses to TRUE or FALSE for every single row in the dataset. The function will return only rows which evaluate to TRUE.

To see why this works, imagine a vector like:

countries <- c("France", "Bangladesh", "Burundi")
countries == "France"
## [1]  TRUE FALSE FALSE

Using the == operator on a vector creates one TRUE or FALSE value for each entry of the vector. When you use it on a column in a dataset like in the filter() example above, it will return one TRUE or FALSE for each row in the dataset. Then, filter() only returns the rows with a TRUE value.

Exercise

  1. Use filter() to access the rows in population in the year 1999.

  2. Use filter() to access the rows in population with less than 5000 people. Assign the result to an object called small_countries.

  3. Use filter() to access the rows in population for every country other than Greece.

4. Creating new columns: mutate()

The mutate() function is used to make new columns in your data. For example, if we wanted a new column in our dataset representing countries with at least a million people:

population %>%
  mutate(million = population > 1000000)
## # A tibble: 4,060 x 4
##    country      year population million
##    <chr>       <int>      <int> <lgl>  
##  1 Afghanistan  1995   17586073 TRUE   
##  2 Afghanistan  1996   18415307 TRUE   
##  3 Afghanistan  1997   19021226 TRUE   
##  4 Afghanistan  1998   19496836 TRUE   
##  5 Afghanistan  1999   19987071 TRUE   
##  6 Afghanistan  2000   20595360 TRUE   
##  7 Afghanistan  2001   21347782 TRUE   
##  8 Afghanistan  2002   22202806 TRUE   
##  9 Afghanistan  2003   23116142 TRUE   
## 10 Afghanistan  2004   24018682 TRUE   
## # … with 4,050 more rows

Or maybe a column that counts population in millions:

population %>%
  mutate(pop_thousand = population / 1000000)
## # A tibble: 4,060 x 4
##    country      year population pop_thousand
##    <chr>       <int>      <int>        <dbl>
##  1 Afghanistan  1995   17586073         17.6
##  2 Afghanistan  1996   18415307         18.4
##  3 Afghanistan  1997   19021226         19.0
##  4 Afghanistan  1998   19496836         19.5
##  5 Afghanistan  1999   19987071         20.0
##  6 Afghanistan  2000   20595360         20.6
##  7 Afghanistan  2001   21347782         21.3
##  8 Afghanistan  2002   22202806         22.2
##  9 Afghanistan  2003   23116142         23.1
## 10 Afghanistan  2004   24018682         24.0
## # … with 4,050 more rows

5. Using your own datasets

You can also download and use your own datasets. R can work with a ton of formats you may have seen in the past (.csv, .pdf, .xlsx, etc.). We will download a CSV from the internet and read it into R. Here is a link to a Google Sheet that I made. This contains presidential election results for every state from 1932-2016.1

Download the spreadsheet to your computer with File –> Download –> .csv. Save the dataset in the same folder where you stored this lesson_2.Rmd document.

Then, we are going to use the read_csv() function to read this dataset. We will store it in an object called elections. The argument to this function is the path to the file you want to read. Since you placed it in the same folder, the path is simply the name of the file. After typing the quotes, you can press the tab key on your keyboard to show your list of options. If you press tab again, R will autocomplete.

# Data from class, link is above in the document
elections <- read_csv("pres_elections.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   state = col_character(),
##   abb = col_character(),
##   democrat = col_double(),
##   year = col_double(),
##   region = col_character()
## )

Exercise

  1. Look at the dataset with View(). What do the rows and columns represent?

  2. Use filter() to only select results for New Jersey and store the answer in an object called nj.

  3. Use filter() to make an object called south that only contains results from states in the south:

  4. Explain why this code below returns no rows (it may be helpful to look at the dataset with View(elections):

elections %>% filter(state == "NJ")
## # A tibble: 0 x 5
## # … with 5 variables: state <chr>, abb <chr>, democrat <dbl>, year <dbl>,
## #   region <chr>

  1. This dataset comes from the pscl R package.↩︎