In Lesson 1, we learned about objects. Vectors are objects with multiple values. For example, you could have vectors that tell you the population and year for a particular area. Like this:
You can type the name of the object to see these values and also use functions on them just like before:
## [1] 100 347 500
## [1] 2019 2020 2021
## [1] 315.6667
However, what if you want information on population and year in a single object? This is what datasets are for.
A dataset will store multiple pieces of information for each observation. An observation can any individual thing you want to study - a country, a person, a book, etc. For example, let’s use one of the datasets built into the tidyverse() package. Load the tidyverse package with library(tidyverse). You will need to do this at the start of any script where you want to use tidyverse.
The tidyverse package comes with a built-in dataset called population. This object called population was automatically created when you ran library(tidyverse). Go to the console in RStudio (the panel called “Console” with the > character at the bottom) and type population. It will print out a dataset that looks like this:
## # A tibble: 4,060 x 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1995 17586073
## 2 Afghanistan 1996 18415307
## 3 Afghanistan 1997 19021226
## 4 Afghanistan 1998 19496836
## 5 Afghanistan 1999 19987071
## 6 Afghanistan 2000 20595360
## 7 Afghanistan 2001 21347782
## 8 Afghanistan 2002 22202806
## 9 Afghanistan 2003 23116142
## 10 Afghanistan 2004 24018682
## # … with 4,050 more rows
Notice that this dataset has both rows and columns. 4060 x 3 tells you that there are 4060 rows (up to down) and 3 columns (left to right). We can see the first ten rows in the screenshot above. Each row in the dataset tells you about the population of a country in a particular year. So, the population of Afghanistan in 2003 was 23,116,142.
In the console, type View(population) to look at your dataset. This will open up a new window where you can look at the values. It is a good idea to always look through your datasets:
You can do many things with datasets. Often, we will want to use functions on datasets. For example, the slice() function takes two arguments, an object and a number n, and the function will return to you the first n rows you asked for.
For example, to get the first 5 rows of population, type:
## # A tibble: 1 x 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1999 19987071
%>% OperatorFor various reasons, we will often want to chain multiple functions together. The %>% operator (called “pipe”) makes this very easy. Remember the tidyverse library? This gives us access to the %>% operator.
The %>% operator passes the object on the left as the first argument to the function on the right. For example, both of these lines do the exact same thing:
## # A tibble: 1 x 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1999 19987071
## # A tibble: 1 x 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1999 19987071
You can use this operator anywhere we have used functions before, like:
## [1] 4
Write code to find the first 10 rows of population.
Try to run population + 1. What happens? That’s an error message - read it! Most importantly, why isn’t this working? What is the message trying to tell you?
filter()You can use many functions to interact with datasets too. You will often want to work with only observations that meet a certain condition. For example, only countries in a particular year, or only people over the age of 45, or only Democratic political candidates.
filter() makes a smaller dataset by selecting rows that meet a certain criteria. The function filter() will return only the rows in a particular dataset that meet a certain condition. For example, the population dataset has many countries in it. What if you only want the values from France? You can use filter() to select observations that only have “France” as the value for country.
## # A tibble: 19 x 3
## country year population
## <chr> <int> <int>
## 1 France 1995 58008958
## 2 France 1996 58216225
## 3 France 1997 58418324
## 4 France 1998 58635564
## 5 France 1999 58894671
## 6 France 2000 59213096
## 7 France 2001 59600714
## 8 France 2002 60047743
## 9 France 2003 60527640
## 10 France 2004 61002537
## 11 France 2005 61444972
## 12 France 2006 61845239
## 13 France 2007 62210877
## 14 France 2008 62552614
## 15 France 2009 62888318
## 16 France 2010 63230866
## 17 France 2011 63582112
## 18 France 2012 63936575
## 19 France 2013 64291280
filter() will evaluate the condition in the parentheses to TRUE or FALSE for every single row in the dataset. The function will return only rows which evaluate to TRUE.
To see why this works, imagine a vector like:
## [1] TRUE FALSE FALSE
Using the == operator on a vector creates one TRUE or FALSE value for each entry of the vector. When you use it on a column in a dataset like in the filter() example above, it will return one TRUE or FALSE for each row in the dataset. Then, filter() only returns the rows with a TRUE value.
Use filter() to access the rows in population in the year 1999.
Use filter() to access the rows in population with less than 5000 people. Assign the result to an object called small_countries.
Use filter() to access the rows in population for every country other than Greece.
mutate()The mutate() function is used to make new columns in your data. For example, if we wanted a new column in our dataset representing countries with at least a million people:
## # A tibble: 4,060 x 4
## country year population million
## <chr> <int> <int> <lgl>
## 1 Afghanistan 1995 17586073 TRUE
## 2 Afghanistan 1996 18415307 TRUE
## 3 Afghanistan 1997 19021226 TRUE
## 4 Afghanistan 1998 19496836 TRUE
## 5 Afghanistan 1999 19987071 TRUE
## 6 Afghanistan 2000 20595360 TRUE
## 7 Afghanistan 2001 21347782 TRUE
## 8 Afghanistan 2002 22202806 TRUE
## 9 Afghanistan 2003 23116142 TRUE
## 10 Afghanistan 2004 24018682 TRUE
## # … with 4,050 more rows
Or maybe a column that counts population in millions:
## # A tibble: 4,060 x 4
## country year population pop_thousand
## <chr> <int> <int> <dbl>
## 1 Afghanistan 1995 17586073 17.6
## 2 Afghanistan 1996 18415307 18.4
## 3 Afghanistan 1997 19021226 19.0
## 4 Afghanistan 1998 19496836 19.5
## 5 Afghanistan 1999 19987071 20.0
## 6 Afghanistan 2000 20595360 20.6
## 7 Afghanistan 2001 21347782 21.3
## 8 Afghanistan 2002 22202806 22.2
## 9 Afghanistan 2003 23116142 23.1
## 10 Afghanistan 2004 24018682 24.0
## # … with 4,050 more rows
You can also download and use your own datasets. R can work with a ton of formats you may have seen in the past (.csv, .pdf, .xlsx, etc.). We will download a CSV from the internet and read it into R. Here is a link to a Google Sheet that I made. This contains presidential election results for every state from 1932-2016.1
Download the spreadsheet to your computer with File –> Download –> .csv. Save the dataset in the same folder where you stored this lesson_2.Rmd document.
Then, we are going to use the read_csv() function to read this dataset. We will store it in an object called elections. The argument to this function is the path to the file you want to read. Since you placed it in the same folder, the path is simply the name of the file. After typing the quotes, you can press the tab key on your keyboard to show your list of options. If you press tab again, R will autocomplete.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## state = col_character(),
## abb = col_character(),
## democrat = col_double(),
## year = col_double(),
## region = col_character()
## )
Look at the dataset with View(). What do the rows and columns represent?
Use filter() to only select results for New Jersey and store the answer in an object called nj.
Use filter() to make an object called south that only contains results from states in the south:
Explain why this code below returns no rows (it may be helpful to look at the dataset with View(elections):
## # A tibble: 0 x 5
## # … with 5 variables: state <chr>, abb <chr>, democrat <dbl>, year <dbl>,
## # region <chr>