Setup

Verify that you have a folder that you saved the data into. This will also be where you save your R or Rmd files from lab.
getwd() tells you the current working directory

To choose the working directory: Session > Set working directory > Choose the folder I want

Or use full folder path: setwd("C:/Users/yourname/Desktop/PPOL105")

getwd() # tells you your current working directory

# setwd() # To change your working directory to the folder you want

You only need to install each package once.

# Reminder: put a # at the beginning of the line and R will ignore the line

# install.packages("tidyverse")

However, installing a package only puts the files on your computer. In order to actually use the features of a package you need to load it into memory during your current R session using the library() function. You can then use the Tidyverse functions and other tools. Load the tidyverse package:

library("tidyverse")

Our lab example will use historical employment data for Canada from January 1976 through January 2021. Download the file from Blackboard’s Week 10 Folder. Save it in a folder for this course.

Our main data set for analysis is the worksheet “Data for Analysis”, which includes the following variables:

Data: Canadian employment data, 1976-2021

  • MonthYr: the month and year of the observation.
  • Population: the civilian, non-institutionalized working-age population of Canada at that time, in thousands.
  • Employed: the total number employed in the population, in thousands.
  • Unemployed: the total number employed in the population, in thousands.
  • LabourForce: the sum of Employed and Unemployed.
  • NotInLabourForce: the difference between Population and LabourForce.
  • UnempRate: the percentage of the labor force that is unemployed. It is calculated and stored as a decimal (ranging from 0.0 to 1.0).
  • LFPRate: the percentage of the population that is in the labor force. It is calculated and stored as a decimal (ranging from 0.0 to 1.0)
  • Party: the political party in control of the Federal government. If the party in control changed during the month, it is listed as “Transfer.”
  • PrimeMinister: the name of the Prime Minister. If the prime minister changed during the month, it is listed as “Transfer.”
  • AnnPopGrowth: the rate of population growth over the previous 12 months, calculated as a proportion and displayed as a percentage. Note that this variable is blank for the first 12 months of the data set

Opening Data in R

EmpData <- read_csv("EmploymentData.csv")  # from readr() package included in Tidyverse

What information can be gained from the output above?

Tidyverse Practice

The Tidyverse includes multiple core functions for modifying data:

  • mutate() allows us to add or change variables.
  • filter() allows us to select particular observations, much like Excel’s filter tool.
  • arrange() allows us to sort observations, much like Excel’s sort tool.
  • select() allows us to select particular variables.
  • summarize()
  • group_by()

These functions follow a common syntax that is designed to work with a convenient Tidyverse tool called the “pipe” operator.

The pipe operator is part of the Tidyverse and is written %>%. Recall that an operator is just a symbol like + or * that performs some function on whatever comes before it and whatever comes after it.

To see how it works, I’ll show you a few examples:

# This is equivalent to names(EmpData)
EmpData %>%
    names()
# This is equivalent to sqrt(2)
2 %>%
    sqrt()

Renaming Variables

Rename the variable LabourForce to laborforce. What syntax do you need for this command? Use ?rename() to see how you should type the command.

  • Using ?function_name() or help(function_name) opens up the documentation that goes with the function or package.
?help # opens up the documentation for using the help() function. 

# same result as:
help(help)
# if you knit a markdown document with these commands in the code, 
# then the documentation will open in a web browser
?rename()
help("rename")

It appears the syntax for the rename() command is: rename(dataset_name, variable_new_name = variable_old_name.
This command (and almost all other tidyverse commands) also works with pipes syntax (%>%). When using pipes, you pass your data through the pipes and then tell R which commands you want to use on the variables within that dataset: dataset_name %>% rename(variable_new_name = variable_old_name)

Rename the variable LabourForce to laborforce:

# rename(new_name = old_name)
EmpData <- EmpData %>%
  rename(laborforce = LabourForce)

Rename NotInLabourForce to not_laborforce:

EmpData <- EmpData %>% 
  rename(_____ = _______)

Univariate Tables

How many observations are there for each Prime Minister? The table() command is a great way to quickly know how many observations there are for categories of a variable.

table(EmpData$PrimeMinister)

The table above tells us how many observations there are in the dataset for each Prime Minister.

Based on the table, how many observations are there for each party? What does one observation represent in the context of our data? What level of measurement is appropriate for “party”? What type of variable is party stored as in R? How can we find this out?

table(EmpData$Party)

If the table() command shows you the number of observations for EACH category of a variable, why should we be hesitant to use it with a continuous variable? Feel free to try it below:

table(EmpData$_______ )

Subsetting

If I wanted to work with only a few of the variables, I could make a smaller dataframe by selecting the variables and saving them as a new object.

subset <- EmpData %>% 
  select(MonthYr, Population, Employed, Unemployed)

Is it in your environment? How many variables and observations are there in your new subset?

If you knew that you wanted the first four variables, you could also select them like this:

subset <- EmpData %>%
  select(MonthYr:Unemployed) # selects all variables MonthYr through Unemployed
subset    # view the subset you created

Make another subset dataframe and name it subset2. Select the variables MonthYr, PrimeMinister, and Party and save them as subset2. Don’t forget to view your new subset to check your work!

subset2 <- EmpData %>% 
  select(MonthYr, PrimeMinister, Party)
subset2

Joining Data

We want to join observations in one tibble with observations in another tibble.

joined <- left_join(subset, subset2)
# if you run this line, you will get a message saying: Joining, by = "MonthYr" 
# R sees that there are columns with the same name and use them as the unique identifier.

joined      # add dimensions of object

How many observations are there? Add a comment in the code to indicate the dimensions of the new object

If you had differently named variables that you were going to use to merge data, you would need to identify the variable names in the command. Run ?left_join to see the syntax for this command.

?__

Scroll through the Arguments of the mutating-joins commands and look at what the argument by = does.

Now use the full_join command to combine subset 1 and subset 2. Include the by = arguement in your command.
full_join(item1, item2, by = c(item1_VariableName = item2_VariableName))

joined2 <- full_join(____, _____, by = C())

How many observations are there?

What is the difference between the results from the left_join and full_join in this situation? Given how we created and then rejoined the subsets, should we expect a difference or should there be the same number of observations as when we originally opened the data file?

You can also join using multiple variables! >Hint hint>

If you wanted to join two datasets by using two variables like (HYPOTHETICALLY) the state and year, you could use the join commands to do so. You do not need to create a unique identifier like we did in Excel. You can instead tell R which columns to use during the joining process and it will look to see if there is match for both columns before joining the data. If both items that you are joining have variables with the same names, then you can use a line of code similar to this: inner_join(item1, item2, by = c(VariableName1, VariableName2)

If both items that you are joining do NOT have variables with the same names, then you can use a line of code similar to this: inner_join(item1, item2, by = c(item1_VariableName1 = item2_VariableName1, item1_VariableName2 = item2_variableName2) Note: this uses the inner_join and that may not be what you need for what you are trying to do.

Visualizing different join functions

There are multiple ways to join data.

joined data options

inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

full_join(): return all rows and all columns from both x and y . Where there are not matching values, returns NA for the one missing.

left_join() return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join() return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

Another way to visualize which data is kept:

Venn Diagram of joined data

Another way to visualize which data is kept:

Table Cells of joined data


Mutate

An data transformation function is mutate, which allows us to change or add variables. We will discuss working with dates in R more in the future, but for now, start by changing the MonthYr variable from character (text) to date, using the as.Date() function.
If you have extra time at the end of lab, look at the different arguments that exist for as.date(). There are also entire packages that deal with working and cleaning dates such as lubridate()

Check the class of EmpData$MonthYr. Add a comment next to the command to indicate what the output was.

class()

When the data was read in originally from the CSV file, it was a stored as a character. Like Excel, R has an internal representation of dates that allows for correct ordering and calculations, but displays dates in a standard human-readable format. Run the line of code below to turn the MonthYr type from a character to date.

# Change MonthYr to date format using as.Date()
EmpData %>%
    mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y"))

As you can see, the MonthYr column is now labeled as a date rather than text. Now R knows it is a date.

How else does the MonthYr data change?

If I wanted to keep the original variable and create a new variable for the date with the new formatting, I could make a minor change to the code (below). Remember, these commands follow the new_variable = old_variable syntax that is common in tidyverse.

# creates a new variable called "date" that formats EmpData$MonthYr
EmpData %>%
    mutate(date = as.Date(MonthYr, "%m/%d/%Y")) 

How many columns are there now? Where did our new variable go?

Mutate can be used to perform calculations on existing variables and create new variables. For example, suppose we also want to create versions of UnempRate and LFPRate that are expressed in percentages rather than decimal units:

# Create UnempPct and LFPPct
EmpData %>%
  select(MonthYr, UnempRate, LFPRate) %>%  # keeps these 3 variables 
    mutate(UnempPct = 100 * UnempRate) %>% # creates a new variable
    mutate(LFPPct = 100 * LFPRate)        # creates a new variable

If the number of decimals are bothering you, you can use round() and indicate the number of digits after the decimal point. This matters more when making our final summary tables in future lectures.

# round the decimals to two digits
EmpData %>%
  select(MonthYr, UnempRate, LFPRate) %>%  # keeps these 3 variables
  mutate(UnempPct = round(100 * UnempRate, digits = 2),  # creates a new variable
         LFPPct = round(100 * LFPRate, digits = 2 ))   # creates a new variable 

So far we have not made any permanent changes to EmpData. Before we had R calculate something and show it to us in the output, but not store it anywhere. In order to modify EmpData, we have to save those changes back to the object EmpData. Our original commands simply created a new object based on EmpData that was then displayed on the screen. In order to change EmpData itself, we need to assign that new object back to EmpData with the <-. Other things to notice in the code below is that I used the command mutate() once and made three variables inside that one command.

# Make permanent changes to EmpData
EmpData <- EmpData %>%
    mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y"),
           UnempPct = 100 * UnempRate,
           LFPPct = 100 * LFPRate)

EmpData # check your work. Look at the data
# 541 rows, 13 columns

Filter and Arrange

Now let’s suppose we want to know more about the months in our data set with the highest unemployment rates. We can use filter() for this purpose:

# This will give all of the observations with unemployment rates over 12.5%
EmpData %>%
    filter(UnempPct > 12.5)

Interpretation of output: only 8 of the 541 months in our data have unemployment rates over 12.5% - the worst months of the 1982-83 recession, and April and May of last year.

Now suppose that we only want to see a few pieces of information about those months. We can use select() again to choose variables to display in the output:

# This will leave out all variables except the ones mentioned in the select() command
EmpData %>%
    filter(UnempPct > 12.5) %>%
    select(MonthYr, UnempRate, LFPPct, PrimeMinister)

What happens if you put the select() command before the filter() command in the code below? Why?

EmpData %>%
  select(MonthYr, UnempRate, LFPPct, PrimeMinister) %>%
  filter(UnempPct > 12.5)

Finally, suppose that we want to show months in descending order by unemployment rate (i.e., the highest unemployment rate first). We can use arrange() to sort rows in this way:

EmpData %>%
    filter(UnempPct > 12.5) %>%
    select(MonthYr, UnempPct, LFPPct, PrimeMinister) %>%
    arrange(UnempPct) # Sorts the rows by unemployment rate, default is ascending order

Hopefully you can see why the pipe operator is useful in making our code clear and readable. Compare

EmpData %>%
    filter(UnempPct > 12.5) %>%
    select(MonthYr, UnempPct, LFPPct, PrimeMinister) %>%
    arrange(UnempPct)

# This is what the same code looks like without the pipe
arrange(select(filter(EmpData, UnempPct > 12.5), MonthYr, UnempPct, LFPPct, PrimeMinister), UnempPct)

Some other helpful commands when you are re-naming columns: tolower and toupper they transform a string into all lower case or all upper case. If you want to change all the column names, you can use them with the colnames command.

This might be helpful if the columns mix upper and lower case and you cannot remember how you named your variables.

colnames(EmpData) # view list of all column names in dataframe

colnames(EmpData) <- tolower(colnames(EmpData)) # makes all column names in EmpData lowercase. 

colnames(EmpData) # Double check if you achieved your task

Note: If you change all of the columns to lower case and then try to go back and rerun some of the lines of code above again, they will not work anymore. This is because R is case sensitive and your code no longer matches the name of the variables.

# This code worked above, but will not work after you change the capitalization of the variables
EmpData %>% 
  select(MonthYr, PrimeMinister, Party)

Usually, if I am going to change the capitalization of the variables, I do so right in the beginning and then am consistent throughout the rest of my code.

Saving code

Because it is command-based, R enables an entirely different and much more reproducible model for data cleaning and analysis. In Excel, the original data, the data cleaning, the data analysis, and the results are all mixed in together in a simple file. This usually okay for simple things but it can be a disaster in complex projects. In contrast, R allows you to have three separate files or groups of files

  1. The original data, which you do not change.
  2. The code to clean and analyze the data, which you maintain carefully (including version control). Your code can be saved in an R script, or as part of an R Markdown document. You can have major tasks separated into multiple R files (ex. cleaning vs analyzing codes).
  3. The results of your code, which you treat as temporary files that can be deleted or replaced at any time.

The key is to make sure that all of your cleaned data and results can be regenerated from the original data at any time by running your code.

You can stop here unless you have a very inquisitive mind. These commands will be discussed in future lectures and in the next lab.

If you got through this during lab, feel free to work on the homework together



Bonus Content

Analyzing Data in R

The summary function

The summary() function will give a basic summary of any object. Exactly what that summary looks like depends on the object. For tibbles, summary() produces a set of summary statistics for each variable. Hopefully these statistics look familiar.

summary(EmpData)

Refreshing the data

Because all variable names were changed to lowercase in previous codes, the code below will not work. To solve this, we can reread in the data to have a “fresh start”

EmpData <- read_csv("EmploymentData.csv")  %>%
  mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y"),
         UnempPct = 100 * UnempRate,
         LFPPct = 100 * LFPRate)

Univariate Statistics

The R function mean() calculates the sample average of any numeric vector:

# Mean of a single variable
mean(EmpData$UnempPct)
# Var calculates the sample variance
var(EmpData$UnempPct) #2.923

Remember that the square root of the variance is equal to the standard deviation:

#square root of variance
sqrt(2.923)
#  calculates the standard deviation
sd(EmpData$UnempPct) # 1.71
# median() calculates the sample median
median(EmpData$UnempPct) # 7.69% was median unemployment rate

In real-world data, some variables have missing values for one or more observations. For example, the AnnPopGrowth variable in our data set is missing for the first year of data (1976), since calculating the growth rate for 1976 would require data from 1975. In R, missing values are given the special value NA which stands for “not available”.

EmpData %>%
  select(MonthYr, Population, AnnPopGrowth)%>%
  head()

When we try to take the mean of this variable we also get NA:

mean(EmpData$AnnPopGrowth)

This is because math in R follows the IEEE-754 standard for numerical arithmetic, which says that any calculation involving NA should also result in NA. Some other applications drop missing data from the calculation.

Whenever you have missing values, you should investigate before proceeding. Sometimes (as in our case here), missing values are for a good reason, other times they are the result of a mistake or problem that needs to be fixed. Once we have investigated the missing values, we can tell R explicitly to exclude them from the calculation by adding the na.rm = TRUE option:

mean(EmpData$AnnPopGrowth, na.rm = TRUE)

Suppose we want to calculate the sample average for each column in our tibble. We could just call mean() for each of them, but there should be a quicker way. Here is the code to do that:

# Mean of each column
EmpData %>%
    select(where(is.numeric)) %>%
    lapply(mean, na.rm = TRUE)

I would not expect you to come up with this code, but maybe it kind of makes sense.

  • The select(where(is.numeric)) step selects only the columns in EmpData that are numeric rather than characters or dates
  • The lapply(mean, na.rm=TRUE) step calculates mean(x,na.rm=TRUE) for each (numeric) column x in EmpData.

We can use this method with any function that calculates a summary statistic:

# Standard deviation of each column that was numeric
EmpData %>%
    select(where(is.numeric)) %>%
    lapply(sd, na.rm = TRUE)

Frequency Tables

We can also construct frequency tables for both discrete and continuous variables:

# COUNT creates a frequency table for discrete variables
EmpData %>% count(PrimeMinister)

This is similar to the output from using table(EmpData$PrimeMinister)

table(EmpData$PrimeMinister)

Maybe you want to know how many months had unemployment rates between a certain number of intervals. This code below tells R to take the distribution of observations and put it into 8 bins using the ranges in the output. It automatically created these bins but they can be customized.

# COUNT and CUT_INTERVAL create a binned frequency table
EmpData %>%
    count(cut_interval(UnempPct, 6))

Graphs with ggplot

ggplot() lives within tidyverse().

Histogram of unemployment rate:

ggplot(data = EmpData, mapping = aes(x = UnempPct)) + geom_histogram()

The code below creates the exact same output as the the code chunk above.

Empdata %>% ggplot(
  aes(x = UnempPct)) + 
  geom_histogram()

Time series (line) graph:

ggplot(data = EmpData, mapping = aes(x = MonthYr, y = UnempPct)) + 
  geom_line()

The ggplot() function has a non-standard syntax:

  • The first line sets up the basic characteristics of the graph:
    • The data argument tells R which data set (tibble) will be used
    • the mapping argument describes the basic aesthetics of the graph, i.e., the relationship in the data we will be graphing.
      • For the histogram, our aesthetic includes only one variable
      • For the line graph, our aesthetic includes two variables
  • The rest of the command is one or more statements separated by a + sign. These are called geometries and are geometric elements to be included in the plot.
    • The geom_histogram() geometry produces a histogram
    • The geom_line() geometry produces a line

A graph can include multiple geometries in a given graph, as we will see shortly.

Title and labels

ggplot(data = EmpData, 
       aes(x = MonthYr, 
           y = UnempPct)) + 
  geom_line() + 
  labs(title = "Unemployment rate",
    subtitle = "January 1976 - January 2021", 
    caption = "Source: Statistics Canada, Labour Force Survey",
    tag = "Canada") + 
  xlab("") + 
  ylab("Unemployment rate, %")

Color

You can change the color of any geometric element using the col= argument:

ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) + 
  geom_line(col = "blue")

Colors can be given in ordinary words, or with detailed color codes in RGB or CMYK format. There are entire packages with more color theme options. There is even a package that has color themes based on National Parks.

Some geometric elements, such as the bars in a histogram, also have a fill color:

ggplot(data = EmpData,  aes(x = UnempPct)) + 
  geom_histogram(col = "red", fill = "blue")

As you can see, the col= argument sets the color for the exterior of each bar, and the fill= argument sets the color for the interior:

ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) + 
  geom_line(col = "blue") +
  geom_line(aes(y = LFPPct), col = "red")
ggplot(data = EmpData, 
       aes(x = UnempPct)) + 
  geom_histogram(binwidth = 0.5, fill = "blue") +
    geom_density() + 
  labs(title = "Unemployment rate", 
       subtitle = paste("January 1976 - January 2021 (",
    nrow(EmpData), " months)", 
    sep = "", collapse = ""), 
    caption = "Source: Statistics Canada, Labour Force Survey",
    tag = "Canada") + xlab("Unemployment rate, %") + ylab("Count")

Side note: This is the same as the code above, but look how hard it is to read:

ggplot(data = EmpData, aes(x = UnempPct)) + geom_histogram(binwidth = 0.5, fill = "blue") + geom_density() + labs(title = "Unemployment rate", subtitle = paste("January 1976 - January 2021 (", nrow(EmpData), " months)", sep = "", collapse = ""), caption = "Source: Statistics Canada, Labour Force Survey", tag = "Canada") + xlab("Unemployment rate, %") + ylab("Count")

Saving graphs

If you want to save the graph, you can use the ggsave() function. Specify the name of the file that you want to use. You must include the file extension because it tells ggplot what format to save the file in. Common image formats are .pdf, .png, .jpeg, .tiff, and .bmp. The other terms in the function specify that I want to save a file that is 7 wide by 5 high, and the units of measurement is inches. This will produce a 7 X 5 inch image and it will be saved in the current working directory with the name given in the function.

ggsave("practice image.png", width = 7, height = 5, units = "in")