Instructions

Your participation points come from emailing me the knitted Word document.

Lab Goals

  • data manipulation
  • make graphs
  • make summary table
  • recode numeric variable into a categorical variable
  • knit an html document
  • knit a Word document

Setup Chunk

Set working Directory (Make sure necessary files are in the correct folder). Load packages.

When you load packages, tons of lines of output show up both in your console and in your final document when you knit it. If you want to hide that output but show that you loaded the packages, you can do warning=FALSE, message = FALSE

library(tidyverse)
library(kableExtra)
library(lubridate)

Our lab example will use historical employment data for Canada from January 1976 through January 2021. Download the file from Blackboard’s Week 13 Folder. Save the data file and markdown file in a folder for this course.

  • MonthYr: the month and year of the observation.
  • Population: the civilian, non-institutionalized working-age population of Canada at that time, in thousands.
  • Employed: the total number employed in the population, in thousands.
  • Unemployed: the total number employed in the population, in thousands.
  • LabourForce: the sum of Employed and Unemployed.
  • NotInLabourForce: the difference between Population and LabourForce.
  • UnempRate: the percentage of the labor force that is unemployed. It is calculated and stored as a decimal (ranging from 0.0 to 1.0).
  • LFPRate: the percentage of the population that is in the labor force. It is calculated and stored as a decimal (ranging from 0.0 to 1.0)
  • Party: the political party in control of the Federal government. If the party in control changed during the month, it is listed as “Transfer.”
  • PrimeMinister: the name of the Prime Minister. If the prime minister changed during the month, it is listed as “Transfer.”
  • AnnPopGrowth: the rate of population growth over the previous 12 months, calculated as a proportion and displayed as a percentage. Note that this variable is blank for the first 12 months of the data set

Opening Data in R

EmpData <- read_csv("EmploymentData.csv")  # from readr package included in tidyverse package
## 
## -- Column specification --------------------------------------------------------
## cols(
##   MonthYr = col_character(),
##   Population = col_double(),
##   Employed = col_double(),
##   Unemployed = col_double(),
##   LabourForce = col_double(),
##   NotInLabourForce = col_double(),
##   UnempRate = col_double(),
##   LFPRate = col_double(),
##   Party = col_character(),
##   PrimeMinister = col_character(),
##   AnnPopGrowth = col_double()
## )

What information can be gained from the output above? How is the data stored? Is it numeric? Text? Dates? Logical? etc. How do you know?

Tidyverse Practice

Renaming Variables

Rename the variable LabourForce to laborforce.

Remember: If you forget the syntax or different options that exist for a command, you can use ?_____ or highlight the command or package name and push F1. Both options open documentation on commands or packages in the Help panel.

The syntax for the rename() command is: rename(dataset_name, variable_new_name = variable_old_name.
This command (and almost all other tidyverse commands) also works with pipes syntax (%>%). When using pipes, you pass your data frame through the pipes and then tell R which commands you want to use on the variables within that data frame:
dataset_name %>% rename(variable_new_name = variable_old_name)

Rename the variable LabourForce to laborforce:

# rename(new_name = old_name)
EmpData <- EmpData %>%
  rename(laborforce = LabourForce)

Rename NotInLaborForce to not_laborforce:

EmpData <- EmpData %>% 
  rename(not_laborforce = NotInLabourForce)

Univariate Tables

The table() command is a great way to quickly know how many observations there are for categories of a variable.

How many observations are there for each Prime Minister?

Using the table command, how many observations are there for each party? What does one observation represent in the context of our data? What level of measurement is appropriate for “party”? What type of variable is party stored as in R? How can we find this out?

What happens if you try to graph a categorical variable as a histogram?

hist(EmpData$Party) # doesn't work because the command needs a numeric variable
## Error in hist.default(EmpData$Party): 'x' must be numeric

If the table() command shows you the number of observations for EACH category of a variable, why should we be hesitant to use it with a continuous variable? Feel free to try it below:

table(EmpData$Unemployed)

Instead of a table, maybe try a histogram:

hist(EmpData$Unemployed)

# hist() is the basic way to get a histogram in R

Why is using a histogram to display this information better or worse than a frequency table?

What does the y-axis represent? (i.e. How many of what?) What does the x-axis represent? What is a better name you could give this histogram? (You do not actually have to add the title to the image). How wide is each bin?


Select

If I wanted to work with only a few of the variables, I could make a smaller data frame by selecting the specific variables I want and saving them as a new object.

subset <- EmpData %>% 
  select(MonthYr, Population, Employed, Unemployed)

Is it in your environment? How many variables and observations are there in your new subset?

Note that I didn’t save this output as anything. I just ran the code and the output shows up but it does not create a new object.

EmpData %>% 
  select(MonthYr, Population, Employed, Unemployed)
## # A tibble: 541 x 4
##    MonthYr   Population Employed Unemployed
##    <chr>          <dbl>    <dbl>      <dbl>
##  1 1/1/1976      16852.    9637.       733 
##  2 2/1/1976      16892     9660.       730 
##  3 3/1/1976      16931.    9704.       692.
##  4 4/1/1976      16969.    9738.       713.
##  5 5/1/1976      17008.    9726.       720 
##  6 6/1/1976      17047.    9748.       721.
##  7 7/1/1976      17086.    9760.       780.
##  8 8/1/1976      17124.    9780.       744.
##  9 9/1/1976      17154.    9795.       737.
## 10 10/1/1976     17183.    9782.       783.
## # ... with 531 more rows

If I wanted to save the table in my environment, I would just add table_name <- to my code. That way it selects the variables I want, makes a table, and then stores it in the object that the arrow points to.

table_name <- EmpData %>% 
  select(MonthYr, Population, Employed, Unemployed)

Mutate

Dates

Check the class of EmpData$MonthYr. Add a comment next to the command to indicate what the output was.

class(EmpData$MonthYr) # originally stored as a character variable.  

When the data was read in originally from the CSV file, it was a stored as a character. This needs to be changed if we want to graph change over time. (Although it looks like dates to humans, since it is stored as a character, there isn’t an order to the data that R can see, graph, or analyze)

Also notice that the information was formatted as Month/Day/Year. You have to include that information in your code so R knows how to read the numbers.

Run the line of code below to turn the MonthYr type from a character to date.

# Change MonthYr to date format using as.Date() from 
library(lubridate)
EmpData %>%
    mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y"))
## # A tibble: 541 x 11
##    MonthYr    Population Employed Unemployed laborforce not_laborforce UnempRate
##    <date>          <dbl>    <dbl>      <dbl>      <dbl>          <dbl>     <dbl>
##  1 1976-01-01     16852.    9637.       733      10370.          6483.    0.0707
##  2 1976-02-01     16892     9660.       730      10390.          6502.    0.0703
##  3 1976-03-01     16931.    9704.       692.     10396.          6535     0.0665
##  4 1976-04-01     16969.    9738.       713.     10451.          6518.    0.0682
##  5 1976-05-01     17008.    9726.       720      10446.          6562     0.0689
##  6 1976-06-01     17047.    9748.       721.     10470.          6577.    0.0689
##  7 1976-07-01     17086.    9760.       780.     10539.          6546.    0.0740
##  8 1976-08-01     17124.    9780.       744.     10524.          6600.    0.0707
##  9 1976-09-01     17154.    9795.       737.     10532.          6622.    0.0699
## 10 1976-10-01     17183.    9782.       783.     10565.          6618.    0.0741
## # ... with 531 more rows, and 4 more variables: LFPRate <dbl>, Party <chr>,
## #   PrimeMinister <chr>, AnnPopGrowth <dbl>

Now MonthYr is stored as a date in R (temporarily, only in the output. We didn’t save the dataframe this way). The information displayed also changed: Now the date is displayed as year-month-day.

Keep the original variable and create a new variable for the date with the new formatting in EmpData:

#using dollar sign syntax:
EmpData$date <- as.Date(EmpData$MonthYr, "%m/%d/%Y")
# using tidyverse syntax:
# creates a new variable called "date" that formats EmpData$MonthYr to a date
# then saves dataset with new variable over its previous version
EmpData <- EmpData %>%
    mutate(date2 = as.Date(MonthYr, "%m/%d/%Y")) 

How many columns are there now? Where did the new variables go?

Calculations

Mutate can be used to perform calculations on existing variables and create new variables.

Create versions of UnempRate and LFPRate that are expressed in percentages rather than decimal units. Remember, the lines of code below only create the output in the document and code chunk, they do not save the changes to the original data.

# Create UnempPct and LFPPct
EmpData %>%
  select(MonthYr, UnempRate, LFPRate) %>%  # keeps these 3 variables 
    mutate(UnempPct = 100 * UnempRate) %>% # creates a new variable named UnempPct
    mutate(LFPPct = 100 * LFPRate)        # creates a new variable named LFPPct
## # A tibble: 541 x 5
##    MonthYr   UnempRate LFPRate UnempPct LFPPct
##    <chr>         <dbl>   <dbl>    <dbl>  <dbl>
##  1 1/1/1976     0.0707   0.615     7.07   61.5
##  2 2/1/1976     0.0703   0.615     7.03   61.5
##  3 3/1/1976     0.0665   0.614     6.65   61.4
##  4 4/1/1976     0.0682   0.616     6.82   61.6
##  5 5/1/1976     0.0689   0.614     6.89   61.4
##  6 6/1/1976     0.0689   0.614     6.89   61.4
##  7 7/1/1976     0.0740   0.617     7.40   61.7
##  8 8/1/1976     0.0707   0.615     7.07   61.5
##  9 9/1/1976     0.0699   0.614     6.99   61.4
## 10 10/1/1976    0.0741   0.615     7.41   61.5
## # ... with 531 more rows

Rounding

If the number of decimals are bothering you, you can use round() and indicate the number of digits after the decimal point.

# round the decimals to two digits
EmpData %>%
  select(MonthYr, UnempRate, LFPRate) %>%  # keeps these 3 variables
  mutate(UnempPct = round(100 * UnempRate, digits = 2),  
         LFPPct = round(100 * LFPRate, digits = 2 ),
         UnempRate = round(UnempRate, digits = 2),
         LFPRate = round(LFPRate, digits = 2))    
## # A tibble: 541 x 5
##    MonthYr   UnempRate LFPRate UnempPct LFPPct
##    <chr>         <dbl>   <dbl>    <dbl>  <dbl>
##  1 1/1/1976       0.07    0.62     7.07   61.5
##  2 2/1/1976       0.07    0.62     7.03   61.5
##  3 3/1/1976       0.07    0.61     6.65   61.4
##  4 4/1/1976       0.07    0.62     6.82   61.6
##  5 5/1/1976       0.07    0.61     6.89   61.4
##  6 6/1/1976       0.07    0.61     6.89   61.4
##  7 7/1/1976       0.07    0.62     7.4    61.7
##  8 8/1/1976       0.07    0.61     7.07   61.5
##  9 9/1/1976       0.07    0.61     6.99   61.4
## 10 10/1/1976      0.07    0.61     7.41   61.5
## # ... with 531 more rows

So far we have not made any permanent changes to EmpData. Before we had R calculate something and show it to us in the output, but not store it anywhere. In order to modify our EmpData tibble, we have to save those changes back to the object EmpData. Our original commands simply created a new object based on EmpData that was then displayed on the screen. In order to change EmpData itself, we need to assign that new object back to EmpData with the <-. Other things to notice in the code below is that I used the command mutate() once and made three variables inside that one command.

# Make permanent changes to EmpData
EmpData <- EmpData %>%
  select(-date2) %>% 
  mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y"),
           UnempPct = 100 * UnempRate,
           LFPPct = 100 * LFPRate)

EmpData # check your work. Look at the data
## # A tibble: 541 x 14
##    MonthYr    Population Employed Unemployed laborforce not_laborforce UnempRate
##    <date>          <dbl>    <dbl>      <dbl>      <dbl>          <dbl>     <dbl>
##  1 1976-01-01     16852.    9637.       733      10370.          6483.    0.0707
##  2 1976-02-01     16892     9660.       730      10390.          6502.    0.0703
##  3 1976-03-01     16931.    9704.       692.     10396.          6535     0.0665
##  4 1976-04-01     16969.    9738.       713.     10451.          6518.    0.0682
##  5 1976-05-01     17008.    9726.       720      10446.          6562     0.0689
##  6 1976-06-01     17047.    9748.       721.     10470.          6577.    0.0689
##  7 1976-07-01     17086.    9760.       780.     10539.          6546.    0.0740
##  8 1976-08-01     17124.    9780.       744.     10524.          6600.    0.0707
##  9 1976-09-01     17154.    9795.       737.     10532.          6622.    0.0699
## 10 1976-10-01     17183.    9782.       783.     10565.          6618.    0.0741
## # ... with 531 more rows, and 7 more variables: LFPRate <dbl>, Party <chr>,
## #   PrimeMinister <chr>, AnnPopGrowth <dbl>, date <date>, UnempPct <dbl>,
## #   LFPPct <dbl>

Recoding Variables

You can also create a binary variable from a continuous variable. If you wanted to just represent unemployment levels that are below or above average.

Note: I picked 8% as my cut off for high and low unemployment rates. You would want to have a good reason behind your choices in your own assignments/research. I just wanted to make an example.

if_else() is a great tool for creating or recoding variables.

EmpData$highlow <- if_else(EmpData$UnempPct < 8, "low", "high")
table(EmpData$highlow)
## 
## high  low 
##  229  312

Filter and Arrange

Now let’s suppose we want to know more about the months in our data set with the highest unemployment rates. We can use filter() for this purpose:

# This will give all of the observations with unemployment rates over 12.5%
EmpData %>%
    filter(UnempPct > 12.5)
## # A tibble: 8 x 15
##   MonthYr    Population Employed Unemployed laborforce not_laborforce UnempRate
##   <date>          <dbl>    <dbl>      <dbl>      <dbl>          <dbl>     <dbl>
## 1 1982-10-01     19183.   10787.      1602.     12389.          6794.     0.129
## 2 1982-11-01     19203.   10764.      1600.     12364.          6839.     0.129
## 3 1982-12-01     19223.   10774.      1624.     12398.          6824.     0.131
## 4 1983-01-01     19244.   10801.      1573.     12374           6870.     0.127
## 5 1983-02-01     19266.   10818.      1574.     12392.          6875.     0.127
## 6 1983-03-01     19285.   10875.      1555.     12430.          6856.     0.125
## 7 2020-04-01     30994.   16142.      2444.     18586.         12409.     0.131
## 8 2020-05-01     31009.   16444       2610.     19054.         11955.     0.137
## # ... with 8 more variables: LFPRate <dbl>, Party <chr>, PrimeMinister <chr>,
## #   AnnPopGrowth <dbl>, date <date>, UnempPct <dbl>, LFPPct <dbl>,
## #   highlow <chr>

Interpretation of output: only 8 of the 541 months in our data have unemployment rates over 12.5% - the worst months of the 1982-83 recession, and April and May of last year.

Now suppose that we only want to see a few pieces of information about those months. We can use select() again to choose variables to display in the output:

# This will leave out all variables except the ones mentioned in the select() command
EmpData %>%
    filter(UnempPct > 12.5) %>%
    select(MonthYr, UnempRate, LFPPct, PrimeMinister)
## # A tibble: 8 x 4
##   MonthYr    UnempRate LFPPct PrimeMinister 
##   <date>         <dbl>  <dbl> <chr>         
## 1 1982-10-01     0.129   64.6 Pierre Trudeau
## 2 1982-11-01     0.129   64.4 Pierre Trudeau
## 3 1982-12-01     0.131   64.5 Pierre Trudeau
## 4 1983-01-01     0.127   64.3 Pierre Trudeau
## 5 1983-02-01     0.127   64.3 Pierre Trudeau
## 6 1983-03-01     0.125   64.5 Pierre Trudeau
## 7 2020-04-01     0.131   60.0 Justin Trudeau
## 8 2020-05-01     0.137   61.4 Justin Trudeau

What happens if you put the select() command before the filter() command in the code below? Why?

EmpData %>%
  select(MonthYr, UnempRate, LFPPct, PrimeMinister) %>%
  filter(UnempPct > 12.5)
## Error: Problem with `filter()` input `..1`.
## i Input `..1` is `UnempPct > 12.5`.
## x object 'UnempPct' not found

Show months in descending order by unemployment rate (i.e., the highest unemployment rate first). We can use arrange() to sort rows in this way:

EmpData %>%
    filter(UnempPct > 12.5) %>%
    select(MonthYr, UnempPct, LFPPct, PrimeMinister) %>%
    arrange(UnempPct) # Sorts the rows by unemployment rate, default is ascending order
## # A tibble: 8 x 4
##   MonthYr    UnempPct LFPPct PrimeMinister 
##   <date>        <dbl>  <dbl> <chr>         
## 1 1983-03-01     12.5   64.5 Pierre Trudeau
## 2 1983-02-01     12.7   64.3 Pierre Trudeau
## 3 1983-01-01     12.7   64.3 Pierre Trudeau
## 4 1982-10-01     12.9   64.6 Pierre Trudeau
## 5 1982-11-01     12.9   64.4 Pierre Trudeau
## 6 1982-12-01     13.1   64.5 Pierre Trudeau
## 7 2020-04-01     13.1   60.0 Justin Trudeau
## 8 2020-05-01     13.7   61.4 Justin Trudeau

arrange() is very useful when creating graphs! If you want bar graphs to show categories in increasing or decreasing order, arrange() will help you do that.

Hopefully you can see why the pipe operator is useful in making our code clear and readable. Compare these two lines of code. Both do the same thing but one is easier to read.

# with pipes
EmpData %>%
    filter(UnempPct > 12.5) %>%
    select(MonthYr, UnempPct, LFPPct, PrimeMinister) %>%
    arrange(UnempPct)

# without pipes
arrange(select(filter(EmpData, UnempPct > 12.5), MonthYr, UnempPct, LFPPct, PrimeMinister), UnempPct)

Analyzing Data in R

The summary function

The summary() function will give a basic summary of any object. Exactly what that summary looks like depends on the object. For tibbles, summary() produces a set of summary statistics for each variable. Hopefully these statistics look familiar.

summary(EmpData) # entire data frame
##     MonthYr             Population       Employed       Unemployed    
##  Min.   :1976-01-01   Min.   :16852   Min.   : 9637   Min.   : 691.5  
##  1st Qu.:1987-04-01   1st Qu.:20290   1st Qu.:12230   1st Qu.:1102.5  
##  Median :1998-07-01   Median :23529   Median :14064   Median :1265.5  
##  Mean   :1998-07-01   Mean   :23795   Mean   :14383   Mean   :1261.0  
##  3rd Qu.:2009-10-01   3rd Qu.:27327   3rd Qu.:16926   3rd Qu.:1404.6  
##  Max.   :2021-01-01   Max.   :31191   Max.   :19130   Max.   :2609.8  
##                                                                       
##    laborforce    not_laborforce    UnempRate          LFPRate      
##  Min.   :10370   Min.   : 6483   Min.   :0.05446   Min.   :0.5996  
##  1st Qu.:13467   1st Qu.: 6842   1st Qu.:0.07032   1st Qu.:0.6501  
##  Median :15333   Median : 8162   Median :0.07691   Median :0.6573  
##  Mean   :15644   Mean   : 8151   Mean   :0.08207   Mean   :0.6564  
##  3rd Qu.:18230   3rd Qu.: 9099   3rd Qu.:0.09369   3rd Qu.:0.6674  
##  Max.   :20316   Max.   :12409   Max.   :0.13697   Max.   :0.6766  
##                                                                    
##     Party           PrimeMinister       AnnPopGrowth           date           
##  Length:541         Length:541         Min.   :0.007522   Min.   :1976-01-01  
##  Class :character   Class :character   1st Qu.:0.012390   1st Qu.:1987-04-01  
##  Mode  :character   Mode  :character   Median :0.013156   Median :1998-07-01  
##                                        Mean   :0.013703   Mean   :1998-07-01  
##                                        3rd Qu.:0.014286   3rd Qu.:2009-10-01  
##                                        Max.   :0.024815   Max.   :2021-01-01  
##                                        NA's   :12                             
##     UnempPct          LFPPct        highlow         
##  Min.   : 5.446   Min.   :59.96   Length:541        
##  1st Qu.: 7.032   1st Qu.:65.01   Class :character  
##  Median : 7.691   Median :65.73   Mode  :character  
##  Mean   : 8.207   Mean   :65.64                     
##  3rd Qu.: 9.369   3rd Qu.:66.74                     
##  Max.   :13.697   Max.   :67.66                     
## 
summary(EmpData$Unemployed) # for one variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   691.5  1102.5  1265.5  1261.0  1404.6  2609.8

Univariate Statistics

The R function mean() calculates the sample average of any numeric vector:

# Mean of a single variable
mean(EmpData$UnempPct)
## [1] 8.207112

In a full sentence, interpret the mean in the context of the variable.

#  calculates the standard deviation
sd(EmpData$UnempPct) # 1.71
## [1] 1.709704
# median() calculates the sample median
median(EmpData$UnempPct) # 7.69% was median unemployment rate
## [1] 7.691411

In real-world data, some variables have missing values for one or more observations. For example, the AnnPopGrowth variable in our data set is missing for the first year of data (1976), since calculating the growth rate for 1976 would require data from 1975. In R, missing values are given the special value NA which stands for “not available”:

EmpData %>%
  select(MonthYr, Population, AnnPopGrowth)
## # A tibble: 541 x 3
##    MonthYr    Population AnnPopGrowth
##    <date>          <dbl>        <dbl>
##  1 1976-01-01     16852.           NA
##  2 1976-02-01     16892            NA
##  3 1976-03-01     16931.           NA
##  4 1976-04-01     16969.           NA
##  5 1976-05-01     17008.           NA
##  6 1976-06-01     17047.           NA
##  7 1976-07-01     17086.           NA
##  8 1976-08-01     17124.           NA
##  9 1976-09-01     17154.           NA
## 10 1976-10-01     17183.           NA
## # ... with 531 more rows

When we try to take the mean of this variable we also get NA:

mean(EmpData$AnnPopGrowth) # returns NA as result
## [1] NA

This is because math in R follows rules that say that any calculation involving NA should also result in NA. Some other applications drop missing data from the calculation.

Whenever you have missing values, you should investigate before proceeding. Sometimes (as in our case here), missing values are for a good reason, other times they are the result of a mistake or problem that needs to be fixed.

Once we have investigated the missing values, we can tell R explicitly to exclude them from the calculation by adding the na.rm = TRUE option:

mean(EmpData$AnnPopGrowth, na.rm = TRUE)
## [1] 0.01370259

If you wanted to calculate the sample average for each column in our tibble, we could apply the command mean() to all the columns using the command lapply().

# Mean of each column
EmpData %>%
    select(where(is.numeric)) %>%
    lapply(mean, na.rm = TRUE)
## $Population
## [1] 23795.46
## 
## $Employed
## [1] 14383.15
## 
## $Unemployed
## [1] 1260.953
## 
## $laborforce
## [1] 15644.1
## 
## $not_laborforce
## [1] 8151.352
## 
## $UnempRate
## [1] 0.08207112
## 
## $LFPRate
## [1] 0.6563653
## 
## $AnnPopGrowth
## [1] 0.01370259
## 
## $UnempPct
## [1] 8.207112
## 
## $LFPPct
## [1] 65.63653

I would not expect you to come up with this code, but maybe it kind of makes sense.

  • The select(where(is.numeric)) step selects only the columns in EmpData that are numeric.
  • The lapply(mean,na.rm=TRUE) step calculates mean(x,na.rm=TRUE) for each (numeric) column x in EmpData.

We can use this method with any function that calculates a summary statistic:

# Standard deviation of each column
EmpData %>%
    select(where(is.numeric)) %>%
    lapply(sd, na.rm = TRUE)
## $Population
## [1] 4034.558
## 
## $Employed
## [1] 2704.267
## 
## $Unemployed
## [1] 243.8356
## 
## $laborforce
## [1] 2783.985
## 
## $not_laborforce
## [1] 1294.117
## 
## $UnempRate
## [1] 0.01709704
## 
## $LFPRate
## [1] 0.01401074
## 
## $AnnPopGrowth
## [1] 0.00269365
## 
## $UnempPct
## [1] 1.709704
## 
## $LFPPct
## [1] 1.401074

Frequency Tables

We can also construct frequency tables for both categorical and continuous variables:

# COUNT creates a frequency table for categorical variables
EmpData %>% count(PrimeMinister)
## # A tibble: 10 x 2
##    PrimeMinister      n
##    <chr>          <int>
##  1 Brian Mulroney   104
##  2 Jean Chretien    120
##  3 Joe Clark          8
##  4 John Turner        2
##  5 Justin Trudeau    62
##  6 Kim Campbell       4
##  7 Paul Martin       25
##  8 Pierre Trudeau    91
##  9 Stephen Harper   116
## 10 Transfer           9

Bivariate Statistics

Cross tabulations

table(EmpData$PrimeMinister, EmpData$Party)
##                 
##                  Conservative Liberal Transfer
##   Brian Mulroney          104       0        0
##   Jean Chretien             0     120        0
##   Joe Clark                 8       0        0
##   John Turner               0       2        0
##   Justin Trudeau            0      62        0
##   Kim Campbell              4       0        0
##   Paul Martin               0      25        0
##   Pierre Trudeau            0      91        0
##   Stephen Harper          116       0        0
##   Transfer                  1       2        6
library(psych)
## Warning: package 'psych' was built under R version 4.0.5
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
#(EmpData$PrimeMinister, EmpData$Party)

Correlation

cor(EmpData$UnempPct, EmpData$LFPPct)
## [1] -0.2557409

What is the relationship between these two variables? Indicate direction and strength of the relationship.

Unemployment and labor force participation are negatively correlated: when unemployment is high, LFP tends to be low. This makes sense given the economics: if it is hard to find a job, people will move into other activities that take one out of the labor force: education, childcare, retirement, etc.

Correlation Matrix

Correlation matrix for the whole data set (at least the numerical parts).

In most applications, pairwise deletion makes the most sense because it avoids throwing out data. But it is occasionally important to use the same data for all calculations, in which case we would use listwise deletion.

  • Pairwise deletion: when calculating the covariance or correlation of two variables, exclude observations with a missing values for either of those two variables.

  • Casewise or listwise deletion: when calculating the covariance or correlation of two variables, exclude observations with a missing value for any variable.

EmpData %>%
    select(Population, Employed, laborforce, UnempPct) %>%
    cor(use = "pairwise.complete.obs")
##            Population   Employed laborforce   UnempPct
## Population  1.0000000  0.9905010  0.9950675 -0.4721230
## Employed    0.9905010  1.0000000  0.9964734 -0.5542043
## laborforce  0.9950675  0.9964734  1.0000000 -0.4836022
## UnempPct   -0.4721230 -0.5542043 -0.4836022  1.0000000

Notice which variables are EXTREMELEY correlated with each other. Think about why that is. While the variables are strongly correlated, does it have meaningful implications?

Conditional Averages

Calculate the average unemployment rate for each prime minister.

EmpData %>%
  group_by(PrimeMinister) %>%
  summarize(avg_unemp_rate = mean(UnempRate))
## # A tibble: 10 x 2
##    PrimeMinister  avg_unemp_rate
##    <chr>                   <dbl>
##  1 Brian Mulroney         0.0939
##  2 Jean Chretien          0.0841
##  3 Joe Clark              0.0726
##  4 John Turner            0.112 
##  5 Justin Trudeau         0.0698
##  6 Kim Campbell           0.114 
##  7 Paul Martin            0.0694
##  8 Pierre Trudeau         0.0896
##  9 Stephen Harper         0.0711
## 10 Transfer               0.0911
EmpData %>%
  group_by(PrimeMinister) %>%
  summarize(avg_unemp_rate = round(mean(UnempRate), digits = 3))
## # A tibble: 10 x 2
##    PrimeMinister  avg_unemp_rate
##    <chr>                   <dbl>
##  1 Brian Mulroney          0.094
##  2 Jean Chretien           0.084
##  3 Joe Clark               0.073
##  4 John Turner             0.112
##  5 Justin Trudeau          0.07 
##  6 Kim Campbell            0.114
##  7 Paul Martin             0.069
##  8 Pierre Trudeau          0.09 
##  9 Stephen Harper          0.071
## 10 Transfer                0.091
EmpData %>%
  group_by(PrimeMinister) %>%
  summarize(avg_unemp_rate = round(mean(UnempRate), digits = 3)) %>%
  ggplot(aes(avg_unemp_rate, PrimeMinister)) + 
  geom_col()

Graphs with ggplot

More graphing practice!

Histogram of unemployment rate:

ggplot(data = EmpData, 
       mapping = aes(x = UnempPct)) + 
  geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Time series (line) graph:

ggplot(data = EmpData, 
       mapping = aes(x = MonthYr, y = UnempPct)) + 
  geom_line()

The ggplot() function has a non-standard syntax:

  • The first line sets up the basic characteristics of the graph:
    • The data argument tells R which data set (tibble) will be used
    • the mapping argument describes the basic aesthetics of the graph, i.e., the relationship in the data we will be graphing.
      • For the histogram, our aesthetic includes only one variable
      • For the line graph, our aesthetic includes two variables
  • The rest of the command is one or more statements separated by a + sign. These are called geometries and are geometric elements to be included in the plot.
    • The geom_histogram() geometry produces a histogram
    • The geom_line() geometry produces a line A graph can include multiple geometries in a given graph (ex. lines and histograms)

Title and labels

ggplot(data = EmpData, mapping = aes(x = MonthYr, y = UnempPct)) + 
  geom_line() + 
  labs(title = "Unemployment rate",
    subtitle = "January 1976 - January 2021", 
    caption = "Source: Statistics Canada, Labour Force Survey",
    tag = "Canada") + 
  xlab("") + 
  ylab("Unemployment rate, %")

Color

You can change the color of any geometric element using the col= argument:

ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) + 
  geom_line(col = "blue")

Colors can be given in ordinary English (or local language) words, or with detailed color codes in RGB or CMYK format.

Some geometric elements, such as the bars in a histogram, also have a fill color:

ggplot(data = EmpData, aes(x = UnempPct)) + 
  geom_histogram(col = "red", fill = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see, the col= argument sets the color for the exterior of each bar, and the fill= argument sets the color for the interior.

Line graph with 2 colors

Remember, Time is on the x axis, percent of the population unemployed is on the y axis.

ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) + 
  geom_line(col = "blue") +
  geom_line(aes(y = LFPPct), col = "red")

A few things to note for the image above:

  • The third line gives geom_line() an aesthetics argument aes(y=LFPPct). This overrides the aesthetics in the first line.
  • We have used color to differentiate the two lines, but there is no legend to tell the reader which line is which. We will need to fix that.
  • The vertical axis is labeled UnempPct. We will need to fix that.

We could add a legend here, but it is better (and friendlier to the color-blind) to just label the lines. We can use the geom_text geometry to do this:

ggplot(data = EmpData, aes(x = MonthYr)) +
  geom_line(aes(y = UnempPct), col = "blue") +
  geom_text(x = as.Date("1/1/2000", "%m/%d/%Y"), y = 15, label = "Unemployment",
        col = "blue") + 
  geom_line(aes(y = LFPPct), col = "red") + 
  geom_text(x = as.Date("1/1/2000",
    "%m/%d/%Y"), y = 60, label = "% in Labor Force", col = "red")

Adding labels

Add comments to each line of code indicating what is being done

ggplot(data = EmpData, aes(x = UnempPct)) + 
  geom_histogram(binwidth = 0.5, fill = "blue") +
    geom_density() + labs(title = "Unemployment rate", 
                          subtitle = "January 1976 - January 2021", 
                          caption = "Source: Statistics Canada, Labour Force Survey",
    tag = "Canada") + xlab("Unemployment rate, %") + ylab("Count")

Lab Practice Questions

Use filter, arrange and select to modify a data table

Starting with the data table EmpData:

  1. Add the numeric variable Year based on the existing variable MonthYr. The formula for Year should be format(MonthYr, “%Y”)
  2. Add the numeric variable EmpRate, which is the proportion of the population (Population) that is employed (Employed), also called the employment rate or employment-to-population ratio.
  3. Drop all observations from years before 2010.
  4. Drop all variables except MonthYr, Year, EmpRate, UnempRate, and AnnPopGrowth
  5. Sort observations by EmpRate.
  6. Give the resulting data table the name PPData.
class(EmpData$MonthYr) # Character
## [1] "Date"
PPData <- EmpData %>%
    mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y")) %>%
    mutate(Year = format(MonthYr, "%Y")) %>%
    mutate(EmpRate = Employed/Population) %>%
    filter(Year >= 2010) %>%
    select(MonthYr, Year, EmpRate, UnempRate, AnnPopGrowth) %>%
    arrange(EmpRate)

print(PPData)
## # A tibble: 133 x 5
##    MonthYr    Year  EmpRate UnempRate AnnPopGrowth
##    <date>     <chr>   <dbl>     <dbl>        <dbl>
##  1 2020-04-01 2020    0.521    0.131       0.0132 
##  2 2020-05-01 2020    0.530    0.137       0.0124 
##  3 2020-06-01 2020    0.560    0.125       0.0118 
##  4 2020-07-01 2020    0.573    0.109       0.0109 
##  5 2020-08-01 2020    0.580    0.102       0.0104 
##  6 2020-03-01 2020    0.585    0.0789      0.0143 
##  7 2021-01-01 2021    0.586    0.0941      0.00863
##  8 2020-09-01 2020    0.591    0.0918      0.00992
##  9 2020-12-01 2020    0.593    0.0876      0.00905
## 10 2020-10-01 2020    0.594    0.0902      0.00961
## # ... with 123 more rows

Recognize and handle missing data problems

Starting with the PPData data table you created in question (1) above:

  1. Calculate and report the mean employment rate since 2010.
  2. Calculate and report a table reporting the median for all variables in PPData.
  3. Did any variables in PPData have missing data? If so, how did you decide to address it in your answer to (b), and why?
### A: Mean employment rate
mean(PPData$EmpRate)
## [1] 0.610809
### B: Table of medians
PPData %>%
    select(where(is.numeric)) %>%
    lapply(median)
## $EmpRate
## [1] 0.6139724
## 
## $UnempRate
## [1] 0.07082771
## 
## $AnnPopGrowth
## [1] 0.01171851

Construct a simple or binned frequency table

Using the PPDatadata set, construct a frequency table of the employment rate.

PPData %>%
    count(cut_interval(EmpRate, 6))
## # A tibble: 5 x 2
##   `cut_interval(EmpRate, 6)`     n
##   <fct>                      <int>
## 1 [0.521,0.537]                  2
## 2 (0.554,0.57]                   1
## 3 (0.57,0.587]                   4
## 4 (0.587,0.603]                  4
## 5 (0.603,0.62]                 122
# The employment rate is a continuous variable, so the appropriate kind of frequency table here is a binned frequency table._

Create a histogram with ggplot

Using the PPDatadata set, create a histogram of the employment rate.

ggplot(data = PPData, 
       mapping = aes(x = EmpRate)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Create a line graph with ggplot

Using the PPData data set, create a time series graph of the employment rate.

ggplot(data = PPData, 
       mapping = aes(x = MonthYr, y = EmpRate)) + 
  geom_line()

Calculate and interpret correlation

Using the EmpData data set, calculate the covariance and correlation of UnempPct and AnnPopGrowth. Based on these results, are periods of high population growth typically periods of high unemployment?

cor(EmpData$UnempPct, EmpData$AnnPopGrowth, use = "complete.obs")
## [1] -0.06513125
## [1] -0.06513125

Construct and interpret a scatter plot in R

Using the EmpData data set, construct a scatter plot with annual population growth on the horizontal axis and unemployment rate on the vertical axis.

ggplot(data = EmpData, mapping = aes(x = AnnPopGrowth, 
                                     y = UnempRate)) + 
  geom_point()
## Warning: Removed 12 rows containing missing values (geom_point).

Construct and interpret a linear or smoothed average plot in R
Using the EmpData data set, construct the same scatter plot as in the problem above, but add a smooth fit and a linear fit.

ggplot(data = EmpData, 
       mapping = aes(x = AnnPopGrowth, y = UnempRate)) + 
  geom_point() +
  geom_smooth(col = "green") + 
  geom_smooth(method = "lm", col = "blue")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).

Note: If you want to hide the warnings, add warning = FALSE, message = FALSE in your r brackets. You can only see this different in the code of the actual .rmd file

ggplot(data = EmpData, 
       mapping = aes(x = AnnPopGrowth, y = UnempRate)) + 
  geom_point() +
  geom_smooth(col = "green") + 
  geom_smooth(method = "lm", col = "blue")