Your participation points come from emailing me the knitted Word document.
Lab Goals
Set working Directory (Make sure necessary files are in the correct folder). Load packages.
When you load packages, tons of lines of output show up both in your console and in your final document when you knit it. If you want to hide that output but show that you loaded the packages, you can do warning=FALSE, message = FALSE
library(tidyverse)
library(kableExtra)
library(lubridate)
Our lab example will use historical employment data for Canada from January 1976 through January 2021. Download the file from Blackboard’s Week 13 Folder. Save the data file and markdown file in a folder for this course.
EmpData <- read_csv("EmploymentData.csv") # from readr package included in tidyverse package
##
## -- Column specification --------------------------------------------------------
## cols(
## MonthYr = col_character(),
## Population = col_double(),
## Employed = col_double(),
## Unemployed = col_double(),
## LabourForce = col_double(),
## NotInLabourForce = col_double(),
## UnempRate = col_double(),
## LFPRate = col_double(),
## Party = col_character(),
## PrimeMinister = col_character(),
## AnnPopGrowth = col_double()
## )
What information can be gained from the output above? How is the data stored? Is it numeric? Text? Dates? Logical? etc. How do you know?
Rename the variable LabourForce to laborforce.
Remember: If you forget the syntax or different options that exist for a command, you can use
?_____or highlight the command or package name and push F1. Both options open documentation on commands or packages in the Help panel.
The syntax for the rename() command is: rename(dataset_name, variable_new_name = variable_old_name.
This command (and almost all other tidyverse commands) also works with pipes syntax (%>%). When using pipes, you pass your data frame through the pipes and then tell R which commands you want to use on the variables within that data frame:
dataset_name %>% rename(variable_new_name = variable_old_name)
Rename the variable LabourForce to laborforce:
# rename(new_name = old_name)
EmpData <- EmpData %>%
rename(laborforce = LabourForce)
Rename NotInLaborForce to not_laborforce:
EmpData <- EmpData %>%
rename(not_laborforce = NotInLabourForce)
The table() command is a great way to quickly know how many observations there are for categories of a variable.
How many observations are there for each Prime Minister?
Using the table command, how many observations are there for each party? What does one observation represent in the context of our data? What level of measurement is appropriate for “party”? What type of variable is party stored as in R? How can we find this out?
What happens if you try to graph a categorical variable as a histogram?
hist(EmpData$Party) # doesn't work because the command needs a numeric variable
## Error in hist.default(EmpData$Party): 'x' must be numeric
If the table() command shows you the number of observations for EACH category of a variable, why should we be hesitant to use it with a continuous variable? Feel free to try it below:
table(EmpData$Unemployed)
Instead of a table, maybe try a histogram:
hist(EmpData$Unemployed)
# hist() is the basic way to get a histogram in R
Why is using a histogram to display this information better or worse than a frequency table?
What does the y-axis represent? (i.e. How many of what?) What does the x-axis represent? What is a better name you could give this histogram? (You do not actually have to add the title to the image). How wide is each bin?
If I wanted to work with only a few of the variables, I could make a smaller data frame by selecting the specific variables I want and saving them as a new object.
subset <- EmpData %>%
select(MonthYr, Population, Employed, Unemployed)
Is it in your environment? How many variables and observations are there in your new subset?
Note that I didn’t save this output as anything. I just ran the code and the output shows up but it does not create a new object.
EmpData %>%
select(MonthYr, Population, Employed, Unemployed)
## # A tibble: 541 x 4
## MonthYr Population Employed Unemployed
## <chr> <dbl> <dbl> <dbl>
## 1 1/1/1976 16852. 9637. 733
## 2 2/1/1976 16892 9660. 730
## 3 3/1/1976 16931. 9704. 692.
## 4 4/1/1976 16969. 9738. 713.
## 5 5/1/1976 17008. 9726. 720
## 6 6/1/1976 17047. 9748. 721.
## 7 7/1/1976 17086. 9760. 780.
## 8 8/1/1976 17124. 9780. 744.
## 9 9/1/1976 17154. 9795. 737.
## 10 10/1/1976 17183. 9782. 783.
## # ... with 531 more rows
If I wanted to save the table in my environment, I would just add table_name <- to my code. That way it selects the variables I want, makes a table, and then stores it in the object that the arrow points to.
table_name <- EmpData %>%
select(MonthYr, Population, Employed, Unemployed)
Check the class of EmpData$MonthYr. Add a comment next to the command to indicate what the output was.
class(EmpData$MonthYr) # originally stored as a character variable.
When the data was read in originally from the CSV file, it was a stored as a character. This needs to be changed if we want to graph change over time. (Although it looks like dates to humans, since it is stored as a character, there isn’t an order to the data that R can see, graph, or analyze)
Also notice that the information was formatted as Month/Day/Year. You have to include that information in your code so R knows how to read the numbers.
Run the line of code below to turn the MonthYr type from a character to date.
# Change MonthYr to date format using as.Date() from
library(lubridate)
EmpData %>%
mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y"))
## # A tibble: 541 x 11
## MonthYr Population Employed Unemployed laborforce not_laborforce UnempRate
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1976-01-01 16852. 9637. 733 10370. 6483. 0.0707
## 2 1976-02-01 16892 9660. 730 10390. 6502. 0.0703
## 3 1976-03-01 16931. 9704. 692. 10396. 6535 0.0665
## 4 1976-04-01 16969. 9738. 713. 10451. 6518. 0.0682
## 5 1976-05-01 17008. 9726. 720 10446. 6562 0.0689
## 6 1976-06-01 17047. 9748. 721. 10470. 6577. 0.0689
## 7 1976-07-01 17086. 9760. 780. 10539. 6546. 0.0740
## 8 1976-08-01 17124. 9780. 744. 10524. 6600. 0.0707
## 9 1976-09-01 17154. 9795. 737. 10532. 6622. 0.0699
## 10 1976-10-01 17183. 9782. 783. 10565. 6618. 0.0741
## # ... with 531 more rows, and 4 more variables: LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>
Now MonthYr is stored as a date in R (temporarily, only in the output. We didn’t save the dataframe this way). The information displayed also changed: Now the date is displayed as year-month-day.
Keep the original variable and create a new variable for the date with the new formatting in EmpData:
#using dollar sign syntax:
EmpData$date <- as.Date(EmpData$MonthYr, "%m/%d/%Y")
# using tidyverse syntax:
# creates a new variable called "date" that formats EmpData$MonthYr to a date
# then saves dataset with new variable over its previous version
EmpData <- EmpData %>%
mutate(date2 = as.Date(MonthYr, "%m/%d/%Y"))
How many columns are there now? Where did the new variables go?
Mutate can be used to perform calculations on existing variables and create new variables.
Create versions of UnempRate and LFPRate that are expressed in percentages rather than decimal units. Remember, the lines of code below only create the output in the document and code chunk, they do not save the changes to the original data.
# Create UnempPct and LFPPct
EmpData %>%
select(MonthYr, UnempRate, LFPRate) %>% # keeps these 3 variables
mutate(UnempPct = 100 * UnempRate) %>% # creates a new variable named UnempPct
mutate(LFPPct = 100 * LFPRate) # creates a new variable named LFPPct
## # A tibble: 541 x 5
## MonthYr UnempRate LFPRate UnempPct LFPPct
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/1976 0.0707 0.615 7.07 61.5
## 2 2/1/1976 0.0703 0.615 7.03 61.5
## 3 3/1/1976 0.0665 0.614 6.65 61.4
## 4 4/1/1976 0.0682 0.616 6.82 61.6
## 5 5/1/1976 0.0689 0.614 6.89 61.4
## 6 6/1/1976 0.0689 0.614 6.89 61.4
## 7 7/1/1976 0.0740 0.617 7.40 61.7
## 8 8/1/1976 0.0707 0.615 7.07 61.5
## 9 9/1/1976 0.0699 0.614 6.99 61.4
## 10 10/1/1976 0.0741 0.615 7.41 61.5
## # ... with 531 more rows
If the number of decimals are bothering you, you can use round() and indicate the number of digits after the decimal point.
# round the decimals to two digits
EmpData %>%
select(MonthYr, UnempRate, LFPRate) %>% # keeps these 3 variables
mutate(UnempPct = round(100 * UnempRate, digits = 2),
LFPPct = round(100 * LFPRate, digits = 2 ),
UnempRate = round(UnempRate, digits = 2),
LFPRate = round(LFPRate, digits = 2))
## # A tibble: 541 x 5
## MonthYr UnempRate LFPRate UnempPct LFPPct
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/1976 0.07 0.62 7.07 61.5
## 2 2/1/1976 0.07 0.62 7.03 61.5
## 3 3/1/1976 0.07 0.61 6.65 61.4
## 4 4/1/1976 0.07 0.62 6.82 61.6
## 5 5/1/1976 0.07 0.61 6.89 61.4
## 6 6/1/1976 0.07 0.61 6.89 61.4
## 7 7/1/1976 0.07 0.62 7.4 61.7
## 8 8/1/1976 0.07 0.61 7.07 61.5
## 9 9/1/1976 0.07 0.61 6.99 61.4
## 10 10/1/1976 0.07 0.61 7.41 61.5
## # ... with 531 more rows
So far we have not made any permanent changes to EmpData. Before we had R calculate something and show it to us in the output, but not store it anywhere. In order to modify our EmpData tibble, we have to save those changes back to the object EmpData. Our original commands simply created a new object based on EmpData that was then displayed on the screen. In order to change EmpData itself, we need to assign that new object back to EmpData with the <-. Other things to notice in the code below is that I used the command mutate() once and made three variables inside that one command.
# Make permanent changes to EmpData
EmpData <- EmpData %>%
select(-date2) %>%
mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y"),
UnempPct = 100 * UnempRate,
LFPPct = 100 * LFPRate)
EmpData # check your work. Look at the data
## # A tibble: 541 x 14
## MonthYr Population Employed Unemployed laborforce not_laborforce UnempRate
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1976-01-01 16852. 9637. 733 10370. 6483. 0.0707
## 2 1976-02-01 16892 9660. 730 10390. 6502. 0.0703
## 3 1976-03-01 16931. 9704. 692. 10396. 6535 0.0665
## 4 1976-04-01 16969. 9738. 713. 10451. 6518. 0.0682
## 5 1976-05-01 17008. 9726. 720 10446. 6562 0.0689
## 6 1976-06-01 17047. 9748. 721. 10470. 6577. 0.0689
## 7 1976-07-01 17086. 9760. 780. 10539. 6546. 0.0740
## 8 1976-08-01 17124. 9780. 744. 10524. 6600. 0.0707
## 9 1976-09-01 17154. 9795. 737. 10532. 6622. 0.0699
## 10 1976-10-01 17183. 9782. 783. 10565. 6618. 0.0741
## # ... with 531 more rows, and 7 more variables: LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>, date <date>, UnempPct <dbl>,
## # LFPPct <dbl>
You can also create a binary variable from a continuous variable. If you wanted to just represent unemployment levels that are below or above average.
Note: I picked 8% as my cut off for high and low unemployment rates. You would want to have a good reason behind your choices in your own assignments/research. I just wanted to make an example.
if_else() is a great tool for creating or recoding variables.
EmpData$highlow <- if_else(EmpData$UnempPct < 8, "low", "high")
table(EmpData$highlow)
##
## high low
## 229 312
Now let’s suppose we want to know more about the months in our data set with the highest unemployment rates. We can use filter() for this purpose:
# This will give all of the observations with unemployment rates over 12.5%
EmpData %>%
filter(UnempPct > 12.5)
## # A tibble: 8 x 15
## MonthYr Population Employed Unemployed laborforce not_laborforce UnempRate
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1982-10-01 19183. 10787. 1602. 12389. 6794. 0.129
## 2 1982-11-01 19203. 10764. 1600. 12364. 6839. 0.129
## 3 1982-12-01 19223. 10774. 1624. 12398. 6824. 0.131
## 4 1983-01-01 19244. 10801. 1573. 12374 6870. 0.127
## 5 1983-02-01 19266. 10818. 1574. 12392. 6875. 0.127
## 6 1983-03-01 19285. 10875. 1555. 12430. 6856. 0.125
## 7 2020-04-01 30994. 16142. 2444. 18586. 12409. 0.131
## 8 2020-05-01 31009. 16444 2610. 19054. 11955. 0.137
## # ... with 8 more variables: LFPRate <dbl>, Party <chr>, PrimeMinister <chr>,
## # AnnPopGrowth <dbl>, date <date>, UnempPct <dbl>, LFPPct <dbl>,
## # highlow <chr>
Interpretation of output: only 8 of the 541 months in our data have unemployment rates over 12.5% - the worst months of the 1982-83 recession, and April and May of last year.
Now suppose that we only want to see a few pieces of information about those months. We can use select() again to choose variables to display in the output:
# This will leave out all variables except the ones mentioned in the select() command
EmpData %>%
filter(UnempPct > 12.5) %>%
select(MonthYr, UnempRate, LFPPct, PrimeMinister)
## # A tibble: 8 x 4
## MonthYr UnempRate LFPPct PrimeMinister
## <date> <dbl> <dbl> <chr>
## 1 1982-10-01 0.129 64.6 Pierre Trudeau
## 2 1982-11-01 0.129 64.4 Pierre Trudeau
## 3 1982-12-01 0.131 64.5 Pierre Trudeau
## 4 1983-01-01 0.127 64.3 Pierre Trudeau
## 5 1983-02-01 0.127 64.3 Pierre Trudeau
## 6 1983-03-01 0.125 64.5 Pierre Trudeau
## 7 2020-04-01 0.131 60.0 Justin Trudeau
## 8 2020-05-01 0.137 61.4 Justin Trudeau
What happens if you put the select() command before the filter() command in the code below? Why?
EmpData %>%
select(MonthYr, UnempRate, LFPPct, PrimeMinister) %>%
filter(UnempPct > 12.5)
## Error: Problem with `filter()` input `..1`.
## i Input `..1` is `UnempPct > 12.5`.
## x object 'UnempPct' not found
Show months in descending order by unemployment rate (i.e., the highest unemployment rate first). We can use arrange() to sort rows in this way:
EmpData %>%
filter(UnempPct > 12.5) %>%
select(MonthYr, UnempPct, LFPPct, PrimeMinister) %>%
arrange(UnempPct) # Sorts the rows by unemployment rate, default is ascending order
## # A tibble: 8 x 4
## MonthYr UnempPct LFPPct PrimeMinister
## <date> <dbl> <dbl> <chr>
## 1 1983-03-01 12.5 64.5 Pierre Trudeau
## 2 1983-02-01 12.7 64.3 Pierre Trudeau
## 3 1983-01-01 12.7 64.3 Pierre Trudeau
## 4 1982-10-01 12.9 64.6 Pierre Trudeau
## 5 1982-11-01 12.9 64.4 Pierre Trudeau
## 6 1982-12-01 13.1 64.5 Pierre Trudeau
## 7 2020-04-01 13.1 60.0 Justin Trudeau
## 8 2020-05-01 13.7 61.4 Justin Trudeau
arrange() is very useful when creating graphs! If you want bar graphs to show categories in increasing or decreasing order, arrange() will help you do that.
Hopefully you can see why the pipe operator is useful in making our code clear and readable. Compare these two lines of code. Both do the same thing but one is easier to read.
# with pipes
EmpData %>%
filter(UnempPct > 12.5) %>%
select(MonthYr, UnempPct, LFPPct, PrimeMinister) %>%
arrange(UnempPct)
# without pipes
arrange(select(filter(EmpData, UnempPct > 12.5), MonthYr, UnempPct, LFPPct, PrimeMinister), UnempPct)
The summary() function will give a basic summary of any object. Exactly what that summary looks like depends on the object. For tibbles, summary() produces a set of summary statistics for each variable. Hopefully these statistics look familiar.
summary(EmpData) # entire data frame
## MonthYr Population Employed Unemployed
## Min. :1976-01-01 Min. :16852 Min. : 9637 Min. : 691.5
## 1st Qu.:1987-04-01 1st Qu.:20290 1st Qu.:12230 1st Qu.:1102.5
## Median :1998-07-01 Median :23529 Median :14064 Median :1265.5
## Mean :1998-07-01 Mean :23795 Mean :14383 Mean :1261.0
## 3rd Qu.:2009-10-01 3rd Qu.:27327 3rd Qu.:16926 3rd Qu.:1404.6
## Max. :2021-01-01 Max. :31191 Max. :19130 Max. :2609.8
##
## laborforce not_laborforce UnempRate LFPRate
## Min. :10370 Min. : 6483 Min. :0.05446 Min. :0.5996
## 1st Qu.:13467 1st Qu.: 6842 1st Qu.:0.07032 1st Qu.:0.6501
## Median :15333 Median : 8162 Median :0.07691 Median :0.6573
## Mean :15644 Mean : 8151 Mean :0.08207 Mean :0.6564
## 3rd Qu.:18230 3rd Qu.: 9099 3rd Qu.:0.09369 3rd Qu.:0.6674
## Max. :20316 Max. :12409 Max. :0.13697 Max. :0.6766
##
## Party PrimeMinister AnnPopGrowth date
## Length:541 Length:541 Min. :0.007522 Min. :1976-01-01
## Class :character Class :character 1st Qu.:0.012390 1st Qu.:1987-04-01
## Mode :character Mode :character Median :0.013156 Median :1998-07-01
## Mean :0.013703 Mean :1998-07-01
## 3rd Qu.:0.014286 3rd Qu.:2009-10-01
## Max. :0.024815 Max. :2021-01-01
## NA's :12
## UnempPct LFPPct highlow
## Min. : 5.446 Min. :59.96 Length:541
## 1st Qu.: 7.032 1st Qu.:65.01 Class :character
## Median : 7.691 Median :65.73 Mode :character
## Mean : 8.207 Mean :65.64
## 3rd Qu.: 9.369 3rd Qu.:66.74
## Max. :13.697 Max. :67.66
##
summary(EmpData$Unemployed) # for one variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 691.5 1102.5 1265.5 1261.0 1404.6 2609.8
The R function mean() calculates the sample average of any numeric vector:
# Mean of a single variable
mean(EmpData$UnempPct)
## [1] 8.207112
In a full sentence, interpret the mean in the context of the variable.
# calculates the standard deviation
sd(EmpData$UnempPct) # 1.71
## [1] 1.709704
# median() calculates the sample median
median(EmpData$UnempPct) # 7.69% was median unemployment rate
## [1] 7.691411
In real-world data, some variables have missing values for one or more observations. For example, the AnnPopGrowth variable in our data set is missing for the first year of data (1976), since calculating the growth rate for 1976 would require data from 1975. In R, missing values are given the special value NA which stands for “not available”:
EmpData %>%
select(MonthYr, Population, AnnPopGrowth)
## # A tibble: 541 x 3
## MonthYr Population AnnPopGrowth
## <date> <dbl> <dbl>
## 1 1976-01-01 16852. NA
## 2 1976-02-01 16892 NA
## 3 1976-03-01 16931. NA
## 4 1976-04-01 16969. NA
## 5 1976-05-01 17008. NA
## 6 1976-06-01 17047. NA
## 7 1976-07-01 17086. NA
## 8 1976-08-01 17124. NA
## 9 1976-09-01 17154. NA
## 10 1976-10-01 17183. NA
## # ... with 531 more rows
When we try to take the mean of this variable we also get NA:
mean(EmpData$AnnPopGrowth) # returns NA as result
## [1] NA
This is because math in R follows rules that say that any calculation involving NA should also result in NA. Some other applications drop missing data from the calculation.
Whenever you have missing values, you should investigate before proceeding. Sometimes (as in our case here), missing values are for a good reason, other times they are the result of a mistake or problem that needs to be fixed.
Once we have investigated the missing values, we can tell R explicitly to exclude them from the calculation by adding the na.rm = TRUE option:
mean(EmpData$AnnPopGrowth, na.rm = TRUE)
## [1] 0.01370259
If you wanted to calculate the sample average for each column in our tibble, we could apply the command mean() to all the columns using the command lapply().
# Mean of each column
EmpData %>%
select(where(is.numeric)) %>%
lapply(mean, na.rm = TRUE)
## $Population
## [1] 23795.46
##
## $Employed
## [1] 14383.15
##
## $Unemployed
## [1] 1260.953
##
## $laborforce
## [1] 15644.1
##
## $not_laborforce
## [1] 8151.352
##
## $UnempRate
## [1] 0.08207112
##
## $LFPRate
## [1] 0.6563653
##
## $AnnPopGrowth
## [1] 0.01370259
##
## $UnempPct
## [1] 8.207112
##
## $LFPPct
## [1] 65.63653
I would not expect you to come up with this code, but maybe it kind of makes sense.
We can use this method with any function that calculates a summary statistic:
# Standard deviation of each column
EmpData %>%
select(where(is.numeric)) %>%
lapply(sd, na.rm = TRUE)
## $Population
## [1] 4034.558
##
## $Employed
## [1] 2704.267
##
## $Unemployed
## [1] 243.8356
##
## $laborforce
## [1] 2783.985
##
## $not_laborforce
## [1] 1294.117
##
## $UnempRate
## [1] 0.01709704
##
## $LFPRate
## [1] 0.01401074
##
## $AnnPopGrowth
## [1] 0.00269365
##
## $UnempPct
## [1] 1.709704
##
## $LFPPct
## [1] 1.401074
We can also construct frequency tables for both categorical and continuous variables:
# COUNT creates a frequency table for categorical variables
EmpData %>% count(PrimeMinister)
## # A tibble: 10 x 2
## PrimeMinister n
## <chr> <int>
## 1 Brian Mulroney 104
## 2 Jean Chretien 120
## 3 Joe Clark 8
## 4 John Turner 2
## 5 Justin Trudeau 62
## 6 Kim Campbell 4
## 7 Paul Martin 25
## 8 Pierre Trudeau 91
## 9 Stephen Harper 116
## 10 Transfer 9
table(EmpData$PrimeMinister, EmpData$Party)
##
## Conservative Liberal Transfer
## Brian Mulroney 104 0 0
## Jean Chretien 0 120 0
## Joe Clark 8 0 0
## John Turner 0 2 0
## Justin Trudeau 0 62 0
## Kim Campbell 4 0 0
## Paul Martin 0 25 0
## Pierre Trudeau 0 91 0
## Stephen Harper 116 0 0
## Transfer 1 2 6
library(psych)
## Warning: package 'psych' was built under R version 4.0.5
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
#(EmpData$PrimeMinister, EmpData$Party)
cor(EmpData$UnempPct, EmpData$LFPPct)
## [1] -0.2557409
What is the relationship between these two variables? Indicate direction and strength of the relationship.
Unemployment and labor force participation are negatively correlated: when unemployment is high, LFP tends to be low. This makes sense given the economics: if it is hard to find a job, people will move into other activities that take one out of the labor force: education, childcare, retirement, etc.
Correlation matrix for the whole data set (at least the numerical parts).
In most applications, pairwise deletion makes the most sense because it avoids throwing out data. But it is occasionally important to use the same data for all calculations, in which case we would use listwise deletion.
Pairwise deletion: when calculating the covariance or correlation of two variables, exclude observations with a missing values for either of those two variables.
Casewise or listwise deletion: when calculating the covariance or correlation of two variables, exclude observations with a missing value for any variable.
EmpData %>%
select(Population, Employed, laborforce, UnempPct) %>%
cor(use = "pairwise.complete.obs")
## Population Employed laborforce UnempPct
## Population 1.0000000 0.9905010 0.9950675 -0.4721230
## Employed 0.9905010 1.0000000 0.9964734 -0.5542043
## laborforce 0.9950675 0.9964734 1.0000000 -0.4836022
## UnempPct -0.4721230 -0.5542043 -0.4836022 1.0000000
Notice which variables are EXTREMELEY correlated with each other. Think about why that is. While the variables are strongly correlated, does it have meaningful implications?
Calculate the average unemployment rate for each prime minister.
EmpData %>%
group_by(PrimeMinister) %>%
summarize(avg_unemp_rate = mean(UnempRate))
## # A tibble: 10 x 2
## PrimeMinister avg_unemp_rate
## <chr> <dbl>
## 1 Brian Mulroney 0.0939
## 2 Jean Chretien 0.0841
## 3 Joe Clark 0.0726
## 4 John Turner 0.112
## 5 Justin Trudeau 0.0698
## 6 Kim Campbell 0.114
## 7 Paul Martin 0.0694
## 8 Pierre Trudeau 0.0896
## 9 Stephen Harper 0.0711
## 10 Transfer 0.0911
EmpData %>%
group_by(PrimeMinister) %>%
summarize(avg_unemp_rate = round(mean(UnempRate), digits = 3))
## # A tibble: 10 x 2
## PrimeMinister avg_unemp_rate
## <chr> <dbl>
## 1 Brian Mulroney 0.094
## 2 Jean Chretien 0.084
## 3 Joe Clark 0.073
## 4 John Turner 0.112
## 5 Justin Trudeau 0.07
## 6 Kim Campbell 0.114
## 7 Paul Martin 0.069
## 8 Pierre Trudeau 0.09
## 9 Stephen Harper 0.071
## 10 Transfer 0.091
EmpData %>%
group_by(PrimeMinister) %>%
summarize(avg_unemp_rate = round(mean(UnempRate), digits = 3)) %>%
ggplot(aes(avg_unemp_rate, PrimeMinister)) +
geom_col()
More graphing practice!
Histogram of unemployment rate:
ggplot(data = EmpData,
mapping = aes(x = UnempPct)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Time series (line) graph:
ggplot(data = EmpData,
mapping = aes(x = MonthYr, y = UnempPct)) +
geom_line()
The ggplot() function has a non-standard syntax:
ggplot(data = EmpData, mapping = aes(x = MonthYr, y = UnempPct)) +
geom_line() +
labs(title = "Unemployment rate",
subtitle = "January 1976 - January 2021",
caption = "Source: Statistics Canada, Labour Force Survey",
tag = "Canada") +
xlab("") +
ylab("Unemployment rate, %")
You can change the color of any geometric element using the col= argument:
ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) +
geom_line(col = "blue")
Colors can be given in ordinary English (or local language) words, or with detailed color codes in RGB or CMYK format.
Some geometric elements, such as the bars in a histogram, also have a fill color:
ggplot(data = EmpData, aes(x = UnempPct)) +
geom_histogram(col = "red", fill = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As you can see, the col= argument sets the color for the exterior of each bar, and the fill= argument sets the color for the interior.
Remember, Time is on the x axis, percent of the population unemployed is on the y axis.
ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) +
geom_line(col = "blue") +
geom_line(aes(y = LFPPct), col = "red")
A few things to note for the image above:
We could add a legend here, but it is better (and friendlier to the color-blind) to just label the lines. We can use the geom_text geometry to do this:
ggplot(data = EmpData, aes(x = MonthYr)) +
geom_line(aes(y = UnempPct), col = "blue") +
geom_text(x = as.Date("1/1/2000", "%m/%d/%Y"), y = 15, label = "Unemployment",
col = "blue") +
geom_line(aes(y = LFPPct), col = "red") +
geom_text(x = as.Date("1/1/2000",
"%m/%d/%Y"), y = 60, label = "% in Labor Force", col = "red")
Add comments to each line of code indicating what is being done
ggplot(data = EmpData, aes(x = UnempPct)) +
geom_histogram(binwidth = 0.5, fill = "blue") +
geom_density() + labs(title = "Unemployment rate",
subtitle = "January 1976 - January 2021",
caption = "Source: Statistics Canada, Labour Force Survey",
tag = "Canada") + xlab("Unemployment rate, %") + ylab("Count")
Use filter, arrange and select to modify a data table
Starting with the data table EmpData:
class(EmpData$MonthYr) # Character
## [1] "Date"
PPData <- EmpData %>%
mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y")) %>%
mutate(Year = format(MonthYr, "%Y")) %>%
mutate(EmpRate = Employed/Population) %>%
filter(Year >= 2010) %>%
select(MonthYr, Year, EmpRate, UnempRate, AnnPopGrowth) %>%
arrange(EmpRate)
print(PPData)
## # A tibble: 133 x 5
## MonthYr Year EmpRate UnempRate AnnPopGrowth
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2020-04-01 2020 0.521 0.131 0.0132
## 2 2020-05-01 2020 0.530 0.137 0.0124
## 3 2020-06-01 2020 0.560 0.125 0.0118
## 4 2020-07-01 2020 0.573 0.109 0.0109
## 5 2020-08-01 2020 0.580 0.102 0.0104
## 6 2020-03-01 2020 0.585 0.0789 0.0143
## 7 2021-01-01 2021 0.586 0.0941 0.00863
## 8 2020-09-01 2020 0.591 0.0918 0.00992
## 9 2020-12-01 2020 0.593 0.0876 0.00905
## 10 2020-10-01 2020 0.594 0.0902 0.00961
## # ... with 123 more rows
Recognize and handle missing data problems
Starting with the PPData data table you created in question (1) above:
### A: Mean employment rate
mean(PPData$EmpRate)
## [1] 0.610809
### B: Table of medians
PPData %>%
select(where(is.numeric)) %>%
lapply(median)
## $EmpRate
## [1] 0.6139724
##
## $UnempRate
## [1] 0.07082771
##
## $AnnPopGrowth
## [1] 0.01171851
Construct a simple or binned frequency table
Using the PPDatadata set, construct a frequency table of the employment rate.
PPData %>%
count(cut_interval(EmpRate, 6))
## # A tibble: 5 x 2
## `cut_interval(EmpRate, 6)` n
## <fct> <int>
## 1 [0.521,0.537] 2
## 2 (0.554,0.57] 1
## 3 (0.57,0.587] 4
## 4 (0.587,0.603] 4
## 5 (0.603,0.62] 122
# The employment rate is a continuous variable, so the appropriate kind of frequency table here is a binned frequency table._
Create a histogram with ggplot
Using the PPDatadata set, create a histogram of the employment rate.
ggplot(data = PPData,
mapping = aes(x = EmpRate)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Create a line graph with ggplot
Using the PPData data set, create a time series graph of the employment rate.
ggplot(data = PPData,
mapping = aes(x = MonthYr, y = EmpRate)) +
geom_line()
Calculate and interpret correlation
Using the EmpData data set, calculate the covariance and correlation of UnempPct and AnnPopGrowth. Based on these results, are periods of high population growth typically periods of high unemployment?
cor(EmpData$UnempPct, EmpData$AnnPopGrowth, use = "complete.obs")
## [1] -0.06513125
## [1] -0.06513125
Construct and interpret a scatter plot in R
Using the EmpData data set, construct a scatter plot with annual population growth on the horizontal axis and unemployment rate on the vertical axis.
ggplot(data = EmpData, mapping = aes(x = AnnPopGrowth,
y = UnempRate)) +
geom_point()
## Warning: Removed 12 rows containing missing values (geom_point).
Construct and interpret a linear or smoothed average plot in R
Using the EmpData data set, construct the same scatter plot as in the problem above, but add a smooth fit and a linear fit.
ggplot(data = EmpData,
mapping = aes(x = AnnPopGrowth, y = UnempRate)) +
geom_point() +
geom_smooth(col = "green") +
geom_smooth(method = "lm", col = "blue")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).
Note: If you want to hide the warnings, add warning = FALSE, message = FALSE in your r brackets. You can only see this different in the code of the actual .rmd file
ggplot(data = EmpData,
mapping = aes(x = AnnPopGrowth, y = UnempRate)) +
geom_point() +
geom_smooth(col = "green") +
geom_smooth(method = "lm", col = "blue")