Grading is in progress, but can not be discussed until all grades are posted.
Checking the submitted work of all students takes time but should be completed this week.
Comments about HWs 1 - 3
HW 1 and HW 2 grades are posted
HW 3 will be graded this week.
HW 4 is posted and is due next week on Wednesday, 2/28/24.
We will work through part of it together on Thursday
Demo Videos are available.
Recall from Lecture 9:
R functions:
Useful for automating repetitive tasks
Best Practices for Writing Functions
Plan what you want to do
Write out task steps
Develop and refine the code to complete tasks
Subdivide multi-step tasks into short sets of tasks.
Convert each set of tasks to a function
Make the function more general so it is more versatile.
Common Function Structure Options
Anatomy of a Function (Option 1):
Commands within function are not saved to an object.
Function results are automatically ‘kicked out’ as output.
Function_Name <- function(input_1, input_2, etc){
command 1 to do "stuff" to inputs |>
command 2 to do "stuff" to inputs |>
command 3 to do "stuff" to inputs |> etc. # output is autmatically "kicked out"
}
Anatomy of a Function (Option 2 - Shown in Week 5 Lecture):
Commands within function are saved to an object.
Function ends with name of object so that results are ‘kicked out’ as output.
Function_Name <- function(input_1, input_2, etc){
output <- command 1 to do "stuff" to inputs |>
command 2 to do "stuff" to inputs |>
command 3 to do "stuff" to inputs |> etc.
output # end with name of output so that it is "kicked out" of function
}
Box Office Mojo Data Cleaning Steps
Examine .csv file to determine number of rows to skip
Import raw .csv file and skip header rows above variable names
Select useful columns (Columns 1, 4, 7, and 9):
1: Date
4: Top 10 Gross- <chr> variable with nuisance characters to be removed ($, ,)
7: Releases
9: Gross - <chr> variable with nuisance characters to be removed ($, ,)
Rename these four variables to easier names to work with in R
New names should not have spaces.
Lower case names with underscores (_) work well in code.
Variable names and labels can be reformatted for plots and tables.
Box Office Mojo Data Cleaning Steps Cont’d
5 . Remove non-data rows (Holidays, etc.) with a filter command.
Use mutate to convert variables to use-able formats:
Use paste to add year text to date character variable and convert to a date variable.
Convert Releases (num_releases) to an integer variable
Use gsub and as.numeric to convert each gross variable to numeric:
gsub is used to remove nuisance characters, $ and ,
NOTE: multiple adjacent columns with same nuisance characters can be cleaned at the same time with mutate(across...)
as.numeric is used to convert character to numeric decimal value (<dbl>) once nusaince characters are removed.
```{r}#|label: bom_cln_functionbom_cln <- function(data_file, yr, skip_num){ # inputs: data_file is file name # yr is year of data # skip_num is number of header rows to skip read_csv(data_file, show_col_types = F, skip=skip_num) |> # data_file and skip_num used here select(1,4,7,9) |> # columns needed are always the same for bom rename("date" = "Date", # column renaming always the same "top10gross" = "Top 10 Gross", "num_releases" = "Releases", "num1gross" = "Gross") |> filter(!is.na(top10gross)) |> # filter out non-data rows mutate(date = dmy(paste(date,yr)), # paste yr input to date text and convert to date num_releases = as.integer(num_releases), top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T), top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |> as.numeric(), num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T), num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |> as.numeric()) }```
Working with Dates using lubridate
lubridate can convert a wide variety of text information to a date.
User must specify order of information, e.g. ymd indicates year, then month, the day
Text must include year, but day is not required (See HW 4)
Code
```{r}#|label: examples of lubridate"31st of October, 2040" # text datedmy("31st of October, 2040") # example of lubridate command dmypaste("Hello", "Goodbye") # example of paste concatenating two text strings"March 15" # example of month and day only # lubridate comands require a year```
```{r}#|label: more lubridate examplespaste("February", 2040) # using paste to add year to month and daymy(paste("February", "2040")) # using lubridate my command with pastepaste("February", "2040") |> my() # same command using piping"March 15th, 2039" # different form of text date (Q1 next slide)ymd(20390315) # example of lubridate command ymd Sys.Date() # demo of Sys.date command ```
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = dmy(paste(date, yr))`.
Caused by warning:
! 1 failed to parse.
Code
```{r}bom2021<- bom_cln_out(data_file = "data/box_office_mojo_2021.csv", yr = 2021, skip_num=11, out_file = "data/box_office_mojo_2021_tidy.csv")bom2022<- bom_cln_out(data_file = "data/box_office_mojo_2022.csv", yr = 2022, skip_num=11, out_file = "data/box_office_mojo_2022_tidy.csv")# fix NA due to leap day manually# not easily fixed within functionbom2020$date[is.na(bom2020$date)] <- dmy(29022020)write_csv(bom2020, "data/box_office_mojo_2020_tidy.csv")```
Join datasets vertically (stacking)
If you have multiple identical datasets with the same columns, you can ‘stack’ them using the bind_rows command.
Recommended: Import and format datasets to all have same column names so that this is seamless.
Functions are useful for creating identically formatted datasets.
Code
```{r}#|label: create one 5 year BOMdatasetbom_2018_2022 <- bind_rows(bom2018, bom2019, bom2020, bom2021, bom2022) # join datasets# Modify gross variables (could also be done in function)bom_2018_2022 <- bom_2018_2022 |> mutate(top10grossM = (top10gross/1000000) |> round(2), num1grossM = (num1gross/1000000) |> round(2)) |> filter(date <= "2022-12-30") |> glimpse()```
Unemployment Rate, the percentage of people unemployed (bls_unemp_rate.csv)
Import Price Index (bls_import_index.csv)
Export Price Index (bls_export_index.csv)
The Import and Export Price Indices contain data on changes in the prices of nonmilitary goods and services traded between the U.S. and the rest of the world.
Join datasets horizontally by matching variables and values (HW 4:full_join)
Working with date variables using lubridate
Command(s) to create date variables
using paste to add year to text
You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.
Source Code
---title: "Week 6"subtitle: "Reshaping. Cleaning, and Joining Data"author: "Penelope Pooler Eisenbies"date: last-modifiedlightbox: truetoc: truetoc-depth: 3toc-location: lefttoc-title: "Table of Contents"toc-expand: 1format: html: code-line-numbers: true code-fold: true code-tools: trueexecute: echo: fenced---## Housekeeping```{r include=F}#|label: setupknitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunksoptions(scipen=100) # suppress scientific notation # install pacman if neededif (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")pacman::p_load(pacman, tidyverse, gridExtra, magrittr, kableExtra) # install and load required packages p_loaded() # verify loaded packages```**Comments About Quiz 1**- Grading is in progress, but can not be discussed until all grades are posted.- Checking the submitted work of all students takes time but should be completed this week.::: fragment**Comments about HWs 1 - 3**:::- HW 1 and HW 2 grades are posted- HW 3 will be graded this week.::: fragment**HW 4 is posted and is due next week on Wednesday, 2/28/24.**:::- We will work through part of it together on Thursday- Demo Videos are available.## Recall from Lecture 9::::::::: columns::::: {.column width="48%"}::: fragment#### R functions::::- Useful for automating repetitive tasks::: fragment#### Best Practices for Writing Functions:::- Plan what you want to do - Write out task steps- Develop and refine the code to complete tasks- Subdivide multi-step tasks into short sets of tasks.- Convert each set of tasks to a function - Make the function more general so it is more versatile.:::::::: {.column width="4%"}:::::: {.column width="48%"}:::::::::::## ### Common Function Structure Options::: fragment#### Anatomy of a Function (Option 1)::::- Commands within function **are not saved** to an object.- Function results are automatically 'kicked out' as output.::: fragment``` Function_Name <- function(input_1, input_2, etc){ command 1 to do "stuff" to inputs |> command 2 to do "stuff" to inputs |> command 3 to do "stuff" to inputs |> etc. # output is autmatically "kicked out"}```:::::: fragment#### Anatomy of a Function (Option 2 - Shown in Week 5 Lecture)::::- Commands within function **are saved** to an object.- Function ends with name of object so that results are 'kicked out' as output.::: fragment``` Function_Name <- function(input_1, input_2, etc){ output <- command 1 to do "stuff" to inputs |> command 2 to do "stuff" to inputs |> command 3 to do "stuff" to inputs |> etc. output # end with name of output so that it is "kicked out" of function}```:::## ### Box Office Mojo Data Cleaning Steps1. Examine .csv file to determine number of rows to skip2. Import raw .csv file and skip header rows above variable names3. Select useful columns (Columns 1, 4, 7, and 9): - 1: `Date` - 4: `Top 10 Gross`- `<chr>` variable with nuisance characters to be removed (`$`, `,`) - 7: `Releases` - 9: `Gross` - `<chr>` variable with nuisance characters to be removed (`$`, `,`)4. Rename these four variables to easier names to work with in R - New names should not have spaces. - Lower case names with underscores (`_`) work well in code. - Variable names and labels can be reformatted for plots and tables.## ### Box Office Mojo Data Cleaning Steps Cont'd5 . Remove non-data rows (Holidays, etc.) with a `filter` command.6. Use `mutate` to convert variables to use-able formats: - Use `paste` to add year text to `date` character variable and convert to a date variable. - Convert `Releases` (`num_releases`) to an integer variable - Use `gsub` and `as.numeric` to convert each gross variable to numeric: - `gsub` is used to remove nuisance characters, `$` and `,` - NOTE: multiple adjacent columns with same nuisance characters can be cleaned at the same time with `mutate(across...)` - `as.numeric` is used to convert character to numeric decimal value (`<dbl>`) once nusaince characters are removed.## Cleaning One Data Set```{r}#|label: cleaning 1 box office mojo datasetbom2022 <-read_csv("data/box_office_mojo_2022.csv", skip=11, show_col_types = F) |>select(1,4,7,9) |># select columns by number (use with care)rename("date"="Date", "top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filters out empty holiday rowsmutate(date =dmy(paste(date,2022)),num_releases =as.integer(num_releases),# gross variables cleaned one at a timetop10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T), top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) |>glimpse()```## ### Converting B.O.M. Cleaning Code to a Function```{r}#|label: bom_cln_functionbom_cln <-function(data_file, yr, skip_num){ # inputs: data_file is file name# yr is year of data# skip_num is number of header rows to skipread_csv(data_file, show_col_types = F, skip=skip_num) |># data_file and skip_num used here select(1,4,7,9) |># columns needed are always the same for bomrename("date"="Date", # column renaming always the same"top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filter out non-data rowsmutate(date =dmy(paste(date,yr)), # paste yr input to date text and convert to datenum_releases =as.integer(num_releases), top10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T),top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) }```## Working with Dates using `lubridate`- `lubridate` can convert a wide variety of text information to a date.- User must specify order of information, e.g. `ymd` indicates year, then month, the day- Text must include year, but day is not required (See HW 4):::: fragment::: r-fit-text```{r}#|label: examples of lubridate"31st of October, 2040"# text datedmy("31st of October, 2040") # example of lubridate command dmypaste("Hello", "Goodbye") # example of paste concatenating two text strings"March 15"# example of month and day only# lubridate comands require a year```:::::::## Lubridate Examples Continued::: r-fit-text```{r}#|label: more lubridate examplespaste("February", 2040) # using paste to add year to month and daymy(paste("February", "2040")) # using lubridate my command with pastepaste("February", "2040") |>my() # same command using piping"March 15th, 2039"# different form of text date (Q1 next slide)ymd(20390315) # example of lubridate command ymd Sys.Date() # demo of Sys.date command ```:::## Week 6 In-class Exercises - Q1:::::: columns::: {.column width="48%"}***Session ID: bua455f24***What is the correct `lubridate` command to convert the following date text to a date value in R.**"March 15th, 2039"**<br>**Hint:** Examine the [Lubridate Cheat Sheet](https://rstudio.github.io/cheatsheets/lubridate.pdf) and test out commands in console to see what the output is.**NOTE:** `lubridate` commands will only work if you have run the setup for this lecture to load the `tidyverse` suite of packages.:::::: {.column width="4%"}:::::: {.column width="48%"}:::::::::## Function Format Options::: panel-tabset### [Option 1]{style="color:blue;"}No objects are saved within function and result is `kicked out`Ideal for straightforward functions and plot functions```{r}#|label: bom_cln function option 1bom_cln <-function(data_file, yr, skip_num){ read_csv(data_file, show_col_types = F, skip=skip_num) |># data_file and skip_num used here select(1,4,7,9) |># columns needed are always the same for these datasetsrename("date"="Date", # column renaming always the same"top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filter out non-data rowsmutate(date =dmy(paste(date,yr)), # paste yr input to date text and convert to datenum_releases =as.integer(num_releases),top10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T),top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) }bom2021_Op1 <-bom_cln(data_file ="data/box_office_mojo_2021.csv", yr =2021, skip_num=11) # use function```### [Option 2]{style="color:blue;"}Commands within function are saved to an object, `d_out`Function ends with `d_out` so that result gets `kicked-out` of function```{r}#|label: bom_cln function option 2bom_cln <-function(data_file, yr, skip_num){ d_out <-read_csv(data_file, show_col_types = F, skip=skip_num) |># data_file and skip_num used hereselect(1,4,7,9) |># columns needed are always the same for these datasetsrename("date"="Date", # column renaming always the same"top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filter out non-data rowsmutate(date =dmy(paste(date,yr)), # paste yr input to date text and convert to datenum_releases =as.integer(num_releases),top10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T),top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) d_out}bom2021_Op2 <-bom_cln(data_file ="data/box_office_mojo_2021.csv", yr =2021, skip_num=11) # use function```### [Option 3]{style="color:blue;"}This option has 1 more input, `out_file`, to export clean data using `write_csv`Useful for providing `clean` data to a client or colleague who doesn't use R```{r}#|label: bom_cln function option 3bom_cln_out <-function(data_file, yr, skip_num, out_file){ read_csv(data_file, show_col_types = F, skip=skip_num) |># data_file and skip_num used here select(1,4,7,9) |># columns needed are always the same for these datasetsrename("date"="Date", # column renaming always the same"top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filter out non-data rowsmutate(date =dmy(paste(date,yr)), # paste yr input to date text and convert to datenum_releases =as.integer(num_releases),top10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T),top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) |>write_csv(out_file)}bom2021_Op3 <-bom_cln_out(data_file ="data/box_office_mojo_2021.csv", yr =2021, skip_num=11, out_file ="data/box_office_mojo_2021_tidy.csv") # use function```:::## Week 6 In-class Exercises - Q2:::::: columns::: {.column width="48%"}***Session ID: bua455f24***When using the 'Option 3' Function, what output(s) are created?A. An exported .csv fileB. A clean tibble dataset saved to the Global EnvironmentC. Both an exported .csv file and a clean tibble dataset in R:::::: {.column width="4%"}:::::: {.column width="48%"}:::::::::## More Comments on Functions- Even if a function works for one situation, it may not work for every situation.- Published R functions are very long because they are tested and edited and retested to account for most situations.- For example, you can examine the text for a function we have used - In the console, type `read_csv` without the parentheses and push `Enter`. - You can see the text of the function which is complex.## Using our FunctionWe will run the Option 3 version, which cleans the data AND exports a cleaned (tidied) dataset.Note that there will be one error for 2020 due to the leap day (2/29/2020).```{r}#|label: testing our functionbom2018<-bom_cln_out(data_file ="data/box_office_mojo_2018.csv", yr =2018, skip_num=11, out_file ="data/box_office_mojo_2018_tidy.csv")bom2019<-bom_cln_out(data_file ="data/box_office_mojo_2019.csv", yr =2019, skip_num=11, out_file ="data/box_office_mojo_2019_tidy.csv")bom2020<-bom_cln_out(data_file ="data/box_office_mojo_2020.csv", yr =2020, skip_num=11, out_file ="data/box_office_mojo_2020_tidy.csv")bom2021<-bom_cln_out(data_file ="data/box_office_mojo_2021.csv", yr =2021, skip_num=11, out_file ="data/box_office_mojo_2021_tidy.csv")bom2022<-bom_cln_out(data_file ="data/box_office_mojo_2022.csv", yr =2022, skip_num=11, out_file ="data/box_office_mojo_2022_tidy.csv")# fix NA due to leap day manually# not easily fixed within functionbom2020$date[is.na(bom2020$date)] <-dmy(29022020)write_csv(bom2020, "data/box_office_mojo_2020_tidy.csv")```## Join datasets vertically (stacking)- If you have multiple identical datasets with the same columns, you can 'stack' them using the `bind_rows` command.- **Recommended:** Import and format datasets to all have same column names so that this is seamless. - Functions are useful for creating identically formatted datasets.::: fragment```{r}#|label: create one 5 year BOMdatasetbom_2018_2022 <-bind_rows(bom2018, bom2019, bom2020, bom2021, bom2022) # join datasets# Modify gross variables (could also be done in function)bom_2018_2022 <- bom_2018_2022 |>mutate(top10grossM = (top10gross/1000000) |>round(2), num1grossM = (num1gross/1000000) |>round(2)) |>filter(date <="2022-12-30") |>glimpse()```:::## HW 4 PreviewWe will create a time series line plot from multiple years of data in classIn HW 4 you will use data from the [Bureau of Labor Statistics](https://www.bls.gov/) that is easier to clean- Much of this code is review- Useful plot formatting code: - Move legend to bottom: `theme(legend.position="bottom")` - Format scale of dates: `scale_x_date(date_breaks = "3 months", date_labels = "%b")` - HW 4 date format: `scale_x_date(date_breaks = "year", date_labels = "%Y)"` - Format all plot labels using a `labs` command## :::: panel-tabset### [Modify Data]{style="color:blue;"}Reshape data and create a factor variable with plot label text (Similar to Hw 4 - Part 4)::: r-fit-text```{r modify data for lineplot}bom_plot_data <- bom_2018_2022 |> select(date, top10grossM, num1grossM) |> pivot_longer(cols=top10grossM:num1grossM, names_to = "type", values_to="grossM") |> mutate(type=factor(type, levels=c("top10grossM", "num1grossM"), labels=c("Top 10", "No. 1"))) head(bom_plot_data)```:::### [Basic Plot]{style="color:blue;"}```{r fig.dim=c(15,5)}#|label: basic line plot(bom_ln_plot <- bom_plot_data |> ggplot() + geom_line(aes(x=date, y=grossM, color=type)) + theme_classic())```### [Legend & Lines]{style="color:blue;"}```{r fig.dim=c(15,5)}#|label: modify legend and lines(bom_ln_plot <- bom_plot_data |> ggplot() + geom_line(aes(x=date, y=grossM, color=type), linewidth=0.75) + theme_classic() + theme(legend.position="bottom") + scale_color_manual(values=c("lightblue","blue")))```### [X-axis & Labels]{style="color:blue;"}```{r fig.dim=c(15,5)}#|label: modify labels and axis breaks(bom_ln_plot <- bom_ln_plot + scale_x_date(date_breaks = "3 months", # OG code for date_labels date_labels = "%b") + labs(x="Date", y = "Gross ($Mill)", color="", # modify plot labels title="Top 10 and No. 1 Movie Gross by Date", subtitle="Jan. 1, 2018 - Dec. 30, 2022", caption="Data Source:www.boxoffice.mojo.com"))```### [Text & Border]{style="color:blue;"}```{r fig.dim=c(15,5)}#|label: modify text and borders(bom_ln_plot <- bom_ln_plot + theme(plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), axis.title = element_text(size=18), axis.text = element_text(size=15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), panel.border = element_rect(colour = "lightgrey", fill=NA, size=2), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)))```::::## ::: panel-tabset### [Complete Plot Code]{style="color:blue;"}Similar but more complex than complete plot code for HW 4.Additional formatting is required for slides that is not needed in a HTML display.HW 4 plot code will end with `labs` command`theme` options may be useful for group project.```{r complete code}bom_ln_plot <- bom_plot_data |> ggplot() + geom_line(aes(x=date, y=grossM, color=type), size=0.75) + theme_classic() + theme(legend.position="bottom") + # position legend at bootom scale_color_manual(values=c("lightblue","blue")) + # modify line colors scale_x_date(date_breaks = "3 months", date_labels = "%b") + # OG code for date_labels labs(x="Date", y = "Gross ($Mill)", color="", # modify plot labels title="Top 10 and No. 1 Movie Gross by Date", subtitle="Jan. 1, 2018 - Dec. 30, 2022", caption="Data Source:www.boxoffice.mojo.com") + theme(plot.title = element_text(size = 20), # text size and borders plot.subtitle = element_text(size = 15), axis.title = element_text(size=18), axis.text = element_text(size=15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), panel.border = element_rect(colour = "lightgrey", fill=NA, size=2), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))```### [Final Formatted Plot]{style="color:blue;"}```{r final plot, echo = F, fig.dim=c(16,6), fig.align='center'}bom_ln_plot```:::## Bureau of Labor Statistics (BLS) Data- Bureau of Labor Statistics is an excellent resource for economic data.- Many (not all) BLS datasets have a similar format.- In HW 4 you will: - create a function from code used to clean the monthly unemployment Rate data. - use this function to import and clean two data sets: - **Export Price Index** and **Import Price Index**. - join these two data-sets using a `full_join` and create a `date` variable. - modify the dataset for a plot and follow instructions to format plot.- Next lecture(s): We will adapt this function to examine and plot labor force variables.- Recall: Before importing ANY data, **ALWAYS** examine the `.csv` file.## HW 4 - Bureau of Labor Statistics Data[Bureau of Labor Statistics](https://www.bls.gov/) is a great source for economic data.In HW 4, we will look at three data sets:- Unemployment Rate, the percentage of people unemployed (`bls_unemp_rate.csv`)- Import Price Index (`bls_import_index.csv`)- Export Price Index (`bls_export_index.csv`)::: fragmentThe Import and Export Price Indices contain data on changes in the prices of nonmilitary goods and services traded between the U.S. and the rest of the world.:::- [Documentation on Import and Export Price Indixes](https://www.bls.gov/mxp/)- [Additional Documention provided by Investopedia](https://www.investopedia.com/terms/i/import-export-prices.asp)- Before going further, examine these three .csv files to determine number of header rows to skip when cleaning the data.## Introduction to Joins::::::: columns:::: {.column width="48%"}Earlier we discussed stacking datasets vertically with `bind_rows`.We used `date` to stack the data in the correct order.Alternatively, we can join datasets horizontally (side-by-side)- This in only possible if the data sets have variables in common with matching values.- We will cover four types of common joins in this course.::: fragmentThe [`dplyr` Cheat Sheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf) is a comprehensive reference.:::::::::: {.column width="4%"}:::::: {.column width="48%"}::::::::::## Types of Joins- Datasets are joined two at a time, then a third can be joined to first two.- In BUA 455, we will cover four types of joins (but others exist): - `left_join`: include all observations in **FIRST** dataset in join command (dataset on left) - `right_join`: include all observations in **SECOND** dataset in join command (dataset on right) - `inner_join`: only include observations that are in **BOTH** datasets - `full_join`: include all observations that are in **EITHER** dataset- All of these join commands are useful, but `left_join` and `right_join` are redundant.- I usually (not always) prefer using `full_join` to help reveal data errors or discrepancies between multiple datasets.## A note about R colors- There are [657 Colors in R](https://docs.google.com/spreadsheets/d/1m-UfG96DB4ZAOntVyzcjuYrX564AdLGB2f-MA6rvc2s/edit?usp=sharing)- We'll talk more about color options and palettes, but it's good to know they exist.- We don't emphasize data visualization theory in this course, BUT managing data entails visualizations - The goal is to manage and make sense of the data for a client, colleague, or supervisor.- Choosing colors, symbols, size of text, size of lines, etc. wisely is an important part of data management and data curation.- As we transition to discussing the group projects (next Week), we will talk more about how to communicate with data effectively. - Color choices will be part of that discussion.- In HW 4 and today's demo, I manually choose colors to provide examples.## Outline of HW 4- First Steps - Create R project for HW 4. - Create `data` and `img` folders. - Save HW4 Quarto (`.qmd`) Template to project folder. - Save data files to `data` folder within project.- Part 1 - Create `bls_tidy` function from code used to import and clean `bls_unemp_rate.csv` - Run the function and then use it to import and clean `bls_unemp_rate.csv`- Part 2 - Use `bls_tidy` function to - import and clean `bls_import_index.csv` and save it as `import_index` - import and clean `bls_export_index.csv` and save it as `export_index`## Outline of HW 4 Cont'd- Part 3 - Use full `full_join` to join `import_index` and `export_index` into one dataset, `export_import`. - Use correct `lubridate` command and `paste` command to create a date variable.- Part 4 - Reshape `export_import` dataset using `pivot_longer` to create `exp_imp_plt` - `exp_imp_plt` will be used to plot the data. - Type of index, `type` will be converted to a `factor` with levels and labels for the plot.## Outline of HW 4 Cont'd- Part 5 - Create a formatted line plot with: - one line for the Export Price Index (light blue line). - one line for Import Price Index (blue line)- Final Steps - Answer all Blackboard Questions. - Knit Quarto (.qmd) file. - Create README FOR HW 4. - Save all files correctly within HW 4 R Project folder. - Zip and submit HW 4 R project folder.## Example of HW 4 Final Plot**NOTES:** - Final HW 4 plot should be almost identical to the one below **EXCEPT** for second line of caption.- Slides version has more formatting. See version in `.html` lecture notes and HW 4 Instructions.::: fragment{fig-align="center" height="4.5in"}:::## Next Lectures:::::: columns::: {.column width="48%"}- More Examples from [BLS](https://www.bls.gov/)- line and area plots,- Interactive time series plots - hcharts and dygraphs- Converting between data formats (`xts` and `tibble`)- Importing stock data from Yahoo Finance to R- **Introducing the Group Project**:::::: {.column width="4%"}:::::: {.column width="48%"}{fig-align="center"}:::::::::## ### Key Points from This Week::: fragment**Developing Functions:**:::- Develop code, convert to a function, test it::: fragment**Vertical and Horizontal Data Joins**:::- Stack datasets vertically using `bind_rows`- Join datasets horizontally by matching variables and values (**HW 4:** `full_join`)::: fragment**Working with date variables using `lubridate`**:::- Command(s) to create date variables- using `paste` to add year to text::: fragmentYou may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.**:::