Week 6

Reshaping. Cleaning, and Joining Data

Author

Penelope Pooler Eisenbies

Published

February 17, 2025

Housekeeping

Comments About Quiz 1

Grading is in progress, but can not be discussed until all grades are posted.
Checking the submitted work of all students takes time but should be completed this week.

Comments about HWs 1 - 3

HW 1 and HW 2 grades are posted
HW 3 will be graded this week.

HW 4 is posted and is due next week on Wednesday, 2/28/24.

We will work through part of it together on Thursday
Demo Videos are available.

Recall from Lecture 9:

R functions:

Useful for automating repetitive tasks

Best Practices for Writing Functions

Plan what you want to do
- Write out task steps
Develop and refine the code to complete tasks
Subdivide multi-step tasks into short sets of tasks.
Convert each set of tasks to a function
- Make the function more general so it is more versatile.

Common Function Structure Options

Anatomy of a Function (Option 1):

Commands within function are not saved to an object.
Function results are automatically ‘kicked out’ as output.

Function_Name <- function(input_1, input_2, etc){
   command 1 to do "stuff" to inputs |>
   command 2 to do "stuff" to inputs |>
   command 3 to do "stuff" to inputs |> etc. # output is autmatically "kicked out"
}

Anatomy of a Function (Option 2 - Shown in Week 5 Lecture):

Commands within function are saved to an object.
Function ends with name of object so that results are ‘kicked out’ as output.

Function_Name <- function(input_1, input_2, etc){
   output <- command 1 to do "stuff" to inputs |>
             command 2 to do "stuff" to inputs |>
             command 3 to do "stuff" to inputs |> etc.
   output  # end with name of output so that it is "kicked out" of function
}

Box Office Mojo Data Cleaning Steps

Examine .csv file to determine number of rows to skip
Import raw .csv file and skip header rows above variable names
Select useful columns (Columns 1, 4, 7, and 9):
- 1: Date
- 4: Top 10 Gross- <chr> variable with nuisance characters to be removed ($, ,)
- 7: Releases
- 9: Gross - <chr> variable with nuisance characters to be removed ($, ,)
Rename these four variables to easier names to work with in R
- New names should not have spaces.
- Lower case names with underscores (_) work well in code.
- Variable names and labels can be reformatted for plots and tables.

Box Office Mojo Data Cleaning Steps Cont’d

5 . Remove non-data rows (Holidays, etc.) with a filter command.

Use mutate to convert variables to use-able formats:
- Use paste to add year text to date character variable and convert to a date variable.
- Convert Releases (num_releases) to an integer variable
- Use gsub and as.numeric to convert each gross variable to numeric:
  - gsub is used to remove nuisance characters, $ and ,
  - NOTE: multiple adjacent columns with same nuisance characters can be cleaned at the same time with mutate(across...)
  - as.numeric is used to convert character to numeric decimal value (<dbl>) once nusaince characters are removed.

Cleaning One Data Set

Code

```{r}
#|label: cleaning 1 box office mojo dataset

bom2022 <- read_csv("data/box_office_mojo_2022.csv", skip=11, show_col_types = F) |>
  select(1,4,7,9) |>                                                  # select columns by number (use with care)
  rename("date" = "Date", 
         "top10gross" = "Top 10 Gross", 
         "num_releases" = "Releases",
         "num1gross" = "Gross") |>
  filter(!is.na(top10gross)) |>                                       # filters out empty holiday rows
  mutate(date = dmy(paste(date,2022)),
         num_releases = as.integer(num_releases),
                                                                      # gross variables cleaned one at a time
         top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),  
         top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
           as.numeric(),

         num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
         num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
           as.numeric()) |>
  glimpse()
```

Rows: 365
Columns: 4
$ date         <date> 2022-12-31, 2022-12-30, 2022-12-29, 2022-12-28, 2022-12-…
$ top10gross   <dbl> 27962493, 37900613, 31204428, 31866473, 37343124, 4696327…
$ num_releases <int> 31, 32, 28, 28, 28, 29, 28, 28, 29, 30, 31, 31, 30, 31, 3…
$ num1gross    <dbl> 18053159, 24836835, 20117061, 20582014, 24128503, 3227043…

Converting B.O.M. Cleaning Code to a Function

Code

```{r}
#|label: bom_cln_function
bom_cln <- function(data_file, yr, skip_num){   # inputs: data_file is file name
                                                #         yr is year of data
                                                #         skip_num is number of header rows to skip
  
  read_csv(data_file, show_col_types = F, skip=skip_num) |>   # data_file and skip_num used here 
  select(1,4,7,9) |>                            # columns needed are always the same for bom
  rename("date" = "Date",                       # column renaming always the same
         "top10gross" = "Top 10 Gross", 
         "num_releases" = "Releases",
         "num1gross" = "Gross") |>
  filter(!is.na(top10gross)) |>                # filter out non-data rows
  mutate(date = dmy(paste(date,yr)),           # paste yr input to date text and convert to date
         num_releases = as.integer(num_releases),                                       
         top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
         top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
           as.numeric(),
         num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
         num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
           as.numeric()) 
}
```

Working with Dates using `lubridate`

lubridate can convert a wide variety of text information to a date.
User must specify order of information, e.g. ymd indicates year, then month, the day
Text must include year, but day is not required (See HW 4)

Code

```{r}
#|label: examples of lubridate

"31st of October, 2040"                                              # text date

dmy("31st of October, 2040")                                         # example of lubridate command dmy

paste("Hello", "Goodbye")                                            # example of paste concatenating two text strings

"March 15"                                                           # example of month and day only
                                                                     # lubridate comands require a year
```

[1] "31st of October, 2040"
[1] "2040-10-31"
[1] "Hello Goodbye"
[1] "March 15"

Lubridate Examples Continued

Code

```{r}
#|label: more lubridate examples

paste("February", 2040)                                        # using paste to add year to month and day

my(paste("February", "2040"))                                  # using lubridate my command with paste

paste("February", "2040") |> my()                              # same command using piping

"March 15th, 2039"                                             # different form of text date (Q1 next slide)

ymd(20390315)                                                  # example of lubridate command ymd 

Sys.Date()                                                     # demo of Sys.date command 
```

[1] "February 2040"
[1] "2040-02-01"
[1] "2040-02-01"
[1] "March 15th, 2039"
[1] "2039-03-15"
[1] "2025-02-17"

Week 6 In-class Exercises - Q1

Session ID: bua455f24

What is the correct lubridate command to convert the following date text to a date value in R.

“March 15th, 2039”

Hint: Examine the Lubridate Cheat Sheet and test out commands in console to see what the output is.

NOTE: lubridate commands will only work if you have run the setup for this lecture to load the tidyverse suite of packages.

Function Format Options

No objects are saved within function and result is kicked out

Ideal for straightforward functions and plot functions

Code

```{r}
#|label: bom_cln function option 1
bom_cln <- function(data_file, yr, skip_num){   
  read_csv(data_file, show_col_types = F, skip=skip_num) |>   # data_file and skip_num used here 
  select(1,4,7,9) |>                            # columns needed are always the same for these datasets
  rename("date" = "Date",                       # column renaming always the same
         "top10gross" = "Top 10 Gross", 
         "num_releases" = "Releases",
         "num1gross" = "Gross") |>
  filter(!is.na(top10gross)) |>                # filter out non-data rows
  mutate(date = dmy(paste(date,yr)),           # paste yr input to date text and convert to date
         num_releases = as.integer(num_releases),
         top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
         top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
           as.numeric(),
         num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
         num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
           as.numeric()) 
}

bom2021_Op1 <- bom_cln(data_file = "data/box_office_mojo_2021.csv", 
                       yr = 2021, skip_num=11) # use function
```

Commands within function are saved to an object, d_out

Function ends with d_out so that result gets kicked-out of function

Code

```{r}
#|label: bom_cln function option 2
bom_cln <- function(data_file, yr, skip_num){   
  d_out <- read_csv(data_file, show_col_types = F, skip=skip_num) |>   # data_file and skip_num used here
    select(1,4,7,9) |>                            # columns needed are always the same for these datasets
    rename("date" = "Date",                       # column renaming always the same
           "top10gross" = "Top 10 Gross", 
           "num_releases" = "Releases",
           "num1gross" = "Gross") |>
    filter(!is.na(top10gross)) |>                # filter out non-data rows
    mutate(date = dmy(paste(date,yr)),           # paste yr input to date text and convert to date
           num_releases = as.integer(num_releases),
           top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
           top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
             as.numeric(),
           num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
           num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
             as.numeric()) 
  d_out
}
bom2021_Op2 <- bom_cln(data_file = "data/box_office_mojo_2021.csv", 
                       yr = 2021, skip_num=11) # use function
```

This option has 1 more input, out_file, to export clean data using write_csv

Useful for providing clean data to a client or colleague who doesn’t use R

Code

```{r}
#|label: bom_cln function option 3
bom_cln_out <- function(data_file, yr, skip_num, out_file){   
    read_csv(data_file, show_col_types = F, skip=skip_num) |>   # data_file and skip_num used here 
    select(1,4,7,9) |>                            # columns needed are always the same for these datasets
    rename("date" = "Date",                       # column renaming always the same
           "top10gross" = "Top 10 Gross", 
           "num_releases" = "Releases",
           "num1gross" = "Gross") |>
    filter(!is.na(top10gross)) |>                # filter out non-data rows
    mutate(date = dmy(paste(date,yr)),           # paste yr input to date text and convert to date
           num_releases = as.integer(num_releases),
           top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
           top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
             as.numeric(),
           num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
           num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
             as.numeric()) |>
    write_csv(out_file)
}
bom2021_Op3 <- bom_cln_out(data_file = "data/box_office_mojo_2021.csv", 
                           yr = 2021, skip_num=11, 
                           out_file = "data/box_office_mojo_2021_tidy.csv") # use function
```

Week 6 In-class Exercises - Q2

Session ID: bua455f24

When using the ‘Option 3’ Function, what output(s) are created?

A. An exported .csv file

B. A clean tibble dataset saved to the Global Environment

C. Both an exported .csv file and a clean tibble dataset in R

More Comments on Functions

Even if a function works for one situation, it may not work for every situation.
Published R functions are very long because they are tested and edited and retested to account for most situations.
For example, you can examine the text for a function we have used
- In the console, type read_csv without the parentheses and push Enter.
- You can see the text of the function which is complex.

Using our Function

We will run the Option 3 version, which cleans the data AND exports a cleaned (tidied) dataset.

Note that there will be one error for 2020 due to the leap day (2/29/2020).

Code

```{r}
#|label: testing our function

bom2018<- bom_cln_out(data_file = "data/box_office_mojo_2018.csv", yr = 2018, skip_num=11, 
                       out_file = "data/box_office_mojo_2018_tidy.csv")

bom2019<- bom_cln_out(data_file = "data/box_office_mojo_2019.csv", yr = 2019, skip_num=11, 
                       out_file = "data/box_office_mojo_2019_tidy.csv")

bom2020<- bom_cln_out(data_file = "data/box_office_mojo_2020.csv", yr = 2020, skip_num=11, 
                       out_file = "data/box_office_mojo_2020_tidy.csv")
```

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = dmy(paste(date, yr))`.
Caused by warning:
!  1 failed to parse.

Code

```{r}
bom2021<- bom_cln_out(data_file = "data/box_office_mojo_2021.csv", yr = 2021, skip_num=11, 
                       out_file = "data/box_office_mojo_2021_tidy.csv")

bom2022<- bom_cln_out(data_file = "data/box_office_mojo_2022.csv", yr = 2022, skip_num=11, 
                       out_file = "data/box_office_mojo_2022_tidy.csv")

# fix NA due to leap day manually
# not easily fixed within function
bom2020$date[is.na(bom2020$date)] <- dmy(29022020)
write_csv(bom2020, "data/box_office_mojo_2020_tidy.csv")
```

Join datasets vertically (stacking)

If you have multiple identical datasets with the same columns, you can ‘stack’ them using the bind_rows command.
Recommended: Import and format datasets to all have same column names so that this is seamless.
- Functions are useful for creating identically formatted datasets.

Code

```{r}
#|label: create one 5 year BOMdataset

bom_2018_2022 <- bind_rows(bom2018, bom2019, bom2020, bom2021, bom2022) # join datasets

# Modify gross variables (could also be done in function)
bom_2018_2022 <- bom_2018_2022 |>
  mutate(top10grossM = (top10gross/1000000) |> round(2), 
         num1grossM = (num1gross/1000000) |> round(2)) |>
  filter(date <= "2022-12-30") |>
  glimpse()
```

Rows: 1,819
Columns: 6
$ date         <date> 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, 2018-12-…
$ top10gross   <dbl> 36240441, 50932176, 58118460, 56667767, 51671321, 5557976…
$ num_releases <int> 53, 51, 51, 51, 53, 52, 53, 49, 53, 52, 53, 60, 63, 59, 6…
$ num1gross    <dbl> 10011638, 16440551, 18632907, 17041113, 14622228, 1690351…
$ top10grossM  <dbl> 36.24, 50.93, 58.12, 56.67, 51.67, 55.58, 70.90, 31.75, 4…
$ num1grossM   <dbl> 10.01, 16.44, 18.63, 17.04, 14.62, 16.90, 21.98, 10.85, 1…

HW 4 Preview

We will create a time series line plot from multiple years of data in class

In HW 4 you will use data from the Bureau of Labor Statistics that is easier to clean

Much of this code is review
Useful plot formatting code:
- Move legend to bottom: theme(legend.position="bottom")
- Format scale of dates: scale_x_date(date_breaks = "3 months", date_labels = "%b")
  - HW 4 date format: scale_x_date(date_breaks = "year", date_labels = "%Y)"
- Format all plot labels using a labs command

Reshape data and create a factor variable with plot label text (Similar to Hw 4 - Part 4)

Code

```{r modify data for lineplot}
bom_plot_data <- bom_2018_2022 |>
  select(date, top10grossM, num1grossM) |>
  pivot_longer(cols=top10grossM:num1grossM, names_to = "type", values_to="grossM") |>
  mutate(type=factor(type, 
                     levels=c("top10grossM", "num1grossM"), 
                     labels=c("Top 10", "No. 1"))) 
head(bom_plot_data)
```

# A tibble: 6 × 3
  date       type   grossM
  <date>     <fct>   <dbl>
1 2018-12-31 Top 10   36.2
2 2018-12-31 No. 1    10.0
3 2018-12-30 Top 10   50.9
4 2018-12-30 No. 1    16.4
5 2018-12-29 Top 10   58.1
6 2018-12-29 No. 1    18.6

Code

```{r fig.dim=c(15,5)}
#|label: basic line plot
(bom_ln_plot <- bom_plot_data |>
  ggplot() +
  geom_line(aes(x=date, y=grossM, color=type)) + 
  theme_classic())
```

Code

```{r fig.dim=c(15,5)}
#|label: modify legend and lines
(bom_ln_plot <- bom_plot_data |>
  ggplot() +
  geom_line(aes(x=date, y=grossM, color=type), linewidth=0.75) + 
  theme_classic() +
  theme(legend.position="bottom") + 
  scale_color_manual(values=c("lightblue","blue")))
```

Code

```{r fig.dim=c(15,5)}
#|label: modify labels and axis breaks
(bom_ln_plot  <- bom_ln_plot +
  scale_x_date(date_breaks = "3 months",         # OG code for date_labels
                 date_labels = "%b") +
  labs(x="Date", y = "Gross ($Mill)", color="",  # modify plot labels
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle="Jan. 1, 2018 - Dec. 30, 2022",
       caption="Data Source:www.boxoffice.mojo.com"))
```

Code

```{r fig.dim=c(15,5)}
#|label: modify text and borders
(bom_ln_plot  <- bom_ln_plot + 
     theme(plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 15),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        legend.text = element_text(size = 12),
        panel.border = element_rect(colour = "lightgrey", fill=NA, size=2),
        plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)))
```

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Similar but more complex than complete plot code for HW 4.

Additional formatting is required for slides that is not needed in a HTML display.

HW 4 plot code will end with labs command

theme options may be useful for group project.

Code

```{r complete code}
bom_ln_plot <- bom_plot_data |>
  ggplot() +
  geom_line(aes(x=date, y=grossM, color=type), size=0.75) + 
  theme_classic() +
  theme(legend.position="bottom") +                               # position legend at bootom
  scale_color_manual(values=c("lightblue","blue")) +              # modify line colors
  scale_x_date(date_breaks = "3 months",  date_labels = "%b") +   # OG code for date_labels
  labs(x="Date", y = "Gross ($Mill)", color="",                   # modify plot labels
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle="Jan. 1, 2018 - Dec. 30, 2022",
       caption="Data Source:www.boxoffice.mojo.com") + 
  theme(plot.title = element_text(size = 20),                     # text size and borders
        plot.subtitle = element_text(size = 15),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        legend.text = element_text(size = 12),
        panel.border = element_rect(colour = "lightgrey", fill=NA, size=2),
        plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))
```

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Bureau of Labor Statistics (BLS) Data

Bureau of Labor Statistics is an excellent resource for economic data.
Many (not all) BLS datasets have a similar format.
In HW 4 you will:
- create a function from code used to clean the monthly unemployment Rate data.
- use this function to import and clean two data sets:
  - Export Price Index and Import Price Index.
- join these two data-sets using a full_join and create a date variable.
- modify the dataset for a plot and follow instructions to format plot.
Next lecture(s): We will adapt this function to examine and plot labor force variables.
Recall: Before importing ANY data, ALWAYS examine the .csv file.

HW 4 - Bureau of Labor Statistics Data

Bureau of Labor Statistics is a great source for economic data.

In HW 4, we will look at three data sets:

Unemployment Rate, the percentage of people unemployed (bls_unemp_rate.csv)
Import Price Index (bls_import_index.csv)
Export Price Index (bls_export_index.csv)

The Import and Export Price Indices contain data on changes in the prices of nonmilitary goods and services traded between the U.S. and the rest of the world.

Documentation on Import and Export Price Indixes
Additional Documention provided by Investopedia
Before going further, examine these three .csv files to determine number of header rows to skip when cleaning the data.

Introduction to Joins

Earlier we discussed stacking datasets vertically with bind_rows.

We used date to stack the data in the correct order.

Alternatively, we can join datasets horizontally (side-by-side)

This in only possible if the data sets have variables in common with matching values.
We will cover four types of common joins in this course.

The dplyr Cheat Sheet is a comprehensive reference.

Types of Joins

Datasets are joined two at a time, then a third can be joined to first two.
In BUA 455, we will cover four types of joins (but others exist):
- left_join: include all observations in FIRST dataset in join command (dataset on left)
- right_join: include all observations in SECOND dataset in join command (dataset on right)
- inner_join: only include observations that are in BOTH datasets
- full_join: include all observations that are in EITHER dataset
All of these join commands are useful, but left_join and right_join are redundant.
I usually (not always) prefer using full_join to help reveal data errors or discrepancies between multiple datasets.

A note about R colors

There are 657 Colors in R
We’ll talk more about color options and palettes, but it’s good to know they exist.
We don’t emphasize data visualization theory in this course, BUT managing data entails visualizations
- The goal is to manage and make sense of the data for a client, colleague, or supervisor.
Choosing colors, symbols, size of text, size of lines, etc. wisely is an important part of data management and data curation.
As we transition to discussing the group projects (next Week), we will talk more about how to communicate with data effectively.
- Color choices will be part of that discussion.
In HW 4 and today’s demo, I manually choose colors to provide examples.

Outline of HW 4

First Steps
- Create R project for HW 4.
- Create data and img folders.
- Save HW4 Quarto (.qmd) Template to project folder.
- Save data files to data folder within project.
Part 1
- Create bls_tidy function from code used to import and clean bls_unemp_rate.csv
- Run the function and then use it to import and clean bls_unemp_rate.csv
Part 2
- Use bls_tidy function to
  - import and clean bls_import_index.csv and save it as import_index
  - import and clean bls_export_index.csv and save it as export_index

Outline of HW 4 Cont’d

Part 3
- Use full full_join to join import_index and export_index into one dataset, export_import.
- Use correct lubridate command and paste command to create a date variable.
Part 4
- Reshape export_import dataset using pivot_longer to create exp_imp_plt
  - exp_imp_plt will be used to plot the data.
- Type of index, type will be converted to a factor with levels and labels for the plot.

Outline of HW 4 Cont’d

Part 5
- Create a formatted line plot with:
  - one line for the Export Price Index (light blue line).
  - one line for Import Price Index (blue line)
Final Steps
- Answer all Blackboard Questions.
- Knit Quarto (.qmd) file.
- Create README FOR HW 4.
- Save all files correctly within HW 4 R Project folder.
- Zip and submit HW 4 R project folder.

Example of HW 4 Final Plot

NOTES: - Final HW 4 plot should be almost identical to the one below EXCEPT for second line of caption.

Slides version has more formatting. See version in .html lecture notes and HW 4 Instructions.

Next Lectures

More Examples from BLS
line and area plots,
Interactive time series plots
- hcharts and dygraphs
Converting between data formats (xts and tibble)
Importing stock data from Yahoo Finance to R
Introducing the Group Project

Key Points from This Week

Developing Functions:

Develop code, convert to a function, test it

Vertical and Horizontal Data Joins

Stack datasets vertically using bind_rows
Join datasets horizontally by matching variables and values (HW 4: full_join)

Working with date variables using lubridate

Command(s) to create date variables
using paste to add year to text

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.

--- title: "Week 6" subtitle: "Reshaping. Cleaning, and Joining Data" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r include=F} #|label: setup knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks options(scipen=100) # suppress scientific notation # install pacman if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") pacman::p_load(pacman, tidyverse, gridExtra, magrittr, kableExtra) # install and load required packages p_loaded() # verify loaded packages ``` **Comments About Quiz 1** - Grading is in progress, but can not be discussed until all grades are posted. - Checking the submitted work of all students takes time but should be completed this week. ::: fragment **Comments about HWs 1 - 3** ::: - HW 1 and HW 2 grades are posted - HW 3 will be graded this week. ::: fragment **HW 4 is posted and is due next week on Wednesday, 2/28/24.** ::: - We will work through part of it together on Thursday - Demo Videos are available. ## Recall from Lecture 9: :::::::: columns ::::: {.column width="48%"} ::: fragment #### R functions: ::: - Useful for automating repetitive tasks ::: fragment #### Best Practices for Writing Functions ::: - Plan what you want to do - Write out task steps - Develop and refine the code to complete tasks - Subdivide multi-step tasks into short sets of tasks. - Convert each set of tasks to a function - Make the function more general so it is more versatile. ::::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/owl.png) ::: :::::::: ## ### Common Function Structure Options ::: fragment #### Anatomy of a Function (Option 1): ::: - Commands within function **are not saved** to an object. - Function results are automatically 'kicked out' as output. ::: fragment ``` Function_Name <- function(input_1, input_2, etc){ command 1 to do "stuff" to inputs |> command 2 to do "stuff" to inputs |> command 3 to do "stuff" to inputs |> etc. # output is autmatically "kicked out" } ``` ::: ::: fragment #### Anatomy of a Function (Option 2 - Shown in Week 5 Lecture): ::: - Commands within function **are saved** to an object. - Function ends with name of object so that results are 'kicked out' as output. ::: fragment ``` Function_Name <- function(input_1, input_2, etc){ output <- command 1 to do "stuff" to inputs |> command 2 to do "stuff" to inputs |> command 3 to do "stuff" to inputs |> etc. output # end with name of output so that it is "kicked out" of function } ``` ::: ## ### Box Office Mojo Data Cleaning Steps 1. Examine .csv file to determine number of rows to skip 2. Import raw .csv file and skip header rows above variable names 3. Select useful columns (Columns 1, 4, 7, and 9): - 1: `Date` - 4: `Top 10 Gross`- `<chr>` variable with nuisance characters to be removed (`$`, `,`) - 7: `Releases` - 9: `Gross` - `<chr>` variable with nuisance characters to be removed (`$`, `,`) 4. Rename these four variables to easier names to work with in R - New names should not have spaces. - Lower case names with underscores (`_`) work well in code. - Variable names and labels can be reformatted for plots and tables. ## ### Box Office Mojo Data Cleaning Steps Cont'd 5 . Remove non-data rows (Holidays, etc.) with a `filter` command. 6. Use `mutate` to convert variables to use-able formats: - Use `paste` to add year text to `date` character variable and convert to a date variable. - Convert `Releases` (`num_releases`) to an integer variable - Use `gsub` and `as.numeric` to convert each gross variable to numeric: - `gsub` is used to remove nuisance characters, `$` and `,` - NOTE: multiple adjacent columns with same nuisance characters can be cleaned at the same time with `mutate(across...)` - `as.numeric` is used to convert character to numeric decimal value (`<dbl>`) once nusaince characters are removed. ## Cleaning One Data Set ```{r} #|label: cleaning 1 box office mojo dataset bom2022 <- read_csv("data/box_office_mojo_2022.csv", skip=11, show_col_types = F) |> select(1,4,7,9) |> # select columns by number (use with care) rename("date" = "Date", "top10gross" = "Top 10 Gross", "num_releases" = "Releases", "num1gross" = "Gross") |> filter(!is.na(top10gross)) |> # filters out empty holiday rows mutate(date = dmy(paste(date,2022)), num_releases = as.integer(num_releases), # gross variables cleaned one at a time top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T), top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |> as.numeric(), num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T), num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |> as.numeric()) |> glimpse() ``` ## ### Converting B.O.M. Cleaning Code to a Function ```{r} #|label: bom_cln_function bom_cln <- function(data_file, yr, skip_num){ # inputs: data_file is file name # yr is year of data # skip_num is number of header rows to skip read_csv(data_file, show_col_types = F, skip=skip_num) |> # data_file and skip_num used here select(1,4,7,9) |> # columns needed are always the same for bom rename("date" = "Date", # column renaming always the same "top10gross" = "Top 10 Gross", "num_releases" = "Releases", "num1gross" = "Gross") |> filter(!is.na(top10gross)) |> # filter out non-data rows mutate(date = dmy(paste(date,yr)), # paste yr input to date text and convert to date num_releases = as.integer(num_releases), top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T), top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |> as.numeric(), num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T), num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |> as.numeric()) } ``` ## Working with Dates using `lubridate` - `lubridate` can convert a wide variety of text information to a date. - User must specify order of information, e.g. `ymd` indicates year, then month, the day - Text must include year, but day is not required (See HW 4) :::: fragment ::: r-fit-text ```{r} #|label: examples of lubridate "31st of October, 2040" # text date dmy("31st of October, 2040") # example of lubridate command dmy paste("Hello", "Goodbye") # example of paste concatenating two text strings "March 15" # example of month and day only # lubridate comands require a year ``` ::: :::: ## Lubridate Examples Continued ::: r-fit-text ```{r} #|label: more lubridate examples paste("February", 2040) # using paste to add year to month and day my(paste("February", "2040")) # using lubridate my command with paste paste("February", "2040") |> my() # same command using piping "March 15th, 2039" # different form of text date (Q1 next slide) ymd(20390315) # example of lubridate command ymd Sys.Date() # demo of Sys.date command ``` ::: ## Week 6 In-class Exercises - Q1 :::::: columns ::: {.column width="48%"} ***Session ID: bua455f24*** What is the correct `lubridate` command to convert the following date text to a date value in R. **"March 15th, 2039"** <br> **Hint:** Examine the [Lubridate Cheat Sheet](https://rstudio.github.io/cheatsheets/lubridate.pdf) and test out commands in console to see what the output is. **NOTE:** `lubridate` commands will only work if you have run the setup for this lecture to load the `tidyverse` suite of packages. ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![Lubridate Cheat Sheet Screenshot](img/lubridate_examples.png) ::: :::::: ## Function Format Options ::: panel-tabset ### [Option 1]{style="color:blue;"} No objects are saved within function and result is `kicked out` Ideal for straightforward functions and plot functions ```{r} #|label: bom_cln function option 1 bom_cln <- function(data_file, yr, skip_num){ read_csv(data_file, show_col_types = F, skip=skip_num) |> # data_file and skip_num used here select(1,4,7,9) |> # columns needed are always the same for these datasets rename("date" = "Date", # column renaming always the same "top10gross" = "Top 10 Gross", "num_releases" = "Releases", "num1gross" = "Gross") |> filter(!is.na(top10gross)) |> # filter out non-data rows mutate(date = dmy(paste(date,yr)), # paste yr input to date text and convert to date num_releases = as.integer(num_releases), top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T), top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |> as.numeric(), num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T), num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |> as.numeric()) } bom2021_Op1 <- bom_cln(data_file = "data/box_office_mojo_2021.csv", yr = 2021, skip_num=11) # use function ``` ### [Option 2]{style="color:blue;"} Commands within function are saved to an object, `d_out` Function ends with `d_out` so that result gets `kicked-out` of function ```{r} #|label: bom_cln function option 2 bom_cln <- function(data_file, yr, skip_num){ d_out <- read_csv(data_file, show_col_types = F, skip=skip_num) |> # data_file and skip_num used here select(1,4,7,9) |> # columns needed are always the same for these datasets rename("date" = "Date", # column renaming always the same "top10gross" = "Top 10 Gross", "num_releases" = "Releases", "num1gross" = "Gross") |> filter(!is.na(top10gross)) |> # filter out non-data rows mutate(date = dmy(paste(date,yr)), # paste yr input to date text and convert to date num_releases = as.integer(num_releases), top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T), top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |> as.numeric(), num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T), num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |> as.numeric()) d_out } bom2021_Op2 <- bom_cln(data_file = "data/box_office_mojo_2021.csv", yr = 2021, skip_num=11) # use function ``` ### [Option 3]{style="color:blue;"} This option has 1 more input, `out_file`, to export clean data using `write_csv` Useful for providing `clean` data to a client or colleague who doesn't use R ```{r} #|label: bom_cln function option 3 bom_cln_out <- function(data_file, yr, skip_num, out_file){ read_csv(data_file, show_col_types = F, skip=skip_num) |> # data_file and skip_num used here select(1,4,7,9) |> # columns needed are always the same for these datasets rename("date" = "Date", # column renaming always the same "top10gross" = "Top 10 Gross", "num_releases" = "Releases", "num1gross" = "Gross") |> filter(!is.na(top10gross)) |> # filter out non-data rows mutate(date = dmy(paste(date,yr)), # paste yr input to date text and convert to date num_releases = as.integer(num_releases), top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T), top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |> as.numeric(), num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T), num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |> as.numeric()) |> write_csv(out_file) } bom2021_Op3 <- bom_cln_out(data_file = "data/box_office_mojo_2021.csv", yr = 2021, skip_num=11, out_file = "data/box_office_mojo_2021_tidy.csv") # use function ``` ::: ## Week 6 In-class Exercises - Q2 :::::: columns ::: {.column width="48%"} ***Session ID: bua455f24*** When using the 'Option 3' Function, what output(s) are created? A. An exported .csv file B. A clean tibble dataset saved to the Global Environment C. Both an exported .csv file and a clean tibble dataset in R ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/beaver.png) ::: :::::: ## More Comments on Functions - Even if a function works for one situation, it may not work for every situation. - Published R functions are very long because they are tested and edited and retested to account for most situations. - For example, you can examine the text for a function we have used - In the console, type `read_csv` without the parentheses and push `Enter`. - You can see the text of the function which is complex. ## Using our Function We will run the Option 3 version, which cleans the data AND exports a cleaned (tidied) dataset. Note that there will be one error for 2020 due to the leap day (2/29/2020). ```{r} #|label: testing our function bom2018<- bom_cln_out(data_file = "data/box_office_mojo_2018.csv", yr = 2018, skip_num=11, out_file = "data/box_office_mojo_2018_tidy.csv") bom2019<- bom_cln_out(data_file = "data/box_office_mojo_2019.csv", yr = 2019, skip_num=11, out_file = "data/box_office_mojo_2019_tidy.csv") bom2020<- bom_cln_out(data_file = "data/box_office_mojo_2020.csv", yr = 2020, skip_num=11, out_file = "data/box_office_mojo_2020_tidy.csv") bom2021<- bom_cln_out(data_file = "data/box_office_mojo_2021.csv", yr = 2021, skip_num=11, out_file = "data/box_office_mojo_2021_tidy.csv") bom2022<- bom_cln_out(data_file = "data/box_office_mojo_2022.csv", yr = 2022, skip_num=11, out_file = "data/box_office_mojo_2022_tidy.csv") # fix NA due to leap day manually # not easily fixed within function bom2020$date[is.na(bom2020$date)] <- dmy(29022020) write_csv(bom2020, "data/box_office_mojo_2020_tidy.csv") ``` ## Join datasets vertically (stacking) - If you have multiple identical datasets with the same columns, you can 'stack' them using the `bind_rows` command. - **Recommended:** Import and format datasets to all have same column names so that this is seamless. - Functions are useful for creating identically formatted datasets. ::: fragment ```{r} #|label: create one 5 year BOMdataset bom_2018_2022 <- bind_rows(bom2018, bom2019, bom2020, bom2021, bom2022) # join datasets # Modify gross variables (could also be done in function) bom_2018_2022 <- bom_2018_2022 |> mutate(top10grossM = (top10gross/1000000) |> round(2), num1grossM = (num1gross/1000000) |> round(2)) |> filter(date <= "2022-12-30") |> glimpse() ``` ::: ## HW 4 Preview We will create a time series line plot from multiple years of data in class In HW 4 you will use data from the [Bureau of Labor Statistics](https://www.bls.gov/) that is easier to clean - Much of this code is review - Useful plot formatting code: - Move legend to bottom: `theme(legend.position="bottom")` - Format scale of dates: `scale_x_date(date_breaks = "3 months", date_labels = "%b")` - HW 4 date format: `scale_x_date(date_breaks = "year", date_labels = "%Y)"` - Format all plot labels using a `labs` command ## :::: panel-tabset ### [Modify Data]{style="color:blue;"} Reshape data and create a factor variable with plot label text (Similar to Hw 4 - Part 4) ::: r-fit-text ```{r modify data for lineplot} bom_plot_data <- bom_2018_2022 |> select(date, top10grossM, num1grossM) |> pivot_longer(cols=top10grossM:num1grossM, names_to = "type", values_to="grossM") |> mutate(type=factor(type, levels=c("top10grossM", "num1grossM"), labels=c("Top 10", "No. 1"))) head(bom_plot_data) ``` ::: ### [Basic Plot]{style="color:blue;"} ```{r fig.dim=c(15,5)} #|label: basic line plot (bom_ln_plot <- bom_plot_data |> ggplot() + geom_line(aes(x=date, y=grossM, color=type)) + theme_classic()) ``` ### [Legend & Lines]{style="color:blue;"} ```{r fig.dim=c(15,5)} #|label: modify legend and lines (bom_ln_plot <- bom_plot_data |> ggplot() + geom_line(aes(x=date, y=grossM, color=type), linewidth=0.75) + theme_classic() + theme(legend.position="bottom") + scale_color_manual(values=c("lightblue","blue"))) ``` ### [X-axis & Labels]{style="color:blue;"} ```{r fig.dim=c(15,5)} #|label: modify labels and axis breaks (bom_ln_plot <- bom_ln_plot + scale_x_date(date_breaks = "3 months", # OG code for date_labels date_labels = "%b") + labs(x="Date", y = "Gross ($Mill)", color="", # modify plot labels title="Top 10 and No. 1 Movie Gross by Date", subtitle="Jan. 1, 2018 - Dec. 30, 2022", caption="Data Source:www.boxoffice.mojo.com")) ``` ### [Text & Border]{style="color:blue;"} ```{r fig.dim=c(15,5)} #|label: modify text and borders (bom_ln_plot <- bom_ln_plot + theme(plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), axis.title = element_text(size=18), axis.text = element_text(size=15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), panel.border = element_rect(colour = "lightgrey", fill=NA, size=2), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))) ``` :::: ## ::: panel-tabset ### [Complete Plot Code]{style="color:blue;"} Similar but more complex than complete plot code for HW 4. Additional formatting is required for slides that is not needed in a HTML display. HW 4 plot code will end with `labs` command `theme` options may be useful for group project. ```{r complete code} bom_ln_plot <- bom_plot_data |> ggplot() + geom_line(aes(x=date, y=grossM, color=type), size=0.75) + theme_classic() + theme(legend.position="bottom") + # position legend at bootom scale_color_manual(values=c("lightblue","blue")) + # modify line colors scale_x_date(date_breaks = "3 months", date_labels = "%b") + # OG code for date_labels labs(x="Date", y = "Gross ($Mill)", color="", # modify plot labels title="Top 10 and No. 1 Movie Gross by Date", subtitle="Jan. 1, 2018 - Dec. 30, 2022", caption="Data Source:www.boxoffice.mojo.com") + theme(plot.title = element_text(size = 20), # text size and borders plot.subtitle = element_text(size = 15), axis.title = element_text(size=18), axis.text = element_text(size=15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), panel.border = element_rect(colour = "lightgrey", fill=NA, size=2), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ### [Final Formatted Plot]{style="color:blue;"} ```{r final plot, echo = F, fig.dim=c(16,6), fig.align='center'} bom_ln_plot ``` ::: ## Bureau of Labor Statistics (BLS) Data - Bureau of Labor Statistics is an excellent resource for economic data. - Many (not all) BLS datasets have a similar format. - In HW 4 you will: - create a function from code used to clean the monthly unemployment Rate data. - use this function to import and clean two data sets: - **Export Price Index** and **Import Price Index**. - join these two data-sets using a `full_join` and create a `date` variable. - modify the dataset for a plot and follow instructions to format plot. - Next lecture(s): We will adapt this function to examine and plot labor force variables. - Recall: Before importing ANY data, **ALWAYS** examine the `.csv` file. ## HW 4 - Bureau of Labor Statistics Data [Bureau of Labor Statistics](https://www.bls.gov/) is a great source for economic data. In HW 4, we will look at three data sets: - Unemployment Rate, the percentage of people unemployed (`bls_unemp_rate.csv`) - Import Price Index (`bls_import_index.csv`) - Export Price Index (`bls_export_index.csv`) ::: fragment The Import and Export Price Indices contain data on changes in the prices of nonmilitary goods and services traded between the U.S. and the rest of the world. ::: - [Documentation on Import and Export Price Indixes](https://www.bls.gov/mxp/) - [Additional Documention provided by Investopedia](https://www.investopedia.com/terms/i/import-export-prices.asp) - Before going further, examine these three .csv files to determine number of header rows to skip when cleaning the data. ## Introduction to Joins ::::::: columns :::: {.column width="48%"} Earlier we discussed stacking datasets vertically with `bind_rows`. We used `date` to stack the data in the correct order. Alternatively, we can join datasets horizontally (side-by-side) - This in only possible if the data sets have variables in common with matching values. - We will cover four types of common joins in this course. ::: fragment The [`dplyr` Cheat Sheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf) is a comprehensive reference. ::: :::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/join_cheat_sheet_screenshot.png) ::: ::::::: ## Types of Joins - Datasets are joined two at a time, then a third can be joined to first two. - In BUA 455, we will cover four types of joins (but others exist): - `left_join`: include all observations in **FIRST** dataset in join command (dataset on left) - `right_join`: include all observations in **SECOND** dataset in join command (dataset on right) - `inner_join`: only include observations that are in **BOTH** datasets - `full_join`: include all observations that are in **EITHER** dataset - All of these join commands are useful, but `left_join` and `right_join` are redundant. - I usually (not always) prefer using `full_join` to help reveal data errors or discrepancies between multiple datasets. ## A note about R colors - There are [657 Colors in R](https://docs.google.com/spreadsheets/d/1m-UfG96DB4ZAOntVyzcjuYrX564AdLGB2f-MA6rvc2s/edit?usp=sharing) - We'll talk more about color options and palettes, but it's good to know they exist. - We don't emphasize data visualization theory in this course, BUT managing data entails visualizations - The goal is to manage and make sense of the data for a client, colleague, or supervisor. - Choosing colors, symbols, size of text, size of lines, etc. wisely is an important part of data management and data curation. - As we transition to discussing the group projects (next Week), we will talk more about how to communicate with data effectively. - Color choices will be part of that discussion. - In HW 4 and today's demo, I manually choose colors to provide examples. ## Outline of HW 4 - First Steps - Create R project for HW 4. - Create `data` and `img` folders. - Save HW4 Quarto (`.qmd`) Template to project folder. - Save data files to `data` folder within project. - Part 1 - Create `bls_tidy` function from code used to import and clean `bls_unemp_rate.csv` - Run the function and then use it to import and clean `bls_unemp_rate.csv` - Part 2 - Use `bls_tidy` function to - import and clean `bls_import_index.csv` and save it as `import_index` - import and clean `bls_export_index.csv` and save it as `export_index` ## Outline of HW 4 Cont'd - Part 3 - Use full `full_join` to join `import_index` and `export_index` into one dataset, `export_import`. - Use correct `lubridate` command and `paste` command to create a date variable. - Part 4 - Reshape `export_import` dataset using `pivot_longer` to create `exp_imp_plt` - `exp_imp_plt` will be used to plot the data. - Type of index, `type` will be converted to a `factor` with levels and labels for the plot. ## Outline of HW 4 Cont'd - Part 5 - Create a formatted line plot with: - one line for the Export Price Index (light blue line). - one line for Import Price Index (blue line) - Final Steps - Answer all Blackboard Questions. - Knit Quarto (.qmd) file. - Create README FOR HW 4. - Save all files correctly within HW 4 R Project folder. - Zip and submit HW 4 R project folder. ## Example of HW 4 Final Plot **NOTES:** - Final HW 4 plot should be almost identical to the one below **EXCEPT** for second line of caption. - Slides version has more formatting. See version in `.html` lecture notes and HW 4 Instructions. ::: fragment ![](img/LinePlot_HW4_Penelope_Pooler_slides.png){fig-align="center" height="4.5in"} ::: ## Next Lectures :::::: columns ::: {.column width="48%"} - More Examples from [BLS](https://www.bls.gov/) - line and area plots, - Interactive time series plots - hcharts and dygraphs - Converting between data formats (`xts` and `tibble`) - Importing stock data from Yahoo Finance to R - **Introducing the Group Project** ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/owl.png){fig-align="center"} ::: :::::: ## ### Key Points from This Week ::: fragment **Developing Functions:** ::: - Develop code, convert to a function, test it ::: fragment **Vertical and Horizontal Data Joins** ::: - Stack datasets vertically using `bind_rows` - Join datasets horizontally by matching variables and values (**HW 4:** `full_join`) ::: fragment **Working with date variables using `lubridate`** ::: - Command(s) to create date variables - using `paste` to add year to text ::: fragment You may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.** :::

Housekeeping

Recall from Lecture 9:

R functions:

Best Practices for Writing Functions

Common Function Structure Options

Anatomy of a Function (Option 1):

Anatomy of a Function (Option 2 - Shown in Week 5 Lecture):

Box Office Mojo Data Cleaning Steps

Box Office Mojo Data Cleaning Steps Cont’d

Cleaning One Data Set

Converting B.O.M. Cleaning Code to a Function

Working with Dates using lubridate

Lubridate Examples Continued

Week 6 In-class Exercises - Q1

Function Format Options

Week 6 In-class Exercises - Q2

More Comments on Functions

Using our Function

Join datasets vertically (stacking)

HW 4 Preview

Bureau of Labor Statistics (BLS) Data

HW 4 - Bureau of Labor Statistics Data

Introduction to Joins

Types of Joins

A note about R colors

Outline of HW 4

Outline of HW 4 Cont’d

Outline of HW 4 Cont’d

Example of HW 4 Final Plot

Next Lectures

Key Points from This Week

Working with Dates using `lubridate`