Reshaping. Cleaning, and Joining Data

Author

Penelope Pooler Eisenbies

Published

February 17, 2025

Housekeeping

Comments About Quiz 1

  • Grading is in progress, but can not be discussed until all grades are posted.

  • Checking the submitted work of all students takes time but should be completed this week.

Comments about HWs 1 - 3

  • HW 1 and HW 2 grades are posted

  • HW 3 will be graded this week.

HW 4 is posted and is due next week on Wednesday, 2/28/24.

  • We will work through part of it together on Thursday

  • Demo Videos are available.

Recall from Lecture 9:

R functions:

  • Useful for automating repetitive tasks

Best Practices for Writing Functions

  • Plan what you want to do

    • Write out task steps
  • Develop and refine the code to complete tasks

  • Subdivide multi-step tasks into short sets of tasks.

  • Convert each set of tasks to a function

    • Make the function more general so it is more versatile.

Common Function Structure Options

Anatomy of a Function (Option 1):

  • Commands within function are not saved to an object.
  • Function results are automatically ‘kicked out’ as output.
Function_Name <- function(input_1, input_2, etc){
   command 1 to do "stuff" to inputs |>
   command 2 to do "stuff" to inputs |>
   command 3 to do "stuff" to inputs |> etc. # output is autmatically "kicked out"
}

Anatomy of a Function (Option 2 - Shown in Week 5 Lecture):

  • Commands within function are saved to an object.
  • Function ends with name of object so that results are ‘kicked out’ as output.
Function_Name <- function(input_1, input_2, etc){
   output <- command 1 to do "stuff" to inputs |>
             command 2 to do "stuff" to inputs |>
             command 3 to do "stuff" to inputs |> etc.
   output  # end with name of output so that it is "kicked out" of function
}

Box Office Mojo Data Cleaning Steps

  1. Examine .csv file to determine number of rows to skip

  2. Import raw .csv file and skip header rows above variable names

  3. Select useful columns (Columns 1, 4, 7, and 9):

    • 1: Date
    • 4: Top 10 Gross- <chr> variable with nuisance characters to be removed ($, ,)
    • 7: Releases
    • 9: Gross - <chr> variable with nuisance characters to be removed ($, ,)
  4. Rename these four variables to easier names to work with in R

    • New names should not have spaces.
    • Lower case names with underscores (_) work well in code.
    • Variable names and labels can be reformatted for plots and tables.

Box Office Mojo Data Cleaning Steps Cont’d

5 . Remove non-data rows (Holidays, etc.) with a filter command.

  1. Use mutate to convert variables to use-able formats:

    • Use paste to add year text to date character variable and convert to a date variable.

    • Convert Releases (num_releases) to an integer variable

    • Use gsub and as.numeric to convert each gross variable to numeric:

      • gsub is used to remove nuisance characters, $ and ,

      • NOTE: multiple adjacent columns with same nuisance characters can be cleaned at the same time with mutate(across...)

      • as.numeric is used to convert character to numeric decimal value (<dbl>) once nusaince characters are removed.

Cleaning One Data Set

Code
```{r}
#|label: cleaning 1 box office mojo dataset

bom2022 <- read_csv("data/box_office_mojo_2022.csv", skip=11, show_col_types = F) |>
  select(1,4,7,9) |>                                                  # select columns by number (use with care)
  rename("date" = "Date", 
         "top10gross" = "Top 10 Gross", 
         "num_releases" = "Releases",
         "num1gross" = "Gross") |>
  filter(!is.na(top10gross)) |>                                       # filters out empty holiday rows
  mutate(date = dmy(paste(date,2022)),
         num_releases = as.integer(num_releases),
                                                                      # gross variables cleaned one at a time
         top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),  
         top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
           as.numeric(),

         num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
         num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
           as.numeric()) |>
  glimpse()
```
Rows: 365
Columns: 4
$ date         <date> 2022-12-31, 2022-12-30, 2022-12-29, 2022-12-28, 2022-12-…
$ top10gross   <dbl> 27962493, 37900613, 31204428, 31866473, 37343124, 4696327…
$ num_releases <int> 31, 32, 28, 28, 28, 29, 28, 28, 29, 30, 31, 31, 30, 31, 3…
$ num1gross    <dbl> 18053159, 24836835, 20117061, 20582014, 24128503, 3227043…

Converting B.O.M. Cleaning Code to a Function

Code
```{r}
#|label: bom_cln_function
bom_cln <- function(data_file, yr, skip_num){   # inputs: data_file is file name
                                                #         yr is year of data
                                                #         skip_num is number of header rows to skip
  
  read_csv(data_file, show_col_types = F, skip=skip_num) |>   # data_file and skip_num used here 
  select(1,4,7,9) |>                            # columns needed are always the same for bom
  rename("date" = "Date",                       # column renaming always the same
         "top10gross" = "Top 10 Gross", 
         "num_releases" = "Releases",
         "num1gross" = "Gross") |>
  filter(!is.na(top10gross)) |>                # filter out non-data rows
  mutate(date = dmy(paste(date,yr)),           # paste yr input to date text and convert to date
         num_releases = as.integer(num_releases),                                       
         top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
         top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
           as.numeric(),
         num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
         num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
           as.numeric()) 
}
```

Working with Dates using lubridate

  • lubridate can convert a wide variety of text information to a date.

  • User must specify order of information, e.g. ymd indicates year, then month, the day

  • Text must include year, but day is not required (See HW 4)

Code
```{r}
#|label: examples of lubridate

"31st of October, 2040"                                              # text date

dmy("31st of October, 2040")                                         # example of lubridate command dmy

paste("Hello", "Goodbye")                                            # example of paste concatenating two text strings

"March 15"                                                           # example of month and day only
                                                                     # lubridate comands require a year
```
[1] "31st of October, 2040"
[1] "2040-10-31"
[1] "Hello Goodbye"
[1] "March 15"

Lubridate Examples Continued

Code
```{r}
#|label: more lubridate examples

paste("February", 2040)                                        # using paste to add year to month and day

my(paste("February", "2040"))                                  # using lubridate my command with paste

paste("February", "2040") |> my()                              # same command using piping

"March 15th, 2039"                                             # different form of text date (Q1 next slide)

ymd(20390315)                                                  # example of lubridate command ymd 

Sys.Date()                                                     # demo of Sys.date command 
```
[1] "February 2040"
[1] "2040-02-01"
[1] "2040-02-01"
[1] "March 15th, 2039"
[1] "2039-03-15"
[1] "2025-02-17"

Week 6 In-class Exercises - Q1

Session ID: bua455f24

What is the correct lubridate command to convert the following date text to a date value in R.

“March 15th, 2039”


Hint: Examine the Lubridate Cheat Sheet and test out commands in console to see what the output is.

NOTE: lubridate commands will only work if you have run the setup for this lecture to load the tidyverse suite of packages.

Lubridate Cheat Sheet Screenshot

Lubridate Cheat Sheet Screenshot

Function Format Options

No objects are saved within function and result is kicked out

Ideal for straightforward functions and plot functions

Code
```{r}
#|label: bom_cln function option 1
bom_cln <- function(data_file, yr, skip_num){   
  read_csv(data_file, show_col_types = F, skip=skip_num) |>   # data_file and skip_num used here 
  select(1,4,7,9) |>                            # columns needed are always the same for these datasets
  rename("date" = "Date",                       # column renaming always the same
         "top10gross" = "Top 10 Gross", 
         "num_releases" = "Releases",
         "num1gross" = "Gross") |>
  filter(!is.na(top10gross)) |>                # filter out non-data rows
  mutate(date = dmy(paste(date,yr)),           # paste yr input to date text and convert to date
         num_releases = as.integer(num_releases),
         top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
         top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
           as.numeric(),
         num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
         num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
           as.numeric()) 
}

bom2021_Op1 <- bom_cln(data_file = "data/box_office_mojo_2021.csv", 
                       yr = 2021, skip_num=11) # use function
```

Commands within function are saved to an object, d_out

Function ends with d_out so that result gets kicked-out of function

Code
```{r}
#|label: bom_cln function option 2
bom_cln <- function(data_file, yr, skip_num){   
  d_out <- read_csv(data_file, show_col_types = F, skip=skip_num) |>   # data_file and skip_num used here
    select(1,4,7,9) |>                            # columns needed are always the same for these datasets
    rename("date" = "Date",                       # column renaming always the same
           "top10gross" = "Top 10 Gross", 
           "num_releases" = "Releases",
           "num1gross" = "Gross") |>
    filter(!is.na(top10gross)) |>                # filter out non-data rows
    mutate(date = dmy(paste(date,yr)),           # paste yr input to date text and convert to date
           num_releases = as.integer(num_releases),
           top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
           top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
             as.numeric(),
           num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
           num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
             as.numeric()) 
  d_out
}
bom2021_Op2 <- bom_cln(data_file = "data/box_office_mojo_2021.csv", 
                       yr = 2021, skip_num=11) # use function
```

This option has 1 more input, out_file, to export clean data using write_csv

Useful for providing clean data to a client or colleague who doesn’t use R

Code
```{r}
#|label: bom_cln function option 3
bom_cln_out <- function(data_file, yr, skip_num, out_file){   
    read_csv(data_file, show_col_types = F, skip=skip_num) |>   # data_file and skip_num used here 
    select(1,4,7,9) |>                            # columns needed are always the same for these datasets
    rename("date" = "Date",                       # column renaming always the same
           "top10gross" = "Top 10 Gross", 
           "num_releases" = "Releases",
           "num1gross" = "Gross") |>
    filter(!is.na(top10gross)) |>                # filter out non-data rows
    mutate(date = dmy(paste(date,yr)),           # paste yr input to date text and convert to date
           num_releases = as.integer(num_releases),
           top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
           top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
             as.numeric(),
           num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
           num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
             as.numeric()) |>
    write_csv(out_file)
}
bom2021_Op3 <- bom_cln_out(data_file = "data/box_office_mojo_2021.csv", 
                           yr = 2021, skip_num=11, 
                           out_file = "data/box_office_mojo_2021_tidy.csv") # use function
```

Week 6 In-class Exercises - Q2

Session ID: bua455f24

When using the ‘Option 3’ Function, what output(s) are created?

A. An exported .csv file

B. A clean tibble dataset saved to the Global Environment

C. Both an exported .csv file and a clean tibble dataset in R

More Comments on Functions

  • Even if a function works for one situation, it may not work for every situation.

  • Published R functions are very long because they are tested and edited and retested to account for most situations.

  • For example, you can examine the text for a function we have used

    • In the console, type read_csv without the parentheses and push Enter.

    • You can see the text of the function which is complex.

Using our Function

We will run the Option 3 version, which cleans the data AND exports a cleaned (tidied) dataset.

Note that there will be one error for 2020 due to the leap day (2/29/2020).

Code
```{r}
#|label: testing our function

bom2018<- bom_cln_out(data_file = "data/box_office_mojo_2018.csv", yr = 2018, skip_num=11, 
                       out_file = "data/box_office_mojo_2018_tidy.csv")

bom2019<- bom_cln_out(data_file = "data/box_office_mojo_2019.csv", yr = 2019, skip_num=11, 
                       out_file = "data/box_office_mojo_2019_tidy.csv")

bom2020<- bom_cln_out(data_file = "data/box_office_mojo_2020.csv", yr = 2020, skip_num=11, 
                       out_file = "data/box_office_mojo_2020_tidy.csv")
```
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = dmy(paste(date, yr))`.
Caused by warning:
!  1 failed to parse.
Code
```{r}
bom2021<- bom_cln_out(data_file = "data/box_office_mojo_2021.csv", yr = 2021, skip_num=11, 
                       out_file = "data/box_office_mojo_2021_tidy.csv")

bom2022<- bom_cln_out(data_file = "data/box_office_mojo_2022.csv", yr = 2022, skip_num=11, 
                       out_file = "data/box_office_mojo_2022_tidy.csv")

# fix NA due to leap day manually
# not easily fixed within function
bom2020$date[is.na(bom2020$date)] <- dmy(29022020)
write_csv(bom2020, "data/box_office_mojo_2020_tidy.csv")
```

Join datasets vertically (stacking)

  • If you have multiple identical datasets with the same columns, you can ‘stack’ them using the bind_rows command.

  • Recommended: Import and format datasets to all have same column names so that this is seamless.

    • Functions are useful for creating identically formatted datasets.
Code
```{r}
#|label: create one 5 year BOMdataset

bom_2018_2022 <- bind_rows(bom2018, bom2019, bom2020, bom2021, bom2022) # join datasets

# Modify gross variables (could also be done in function)
bom_2018_2022 <- bom_2018_2022 |>
  mutate(top10grossM = (top10gross/1000000) |> round(2), 
         num1grossM = (num1gross/1000000) |> round(2)) |>
  filter(date <= "2022-12-30") |>
  glimpse()
```
Rows: 1,819
Columns: 6
$ date         <date> 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, 2018-12-…
$ top10gross   <dbl> 36240441, 50932176, 58118460, 56667767, 51671321, 5557976…
$ num_releases <int> 53, 51, 51, 51, 53, 52, 53, 49, 53, 52, 53, 60, 63, 59, 6…
$ num1gross    <dbl> 10011638, 16440551, 18632907, 17041113, 14622228, 1690351…
$ top10grossM  <dbl> 36.24, 50.93, 58.12, 56.67, 51.67, 55.58, 70.90, 31.75, 4…
$ num1grossM   <dbl> 10.01, 16.44, 18.63, 17.04, 14.62, 16.90, 21.98, 10.85, 1…

HW 4 Preview

We will create a time series line plot from multiple years of data in class

In HW 4 you will use data from the Bureau of Labor Statistics that is easier to clean

  • Much of this code is review

  • Useful plot formatting code:

    • Move legend to bottom: theme(legend.position="bottom")

    • Format scale of dates: scale_x_date(date_breaks = "3 months", date_labels = "%b")

      • HW 4 date format: scale_x_date(date_breaks = "year", date_labels = "%Y)"
    • Format all plot labels using a labs command

Reshape data and create a factor variable with plot label text (Similar to Hw 4 - Part 4)

Code
```{r modify data for lineplot}
bom_plot_data <- bom_2018_2022 |>
  select(date, top10grossM, num1grossM) |>
  pivot_longer(cols=top10grossM:num1grossM, names_to = "type", values_to="grossM") |>
  mutate(type=factor(type, 
                     levels=c("top10grossM", "num1grossM"), 
                     labels=c("Top 10", "No. 1"))) 
head(bom_plot_data)
```
# A tibble: 6 × 3
  date       type   grossM
  <date>     <fct>   <dbl>
1 2018-12-31 Top 10   36.2
2 2018-12-31 No. 1    10.0
3 2018-12-30 Top 10   50.9
4 2018-12-30 No. 1    16.4
5 2018-12-29 Top 10   58.1
6 2018-12-29 No. 1    18.6
Code
```{r fig.dim=c(15,5)}
#|label: basic line plot
(bom_ln_plot <- bom_plot_data |>
  ggplot() +
  geom_line(aes(x=date, y=grossM, color=type)) + 
  theme_classic())
```

Code
```{r fig.dim=c(15,5)}
#|label: modify legend and lines
(bom_ln_plot <- bom_plot_data |>
  ggplot() +
  geom_line(aes(x=date, y=grossM, color=type), linewidth=0.75) + 
  theme_classic() +
  theme(legend.position="bottom") + 
  scale_color_manual(values=c("lightblue","blue")))
```

Code
```{r fig.dim=c(15,5)}
#|label: modify labels and axis breaks
(bom_ln_plot  <- bom_ln_plot +
  scale_x_date(date_breaks = "3 months",         # OG code for date_labels
                 date_labels = "%b") +
  labs(x="Date", y = "Gross ($Mill)", color="",  # modify plot labels
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle="Jan. 1, 2018 - Dec. 30, 2022",
       caption="Data Source:www.boxoffice.mojo.com"))
```

Code
```{r fig.dim=c(15,5)}
#|label: modify text and borders
(bom_ln_plot  <- bom_ln_plot + 
     theme(plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 15),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        legend.text = element_text(size = 12),
        panel.border = element_rect(colour = "lightgrey", fill=NA, size=2),
        plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)))
```
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Similar but more complex than complete plot code for HW 4.

Additional formatting is required for slides that is not needed in a HTML display.

HW 4 plot code will end with labs command

theme options may be useful for group project.

Code
```{r complete code}
bom_ln_plot <- bom_plot_data |>
  ggplot() +
  geom_line(aes(x=date, y=grossM, color=type), size=0.75) + 
  theme_classic() +
  theme(legend.position="bottom") +                               # position legend at bootom
  scale_color_manual(values=c("lightblue","blue")) +              # modify line colors
  scale_x_date(date_breaks = "3 months",  date_labels = "%b") +   # OG code for date_labels
  labs(x="Date", y = "Gross ($Mill)", color="",                   # modify plot labels
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle="Jan. 1, 2018 - Dec. 30, 2022",
       caption="Data Source:www.boxoffice.mojo.com") + 
  theme(plot.title = element_text(size = 20),                     # text size and borders
        plot.subtitle = element_text(size = 15),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        legend.text = element_text(size = 12),
        panel.border = element_rect(colour = "lightgrey", fill=NA, size=2),
        plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))
```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Bureau of Labor Statistics (BLS) Data

  • Bureau of Labor Statistics is an excellent resource for economic data.

  • Many (not all) BLS datasets have a similar format.

  • In HW 4 you will:

    • create a function from code used to clean the monthly unemployment Rate data.

    • use this function to import and clean two data sets:

      • Export Price Index and Import Price Index.
    • join these two data-sets using a full_join and create a date variable.

    • modify the dataset for a plot and follow instructions to format plot.

  • Next lecture(s): We will adapt this function to examine and plot labor force variables.

  • Recall: Before importing ANY data, ALWAYS examine the .csv file.

HW 4 - Bureau of Labor Statistics Data

Bureau of Labor Statistics is a great source for economic data.

In HW 4, we will look at three data sets:

  • Unemployment Rate, the percentage of people unemployed (bls_unemp_rate.csv)

  • Import Price Index (bls_import_index.csv)

  • Export Price Index (bls_export_index.csv)

The Import and Export Price Indices contain data on changes in the prices of nonmilitary goods and services traded between the U.S. and the rest of the world.

Introduction to Joins

Earlier we discussed stacking datasets vertically with bind_rows.

We used date to stack the data in the correct order.

Alternatively, we can join datasets horizontally (side-by-side)

  • This in only possible if the data sets have variables in common with matching values.

  • We will cover four types of common joins in this course.

The dplyr Cheat Sheet is a comprehensive reference.

Types of Joins

  • Datasets are joined two at a time, then a third can be joined to first two.

  • In BUA 455, we will cover four types of joins (but others exist):

    • left_join: include all observations in FIRST dataset in join command (dataset on left)

    • right_join: include all observations in SECOND dataset in join command (dataset on right)

    • inner_join: only include observations that are in BOTH datasets

    • full_join: include all observations that are in EITHER dataset

  • All of these join commands are useful, but left_join and right_join are redundant.

  • I usually (not always) prefer using full_join to help reveal data errors or discrepancies between multiple datasets.

A note about R colors

  • There are 657 Colors in R

  • We’ll talk more about color options and palettes, but it’s good to know they exist.

  • We don’t emphasize data visualization theory in this course, BUT managing data entails visualizations

    • The goal is to manage and make sense of the data for a client, colleague, or supervisor.
  • Choosing colors, symbols, size of text, size of lines, etc. wisely is an important part of data management and data curation.

  • As we transition to discussing the group projects (next Week), we will talk more about how to communicate with data effectively.

    • Color choices will be part of that discussion.
  • In HW 4 and today’s demo, I manually choose colors to provide examples.

Outline of HW 4

  • First Steps

    • Create R project for HW 4.
    • Create data and img folders.
    • Save HW4 Quarto (.qmd) Template to project folder.
    • Save data files to data folder within project.
  • Part 1

    • Create bls_tidy function from code used to import and clean bls_unemp_rate.csv
    • Run the function and then use it to import and clean bls_unemp_rate.csv
  • Part 2

    • Use bls_tidy function to

      • import and clean bls_import_index.csv and save it as import_index
      • import and clean bls_export_index.csv and save it as export_index

Outline of HW 4 Cont’d

  • Part 3

    • Use full full_join to join import_index and export_index into one dataset, export_import.
    • Use correct lubridate command and paste command to create a date variable.
  • Part 4

    • Reshape export_import dataset using pivot_longer to create exp_imp_plt
      • exp_imp_plt will be used to plot the data.
    • Type of index, type will be converted to a factor with levels and labels for the plot.

Outline of HW 4 Cont’d

  • Part 5

    • Create a formatted line plot with:

      • one line for the Export Price Index (light blue line).

      • one line for Import Price Index (blue line)

  • Final Steps

    • Answer all Blackboard Questions.
    • Knit Quarto (.qmd) file.
    • Create README FOR HW 4.
    • Save all files correctly within HW 4 R Project folder.
    • Zip and submit HW 4 R project folder.

Example of HW 4 Final Plot

NOTES: - Final HW 4 plot should be almost identical to the one below EXCEPT for second line of caption.

  • Slides version has more formatting. See version in .html lecture notes and HW 4 Instructions.

Next Lectures

  • More Examples from BLS

  • line and area plots,

  • Interactive time series plots

    • hcharts and dygraphs
  • Converting between data formats (xts and tibble)

  • Importing stock data from Yahoo Finance to R

  • Introducing the Group Project

Key Points from This Week

Developing Functions:

  • Develop code, convert to a function, test it

Vertical and Horizontal Data Joins

  • Stack datasets vertically using bind_rows

  • Join datasets horizontally by matching variables and values (HW 4: full_join)

Working with date variables using lubridate

  • Command(s) to create date variables

  • using paste to add year to text

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.