Grading is in progress, but can not be discussed until all grades are posted.
Checking the submitted work of all students takes time but should be completed this week.
Comments about HWs 1 - 3
HW 1 and HW 2 grades are posted
HW 3 will be graded this week.
HW 4 is posted and is due next week on Wednesday, 2/28/24.
We will work through part of it together on Thursday
Demo Videos are available.
Recall from Lecture 9:
R functions:
Useful for automating repetitive tasks
Best Practices for Writing Functions
Plan what you want to do
Write out task steps
Develop and refine the code to complete tasks
Subdivide multi-step tasks into short sets of tasks.
Convert each set of tasks to a function
Make the function more general so it is more versatile.
Common Function Structure Options
Anatomy of a Function (Option 1):
Commands within function are not saved to an object.
Function results are automatically ‘kicked out’ as output.
Function_Name <- function(input_1, input_2, etc){
command 1 to do "stuff" to inputs |>
command 2 to do "stuff" to inputs |>
command 3 to do "stuff" to inputs |> etc. # output is autmatically "kicked out"
}
Anatomy of a Function (Option 2 - Shown in Week 5 Lecture):
Commands within function are saved to an object.
Function ends with name of object so that results are ‘kicked out’ as output.
Function_Name <- function(input_1, input_2, etc){
output <- command 1 to do "stuff" to inputs |>
command 2 to do "stuff" to inputs |>
command 3 to do "stuff" to inputs |> etc.
output # end with name of output so that it is "kicked out" of function
}
Box Office Mojo Data Cleaning Steps
Examine .csv file to determine number of rows to skip
Import raw .csv file and skip header rows above variable names
Select useful columns (Columns 1, 4, 7, and 9):
1: Date
4: Top 10 Gross- <chr> variable with nuisance characters to be removed ($, ,)
7: Releases
9: Gross - <chr> variable with nuisance characters to be removed ($, ,)
Rename these four variables to easier names to work with in R
New names should not have spaces.
Lower case names with underscores (_) work well in code.
Variable names and labels can be reformatted for plots and tables.
Box Office Mojo Data Cleaning Steps Cont’d
5 . Remove non-data rows (Holidays, etc.) with a filter command.
Use mutate to convert variables to use-able formats:
Use paste to add year text to date character variable and convert to a date variable.
Convert Releases (num_releases) to an integer variable
Use gsub and as.numeric to convert each gross variable to numeric:
gsub is used to remove nuisance characters, $ and ,
NOTE: multiple adjacent columns with same nuisance characters can be cleaned at the same time with mutate(across...)
as.numeric is used to convert character to numeric decimal value (<dbl>) once nusaince characters are removed.
Cleaning One Data Set
#|label: cleaning 1 box office mojo datasetbom2022 <-read_csv("data/box_office_mojo_2022.csv", skip=11, show_col_types = F) |>select(1,4,7,9) |># select columns by number (use with care)rename("date"="Date", "top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filters out empty holiday rowsmutate(date =dmy(paste(date,2022)),num_releases =as.integer(num_releases),# gross variables cleaned one at a timetop10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T), top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) |>glimpse()
#|label: bom_cln_functionbom_cln <-function(data_file, yr, skip_num){ # inputs: data_file is file name# yr is year of data# skip_num is number of header rows to skipread_csv(data_file, show_col_types = F, skip=skip_num) |># data_file and skip_num used here select(1,4,7,9) |># columns needed are always the same for bomrename("date"="Date", # column renaming always the same"top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filter out non-data rowsmutate(date =dmy(paste(date,yr)), # paste yr input to date text and convert to datenum_releases =as.integer(num_releases), top10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T),top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) }
Working with Dates using lubridate
lubridate can convert a wide variety of text information to a date.
User must specify order of information, e.g. ymd indicates year, then month, the day
Text must include year, but day is not required (See HW 4)
#|label: examples of lubridate"31st of October, 2040"# text date
[1] "31st of October, 2040"
dmy("31st of October, 2040") # example of lubridate command dmy
[1] "2040-10-31"
paste("Hello", "Goodbye") # example of paste concatenating two text strings
[1] "Hello Goodbye"
"March 15"# example of month and day only
[1] "March 15"
# lubridate commands require a year
Lubridate Examples Continued
#|label: more lubridate examplespaste("February", 2040) # using paste to add year to month and day
[1] "February 2040"
my(paste("February", "2040")) # using lubridate my command with paste
[1] "2040-02-01"
paste("February", "2040") |>my() # same command using piping
[1] "2040-02-01"
"March 15th, 2039"# different form of text date (Q1 next slide)
[1] "March 15th, 2039"
ymd(20390315) # example of lubridate command ymd
[1] "2039-03-15"
Sys.Date() # demo of Sys.date command
[1] "2025-02-17"
💥 Week 6 In-class Exercises - Q1 💥
Session ID: bua455f24
What is the correct lubridate command to convert the following date text to a date value in R.
“March 15th, 2039”
Hint: Examine the Lubridate Cheat Sheet and test out commands in console to see what the output is.
NOTE:lubridate commands will only work if you have run the setup for this lecture to load the tidyverse suite of packages.
No objects are saved within function and result is kicked out
Ideal for straightforward functions and plot functions
#|label: bom_cln function option 1bom_cln <-function(data_file, yr, skip_num){ read_csv(data_file, show_col_types = F, skip=skip_num) |># data_file and skip_num used here select(1,4,7,9) |># columns needed are always the same for these datasetsrename("date"="Date", # column renaming always the same"top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filter out non-data rowsmutate(date =dmy(paste(date,yr)), # paste yr input to date text and convert to datenum_releases =as.integer(num_releases),top10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T),top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) }bom2021_Op1 <-bom_cln(data_file ="data/box_office_mojo_2021.csv", yr =2021, skip_num=11) # use function
Commands within function are saved to an object, d_out
Function ends with d_out so that result gets kicked-out of function
#|label: bom_cln function option 2bom_cln <-function(data_file, yr, skip_num){ d_out <-read_csv(data_file, show_col_types = F, skip=skip_num) |># data_file and skip_num used hereselect(1,4,7,9) |># columns needed are always the same for these datasetsrename("date"="Date", # column renaming always the same"top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filter out non-data rowsmutate(date =dmy(paste(date,yr)), # paste yr input to date text and convert to datenum_releases =as.integer(num_releases),top10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T),top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) d_out}bom2021_Op2 <-bom_cln(data_file ="data/box_office_mojo_2021.csv", yr =2021, skip_num=11) # use function
This option has 1 more input, out_file, to export clean data using write_csv
Useful for providing clean data to a client or colleague who doesn’t use R
#|label: bom_cln function option 3bom_cln_out <-function(data_file, yr, skip_num, out_file){ read_csv(data_file, show_col_types = F, skip=skip_num) |># data_file and skip_num used here select(1,4,7,9) |># columns needed are always the same for these datasetsrename("date"="Date", # column renaming always the same"top10gross"="Top 10 Gross", "num_releases"="Releases","num1gross"="Gross") |>filter(!is.na(top10gross)) |># filter out non-data rowsmutate(date =dmy(paste(date,yr)), # paste yr input to date text and convert to datenum_releases =as.integer(num_releases),top10gross =gsub(pattern="$", replacement="", x=top10gross, fixed=T),top10gross =gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>as.numeric(),num1gross =gsub(pattern="$", replacement="", x=num1gross, fixed=T),num1gross =gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>as.numeric()) |>write_csv(out_file)}bom2021_Op3 <-bom_cln_out(data_file ="data/box_office_mojo_2021.csv", yr =2021, skip_num=11, out_file ="data/box_office_mojo_2021_tidy.csv") # use function
💥 Week 6 In-class Exercises - Q2 💥
Session ID: bua455f24
When using the ‘Option 3’ Function, what output(s) are created?
A. An exported .csv file
B. A clean tibble dataset saved to the Global Environment
C. Both an exported .csv file and a clean tibble dataset in R
More Comments on Functions
Even if a function works for one situation, it may not work for every situation.
Published R functions are very long because they are tested and edited and retested to account for most situations.
For example, you can examine the text for a function we have used
In the console, type read_csv without the parentheses and push Enter.
You can see the text of the function which is complex.
Using our Function
We will run the Option 3 version, which cleans the data AND exports a cleaned (tidied) dataset.
Note that there will be one error for 2020 due to the leap day (2/29/2020).
#|label: testing our functionbom2018<-bom_cln_out(data_file ="data/box_office_mojo_2018.csv", yr =2018, skip_num=11, out_file ="data/box_office_mojo_2018_tidy.csv")bom2019<-bom_cln_out(data_file ="data/box_office_mojo_2019.csv", yr =2019, skip_num=11, out_file ="data/box_office_mojo_2019_tidy.csv")bom2020<-bom_cln_out(data_file ="data/box_office_mojo_2020.csv", yr =2020, skip_num=11, out_file ="data/box_office_mojo_2020_tidy.csv")bom2021<-bom_cln_out(data_file ="data/box_office_mojo_2021.csv", yr =2021, skip_num=11, out_file ="data/box_office_mojo_2021_tidy.csv")bom2022<-bom_cln_out(data_file ="data/box_office_mojo_2022.csv", yr =2022, skip_num=11, out_file ="data/box_office_mojo_2022_tidy.csv")# fix NA due to leap day manually# not easily fixed within functionbom2020$date[is.na(bom2020$date)] <-dmy(29022020)write_csv(bom2020, "data/box_office_mojo_2020_tidy.csv")
Join datasets vertically (stacking)
If you have multiple identical datasets with the same columns, you can ‘stack’ them using the bind_rows command.
Recommended: Import and format datasets to all have same column names so that this is seamless.
Functions are useful for creating identically formatted datasets.
#|label: create one 5 year BOMdatasetbom_2018_2022 <-bind_rows(bom2018, bom2019, bom2020, bom2021, bom2022) # join datasets# Modify gross variables (could also be done in function)bom_2018_2022 <- bom_2018_2022 |>mutate(top10grossM = (top10gross/1000000) |>round(2), num1grossM = (num1gross/1000000) |>round(2)) |>filter(date <="2022-12-30") |>glimpse()
Unemployment Rate, the percentage of people unemployed (bls_unemp_rate.csv)
Import Price Index (bls_import_index.csv)
Export Price Index (bls_export_index.csv)
The Import and Export Price Indices contain data on changes in the prices of nonmilitary goods and services traded between the U.S. and the rest of the world.
Join datasets horizontally by matching variables and values (HW 4:full_join)
Working with date variables using lubridate
Command(s) to create date variables
using paste to add year to text
You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.