Housekeeping

Quiz 1 on Thursday 2/13

  • Weeks 1 - 4 (Lectures 1 - 8)

    • Quiz questions will be similar (but not identical) to Practice Questions

    • Mix of R datasets and imported datasets

      • I will provide R code to import data

      • Quiz Template and data files will be provided in Zipped project

      • Review Practice Questions, HW assignments, and Demo Videos


  • You will be required to download, unzip, and and save a project to your computer (not in Downloads), as part of Quiz 1.

R Online Resources

  • Some of what we have covered (Week 4 has a more complete review.):

    • R projects, file structure and Quarto files

    • Working with ‘clean’ data using the dplyr package

      • common commands: read_csv, filter, select, slice, factor

      • Augmenting these commands with operators such as !, %in%, ==

      • Using pipes, |> to make data management more efficient

  • Reference links for R operators:

  • For R Markdown and dplyr commands there are R Cheat Sheets

Using AI to help you write R code

  • AI tools became use-able in the classroom in 2023.

  • My current AI of choice in Co-Pilot for Windows.

  • Chat-GPT and Gemini on the Google platform are also good.

  • Note that in this example I had to:

    • paraphrase the question to use the R dataset name.

    • provide additional information after the first AI response to specify using select in the tidyverse package.

Recommendations for using AI

  • DO: use AI as a search engine to find code or correct code when you are stuck.

  • DO: use AI iteratively to build code by asking it one question at a time

    • Add suggested code to your file, test the code and then either modify question or ask a subsequent question.
  • DON’T: use AI in place of studing for the exam and plug exam questions into an AI application and expect it to work without your understanding of the question.

    • AI can be used during the tests, but it won’t help you if you don’t know what you are looking for or how to phrase the queries correctly.
  • I use AI to test my quiz questions to insure that they will not provide fully correct code.

  • AI can be helpful, but only if you understand the code provided and can modify it correctly.

Creating a Function

  • Any task in R can be converted to a function.

  • If you are only doing something once or twice, this is not needed.

  • If you are doing the same tasks 4 or more times, this is very useful

  • Best Practice:

    • Develop and refine the code to complete your tasks

    • Subdivide the larger tasks into smaller shorter tasks

Aanatomy of a Function:

Function_Name <- function(input_1, input_2, etc){
   output <- command 1 to do "stuff" to inputs |>
             command 2 to do "stuff" to inputs |>
             command 3 to do "stuff" to inputs |> etc.
   output  # end with name of output so that it is "kicked out" of function
}

Example and Review:

  • Code below includes preview of lubridate functions to create date, month, day, and quarter variables.
#|label: bom_import
bom21_orig <- read_csv("data/box_office_mojo_2021_tidy.csv", show_col_types = F) |>
  mutate(date = ymd(date),                              # converts ymd date text to date var
         month = month(date, label = T, abbr = T),      # creates month var from date var
         day = wday(date, label=T, abbr = T),           # creates wkday var from date var
         qtr = quarter(date),                           # creates quarter var from date var
         num_releases = as.integer(num_releases),
         top10grossM = (top10gross/1000000) |> round(2),
         num1grossM = (num1gross/1000000) |> round(2))
  • Below, bom_basic is a function that completes the tasks above:
bom_basic <- function(data_file) {
  d_out <- read_csv(data_file, show_col_types = F) |>
  mutate(date = ymd(date),
         month = month(date, label = T, abbr = T),
         day = wday(date, label=T, abbr = T),
         qtr = quarter(date),
         num_releases = as.integer(num_releases),
         top10grossM = (top10gross/1000000) |> round(2),
         num1grossM = (num1gross/1000000) |> round(2))
  d_out # outputs function results to screen or saved object name
}

What does bom_basic function do?

#|label: import with read_csv

b21 <- read_csv("data/box_office_mojo_2021_tidy.csv",
                show_col_types = F) |>
  glimpse(width=40)
Rows: 365
Columns: 4
$ date         <date> 2021-12-31, 2021…
$ top10gross   <dbl> 27601787, 3502147…
$ num_releases <dbl> 25, 26, 26, 25, 2…
$ num1gross    <dbl> 15407695, 2071790…
#|label: import with bom_basic function

bom21 <- bom_basic("data/box_office_mojo_2021_tidy.csv") |>
  glimpse(width=40)
Rows: 365
Columns: 9
$ date         <date> 2021-12-31, 2021…
$ top10gross   <dbl> 27601787, 3502147…
$ num_releases <int> 25, 26, 26, 25, 2…
$ num1gross    <dbl> 15407695, 2071790…
$ month        <ord> Dec, Dec, Dec, De…
$ day          <ord> Fri, Thu, Wed, Tu…
$ qtr          <int> 4, 4, 4, 4, 4, 4,…
$ top10grossM  <dbl> 27.60, 35.02, 34.…
$ num1grossM   <dbl> 15.41, 20.72, 20.…

💥 Week 5 In-class Exercises - Q1 💥

Session ID: bua455s25

Using lubridate commands we converted date to date format (if needed) and created month day and qtr variables from date.

  • By default, month and day are ordinal factor variables (<ord>).

  • What is the default data type for qtr (quarter)?

A. character <chr>

B. decimal (double precision) <dbl>

C. factor <fct>

D. integer <int>

💥 Week 5 In-class Exercises - Q2 💥

Session ID: bua455s25

Here is the line that creates qtr within the mutate statement.

The quarter command is part of the lubridate package:

  • qtr = quarter(date)

Fill in the blank to convert this variable to a factor variable as you create it:

  • qtr = _____(quarter(date))

Function Demonstration - Multiple Years

  • Once function code is developed and tested, we can import 2, or 5, or even 10 data sets very efficiently.
#|label: import all 5 datasets
bom22 <- bom_basic("data/box_office_mojo_2022_tidy.csv")
bom21 <- bom_basic("data/box_office_mojo_2021_tidy.csv")
bom20 <- bom_basic("data/box_office_mojo_2020_tidy.csv")
bom19 <- bom_basic("data/box_office_mojo_2019_tidy.csv")
bom18 <- bom_basic("data/box_office_mojo_2018_tidy.csv") |> glimpse( width=60)
Rows: 365
Columns: 9
$ date         <date> 2018-12-31, 2018-12-30, 2018-12-29, …
$ top10gross   <dbl> 36240441, 50932176, 58118460, 5666776…
$ num_releases <int> 53, 51, 51, 51, 53, 52, 53, 49, 53, 5…
$ num1gross    <dbl> 10011638, 16440551, 18632907, 1704111…
$ month        <ord> Dec, Dec, Dec, Dec, Dec, Dec, Dec, De…
$ day          <ord> Mon, Sun, Sat, Fri, Thu, Wed, Tue, Mo…
$ qtr          <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ top10grossM  <dbl> 36.24, 50.93, 58.12, 56.67, 51.67, 55…
$ num1grossM   <dbl> 10.01, 16.44, 18.63, 17.04, 14.62, 16…

Function to Make Repeatable Plots

  • A good practice is to subdivide tasks to make short functions

  • Recall the area plot we discussed in Week 3

  • This Function modifies the data for the plot:

#|label: data mgmt for area plot
bom22_line_area_orig <- bom22 |>
  select(date, top10grossM, num1grossM) |>                  # select variables
  rename(`Top 10` = top10grossM, `No. 1` = num1grossM) |>   # rename for plot
  pivot_longer(cols=`Top 10`:`No. 1`,                       # reshape data  
               names_to = "type", values_to = "grossM") |>
  mutate(type=factor(type, levels=c("Top 10", "No. 1")))    # convert type of gross to a factor


#|label: data mgmt function for area plot
bom_line_area <- function(data_in){
  d_out <- data_in |>
  select(date, top10grossM, num1grossM) |>                  
  rename(`Top 10` = top10grossM, `No. 1` = num1grossM) |>   
  pivot_longer(cols=`Top 10`:`No. 1`,                       
               names_to = "type", values_to = "grossM") |>
  mutate(type=factor(type, levels=c("Top 10", "No. 1"))) 
  d_out
}

bom22_line_area <- bom_line_area(bom22)   # creates plot dataset for 2022
bom21_line_area <- bom_line_area(bom21)   # creates plot dataset for 2021

Function for Area Plot

  • Functions are very useful for plots so that you don’t have to keep recreating the code for the same data.

  • The only text that changes from year to year is the subtitle.

area_plt22_orig <- bom22_line_area |>                                  
  ggplot() +                                                           
  geom_area(aes(x=date, y=grossM, fill=type), size=1) +                
  theme_classic() + 
  scale_fill_manual(values=c("blue", "lightblue")) +   
  labs(x="Date", y = "Gross ($Mill)", fill="",
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle="Jan. 1, 2022 - Dec. 31, 2022",
       caption="Data Source:www.boxoffice.mojo.com") + 
  theme(legend.position="bottom",
        legend.text = element_text(size = 12),
        plot.title = element_text(size = 20),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))

Display of saved plot, area_plt22_orig

Area Plot Function

#|label: area plot function

area_plt<- function(data_in, yr){
  data_in |>                                                
  ggplot() +                                                
  geom_area(aes(x=date, y=grossM, fill=type), size=1) +     
  theme_classic() + 
  scale_fill_manual(values=c("blue", "lightblue")) +   
  labs(x="Date", y = "Gross ($Mill)", fill="",
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle=paste("Jan. 1,", yr,"- Dec. 31,", yr),
       caption="Data Source:www.boxoffice.mojo.com") + 
  theme(legend.position="bottom",
        legend.text = element_text(size = 12),
        plot.title = element_text(size = 20),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
}

Line Plot Function

Almost identical to Area Plot Function

#|label: line plot function

line_plt<- function(data_in, yr){
  data_in |>                                                    
  ggplot() +                                                    
  geom_line(aes(x=date, y=grossM, color=type), linewidth=1) +   
  theme_classic() + 
  scale_color_manual(values=c("blue", "lightblue")) +   
  labs(x="Date", y = "Gross ($Mill)", color="",
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle=paste("Jan. 1,", yr,"- Dec. 31,", yr),
       caption="Data Source:www.boxoffice.mojo.com") + 
  theme(legend.position="bottom",
        legend.text = element_text(size = 12),
        plot.title = element_text(size = 20),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
}

Box Office Mojo 2022 - Area Plot

#|label: data and area plot 2022
bom22_line_area <- bom_line_area(bom22) # data formating function
area_plt(bom22_line_area, "2022")       # area plot function

Box Office Mojo 2022 - Line Plot

#|label: line plot 2022
line_plt(bom22_line_area, "2022") # line plot function (data formatted in chunk above)

Box Office Mojo 2021 - Line Plot

#|label: data and line plot 2021
bom21_line_area <- bom_line_area(bom21)  # data formatting function
line_plt(bom21_line_area, "2021")        # line plot function

Box Office Mojo 2021 - Area Plot

#|label: area plot 2021
area_plt(bom21_line_area, "2021") # area plot function (data formatted in previous chunk)

Preview of Next week after Quiz 1

  • Cleaning Messy Data from Box Office Mojo Website

  • Examining/Cleaning Bureau of Labor Statistics data

  • Writing functions to automate data cleaning

  • Joining data from multiple datasets

  • HW 4 will be introduced