Housekeeping

Quiz 1 on Thursday 2/12

  • Weeks 1 - 4 (Lectures 1 - 8)

    • Quiz questions will be similar (but not identical) to Practice Questions

    • Mix of R datasets and imported datasets

      • I will provide R code to import data

      • Quiz Template and data files will be provided in Zipped project

      • Review Practice Questions, HW assignments, and Demo Videos


  • You will be required to download, unzip, and and save a project to your computer (not in Downloads), as part of Quiz 1.

R Online Resources

  • Some of what we have covered (Week 4 has a more complete review.):

    • R projects, file structure and Quarto files

    • Working with ‘clean’ data using the dplyr package

      • common commands: read_csv, filter, select, slice, factor

      • Augmenting these commands with operators such as !, %in%, ==

      • Using pipes, |> to make data management more efficient

  • Reference links for R operators:

  • For R Markdown and dplyr commands there are R Cheat Sheets

Using AI to help you write R code

  • AI tools became use-able in the classroom in 2023.

  • My current AI of choice in Copilot for Windows.

  • Chat-GPT and Gemini on the Google platform are also good.

  • On the next slide I show the result of using copilot for Question 12.

  • Note that in this example I had to:

    • Let Copilot know what R dataset this is.

AI Prompt for Practice Question 12

  • Note that I added in the second line.

  • In Quizzes I will let you know the R dataset if that information is needed.

  • Students should also know which R datasets are being used from doing the practice questions

2025 AI Response for Practice Question 12

2026 AI Response for Practice Question 12

Similar to previous response but not identical and still would need editing.

Recommendations for using AI

  • DO: use AI as a search engine to find code or correct code when you are stuck.

  • DO: use AI iteratively to build code by asking it one question at a time

    • Add suggested code to your file, test the code and then either modify question or ask a subsequent question.
  • DON’T: use AI in place of studing for the exam and plug exam questions into an AI application and expect it to work without your understanding of the question.

    • AI can be used during the tests, but it won’t help you if you don’t know what you are looking for or how to phrase the queries correctly.
  • I use AI to test my quiz questions to insure that they will not provide fully correct code.

  • AI can be helpful, but only if you understand the code provided and can modify it correctly.

Creating a Function

  • Any task in R can be converted to a function.

  • If you are only doing something once or twice, this is not needed.

  • If you are doing the same tasks 4 or more times, this is very useful

  • Best Practice:

    • Develop and refine the code to complete your tasks

    • Subdivide the larger tasks into smaller shorter tasks

Aanatomy of a Function:

Function_Name <- function(input_1, input_2, etc){
   output <- command 1 to do "stuff" to inputs |>
             command 2 to do "stuff" to inputs |>
             command 3 to do "stuff" to inputs |> etc.
   output  # end with name of output so that it is "kicked out" of function
}

Example and Review:

  • Code below includes preview of lubridate functions to create date, month, day, and quarter variables.
#|label: bom_import
bom21_orig <- read_csv("data/box_office_mojo_2021_tidy.csv", show_col_types = F) |>
  mutate(date = ymd(date),                              # converts ymd date text to date var
         month = month(date, label = T, abbr = T),      # creates month var from date var
         day = wday(date, label=T, abbr = T),           # creates wkday var from date var
         qtr = quarter(date),                           # creates quarter var from date var
         num_releases = as.integer(num_releases),
         top10grossM = (top10gross/1000000) |> round(2),
         num1grossM = (num1gross/1000000) |> round(2))
  • Below, bom_basic is a function that completes the tasks above:
bom_basic <- function(data_file) {
  d_out <- read_csv(data_file, show_col_types = F) |>
  mutate(date = ymd(date),
         month = month(date, label = T, abbr = T),
         day = wday(date, label=T, abbr = T),
         qtr = quarter(date),
         num_releases = as.integer(num_releases),
         top10grossM = (top10gross/1000000) |> round(2),
         num1grossM = (num1gross/1000000) |> round(2))
  d_out # outputs function results to screen or saved object name
}

What does bom_basic function do?

#|label: import with read_csv

b21 <- read_csv("data/box_office_mojo_2021_tidy.csv",
                show_col_types = F) |>
  glimpse(width=40)
Rows: 365
Columns: 4
$ date         <date> 2021-12-31, 2021…
$ top10gross   <dbl> 27601787, 3502147…
$ num_releases <dbl> 25, 26, 26, 25, 2…
$ num1gross    <dbl> 15407695, 2071790…
#|label: import with bom_basic function

bom21 <- bom_basic("data/box_office_mojo_2021_tidy.csv") |>
  glimpse(width=40)
Rows: 365
Columns: 9
$ date         <date> 2021-12-31, 2021…
$ top10gross   <dbl> 27601787, 3502147…
$ num_releases <int> 25, 26, 26, 25, 2…
$ num1gross    <dbl> 15407695, 2071790…
$ month        <ord> Dec, Dec, Dec, De…
$ day          <ord> Fri, Thu, Wed, Tu…
$ qtr          <int> 4, 4, 4, 4, 4, 4,…
$ top10grossM  <dbl> 27.60, 35.02, 34.…
$ num1grossM   <dbl> 15.41, 20.72, 20.…

💥 Week 5 In-class Exercises - Q1 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685

Using lubridate commands we converted date to date format (if needed) and created month day and qtr variables from date.

  • By default, month and day are ordinal factor variables (<ord>).

  • What is the default data type for qtr (quarter)?

A. character <chr>

B. decimal (double precision) <dbl>

C. factor <fct>

D. integer <int>

💥 Week 5 In-class Exercises - Q2 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685

Here is the line that creates qtr within the mutate statement.

The quarter command is part of the lubridate package:

  • qtr = quarter(date)

Fill in the blank to convert this variable to a factor variable as you create it:

  • qtr = _____(quarter(date))

Function Demonstration - Multiple Years

  • Once function code is developed and tested, we can import 2, or 5, or even 10 data sets very efficiently.
#|label: import all 5 datasets
bom22 <- bom_basic("data/box_office_mojo_2022_tidy.csv")
bom21 <- bom_basic("data/box_office_mojo_2021_tidy.csv")
bom20 <- bom_basic("data/box_office_mojo_2020_tidy.csv")
bom19 <- bom_basic("data/box_office_mojo_2019_tidy.csv")
bom18 <- bom_basic("data/box_office_mojo_2018_tidy.csv") |> glimpse( width=60)
Rows: 365
Columns: 9
$ date         <date> 2018-12-31, 2018-12-30, 2018-12-29, …
$ top10gross   <dbl> 36240441, 50932176, 58118460, 5666776…
$ num_releases <int> 53, 51, 51, 51, 53, 52, 53, 49, 53, 5…
$ num1gross    <dbl> 10011638, 16440551, 18632907, 1704111…
$ month        <ord> Dec, Dec, Dec, Dec, Dec, Dec, Dec, De…
$ day          <ord> Mon, Sun, Sat, Fri, Thu, Wed, Tue, Mo…
$ qtr          <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ top10grossM  <dbl> 36.24, 50.93, 58.12, 56.67, 51.67, 55…
$ num1grossM   <dbl> 10.01, 16.44, 18.63, 17.04, 14.62, 16…

Function to Make Repeatable Plots

  • A good practice is to subdivide tasks to make short functions

  • Recall the area plot we discussed in Week 3

  • This Function modifies the data for the plot:

#|label: data mgmt for area plot
bom22_line_area_orig <- bom22 |>
  select(date, top10grossM, num1grossM) |>                  # select variables
  rename(`Top 10` = top10grossM, `No. 1` = num1grossM) |>   # rename for plot
  pivot_longer(cols=`Top 10`:`No. 1`,                       # reshape data  
               names_to = "type", values_to = "grossM") |>
  mutate(type=factor(type, levels=c("Top 10", "No. 1")))    # convert type of gross to a factor


#|label: data mgmt function for area plot
bom_line_area <- function(data_in){
  d_out <- data_in |>
  select(date, top10grossM, num1grossM) |>                  
  rename(`Top 10` = top10grossM, `No. 1` = num1grossM) |>   
  pivot_longer(cols=`Top 10`:`No. 1`,                       
               names_to = "type", values_to = "grossM") |>
  mutate(type=factor(type, levels=c("Top 10", "No. 1"))) 
  d_out
}

bom22_line_area <- bom_line_area(bom22)   # creates plot dataset for 2022
bom21_line_area <- bom_line_area(bom21)   # creates plot dataset for 2021

Function for Area Plot

  • Functions are very useful for plots so that you don’t have to keep recreating the code for the same data.

  • The only text that changes from year to year is the subtitle.

area_plt22_orig <- bom22_line_area |>                                  
  ggplot() +                                                           
  geom_area(aes(x=date, y=grossM, fill=type), size=1) +                
  theme_classic() + 
  scale_fill_manual(values=c("blue", "lightblue")) +   
  labs(x="Date", y = "Gross ($Mill)", fill="",
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle="Jan. 1, 2022 - Dec. 31, 2022",
       caption="Data Source:www.boxoffice.mojo.com") + 
  theme(legend.position="bottom",
        legend.text = element_text(size = 12),
        plot.title = element_text(size = 20),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))

Display of saved plot, area_plt22_orig

Area Plot Function

#|label: area plot function

area_plt<- function(data_in, yr){
  data_in |>                                                
  ggplot() +                                                
  geom_area(aes(x=date, y=grossM, fill=type), size=1) +     
  theme_classic() + 
  scale_fill_manual(values=c("blue", "lightblue")) +   
  labs(x="Date", y = "Gross ($Mill)", fill="",
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle=paste("Jan. 1,", yr,"- Dec. 31,", yr),
       caption="Data Source:www.boxoffice.mojo.com") + 
  theme(legend.position="bottom",
        legend.text = element_text(size = 12),
        plot.title = element_text(size = 20),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
}

Line Plot Function

Almost identical to Area Plot Function

#|label: line plot function

line_plt<- function(data_in, yr){
  data_in |>                                                    
  ggplot() +                                                    
  geom_line(aes(x=date, y=grossM, color=type), linewidth=1) +   
  theme_classic() + 
  scale_color_manual(values=c("blue", "lightblue")) +   
  labs(x="Date", y = "Gross ($Mill)", color="",
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle=paste("Jan. 1,", yr,"- Dec. 31,", yr),
       caption="Data Source:www.boxoffice.mojo.com") + 
  theme(legend.position="bottom",
        legend.text = element_text(size = 12),
        plot.title = element_text(size = 20),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
}

Box Office Mojo 2022 - Area Plot

#|label: data and area plot 2022
bom22_line_area <- bom_line_area(bom22) # data formating function
area_plt(bom22_line_area, "2022")       # area plot function

Box Office Mojo 2022 - Line Plot

#|label: line plot 2022
line_plt(bom22_line_area, "2022") # line plot function (data formatted in chunk above)

Box Office Mojo 2021 - Line Plot

#|label: data and line plot 2021
bom21_line_area <- bom_line_area(bom21)  # data formatting function
line_plt(bom21_line_area, "2021")        # line plot function

Box Office Mojo 2021 - Area Plot

#|label: area plot 2021
area_plt(bom21_line_area, "2021") # area plot function (data formatted in previous chunk)

Preview of Next week after Quiz 1

  • Cleaning Messy Data from Box Office Mojo Website

  • Examining/Cleaning Bureau of Labor Statistics data

  • Writing functions to automate data cleaning

  • Joining data from multiple datasets

  • HW 4 will be introduced