---
title: "Week 5"
subtitle: "Cheat Sheets, AI, Functions, and Review"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r include=F}
#|label: setup
knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks
options(scipen=100) # suppress scientific notation
# install pacman if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
pacman::p_load(pacman, tidyverse, gridExtra, magrittr,
kableExtra) # install and load required packages
p_loaded() # verify loaded packages
```
***Quiz 1 on Thursday 9/25***
- Weeks 1 - 4 (Lectures 1 - 8)
- Quiz questions will be similar (but not identical)
to Practice Questions
- Mix of R datasets and imported datasets
- I will provide R code to import data
- Quiz Template and data files will be provided in
Zipped project
- Review Practice Questions, HW assignments, and
Demo Videos
<br>
- **You will be required to download, unzip, and and save
a project to your computer (not in Downloads), as part
of Quiz 1.**
## R Online Resources
- Some of what we have covered (Week 4 has a more complete
review.):
- R projects, file structure and Quarto files
- Working with 'clean' data using the `dplyr` package
- common commands: `read_csv`, `filter`, `select`,
`slice`, `factor`
- Augmenting these commands with operators such as
`!`, `%in%`, `==`
- Using pipes, `|>` to make data management more
efficient
- Reference links for R operators:
- [**tutorialspoint**](https://www.tutorialspoint.com/r/r_operators.htm)\
- [**Quick-R**](https://www.statmethods.net/management/operators.html)
- Or google `R Operators`
- For R Markdown and `dplyr` commands there are R Cheat
Sheets
- [**Curated List of Text Resources for BUA
455**](https://docs.google.com/document/d/1qdqO7MTq7scYhFydkJuhA7JIUVQNldNXqMBOspXlNZk/edit?usp=sharing)
## Using AI to help you write R code
- AI tools became use-able in the classroom in 2023.
- My current AI of choice in **Copilot** for Windows.
- **Chat-GPT** and **Gemini** on the Google platform are
also good.
- On the next slide I show the result of using copilot for
Question 12.
- Note that in this example I had to:
- Let Copilot know what R dataset this is.
##
### AI Prompt for Practice Question 12
- Note that I added in the second line.
- In Quizzes I will let you know the R dataset if that
information is needed.
- Students should also know which R datasets are being
used from doing the practice questions
{height="4in"
fig-align="center"}
##
### AI Response for Practice Question 12
{height="6in"
fig-align="center"}
## Recommendations for using AI
- DO: use AI as a search engine to find code or correct
code when you are stuck.
- DO: use AI iteratively to build code by asking it one
question at a time
- Add suggested code to your file, test the code and
then either modify question or ask a subsequent
question.
- DON'T: use AI in place of studing for the exam and plug
exam questions into an AI application and expect it to
work without your understanding of the question.
- AI can be used during the tests, but it won't help
you if you don't know what you are looking for or
how to phrase the queries correctly.
- I use AI to `test` my quiz questions to insure that they
will not provide fully correct code.
- AI can be helpful, but only if you understand the code
provided and can modify it correctly.
## Creating a Function
- Any task in R can be converted to a function.
- If you are only doing something once or twice, this is
not needed.
- If you are doing the same tasks 4 or more times, this is
very useful
- Best Practice:
- Develop and refine the code to complete your tasks
- Subdivide the larger tasks into smaller shorter
tasks
::: fragment
#### Aanatomy of a Function:
:::
::: fragment
```
Function_Name <- function(input_1, input_2, etc){
output <- command 1 to do "stuff" to inputs |>
command 2 to do "stuff" to inputs |>
command 3 to do "stuff" to inputs |> etc.
output # end with name of output so that it is "kicked out" of function
}
```
:::
```{r echo=F, eval=F, include=F}
#|label: bom_cln_function
bom_cln <- function(data_file, yr, out_file){
d <- read_csv(data_file, show_col_types = F, skip=11) |>
select(1,4,7,9) |>
rename("date" = "Date",
"top10gross" = "Top 10 Gross",
"num_releases" = "Releases",
"num1gross" = "Gross") |>
filter(!is.na(top10gross)) |>
mutate(date = dmy(paste(date,yr)),
top10gross = gsub(pattern="$", replacement="", x=top10gross, fixed=T),
top10gross = gsub(pattern=",", replacement="", x=top10gross, fixed=T) |>
as.numeric(),
num1gross = gsub(pattern="$", replacement="", x=num1gross, fixed=T),
num1gross = gsub(pattern=",", replacement="", x=num1gross, fixed=T) |>
as.numeric()) |>
write_csv(out_file)
}
bom_cln("data/box_office_mojo_2022.csv", 2022, "data/box_office_mojo_2022_tidy.csv")
bom_cln("data/box_office_mojo_2021.csv", 2021, "data/box_office_mojo_2021_tidy.csv")
bom_cln("data/box_office_mojo_2020.csv", 2020, "data/box_office_mojo_2020_tidy.csv")
bom_cln("data/box_office_mojo_2019.csv", 2019, "data/box_office_mojo_2019_tidy.csv")
bom_cln("data/box_office_mojo_2018.csv", 2018, "data/box_office_mojo_2018_tidy.csv")
```
## Example and Review:
- Code below includes preview of `lubridate` functions to
create date, month, day, and quarter variables.
::: fragment
```{r}
#|label: bom_import
bom21_orig <- read_csv("data/box_office_mojo_2021_tidy.csv", show_col_types = F) |>
mutate(date = ymd(date), # converts ymd date text to date var
month = month(date, label = T, abbr = T), # creates month var from date var
day = wday(date, label=T, abbr = T), # creates wkday var from date var
qtr = quarter(date), # creates quarter var from date var
num_releases = as.integer(num_releases),
top10grossM = (top10gross/1000000) |> round(2),
num1grossM = (num1gross/1000000) |> round(2))
```
:::
- Below, `bom_basic` is a function that completes the
tasks above:
::: fragment
```{r bom_import basic function}
bom_basic <- function(data_file) {
d_out <- read_csv(data_file, show_col_types = F) |>
mutate(date = ymd(date),
month = month(date, label = T, abbr = T),
day = wday(date, label=T, abbr = T),
qtr = quarter(date),
num_releases = as.integer(num_releases),
top10grossM = (top10gross/1000000) |> round(2),
num1grossM = (num1gross/1000000) |> round(2))
d_out # outputs function results to screen or saved object name
}
```
:::
## What does `bom_basic` function do?
:::::: columns
::: {.column width="48%"}
```{r}
#|label: import with read_csv
b21 <- read_csv("data/box_office_mojo_2021_tidy.csv",
show_col_types = F) |>
glimpse(width=40)
```
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
```{r}
#|label: import with bom_basic function
bom21 <- bom_basic("data/box_office_mojo_2021_tidy.csv") |>
glimpse(width=40)
```
:::
::::::
## Week 5 In-class Exercises - Q1
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Using `lubridate` commands we converted `date` to date
format (if needed) and created `month` `day` and `qtr`
variables from `date`.
- By default, `month` and `day` are ordinal factor
variables (`<ord>`).
- What is the default data type for `qtr` (quarter)?
::: fragment
A. character `<chr>`
B. decimal (double precision) `<dbl>`
C. factor `<fct>`
D. integer `<int>`
:::
## Week 5 In-class Exercises - Q2
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Here is the line that creates `qtr` within the mutate
statement.
The `quarter` command is part of the `lubridate` package:
- `qtr = quarter(date)`
::: fragment
Fill in the blank to convert this variable to a factor
variable as you create it:
:::
- `qtr = _____(quarter(date))`
## Function Demonstration - Multiple Years
- Once function code is developed and tested, we can
import 2, or 5, or even 10 data sets very efficiently.
::: fragment
```{r}
#|label: import all 5 datasets
bom22 <- bom_basic("data/box_office_mojo_2022_tidy.csv")
bom21 <- bom_basic("data/box_office_mojo_2021_tidy.csv")
bom20 <- bom_basic("data/box_office_mojo_2020_tidy.csv")
bom19 <- bom_basic("data/box_office_mojo_2019_tidy.csv")
bom18 <- bom_basic("data/box_office_mojo_2018_tidy.csv") |> glimpse( width=60)
```
:::
## Function to Make Repeatable Plots
- A good practice is to subdivide tasks to make short
functions
- Recall the area plot we discussed in Week 3
- This Function modifies the data for the plot:
::: fragment
```{r}
#|label: data mgmt for area plot
bom22_line_area_orig <- bom22 |>
select(date, top10grossM, num1grossM) |> # select variables
rename(`Top 10` = top10grossM, `No. 1` = num1grossM) |> # rename for plot
pivot_longer(cols=`Top 10`:`No. 1`, # reshape data
names_to = "type", values_to = "grossM") |>
mutate(type=factor(type, levels=c("Top 10", "No. 1"))) # convert type of gross to a factor
```
<br>
```{r}
#|label: data mgmt function for area plot
bom_line_area <- function(data_in){
d_out <- data_in |>
select(date, top10grossM, num1grossM) |>
rename(`Top 10` = top10grossM, `No. 1` = num1grossM) |>
pivot_longer(cols=`Top 10`:`No. 1`,
names_to = "type", values_to = "grossM") |>
mutate(type=factor(type, levels=c("Top 10", "No. 1")))
d_out
}
bom22_line_area <- bom_line_area(bom22) # creates plot dataset for 2022
bom21_line_area <- bom_line_area(bom21) # creates plot dataset for 2021
```
:::
## Function for Area Plot
- Functions are very useful for plots so that you don't
have to keep recreating the code for the same data.
- The only text that changes from year to year is the
subtitle.
::: fragment
```{r bom area plot code}
area_plt22_orig <- bom22_line_area |>
ggplot() +
geom_area(aes(x=date, y=grossM, fill=type), size=1) +
theme_classic() +
scale_fill_manual(values=c("blue", "lightblue")) +
labs(x="Date", y = "Gross ($Mill)", fill="",
title="Top 10 and No. 1 Movie Gross by Date",
subtitle="Jan. 1, 2022 - Dec. 31, 2022",
caption="Data Source:www.boxoffice.mojo.com") +
theme(legend.position="bottom",
legend.text = element_text(size = 12),
plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.caption = element_text(size = 10),
plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
##
### Display of saved plot, `area_plt22_orig`
```{r display of area plot, echo=F, fig.dim=c(14,8), fig.align='center'}
area_plt22_orig
```
## Area Plot Function
```{r}
#|label: area plot function
area_plt<- function(data_in, yr){
data_in |>
ggplot() +
geom_area(aes(x=date, y=grossM, fill=type), size=1) +
theme_classic() +
scale_fill_manual(values=c("blue", "lightblue")) +
labs(x="Date", y = "Gross ($Mill)", fill="",
title="Top 10 and No. 1 Movie Gross by Date",
subtitle=paste("Jan. 1,", yr,"- Dec. 31,", yr),
caption="Data Source:www.boxoffice.mojo.com") +
theme(legend.position="bottom",
legend.text = element_text(size = 12),
plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.caption = element_text(size = 10),
plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
}
```
## Line Plot Function
Almost identical to Area Plot Function
```{r}
#|label: line plot function
line_plt<- function(data_in, yr){
data_in |>
ggplot() +
geom_line(aes(x=date, y=grossM, color=type), linewidth=1) +
theme_classic() +
scale_color_manual(values=c("blue", "lightblue")) +
labs(x="Date", y = "Gross ($Mill)", color="",
title="Top 10 and No. 1 Movie Gross by Date",
subtitle=paste("Jan. 1,", yr,"- Dec. 31,", yr),
caption="Data Source:www.boxoffice.mojo.com") +
theme(legend.position="bottom",
legend.text = element_text(size = 12),
plot.title = element_text(size = 20),
axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.caption = element_text(size = 10),
plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
}
```
## Box Office Mojo 2022 - Area Plot
```{r, fig.dim=c(14,8), fig.align='center'}
#|label: data and area plot 2022
bom22_line_area <- bom_line_area(bom22) # data formating function
area_plt(bom22_line_area, "2022") # area plot function
```
## Box Office Mojo 2022 - Line Plot
```{r, fig.dim=c(14,8), fig.align='center'}
#|label: line plot 2022
line_plt(bom22_line_area, "2022") # line plot function (data formatted in chunk above)
```
## Box Office Mojo 2021 - Line Plot
```{r, fig.dim=c(14,8), fig.align='center'}
#|label: data and line plot 2021
bom21_line_area <- bom_line_area(bom21) # data formatting function
line_plt(bom21_line_area, "2021") # line plot function
```
## Box Office Mojo 2021 - Area Plot
```{r, fig.dim=c(14,8), fig.align='center'}
#|label: area plot 2021
area_plt(bom21_line_area, "2021") # area plot function (data formatted in previous chunk)
```
## Preview of Next week after Quiz 1
:::::: columns
::: {.column width="48%"}
- Cleaning Messy Data from Box Office Mojo Website
- Examining/Cleaning Bureau of Labor Statistics data
- Writing functions to automate data cleaning
- Joining data from multiple datasets
- HW 4 will be introduced
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
{fig-align="center"}
:::
::::::
##
### Key Points from This Week
::: fragment
**Review for Quiz 1**
:::
- Review Practice Questions
- Drop into Office Hours if you have additional questions.
::: fragment
**Automating Data Management and Plots with Functions**
:::
- Anatomy of a Function is always consistent
- Functions are useful for repetitive tasks e.g. data from
the same data source, but multiple years
- Divide task into smaller tasks and create a function for
each task
- Fully develop and check code to complete tasks, then
convert to function.
::: fragment
You may submit an 'Engagement Question' about each lecture
until midnight on the day of the lecture. **A minimum of
four submissions are required during the semester.**
:::