HW Assignment 3

Instructions for HW 3

Purpose:

HW Assignment 3 (and lecture 5) will guide you through importing, wrangling, plotting data in an R Markdown file.

Steps to Follow

This template will be saved as an R markdown (.Rmd) file with the raw data in a zipped code_data_output folder.

First Steps:

Create a HW3_YourName project on your laptop in your BUA 455 folder
- My project would be called HW3_Penelope_Pooler
Unzip and save this code_data_output folder to your HW 3 project.
Open R Markdown file (.Rmd) in R with project open.
In Markdown file: Change author above to your name.
In Markdown file: Change date above to due date 9/22/21
In Markdown file: Change title from HW 3 Template to HW Assignment 3
Rename R Markdown file as HW3_YourName.Rmd
- e.g. my Markdown file would be HW3_Penelope_Pooler.Rmd

Once Steps 1-7 are completed, please do the following:

Run Chunk 1, the setup chunk to load packages and suppress scientific notation

In Chunks 2, 3, and 4, you will add comments to R Chunks above R commands as specified.

In Chunk 5 you will:

Write the R code to create a mutate(...) command
Convert 2 variables to integers
Create three new variables as specified
Add comments describing what was done as specified

In Chunk 6 you will:

Use select command to reorder the variables in the dataset
Add a comment describing what was done

In Chunk 7 you will:

Complete the geom_line(...) statement for each line in the plot
Add labels to the line plot using labs(...) command
Add comments describing these steps

In Chunk 8 you will:

Export the line plot as a .png file to your code_data_output folder
Export the tidy dataset, movies, as a .csv to your code_data_output folder
- We will use this tidy dataset for some additional In-class Exercises next week.
Add comments describing these export commands

Final Steps:

Knit completed Markdown file as an HTML file.
code_data_output should contain:
- completed R Markdown file (.Rmd)
- HTML file from knitting R Markdown file (.html)
- PNG file of final edited plot (.png)
- CSV file of tidy data (.csv)
Save project with completed code_data_output folder
Create README text file listing files in HW 3 project.
Zip this project and submit it.

Grading Criteria:

First Steps (2 pts.)
Adding complete comments to Chunks 2, 3, and 4 (2 pts. per chunk = 6 pts.)
Completing Chunk 5 correctly with comments (4 pts.)
Completing Chunk 6 correctly with comments (2 pts.)
Completing Chunk 7 (plot) correctly with comments (4 pts.)
Completing Chunk 8 correctly with comments (2 pts.)
Final Steps (3 pts.)

NOTES:

There are NO Blackboard questions for HW 3.
Next week (9/21 - 9/23) we will use these tidy data to cover:
- Creating summary tables using group_by(...) and summarize(...) (Review)
- Reshaping data
- Creating plots with reshaped data,
There will be a short Blackboard assignment next week to help you practice these skills to review for Quiz 1

R Markdown Steps:

Setup

Chunk 1: setup (always)

All R Markdown files should start with a setup chunk.

This chunk with comments has been provided in the HW 3 template.

NOTES: One additional package, lubridate has been added to the p_load(...) statement.

include = F was replaced with message = F from Chunk 1 header so you can examine setup code in HTML file.

#Set up and load function to ensure we have the tools we need when looking at the data

# this line specifies options for default options for all R Chunks
knitr::opts_chunk$set(echo=T, highlight=T)

## Setup ====

# install and load packages we'll need
if (!require("pacman")) install.packages("pacman", repos = "http://cran.us.r-project.org")

p_load(tidyverse, ggthemes, magrittr, lubridate)

# tidyverse - a large suite of packages that work together
# ggthemes - smaller add-on for tidyverse graphics package, ggplot2
# magrittr - needed for piping
# lubridate - needed for dealing with dates

# verify packages 
# remove # in front of library if needed
# library()

# suppress scientific notation
options(scipen=100)

Import/Tidy Data

Best practices in R Markdown suggest breaking up data wrangling tasks into multiple chunks.
Below, tasks for this HW are subdivided into 7 Chunks (Chunks 2 - 8).

Import/Examine Data

Chunk 2: import and examine data

For full credit you are required to add comments in R Chunk 2 below (using #) before each command in your own words to show that you understand what the command is doing.

OPTIONAL: Add your own text here describing what Chunk 2 does.

Steps:

Add comments describing what read_csv(...) does
- comment on what the skip = 12 option does #Tells R that we dont want to show column types
- comment on what show_col_types = FALSE option does #specifies the names of the columns we want in the movies dataset
- comment on what col_names = ... option does

Add a comment describing what glimpse(...) command does

NOTE: col_names = ... option was added to this read_csv(...) import command because the variable names contain symbols that cause problems in R.

# Add comment(s) describing what read_csv does, 
# For full credit, comment should specify:
#   what the skip and show_col_types options do.
#   what the col_names option does

#imports the data set and specifies separator values are commas 
#skips row 12 in the movies dataset
#used to see the columns of the dataset, and some of the data itself along with the type of data per column (chr, dbl, etc)


movies <- read_csv("mojo210909.csv", skip=12, show_col_types = FALSE,
                   col_names=c("date", "day", "day_num", 
                               "top10_gross", "pct_chg_day", "pct_chg_wk",
                               "num_releases", "num1_release", "num1_gross")) |>
                   
  
# Add comment about glimpse (only required in Chunk 2)
  glimpse()

## Rows: 429
## Columns: 9
## $ date         <chr> "9-Sep-21", "8-Sep-21", "7-Sep-21", "6-Sep-21", "Labor Da…
## $ day          <chr> "Thursday", "Wednesday", "Tuesday", "Monday", NA, "Sunday…
## $ day_num      <dbl> 252, 251, 250, 249, NA, 248, 247, 246, 245, 244, 243, 242…
## $ top10_gross  <chr> "$5,863,916", "$6,675,960", "$9,169,492", "$27,571,995", …
## $ pct_chg_day  <chr> "-12.20%", "-27.20%", "-66.70%", "656.10%", NA, "-3%", "-…
## $ pct_chg_wk   <chr> "56.10%", "54.40%", "64.10%", "13%", NA, "136.30%", "53.6…
## $ num_releases <dbl> 28, 28, 28, 28, NA, 29, 29, 29, 32, 32, 32, 32, 32, 32, 3…
## $ num1_release <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi a…
## $ num1_gross   <chr> "$3,908,701", "$4,614,556", "$6,619,036", "$19,284,160", …

Select Variables

Chunk 3: select variables

OPTIONAL:

Steps:

The Chunk below uses select(...) command to select only variables needed
- Add a comment describing what the select(!day) command does.
- Note: We drop this variable because we will replace it in Chunk 5.

Examine data using glimpse(...)

# Add a comment describing what this select command does
# NOTE: We drop this variable because we will create a better version with
#       day command in lubridate package (Chunk 5)

#The select command tells R that we want to select all the data except the day
#glimpse does exactly what it sounds like, gives us a glimpse into the data

movies <- movies |>
  
  select(!day) |>

  glimpse()

## Rows: 429
## Columns: 8
## $ date         <chr> "9-Sep-21", "8-Sep-21", "7-Sep-21", "6-Sep-21", "Labor Da…
## $ day_num      <dbl> 252, 251, 250, 249, NA, 248, 247, 246, 245, 244, 243, 242…
## $ top10_gross  <chr> "$5,863,916", "$6,675,960", "$9,169,492", "$27,571,995", …
## $ pct_chg_day  <chr> "-12.20%", "-27.20%", "-66.70%", "656.10%", NA, "-3%", "-…
## $ pct_chg_wk   <chr> "56.10%", "54.40%", "64.10%", "13%", NA, "136.30%", "53.6…
## $ num_releases <dbl> 28, 28, 28, 28, NA, 29, 29, 29, 32, 32, 32, 32, 32, 32, 3…
## $ num1_release <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi a…
## $ num1_gross   <chr> "$3,908,701", "$4,614,556", "$6,619,036", "$19,284,160", …

Filter/Clean/Convert Data

Chunk 4: filter and clean-up and convert data types

OPTIONAL: Add your own text here describing what Chunk 4 does.

Steps: 1. Add comment describing what *filter(!is.na(day_num)) command does to the dataset. + If you are not sure, examine the raw data before you run this chunk. + filter(...) removes (filters out) rows as specified + !is.na(day_num) tells R to only keep rows where day_num is not NA (missing) + Recall that ! means not so !is.na means NOT NA or NOT MISSING

Add FIVE comments describing the FIVE statements included in the mutate(...) command:
- We remove nuisance symbols with the MAGICAL NUISANCE DESTROYER (MND)
- MND is gsub("[\\___,]", "__", ________)
  - First blank replaced by symbol to be removed
  - Second blank deleted OR replaced by what you want to replace first blank with.
  - Third blank replaced by variable name
- We convert date variable to be recognized as dates using dmy(...) command in the lubridate package.
  - lubridate commands can also be used for time or other daea formats

Examine data using glimpse()

# Add a comment explaining what this filter command combined

#filters by removing data in day_num that is blank or missing 
#   with !is.na() does to this dataset

movies <- movies |>
  filter(!is.na(day_num)) |>
   
   #removes the $ and replaces it with nothing in the top10_gross variable
   #removes the $ and replaces it with nothing in the num1_gross variable
   #removes the % and replaces it with nothing in the pct_chg_day variable
   #removes the % and replaces it with nothing in the pct_chg_wk variable
   #convert the date variable to be noticed as dates using dmy command

# Add a comment explaining what mutate does.
# Add one comment for EACH of the five variable changes 
  mutate(top10_gross = as.numeric(gsub("[\\$,]", "", top10_gross)),
         num1_gross = as.numeric(gsub("[\\$,]", "", num1_gross)),
         pct_chg_day = as.numeric(gsub("[\\%,]", "", pct_chg_day)),
         pct_chg_wk = as.numeric(gsub("[\\%,]", "", pct_chg_wk)),
         date = dmy(date))|>
  
  glimpse()

## Rows: 252
## Columns: 8
## $ date         <date> 2021-09-09, 2021-09-08, 2021-09-07, 2021-09-06, 2021-09-…
## $ day_num      <dbl> 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, 24…
## $ top10_gross  <dbl> 5863916, 6675960, 9169492, 27571995, 34538599, 35596201, …
## $ pct_chg_day  <dbl> -12.2, -27.2, -66.7, 656.1, -3.0, -5.9, 906.2, -13.1, -22…
## $ pct_chg_wk   <dbl> 56.1, 54.4, 64.1, 13.0, 136.3, 53.6, 103.2, -16.9, -11.4,…
## $ num_releases <dbl> 28, 28, 28, 28, 29, 29, 29, 32, 32, 32, 32, 32, 32, 32, 3…
## $ num1_release <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi a…
## $ num1_gross   <dbl> 3908701, 4614556, 6619036, 19284160, 22696386, 23190043, …

Convert/Create Variables

Chunk 5: convert and create variables

OPTIONAL: Add your own text here describing what Chunk 5 does.

Steps:

Use example above to create new empty mutate(...) command. Here is the R code to get you started:

movies <- movies |>
  mutate() |>
  
  glimpse()

Within the mutate(...) command:
- convert day_num and num_releases to integer variables.
- Chunk 3 shows how statements are included in a mutate(...) command separated by commas.
- The complete day_num conversion statement is shown below.
- You are asked to complete the num_releases conversion statement.
- Add both completed statements to mutate(...) command.

day_num = as.integer(day_num)
num_releases = ...

Create a new day variable using wday(...) command within the mutate(...) command:
- label = T option: day will be shown as name of day not a number
- abbr = T option: day will be abbreviated, e.g. Sun, Mon, etc.

day = wday(date, label=T, abbr=T)

Create a month variable using month(...) command within the mutate(...) command:
- abbr = T option: month will be abbreviated e.g. Jan, Feb, etc.

month = month(date, label=T, abbr = T)

Create a calculated variable, num1_pct_gross within the mutate(...) command:

num1_pct_gross = (num1_gross/top10_gross)*100

Examine data using glimpse ()

Add FOUR comments in R Chunk using # to describe statements included in the mutate command to convert or create each variable
- One comment describing BOTH as.integer(...) conversions
- One comment describing what month = month(...) created
- One comment describing what day = wday(...) created
- One comment describing the calculated variable num1_pct_gross created

NOTE: date(…) command in Chunk 4 and wday(...) and month(...) commands in in Chunk 5 will only work if lubridate package is loaded in p_load(...) command in setup.

# Add FOUR comments at minimum describing the variables that are converted or created
# Add mutate and glimpse commands and complete mutate command as specified above.

#as.integer converts day_num and num_releases into integer variables
#day=wday... turns the dates into abbreviated days rather than numbers 
#month=month creates an abreviated month variable 
#creates a new variable by dividing num1_gross by top10_gross
movies <- movies |>
  mutate( day_num = as.integer(day_num),
num_releases = as.integer(num_releases),
day = wday(date, label=T, abbr=T),
month = month(date, label=T, abbr = T),
num1_pct_gross = (num1_gross/top10_gross)*100)|>
   

glimpse ()

## Rows: 252
## Columns: 11
## $ date           <date> 2021-09-09, 2021-09-08, 2021-09-07, 2021-09-06, 2021-0…
## $ day_num        <int> 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, …
## $ top10_gross    <dbl> 5863916, 6675960, 9169492, 27571995, 34538599, 35596201…
## $ pct_chg_day    <dbl> -12.2, -27.2, -66.7, 656.1, -3.0, -5.9, 906.2, -13.1, -…
## $ pct_chg_wk     <dbl> 56.1, 54.4, 64.1, 13.0, 136.3, 53.6, 103.2, -16.9, -11.…
## $ num_releases   <int> 28, 28, 28, 28, 29, 29, 29, 32, 32, 32, 32, 32, 32, 32,…
## $ num1_release   <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi…
## $ num1_gross     <dbl> 3908701, 4614556, 6619036, 19284160, 22696386, 23190043…
## $ day            <ord> Thu, Wed, Tue, Mon, Sun, Sat, Fri, Thu, Wed, Tue, Mon, …
## $ month          <ord> Sep, Sep, Sep, Sep, Sep, Sep, Sep, Sep, Sep, Aug, Aug, …
## $ num1_pct_gross <dbl> 66.65684, 69.12198, 72.18542, 69.94111, 65.71311, 65.14…

Reorder Variables

Chunk 6: reorder variables

OPTIONAL: Add your own text here describing what Chunk 6 does.

Steps:

Use select(...) to reorganize dataset as specified:
- order specified:
  - date, month, day, day_num, num_releases, num1_release, num1_gross, top10_gross, num1_pct_gross, pct_chg_day, pct_chg_wk
- Example code to get you started shown:

movies <- movies |>
  select(date, month, day, ...) |>
  
  glimpse()

Examine data to verify order using glimpse()

Add comment before select(...) command to explain what it is doing

# Add comment describing what select is doing to the dataset.
# Add select and glimpse commands and complete select command as specified above.

#we are reordering the data set by the variables in the order they are shown in parenthesis 
#glimpse command gives us a look into what we did in the step above

movies <- movies |>
  select(date, month, day, day_num, num_releases, num1_release, num1_gross, top10_gross, num1_pct_gross, pct_chg_day, pct_chg_wk) |>
  
  glimpse()

## Rows: 252
## Columns: 11
## $ date           <date> 2021-09-09, 2021-09-08, 2021-09-07, 2021-09-06, 2021-0…
## $ month          <ord> Sep, Sep, Sep, Sep, Sep, Sep, Sep, Sep, Sep, Aug, Aug, …
## $ day            <ord> Thu, Wed, Tue, Mon, Sun, Sat, Fri, Thu, Wed, Tue, Mon, …
## $ day_num        <int> 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, …
## $ num_releases   <int> 28, 28, 28, 28, 29, 29, 29, 32, 32, 32, 32, 32, 32, 32,…
## $ num1_release   <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi…
## $ num1_gross     <dbl> 3908701, 4614556, 6619036, 19284160, 22696386, 23190043…
## $ top10_gross    <dbl> 5863916, 6675960, 9169492, 27571995, 34538599, 35596201…
## $ num1_pct_gross <dbl> 66.65684, 69.12198, 72.18542, 69.94111, 65.71311, 65.14…
## $ pct_chg_day    <dbl> -12.2, -27.2, -66.7, 656.1, -3.0, -5.9, 906.2, -13.1, -…
## $ pct_chg_wk     <dbl> 56.1, 54.4, 64.1, 13.0, 136.3, 53.6, 103.2, -16.9, -11.…

Plot Data

Line Plot

Chunk 7: line plot showing two variables

OPTIONAL: Add your own text here describing what Chunk 7 does.

Steps

Add comment explaining what is being done in mutate command and why.
- If you are not sure:
  - Examine calculation
  - Comment out mutate(...) command and create plots to see how difference on y-axis.

Complete these 2 geom_line(...) statements and add them after the ggplot() + line
- Both geom_line(...): Replace 1st blank with x variable, date, not in quotes.
- 1st geom_line(...): Replace 2nd blank with y variable, top10_gross, not in quotes.
- 2nd geom_line(...): Replace 2nd blank with y variable, num1_gross, not in quotes.
- NOTE: Each line in a ggplot sequence must end in plus, +

  geom_line(aes(x=______, y=_______, col="cornflowerblue"), size=1) +
  geom_line(aes(x=______, y=________, col="darkmagenta"), size=1) +

See all color choices here.

Complete and add labs(...) statement to:
- Format x and y labels
- Add a title, subtitle, and caption
  - Replace 1st blank in labs statement, after x =, with the x-axis label, Date
  - Replace 2nd blank in labs statement with a title,
  - Example Title: Movie Theater Gross by Day (or something similar)
  - Add labs(...) statement to ggplot code after theme_classic() + statment

labs(x = "_____", y = "Gross ($mil)", 
     title = "__________________ (2021)",
     subtitle = "Top 10 and No. 1 Movies",
     caption = "Data Source: https://www.boxofficemojo.com/")

Add additional comments
- Add brief comments describing what each geom_line(...) statement does.
- Add a brief comment describing what labs(...) command does

Once plot command is correct, remove eval=F from Chunk 7 header.

NOTES:

Every ggplot component must end with +, except for the final command, to chain all components together.
This plot is saved as an object, g_lines, AND printed to screen because the R code is enclosed in parentheses.

# NOTE: full ggplot command won't run until at least one geom_line command is added below

# Add comment describing what mutate command is doing
# Add comments describing what geom_lines commands do
# Add comments describing what labs command does

# theme_classic() changes default theme to a simpler one 
# feel free to try a different theme

# scale_color_manual manually creates a legend from multiple lines
# scale_x_date allows for specifying breaks and labels on date axis
# %b in scale_x_date indicates month abbreviations (older code)

#mutate command adds a new variable from a data set 
#geom_line creates a line in order of the date (in this case x variable)
#labs function modifies legend labels, captions underneath a plot and subtitles under a plot 
  (g_lines <- movies |>
    
    mutate(top10_gross = top10_gross/1000000,
    num1_gross = num1_gross/1000000) |>
    
    ggplot() +
        
        geom_line(aes(x=date, y=top10_gross, col="cornflowerblue"), size=1) +
  geom_line(aes(x=date, y=num1_gross, col="darkmagenta"), size=1) +

# completed geom_lines commands are added here 
# Add completed labs statement here
# End labs statement with a +

   theme_classic() +
        
        labs(x = "date", y = "Gross ($mil)", 
     title = "Daily Movie Theater Gross(2021)",
     subtitle = "Top 10 and No. 1 Movies",
     caption = "Data Source: https://www.boxofficemojo.com/") +
   
   #scale_color_manual creates a legend which specifies color of values and labels 
   scale_color_manual(name="",
                      values=c("cornflowerblue", "darkmagenta"),
                      labels=c("Top 10 Gross ($mil)", "No. 1 Gross ($mil)")) +
   
   scale_x_date(date_breaks = "month", 
                date_labels = "%b"))

Export Plot/Data

Chunk 8: export plot and data

OPTIONAL: Add your own text here describing what Chunk 8 does.

Steps:

Export plot object g_lines to your code_data_output folder
- plot should be named "HW3_lineplot_YourName.png"
- For example, my plot would be named "HW3_lineplot_Penelope_Pooler.png"
- See HW 2 for a reminder of how to do this using three statements:
  1. png(...) statement
  2. g_lines
  3. dev.off()
- Incomplete example code to get you started:

png("_______.png")
g_lines
dev.off()

Export tidy data using write_csv(movies, ...) statement
- First input in write_csv is provided, the name of the dataset
- Second input is the file name: "HW3_movies_YourName.csv"
- This command should save the dataset to your code_data_output folder
- Incomplete example code to get you started:

movies |>
  write_csv("______.csv")

Add R comments in chunk describing what commands from Step 1 (plot export commands) and Step 2 , write_csv(...) are doing.

NOTE: If we were done working with these data, we would edit the variable names before exporting the data. We will work with these data again next week.

# Add comment explaining what the three plot export comments are doing

# Follow instruction in Step 1 to export final line plot and add comments here to explain what these commands do
# HW 2 also has an example of exporting a plot


# Add comment explaining what write_csv does

# follow instructions in Step 2 to export dataset movies and add comments here to explain what write_csv command does

#exports and names plot object to folder
png("HW3_lineplot_Ari_Cohen.png")
g_lines
dev.off()

## quartz_off_screen 
##                 2

#exports and saves tidy to specified folder
movies |>
  write_csv("HW3_movies_Ari_Cohen.csv")

Final Steps

Final Steps (Reprinted from Above)

Knit completed Markdown file as an HTML file.
code_data_output should contain:
- completed R Markdown file (.Rmd)
- HTML file from knitting R Markdown file (.html)
- PNG file of final edited plot (.png)
- CSV file of tidy data (.csv)
Save project with completed code_data_output folder
Create README text file listing files in HW 3 project.
Zip this project and submit it.

HW Assignment 3

Ari Cohen

9/22/2021

Instructions for HW 3

Purpose:

Steps to Follow

Grading Criteria:

R Markdown Steps:

Setup

Import/Tidy Data

Import/Examine Data

Select Variables

Filter/Clean/Convert Data

Convert/Create Variables

Reorder Variables

Plot Data

Line Plot

Export Plot/Data

Final Steps

End of HW Assignment 3