HW Assignment 3 (and lecture 5) will guide you through importing, wrangling, plotting data in an R Markdown file.
This template will be saved as an R markdown (.Rmd) file with the raw data in a zipped code_data_output folder.
First Steps:
author above to your name.date above to due date 9/22/21title from HW 3 Template to HW Assignment 3Once Steps 1-7 are completed, please do the following:
Run Chunk 1, the setup chunk to load packages and suppress scientific notation
In Chunks 2, 3, and 4, you will add comments to R Chunks above R commands as specified.
In Chunk 5 you will:
mutate(...) commandIn Chunk 6 you will:
In Chunk 7 you will:
geom_line(...) statement for each line in the plotlabs(...) commandIn Chunk 8 you will:
Final Steps:
NOTES:
group_by(...) and summarize(...) (Review)Chunk 1: setup (always)
All R Markdown files should start with a setup chunk.
This chunk with comments has been provided in the HW 3 template.
NOTES: One additional package, lubridate has been added to the p_load(...) statement.
include = F was replaced with message = F from Chunk 1 header so you can examine setup code in HTML file.
#Set up and load function to ensure we have the tools we need when looking at the data
# this line specifies options for default options for all R Chunks
knitr::opts_chunk$set(echo=T, highlight=T)
## Setup ====
# install and load packages we'll need
if (!require("pacman")) install.packages("pacman", repos = "http://cran.us.r-project.org")
p_load(tidyverse, ggthemes, magrittr, lubridate)
# tidyverse - a large suite of packages that work together
# ggthemes - smaller add-on for tidyverse graphics package, ggplot2
# magrittr - needed for piping
# lubridate - needed for dealing with dates
# verify packages
# remove # in front of library if needed
# library()
# suppress scientific notation
options(scipen=100)
Best practices in R Markdown suggest breaking up data wrangling tasks into multiple chunks.
Below, tasks for this HW are subdivided into 7 Chunks (Chunks 2 - 8).
Chunk 2: import and examine data
For full credit you are required to add comments in R Chunk 2 below (using #) before each command in your own words to show that you understand what the command is doing.
OPTIONAL: Add your own text here describing what Chunk 2 does.
Steps:
read_csv(...) does
skip = 12 option does #Tells R that we dont want to show column typesshow_col_types = FALSE option does #specifies the names of the columns we want in the movies datasetcol_names = ... option doesglimpse(...) command doesNOTE: col_names = ... option was added to this read_csv(...) import command because the variable names contain symbols that cause problems in R.
# Add comment(s) describing what read_csv does,
# For full credit, comment should specify:
# what the skip and show_col_types options do.
# what the col_names option does
#imports the data set and specifies separator values are commas
#skips row 12 in the movies dataset
#used to see the columns of the dataset, and some of the data itself along with the type of data per column (chr, dbl, etc)
movies <- read_csv("mojo210909.csv", skip=12, show_col_types = FALSE,
col_names=c("date", "day", "day_num",
"top10_gross", "pct_chg_day", "pct_chg_wk",
"num_releases", "num1_release", "num1_gross")) |>
# Add comment about glimpse (only required in Chunk 2)
glimpse()
## Rows: 429
## Columns: 9
## $ date <chr> "9-Sep-21", "8-Sep-21", "7-Sep-21", "6-Sep-21", "Labor Da…
## $ day <chr> "Thursday", "Wednesday", "Tuesday", "Monday", NA, "Sunday…
## $ day_num <dbl> 252, 251, 250, 249, NA, 248, 247, 246, 245, 244, 243, 242…
## $ top10_gross <chr> "$5,863,916", "$6,675,960", "$9,169,492", "$27,571,995", …
## $ pct_chg_day <chr> "-12.20%", "-27.20%", "-66.70%", "656.10%", NA, "-3%", "-…
## $ pct_chg_wk <chr> "56.10%", "54.40%", "64.10%", "13%", NA, "136.30%", "53.6…
## $ num_releases <dbl> 28, 28, 28, 28, NA, 29, 29, 29, 32, 32, 32, 32, 32, 32, 3…
## $ num1_release <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi a…
## $ num1_gross <chr> "$3,908,701", "$4,614,556", "$6,619,036", "$19,284,160", …
Chunk 3: select variables
OPTIONAL:
Steps:
select(...) command to select only variables needed
select(!day) command does.glimpse(...)# Add a comment describing what this select command does
# NOTE: We drop this variable because we will create a better version with
# day command in lubridate package (Chunk 5)
#The select command tells R that we want to select all the data except the day
#glimpse does exactly what it sounds like, gives us a glimpse into the data
movies <- movies |>
select(!day) |>
glimpse()
## Rows: 429
## Columns: 8
## $ date <chr> "9-Sep-21", "8-Sep-21", "7-Sep-21", "6-Sep-21", "Labor Da…
## $ day_num <dbl> 252, 251, 250, 249, NA, 248, 247, 246, 245, 244, 243, 242…
## $ top10_gross <chr> "$5,863,916", "$6,675,960", "$9,169,492", "$27,571,995", …
## $ pct_chg_day <chr> "-12.20%", "-27.20%", "-66.70%", "656.10%", NA, "-3%", "-…
## $ pct_chg_wk <chr> "56.10%", "54.40%", "64.10%", "13%", NA, "136.30%", "53.6…
## $ num_releases <dbl> 28, 28, 28, 28, NA, 29, 29, 29, 32, 32, 32, 32, 32, 32, 3…
## $ num1_release <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi a…
## $ num1_gross <chr> "$3,908,701", "$4,614,556", "$6,619,036", "$19,284,160", …
Chunk 4: filter and clean-up and convert data types
OPTIONAL: Add your own text here describing what Chunk 4 does.
Steps: 1. Add comment describing what *filter(!is.na(day_num)) command does to the dataset. + If you are not sure, examine the raw data before you run this chunk. + filter(...) removes (filters out) rows as specified + !is.na(day_num) tells R to only keep rows where day_num is not NA (missing) + Recall that ! means not so !is.na means NOT NA or NOT MISSING
mutate(...) command:
We remove nuisance symbols with the MAGICAL NUISANCE DESTROYER (MND)
MND is gsub("[\\___,]", "__", ________)
We convert date variable to be recognized as dates using dmy(...) command in the lubridate package.
lubridate commands can also be used for time or other daea formats# Add a comment explaining what this filter command combined
#filters by removing data in day_num that is blank or missing
# with !is.na() does to this dataset
movies <- movies |>
filter(!is.na(day_num)) |>
#removes the $ and replaces it with nothing in the top10_gross variable
#removes the $ and replaces it with nothing in the num1_gross variable
#removes the % and replaces it with nothing in the pct_chg_day variable
#removes the % and replaces it with nothing in the pct_chg_wk variable
#convert the date variable to be noticed as dates using dmy command
# Add a comment explaining what mutate does.
# Add one comment for EACH of the five variable changes
mutate(top10_gross = as.numeric(gsub("[\\$,]", "", top10_gross)),
num1_gross = as.numeric(gsub("[\\$,]", "", num1_gross)),
pct_chg_day = as.numeric(gsub("[\\%,]", "", pct_chg_day)),
pct_chg_wk = as.numeric(gsub("[\\%,]", "", pct_chg_wk)),
date = dmy(date))|>
glimpse()
## Rows: 252
## Columns: 8
## $ date <date> 2021-09-09, 2021-09-08, 2021-09-07, 2021-09-06, 2021-09-…
## $ day_num <dbl> 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, 24…
## $ top10_gross <dbl> 5863916, 6675960, 9169492, 27571995, 34538599, 35596201, …
## $ pct_chg_day <dbl> -12.2, -27.2, -66.7, 656.1, -3.0, -5.9, 906.2, -13.1, -22…
## $ pct_chg_wk <dbl> 56.1, 54.4, 64.1, 13.0, 136.3, 53.6, 103.2, -16.9, -11.4,…
## $ num_releases <dbl> 28, 28, 28, 28, 29, 29, 29, 32, 32, 32, 32, 32, 32, 32, 3…
## $ num1_release <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi a…
## $ num1_gross <dbl> 3908701, 4614556, 6619036, 19284160, 22696386, 23190043, …
Chunk 5: convert and create variables
OPTIONAL: Add your own text here describing what Chunk 5 does.
Steps:
mutate(...) command. Here is the R code to get you started:movies <- movies |>
mutate() |>
glimpse()
mutate(...) command:
day_num and num_releases to integer variables.mutate(...) command separated by commas.day_num conversion statement is shown below.num_releases conversion statement.mutate(...) command.day_num = as.integer(day_num)
num_releases = ...
wday(...) command within the mutate(...) command:
label = T option: day will be shown as name of day not a numberabbr = T option: day will be abbreviated, e.g. Sun, Mon, etc.day = wday(date, label=T, abbr=T)
month(...) command within the mutate(...) command:
abbr = T option: month will be abbreviated e.g. Jan, Feb, etc.month = month(date, label=T, abbr = T)
num1_pct_gross within the mutate(...) command:num1_pct_gross = (num1_gross/top10_gross)*100
as.integer(...) conversionsmonth = month(...) createdday = wday(...) creatednum1_pct_gross createdNOTE: date(…) command in Chunk 4 and wday(...) and month(...) commands in in Chunk 5 will only work if lubridate package is loaded in p_load(...) command in setup.
# Add FOUR comments at minimum describing the variables that are converted or created
# Add mutate and glimpse commands and complete mutate command as specified above.
#as.integer converts day_num and num_releases into integer variables
#day=wday... turns the dates into abbreviated days rather than numbers
#month=month creates an abreviated month variable
#creates a new variable by dividing num1_gross by top10_gross
movies <- movies |>
mutate( day_num = as.integer(day_num),
num_releases = as.integer(num_releases),
day = wday(date, label=T, abbr=T),
month = month(date, label=T, abbr = T),
num1_pct_gross = (num1_gross/top10_gross)*100)|>
glimpse ()
## Rows: 252
## Columns: 11
## $ date <date> 2021-09-09, 2021-09-08, 2021-09-07, 2021-09-06, 2021-0…
## $ day_num <int> 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, …
## $ top10_gross <dbl> 5863916, 6675960, 9169492, 27571995, 34538599, 35596201…
## $ pct_chg_day <dbl> -12.2, -27.2, -66.7, 656.1, -3.0, -5.9, 906.2, -13.1, -…
## $ pct_chg_wk <dbl> 56.1, 54.4, 64.1, 13.0, 136.3, 53.6, 103.2, -16.9, -11.…
## $ num_releases <int> 28, 28, 28, 28, 29, 29, 29, 32, 32, 32, 32, 32, 32, 32,…
## $ num1_release <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi…
## $ num1_gross <dbl> 3908701, 4614556, 6619036, 19284160, 22696386, 23190043…
## $ day <ord> Thu, Wed, Tue, Mon, Sun, Sat, Fri, Thu, Wed, Tue, Mon, …
## $ month <ord> Sep, Sep, Sep, Sep, Sep, Sep, Sep, Sep, Sep, Aug, Aug, …
## $ num1_pct_gross <dbl> 66.65684, 69.12198, 72.18542, 69.94111, 65.71311, 65.14…
Chunk 6: reorder variables
OPTIONAL: Add your own text here describing what Chunk 6 does.
Steps:
select(...) to reorganize dataset as specified:
movies <- movies |>
select(date, month, day, ...) |>
glimpse()
select(...) command to explain what it is doing# Add comment describing what select is doing to the dataset.
# Add select and glimpse commands and complete select command as specified above.
#we are reordering the data set by the variables in the order they are shown in parenthesis
#glimpse command gives us a look into what we did in the step above
movies <- movies |>
select(date, month, day, day_num, num_releases, num1_release, num1_gross, top10_gross, num1_pct_gross, pct_chg_day, pct_chg_wk) |>
glimpse()
## Rows: 252
## Columns: 11
## $ date <date> 2021-09-09, 2021-09-08, 2021-09-07, 2021-09-06, 2021-0…
## $ month <ord> Sep, Sep, Sep, Sep, Sep, Sep, Sep, Sep, Sep, Aug, Aug, …
## $ day <ord> Thu, Wed, Tue, Mon, Sun, Sat, Fri, Thu, Wed, Tue, Mon, …
## $ day_num <int> 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, …
## $ num_releases <int> 28, 28, 28, 28, 29, 29, 29, 32, 32, 32, 32, 32, 32, 32,…
## $ num1_release <chr> "Shang-Chi and the Legend of the Ten Rings", "Shang-Chi…
## $ num1_gross <dbl> 3908701, 4614556, 6619036, 19284160, 22696386, 23190043…
## $ top10_gross <dbl> 5863916, 6675960, 9169492, 27571995, 34538599, 35596201…
## $ num1_pct_gross <dbl> 66.65684, 69.12198, 72.18542, 69.94111, 65.71311, 65.14…
## $ pct_chg_day <dbl> -12.2, -27.2, -66.7, 656.1, -3.0, -5.9, 906.2, -13.1, -…
## $ pct_chg_wk <dbl> 56.1, 54.4, 64.1, 13.0, 136.3, 53.6, 103.2, -16.9, -11.…
Chunk 7: line plot showing two variables
OPTIONAL: Add your own text here describing what Chunk 7 does.
Steps
Steps
mutate(...) command and create plots to see how difference on y-axis.geom_line(...) statements and add them after the ggplot() + line
geom_line(...): Replace 1st blank with x variable, date, not in quotes.geom_line(...): Replace 2nd blank with y variable, top10_gross, not in quotes.geom_line(...): Replace 2nd blank with y variable, num1_gross, not in quotes.+ geom_line(aes(x=______, y=_______, col="cornflowerblue"), size=1) +
geom_line(aes(x=______, y=________, col="darkmagenta"), size=1) +
labs(...) statement to:
DateMovie Theater Gross by Day (or something similar)labs(...) statement to ggplot code after theme_classic() + statmentlabs(x = "_____", y = "Gross ($mil)",
title = "__________________ (2021)",
subtitle = "Top 10 and No. 1 Movies",
caption = "Data Source: https://www.boxofficemojo.com/")
geom_line(...) statement does.labs(...) command doeseval=F from Chunk 7 header.NOTES:
Every ggplot component must end with +, except for the final command, to chain all components together.
This plot is saved as an object, g_lines, AND printed to screen because the R code is enclosed in parentheses.
# NOTE: full ggplot command won't run until at least one geom_line command is added below
# Add comment describing what mutate command is doing
# Add comments describing what geom_lines commands do
# Add comments describing what labs command does
# theme_classic() changes default theme to a simpler one
# feel free to try a different theme
# scale_color_manual manually creates a legend from multiple lines
# scale_x_date allows for specifying breaks and labels on date axis
# %b in scale_x_date indicates month abbreviations (older code)
#mutate command adds a new variable from a data set
#geom_line creates a line in order of the date (in this case x variable)
#labs function modifies legend labels, captions underneath a plot and subtitles under a plot
(g_lines <- movies |>
mutate(top10_gross = top10_gross/1000000,
num1_gross = num1_gross/1000000) |>
ggplot() +
geom_line(aes(x=date, y=top10_gross, col="cornflowerblue"), size=1) +
geom_line(aes(x=date, y=num1_gross, col="darkmagenta"), size=1) +
# completed geom_lines commands are added here
# Add completed labs statement here
# End labs statement with a +
theme_classic() +
labs(x = "date", y = "Gross ($mil)",
title = "Daily Movie Theater Gross(2021)",
subtitle = "Top 10 and No. 1 Movies",
caption = "Data Source: https://www.boxofficemojo.com/") +
#scale_color_manual creates a legend which specifies color of values and labels
scale_color_manual(name="",
values=c("cornflowerblue", "darkmagenta"),
labels=c("Top 10 Gross ($mil)", "No. 1 Gross ($mil)")) +
scale_x_date(date_breaks = "month",
date_labels = "%b"))
Chunk 8: export plot and data
OPTIONAL: Add your own text here describing what Chunk 8 does.
Steps:
g_lines to your code_data_output folder
"HW3_lineplot_YourName.png""HW3_lineplot_Penelope_Pooler.png"png(...) statementg_linesdev.off()png("_______.png")
g_lines
dev.off()
write_csv(movies, ...) statement
"HW3_movies_YourName.csv"movies |>
write_csv("______.csv")
write_csv(...) are doing.NOTE: If we were done working with these data, we would edit the variable names before exporting the data. We will work with these data again next week.
# Add comment explaining what the three plot export comments are doing
# Follow instruction in Step 1 to export final line plot and add comments here to explain what these commands do
# HW 2 also has an example of exporting a plot
# Add comment explaining what write_csv does
# follow instructions in Step 2 to export dataset movies and add comments here to explain what write_csv command does
#exports and names plot object to folder
png("HW3_lineplot_Ari_Cohen.png")
g_lines
dev.off()
## quartz_off_screen
## 2
#exports and saves tidy to specified folder
movies |>
write_csv("HW3_movies_Ari_Cohen.csv")
Final Steps (Reprinted from Above)