Dealing with dates used to be much more difficult prior to development of the lubridate package.
Dates are still troublesome in other software environments.
Below we create a date variable from the provided character variable, create other variables, examine data, and export the dataset with write_csv.
#|label: date example with lubridatebom23 <- bom23 |>mutate(date =dmy(paste(Date,"2023")), # year is required# we paste it (add it as text) to each datemonth =month(date, label=T, abbr=T), # month shown as 3 letter abbr.day =wday(date, label=T, abbr=T), # weekday shown as 3 letter abbr.quart =quarter(date)) |># quarter shown as numberselect(date, month, day, quart, top10gross:num1) |># select and reorder variablesglimpse() |># examine datawrite_csv("data/Box_Office_Mojo_Week3_HW3.csv") # export using write_csv
Notice that in the prior chunk, we use the command read_csv
True or False:
read_csv and read.csv are the same and can be used interchangeably to import data.
Hint: Here are three ways to determine this:
R help: In console type ?read_csv and/or type ?read.csv and look through documentation
Google R read_csv and read.csv
Ask ‘Chat GPT’, ‘Copilot’, or another AI search engine.
Note: R help files are sometimes hard to decipher and Googling often requires time and effort but both are excellent resources. AI search engines are getting better, but are not always 100% accurate.
month monthF day wkdayF
Length:365 Jan : 31 Length:365 M :52
Class :character Mar : 31 Class :character T :52
Mode :character May : 31 Mode :character W :52
Jul : 31 Th:52
Aug : 31 F :52
Oct : 31 Sa:52
(Other):179 Su:53
# A tibble: 6 × 10
date monthF wkdayF quart num_releases num1gross num1grossM top10gross
<date> <fct> <fct> <dbl> <int> <dbl> <dbl> <dbl>
1 2023-12-31 Dec Su 4 43 5208897 5.21 23078184
2 2023-12-30 Dec Sa 4 44 8637841 8.64 40050370
3 2023-12-29 Dec F 4 44 8630268 8.63 37348409
4 2023-12-28 Dec Th 4 46 7988504 7.99 33261609
5 2023-12-27 Dec W 4 45 8135639 8.14 33892628
6 2023-12-26 Dec T 4 45 8970413 8.97 41788862
# ℹ 2 more variables: top10grossM <dbl>, num1pct <dbl>
💥 Week 3 In-class Exercises - Q2 💥
Session ID: bua455f24
This is BB Question 2 in HW 3
The correct command used to convert a numeric variable to an integer variable is
____().
When you glimpse the data after Part 2 (Chunk 3) in HW 3, the type for the num_releases variable is shown as
<____> instead of <dbl>.
Grouping and Filtering Data
We can filter data by value within each group.
R command group_by allows us to group data before we filter.
Data are filtered by value WITHIN each specified group
Ungrouping data afterwards using ungroup is not required, but often helpful.
The example below is not used in the subsequent summary but can be very useful.
#|label: filter to last day of monthmojo_23_mnth_end <- mojo_23_mod |>select(date, monthF, top10grossM) |>group_by(monthF) |># doesn't change data appearancefilter(date ==max(date)) |>ungroup() |># ungroup not required but helpfulglimpse()
Rows: 12
Columns: 3
$ date <date> 2023-12-31, 2023-11-30, 2023-10-31, 2023-09-30, 2023-08-3…
$ monthF <fct> Dec, Nov, Oct, Sep, Aug, Jul, Jun, May, Apr, Mar, Feb, Jan
$ top10grossM <dbl> 23.08, 5.28, 9.82, 30.32, 5.27, 30.83, 41.92, 14.13, 27.13…
Grouping and Summarizing Data
We will summarize data and then reshape it for a summary table.
R commands group_by and summarize allow us to summarize the data by category
When summarizing data, it is easier to select the variables you want first.
Plan what you want to do
mojo_23_smry <- mojo_23_mod |>select(monthF, wkdayF, top10grossM) |>group_by(monthF, wkdayF) |># doesn't change data appearancesummarize(avg_top10gross =mean(top10grossM, na.rm=T),mdn_top10gross =median(top10grossM, na.rm=T),max_top10gross =max(top10grossM, na.rm=T)) |>ungroup() |>glimpse() # ungroup not required but helpful
# A tibble: 10 × 3
Month Day max_top10gross
<fct> <chr> <dbl>
1 Jan M 32.6
2 Jan T 17.0
3 Jan W 12.1
4 Jan Th 10.9
5 Jan F 31.0
6 Jan Sa 44.6
7 Jan Su 36.2
8 Feb M 21.2
9 Feb T 12.4
10 Feb W 6.49
#|label: stacked no backgroundmojo_23_long <- mojo_23_long |># Day converted to factor to specify ordermutate(Day =factor(Day, levels=c("M", "T", "W", "Th", "F", "Sa", "Su")))(mojo_barplot <- mojo_23_long |>ggplot() +geom_bar(aes(x=Month, y=max_top10gross, fill=Day), stat="identity") +theme_classic())
#|label: side by side(mojo_barplot <- mojo_23_long |>ggplot() +geom_bar(aes(x=Month, y=max_top10gross, fill=Day), stat="identity", position="dodge") +theme_classic())
We can add on to the plot which is a saved object in the Global Environment.
#|label: label formatting(mojo_barplot <- mojo_barplot +theme(legend.position ="bottom") +guides(fill =guide_legend(nrow =1)) +labs(x="", y="Maximum Daily Gross ($M)",title ="Maximum Daily Gross of Top 10 Films by Month and Day of Week",caption ="Data Source: www.boxofficemojo.com"))
#|label: area plot codearea_plt <- mojo_23_line_area |>ggplot() +# changed to geom_areageom_area(aes(x=date, y=grossM, fill=type), size=1) +# changed color to filltheme_classic() +theme(legend.position="bottom") +scale_fill_manual(values=c("blue", "lightblue")) +# changed color to filllabs(x="Date", y ="Gross ($Mill)", fill="", # changed color to filltitle="Top 10 and No. 1 Movie Gross by Date", subtitle="Jan. 1, 2023 - Dec. 31, 2023",caption="Data Source:www.boxoffice.mojo.com") +theme(plot.title =element_text(size =20),axis.title =element_text(size=18),axis.text =element_text(size=15),plot.caption =element_text(size =10),legend.text =element_text(size =12),plot.background =element_rect(colour ="darkgrey", fill=NA, linewidth=2))
💥 Week 3 In-class Exercises 💥
Lecture 6 - Q1 - NOT ON PointSolutions
In class we will practice:
Running chunks and exporting a table.
Preview for 1 Question in Quiz 1 where you will:
Select variables from a provided dataset
Group and summarize data
Export a summary table as a .csv file and submit it.
Instructions for In-class Exercise
Save Week 3 R project to your computer.
Open this project by clicking on .Rproj file.
Open .Rmd file within open R project.
Run all chunks above this exercise.
Modify the following chunk below to:
Round all values in columns 2-4 of mojo_23_fall_wknd to 1 decimal place using round.
Export mojo_23_fall_wknd as a .csv file with your name.
Submit this .csv file with your name in the Week 3 In-class Exercise in the In-class Exercises folder on Blackboard.
NOTE: This counts as part of your in-class participation for the Week 3 lectures (due Fri. at midnight).
R Code Chunk for In-class Exercise
Remove , eval=F from chunk header. This will allow code in chunk to run when it is rendered.
Remove the # and complete round command to round numeric columns (columns 2 - 4) to 1 decimal place.
Choose EITHER of the write_csv commands and edit it so dataset will be exported to the data folder with your name.
Delete write_csv command you don’t edit or put # symbols in front of it.
Submit .csv file with your name in the filename
#|label: round and export summary datasetmojo_23_fall_wknd |>glimpse() # examine data with glimpse# round columns 2, 3 and 4 only# export summary dataset using write_csv without pipingwrite_csv(mojo_23_fall_wknd, "data/Movie_Gross_Fall_2023_Weekends_FirstName_Last_Name.csv")# export summary dataset using write_csv with pipingmojo_23_fall_wknd |>write_csv("data/Movie_Gross_Fall_2023_Weekends_FirstName_Last_Name.csv")
💥 Week 3 In-class Exercises 💥
Lecture 6 - Q2 - NOT ON PointSolutions
Practice:
If all the columns in a dataset are numeric, you can round the whole dataset at once with the command round(<name of dataset>).
Why wouldn’t that work for the dataset in the previous exercise, mojo_23_fall_wknd?
Hint: To answer this question, you are encourage to
try running the command round(mojo_23_fall_wknd).
examine the data using glimpse.
💥 Week 3 In-class Exercises - Q5 💥
Session ID: bua455f24
Which of the following commands should NOT be used within a mutate command or a summarize command?
as.integer
factor
mean
filter
HW 3 Introduction
Purpose
This assignment will give you experience with:
Creating an R Project Directory folder with data and img folders. (Review)
Creating, saving, using a Quarto file (Review)
Importing data
Rendering a Quarto file to create an HTML file (Review)
Creating a README file (Review)
Using the dplyr commands along with commands to reshape and summarize data
Creating plots with some formatting
💥 Week 3 In-class Exercises - Q6 💥
Session ID: bua455f24
In HW 3, you will group the data by quarter and week day. This is Part 4 of HW 3 and is very similar to the group_by and summarize code covered in Lecture 5.
This is BB Question 3 in HW 3
Your grouped and summarized dataset, mojo_qtr_smry, has
____ rows and
____ columns
____ summary numeric variables
Key Points from This Week
Summarizing Data by Group
Use group_by to specify grouping variables followed by summarize
Within summarize specify type, .e.g. mean, median, max, etc.
Reshaping Data for Different Purposes
pivot_wider is useful for display tables
pivot_longer is useful for plots
Plotting Data
grouped barplots (stacked and side-by-side)
line plots and area plots
You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.