Week 3

Reshaping and Summarizing Data

Penelope Pooler Eisenbies

2025-01-27

RStudio Global General Options

Reminders from Week 2 and HW 2

HW 2 is Due Wednesday, 1/29/2025

dplyer commands:

select - used to select variables (columns) of a dataset
slice - used to select rows by row number
filter - used to filter data rows by values of a variable
mutate - to create or transform a variable

ggplot introduction:

basic syntax and aesthetics statements (aes)
creating a basic boxplot (geom_boxplot) or scatterplot (geom_point)
removing default background by modifying the theme
adding a third categorical variable to color the data by category

Reordering variables

In class and HW 2 we used select to reorder variables.
Another option in the dplyr package is relocate

#|label: starwars numeric vars first
my_starwars <- starwars |>
  select(1:11) |>
  relocate(where(is.numeric)) |> 
  glimpse(width=40)

Rows: 87
Columns: 11
$ height     <int> 172, 167, 96, 202, …
$ mass       <dbl> 77, 75, 32, 136, 49…
$ birth_year <dbl> 19.0, 112.0, 33.0, …
$ name       <chr> "Luke Skywalker", "…
$ hair_color <chr> "blond", NA, NA, "n…
$ skin_color <chr> "fair", "gold", "wh…
$ eye_color  <chr> "blue", "yellow", "…
$ sex        <chr> "male", "none", "no…
$ gender     <chr> "masculine", "mascu…
$ homeworld  <chr> "Tatooine", "Tatooi…
$ species    <chr> "Human", "Droid", "…

#|label: starwars character vars first
my_starwars <- starwars |>
  select(1:11) |>
  relocate(where(is.character)) |> 
  glimpse(width=40)

Rows: 87
Columns: 11
$ name       <chr> "Luke Skywalker", "…
$ hair_color <chr> "blond", NA, NA, "n…
$ skin_color <chr> "fair", "gold", "wh…
$ eye_color  <chr> "blue", "yellow", "…
$ sex        <chr> "male", "none", "no…
$ gender     <chr> "masculine", "mascu…
$ homeworld  <chr> "Tatooine", "Tatooi…
$ species    <chr> "Human", "Droid", "…
$ height     <int> 172, 167, 96, 202, …
$ mass       <dbl> 77, 75, 32, 136, 49…
$ birth_year <dbl> 19.0, 112.0, 33.0, …

New Skills in Week 3 (and HW 3)

Importing a ‘clean’ dataset
- After Quiz 1 we’ll cover how to clean ‘messy’ data
Creating a character or factor variable
Coercing data to be a new data type
- e.g. character to numeric
Grouping, summarizing, and filtering data
Reshaping data for a summary table OR reshaping data for a plot

Preview of ‘cleaning’ messy

This week, we will introduce data from Box Office Mojo
We will work with the cleaned (usable data)
First, a quick preview of one way to acquire and clean data with no download option.
- These are proprietary data, but they can be used for educational purposes according to the fair use doctrine of the U.S. copyright statute.
Steps:
- Select data from website and save as .csv file.
- Examine raw ’messy` data in .csv file.
- Remove non-data rows at the top with skip.
- Select variables and filter data rows.
- Remove nuisance characters like $ and ,.
- Clean and convert date information variables, if present.
- Export and save a clean dataset.

Website
Raw Data (.csv)
Import,Select,Filter
Clean Numeric Data
Dates

Online Data are often formatted for viewing, not using.

Details that make online data viewing easier, have to be removed for data management.

Box Office Mojo Screenshot Data Source: Box Office Mojo

Copying data from a website and saving them as a .csv file (CSV UTF-8) removes most of the formatting, but data cleaning is still required.

read_csv imports the raw data and skips the first 11 rows (above the var names).
filter is used to filter out rows that don’t contain data.
select is used to select only the variables we need.
rename (new command) is used to make the variable names easier to work with.
head is one of many options for examining the data.

#|label: import, select, filter, rename
bom23 <- read_csv("data/box_office_mojo_2023.csv", skip=11, show_col_types = FALSE) |>
  filter(!is.na(Day)) |>
  select(Date, `Top 10 Gross`, Gross, Releases, `#1 Release`) |>
  rename(top10gross = `Top 10 Gross`, 
         num_releases=Releases, num1gross=Gross, num1 = `#1 Release`) 
head(bom23)

# A tibble: 6 × 5
  Date   top10gross  num1gross  num_releases num1 
  <chr>  <chr>       <chr>             <dbl> <chr>
1 31-Dec $23,078,184 $5,208,897           43 Wonka
2 30-Dec $40,050,370 $8,637,841           44 Wonka
3 29-Dec $37,348,409 $8,630,268           44 Wonka
4 28-Dec $33,261,609 $7,988,504           46 Wonka
5 27-Dec $33,892,628 $8,135,639           45 Wonka
6 26-Dec $41,788,862 $8,970,413           45 Wonka

The two Gross variables both contained $ and , symbols that were removed with gsub and across.
Each variable was then converted to numeric with as.numeric.

#|label: clean numeric variables
bom23 <- bom23 |>
  mutate(across(.cols=top10gross:num1gross, 
                ~gsub(pattern="$", replacement="", fixed=T, .)),    # removes $ from 2 vars
         across(.cols=top10gross:num1gross, 
                ~gsub(pattern=",", replacement="", fixed=T, .)) |>  # removes , from 2 vars
  mutate_at(vars(top10gross,num1gross), as.numeric))                # converts to numeric

head(bom23)

# A tibble: 6 × 5
  Date   top10gross num1gross num_releases num1 
  <chr>       <dbl>     <dbl>        <dbl> <chr>
1 31-Dec   23078184   5208897           43 Wonka
2 30-Dec   40050370   8637841           44 Wonka
3 29-Dec   37348409   8630268           44 Wonka
4 28-Dec   33261609   7988504           46 Wonka
5 27-Dec   33892628   8135639           45 Wonka
6 26-Dec   41788862   8970413           45 Wonka

Dealing with dates used to be much more difficult prior to development of the lubridate package.
- Dates are still troublesome in other software environments.
Below we create a date variable from the provided character variable, create other variables, examine data, and export the dataset with write_csv.

#|label: date example with lubridate
bom23 <- bom23 |>
  mutate(date = dmy(paste(Date,"2023")),               # year is required
                                                       # we paste it (add it as text) to each date
         month = month(date, label=T, abbr=T),         # month shown as 3 letter abbr.
         day = wday(date, label=T, abbr=T),            # weekday shown as 3 letter abbr.
         quart = quarter(date)) |>                     # quarter shown as number
  select(date, month, day, quart, top10gross:num1) |>  # select and reorder variables
  glimpse() |>                                         # examine data
  write_csv("data/Box_Office_Mojo_Week3_HW3.csv")           # export using write_csv

Rows: 365
Columns: 8
$ date         <date> 2023-12-31, 2023-12-30, 2023-12-29, 2023-12-28, 2023-12-…
$ month        <ord> Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, De…
$ day          <ord> Sun, Sat, Fri, Thu, Wed, Tue, Mon, Sun, Sat, Fri, Thu, We…
$ quart        <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ top10gross   <dbl> 23078184, 40050370, 37348409, 33261609, 33892628, 4178886…
$ num1gross    <dbl> 5208897, 8637841, 8630268, 7988504, 8135639, 8970413, 181…
$ num_releases <dbl> 43, 44, 44, 46, 45, 45, 44, 40, 41, 41, 40, 40, 39, 39, 4…
$ num1         <chr> "Wonka", "Wonka", "Wonka", "Wonka", "Wonka", "Wonka", "Th…

Importing Clean Data

read_csv is used in this class
External datasets should be saved as .csv files to your project folder
- There are many CSV file options.
- Select CSV UTF-8 when saving Excel datasets as .csv files.
show_col_types=F suppresses the output message from importing data
- This option will be required when you create a dashboard.

#|label: import clean data

mojo_23 <- read_csv("data/Box_Office_Mojo_Week3_HW3.csv", show_col_types=F) |>
  glimpse(width=60)

Rows: 365
Columns: 8
$ date         <date> 2023-12-31, 2023-12-30, 2023-12-29, …
$ month        <chr> "Dec", "Dec", "Dec", "Dec", "Dec", "D…
$ day          <chr> "Sun", "Sat", "Fri", "Thu", "Wed", "T…
$ quart        <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ top10gross   <dbl> 23078184, 40050370, 37348409, 3326160…
$ num1gross    <dbl> 5208897, 8637841, 8630268, 7988504, 8…
$ num_releases <dbl> 43, 44, 44, 46, 45, 45, 44, 40, 41, 4…
$ num1         <chr> "Wonka", "Wonka", "Wonka", "Wonka", "…

💥 Week 3 In-class Exercises - Q1 💥

Session ID: bua455f24

Notice that in the prior chunk, we use the command read_csv

True or False:

read_csv and read.csv are the same and can be used interchangeably to import data.

Hint: Here are three ways to determine this:

R help: In console type ?read_csv and/or type ?read.csv and look through documentation
Google R read_csv and read.csv
Ask ‘Chat GPT’, ‘Copilot’, or another AI search engine.

Note: R help files are sometimes hard to decipher and Googling often requires time and effort but both are excellent resources. AI search engines are getting better, but are not always 100% accurate.

Categorical Data
Exclude num1
Create Factors
Examine Factors

This data set is ALMOST ready to work with BUT there are few additional tasks to cover:

Select all variables in dataset EXCEPT num1 (name of number 1 movie)
- We will work with text (character) variables after Quiz 1
Convert month to an ordinal factor, monthF
Convert day (of the week) to an ordinal factor, wkdayF, with Monday as 1st Day
- Change wkdayF labels to be M, T, W, Th, F, Sa, Su
Convert quart (Quarter) to an ordinal factor with text labels (HW 3):
- In HW 3 you will:
  - create a factor variable quartF with
    - levels: 1,2,3,4.
    - labels: “1st Qtr”, “2nd Qtr”, “3rd Qtr”, “4th Qtr” .
  - create a publication quality table showing data by week day and quarter.

Recall: We use ! to exclude a variable or filter out observations

#|label: exclude a variable
mojo_23_mod <- mojo_23 |>                             # save as new dataset 
  select(!num1) |>                                    # excludes text variable num1
  glimpse()

Rows: 365
Columns: 7
$ date         <date> 2023-12-31, 2023-12-30, 2023-12-29, 2023-12-28, 2023-12-…
$ month        <chr> "Dec", "Dec", "Dec", "Dec", "Dec", "Dec", "Dec", "Dec", "…
$ day          <chr> "Sun", "Sat", "Fri", "Thu", "Wed", "Tue", "Mon", "Sun", "…
$ quart        <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ top10gross   <dbl> 23078184, 40050370, 37348409, 33261609, 33892628, 4178886…
$ num1gross    <dbl> 5208897, 8637841, 8630268, 7988504, 8135639, 8970413, 181…
$ num_releases <dbl> 43, 44, 44, 46, 45, 45, 44, 40, 41, 41, 40, 40, 39, 39, 4…

The factor command is used with mutate to create TWO factor variables - levels option specifies order. - labels option specifies appearance of values.

#|label: creating factor variables
mojo_23_mod <- mojo_23_mod |> 
  mutate(monthF = factor(month,     
                         levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                                  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")),
         wkdayF = factor(day,     
                         levels=c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
                         labels= c("M", "T", "W", "Th", "F", "Sa", "Su"))) |>
  glimpse()

Rows: 365
Columns: 9
$ date         <date> 2023-12-31, 2023-12-30, 2023-12-29, 2023-12-28, 2023-12-…
$ month        <chr> "Dec", "Dec", "Dec", "Dec", "Dec", "Dec", "Dec", "Dec", "…
$ day          <chr> "Sun", "Sat", "Fri", "Thu", "Wed", "Tue", "Mon", "Sun", "…
$ quart        <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ top10gross   <dbl> 23078184, 40050370, 37348409, 33261609, 33892628, 4178886…
$ num1gross    <dbl> 5208897, 8637841, 8630268, 7988504, 8135639, 8970413, 181…
$ num_releases <dbl> 43, 44, 44, 46, 45, 45, 44, 40, 41, 41, 40, 40, 39, 39, 4…
$ monthF       <fct> Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec, De…
$ wkdayF       <fct> Su, Sa, F, Th, W, T, M, Su, Sa, F, Th, W, T, M, Su, Sa, F…

We can use unique or summary to examine the new variables monthF and wkdayF.

unique lists the levels (categories) in the specified order
summary of a factor variable shows the number of observations in each level (category).

#|label: Examine factor variables                                                  
mojo_23_mod |> pull(monthF) |> unique()

 [1] Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

mojo_23_mod |> pull(wkdayF) |> unique()

[1] Su Sa F  Th W  T  M 
Levels: M T W Th F Sa Su

mojo_23_mod |> select(month, monthF, day, wkdayF) |> summary()

    month               monthF        day            wkdayF 
 Length:365         Jan    : 31   Length:365         M :52  
 Class :character   Mar    : 31   Class :character   T :52  
 Mode  :character   May    : 31   Mode  :character   W :52  
                    Jul    : 31                      Th:52  
                    Aug    : 31                      F :52  
                    Oct    : 31                      Sa:52  
                    (Other):179                      Su:53

Numerical Data
R code for Numerical Data

The mutate command can contain many separate statements.
Good practice: Subdivide data management tasks into multiple chunks so that each chunk is easily understood.

In the next chunk we will:

modify top10gross and num1gross:
- divide by 1000000 and round for presentation purposes.
create percent of top 10 gross earned by number 1 film (HW 3), rounded to 2 decimal places.
- pctnum1 = (num1gross/top10gross * 100) |> round(2)
convert num_releases to an integer (HW 3).
- num_releases = as.integer(num_releases)

Note: Variables are rounded to two decimal values by using piping and round(2)

#|label: numerical data management
mojo_23_mod <- mojo_23_mod |>
  mutate(top10grossM = (top10gross/1000000) |> round(2),     # change scale and round
         num1grossM = (num1gross/1000000) |> round(2),       # change scale and round
         
         num1pct = (num1gross/top10gross * 100) |> round(2), # create rounded pct var
         
         num_releases = as.integer(num_releases)) |> # converts num_releases to integer 
  
  select(date, monthF, wkdayF, quart, num_releases, num1gross, num1grossM, 
         top10gross, top10grossM, num1pct)

head(mojo_23_mod)

# A tibble: 6 × 10
  date       monthF wkdayF quart num_releases num1gross num1grossM top10gross
  <date>     <fct>  <fct>  <dbl>        <int>     <dbl>      <dbl>      <dbl>
1 2023-12-31 Dec    Su         4           43   5208897       5.21   23078184
2 2023-12-30 Dec    Sa         4           44   8637841       8.64   40050370
3 2023-12-29 Dec    F          4           44   8630268       8.63   37348409
4 2023-12-28 Dec    Th         4           46   7988504       7.99   33261609
5 2023-12-27 Dec    W          4           45   8135639       8.14   33892628
6 2023-12-26 Dec    T          4           45   8970413       8.97   41788862
# ℹ 2 more variables: top10grossM <dbl>, num1pct <dbl>

💥 Week 3 In-class Exercises - Q2 💥

Session ID: bua455f24

This is BB Question 2 in HW 3

The correct command used to convert a numeric variable to an integer variable is

____().

When you glimpse the data after Part 2 (Chunk 3) in HW 3, the type for the num_releases variable is shown as

<____> instead of <dbl>.

Grouping and Filtering Data

We can filter data by value within each group.
- R command group_by allows us to group data before we filter.
- Data are filtered by value WITHIN each specified group
- Ungrouping data afterwards using ungroup is not required, but often helpful.
The example below is not used in the subsequent summary but can be very useful.

#|label: filter to last day of month
mojo_23_mnth_end <- mojo_23_mod |>
  select(date, monthF, top10grossM) |>
  group_by(monthF) |>                             # doesn't change data appearance
  filter(date == max(date)) |>
  ungroup() |>                                    # ungroup not required but helpful
  glimpse()

Rows: 12
Columns: 3
$ date        <date> 2023-12-31, 2023-11-30, 2023-10-31, 2023-09-30, 2023-08-3…
$ monthF      <fct> Dec, Nov, Oct, Sep, Aug, Jul, Jun, May, Apr, Mar, Feb, Jan
$ top10grossM <dbl> 23.08, 5.28, 9.82, 30.32, 5.27, 30.83, 41.92, 14.13, 27.13…

Grouping and Summarizing Data

We will summarize data and then reshape it for a summary table.
- R commands group_by and summarize allow us to summarize the data by category
When summarizing data, it is easier to select the variables you want first.
Plan what you want to do

mojo_23_smry <- mojo_23_mod |>
  select(monthF, wkdayF, top10grossM) |>
  group_by(monthF, wkdayF) |>                              # doesn't change data appearance
  summarize(avg_top10gross = mean(top10grossM, na.rm=T),
            mdn_top10gross = median(top10grossM, na.rm=T),
            max_top10gross = max(top10grossM, na.rm=T)) |>
  ungroup() |> glimpse()                                   # ungroup not required but helpful

Rows: 84
Columns: 5
$ monthF         <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Feb, Feb, Feb, Feb, …
$ wkdayF         <fct> M, T, W, Th, F, Sa, Su, M, T, W, Th, F, Sa, Su, M, T, W…
$ avg_top10gross <dbl> 14.3100, 10.3540, 8.1950, 7.8650, 23.2325, 36.5500, 26.…
$ mdn_top10gross <dbl> 7.920, 9.760, 7.225, 6.985, 22.710, 36.175, 28.590, 5.4…
$ max_top10gross <dbl> 32.55, 16.97, 12.13, 10.86, 31.03, 44.65, 36.21, 21.22,…

Reshape Data using `pivot_wider`

A common task in data management is reshaping data
Display data tables must be compact for presentation

#|label: reshape data with pivot_wider

mojo_23_wide <- mojo_23_smry |>
  pivot_wider(id_cols=monthF, names_from=wkdayF, values_from=max_top10gross) |>
  rename(Month = monthF)
head(mojo_23_wide)

# A tibble: 6 × 8
  Month     M     T     W    Th     F    Sa    Su
  <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Jan    32.6  17.0 12.1  10.9   31.0  44.6  36.2
2 Feb    21.2  12.4  6.49  6.24  54.7  47.6  36.4
3 Mar    11.8  13.9 10.5   9.21  39.9  44.9  30.3
4 Apr    27.0  22.0 40.4  36.5   73.8  79.0  47.9
5 May    40    19.4 14.1  10.1   57.9  55.2  49.7
6 Jun    27.0  27.1 19.9  18.4   76.6  70.3  56.5

Creating Tables for Presentation

Below are two options for for displaying a small dataset in tabular formats.

Note: Appearance of kable tables varies for slides, documents, and html files

Basic Table with `kable`

#|label: filter select present data

mojo_23_fall_wknd <- mojo_23_wide |>            
  select(Month, F, Sa, Su) |> 
  filter(Month %in% c("Sep", "Oct", 
                      "Nov", "Dec"))
mojo_23_fall_wknd |>
  kable()

Month	F	Sa	Su
Sep	27.29	32.70	27.58
Oct	53.21	47.65	32.68
Nov	43.49	41.72	29.34
Dec	39.89	40.05	23.08

`kable` Table with styling

#|label: modifying alignment and styling
mojo_23_fall_wknd |>
 kable(align="lccc", 
       caption="Max. Fall `23 Top 10 Gross") |>
  kable_styling(full_width = F)

Max. Fall `23 Top 10 Gross
Month	F	Sa	Su
Sep	27.29	32.70	27.58
Oct	53.21	47.65	32.68
Nov	43.49	41.72	29.34
Dec	39.89	40.05	23.08

Reshaping Data using `pivot_longer`

The longer data format is often needed for efficient data visualization

`pivot_longer` R code

#|label: pivot_longer code
mojo_23_long <- mojo_23_wide |>
  pivot_longer(cols=M:Su, names_to="Day", 
               values_to="max_top10gross") 
head(mojo_23_long, 10)

# A tibble: 10 × 3
   Month Day   max_top10gross
   <fct> <chr>          <dbl>
 1 Jan   M              32.6 
 2 Jan   T              17.0 
 3 Jan   W              12.1 
 4 Jan   Th             10.9 
 5 Jan   F              31.0 
 6 Jan   Sa             44.6 
 7 Jan   Su             36.2 
 8 Feb   M              21.2 
 9 Feb   T              12.4 
10 Feb   W               6.49

basic `geom_bar` barplot R code

#|label: stacked barplot
(mojo_barplot <- mojo_23_long |> ggplot() + 
  geom_bar(aes(x=Month, y=max_top10gross, fill=Day), 
           stat="identity"))

Stacked Barplot
Side-by-side
Labels Formatted
Format Palette and Text

#|label: stacked no background
mojo_23_long <- mojo_23_long |> # Day converted to factor to specify order
  mutate(Day = factor(Day, levels=c("M", "T", "W", "Th", "F", "Sa", "Su")))
(mojo_barplot <- mojo_23_long |> ggplot() + 
  geom_bar(aes(x=Month, y=max_top10gross, fill=Day), stat="identity") + 
  theme_classic())

#|label: side by side
(mojo_barplot <- mojo_23_long |> ggplot() + 
  geom_bar(aes(x=Month, y=max_top10gross, fill=Day), 
           stat="identity", position="dodge") + 
  theme_classic())

We can add on to the plot which is a saved object in the Global Environment.

#|label: label formatting
(mojo_barplot <- mojo_barplot +
  theme(legend.position ="bottom") +
  guides(fill = guide_legend(nrow = 1)) +
  labs(x="", y="Maximum Daily Gross ($M)",
       title = "Maximum Daily Gross of Top 10 Films by Month and Day of Week",
       caption = "Data Source: www.boxofficemojo.com"))

#|label: spectral palette and text resized
(mojo_barplot <- mojo_barplot + 
    scale_fill_brewer(palette = "Spectral") +
    theme(plot.title = element_text(size = 20),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        legend.text = element_text(size = 12),
        plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)))

💥 Week 3 In-class Exercises - Q2 💥

Session ID: bua455f24

This is part of BB Question 5 in HW 3

If you want a grouped barplot with side-by-side bars, what is the correct option to include in the geom_bar statement?

Here is some additional information about geom_bar barplots.

`pivot_longer` for Line and Area Plots

An alternative to summarizing the data is to show the data as a time series.
- Two ways to do this are a line plot or an area plot
- These plots are an effective data management and presentation tool.
To make a line plot with multiple variables, we use pivot_longer to reshape the data.

#|label: reshape for line plot
mojo_23_line_area <- mojo_23_mod |>
  select(date, top10grossM, num1grossM) |>                  # select variables
  rename(`Top 10` = top10grossM, `No. 1` = num1grossM) |>   # rename for plot
  pivot_longer(cols=`Top 10`:`No. 1`,                       # reshape data  
               names_to = "type", values_to = "grossM") |>
  mutate(type=factor(type, levels=c("Top 10", "No. 1")))    # convert gross type to factor
head(mojo_23_line_area, 4)

# A tibble: 4 × 3
  date       type   grossM
  <date>     <fct>   <dbl>
1 2023-12-31 Top 10  23.1 
2 2023-12-31 No. 1    5.21
3 2023-12-30 Top 10  40.0 
4 2023-12-30 No. 1    8.64

Line Plot
Labels & Colors
Resize Text
Area Plot Code
Area Plot

#|label: basic line plot
(line_plt <- mojo_23_line_area |> ggplot() + 
  geom_line(aes(x=date, y=grossM, color=type), size=1) + 
  theme_classic())

#|label: labels and colors formatted
(line_plt <- line_plt + 
  theme(legend.position="bottom") +                    # legend at bottom 
  scale_color_manual(values=c("blue", "lightblue")) +  # specify colors 
  labs(x="Date", y = "Gross ($Mill)", color="",
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle="Jan. 1, 2023 - Dec. 31, 2023",
       caption="Data Source:www.boxoffice.mojo.com"))

#|label: adjust text size
(line_plt <- line_plt +
  theme(plot.title = element_text(size = 20), plot.caption = element_text(size = 10),
        axis.text = element_text(size=15), axis.title = element_text(size=18),
        legend.text = element_text(size = 12),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth = 2)))

Change geom_line to geom_area and color to fill

#|label: area plot code
area_plt <- mojo_23_line_area |>                         
  ggplot() +                                            # changed to geom_area
  geom_area(aes(x=date, y=grossM, fill=type), size=1) + # changed color to fill
  theme_classic() + theme(legend.position="bottom") +                    
  scale_fill_manual(values=c("blue", "lightblue")) +    # changed color to fill
  labs(x="Date", y = "Gross ($Mill)", fill="",          # changed color to fill
       title="Top 10 and No. 1 Movie Gross by Date", 
       subtitle="Jan. 1, 2023 - Dec. 31, 2023",
       caption="Data Source:www.boxoffice.mojo.com") + 
    theme(plot.title = element_text(size = 20),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        legend.text = element_text(size = 12),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))

💥 Week 3 In-class Exercises 💥

Lecture 6 - Q1 - NOT ON PointSolutions

In class we will practice:

Running chunks and exporting a table.
Preview for 1 Question in Quiz 1 where you will:
- Select variables from a provided dataset
- Group and summarize data
- Export a summary table as a .csv file and submit it.

Instructions for In-class Exercise

Save Week 3 R project to your computer.
Open this project by clicking on .Rproj file.
Open .Rmd file within open R project.
Run all chunks above this exercise.
Modify the following chunk below to:
1. Round all values in columns 2-4 of mojo_23_fall_wknd to 1 decimal place using round.
2. Export mojo_23_fall_wknd as a .csv file with your name.
Submit this .csv file with your name in the Week 3 In-class Exercise in the In-class Exercises folder on Blackboard.

NOTE: This counts as part of your in-class participation for the Week 3 lectures (due Fri. at midnight).

R Code Chunk for In-class Exercise

Remove , eval=F from chunk header. This will allow code in chunk to run when it is rendered.
Remove the # and complete round command to round numeric columns (columns 2 - 4) to 1 decimal place.
Choose EITHER of the write_csv commands and edit it so dataset will be exported to the data folder with your name.
Delete write_csv command you don’t edit or put # symbols in front of it.
Submit .csv file with your name in the filename

#|label: round and export summary dataset

mojo_23_fall_wknd |> glimpse()         # examine data with glimpse

# round columns 2, 3 and 4 only
  
# export summary dataset using write_csv without piping
write_csv(mojo_23_fall_wknd, "data/Movie_Gross_Fall_2023_Weekends_FirstName_Last_Name.csv")

# export summary dataset using write_csv with piping
mojo_23_fall_wknd |>
  write_csv("data/Movie_Gross_Fall_2023_Weekends_FirstName_Last_Name.csv")

💥 Week 3 In-class Exercises 💥

Lecture 6 - Q2 - NOT ON PointSolutions

Practice:

If all the columns in a dataset are numeric, you can round the whole dataset at once with the command round(<name of dataset>).

Why wouldn’t that work for the dataset in the previous exercise, mojo_23_fall_wknd?

Hint: To answer this question, you are encourage to

try running the command round(mojo_23_fall_wknd).
examine the data using glimpse.

💥 Week 3 In-class Exercises - Q5 💥

Session ID: bua455f24

Which of the following commands should NOT be used within a mutate command or a summarize command?

as.integer
factor
mean
filter

HW 3 Introduction

Purpose

This assignment will give you experience with:

Creating an R Project Directory folder with data and img folders. (Review)
Creating, saving, using a Quarto file (Review)
Importing data
Rendering a Quarto file to create an HTML file (Review)
Creating a README file (Review)
Using the dplyr commands along with commands to reshape and summarize data
Creating plots with some formatting

💥 Week 3 In-class Exercises - Q6 💥

Session ID: bua455f24

In HW 3, you will group the data by quarter and week day. This is Part 4 of HW 3 and is very similar to the group_by and summarize code covered in Lecture 5.

This is BB Question 3 in HW 3

Your grouped and summarized dataset, mojo_qtr_smry, has

____ rows and

____ columns

____ summary numeric variables

Key Points from This Week

Summarizing Data by Group

Use group_by to specify grouping variables followed by summarize
- Within summarize specify type, .e.g. mean, median, max, etc.

Reshaping Data for Different Purposes

pivot_wider is useful for display tables
pivot_longer is useful for plots

Plotting Data

grouped barplots (stacked and side-by-side)
line plots and area plots

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.

Week 3

RStudio Global General Options

Reminders from Week 2 and HW 2

HW 2 is Due Wednesday, 1/29/2025

Reordering variables

New Skills in Week 3 (and HW 3)

Preview of ‘cleaning’ messy

Importing Clean Data

💥 Week 3 In-class Exercises - Q1 💥

💥 Week 3 In-class Exercises - Q2 💥

Grouping and Filtering Data

Grouping and Summarizing Data

Reshape Data using pivot_wider

Creating Tables for Presentation

Basic Table with kable

kable Table with styling

Reshaping Data using pivot_longer

pivot_longer R code

basic geom_bar barplot R code

💥 Week 3 In-class Exercises - Q2 💥

pivot_longer for Line and Area Plots

💥 Week 3 In-class Exercises 💥

Instructions for In-class Exercise

R Code Chunk for In-class Exercise

💥 Week 3 In-class Exercises 💥

💥 Week 3 In-class Exercises - Q5 💥

HW 3 Introduction

Purpose

💥 Week 3 In-class Exercises - Q6 💥

Key Points from This Week

Reshape Data using `pivot_wider`

Basic Table with `kable`

`kable` Table with styling

Reshaping Data using `pivot_longer`

`pivot_longer` R code

basic `geom_bar` barplot R code

`pivot_longer` for Line and Area Plots