HW 5 - Part 1 and Related Skills

Author

Penelope Pooler Eisenbies

Published

October 27, 2025

Dashboard Project - Data Exploration

  • Groups have been assigned. Please meet and explore data resources this week.

    • Proposal Rough draft due Thursday, 10/30 at 6:00 PM

    • Course website has many data sources and I recommend using AI for data searches.

You are building a dashboard to achieve TWO GOALS:

  1. Answer questions proposed.

  2. Display your data management skills that include, but are not limited to:

  • acquiring data.

  • combining data from different sources.

  • managing data for different goals and curating data in visual and tabular formats.

  • documenting your work product.

Keep these goals in mind as you look for data and develop your proposals.

Collaborating on an R Project using Posit Cloud

  • Groups have been created on Posit Cloud.

  • Group membership invitations expire within 7 days.

    • Create your account and accept invitation ASAP
    • $5 Student subscription required before you can contribute to online group workspace.
  • Create multiple projects with your Workspace

    • Two laptops cannot easily edit same file simultaneously.

      • Usually, Posit Cloud will make a ‘forked’ copy so you can both keep working.
    • Collaboration is still doable but it’s not as seamless as Google Docs.

  • Course TAs and I will be members of all groups so we can help you as needed.

  • Demo Using HW 5.1 was done during Lecture 18 (see recording)

Tour of Dashboard Template (Review)

  • For HW 5, I have provided you with a partially complete dashboard.

  • I am also providing you with an empty template that you can upload to Posit Cloud and use for your project.

    • This format is extremely flexible.

    • You also can make multiple temporary projects with simple ’.qmd` files within your group workspace.

  • Note that the course websites for all of my courses were created using Quarto Dashboard, because the format is so flexible and easy to use.

  • When working with your data, you can start out working in an unformatted .qmd file and then transfer work to your dashbard page.

New Skills Covered in Previous Lecture

  • Using ifelse to create a categorical variable

    • Can be combined with factor to create a factor variable

    • Factor variables are useful for plots, tables, and analyses

      • Character (text variable) are easier to work with when managing and manipulating data.
      • Factor variables are preferred when plotting or presenting data in tables.
  • Modifying and cleaning text variables

    • tolower and toupper

    • str_trim (as well as str_squish and str_pad)

    • gsub and ifelse

Data Mgmt for HW 5 - Part 1 Dashboard

Review of Previous Data Mgmt skills used

  • filter and select

  • group_by and summarize

  • pivot_wider and pivot_longer

Note: This week we will cover how to use pivot_wider and pivot_longer to complete and incomplete dataset.

  • In HW 5 - Part 1:

    • Not needed for Netflix Movie data even though these data are not fully complete.

    • Only the first year is incomplete and there are relatively little data in the first year.

HW 5 - Part 1 - Page 1

  • Value Boxes and Interactive Time Series Plots

  • Review of importing, managing, and plotting xts data.

    • xts stands for extensible time series
  • Value Boxes do not have to be edited.

  • Multiple xts plot objects can be displayed using hw_grid

    • in HW 5 plots are shown seprately without hw_grid.
  • You should also review (Not in Page 1):

    • Converting a tibble to an xts

    • Converting an xts to a tibble

  • New Skills

HW 5 - Part 1 - Page 2 - Data Mgmt.

  • Managing data for Grouped Stacked Barplots with Facets and Summary Table

  • Previously Covered Skills

    • pivot_longerto reshape nflx_mv data

    • ifelse is used to simplify release period data

    • mutate and factor is used to create two factor variables with specified levels

    • filter is used to subset the data to release periods with enough data.

  • New Skills - Page 2 Plot is a grouped stack bar plot using facet_grid.

HW 5 - Part 1 - Page 2 - Bar Plot and Table

Page 2 - Bar Plot

  • Be aware of how the barplot is specified in the geom_barplot statement.

  • Stacked bars were used to save space.

Page 2 Summary Table

  • The skills used to create to select and summarize data are review.

  • The table is presented using kable (also review).

  • Note that for projects, you can refine your table further, if needed, using commands from kableExtra package.

  • An alternative option for publishing tables is the gt package.

HW 5 - Part 1 - Page 2 - Area Plot

  • Useful new Skill:

    • summing across rows: mutate(total = sum(c_across(comedies:kids)))

    • sometimes used with rowwise() and ungroup but that is not needed here

  • Area Plot (Review)

    • Similar to line plot but uses

      • geom_area(aes(x = , y = , fill= ))
    • scale_x_continuous(breaks = ) used to show all years in data

    • scale_fill_brewer used to specify same palette as Page 2

Week 10 In-class Exercises

  • This short exercise demonstrates why a complete dataset with all combination levels is often needed for area plots.
  1. Save R gtcars dataset to Global Environment and examine the data.

    • Note: This dataset comes from the gt package.
Code
```{r}
#|label: save and examine data
my_gtcars <- gtcars |> glimpse()
```
Rows: 47
Columns: 15
$ mfr         <chr> "Ford", "Ferrari", "Ferrari", "Ferrari", "Ferrari", "Ferra…
$ model       <chr> "GT", "458 Speciale", "458 Spider", "458 Italia", "488 GTB…
$ year        <dbl> 2017, 2015, 2015, 2014, 2016, 2015, 2017, 2015, 2015, 2015…
$ trim        <chr> "Base Coupe", "Base Coupe", "Base", "Base Coupe", "Base Co…
$ bdy_style   <chr> "coupe", "coupe", "convertible", "coupe", "coupe", "conver…
$ hp          <dbl> 647, 597, 562, 562, 661, 553, 680, 652, 731, 949, 573, 545…
$ hp_rpm      <dbl> 6250, 9000, 9000, 9000, 8000, 7500, 8250, 8000, 8250, 9000…
$ trq         <dbl> 550, 398, 398, 398, 561, 557, 514, 504, 509, 664, 476, 436…
$ trq_rpm     <dbl> 5900, 6000, 6000, 6000, 3000, 4750, 5750, 6000, 6000, 6750…
$ mpg_c       <dbl> 11, 13, 13, 13, 15, 16, 12, 11, 11, 12, 21, 16, 11, 16, 12…
$ mpg_h       <dbl> 18, 17, 17, 17, 22, 23, 17, 16, 16, 16, 22, 22, 18, 20, 20…
$ drivetrain  <chr> "rwd", "rwd", "rwd", "rwd", "rwd", "rwd", "awd", "awd", "r…
$ trsmn       <chr> "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a"…
$ ctry_origin <chr> "United States", "Italy", "Italy", "Italy", "Italy", "Ital…
$ msrp        <dbl> 447000, 291744, 263553, 233509, 245400, 198973, 298000, 29…

Week 10 In-class Exercises Cont’d

  1. Select the following variables

    • year, bdy_style (body style), drivetrain (drive train)
  2. Group data by year, body style, and drive train and count number of observations

    • command to summarize number of observations is n()
Code
```{r}
#|label: select vars then group data and summarize
my_gtcars <- my_gtcars |>
  select(year, bdy_style, drivetrain) |>
  group_by(year, bdy_style, drivetrain) |>
  summarize(n = n())                              # new command n()
```
`summarise()` has grouped output by 'year', 'bdy_style'. You can override using
the `.groups` argument.
Code
```{r}
head(my_gtcars)                                   # incomplete dataset
```
# A tibble: 6 × 4
# Groups:   year, bdy_style [5]
   year bdy_style   drivetrain     n
  <dbl> <chr>       <chr>      <int>
1  2014 coupe       rwd            2
2  2015 convertible rwd            2
3  2015 coupe       awd            3
4  2015 coupe       rwd            4
5  2016 convertible rwd            2
6  2016 coupe       awd            3

Week 10 In-class Exercises Con’t

  1. Use pivot_wider to create a column for each drivetrain

  2. Replace missing values with 0

Code
```{r reshape and replace NAs with 0s}
my_gtcars <- my_gtcars |> 
  pivot_wider(id_cols= c("year", "bdy_style"), names_from = drivetrain, values_from = n)
my_gtcars[is.na(my_gtcars)] <- 0  # convert missing values to 0
head(my_gtcars)
```
# A tibble: 6 × 4
# Groups:   year, bdy_style [6]
   year bdy_style     rwd   awd
  <dbl> <chr>       <int> <int>
1  2014 coupe           2     0
2  2015 convertible     2     0
3  2015 coupe           4     3
4  2016 convertible     2     0
5  2016 coupe          13     3
6  2016 hatchback       0     2

Week 10 In-class Exercises - Q1

Poll Everywhere - My User Name: penelopepoolereisenbies685

  1. Sum across rows to total data from both drive trains

Q1: What is the command to sum across rows?

Code
```{r}
#|label: sum across rows to create total
# simple version
my_gtcars <- my_gtcars |>                      
   mutate(total = sum(c_across(rwd:awd))) 

# more complete version with rowwise and ungroup
my_gtcars <- my_gtcars |>                      
   rowwise() |>
   mutate(total = sum(c_across(rwd:awd))) |>   
   ungroup()
```

Data with total column

# A tibble: 10 × 5
    year bdy_style     rwd   awd total
   <dbl> <chr>       <int> <int> <int>
 1  2014 coupe           2     0     2
 2  2015 convertible     2     0     2
 3  2015 coupe           4     3     7
 4  2016 convertible     2     0     2
 5  2016 coupe          13     3    16
 6  2016 hatchback       0     2     2
 7  2016 sedan           5     2     7
 8  2017 convertible     1     0     1
 9  2017 coupe           5     2     7
10  2017 sedan           0     1     1

Week 10 In-class Exercises - Q2-Q3

Poll Everywhere - My User Name: penelopepoolereisenbies685

  1. Select year, body style and total

Q2: How many observations do you have?

Q3: If every body style was made in every year, how many observations would you have?

  • There are 4 body styles and 4 years represented in these data.

  • Also in this chunk: Command to capitalize first letter of words in a string: str_to_title

Code
```{r}
#|label: select vars of interest and examine

my_gtcars <- my_gtcars |> select(year, bdy_style, total) |>
  mutate(bdy_style = str_to_title(bdy_style)) |>    # Capitalize first letters
  glimpse()
```
Rows: 10
Columns: 3
$ year      <dbl> 2014, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017, 2017
$ bdy_style <chr> "Coupe", "Convertible", "Coupe", "Convertible", "Coupe", "Ha…
$ total     <int> 2, 2, 7, 2, 16, 2, 7, 1, 7, 1

Fill in incomplete dataset using

# A tibble: 10 × 3
    year bdy_style   total
   <dbl> <chr>       <int>
 1  2014 Coupe           2
 2  2015 Convertible     2
 3  2015 Coupe           7
 4  2016 Convertible     2
 5  2016 Coupe          16
 6  2016 Hatchback       2
 7  2016 Sedan           7
 8  2017 Convertible     1
 9  2017 Coupe           7
10  2017 Sedan           1

Reshaping Data creates a 4x4 matrix with NAs

Code
```{r}
#|label:  pivot wider to reshape and include NAs
(my_gtcars1 <- my_gtcars |>
   pivot_wider(id_cols= year, names_from = bdy_style, values_from = total))
```
# A tibble: 4 × 5
   year Coupe Convertible Hatchback Sedan
  <dbl> <int>       <int>     <int> <int>
1  2014     2          NA        NA    NA
2  2015     7           2        NA    NA
3  2016    16           2         2     7
4  2017     7           1        NA     1

Reshaping data back to long format keeps NAs, categories with no observations

Code
```{r}
#|label: long form with NAs
my_gtcars1 <- my_gtcars1 |>
   pivot_longer(cols=Coupe:Sedan, 
                names_to = "bdy_style", 
                values_to = "total")
```
# A tibble: 16 × 3
    year bdy_style   total
   <dbl> <chr>       <int>
 1  2014 Coupe           2
 2  2014 Convertible    NA
 3  2014 Hatchback      NA
 4  2014 Sedan          NA
 5  2015 Coupe           7
 6  2015 Convertible     2
 7  2015 Hatchback      NA
 8  2015 Sedan          NA
 9  2016 Coupe          16
10  2016 Convertible     2
11  2016 Hatchback       2
12  2016 Sedan           7
13  2017 Coupe           7
14  2017 Convertible     1
15  2017 Hatchback      NA
16  2017 Sedan           1

These data values are number of observations so NAs can be replaced with 0s.

Code
```{r}
#|label:  convert NAs to 0s
my_gtcars1[is.na(my_gtcars1)] <- 0
```
# A tibble: 16 × 3
    year bdy_style   total
   <dbl> <chr>       <int>
 1  2014 Coupe           2
 2  2014 Convertible     0
 3  2014 Hatchback       0
 4  2014 Sedan           0
 5  2015 Coupe           7
 6  2015 Convertible     2
 7  2015 Hatchback       0
 8  2015 Sedan           0
 9  2016 Coupe          16
10  2016 Convertible     2
11  2016 Hatchback       2
12  2016 Sedan           7
13  2017 Coupe           7
14  2017 Convertible     1
15  2017 Hatchback       0
16  2017 Sedan           1

Complete code is relatively short and efficient.

Factors should be created AFTER this process, if needed.

Code
```{r complete conversion code, eval=FALSE}
my_gtcars1 <- my_gtcars |>
  
  pivot_wider(id_cols= year, 
              names_from = bdy_style, 
              values_from = total) |>
  
  pivot_longer(cols=Coupe:Sedan, 
               names_to = "bdy_style", 
               values_to = "total") |>
my_gtcars1[is.na(my_gtcars1)] <- 0
```
# A tibble: 16 × 3
    year bdy_style   total
   <dbl> <chr>       <int>
 1  2014 Coupe           2
 2  2014 Convertible     0
 3  2014 Hatchback       0
 4  2014 Sedan           0
 5  2015 Coupe           7
 6  2015 Convertible     2
 7  2015 Hatchback       0
 8  2015 Sedan           0
 9  2016 Coupe          16
10  2016 Convertible     2
11  2016 Hatchback       2
12  2016 Sedan           7
13  2017 Coupe           7
14  2017 Convertible     1
15  2017 Hatchback       0
16  2017 Sedan           1

Comparison of Area Plots

  • The plot on the left was made with the incomplete data.

  • The plot on the right was made with complete data where all levels of the variables shown are in the dataset.

Key Points from Today

Data Management Skills - Some Review and Some New

  • summarizing number of observations by category using n()

  • Using sum(c_across()) to sum across rows

  • Knowing how to complete data that is incomplete

Reminders:

  • Rough Draft Proposals due by 10/30 at 6:00 PM

  • HW 5 - Part 1 is due Wednesday, 10/29.

  • Quiz is on Thu. 11/6. Practice questions for Quiz 2 will be posted by 10/30.

  • Managing and presenting geographic data in choropleth maps will be covered on Thursday and is not on Quiz 2.

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.