HW 5 - Part 1 - Quarto Dashboard and Data Mgmt

Author

Penelope Pooler Eisenbies

Published

March 19, 2025

Dashboard Project - Data Exploration

  • Groups have been assigned so it is time to meet and explore data resources.

  • The course website has many data sources and I recommend search for data with AI

  • When deciding on a topic and data, keep the project parameters in mind.

You are building a dashboard to achieve TWO GOALS:

  1. Answer questions proposed.

  2. Display your data management skills that include, but are not limited to:

  • acquiring data.

  • combining data from different sources.

  • managing data for different goals and curating data in visual and tabular formats.

  • documenting your work product.

Keep these goals in mind as you look for data and develop your proposals.

Collaborating on an R Project using Posit Cloud

  • Groups have been created on Posit Cloud.

  • Group membership invitations expire within 7 days.

  • Create multiple projects with your Workspace

    • Two laptops cannot easily edit same project simultaneously.

    • Collaboration is still doable but it’s not as seamless as Google Docs.

  • Course TAs and I will be members of all groups so we can help you as needed.

  • Demo Using HW 5.1

Tour of Dashboard Template

  • For HW 5, I have provided you with a partially complete dashboard.

  • I am also providing you with an empty template that you can upload to Posit Cloud and use for your project.

    • This format is extremely flexible.

    • You also can make multiple temporary projects with simple ’.qmd` files within your group workspace.

  • Note that the course websites for all of my courses were created using Quarto Dashboard, because the format is so flexible and easy to use.

  • When working with your data, you can start out working in an unformatted .qmd file and then transfer work to your dashbard page.

New Skills Covered in Previous Lecture

  • Using ifelse to create a categorical variable

    • Can be combined with factor to create a factor variable

    • Factor variables are useful for plots, tables, and analyses

      • Character (text variable) are easier to work with when managing and manipulating data.
      • Factor variables are preferred when plotting or presenting data in tables.
  • Modifying and cleaning text variables

    • tolower and toupper

    • str_trim (as well as str_squish and str_pad)

    • gsub and ifelse

Data Mgmt for HW 5 - Part 1 Dashboard

Review of Previous Data Mgmt skills used

  • filter and select

  • group_by and summarize

  • pivot_wider and pivot_longer

Note: This week we will cover how to use pivot_wider and pivot_longer to complete and incomplete dataset.

  • In HW 5 - Part 1:

    • Not needed for Netflix Movie data even though these data are not fully complete.

    • Only the first year is incomplete and there are relatively little data in the first year.

HW 5 - Part 1 - Page 1

  • Value Boxes and Interactive Time Series Plots

  • Review of importing, managing, and plotting xts data.

    • xts stands for extensible time series
  • Value Boxes do not have to be edited.

  • Multiple xts plot objects can be displayed using hw_grid

    • in HW 5 plots are shown seprately without hw_grid.
  • You should also review (Not in Page 1):

    • Converting a tibble to an xts

    • Converting an xts to a tibble

  • New Skills

HW 5 - Part 1 - Page 2 - Data Mgmt.

  • Managing data for Grouped Stacked Barplots with Facets and Summary Table

  • Previously Covered Skills

    • pivot_longerto reshape nflx_mv data

    • ifelse is used to simplify release period data

    • mutate and factor is used to create two factor variables with specified levels

    • filter is used to subset the data to release periods with enough data.

  • New Skills - Page 2 Plot is a grouped stack bar plot using facet_grid.

HW 5 - Part 1 - Page 2 - Bar Plot and Table

Page 2 - Bar Plot

  • Be aware of how the barplot is specified in the geom_barplot statement.

  • Stacked bars were used to save space.

Page 2 Summary Table

  • The skills used to create to select and summarize data are review.

  • The table is presented using kable (also review).

  • Note that for projects, you can refine your table further, if needed, using commands from kableExtra package.

  • An alternative option for publishing tables is the gt package.

HW 5 - Part 1 - Page 2 - Area Plot

  • Useful new Skill:

    • summing across rows: mutate(total = sum(c_across(comedies:kids)))

    • sometimes used with rowwise() and ungroup but that is not needed here

  • Area Plot (Review)

    • Similar to line plot but uses

      • geom_area(aes(x = , y = , fill= ))
    • scale_x_continuous(breaks = ) used to show all years in data

    • scale_fill_brewer used to specify same palette as Page 2

Week 9 In-class Exercises

  • This short exercise demonstrates why a complete dataset with all combination levels is often needed for area plots.
  1. Save R gtcars dataset to Global Environment and examine the data.

    • Note: This dataset comes from the gt package.
Code
```{r}
#|label: save and examine data
my_gtcars <- gtcars |> glimpse()
```
Rows: 47
Columns: 15
$ mfr         <chr> "Ford", "Ferrari", "Ferrari", "Ferrari", "Ferrari", "Ferra…
$ model       <chr> "GT", "458 Speciale", "458 Spider", "458 Italia", "488 GTB…
$ year        <dbl> 2017, 2015, 2015, 2014, 2016, 2015, 2017, 2015, 2015, 2015…
$ trim        <chr> "Base Coupe", "Base Coupe", "Base", "Base Coupe", "Base Co…
$ bdy_style   <chr> "coupe", "coupe", "convertible", "coupe", "coupe", "conver…
$ hp          <dbl> 647, 597, 562, 562, 661, 553, 680, 652, 731, 949, 573, 545…
$ hp_rpm      <dbl> 6250, 9000, 9000, 9000, 8000, 7500, 8250, 8000, 8250, 9000…
$ trq         <dbl> 550, 398, 398, 398, 561, 557, 514, 504, 509, 664, 476, 436…
$ trq_rpm     <dbl> 5900, 6000, 6000, 6000, 3000, 4750, 5750, 6000, 6000, 6750…
$ mpg_c       <dbl> 11, 13, 13, 13, 15, 16, 12, 11, 11, 12, 21, 16, 11, 16, 12…
$ mpg_h       <dbl> 18, 17, 17, 17, 22, 23, 17, 16, 16, 16, 22, 22, 18, 20, 20…
$ drivetrain  <chr> "rwd", "rwd", "rwd", "rwd", "rwd", "rwd", "awd", "awd", "r…
$ trsmn       <chr> "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a"…
$ ctry_origin <chr> "United States", "Italy", "Italy", "Italy", "Italy", "Ital…
$ msrp        <dbl> 447000, 291744, 263553, 233509, 245400, 198973, 298000, 29…

Week 9 In-class Exercises Cont’d

  1. Select the following variables

    • year, bdy_style (body style), drivetrain (drive train)
  2. Group data by year, body style, and drive train and count number of observations

    • command to summarize number of observations is n()
Code
```{r}
#|label: select vars then group data and summarize
my_gtcars <- my_gtcars |>
  select(year, bdy_style, drivetrain) |>
  group_by(year, bdy_style, drivetrain) |>
  summarize(n = n())                              # new command n()
```
`summarise()` has grouped output by 'year', 'bdy_style'. You can override using
the `.groups` argument.
Code
```{r}
head(my_gtcars)                                   # incomplete dataset
```
# A tibble: 6 × 4
# Groups:   year, bdy_style [5]
   year bdy_style   drivetrain     n
  <dbl> <chr>       <chr>      <int>
1  2014 coupe       rwd            2
2  2015 convertible rwd            2
3  2015 coupe       awd            3
4  2015 coupe       rwd            4
5  2016 convertible rwd            2
6  2016 coupe       awd            3

Week 9 In-class Exercises Con’t

  1. Use pivot_wider to create a column for each drivetrain

  2. Replace missing values with 0

Code
```{r reshape and replace NAs with 0s}
my_gtcars <- my_gtcars |> 
  pivot_wider(id_cols= c("year", "bdy_style"), names_from = drivetrain, values_from = n)
my_gtcars[is.na(my_gtcars)] <- 0  # convert missing values to 0
head(my_gtcars)
```
# A tibble: 6 × 4
# Groups:   year, bdy_style [6]
   year bdy_style     rwd   awd
  <dbl> <chr>       <int> <int>
1  2014 coupe           2     0
2  2015 convertible     2     0
3  2015 coupe           4     3
4  2016 convertible     2     0
5  2016 coupe          13     3
6  2016 hatchback       0     2

Week 9 In-class Exercises - Q1

Session ID: bua455s25

  1. Sum across rows to total data from both drive trains

Q1: What is the command to sum across rows?

Code
```{r}
#|label: sum across rows to create total
# simple version
my_gtcars <- my_gtcars |>                      
   mutate(total = sum(c_across(rwd:awd))) 

# more complete version with rowwise and ungroup
my_gtcars <- my_gtcars |>                      
   rowwise() |>
   mutate(total = sum(c_across(rwd:awd))) |>   
   ungroup()
```

Data with total column

# A tibble: 10 × 5
    year bdy_style     rwd   awd total
   <dbl> <chr>       <int> <int> <int>
 1  2014 coupe           2     0     2
 2  2015 convertible     2     0     2
 3  2015 coupe           4     3     7
 4  2016 convertible     2     0     2
 5  2016 coupe          13     3    16
 6  2016 hatchback       0     2     2
 7  2016 sedan           5     2     7
 8  2017 convertible     1     0     1
 9  2017 coupe           5     2     7
10  2017 sedan           0     1     1

Week 9 In-class Exercises - Q2

Session ID: bua455s25

  1. Select year, body style and total

Q2: How many observations do you have?

Q3: If every body style was made in every year, how many observations would you have?

  • There are 4 body styles and 4 years represented in these data.

  • Also in this chunk: Command to capitalize first letter of words in a string: str_to_title

Code
```{r}
#|label: select vars of interest and examine

my_gtcars <- my_gtcars |> select(year, bdy_style, total) |>
  mutate(bdy_style = str_to_title(bdy_style)) |>    # Capitalize first letters
  glimpse()
```
Rows: 10
Columns: 3
$ year      <dbl> 2014, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017, 2017
$ bdy_style <chr> "Coupe", "Convertible", "Coupe", "Convertible", "Coupe", "Ha…
$ total     <int> 2, 2, 7, 2, 16, 2, 7, 1, 7, 1

Fill in incomplete dataset using

# A tibble: 10 × 3
    year bdy_style   total
   <dbl> <chr>       <int>
 1  2014 Coupe           2
 2  2015 Convertible     2
 3  2015 Coupe           7
 4  2016 Convertible     2
 5  2016 Coupe          16
 6  2016 Hatchback       2
 7  2016 Sedan           7
 8  2017 Convertible     1
 9  2017 Coupe           7
10  2017 Sedan           1

Reshaping Data creates a 4x4 matrix with NAs

Code
```{r}
#|label:  pivot wider to reshape and include NAs
(my_gtcars1 <- my_gtcars |>
   pivot_wider(id_cols= year, names_from = bdy_style, values_from = total))
```
# A tibble: 4 × 5
   year Coupe Convertible Hatchback Sedan
  <dbl> <int>       <int>     <int> <int>
1  2014     2          NA        NA    NA
2  2015     7           2        NA    NA
3  2016    16           2         2     7
4  2017     7           1        NA     1

Reshaping data back to long format keeps NAs, categories with no observations

Code
```{r}
#|label: long form with NAs
my_gtcars1 <- my_gtcars1 |>
   pivot_longer(cols=Coupe:Sedan, 
                names_to = "bdy_style", 
                values_to = "total")
```
# A tibble: 16 × 3
    year bdy_style   total
   <dbl> <chr>       <int>
 1  2014 Coupe           2
 2  2014 Convertible    NA
 3  2014 Hatchback      NA
 4  2014 Sedan          NA
 5  2015 Coupe           7
 6  2015 Convertible     2
 7  2015 Hatchback      NA
 8  2015 Sedan          NA
 9  2016 Coupe          16
10  2016 Convertible     2
11  2016 Hatchback       2
12  2016 Sedan           7
13  2017 Coupe           7
14  2017 Convertible     1
15  2017 Hatchback      NA
16  2017 Sedan           1

These data values are number of observations so NAs can be replaced with 0s.

Code
```{r}
#|label:  convert NAs to 0s
my_gtcars1[is.na(my_gtcars1)] <- 0
```
# A tibble: 16 × 3
    year bdy_style   total
   <dbl> <chr>       <int>
 1  2014 Coupe           2
 2  2014 Convertible     0
 3  2014 Hatchback       0
 4  2014 Sedan           0
 5  2015 Coupe           7
 6  2015 Convertible     2
 7  2015 Hatchback       0
 8  2015 Sedan           0
 9  2016 Coupe          16
10  2016 Convertible     2
11  2016 Hatchback       2
12  2016 Sedan           7
13  2017 Coupe           7
14  2017 Convertible     1
15  2017 Hatchback       0
16  2017 Sedan           1

Complete code is relatively short and efficient.

Factors should be created AFTER this process, if needed.

Code
```{r complete conversion code, eval=FALSE}
my_gtcars1 <- my_gtcars |>
  
  pivot_wider(id_cols= year, 
              names_from = bdy_style, 
              values_from = total) |>
  
  pivot_longer(cols=Coupe:Sedan, 
               names_to = "bdy_style", 
               values_to = "total") |>
my_gtcars1[is.na(my_gtcars1)] <- 0
```
# A tibble: 16 × 3
    year bdy_style   total
   <dbl> <chr>       <int>
 1  2014 Coupe           2
 2  2014 Convertible     0
 3  2014 Hatchback       0
 4  2014 Sedan           0
 5  2015 Coupe           7
 6  2015 Convertible     2
 7  2015 Hatchback       0
 8  2015 Sedan           0
 9  2016 Coupe          16
10  2016 Convertible     2
11  2016 Hatchback       2
12  2016 Sedan           7
13  2017 Coupe           7
14  2017 Convertible     1
15  2017 Hatchback       0
16  2017 Sedan           1

Comparison of Area Plots

  • The plot on the left was made with the incomplete data.

  • The plot on the right was made with complete data where all levels of the variables shown are in the dataset.

Key Points from This Week

Data Management Skills - Some Review and Some New

  • summarizing number of observations by category using n()

  • Using sum(c_across()) to sum across rows

  • Knowing how to complete data that is incomplete

Next Week:

  • HW 5 - Part 1 is due Wednesday, 3/26.

  • Practice questions for Quiz 2 are posted.

    • Quiz is on Thu. 4/3.
  • Managing and presenting geographic data in choropleth maps (Not on Quiz 2).

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.