Week 2

Introduction to dplyr commands and ggplot

Penelope Pooler Eisenbies

2025-08-19

RStudio Global General Options

A few simple options can greatly help you.
Workspace:
- Set save workspace to Never
Maintain defaults for
- Restore most recently opened project at startup
- Always save history
- These two options can help you if R crashes or your computer does.

RStudio Markdown Global Options

Quarto files are one type of Markdown file.
The R Markdown options shown here will make your Quarto file easier to navigate and read.
R Notebooks are a basic draft Markdown file that we don’t use in this course.
- Leave R Notebooks section as is.

RStudio Appearance Global Options

By default the RStudio panes are white.
Changing the appearance is completely optional but can help with eye fatigue.
It also makes working in the RStudio environment a little more interesting.

RStudio Code Options

There are many code options that can be changed.
I recommend maintaining the defaults until you know what you want to change and why.
There are however, three options under Syntax in the Display tab.
Checking all three of these options makes it easier to write and proofread your code.

Reminders:

Assignment Reminders

Pre-class Survey Due 9/3/25
HW 1 (Parts 1, 2, and 3) Due 9/3/25

Week 1 - File Management

Creating a Quarto Project
- Quarto project automatically creates a Quarto (.qmd) file.
- Adding data and img folders to your project.
- Creating and editing a setup chunk in your Quarto file.
- Creating and editing code chunks.

Week 1 - Data Management

Selecting data by rows and columns with square brackets
Examining data with R commands: glimpse, summary, unique, table
Types of variables
- numeric variables (<dbl>, <int>)
- categorical variables (<chr>, <fct>, <ord>)
- Type of variable dictates how we examine, summarize and present the data
Using piping, |> to write R code more efficiently.
Using the c() operator to create a group of values
Using $ or pull or select to specify a variable within a dataset

Additional R syntax

Operators are used to filter data or create new variables

For example:
- Filter a dataset of heights to heights <= 6 feet
- Filter a dataset of cars to exclude SUVs

Operators in R is a good reference for some of the common operators used for data management in R.

💥 Week 2 In-class Exercises - Q1 💥

Poll Everywhere

Use the Operators in R reference link to find the operator that is put before = to indicate not equal to.

This same operator can be put before any value, e.g., X, to indicate not X.

Introduction to `dplyr`

Recall the starwars data from Week 1 Online dataset documentation

Original Data

#|label: original data
my_starwars |> glimpse(width=40)

Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "…
$ height     <int> 172, 167, 96, 202, …
$ mass       <dbl> 77, 75, 32, 136, 49…
$ hair_color <chr> "blond", NA, NA, "n…
$ skin_color <chr> "fair", "gold", "wh…
$ eye_color  <chr> "blue", "yellow", "…
$ birth_year <dbl> 19.0, 112.0, 33.0, …
$ sex        <chr> "male", "none", "no…
$ gender     <chr> "masculine", "mascu…
$ homeworld  <chr> "Tatooine", "Tatooi…
$ species    <chr> "Human", "Droid", "…
$ films      <list> <"A New Hope", "Th…
$ vehicles   <list> <"Snowspeeder", "I…
$ starships  <list> <"X-wing", "Imperi…

Modified Data

#|label: modified data
my_starwars_plot_dat |>        
  glimpse(width=40)

Rows: 24
Columns: 5
$ species <chr> "Human", "Droid", "Dro…
$ sex     <chr> "male", "none", "none"…
$ height  <int> 172, 167, 96, 202, 150…
$ mass    <dbl> 77, 75, 32, 136, 49, 1…
$ bmi     <dbl> 26.02758, 26.89232, 34…

Complete Data Mgmt.
select
filter
mutate
remove NA’s

Data Mgmt for a Boxplot Visualization

In Week 1, we previewed some data management of the starwars data for a boxplot visualization.
Today we will examine each data management step above in the subsequent panels of this slide.

#|label: starwars data management

# select, filter, and mutate commands are part of tidyverse suite
# bmi = weight(kg)/height(m)^2

my_starwars_plot_dat <- my_starwars |>         # my_starwars_plot_dat created for plot
  select(species, sex, height, mass) |>        # select specific variables
  filter(species %in% c("Human", "Droid")) |>  # filter data to humans and droids only
  mutate(bmi = mass/((height/100))^2) |>       # use mutate to create new variable, bmi
  filter(!is.na(bmi))                          # filter data to remove missing BMI values

Use the select command in the dplyr package to select variables.
The select command also orders the variables as written in the command.
We save this dataset with fewer variables as a NEW dataset, my_starwars_plot_dat.
glimpse is NOT required at each step but we will use here to examine the dataset modifications.

#|label: selecting variables

my_starwars_plot_dat <- my_starwars |>            # save as new dataset my_starwars_plot_dat
  select(species, sex, height, mass) |>           # select variables of interest         
  glimpse(width=60)

Rows: 87
Columns: 4
$ species <chr> "Human", "Droid", "Droid", "Human", "Human…
$ sex     <chr> "male", "none", "none", "male", "female", …
$ height  <int> 172, 167, 96, 202, 150, 178, 165, 97, 183,…
$ mass    <dbl> 77, 75, 32, 136, 49, 120, 75, 32, 84, 77, …

The filter command in the dplyr package is one common way to filter data.
Datasets can be filtered by numeric values, or character (text), or factor levels
A very useful operator for selecting data from specific categories is %in%, contained in.

#|label: filter observations

# filter data to include only two species categories Human and Droid
my_starwars_plot_dat <- my_starwars_plot_dat |>                     # overwrite dataset
  filter(species %in% c("Human", "Droid")) |>                       # filter data
  glimpse(width=60)

Rows: 41
Columns: 4
$ species <chr> "Human", "Droid", "Droid", "Human", "Human…
$ sex     <chr> "male", "none", "none", "male", "female", …
$ height  <int> 172, 167, 96, 202, 150, 178, 165, 97, 183,…
$ mass    <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0…

The mutate command in the dplyr package can be used to create a new variable.
New variables can be created from other variables or can overwrite variables (be careful).
We will use mutate for many varied tasks throughout this course.

#|label: mutate or create variable

my_starwars_plot_dat <- my_starwars_plot_dat |>           # overwrite dataset
  mutate(bmi = mass/((height/100))^2) |>                  # create new calculated variable, bmi
  glimpse()

Rows: 41
Columns: 5
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human",…
$ sex     <chr> "male", "none", "none", "male", "female", "male", "female", "n…
$ height  <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 180,…
$ mass    <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, …
$ bmi     <dbl> 26.02758, 26.89232, 34.72222, 33.33007, 21.77778, 37.87401, 27…

A common task in data management is removing missing values.
In R, missing values are denoted as NA
Missing values can be filtered out using filter, the command is.na, and the operator !.

#|label: remove NAs

my_starwars_plot_dat <- my_starwars_plot_dat |>     # overwrite dataset
  filter(!is.na(bmi)) |>                            # filter out NA's with !is.na(...)
  glimpse(width=60)

Rows: 24
Columns: 5
$ species <chr> "Human", "Droid", "Droid", "Human", "Human…
$ sex     <chr> "male", "none", "none", "male", "female", …
$ height  <int> 172, 167, 96, 202, 150, 178, 165, 97, 183,…
$ mass    <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0…
$ bmi     <dbl> 26.02758, 26.89232, 34.72222, 33.33007, 21…

💥 Week 2 In-class Exercises - Q2 💥

Session ID: bua455s25

Fill in the blanks to make this sentence correct:

The select command is used to select ___ (or deselect them) and can be used to ___ them.

Comparing `slice` and `filter`

In both examples below, three variables of the my_starwars data are selected.

#|label: slice by row number
# slice data to first 5 rows and last 5 rows
(my_starwars_sliced <- my_starwars |>
  select(name, height, species) |>
  slice(1:5, 83:87))

# A tibble: 10 × 3
   name           height species
   <chr>           <int> <chr>  
 1 Luke Skywalker    172 Human  
 2 C-3PO             167 Droid  
 3 R2-D2              96 Droid  
 4 Darth Vader       202 Human  
 5 Leia Organa       150 Human  
 6 Finn               NA Human  
 7 Rey                NA Human  
 8 Poe Dameron        NA Human  
 9 BB8                NA Droid  
10 Captain Phasma     NA Human

#|label: filter by height
# filter data to heights < 200 cm
(my_starwars_tall <- my_starwars |> 
  select(name, height, species) |>
  filter(height >= 200))

# A tibble: 11 × 3
   name         height species 
   <chr>         <int> <chr>   
 1 Darth Vader     202 Human   
 2 Chewbacca       228 Wookiee 
 3 IG-88           200 Droid   
 4 Roos Tarpals    224 Gungan  
 5 Rugor Nass      206 Gungan  
 6 Yarael Poof     264 Quermian
 7 Lama Su         229 Kaminoan
 8 Taun We         213 Kaminoan
 9 Grievous        216 Kaleesh 
10 Tarfful         234 Wookiee 
11 Tion Medon      206 Pau'an

Why do these data management tasks?

Filtering data to a subset by value
Slicing data by row number
Selecting variables
Removing missing values
Creating new variables

These are all the most common tasks that are done to raw data to make it usable.

Uses for Managed Useable Data

Usable data can communicate information:

can be summarized in a table for presentation.
can be visualized in a plot.
can be analyzed using statistical models.
can be presented or published.

In the next demonstration we review

creating a simple plot from managed data.
formatting the plot for presentation.

Save Plot
fill Option
Theme
R code
Final Plot

Plot is saved as sw_box_1

Plot is NOT printed in this column.

#|label: save sw plot
sw_box_1 <- my_starwars_plot_dat |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi))

Code is hidden in this column.

Unformatted plot is shown.

Hidden code chunk calls plot by name:

sw_box_1

Plot is saved as sw_box_2

Plot is NOT printed in this column.

#|label: plot with fill option
sw_box_2 <- my_starwars_plot_dat |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi, fill=sex))

Code is hidden in this column.

Hidden code chunk calls plot by name:

sw_box_2

Plot is saved as sw_box_3

Plot is NOT printed in this column.

#|label: plot with fill option
sw_box_3 <- my_starwars_plot_dat |> 
  ggplot() +
  geom_boxplot(aes(x=species,y=bmi,fill=sex)) +
  theme_classic()

Code is hidden in this column.

Hidden code chunk calls plot by name:

sw_box_3

Previous plot code from sw_box_3 is on lines 9 - 12.

The rest of the code above and below includes formatting details.

#|label: final complete plot code 
#| code-line-numbers: "9-12"

my_starwars_plot_dat <- my_starwars_plot_dat |>
  mutate(sexF = factor(sex,                                    # create factor variable, sexF
                       levels = c("male", "female", "none"),   # specify order (levels)
                       labels =c("Male", "Female", "None")))   # specify labels

sw_box_final <- my_starwars_plot_dat |>
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi, fill=sexF)) + 
  theme_classic() + 
  labs(title="Comparison of Human and Droid BMI",              # labs specifies text labels
       subtitle="22 Humans and 4 Droids from Star Wars Universe",
       caption="Data Source: dplyr package in R",
       x="",y="BMI", fill="Sex") + 
  theme(plot.title = element_text(size = 20),                  # theme formats plot elements
        plot.subtitle = element_text(size = 15),
        axis.title = element_text(size=18),
        axis.text = element_text(size=15),
        plot.caption = element_text(size = 10),
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 15),
        panel.border = element_rect(colour = "lightgrey", fill=NA, linewidth=2),
        plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))

Showing Plots in A 2x2 Grid

Another common presentation task is to show a grid, row, or column of plots.
Final plot is simplified for showing in the 2x2 grid

#|label: 4 plots in a grid 
grid.arrange(sw_box_1, sw_box_2, sw_box_3, sw_box4_grid, ncol=2)

Exporting a Plot: Two Methods

Method 1 (quick for one plot):

Right click on plot on right
Select ‘Save image as…’
Save image as .png file (or other preference) to img folder

Method 2: (ideal for multiple plots)

Use ggsave command
Defaults to last plot displayed
In the code shown, I override the default to specify plot I want.

#|label: export final plot 
ggsave("img/StarWars_BMI_Boxplot.png", 
       plot=sw_box_final, width = 8, height = 6)

sw_box_final

Creating a README File

In HW 2 you will:

create and modify an R Quarto (.qmd) file.
render the Quarto file to create an HTML file.
create a README file.

A README file documents all files in your R project.

README files can be simple or complex.
BUA 455 will use one README file format.

Editing a README.txt file

in RStudio: File > Open File > click on file
in Notepad (Windows OS) or TextEdit (Mac OS)

💥 Week 2 In-class Exercises 💥

Session ID: bua455s25

Question 3. What type of file should the README file be saved as?

Question 4. We will use the pacman R package, in every lecture, and assignment because it simplifies installing a loading other packages.

There is a package suite that includes both dplyr and ggplot2 that we will will we use in every lecture, assignment, and quiz in BUA 455.

The name of this package suite is ____.

Introduction to HW 2

In class we will work through HW 2 Instructions

Students are encouraged but not required to collaborate with classmates at least once for HW2, HW 3, or HW 4.

All students are responsible for understanding all coding.

Collaboration Options

An easy collaboration option for now:
- Each student makes their own R Quarto project.
- Students can email or share .qmd files in a cloud drive.
Posit Cloud allows for “google-drive” style collaboration on an R project.
- Posit Cloud will be used for the course projects.
GitHub is used for collaboration and version control in more advanced courses and other disciplines (Not required in this course).

Homework 2 NOTES

This assignment will not take too long but it includes important data management details.
Completing this assignment will allow you to practice the skills covered in class in Weeks 1 and 2.
Your completed assignment will create an HTML file with a clickable Table of Contents.
- This HTML will provide concise notes on the code and concepts we have covered.
- OPTIONAL: You can also render HW 1 for your notes.
NOTE: I provide code for you to copy, paste and MODIFY BUT you are responsible for reviewing and understanding this code before Quiz 1.
I will also provide a set of short demo videos that guide you through the assignment.
The remainder of this week’s lecture time will be devoted to working through this assignment.

Homework 2 INSTRUCTIONS

Instructions HTML File

Key Points from This Week

Building on R Project Skills:

Creating and Managing R Projects
- Creating, Editing, Rendering Quarto Files
- Documenting files in a README.txt file

Data Management and Visualization:

Review of glimpse, unique, table, summary
dplyr commands:select, filter mutate, slice
Useful operators:!, %in%, c(), is.na
Intro to plotting data with ggplot

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.

Week 2

RStudio Global General Options

RStudio Markdown Global Options

RStudio Appearance Global Options

RStudio Code Options

Reminders:

Assignment Reminders

Week 1 - File Management

Week 1 - Data Management

Additional R syntax

💥 Week 2 In-class Exercises - Q1 💥

Introduction to dplyr

💥 Week 2 In-class Exercises - Q2 💥

Comparing slice and filter

Why do these data management tasks?

Uses for Managed Useable Data

Showing Plots in A 2x2 Grid

Exporting a Plot: Two Methods

Creating a README File

💥 Week 2 In-class Exercises 💥

Introduction to HW 2

Homework 2 NOTES

Homework 2 INSTRUCTIONS

Key Points from This Week

Introduction to `dplyr`

Comparing `slice` and `filter`