Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "…
$ height <int> 172, 167, 96, 202, …
$ mass <dbl> 77, 75, 32, 136, 49…
$ hair_color <chr> "blond", NA, NA, "n…
$ skin_color <chr> "fair", "gold", "wh…
$ eye_color <chr> "blue", "yellow", "…
$ birth_year <dbl> 19.0, 112.0, 33.0, …
$ sex <chr> "male", "none", "no…
$ gender <chr> "masculine", "mascu…
$ homeworld <chr> "Tatooine", "Tatooi…
$ species <chr> "Human", "Droid", "…
$ films <list> <"A New Hope", "Th…
$ vehicles <list> <"Snowspeeder", "I…
$ starships <list> <"X-wing", "Imperi…
Week 2
Introduction to dplyr
commands and ggplot
RStudio Global General Options
RStudio Markdown Global Options
RStudio Appearance Global Options
RStudio Code Options
There are many code options that can be changed.
I recommend maintaining the defaults until you know what you want to change and why.
There are however, three options under
Syntax
in theDisplay
tab.Checking all three of these options makes it easier to write and proofread your code.
Reminders:
Assignment Reminders
Pre-class Survey Due 9/3/25
HW 1 (Parts 1, 2, and 3) Due 9/3/25
Week 1 - File Management
Creating a Quarto Project
Quarto project automatically creates a Quarto (
.qmd
) file.Adding
data
andimg
folders to your project.Creating and editing a
setup
chunk in your Quarto file.Creating and editing code chunks.
Week 1 - Data Management
Selecting data by rows and columns with square brackets
Examining data with R commands:
glimpse
,summary
,unique
,table
Types of variables
numeric variables (
<dbl>
,<int>
)categorical variables (
<chr>
,<fct>
,<ord>
)Type of variable dictates how we examine, summarize and present the data
Using piping,
|>
to write R code more efficiently.Using the
c()
operator to create a group of valuesUsing
$
orpull
orselect
to specify a variable within a dataset
Additional R syntax
Operators are used to filter data or create new variables
For example:
Filter a dataset of heights to heights <= 6 feet
Filter a dataset of cars to exclude SUVs
Operators in R is a good reference for some of the common operators used for data management in R.
Week 2 In-class Exercises - Q1
Use the Operators in R reference link to find the operator that is put before =
to indicate not equal to.
This same operator can be put before any value, e.g., X
, to indicate not X.
Introduction to dplyr
Recall the starwars
data from Week 1 Online dataset documentation
Original Data
Modified Data
Data Mgmt for a Boxplot Visualization
In Week 1, we previewed some data management of the
starwars
data for a boxplot visualization.Today we will examine each data management step above in the subsequent panels of this slide.
Code
```{r}
#|label: starwars data management
# select, filter, and mutate commands are part of tidyverse suite
# bmi = weight(kg)/height(m)^2
my_starwars_plot_dat <- my_starwars |> # my_starwars_plot_dat created for plot
select(species, sex, height, mass) |> # select specific variables
filter(species %in% c("Human", "Droid")) |> # filter data to humans and droids only
mutate(bmi = mass/((height/100))^2) |> # use mutate to create new variable, bmi
filter(!is.na(bmi)) # filter data to remove missing BMI values
```
Use the
select
command in thedplyr
package to select variables.The
select
command also orders the variables as written in the command.We save this dataset with fewer variables as a NEW dataset,
my_starwars_plot_dat
.glimpse
is NOT required at each step but we will use here to examine the dataset modifications.
Code
Rows: 87
Columns: 4
$ species <chr> "Human", "Droid", "Droid", "Human", "Human…
$ sex <chr> "male", "none", "none", "male", "female", …
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183,…
$ mass <dbl> 77, 75, 32, 136, 49, 120, 75, 32, 84, 77, …
The
filter
command in thedplyr
package is one common way to filter data.Datasets can be filtered by numeric values, or character (text), or factor levels
A very useful operator for selecting data from specific categories is
%in%
, contained in.
Code
Rows: 41
Columns: 4
$ species <chr> "Human", "Droid", "Droid", "Human", "Human…
$ sex <chr> "male", "none", "none", "male", "female", …
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183,…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0…
The
mutate
command in thedplyr
package can be used to create a new variable.New variables can be created from other variables or can overwrite variables (be careful).
We will use
mutate
for many varied tasks throughout this course.
Code
Rows: 41
Columns: 5
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human",…
$ sex <chr> "male", "none", "none", "male", "female", "male", "female", "n…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 180,…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, …
$ bmi <dbl> 26.02758, 26.89232, 34.72222, 33.33007, 21.77778, 37.87401, 27…
A common task in data management is removing missing values.
In R, missing values are denoted as
NA
Missing values can be filtered out using
filter
, the commandis.na
, and the operator!
.
Code
Rows: 24
Columns: 5
$ species <chr> "Human", "Droid", "Droid", "Human", "Human…
$ sex <chr> "male", "none", "none", "male", "female", …
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183,…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0…
$ bmi <dbl> 26.02758, 26.89232, 34.72222, 33.33007, 21…
Week 2 In-class Exercises - Q2
Comparing slice
and filter
In both examples below, three variables of the my_starwars
data are selected.
Code
# A tibble: 10 × 3
name height species
<chr> <int> <chr>
1 Luke Skywalker 172 Human
2 C-3PO 167 Droid
3 R2-D2 96 Droid
4 Darth Vader 202 Human
5 Leia Organa 150 Human
6 Finn NA Human
7 Rey NA Human
8 Poe Dameron NA Human
9 BB8 NA Droid
10 Captain Phasma NA Human
Code
# A tibble: 11 × 3
name height species
<chr> <int> <chr>
1 Darth Vader 202 Human
2 Chewbacca 228 Wookiee
3 IG-88 200 Droid
4 Roos Tarpals 224 Gungan
5 Rugor Nass 206 Gungan
6 Yarael Poof 264 Quermian
7 Lama Su 229 Kaminoan
8 Taun We 213 Kaminoan
9 Grievous 216 Kaleesh
10 Tarfful 234 Wookiee
11 Tion Medon 206 Pau'an
Why do these data management tasks?
Filtering data to a subset by value
Slicing data by row number
Selecting variables
Removing missing values
Creating new variables
These are all the most common tasks that are done to raw data to make it usable.
Uses for Managed Useable Data
Usable data can communicate information:
- can be summarized in a table for presentation.
- can be visualized in a plot.
- can be analyzed using statistical models.
- can be presented or published.
In the next demonstration we review
- creating a simple plot from managed data.
- formatting the plot for presentation.
Previous plot code from sw_box_3
is on lines 9 - 12.
The rest of the code above and below includes formatting details.
Code
```{r}
#|label: final complete plot code
#| code-line-numbers: "9-12"
my_starwars_plot_dat <- my_starwars_plot_dat |>
mutate(sexF = factor(sex, # create factor variable, sexF
levels = c("male", "female", "none"), # specify order (levels)
labels =c("Male", "Female", "None"))) # specify labels
sw_box_final <- my_starwars_plot_dat |>
ggplot() +
geom_boxplot(aes(x=species, y=bmi, fill=sexF)) +
theme_classic() +
labs(title="Comparison of Human and Droid BMI", # labs specifies text labels
subtitle="22 Humans and 4 Droids from Star Wars Universe",
caption="Data Source: dplyr package in R",
x="",y="BMI", fill="Sex") +
theme(plot.title = element_text(size = 20), # theme formats plot elements
plot.subtitle = element_text(size = 15),
axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.caption = element_text(size = 10),
legend.text = element_text(size = 12),
legend.title = element_text(size = 15),
panel.border = element_rect(colour = "lightgrey", fill=NA, linewidth=2),
plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
Showing Plots in A 2x2 Grid
- Another common presentation task is to show a grid, row, or column of plots.
- Final plot is simplified for showing in the 2x2 grid
Exporting a Plot: Two Methods
Method 1 (quick for one plot):
- Right click on plot on right
- Select ‘Save image as…’
- Save image as .png file (or other preference) to
img
folder
Method 2: (ideal for multiple plots)
- Use
ggsave
command - Defaults to last plot displayed
- In the code shown, I override the default to specify plot I want.
Creating a README File
In HW 2 you will:
create and modify an R Quarto (
.qmd
) file.render the Quarto file to create an HTML file.
create a README file.
A README file documents all files in your R project.
README files can be simple or complex.
BUA 455 will use one README file format.
Editing a README.txt file
in RStudio: File > Open File > click on file
in Notepad (Windows OS) or TextEdit (Mac OS)
Week 2 In-class Exercises
Session ID: bua455s25
Question 3.
What type of file should the README file be saved as?
Question 4.
We will use the pacman
R package, in every lecture, and assignment because it simplifies installing a loading other packages.
There is a package suite that includes both dplyr
and ggplot2
that we will will we use in every lecture, assignment, and quiz in BUA 455.
The name of this package suite is ____
.
Introduction to HW 2
In class we will work through HW 2 Instructions
Students are encouraged but not required to collaborate with classmates at least once for HW2, HW 3, or HW 4.
- All students are responsible for understanding all coding.
Collaboration Options
An easy collaboration option for now:
Each student makes their own R Quarto project.
Students can email or share .qmd files in a cloud drive.
Posit Cloud allows for “google-drive” style collaboration on an R project.
- Posit Cloud will be used for the course projects.
GitHub is used for collaboration and version control in more advanced courses and other disciplines (Not required in this course).
Homework 2 NOTES
This assignment will not take too long but it includes important data management details.
Completing this assignment will allow you to practice the skills covered in class in Weeks 1 and 2.
Your completed assignment will create an HTML file with a clickable Table of Contents.
This HTML will provide concise notes on the code and concepts we have covered.
OPTIONAL: You can also render HW 1 for your notes.
NOTE: I provide code for you to copy, paste and MODIFY BUT you are responsible for reviewing and understanding this code before Quiz 1.
I will also provide a set of short demo videos that guide you through the assignment.
The remainder of this week’s lecture time will be devoted to working through this assignment.
Homework 2 INSTRUCTIONS
Key Points from This Week
Building on R Project Skills:
Creating and Managing R Projects
Creating, Editing, Rendering Quarto Files
Documenting files in a
README.txt
file
Data Management and Visualization:
Review of
glimpse
,unique
,table
,summary
dplyr
commands:select
,filter
mutate
,slice
Useful operators:
!
,%in%
,c()
,is.na
Intro to plotting data with
ggplot
You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.