Code
Loading required package: pacman
Code
[1] "magrittr" "gridExtra" "lubridate" "forcats" "stringr" "dplyr"
[7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
[13] "pacman"
R Projects, R Quarto Files, and R Syntax
Data Management is a public-facing aspect of all industries.
All students must have the following to complete this course:
Hardware
Personal Laptop with Windows or MAC OS required
laptop can’t be a loaner
Chromebooks and Ipads are insufficient.
Ideally should have a minimum. of 4 GB RAM and 256 GB of Storage (more preferred)
Note: Temporary laptop issues of one week or less can be managed using a cloud option, but you cannot complete this course without your own laptop.
Software:
Install latest versions of
Uninstall previous versions and reinstall latest version, if needed.
If you have trouble installing/Reinstalling R, RStudio, or Quarto, TA’s and/or I will help you.
You are responsible for maintaining up-to-date software on your laptop.
All of these software environments are open-source and post updates every few months.
Students should always do their best to update their software as soon as possible.
Course Website Includes Links to:
Syllabus
Schedule of Assignments on Project Deadlines
Lecture Slides and PDFs
Homework Assignments
If substantial changes are made, I will post an announcement on Blackboard.
Installation Demos for Windows
Creating a new project with correct file structure (Required for all assignments.)
Using Quarto Files
More to come as semester progresses.
I updated all aspects of this course in 2024.
I updated all demo videos in 2024.
This semester I will update videos as needed.
This Blackboard Assignment counts as 20% of HW Assignment 1.
Six questions including questions about:
some course policies from the syllabus.
the hardware requirements for this course (not negotiable).
the current versions of R and RStudio.
Each version of R is given a unique number to differentiate them.
What is the current version of R?
R.version.string
RStudio.Version()$version
Introduced today and covered in-depth in Lecture 2
During Lecture 1 we will:
We will create a folder where all coursework will be stored:
Create a Quarto Project
for HW 1 with file structure required for all work in this course.
.qmd
) file.Create data
and img
folders within the project.
Modify the Quarto (.qmd
) file header and create a setup
code block.
During Lecture 2 we will:
Create, edit and run R code blocks in our Quarto (.qmd
) file.
Use provided code to export a plot to the img
folder.
When work in R is completed students will:
Answer Blackboard questions based on R output.
Submit zipped HW 1 R project with name and and correct file structure and edited .qmd
file.
Instructions for HW 1 - Parts 2 & 3 are on the BUA Course Website Assignments page.
We will use the same project structure for assignments in this course.
These steps will also be shown in a demo video.
Create a folder named BUA 455
on your desktop or somewhere convenient.
Open RStudio which also opens R and select:
File
> New Project...
> New Directory
> Quarto Project
Create project as HW 1 <Your Full Name>
in your desktop BUA 455
folder.
See example on next slide.
Every R Project in this course will have two additional folders within the project folder.
Creating a Quarto
project also creates a basic Quarto (.qmd
) file with the same name as the project.
Soon we will learn how to create these files.
First lets edit this file header so it is more functional.
Here is the unedited file header.
Replace the unedited header with this text which is also included in HW 1 Instructions.
This header will add a table of contents and provide options for seeing or hiding code.
---
title: "HW 1 Penelope Pooler"
date: last-modified
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
In source view, we will delete the default text and code block so the file blank below the header.
Then we will add this text:
## Setup
- This code block (code chunk) will include code to specify some options for the whole document.
- It will also install and load required packages using the pacman package.
- The final command, p_loaded() will list the loaded packages so we can verify that all required packaged have been loaded correctly.
Quarto files are the newest type of R markdown files
Markdown files allows the user to have text, code, and output together in one document.
All Quarto files in this course will start with a code chunk labeled setup
.
To create an empty code chunk:
Click Ctrl+Alt+i
(Cmd+Alt+i in Mac)
OR Click green C icon at top of file pane.
In code chunks in Quarto, options are added two ways
using fences, #|
within the top curly brackets
You will see both methods used throughout this course
In the setup chunk
we add the label using a fence, #|
so that it appears in the printout.
In addition to specifying options, we use the setup to install and load packages and set options for the whole document.
The final command, p_loaded()
shows all the packages that are loaded.
commands are explained with comments (#
followed by text.)
Loading required package: pacman
[1] "magrittr" "gridExtra" "lubridate" "forcats" "stringr" "dplyr"
[7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
[13] "pacman"
We run the whole chunk of code by clicking green arrow in upper left corner or by clicking Ctrl+Shift_Enter
.
Output from setup
should show 13 packages are loaded.
In HW 1 we will look at two R datasets to learn some basic skills.
I will also provide example code from class to
export a dataset to your data
folder.
create a formatted plot and export it to your img
folder.
You are not expected to learn these export commands for HW 1 or to understand this plot code yet.
The plot and export comands are provided to verify that your data
and img
folders and are setup correctly to receive exported files.
Importing and examining data
```{r}
#|label: import and examine cars data
# cars is an internal R dataset
# this code saves a copy of the cars data in the Global Environment
my_cars <- cars
# examine the dataset mycars using glimpse
glimpse(my_cars)
# same command with piping:
# read |> as 'is sent to' or 'goes into'
my_cars |> glimpse()
```
Rows: 50
Columns: 2
$ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
$ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
Rows: 50
Columns: 2
$ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
$ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
NOTE: This question is also Question 1 of the Blackboard portion of HW 1.
cars
is a dataset that is internal to the R software. The previous chunk of code saves a copy of this dataset with a new name.
cars
dataset temporarily stored after running the previous chunk of code so we can work with it?NOTE: This question is also Question 3 of the Blackboard portion of HW 1.
The my_cars
dataset, which is a saved copy of the R dataset, cars
has
____
rows (observations)
____
columns (variables)
Number rows and columns of a dataset can be seen in the Global Environment
or by viewing the output from glimpse
.
All datasets in R are matrices with rows and columns.
Using Square brackets is the most basic and reliable way to select rows, columns, and observations.
For example, we will a created a dataset called `my_cars’
Square brackets are placed after name of dataset.
Values BEFORE the comma in the square bracket specify selected ROW(S).
Values AFTER the comma in the square bracket specify selected COLUMN(S).
If space before or after comma is left blank, that indicates ALL
Examples:
my_cars[3,2]
: observation in 3rd row and 2nd column of my_cars
my_cars[3,]
: entire 3rd row of my_cars including ALL columns
my_cars[,2]
: entire 2nd column of my_cars including ALL rows
speed dist
3 7 4
4 7 22
5 8 16
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
[1] 11 11 12
[1] 26 40 48
Students will use the above examples to add code to the chunk below as specified in Blackboard Questions 6 and 7.
We will work on Question 7 together in class.
What is the correct R command below to select and print rows 10, 15, and 20 of column 1 of the my_cars
dataset to the screen?
Remove #
from the incomplete R command.
Remove ____
(blank spaces) and replace with correct values.
c()
is used to group non-consecutive elements.starwars
datasetstarwars
to our Global Environment as my_starwars
and examine the data.Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Da…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 1…
$ mass <dbl> 77, 75, 32, 136, 49, 120, 75, 32, 84, 7…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brow…
$ skin_color <chr> "fair", "gold", "white, blue", "white",…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "bro…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47…
$ sex <chr> "male", "none", "none", "male", "female…
$ gender <chr> "masculine", "masculine", "masculine", …
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatoo…
$ species <chr> "Human", "Droid", "Droid", "Human", "Hu…
$ films <list> <"A New Hope", "The Empire Strikes Bac…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>…
Variable type dictates how variable is managed and presented.
By default, R assumes all variables are numeric or character (text) variables.
Numeric variables:
decimal, dbl
integer, int
Non-numeric variables (may appear as numbers:)
Character variables, chr
are text strings
Numeric data may be classified as character variables.
Numeric character variables can be converted to numeric values.
Factors (common) and Logical variables can be created from numeric or character variables as needed.
Factor variables, fct
, or ord
Can be created from character variables or numeric variables
Interpreted as categorical variables by R
ord
refers to ordinal or ordered factor variables
Factors are essential for plots, tables, and analyses
Logical variables, lgl
unique
for Character or Factor Variablesunique
command list all levels (categories) in a text variable. [1] "Human" "Droid" "Wookiee" "Rodian"
[5] "Hutt" NA "Yoda's species" "Trandoshan"
[9] "Mon Calamari" "Ewok" "Sullustan" "Neimodian"
[13] "Gungan" "Toydarian" "Dug" "Zabrak"
[17] "Twi'lek" "Aleena" "Vulptereen" "Xexto"
[21] "Toong" "Cerean" "Nautolan" "Tholothian"
[25] "Iktotchi" "Quermian" "Kel Dor" "Chagrian"
[29] "Geonosian" "Mirialan" "Clawdite" "Besalisk"
[33] "Kaminoan" "Skakoan" "Muun" "Togruta"
[37] "Kaleesh" "Pau'an"
table
for Character or Factor Variablestable
outputs tally of the number of observations in each variable category
Aleena Besalisk Cerean Chagrian Clawdite
1 1 1 1 1
Droid Dug Ewok Geonosian Gungan
6 1 1 1 3
Human Hutt Iktotchi Kaleesh Kaminoan
35 1 1 1 2
Kel Dor Mirialan Mon Calamari Muun Nautolan
1 2 1 1 1
Neimodian Pau'an Quermian Rodian Skakoan
1 1 1 1 1
Sullustan Tholothian Togruta Toong Toydarian
1 1 1 1 1
Trandoshan Twi'lek Vulptereen Wookiee Xexto
1 2 1 2 1
Yoda's species Zabrak
1 2
Aleena Besalisk Cerean Chagrian Clawdite
1 1 1 1 1
Droid Dug Ewok Geonosian Gungan
6 1 1 1 3
Human Hutt Iktotchi Kaleesh Kaminoan
35 1 1 1 2
Kel Dor Mirialan Mon Calamari Muun Nautolan
1 2 1 1 1
Neimodian Pau'an Quermian Rodian Skakoan
1 1 1 1 1
Sullustan Tholothian Togruta Toong Toydarian
1 1 1 1 1
Trandoshan Twi'lek Vulptereen Wookiee Xexto
1 2 1 2 1
Yoda's species Zabrak
1 2
table
for examining 2 variables<-
```{r}
#|label: using table to summarize 2 vars
table(my_starwars$sex, my_starwars$hair_color) # $ specifies variables in dataset
```
hair_color
sex auburn auburn, grey auburn, white black blond blonde brown
female 1 0 0 3 0 1 5
hermaphroditic 0 0 0 0 0 0 0
male 0 1 1 9 3 0 11
none 0 0 0 0 0 0 0
hair_color
sex brown, grey grey none white
female 0 0 5 1
hermaphroditic 0 0 0 0
male 1 1 29 3
none 0 0 3 0
You may want to save a plot or a summary and display it elsewhere in a document or website.
hair_color
sex auburn auburn, grey auburn, white black blond blonde brown
female 1 0 0 3 0 1 5
hermaphroditic 0 0 0 0 0 0 0
male 0 1 1 9 3 0 11
none 0 0 0 0 0 0 0
hair_color
sex brown, grey grey none white
female 0 0 5 1
hermaphroditic 0 0 0 0
male 1 1 29 3
none 0 0 3 0
summary
and mean
are two quick ways to summarize numeric variables.```{r}
#| label: summary and mean examples
my_starwars |> select(height) |> summary() # summarize height of starwars characters
mean(my_starwars$height, na.rm=T) # calculate the mean (without piping)
# na.rm=T is required to remove missing obs before calculation
my_starwars |> pull(height) |> mean() # mean with piping, NA's not removed
my_starwars |> pull(height) |> mean(na.rm=T) # mean without piping NA's removed
```
height
Min. : 66.0
1st Qu.:167.0
Median :180.0
Mean :174.6
3rd Qu.:191.0
Max. :264.0
NA's :6
[1] 174.6049
[1] NA
[1] 174.6049
na.rm=T
used in mean
, var
, sd
, min
, max
, median
, etc.The output from the glimpse(my_starwars)
command lists each variable and shows its type. This dataset has
____
character variables (labeled <chr>
) and____
numeric variables (labeled <int>
, <dbl>
, or <num>
).NOTES:
To answer this question, examine the glimpse
output and count the the nuber of variables of each type.
RECALL: Numeric variables include both decimal (<dbl>
) AND integer (<int>
) variables.
The glimpse
command is a newer alternative to str
and will only work if the tidyverse
package suite is installed and loaded.
“The greatest value of a picture is when it forces us to notice what we never expected to see. -John W. Tukey
This is a preview of
data management skills we will cover soon in this course
data visualization skills using ggplot
, which we will use throughout this course.
This preview of plot code will be included in HW 1 for you to run.
You do not have to fully understand this code in HW 1, but you do have to run it.
The HW 1 code will
create the final plot from this week’s lecture and export it to your HW 1 img
folder.
create a small summary dataset and save it to your HW 1 data
folder.
Running the provided code to export the plot and table is a required part of HW 1.
my_starwars
dataset:Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Da…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 1…
$ mass <dbl> 77, 75, 32, 136, 49, 120, 75, 32, 84, 7…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brow…
$ skin_color <chr> "fair", "gold", "white, blue", "white",…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "bro…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47…
$ sex <chr> "male", "none", "none", "male", "female…
$ gender <chr> "masculine", "masculine", "masculine", …
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatoo…
$ species <chr> "Human", "Droid", "Droid", "Human", "Hu…
$ films <list> <"A New Hope", "The Empire Strikes Bac…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>…
Code shows data management for a plot with comments
Using piping, |>
results in efficient streamlined data amangement.
```{r}
#|label: starwars data management
# select, filter, and mutate commands are part of tidyverse suite
# bmi = weight(kg)/height(m)^2
my_starwars_plot_dat <- my_starwars |> # my_starwars_plot_dat created for plot
select(species, sex, height, mass) |> # select specific variables
filter(species %in% c("Human", "Droid")) |> # filter data to humans and droids only
mutate(bmi = mass/((height/100))^2) |> # use mutate to create new variable, bmi
filter(!is.na(bmi)) # filter data to remove missing BMI values
```
Original Data
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "…
$ height <int> 172, 167, 96, 202, …
$ mass <dbl> 77, 75, 32, 136, 49…
$ hair_color <chr> "blond", NA, NA, "n…
$ skin_color <chr> "fair", "gold", "wh…
$ eye_color <chr> "blue", "yellow", "…
$ birth_year <dbl> 19.0, 112.0, 33.0, …
$ sex <chr> "male", "none", "no…
$ gender <chr> "masculine", "mascu…
$ homeworld <chr> "Tatooine", "Tatooi…
$ species <chr> "Human", "Droid", "…
$ films <list> <"A New Hope", "Th…
$ vehicles <list> <"Snowspeeder", "I…
$ starships <list> <"X-wing", "Imperi…
Modified Data
Previous plot code from sw_box_3
is on lines 9 - 12.
The rest of the code above and below includes formatting details.
```{r}
#|label: final complete plot code
#| code-line-numbers: "9-12"
my_starwars_plot_dat <- my_starwars_plot_dat |>
mutate(sexF = factor(sex, # create factor variable, sexF
levels = c("male", "female", "none"), # specify order (levels)
labels =c("Male", "Female", "None"))) # specify labels
sw_box_final <- my_starwars_plot_dat |>
ggplot() +
geom_boxplot(aes(x=species, y=bmi, fill=sexF)) +
theme_classic() +
labs(title="Comparison of Human and Droid BMI", # labs specifies text labels
subtitle="22 Humans and 4 Droids from Star Wars Universe",
caption="Data Source: dplyr package in R",
x="",y="BMI", fill="Sex") +
theme(plot.title = element_text(size = 20), # theme formats plot elements
plot.subtitle = element_text(size = 15),
axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.caption = element_text(size = 10),
legend.text = element_text(size = 12),
legend.title = element_text(size = 15),
panel.border = element_rect(colour = "lightgrey", fill=NA, linewidth=2),
plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
File Management:
REQUIRED: current versions of R, RSudio, Quarto
Create a BUA 455 folder on your laptop
R Projects and Quarto files are used for all coursework
Data Management:
Examining data:
glimpse
, unique
, table
, summary
Selecting data by row and column using square brackets
Different types of variables and how to summarize them
You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.