Lectures 1 and 2 and HW 1 Information

Plan for Week 1

Course Background (20 min.)

Pre-class Information

If you haven’t already read through the Pre-class Information, please do.

Pre-class survey

This is a short required survey.
Your responses are very helpful to me as I get to know you and your familiarity with R.
The last question is optional.
Completing this survey counts as 20% of HW 1.

Textbook

Lectures and HW Assignments will be augmented with text sections from Introduction to Data Science by Rafael A Irizarry.
A list of text sections can be found here and will be updated as the course progresses.

Blackboard Tour (10 min.)

Some key features of the Blackboard site for this course

If you have a legitmate temporary reason for not attending class in person, please email me and I will allow you to attend by Zoom (not all semester).
In order for absence(s) to be excused, you must fill out this form.

Syllabus (20 min.)

The Syllabus is available as a Google Doc link on Blackboard.
If substantial changes are made, I will post an announcement on Blackboard.

Information about R and R Studio

Pre-class Information Videos are available in the Kaltura Channel for this course: BUA 455 - Spring 2022
Additional Resources are in the textbook sections Linkedin Learning videos mentioned in the Pre-class Information.
There is also an R/RStudio Resources section on the Blackboard site for the course that includes:
- Tutorial videos (I will add more)
- A curated list of text resources I have created for BUA 455
- A curated list of video resources I have created for BUA 455

HW 1 - Part 1

This Blackboard Assignment counts as 20% of HW 1. It includes six questions about:

Some course policies from the syllabus
The hardware requirements for this course
The current version of R and RStudio

Week 1 In-class Exercises

TurningPoint Session ID: bua455s22

TP Question 1 (L1)

This is also Question 6 of HW 1 - Part 1

Each version of R is given a unique name to differentiate them. What is the current version of R (and hopefully the version on your computer) called?

Bird Hippie
Camp Pontanezen
Bunny-Wunnies Freak Out
Shake and Throw
Kick Things
Lost Library Book

See provided tutorial video, if you need to update your version of R or RStudio.

HW 1 - Parts 2 and 3

We will talk about about this part of HW 1 during Lecture 2 this week. The instructions can be found here.
Information below about R Markdown, creating chunks, and examining data are relevant to HW 1.
The plot preview is a preview of week 2 material, but gives context to what we are doing in week 1.

Information about R Markdown

By default R Markdown files include example text and R code chunks.
- The default example code uses the plot command (older base R command)
- In BUA 455, we will mainly be creating plots using ggplot.
To create a NEW R Markdown file:
- Click File > New File > R Markdown…
You can click ‘Create Empty Document’ but the default text is helpful.
There is a default setup chunk, but I will provide the setup chunk for HW 1.
All R Markdown files have plain text areas.
R Markdown files can ALSO have R chunks where code can be run.
Being able to mix R Chunks and Text areas makes markdown files very versatile.
Chunks could also be used for Python, SQL, or other coding languages (Not in BUA 455).

R Chunks in R Markdown

Chunks are small sections within the R Markdown file that are small self contained R scripts.

Two ways to create a new chunk:

Click green C at top of screen
Click Ctrl/Cmd + Alt + i (Mac users use Cmd key instead of Ctrl key)

Things to note about R Chunks:

In order for and R chunk to to be functional:
- an R chunk must start with ```{r}
- an R chunk must end with ```
After the r in {r} you can put a space and then brief comment to name the chunk
- In the default R Markdown file, the first chunk is named ‘cars’ after the dataset.
- we will cover additional options that are specified in these brackets later.
To run the chunk you are working in (the current chunk):
- click green sideways arrow on the right, or type Ctrl/Cmd + Shift + Enter
To run all chunks before the current one:
- click grey downward arrow on the right, or type Ctrl/Cmd + Shift + Enter + P
To run one line, or group of lines:
- place cursor on that line and type Ctrl/Cmd + Enter
To add comments to your code (which is essential to good coding):
- Within a chunk, start lines with # if you want them to comments
- Comments are text that R does not run but that help explain the code

Selecting data values by location

The R chunk below uses the cars data and saves it to the Global Environment.
- Saving data to your Global Environment should always be your first step.
- This is true for BOTH R dataset or external datasets that you import.
The R code demonstrates how to use a pipe operator, |>.
- Piping makes data management coding much more fluid and efficient.
- This course will use piping regularly
- An older version of this operator is %>% which you may see in documentation.
- |> and %>% are interchangeable.
- If you get an error message when you use |>:
  - You need to update your version of R or RStudio.
The R code below demonstrates how to select rows, columns and observations from a dataset.
- We will cover other ways to do this, but understanding the basics is useful
- All datasets in R are matrices with rows and columns.
- Locations with a dataset can be specifed with square brackets
  - cars[3,2] is the observation in the **3rd row and 2nd column of the cars dataset
- There are more examples below which you will modify to answer questions in HW 1.

# note that cars is an internal R dataset
# make your own copy of the cars data in the Global Environment
# name it something different than cars
my_cars <- cars

# examine the dataset mycars using glimpse
glimpse(my_cars)

## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13~
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34~

# same command with piping:
# read |> as 'is sent to' or 'goes into'
my_cars |>
  glimpse()

## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13~
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34~

# select rows 3, 4 and 5 only:
my_cars[3:5,]

##   speed dist
## 3     7    4
## 4     7   22
## 5     8   16

# select column 1 only:
my_cars[,1]

##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

# select observations 10, 11, and 12 within column 1:
my_cars[10:12, 1]

## [1] 11 11 12

# select observations 20, 30, and 40, within column 2:
my_cars[c(20,30,40),2]

## [1] 26 40 48

Week 1 In-class Exercises

TurningPoint Session ID: bua455s22

TP Question 2 (L1)

This is also Question 1 of HW 1 - Parts 2 and 3

The cars dataset is a dataset that is internal to the R software. The code above saves a copy of this dataset with a new name.

Where is this copy dataset saved?

TP Question 1 (L2)

This is also Question 3 of HW 1 - Parts 2 and 3

This copy of the cars dataset has

____ rows (observations)
____ columns (variables)

TP Question 2 (L2)

This is also Question 7 of HW 1 - Parts 2 and 3

Complete the command my_cars[c(____,____,____), ____] to print out rows 10, 15, and 20 of column 1 of the my_cars dataset.

Rows to be selected are specified before the comma.
Columns to be selected are specified after the comma.
The last example in the provided code shows how to use c() to group non-consecutive elements.
Add your code to the chunk you created with the code above.
Replace ___ with correct numbers in your code before submitting it in R
Test your code to verify that it is correct.
You can click on the dataset in the ’Global Environment to view the full dataset and verify your work.

Different types of variables

There are two main types of data variables:
- Numeric values can be decimal values, dbl, or integer values, int
- Character values, chr are text strings
- Sometimes numeric data is classified as a character variable (can be coverted to numeric).
- The type of variable dictates hwat you can do with it.
- Character and numeric data can be converted to a factor for plots, tables, analyses
The starwars data has both numeric and character variables.
In the chunk below we will save the R dataset, starwars to our Global Environment.
We will then examine the data and variable and answer questions using R commands.

# save the starwars data to your global environment
my_starwars <- starwars

# examine the data
glimpse(my_starwars)

## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

# examine the list of species
# note that to indicate a variable WITHIN a dataset we use $, the accessor operator
unique(my_starwars$species)

##  [1] "Human"          "Droid"          "Wookiee"        "Rodian"        
##  [5] "Hutt"           "Yoda's species" "Trandoshan"     "Mon Calamari"  
##  [9] "Ewok"           "Sullustan"      "Neimodian"      "Gungan"        
## [13] NA               "Toydarian"      "Dug"            "Zabrak"        
## [17] "Twi'lek"        "Vulptereen"     "Xexto"          "Toong"         
## [21] "Cerean"         "Nautolan"       "Tholothian"     "Iktotchi"      
## [25] "Quermian"       "Kel Dor"        "Chagrian"       "Geonosian"     
## [29] "Mirialan"       "Clawdite"       "Besalisk"       "Kaminoan"      
## [33] "Aleena"         "Skakoan"        "Muun"           "Togruta"       
## [37] "Kaleesh"        "Pau'an"

# examine the list of haircolors
# again, we use $ to specify hair_colr within this dataset
unique(my_starwars$hair_color)

##  [1] "blond"         NA              "none"          "brown"        
##  [5] "brown, grey"   "black"         "auburn, white" "auburn, grey" 
##  [9] "white"         "grey"          "auburn"        "blonde"       
## [13] "unknown"

# do a quick summary of these data by sex and species
table(my_starwars$sex, my_starwars$hair_color)

##                 
##                  auburn auburn, grey auburn, white black blond blonde brown
##   female              1            0             0     3     0      1     6
##   hermaphroditic      0            0             0     0     0      0     0
##   male                0            1             1     9     3      0    11
##   none                0            0             0     0     0      0     0
##                 
##                  brown, grey grey none unknown white
##   female                   0    0    4       0     1
##   hermaphroditic           0    0    0       0     0
##   male                     1    1   29       0     3
##   none                     0    0    3       0     0

# save this table as an object
sw_gender_hrclr_smry <- table(my_starwars$sex, my_starwars$hair_color)

# summarize height of starwars characters
summary(my_starwars$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    66.0   167.0   180.0   174.4   191.0   264.0       6

# calculate the mean
mean(my_starwars$height)

## [1] NA

# NA's must be excluded with na.rm=T
# also true for min(), max(), median, sd(), etc.
mean(my_starwars$height, na.rm=T)

## [1] 174.358

sd(my_starwars$height, na.rm=T)

## [1] 34.77043

# summarize sex
summary(my_starwars$sex)

##    Length     Class      Mode 
##        87 character character

# summarize sex using table
table(my_starwars$sex)

## 
##         female hermaphroditic           male           none 
##             16              1             60              6

# summarize sex but use as.factor
summary(as.factor(my_starwars$sex))

##         female hermaphroditic           male           none           NA's 
##             16              1             60              6              4

# save this last summary as an object
sw_sex_smry <- summary(as.factor(my_starwars$sex))

Plot Preview

Data Mgmt for a Basic Boxplot

Below is a preview of ggplot which we will use throughout this course.
Plotting data wells requires data management and planning
- The preview below ALSO includes data management skills we will cover
- comments are provided to explain what is done.
The very first plot is the bare minimum to create a boxplot.
- code shown here with and without piping

# dataset my_starwars_plt is created for the plot
# used select command to select variables
# used filter command to filter data to only to species, Humand and Droid
# used mutate command to create new variable bmi
  # bmi = weight(kg)/height(m)^2
# filtered out observations where bmi was a missing value, NA
my_starwars_plt <- my_starwars |>
  select(species, sex, height, mass) |>
  filter(species %in% c("Human", "Droid")) |>
  mutate(bmi = mass/((height/100))^2) |>
  filter(!is.na(bmi)) 

# most basic UNSAVED bar plot
# output to screen only
my_starwars_plt |> ggplot() +
  geom_boxplot(aes(x=species, y=bmi))

# same basic unsaved plot without piping
ggplot(my_starwars_plt) +
geom_boxplot(aes(x=species, y=bmi))

Improving a Plot (Demo)

The four plots below are saved as objects in the Global Environment
- Saved plots don’t print to the screen by default
- Enclose set of plot commands in parentheses to print to screen
sw_boxplot1 is the basic plot.
sw_boxplot2 is the basic plot with data shown by sex
sw_boxplot3 removes the default background
sw_boxplot4 formats all plot text and moves legend to bottom

# most basic saved bar plot
# data are piped into the plot

# this plot is saved as sw_boxplot1 but won't appear on screen
sw_boxplot1 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi))

# same plot as above enclosed in parentheses so it appears on screen
(sw_boxplot1 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi)))

# same plot but with sex included (fill=sex)
(sw_boxplot2 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi, fill=sex)))

# same plot but with sex included (fill=sex)
# also default background removed by changing theme to theme_classic
(sw_boxplot3 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi, fill=sex))+
  theme_classic())

# created a factor variable sexF 
# modified  order (levels) and labels for plot
my_starwars_plt <- my_starwars_plt |>
  mutate(sexF = factor(sex, 
                       levels = c("male", "female", "none"),
                       labels =c("Male", "Female", "None")))

# used new factor variable
# moved legend to the bottom
# formatted titles, captions, axis labels with labs command
(sw_boxplot4 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi, fill=sexF)) +
  theme_classic() + 
  theme(legend.position="bottom") +
  labs(title="Comparison of Human and Droid BMI",
       subtitle="22 Humans and 4 Droids from Star Wars Universe",
       caption="Data Source: dplyr package in R",
       x="",y="BMI", fill="Sex"))

Presenting a grid of plots together

One reason to save plots is to use them later in a composed presentation.
The 2 x 2 grid below shows all four of the previous plots together.

# create 2x2 grid of boxplots
# shows transition from first draft to final plot
# could also change colors
grid.arrange(sw_boxplot1, sw_boxplot2, sw_boxplot3, sw_boxplot4, ncol=2)

Lectures 1 and 2 and HW 1 Information

HW 1 (Parts 1, 2, and 3) is Due 2/2/2022

Plan for Week 1

Course Background (20 min.)

Pre-class Information

Pre-class survey

Textbook

Blackboard Tour (10 min.)

Syllabus (20 min.)

Information about R and R Studio

HW 1 - Part 1

Week 1 In-class Exercises

TP Question 1 (L1)

HW 1 - Parts 2 and 3

Information about R Markdown

R Chunks in R Markdown

Selecting data values by location

Week 1 In-class Exercises

TP Question 2 (L1)

TP Question 1 (L2)

TP Question 2 (L2)

Different types of variables

Plot Preview

Data Mgmt for a Basic Boxplot

Improving a Plot (Demo)

Presenting a grid of plots together