Plan for Week 1

Course Background (20 min.)

Pre-class Information

If you haven’t already read through the Pre-class Information, please do.

 

Pre-class survey

  • This is a short required survey.

  • Your responses are very helpful to me as I get to know you and your familiarity with R.

  • The last question is optional.

  • Completing this survey counts as 20% of HW 1.

     

Textbook

  • Lectures and HW Assignments will be augmented with text sections from Introduction to Data Science by Rafael A Irizarry.

  • A list of text sections can be found here and will be updated as the course progresses.

     

Blackboard Tour (10 min.)

Some key features of the Blackboard site for this course

  • If you have a legitmate temporary reason for not attending class in person, please email me and I will allow you to attend by Zoom (not all semester).

  • In order for absence(s) to be excused, you must fill out this form.

     

Syllabus (20 min.)

  • The Syllabus is available as a Google Doc link on Blackboard.

  • If substantial changes are made, I will post an announcement on Blackboard.

     

Information about R and R Studio

  • Pre-class Information Videos are available in the Kaltura Channel for this course: BUA 455 - Spring 2022

  • Additional Resources are in the textbook sections Linkedin Learning videos mentioned in the Pre-class Information.

  • There is also an R/RStudio Resources section on the Blackboard site for the course that includes:

HW 1 - Part 1

This Blackboard Assignment counts as 20% of HW 1. It includes six questions about:

  • Some course policies from the syllabus
  • The hardware requirements for this course
  • The current version of R and RStudio

Week 1 In-class Exercises

TurningPoint Session ID: bua455s22

TP Question 1 (L1)

This is also Question 6 of HW 1 - Part 1

Each version of R is given a unique name to differentiate them. What is the current version of R (and hopefully the version on your computer) called?

  • Bird Hippie
  • Camp Pontanezen
  • Bunny-Wunnies Freak Out
  • Shake and Throw
  • Kick Things
  • Lost Library Book
See provided tutorial video, if you need to update your version of R or RStudio.

 


HW 1 - Parts 2 and 3

  • We will talk about about this part of HW 1 during Lecture 2 this week. The instructions can be found here.

  • Information below about R Markdown, creating chunks, and examining data are relevant to HW 1.

  • The plot preview is a preview of week 2 material, but gives context to what we are doing in week 1.

     

Information about R Markdown

  • By default R Markdown files include example text and R code chunks.

    • The default example code uses the plot command (older base R command)

    • In BUA 455, we will mainly be creating plots using ggplot.

       

  • To create a NEW R Markdown file:

    • Click File > New File > R Markdown…

       

  • You can click ‘Create Empty Document’ but the default text is helpful.

  • There is a default setup chunk, but I will provide the setup chunk for HW 1.

  • All R Markdown files have plain text areas.

  • R Markdown files can ALSO have R chunks where code can be run.

  • Being able to mix R Chunks and Text areas makes markdown files very versatile.

  • Chunks could also be used for Python, SQL, or other coding languages (Not in BUA 455).

     


R Chunks in R Markdown

Chunks are small sections within the R Markdown file that are small self contained R scripts.

Two ways to create a new chunk:

  1. Click green C at top of screen
  2. Click Ctrl/Cmd + Alt + i (Mac users use Cmd key instead of Ctrl key)

     

Things to note about R Chunks:

  • In order for and R chunk to to be functional:
    • an R chunk must start with ```{r}
    • an R chunk must end with ```

       

  • After the r in {r} you can put a space and then brief comment to name the chunk
    • In the default R Markdown file, the first chunk is named ‘cars’ after the dataset.
    • we will cover additional options that are specified in these brackets later.

       

  • To run the chunk you are working in (the current chunk):
    • click green sideways arrow on the right, or type Ctrl/Cmd + Shift + Enter

       

  • To run all chunks before the current one:
    • click grey downward arrow on the right, or type Ctrl/Cmd + Shift + Enter + P

       

  • To run one line, or group of lines:
    • place cursor on that line and type Ctrl/Cmd + Enter

       

  • To add comments to your code (which is essential to good coding):
    • Within a chunk, start lines with # if you want them to comments
    • Comments are text that R does not run but that help explain the code

       


Selecting data values by location

  • The R chunk below uses the cars data and saves it to the Global Environment.
    • Saving data to your Global Environment should always be your first step.
    • This is true for BOTH R dataset or external datasets that you import.

       

  • The R code demonstrates how to use a pipe operator, |>.
    • Piping makes data management coding much more fluid and efficient.
    • This course will use piping regularly
    • An older version of this operator is %>% which you may see in documentation.
    • |> and %>% are interchangeable.
    • If you get an error message when you use |>:
      • You need to update your version of R or RStudio.

         

  • The R code below demonstrates how to select rows, columns and observations from a dataset.
    • We will cover other ways to do this, but understanding the basics is useful
    • All datasets in R are matrices with rows and columns.
    • Locations with a dataset can be specifed with square brackets
      • cars[3,2] is the observation in the **3rd row and 2nd column of the cars dataset
    • There are more examples below which you will modify to answer questions in HW 1.

       

# note that cars is an internal R dataset
# make your own copy of the cars data in the Global Environment
# name it something different than cars
my_cars <- cars

# examine the dataset mycars using glimpse
glimpse(my_cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13~
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34~
# same command with piping:
# read |> as 'is sent to' or 'goes into'
my_cars |>
  glimpse()
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13~
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34~
# select rows 3, 4 and 5 only:
my_cars[3:5,]
##   speed dist
## 3     7    4
## 4     7   22
## 5     8   16
# select column 1 only:
my_cars[,1]
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
# select observations 10, 11, and 12 within column 1:
my_cars[10:12, 1]
## [1] 11 11 12
# select observations 20, 30, and 40, within column 2:
my_cars[c(20,30,40),2]
## [1] 26 40 48

Week 1 In-class Exercises

TurningPoint Session ID: bua455s22

TP Question 2 (L1)

This is also Question 1 of HW 1 - Parts 2 and 3

The cars dataset is a dataset that is internal to the R software. The code above saves a copy of this dataset with a new name.

  • Where is this copy dataset saved?

     

TP Question 1 (L2)

This is also Question 3 of HW 1 - Parts 2 and 3

This copy of the cars dataset has

  • ____ rows (observations)
  • ____ columns (variables)

     

TP Question 2 (L2)

This is also Question 7 of HW 1 - Parts 2 and 3

Complete the command my_cars[c(____,____,____), ____] to print out rows 10, 15, and 20 of column 1 of the my_cars dataset.

  • Rows to be selected are specified before the comma.
  • Columns to be selected are specified after the comma.
  • The last example in the provided code shows how to use c() to group non-consecutive elements.
  • Add your code to the chunk you created with the code above.
  • Replace ___ with correct numbers in your code before submitting it in R
  • Test your code to verify that it is correct.
  • You can click on the dataset in the ’Global Environment to view the full dataset and verify your work.

     


Different types of variables

  • There are two main types of data variables:

    • Numeric values can be decimal values, dbl, or integer values, int
    • Character values, chr are text strings
    • Sometimes numeric data is classified as a character variable (can be coverted to numeric).
    • The type of variable dictates hwat you can do with it.
    • Character and numeric data can be converted to a factor for plots, tables, analyses

       

  • The starwars data has both numeric and character variables.

  • In the chunk below we will save the R dataset, starwars to our Global Environment.

  • We will then examine the data and variable and answer questions using R commands.

# save the starwars data to your global environment
my_starwars <- starwars

# examine the data
glimpse(my_starwars)
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~
# examine the list of species
# note that to indicate a variable WITHIN a dataset we use $, the accessor operator
unique(my_starwars$species)
##  [1] "Human"          "Droid"          "Wookiee"        "Rodian"        
##  [5] "Hutt"           "Yoda's species" "Trandoshan"     "Mon Calamari"  
##  [9] "Ewok"           "Sullustan"      "Neimodian"      "Gungan"        
## [13] NA               "Toydarian"      "Dug"            "Zabrak"        
## [17] "Twi'lek"        "Vulptereen"     "Xexto"          "Toong"         
## [21] "Cerean"         "Nautolan"       "Tholothian"     "Iktotchi"      
## [25] "Quermian"       "Kel Dor"        "Chagrian"       "Geonosian"     
## [29] "Mirialan"       "Clawdite"       "Besalisk"       "Kaminoan"      
## [33] "Aleena"         "Skakoan"        "Muun"           "Togruta"       
## [37] "Kaleesh"        "Pau'an"
# examine the list of haircolors
# again, we use $ to specify hair_colr within this dataset
unique(my_starwars$hair_color)
##  [1] "blond"         NA              "none"          "brown"        
##  [5] "brown, grey"   "black"         "auburn, white" "auburn, grey" 
##  [9] "white"         "grey"          "auburn"        "blonde"       
## [13] "unknown"
# do a quick summary of these data by sex and species
table(my_starwars$sex, my_starwars$hair_color)
##                 
##                  auburn auburn, grey auburn, white black blond blonde brown
##   female              1            0             0     3     0      1     6
##   hermaphroditic      0            0             0     0     0      0     0
##   male                0            1             1     9     3      0    11
##   none                0            0             0     0     0      0     0
##                 
##                  brown, grey grey none unknown white
##   female                   0    0    4       0     1
##   hermaphroditic           0    0    0       0     0
##   male                     1    1   29       0     3
##   none                     0    0    3       0     0
# save this table as an object
sw_gender_hrclr_smry <- table(my_starwars$sex, my_starwars$hair_color)

# summarize height of starwars characters
summary(my_starwars$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    66.0   167.0   180.0   174.4   191.0   264.0       6
# calculate the mean
mean(my_starwars$height)
## [1] NA
# NA's must be excluded with na.rm=T
# also true for min(), max(), median, sd(), etc.
mean(my_starwars$height, na.rm=T)
## [1] 174.358
sd(my_starwars$height, na.rm=T)
## [1] 34.77043
# summarize sex
summary(my_starwars$sex)
##    Length     Class      Mode 
##        87 character character
# summarize sex using table
table(my_starwars$sex)
## 
##         female hermaphroditic           male           none 
##             16              1             60              6
# summarize sex but use as.factor
summary(as.factor(my_starwars$sex))
##         female hermaphroditic           male           none           NA's 
##             16              1             60              6              4
# save this last summary as an object
sw_sex_smry <- summary(as.factor(my_starwars$sex))

Plot Preview

Data Mgmt for a Basic Boxplot

  • Below is a preview of ggplot which we will use throughout this course.

  • Plotting data wells requires data management and planning

    • The preview below ALSO includes data management skills we will cover
    • comments are provided to explain what is done.
  • The very first plot is the bare minimum to create a boxplot.

    • code shown here with and without piping
# dataset my_starwars_plt is created for the plot
# used select command to select variables
# used filter command to filter data to only to species, Humand and Droid
# used mutate command to create new variable bmi
  # bmi = weight(kg)/height(m)^2
# filtered out observations where bmi was a missing value, NA
my_starwars_plt <- my_starwars |>
  select(species, sex, height, mass) |>
  filter(species %in% c("Human", "Droid")) |>
  mutate(bmi = mass/((height/100))^2) |>
  filter(!is.na(bmi)) 

# most basic UNSAVED bar plot
# output to screen only
my_starwars_plt |> ggplot() +
  geom_boxplot(aes(x=species, y=bmi))

# same basic unsaved plot without piping
ggplot(my_starwars_plt) +
geom_boxplot(aes(x=species, y=bmi))


Improving a Plot (Demo)

  • The four plots below are saved as objects in the Global Environment

    • Saved plots don’t print to the screen by default
    • Enclose set of plot commands in parentheses to print to screen
  • sw_boxplot1 is the basic plot.

  • sw_boxplot2 is the basic plot with data shown by sex

  • sw_boxplot3 removes the default background

  • sw_boxplot4 formats all plot text and moves legend to bottom

# most basic saved bar plot
# data are piped into the plot

# this plot is saved as sw_boxplot1 but won't appear on screen
sw_boxplot1 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi))

# same plot as above enclosed in parentheses so it appears on screen
(sw_boxplot1 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi)))

# same plot but with sex included (fill=sex)
(sw_boxplot2 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi, fill=sex)))

# same plot but with sex included (fill=sex)
# also default background removed by changing theme to theme_classic
(sw_boxplot3 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi, fill=sex))+
  theme_classic())

# created a factor variable sexF 
# modified  order (levels) and labels for plot
my_starwars_plt <- my_starwars_plt |>
  mutate(sexF = factor(sex, 
                       levels = c("male", "female", "none"),
                       labels =c("Male", "Female", "None")))

# used new factor variable
# moved legend to the bottom
# formatted titles, captions, axis labels with labs command
(sw_boxplot4 <- my_starwars_plt |> 
  ggplot() +
  geom_boxplot(aes(x=species, y=bmi, fill=sexF)) +
  theme_classic() + 
  theme(legend.position="bottom") +
  labs(title="Comparison of Human and Droid BMI",
       subtitle="22 Humans and 4 Droids from Star Wars Universe",
       caption="Data Source: dplyr package in R",
       x="",y="BMI", fill="Sex"))


Presenting a grid of plots together

  • One reason to save plots is to use them later in a composed presentation.

  • The 2 x 2 grid below shows all four of the previous plots together.

# create 2x2 grid of boxplots
# shows transition from first draft to final plot
# could also change colors
grid.arrange(sw_boxplot1, sw_boxplot2, sw_boxplot3, sw_boxplot4, ncol=2)