BUA 455 - Lecture 21 - Review for Quiz 2

2021-04-27

Load these packages:

  • I will specify packages to be loaded for quiz

  • You may also load different packages or use base R commands if you choose.

# install.packages("gridExtra")

library(nycflights13)
library(tidyverse)
library(lubridate)
library(gridExtra)
library(knitr)

Format of Quiz 2

  • Timed take-home

  • Multiple different versions

    • Each version will have a unique data set

    • Each version will ask students to complete 4 - 5 data management skills

    • You will be writing commands yourself, but I may provide some hints

    • Each version will ask questions based on managed data

    • Each version will require one basic readable plot to be submitted

    • Each version WILL ALSO include some conceptual (why/how) questions

      • For example what is this operator: %>% and why is it used?
  • Can be completed between 2:00 PM on Thursday and Midnight on Friday

  • I will be available:

    • Thursday 2:00 PM to 3:30 PM

    • Thursday 4:15 PM to 5:30 PM

    • Thursday 7:30 PM to 9:00 PM

    • Friday TBD (depends on Meeting and Family - will post on BB Thursday)

Lecture 13 (HW 5)

Data Management Essential Skills
  • Note that many of these skills build on and expand skills from the first half of the course.

  • Recall these dplyr commands (expanded list to include some tidyr commands):

Details about dplyr functions (CH. 5) and tidyr functions (CH. 12) are in R for Data Science
Function Use
filter() Pick (subset) observations by their value
arrange() Reorder the rows
select() Pick variables by their names
rename() Rename a variable in a data set
relocate() Reorder variables in a data set (can also use select for this)
separate split text variables into multiple variables
mutate() Create new variables with functions of existing variables
summarise() Collapse many values down to a single summary
group_by() Used in conjuction with these functions to change scope, e.g., by category
  • group_by(…) and followed by summarize(…)

    • NOTE: group_by followed by summarize has been useful in multiple lectures or HW assignments

    • Given that universal utility, I am likely to ask a question that will require using these commands

Interactive Question 1

When you create a group_by object, how many observations will it have in comparison to the original data set?

  • NOTE paste(…) vs unite(…)

    • In this course, I use paste(…) as the opposite of separate

    • tidyr package has a different command, unite(…)

    • You are welcome to use either, but show the link you used as a reference (see below)

  • filter(…) and select(…) commands OR can use […,…] square brackets

    • filter is used to subset data by row, e.g. category, value, or row number

    • select is used subset the data by column, e.g. variable name or column number

    • Square brackets can also be used to subset data to specific rows and columns

Interactive Question 2

In order to use filter in some cases it is efficient to add a numeric variable of the row numbers to the data set. If the data set is named cars, what is a command to do this in R?

NOTE: cars is real data set in R that you can use to test commands like the ones in this question.

  • mutate(…) with calculations or other variable modifications and creations

    • used to create a new variable
  • merge(…) or full_join(…)

    • RECALL: in lecture 13 and HW 5 we used full_join to join the weather and flight data

    • In the code library there is an example of merge

    • Other types of joins are used to acheive specific goals, e.g. left_join, right_join, etc.

    • you are responsible for full_join and why joining is needed and how it is limited

      • For example: how would you merge 3 or more data sets?
  • pivot_wider(..) and pivot_longer

    • Recall that pivot_wider was used in HW 5 to create a new variable (column) for each category (airline)

    • In Lecture 15 (and HW 6) we used pivot_longer (mentioned below)

    • Quiz 2 will ask each students to use at least one of these skills to manage data and answer questions

    • both pivot_longer and pivot_wider are essential skills

  • is.na(…) with colSums(…) and colMeans(…)

    • Recall that lecture 13 begins with a demo of a small “toy” data set

    • That demo and the lecture show that you can calculate the number or percent of missing values

    • We can determine number of missing values or percent missing for

      • a whole data frame

      • a single variable (column)

      • all columns in a data set simultaneously

Lectures 14 and 15 (HW 6)

  • Lecture 14 focused on some useful commands from lubridate package

  • students should know

    • paste(…) which is a Base R command used join text

    • all functions and variations from list below

    • by variations: ymd(…), mdy(…), dmy(…)

fnctns <- c("wday(...)", "month(...)", "day(...)", "now()", "today()",
            "ymd(...)", "ymd_h(...)", "ymd_hm(...)", "ymd_hms(...)")

descrips <- c("outputs day of week for a input date as a number (Sun. = 1) or text",
              "outputs month for a input date as a number or text",
              "outputs day of month for a input date as a number",
              "outputs current date and time in time zone, similar to Sys.time() in Base R",
              "outputs current date, similar to Sys.Date() in Base R",
              "converts year, month and day input to a date (also includes dmy(), mdy())",
              "converts year, month, day and hour to a date-time object",
              "converts year, month, day, hour and minutes to a date-time object",
              "converts year, month, day, hour, minutes, and seconds to a date-time object")

lbrdt_df <- data.frame(fnctns, descrips)
kable(lbrdt_df, col.names=c("Duration Function", "Description"), caption="Common Lubridate Functions")  
Common Lubridate Functions
Duration Function Description
wday(…) outputs day of week for a input date as a number (Sun. = 1) or text
month(…) outputs month for a input date as a number or text
day(…) outputs day of month for a input date as a number
now() outputs current date and time in time zone, similar to Sys.time() in Base R
today() outputs current date, similar to Sys.Date() in Base R
ymd(…) converts year, month and day input to a date (also includes dmy(), mdy())
ymd_h(…) converts year, month, day and hour to a date-time object
ymd_hm(…) converts year, month, day, hour and minutes to a date-time object
ymd_hms(…) converts year, month, day, hour, minutes, and seconds to a date-time object
  • Lecture 15 (and HW 6):

    • as.xts used to convert a data frame with a date variable to an .xts object

    • converting an .xts object to a data frame

    • pivot longer was used to ‘stack’ data from multiple columns for plotting

    • Line (geom_line) plot is short and simple if data are in “stacked” format

    • hchart (highcharter packages) requires and .xts object

  • Lectures 16 and 17 (and HW 7)

    • writing and functions to automate repetitive tasks

    • key components of a function are

      • function input(s): data frames, variables, values

      • what the function does: how the function acts on the inputs

      • function output(s): result of calling the function

    • ESSENTIAL FUNCTION FORMATTING: last step in function NOT assigned to a name

      • This is key so that function output can be output and saved

Interactive Question 3

True for False: A function can either output a result to the console OR save the result to a .csv file in the working directory, but not both.

  • Lectures 18 and 19

    • These lectures focused on a single complex data management example

    • some aspects of this work reviewed concepts from previous lectures

    • For example: in this lecture (and previous), gsub(…) was used

      • What does gsub(…) do?
  • Other basic data management commands

    • rename(…): renames a variable

    • relocate(…): moves a variable in a data set which can also be done with select(…)

      • although moving variables using relocate and renaming variables seem trivial, they are not

      • good data management requires these steps

    • separate(…): opposite of paste(…)

    • paste(…) combines or concatenates text and separate(…) splits text

      • both together are essential for being able to manipulate data
  • Additional Comments

    • You are not limited to code covered in class.

    • If you know or find R code online that helps you answer a question, that is great.

    • You do use code you find online, you will be asked to provide a link in your R script to show where you got the code from

    • For example: Let’s say I google: change facet text in ggplot (I did this search for Lecture 19)

      • I would include the link as a comment in my script:
    • Another example: as mentioned above, you may use unite(…) instead of paste(…) but provide a link to your documentation

# source for my R code about facets
# https://r-graphics.org/recipe-facet-label-text

# reference for separate(...):
# https://tidyr.tidyverse.org/reference/separate.html

# reference for unite(...)
# https://tidyr.tidyverse.org/reference/unite.html
  • It should go without saying that you can not work together or ask question of anyone except me (Prof. Pooler)

  • You can use the internet, your textbook, previous lecture, previous HW assignments, etc.