BUA 455 - Lecture 21 - Review for Quiz 2

2021-04-27

Load these packages:

I will specify packages to be loaded for quiz
You may also load different packages or use base R commands if you choose.

# install.packages("gridExtra")

library(nycflights13)
library(tidyverse)
library(lubridate)
library(gridExtra)
library(knitr)

Format of Quiz 2

Timed take-home
Multiple different versions
- Each version will have a unique data set
- Each version will ask students to complete 4 - 5 data management skills
- You will be writing commands yourself, but I may provide some hints
- Each version will ask questions based on managed data
- Each version will require one basic readable plot to be submitted
- Each version WILL ALSO include some conceptual (why/how) questions
  - For example what is this operator: %>% and why is it used?
Can be completed between 2:00 PM on Thursday and Midnight on Friday
I will be available:
- Thursday 2:00 PM to 3:30 PM
- Thursday 4:15 PM to 5:30 PM
- Thursday 7:30 PM to 9:00 PM
- Friday TBD (depends on Meeting and Family - will post on BB Thursday)

Lecture 13 (HW 5)

Data Management Essential Skills

Note that many of these skills build on and expand skills from the first half of the course.
Recall these dplyr commands (expanded list to include some tidyr commands):

Details about dplyr functions (CH. 5) and tidyr functions (CH. 12) are in R for Data Science
Function	Use
filter()	Pick (subset) observations by their value
arrange()	Reorder the rows
select()	Pick variables by their names
rename()	Rename a variable in a data set
relocate()	Reorder variables in a data set (can also use select for this)
separate	split text variables into multiple variables
mutate()	Create new variables with functions of existing variables
summarise()	Collapse many values down to a single summary
group_by()	Used in conjuction with these functions to change scope, e.g., by category

group_by(…) and followed by summarize(…)
- NOTE: group_by followed by summarize has been useful in multiple lectures or HW assignments
- Given that universal utility, I am likely to ask a question that will require using these commands

Interactive Question 1

When you create a group_by object, how many observations will it have in comparison to the original data set?

NOTE paste(…) vs unite(…)
- In this course, I use paste(…) as the opposite of separate
- tidyr package has a different command, unite(…)
- You are welcome to use either, but show the link you used as a reference (see below)
filter(…) and select(…) commands OR can use […,…] square brackets
- filter is used to subset data by row, e.g. category, value, or row number
- select is used subset the data by column, e.g. variable name or column number
- Square brackets can also be used to subset data to specific rows and columns

Interactive Question 2

In order to use filter in some cases it is efficient to add a numeric variable of the row numbers to the data set. If the data set is named cars, what is a command to do this in R?

NOTE: cars is real data set in R that you can use to test commands like the ones in this question.

mutate(…) with calculations or other variable modifications and creations
- used to create a new variable
merge(…) or full_join(…)
- RECALL: in lecture 13 and HW 5 we used full_join to join the weather and flight data
- In the code library there is an example of merge
- Other types of joins are used to acheive specific goals, e.g. left_join, right_join, etc.
- you are responsible for full_join and why joining is needed and how it is limited
  - For example: how would you merge 3 or more data sets?
pivot_wider(..) and pivot_longer
- Recall that pivot_wider was used in HW 5 to create a new variable (column) for each category (airline)
- In Lecture 15 (and HW 6) we used pivot_longer (mentioned below)
- Quiz 2 will ask each students to use at least one of these skills to manage data and answer questions
- both pivot_longer and pivot_wider are essential skills
is.na(…) with colSums(…) and colMeans(…)
- Recall that lecture 13 begins with a demo of a small “toy” data set
- That demo and the lecture show that you can calculate the number or percent of missing values
- We can determine number of missing values or percent missing for
  - a whole data frame
  - a single variable (column)
  - all columns in a data set simultaneously

Lectures 14 and 15 (HW 6)

Lecture 14 focused on some useful commands from lubridate package
students should know
- paste(…) which is a Base R command used join text
- all functions and variations from list below
- by variations: ymd(…), mdy(…), dmy(…)

fnctns <- c("wday(...)", "month(...)", "day(...)", "now()", "today()",
            "ymd(...)", "ymd_h(...)", "ymd_hm(...)", "ymd_hms(...)")

descrips <- c("outputs day of week for a input date as a number (Sun. = 1) or text",
              "outputs month for a input date as a number or text",
              "outputs day of month for a input date as a number",
              "outputs current date and time in time zone, similar to Sys.time() in Base R",
              "outputs current date, similar to Sys.Date() in Base R",
              "converts year, month and day input to a date (also includes dmy(), mdy())",
              "converts year, month, day and hour to a date-time object",
              "converts year, month, day, hour and minutes to a date-time object",
              "converts year, month, day, hour, minutes, and seconds to a date-time object")

lbrdt_df <- data.frame(fnctns, descrips)
kable(lbrdt_df, col.names=c("Duration Function", "Description"), caption="Common Lubridate Functions")

Common Lubridate Functions
Duration Function	Description
wday(…)	outputs day of week for a input date as a number (Sun. = 1) or text
month(…)	outputs month for a input date as a number or text
day(…)	outputs day of month for a input date as a number
now()	outputs current date and time in time zone, similar to Sys.time() in Base R
today()	outputs current date, similar to Sys.Date() in Base R
ymd(…)	converts year, month and day input to a date (also includes dmy(), mdy())
ymd_h(…)	converts year, month, day and hour to a date-time object
ymd_hm(…)	converts year, month, day, hour and minutes to a date-time object
ymd_hms(…)	converts year, month, day, hour, minutes, and seconds to a date-time object

Lecture 15 (and HW 6):
- as.xts used to convert a data frame with a date variable to an .xts object
- converting an .xts object to a data frame
- pivot longer was used to ‘stack’ data from multiple columns for plotting
- Line (geom_line) plot is short and simple if data are in “stacked” format
- hchart (highcharter packages) requires and .xts object
Lectures 16 and 17 (and HW 7)
- writing and functions to automate repetitive tasks
- key components of a function are
  - function input(s): data frames, variables, values
  - what the function does: how the function acts on the inputs
  - function output(s): result of calling the function
- ESSENTIAL FUNCTION FORMATTING: last step in function NOT assigned to a name
  - This is key so that function output can be output and saved

Interactive Question 3

True for False: A function can either output a result to the console OR save the result to a .csv file in the working directory, but not both.

Lectures 18 and 19
- These lectures focused on a single complex data management example
- some aspects of this work reviewed concepts from previous lectures
- For example: in this lecture (and previous), gsub(…) was used
  - What does gsub(…) do?
Other basic data management commands
- rename(…): renames a variable
- relocate(…): moves a variable in a data set which can also be done with select(…)
  - although moving variables using relocate and renaming variables seem trivial, they are not
  - good data management requires these steps
- separate(…): opposite of paste(…)
- paste(…) combines or concatenates text and separate(…) splits text
  - both together are essential for being able to manipulate data
Additional Comments
- You are not limited to code covered in class.
- If you know or find R code online that helps you answer a question, that is great.
- You do use code you find online, you will be asked to provide a link in your R script to show where you got the code from
- For example: Let’s say I google: change facet text in ggplot (I did this search for Lecture 19)
  - I would include the link as a comment in my script:
- Another example: as mentioned above, you may use unite(…) instead of paste(…) but provide a link to your documentation

# source for my R code about facets
# https://r-graphics.org/recipe-facet-label-text

# reference for separate(...):
# https://tidyr.tidyverse.org/reference/separate.html

# reference for unite(...)
# https://tidyr.tidyverse.org/reference/unite.html

It should go without saying that you can not work together or ask question of anyone except me (Prof. Pooler)
You can use the internet, your textbook, previous lecture, previous HW assignments, etc.