Load these packages:
I will specify packages to be loaded for quiz
You may also load different packages or use base R commands if you choose.
# install.packages("gridExtra")
library(nycflights13)
library(tidyverse)
library(lubridate)
library(gridExtra)
library(knitr)
Format of Quiz 2
Timed take-home
Multiple different versions
Each version will have a unique data set
Each version will ask students to complete 4 - 5 data management skills
You will be writing commands yourself, but I may provide some hints
Each version will ask questions based on managed data
Each version will require one basic readable plot to be submitted
Each version WILL ALSO include some conceptual (why/how) questions
- For example what is this operator: %>% and why is it used?
Can be completed between 2:00 PM on Thursday and Midnight on Friday
I will be available:
Thursday 2:00 PM to 3:30 PM
Thursday 4:15 PM to 5:30 PM
Thursday 7:30 PM to 9:00 PM
Friday TBD (depends on Meeting and Family - will post on BB Thursday)
Lecture 13 (HW 5)
Data Management Essential Skills
Note that many of these skills build on and expand skills from the first half of the course.
Recall these dplyr commands (expanded list to include some tidyr commands):
Function | Use |
---|---|
filter() | Pick (subset) observations by their value |
arrange() | Reorder the rows |
select() | Pick variables by their names |
rename() | Rename a variable in a data set |
relocate() | Reorder variables in a data set (can also use select for this) |
separate | split text variables into multiple variables |
mutate() | Create new variables with functions of existing variables |
summarise() | Collapse many values down to a single summary |
group_by() | Used in conjuction with these functions to change scope, e.g., by category |
group_by(…) and followed by summarize(…)
NOTE: group_by followed by summarize has been useful in multiple lectures or HW assignments
Given that universal utility, I am likely to ask a question that will require using these commands
Interactive Question 1
When you create a group_by object, how many observations will it have in comparison to the original data set?
NOTE paste(…) vs unite(…)
In this course, I use paste(…) as the opposite of separate
tidyr package has a different command, unite(…)
You are welcome to use either, but show the link you used as a reference (see below)
filter(…) and select(…) commands OR can use […,…] square brackets
filter is used to subset data by row, e.g. category, value, or row number
select is used subset the data by column, e.g. variable name or column number
Square brackets can also be used to subset data to specific rows and columns
Interactive Question 2
In order to use filter in some cases it is efficient to add a numeric variable of the row numbers to the data set. If the data set is named cars, what is a command to do this in R?
NOTE: cars is real data set in R that you can use to test commands like the ones in this question.
mutate(…) with calculations or other variable modifications and creations
- used to create a new variable
merge(…) or full_join(…)
RECALL: in lecture 13 and HW 5 we used full_join to join the weather and flight data
In the code library there is an example of merge
Other types of joins are used to acheive specific goals, e.g. left_join, right_join, etc.
you are responsible for full_join and why joining is needed and how it is limited
- For example: how would you merge 3 or more data sets?
pivot_wider(..) and pivot_longer
Recall that pivot_wider was used in HW 5 to create a new variable (column) for each category (airline)
In Lecture 15 (and HW 6) we used pivot_longer (mentioned below)
Quiz 2 will ask each students to use at least one of these skills to manage data and answer questions
both pivot_longer and pivot_wider are essential skills
is.na(…) with colSums(…) and colMeans(…)
Recall that lecture 13 begins with a demo of a small “toy” data set
That demo and the lecture show that you can calculate the number or percent of missing values
We can determine number of missing values or percent missing for
a whole data frame
a single variable (column)
all columns in a data set simultaneously
Lectures 14 and 15 (HW 6)
Lecture 14 focused on some useful commands from lubridate package
students should know
paste(…) which is a Base R command used join text
all functions and variations from list below
by variations: ymd(…), mdy(…), dmy(…)
<- c("wday(...)", "month(...)", "day(...)", "now()", "today()",
fnctns "ymd(...)", "ymd_h(...)", "ymd_hm(...)", "ymd_hms(...)")
<- c("outputs day of week for a input date as a number (Sun. = 1) or text",
descrips "outputs month for a input date as a number or text",
"outputs day of month for a input date as a number",
"outputs current date and time in time zone, similar to Sys.time() in Base R",
"outputs current date, similar to Sys.Date() in Base R",
"converts year, month and day input to a date (also includes dmy(), mdy())",
"converts year, month, day and hour to a date-time object",
"converts year, month, day, hour and minutes to a date-time object",
"converts year, month, day, hour, minutes, and seconds to a date-time object")
<- data.frame(fnctns, descrips)
lbrdt_df kable(lbrdt_df, col.names=c("Duration Function", "Description"), caption="Common Lubridate Functions")
Duration Function | Description |
---|---|
wday(…) | outputs day of week for a input date as a number (Sun. = 1) or text |
month(…) | outputs month for a input date as a number or text |
day(…) | outputs day of month for a input date as a number |
now() | outputs current date and time in time zone, similar to Sys.time() in Base R |
today() | outputs current date, similar to Sys.Date() in Base R |
ymd(…) | converts year, month and day input to a date (also includes dmy(), mdy()) |
ymd_h(…) | converts year, month, day and hour to a date-time object |
ymd_hm(…) | converts year, month, day, hour and minutes to a date-time object |
ymd_hms(…) | converts year, month, day, hour, minutes, and seconds to a date-time object |
Lecture 15 (and HW 6):
as.xts used to convert a data frame with a date variable to an .xts object
converting an .xts object to a data frame
pivot longer was used to ‘stack’ data from multiple columns for plotting
Line (geom_line) plot is short and simple if data are in “stacked” format
hchart (highcharter packages) requires and .xts object
Lectures 16 and 17 (and HW 7)
writing and functions to automate repetitive tasks
key components of a function are
function input(s): data frames, variables, values
what the function does: how the function acts on the inputs
function output(s): result of calling the function
ESSENTIAL FUNCTION FORMATTING: last step in function NOT assigned to a name
- This is key so that function output can be output and saved
Interactive Question 3
True for False: A function can either output a result to the console OR save the result to a .csv file in the working directory, but not both.
Lectures 18 and 19
These lectures focused on a single complex data management example
some aspects of this work reviewed concepts from previous lectures
For example: in this lecture (and previous), gsub(…) was used
- What does gsub(…) do?
Other basic data management commands
rename(…): renames a variable
relocate(…): moves a variable in a data set which can also be done with select(…)
although moving variables using relocate and renaming variables seem trivial, they are not
good data management requires these steps
separate(…): opposite of paste(…)
paste(…) combines or concatenates text and separate(…) splits text
- both together are essential for being able to manipulate data
Additional Comments
You are not limited to code covered in class.
If you know or find R code online that helps you answer a question, that is great.
You do use code you find online, you will be asked to provide a link in your R script to show where you got the code from
For example: Let’s say I google: change facet text in ggplot (I did this search for Lecture 19)
- I would include the link as a comment in my script:
Another example: as mentioned above, you may use unite(…) instead of paste(…) but provide a link to your documentation
# source for my R code about facets
# https://r-graphics.org/recipe-facet-label-text
# reference for separate(...):
# https://tidyr.tidyverse.org/reference/separate.html
# reference for unite(...)
# https://tidyr.tidyverse.org/reference/unite.html
It should go without saying that you can not work together or ask question of anyone except me (Prof. Pooler)
You can use the internet, your textbook, previous lecture, previous HW assignments, etc.