- 3rd-year PhD Student
- I’ve using R for about 6 years
- Developed 3 Packages
- Written ~60,000 lines of R code
- Began learning R outside of Academia
- I’m teaching an undergrad R class follow along here
- Always willing to help those interested in learning

09/01/2020

array of topics (i.e., beginner to advanced)
rm(list=ls())
library(ggplot2)
library(extrafont, quietly = TRUE)
library(gridExtra)
library(grid)
library(gtable)
library(dplyr)
source('M:/Applications/R/Workspace/General/get_data_functions.R')
source('U:/Work/Reports/Spring Meeting 2018/spring_meeting_functions_pvi.R')
certs <- c("RDCS","RDMS","RPVI","RMSKS","RMSK","RVT")
status <- c("Granted","Revoked","Permanently Revoked")
df <- get_certifications(start_date = "2013-12-31", end_date = "2018-05-02")
df$Year <- lubridate::year(as.Date(df$DateGranted))
df <- subset(df, CertificationTitle %in% certs & CertificationStatus %in% status & Year
>= "2014" & Year != "2018")
Scripts are not meant to be reused nor interacted with
Functions consist of 4 basic parts:
Function Name
Arguments
Function Body
Return Value
Think of washing clothes as the function name. dirty clothes, detergent amount, wash time, washer setting as the a few possible arguments. The the act of washing as the function body, and clean clothes as the return value.
wash_clothes <- function(dirty_clothes,
detergent_amount = 100,
wash_time = 60,
washer_setting = "delicates") {
...
return(clean_clothes)
}
# inputs can change
wash_clothes(dirty_clothes =
c("jeans","jorts", "wallet"))
# so you're not repeating yourself
wash_clothes(dirty_clothes = rep("shirt", 7),
washer_setting = "cottons")Think of the arguments as inputs that the function takes back to their workshop to produce something nice for us.
say_hello <- function(person, and_goodbye = F) {
res <- paste0("Hello ", person, collapse = "\n")
if(and_goodbye) {
res <- paste0(res, "\n" , paste0("Goodbye ", person, collapse = "\n"))
}
return(cat(res))
}
# R will search it's parent environment so watch naming those arguments person <- "Shea" say_hello(person, F)
## Hello Shea
# Just hello to one "person"
say_hello("Adele")
## Hello Adele
# And Goodbye
say_hello("Adele", T)
## Hello Adele ## Goodbye Adele
# Just hello to many people
say_hello(c("It's me", "Can you hear me?", "from the other side", "from the outside"))
## Hello It's me ## Hello Can you hear me? ## Hello from the other side ## Hello from the outside
# Even numbers say_hello(1:5, T)
## Hello 1 ## Hello 2 ## Hello 3 ## Hello 4 ## Hello 5 ## Goodbye 1 ## Goodbye 2 ## Goodbye 3 ## Goodbye 4 ## Goodbye 5
Indexing is fudamental to programming, it allows us to subset and filter data. In R, we can index data using the basic format vector1[index_vector]. No matter how large y is indexing by position always returns data the same length as the index_vector inside the [].
y <- seq(10, 100, by = 10) # if we want to look at the first element y[1]
## [1] 10
# or the last 5 elements y[5:10]
## [1] 50 60 70 80 90 100
# we can also repeat things y[c(1,1,1,3,3,3)]
## [1] 10 10 10 30 30 30
You can also index by logical vectors, this is called filtering. Logical/Boolean vectors is a vector that only contains TRUE and FALSE values.
# Notice what changes y[TRUE]
## [1] 10 20 30 40 50 60 70 80 90 100
# R recycles unlike positional indexing y[c(TRUE, FALSE)]
## [1] 10 30 50 70 90
# we use conditional statements greater_than_40 <- y > 40 y[greater_than_40]
## [1] 50 60 70 80 90 100
Since we work with tabular data lets simulate a data.frame (i.e., dataset), but let’s do it really fast. 200 columns by 300 rows.
# just so we are all doing the same thing set.seed(46) # 100 items on a 5-point scale likert5 <- replicate(100, sample(1:5, 300, replace = T), simplify = "list") # 100 items on a 3-point scale likert3 <- replicate(100, sample(1:3, 300, replace = T), simplify = "list") #combine em sim_data <- data.frame(likert5, likert3)
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
|---|---|---|---|---|---|---|---|---|---|
| 2 | 5 | 1 | 1 | 2 | 2 | 2 | 4 | 1 | 3 |
| 1 | 2 | 3 | 2 | 5 | 1 | 2 | 3 | 5 | 4 |
| 4 | 2 | 5 | 4 | 2 | 2 | 1 | 1 | 3 | 2 |
| 4 | 5 | 2 | 3 | 2 | 4 | 3 | 2 | 3 | 2 |
| 1 | 4 | 5 | 3 | 1 | 1 | 3 | 2 | 1 | 4 |
| 4 | 1 | 4 | 5 | 1 | 1 | 5 | 5 | 1 | 2 |
Recoding data is a standard activity in the social sciences. Imagine recoding the 200 variables we just simulated within a script.
Instead, why don’t we make our own function to do just that. One that we can use regardless of the dataset.
This get’s at the idea of extensibility. It’s intuitive to distinguish between data A and data B especially when you think about the details.
If close your eyes and hold your nose, apples and potatoes aren’t that different.
Focus on single-use (i.e., if you chase two rabbits, you’ll catch neither). Don’t make a function that recodes values, plots the data, and makes you coffee.
Start simple.
Let’s think about replacing just one value in one column.
What arguments would we need?
x: the column or vector of valuesold_value: a value in x to be replacednew_value: a value to replace the old_valuerecode_value <- function(x, old_value, new_value) {
# let's check where x is equal to the old value
old_val_indx <- x == old_value
# now we can tell r to look WITHIN x and replace the values where the old_value was found
x[old_val_indx] <- new_value
# now lets return the updated x
return(x)
}
| old | new |
|---|---|
| 2 | 2 |
| 1 | -99 |
| 4 | 4 |
| 4 | 4 |
| 1 | -99 |
| 4 | 4 |
| 5 | 5 |
| 3 | 3 |
| 4 | 4 |
| 4 | 4 |
| 4 | 4 |
| 3 | 3 |
| 4 | 4 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 2 | 2 |
| 5 | 5 |
| 5 | 5 |
| 5 | 5 |
| 3 | 3 |
| 3 | 3 |
| 4 | 4 |
| 1 | -99 |
There are a few issues that we can’t take for granted…
What about user error?

# accidentally using a character vector instead of a numeric vector
recode_value(rep(c("A", "B"), 150), 1, -99)
# specifies an old value that isnt in x recode_value(sim_data[,1], -1, -99)
When a character vector is used instead of a numeric vector
# accidentally using a character vector instead of a numeric vector
recode_value(rep(c("A", "B"), 150), 1, -99)
## [1] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [19] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [37] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [55] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [73] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [91] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [109] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [127] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [145] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [163] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [181] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [199] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [217] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [235] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [253] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [271] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" ## [289] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
When no old value is found in the vector to be recoded (i.e., x)
# specifies an old value that isnt in x recode_value(sim_data[,1], -1, -99)
## [1] 2 1 4 4 1 4 5 3 4 4 4 3 4 3 3 3 3 2 5 5 5 3 3 4 1 5 4 5 1 5 2 3 5 2 5 4 1 ## [38] 2 3 1 2 2 3 4 2 3 4 3 3 2 3 5 4 4 5 4 1 4 3 3 4 3 5 3 4 1 5 5 4 1 1 4 2 3 ## [75] 4 1 2 2 3 5 1 1 3 2 5 5 2 4 4 4 2 3 5 4 3 4 1 3 4 3 1 2 5 1 4 5 5 5 4 5 1 ## [112] 4 1 5 2 2 1 5 4 1 4 1 5 5 4 4 5 3 3 2 1 1 3 2 1 1 3 3 2 1 5 2 2 5 5 4 1 2 ## [149] 3 3 1 3 3 2 2 1 1 1 2 4 3 5 1 4 4 2 1 3 1 1 4 1 1 2 1 2 2 4 4 5 3 4 2 4 5 ## [186] 5 1 4 5 5 4 4 5 2 1 3 5 2 5 1 5 1 3 5 3 4 5 1 3 4 5 3 2 2 4 3 3 5 5 5 5 2 ## [223] 2 3 3 4 2 4 4 4 3 3 1 5 1 3 4 1 3 1 2 5 3 3 3 2 5 2 2 1 4 4 2 5 4 3 5 1 2 ## [260] 3 5 3 4 1 1 3 4 5 2 3 4 5 5 2 2 3 2 1 2 3 5 1 4 5 4 2 1 1 1 1 4 3 2 2 4 4 ## [297] 3 4 2 5
Lets add some things to validate the arguments so they’re what we want
recode_value <- function(x, old_value, new_value) {
# make sure everything is numeric
tryCatch(
stopifnot(is.numeric(x), is.numeric(old_value), is.numeric(new_value)),
error = function(err) {
err$message <- "all arguments must be numeric"
stop(err)
})
# check for argument length
if(length(old_value) != 1 || length(new_value) != 1) {
stop("old_value and new_value must be a single value")
}
# let's check where x is equal to the old value
old_val_indx <- x == old_value
#check to see if old_value wasn't found
if(!any(old_val_indx)) {
warning("old_value not found in x...returning x")
return(x)
}
# now we can tell r to look WITHIN x and replace the values where the old_value was found
x[old_val_indx] <- new_value
# now lets return the updated x
return(x)
}
When a character vector is used instead of a numeric vector
## Error in recode_value(rep(c("A", "B"), 150), 1, -99): all arguments must be numeric
When no old value is found in the vector to be recoded (i.e., x)
## Warning in recode_value(sim_data[, 1], -1, -99): old_value not found in ## x...returning x
## [1] 2 1 4 4 1 4 5 3 4 4 4 3 4 3 3 3 3 2 5 5 5 3 3 4 1 5 4 5 1 5 2 3 5 2 5 4 1 ## [38] 2 3 1 2 2 3 4 2 3 4 3 3 2 3 5 4 4 5 4 1 4 3 3 4 3 5 3 4 1 5 5 4 1 1 4 2 3 ## [75] 4 1 2 2 3 5 1 1 3 2 5 5 2 4 4 4 2 3 5 4 3 4 1 3 4 3 1 2 5 1 4 5 5 5 4 5 1 ## [112] 4 1 5 2 2 1 5 4 1 4 1 5 5 4 4 5 3 3 2 1 1 3 2 1 1 3 3 2 1 5 2 2 5 5 4 1 2 ## [149] 3 3 1 3 3 2 2 1 1 1 2 4 3 5 1 4 4 2 1 3 1 1 4 1 1 2 1 2 2 4 4 5 3 4 2 4 5 ## [186] 5 1 4 5 5 4 4 5 2 1 3 5 2 5 1 5 1 3 5 3 4 5 1 3 4 5 3 2 2 4 3 3 5 5 5 5 2 ## [223] 2 3 3 4 2 4 4 4 3 3 1 5 1 3 4 1 3 1 2 5 3 3 3 2 5 2 2 1 4 4 2 5 4 3 5 1 2 ## [260] 3 5 3 4 1 1 3 4 5 2 3 4 5 5 2 2 3 2 1 2 3 5 1 4 5 4 2 1 1 1 1 4 3 2 2 4 4 ## [297] 3 4 2 5
But our function isn’t that efficient…just one value at a time? We should be able to do multiple values. Let’s look at our old code and change some things. Starting with the arguments. Let’s name them old_values and new_values, so it makes sense to others. Additionally we can keep our validator to make sure everything is numeric.
recode_values <- function(x, old_values, new_values) {
# make sure everything is numeric
tryCatch(
stopifnot(is.numeric(x), is.numeric(old_values), is.numeric(new_values)),
error = function(err) {
err$message <- "all arguments must be numeric"
stop(err)
})
# we need to change things here
...
# now lets return the updated x
return(x)
}
There are many ways to create this function (we could even use recode_value over and over until again) but that isn’t efficient. Lucky, r has a function that helps us match multiple values in a vector. As an example let me show you just one column.
# assign one example column to an object first_column <- sim_data[,1] # find 2, 4, and 5 in that column match(first_column, c(2,4,5))
## [1] 1 NA 2 2 NA 2 3 NA 2 2 2 NA 2 NA NA NA NA 1 3 3 3 NA NA 2 NA ## [26] 3 2 3 NA 3 1 NA 3 1 3 2 NA 1 NA NA 1 1 NA 2 1 NA 2 NA NA 1 ## [51] NA 3 2 2 3 2 NA 2 NA NA 2 NA 3 NA 2 NA 3 3 2 NA NA 2 1 NA 2 ## [76] NA 1 1 NA 3 NA NA NA 1 3 3 1 2 2 2 1 NA 3 2 NA 2 NA NA 2 NA ## [101] NA 1 3 NA 2 3 3 3 2 3 NA 2 NA 3 1 1 NA 3 2 NA 2 NA 3 3 2 ## [126] 2 3 NA NA 1 NA NA NA 1 NA NA NA NA 1 NA 3 1 1 3 3 2 NA 1 NA NA ## [151] NA NA NA 1 1 NA NA NA 1 2 NA 3 NA 2 2 1 NA NA NA NA 2 NA NA 1 NA ## [176] 1 1 2 2 3 NA 2 1 2 3 3 NA 2 3 3 2 2 3 1 NA NA 3 1 3 NA ## [201] 3 NA NA 3 NA 2 3 NA NA 2 3 NA 1 1 2 NA NA 3 3 3 3 1 1 NA NA ## [226] 2 1 2 2 2 NA NA NA 3 NA NA 2 NA NA NA 1 3 NA NA NA 1 3 1 1 NA ## [251] 2 2 1 3 2 NA 3 NA 1 NA 3 NA 2 NA NA NA 2 3 1 NA 2 3 3 1 1 ## [276] NA 1 NA 1 NA 3 NA 2 3 2 1 NA NA NA NA 2 NA 1 1 2 2 NA 2 1 3
What is happening?
match() looks at each element (ie. value) in the column one-by-one
Then it tells you what position it finds the value in the second argument which is c(2,4,5) in our example
For example, the first value match returned was a 1 meaning the value it matched is the first number in c(2,4,5)
If match doesn’t find a the column value in the look-up values (i.e, for our example anything that isn’t 2, 4, or 5) it returns an NA
# lets look a simplified example some_letters <- rep(LETTERS[1:5], 5) some_letters
## [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" ## [20] "E" "A" "B" "C" "D" "E"
# get indices of matches
match(some_letters, c("E","A"))
## [1] 2 NA NA NA 1 2 NA NA NA 1 2 NA NA NA 1 2 NA NA NA 1 2 NA NA NA 1
# from a different prespective
df <- data.frame(letter = some_letters,
indx = match(some_letters, c("E","A")),
is_E = some_letters == "E",
is_A = some_letters == "A")
head(df,10)
## letter indx is_E is_A ## 1 A 2 FALSE TRUE ## 2 B NA FALSE FALSE ## 3 C NA FALSE FALSE ## 4 D NA FALSE FALSE ## 5 E 1 TRUE FALSE ## 6 A 2 FALSE TRUE ## 7 B NA FALSE FALSE ## 8 C NA FALSE FALSE ## 9 D NA FALSE FALSE ## 10 E 1 TRUE FALSE
If the code match(first_column[1:5], c(2,4,5)) returned
## [1] 1 2 2 3 1
What would be the the first 5 values in the first_column?
Let’s check…
# Hypothetical example c(2,4,5)[c(1,2,2,3,1)]
## [1] 2 4 4 5 2
We can take the values returned from match and use them to index a smaller vector.

So we have our first_column. For this example, let’s make it a bit smaller, just the first 10 values.
first_column_10 <- first_column[1:10]
And we can match our the possible values of our column 1 - 5, this is known as an index
match(first_column_10, c(1,2,3,4,5))
## [1] 2 1 4 4 1 4 5 3 4 4
We can use this index to create a vector based on new values (given they are the same length as the values we mapped to)
# find the index of old values indx_of_old_values <- match(first_column_10, c(1,2,3,4,5)) # index the new values to "replace" them c(10,20,30,40,50)[indx_of_old_values]
## [1] 20 10 40 40 10 40 50 30 40 40
recode_values <- function(x, old_values, new_values) {
# make sure everything is numeric
tryCatch(
stopifnot(is.numeric(x), is.numeric(old_values), is.numeric(new_values)),
error = function(err) {
err$message <- "all arguments must be numeric"
stop(err)
})
# What validation could we use?
# ?
# ?
# ?
# find the old values in x
old_value_index <- match(x, old_values)
# now lets return the updated x
x <- new_values[old_value_index]
return(x)
}
How long do you think it would take you to reverse code each of the 300 columns? Or even subtract 1 from them? Maybe add 1? Let’s use the function we made to do these things.
We can time ourselves.
t0 <- Sys.time()
# reverse code
rev_sim_data <- sapply(sim_data, function(x) {
old_vals <- sort(unique(x))
x <- recode_values(x, old_vals, rev(old_vals))
})
# add 1
add1_sim_data <- sapply(sim_data, function(x) {
old_vals <- sort(unique(x))
x <- recode_values(x, old_vals, old_vals + 1)
})
# subtract 1
sub1_sim_data <- sapply(sim_data, function(x) {
old_vals <- sort(unique(x))
x <- recode_values(x, old_vals, old_vals - 1)
})
#how much time did it take
final_time <- Sys.time() - t0
Time difference of 0.1130021 secs
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 1 2 5 1 1 2 2 2 4 1 3 2 1 2 3 2 5 1 2 3 5 4
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 [1,] 4 1 5 5 4 4 4 2 5 3 [2,] 5 4 3 4 1 5 4 3 1 2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 [1,] 3 6 2 2 3 3 3 5 2 4 [2,] 2 3 4 3 6 2 3 4 6 5
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 [1,] 1 4 0 0 1 1 1 3 0 2 [2,] 0 1 2 1 4 0 1 2 4 3
To complete your attendance assignment, create a post on I/O Learning’s Blackboard Discussion Board for week 2.
Answer the following questions:
recode_values) would not work?
Feel free to reach out if you have any questions:
email: sfyffe@masonlive.gmu.edu