It’s Time to Stop Being a Script Kitty

09/01/2020

Bio

3rd-year PhD Student
I’ve using R for about 6 years
Developed 3 Packages
Written ~60,000 lines of R code
Began learning R outside of Academia
I’m teaching an undergrad R class follow along here
Always willing to help those interested in learning

What to Expect

To have fun
To cover an array of topics (i.e., beginner to advanced)
To create something useful
To not be judged by your level of knowledge

Agenda

(Re)Conceptualizing R
The Second Curve
User-defined Functions
Indexing
Example: A function to recode a single value
- Arguments
- Validation
Exercise: A function to recode multiple values

Conceptualizing R

Programming is a skill not an ability
R is a programming language not a software package
There are multiple ways to solve a problem just as there are multiple ways to write a sentence

The Second Curve

A Humbling Experience

rm(list=ls())

library(ggplot2)
library(extrafont, quietly = TRUE)
library(gridExtra)
library(grid)
library(gtable)
library(dplyr)
source('M:/Applications/R/Workspace/General/get_data_functions.R')
source('U:/Work/Reports/Spring Meeting 2018/spring_meeting_functions_pvi.R')



certs <- c("RDCS","RDMS","RPVI","RMSKS","RMSK","RVT")
status <- c("Granted","Revoked","Permanently Revoked")


df <- get_certifications(start_date = "2013-12-31", end_date = "2018-05-02")

df$Year <- lubridate::year(as.Date(df$DateGranted))
df <- subset(df, CertificationTitle %in% certs & CertificationStatus %in% status & Year 
             >= "2014" & Year != "2018")

Functions and Scripts: An Analogy

Imagine trying to cook something using ingredients in pre-packaged meals

Or

Editing reports that are PDFs and not DOCX

Scripts are not meant to be reused nor interacted with

Functions: Intro

Functions are the verbs of programming world
Scripts are just a series of functions with fixed inputs, meaning a person shouldn’t be interacting with them
Functions are ideal when you are doing the same thing over and over under varying circumstances
Functions reduce redundancy in your work, you only need to update code in one place
Unlike script development, when creating functions you are concerned a variety of inputs and situations
In R, we can write our own functions known as User-Defined Functions

Functions: Parts

Functions consist of 4 basic parts:

Function Name
Arguments
Function Body
Return Value

Analogy

Think of washing clothes as the function name. dirty clothes, detergent amount, wash time, washer setting as the a few possible arguments. The the act of washing as the function body, and clean clothes as the return value.

wash_clothes <- function(dirty_clothes,
                         detergent_amount = 100, 
                         wash_time = 60,
                         washer_setting = "delicates") {
  ...
  return(clean_clothes)
}

# inputs can change
wash_clothes(dirty_clothes = 
               c("jeans","jorts", "wallet"))

# so you're not repeating yourself
wash_clothes(dirty_clothes = rep("shirt", 7), 
             washer_setting = "cottons")

Think of the arguments as inputs that the function takes back to their workshop to produce something nice for us.

say_hello <- function(person, and_goodbye = F) {
  res <- paste0("Hello ", person, collapse = "\n")
  if(and_goodbye) {
    res <- paste0(res, "\n" , paste0("Goodbye ", person, collapse = "\n"))
  }
  return(cat(res))
}

# R will search it's parent environment so watch naming those arguments
person <- "Shea"
say_hello(person, F)

## Hello Shea

# Just hello to one "person"
say_hello("Adele")

## Hello Adele

# And Goodbye
say_hello("Adele", T)

## Hello Adele
## Goodbye Adele

# Just hello to many people
say_hello(c("It's me", "Can you hear me?", "from the other side", "from the outside"))

## Hello It's me
## Hello Can you hear me?
## Hello from the other side
## Hello from the outside

# Even numbers
say_hello(1:5, T)

## Hello 1
## Hello 2
## Hello 3
## Hello 4
## Hello 5
## Goodbye 1
## Goodbye 2
## Goodbye 3
## Goodbye 4
## Goodbye 5

Indexing: By Position

Indexing is fudamental to programming, it allows us to subset and filter data. In R, we can index data using the basic format vector1[index_vector]. No matter how large y is indexing by position always returns data the same length as the index_vector inside the [].

y <- seq(10, 100, by = 10)
# if we want to look at the first element
y[1]

## [1] 10

# or the last 5 elements
y[5:10]

## [1]  50  60  70  80  90 100

# we can also repeat things
y[c(1,1,1,3,3,3)]

## [1] 10 10 10 30 30 30

Indexing: By Logical Statement

You can also index by logical vectors, this is called filtering. Logical/Boolean vectors is a vector that only contains TRUE and FALSE values.

# Notice what changes
y[TRUE]

##  [1]  10  20  30  40  50  60  70  80  90 100

# R recycles unlike positional indexing
y[c(TRUE, FALSE)]

## [1] 10 30 50 70 90

# we use conditional statements
greater_than_40 <- y > 40

y[greater_than_40]

## [1]  50  60  70  80  90 100

Functions: An Example

Since we work with tabular data lets simulate a data.frame (i.e., dataset), but let’s do it really fast. 200 columns by 300 rows.

# just so we are all doing the same thing
set.seed(46)
# 100 items on a 5-point scale
likert5 <- replicate(100, sample(1:5, 300, replace = T), simplify = "list")
# 100 items on a 3-point scale
likert3 <- replicate(100, sample(1:3, 300, replace = T), simplify = "list")

#combine em
sim_data <- data.frame(likert5, likert3)

View data

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
2	5	1	1	2	2	2	4	1	3
1	2	3	2	5	1	2	3	5	4
4	2	5	4	2	2	1	1	3	2
4	5	2	3	2	4	3	2	3	2
1	4	5	3	1	1	3	2	1	4
4	1	4	5	1	1	5	5	1	2

Recoding data is a standard activity in the social sciences. Imagine recoding the 200 variables we just simulated within a script.

Instead, why don’t we make our own function to do just that. One that we can use regardless of the dataset.

This get’s at the idea of extensibility. It’s intuitive to distinguish between data A and data B especially when you think about the details.

If close your eyes and hold your nose, apples and potatoes aren’t that different.

Tip

Focus on single-use (i.e., if you chase two rabbits, you’ll catch neither). Don’t make a function that recodes values, plots the data, and makes you coffee.

Start simple.

Let’s think about replacing just one value in one column.

Functions: An Example Source Code

What arguments would we need?

x: the column or vector of values
old_value: a value in x to be replaced
new_value: a value to replace the old_value

recode_value <- function(x, old_value, new_value) {
  # let's check where x is equal to the old value
  old_val_indx <- x == old_value
  
  # now we can tell r to look WITHIN x and replace the values where the old_value was found
  x[old_val_indx] <- new_value
  
  # now lets return the updated x
  return(x)
}

Let’s test it out

old	new
2	2
1	-99
4	4
4	4
1	-99
4	4
5	5
3	3
4	4
4	4
4	4
3	3
4	4
3	3
3	3
3	3
3	3
2	2
5	5
5	5
5	5
3	3
3	3
4	4
1	-99

Sharing is Caring

Possible User Error

# accidentally using a character vector instead of a numeric vector
recode_value(rep(c("A", "B"), 150), 1, -99)

# specifies an old value that isnt in x
recode_value(sim_data[,1], -1, -99)

Possible User Error Results

When a character vector is used instead of a numeric vector

# accidentally using a character vector instead of a numeric vector
recode_value(rep(c("A", "B"), 150), 1, -99)

##   [1] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
##  [19] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
##  [37] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
##  [55] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
##  [73] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
##  [91] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [109] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [127] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [145] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [163] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [181] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [199] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [217] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [235] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [253] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [271] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [289] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"

When no old value is found in the vector to be recoded (i.e., x)

# specifies an old value that isnt in x
recode_value(sim_data[,1], -1, -99)

##   [1] 2 1 4 4 1 4 5 3 4 4 4 3 4 3 3 3 3 2 5 5 5 3 3 4 1 5 4 5 1 5 2 3 5 2 5 4 1
##  [38] 2 3 1 2 2 3 4 2 3 4 3 3 2 3 5 4 4 5 4 1 4 3 3 4 3 5 3 4 1 5 5 4 1 1 4 2 3
##  [75] 4 1 2 2 3 5 1 1 3 2 5 5 2 4 4 4 2 3 5 4 3 4 1 3 4 3 1 2 5 1 4 5 5 5 4 5 1
## [112] 4 1 5 2 2 1 5 4 1 4 1 5 5 4 4 5 3 3 2 1 1 3 2 1 1 3 3 2 1 5 2 2 5 5 4 1 2
## [149] 3 3 1 3 3 2 2 1 1 1 2 4 3 5 1 4 4 2 1 3 1 1 4 1 1 2 1 2 2 4 4 5 3 4 2 4 5
## [186] 5 1 4 5 5 4 4 5 2 1 3 5 2 5 1 5 1 3 5 3 4 5 1 3 4 5 3 2 2 4 3 3 5 5 5 5 2
## [223] 2 3 3 4 2 4 4 4 3 3 1 5 1 3 4 1 3 1 2 5 3 3 3 2 5 2 2 1 4 4 2 5 4 3 5 1 2
## [260] 3 5 3 4 1 1 3 4 5 2 3 4 5 5 2 2 3 2 1 2 3 5 1 4 5 4 2 1 1 1 1 4 3 2 2 4 4
## [297] 3 4 2 5

Meet: Validation

Lets add some things to validate the arguments so they’re what we want

recode_value <- function(x, old_value, new_value) {
    # make sure everything is numeric
    tryCatch(
      stopifnot(is.numeric(x), is.numeric(old_value), is.numeric(new_value)),
           error = function(err) {
             err$message <- "all arguments must be numeric"
             stop(err)
             })
    # check for argument length
    if(length(old_value) != 1 || length(new_value) != 1) {
      stop("old_value and new_value must be a single value")
    }
    # let's check where x is equal to the old value
    old_val_indx <- x == old_value
    
    #check to see if old_value wasn't found 
    if(!any(old_val_indx)) {
      warning("old_value not found in x...returning x")
      return(x)
    }
    
    # now we can tell r to look WITHIN x and replace the values where the old_value was found
    x[old_val_indx] <- new_value
    
  # now lets return the updated x
  return(x)
}

Let’s Try again

When a character vector is used instead of a numeric vector

## Error in recode_value(rep(c("A", "B"), 150), 1, -99): all arguments must be numeric

When no old value is found in the vector to be recoded (i.e., x)

## Warning in recode_value(sim_data[, 1], -1, -99): old_value not found in
## x...returning x

##   [1] 2 1 4 4 1 4 5 3 4 4 4 3 4 3 3 3 3 2 5 5 5 3 3 4 1 5 4 5 1 5 2 3 5 2 5 4 1
##  [38] 2 3 1 2 2 3 4 2 3 4 3 3 2 3 5 4 4 5 4 1 4 3 3 4 3 5 3 4 1 5 5 4 1 1 4 2 3
##  [75] 4 1 2 2 3 5 1 1 3 2 5 5 2 4 4 4 2 3 5 4 3 4 1 3 4 3 1 2 5 1 4 5 5 5 4 5 1
## [112] 4 1 5 2 2 1 5 4 1 4 1 5 5 4 4 5 3 3 2 1 1 3 2 1 1 3 3 2 1 5 2 2 5 5 4 1 2
## [149] 3 3 1 3 3 2 2 1 1 1 2 4 3 5 1 4 4 2 1 3 1 1 4 1 1 2 1 2 2 4 4 5 3 4 2 4 5
## [186] 5 1 4 5 5 4 4 5 2 1 3 5 2 5 1 5 1 3 5 3 4 5 1 3 4 5 3 2 2 4 3 3 5 5 5 5 2
## [223] 2 3 3 4 2 4 4 4 3 3 1 5 1 3 4 1 3 1 2 5 3 3 3 2 5 2 2 1 4 4 2 5 4 3 5 1 2
## [260] 3 5 3 4 1 1 3 4 5 2 3 4 5 5 2 2 3 2 1 2 3 5 1 4 5 4 2 1 1 1 1 4 3 2 2 4 4
## [297] 3 4 2 5

Recode Multiple Values

But our function isn’t that efficient…just one value at a time? We should be able to do multiple values. Let’s look at our old code and change some things. Starting with the arguments. Let’s name them old_values and new_values, so it makes sense to others. Additionally we can keep our validator to make sure everything is numeric.

recode_values <- function(x, old_values, new_values) {
    # make sure everything is numeric
    tryCatch(
      stopifnot(is.numeric(x), is.numeric(old_values), is.numeric(new_values)),
           error = function(err) {
             err$message <- "all arguments must be numeric"
             stop(err)
             })
    # we need to change things here
    ...
  # now lets return the updated x
  return(x)
}

Recode Multiple Values: Match

There are many ways to create this function (we could even use recode_value over and over until again) but that isn’t efficient. Lucky, r has a function that helps us match multiple values in a vector. As an example let me show you just one column.

# assign one example column to an object
first_column <- sim_data[,1]

# find 2, 4, and 5 in that column
match(first_column, c(2,4,5))

##   [1]  1 NA  2  2 NA  2  3 NA  2  2  2 NA  2 NA NA NA NA  1  3  3  3 NA NA  2 NA
##  [26]  3  2  3 NA  3  1 NA  3  1  3  2 NA  1 NA NA  1  1 NA  2  1 NA  2 NA NA  1
##  [51] NA  3  2  2  3  2 NA  2 NA NA  2 NA  3 NA  2 NA  3  3  2 NA NA  2  1 NA  2
##  [76] NA  1  1 NA  3 NA NA NA  1  3  3  1  2  2  2  1 NA  3  2 NA  2 NA NA  2 NA
## [101] NA  1  3 NA  2  3  3  3  2  3 NA  2 NA  3  1  1 NA  3  2 NA  2 NA  3  3  2
## [126]  2  3 NA NA  1 NA NA NA  1 NA NA NA NA  1 NA  3  1  1  3  3  2 NA  1 NA NA
## [151] NA NA NA  1  1 NA NA NA  1  2 NA  3 NA  2  2  1 NA NA NA NA  2 NA NA  1 NA
## [176]  1  1  2  2  3 NA  2  1  2  3  3 NA  2  3  3  2  2  3  1 NA NA  3  1  3 NA
## [201]  3 NA NA  3 NA  2  3 NA NA  2  3 NA  1  1  2 NA NA  3  3  3  3  1  1 NA NA
## [226]  2  1  2  2  2 NA NA NA  3 NA NA  2 NA NA NA  1  3 NA NA NA  1  3  1  1 NA
## [251]  2  2  1  3  2 NA  3 NA  1 NA  3 NA  2 NA NA NA  2  3  1 NA  2  3  3  1  1
## [276] NA  1 NA  1 NA  3 NA  2  3  2  1 NA NA NA NA  2 NA  1  1  2  2 NA  2  1  3

What is happening?

match() looks at each element (ie. value) in the column one-by-one
Then it tells you what position it finds the value in the second argument which is c(2,4,5) in our example
For example, the first value match returned was a 1 meaning the value it matched is the first number in c(2,4,5)
If match doesn’t find a the column value in the look-up values (i.e, for our example anything that isn’t 2, 4, or 5) it returns an NA

# lets look a simplified example
some_letters <- rep(LETTERS[1:5], 5)

some_letters

##  [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D"
## [20] "E" "A" "B" "C" "D" "E"

# get indices of matches
match(some_letters, c("E","A"))

##  [1]  2 NA NA NA  1  2 NA NA NA  1  2 NA NA NA  1  2 NA NA NA  1  2 NA NA NA  1

# from a different prespective
df <- data.frame(letter = some_letters,
             indx = match(some_letters, c("E","A")), 
             is_E = some_letters == "E",
             is_A = some_letters == "A")

head(df,10)

##    letter indx  is_E  is_A
## 1       A    2 FALSE  TRUE
## 2       B   NA FALSE FALSE
## 3       C   NA FALSE FALSE
## 4       D   NA FALSE FALSE
## 5       E    1  TRUE FALSE
## 6       A    2 FALSE  TRUE
## 7       B   NA FALSE FALSE
## 8       C   NA FALSE FALSE
## 9       D   NA FALSE FALSE
## 10      E    1  TRUE FALSE

Question

If the code match(first_column[1:5], c(2,4,5)) returned

## [1] 1 2 2 3 1

What would be the the first 5 values in the first_column?

Let’s check…

# Hypothetical example
c(2,4,5)[c(1,2,2,3,1)]

## [1] 2 4 4 5 2

An epiphany!

We can take the values returned from match and use them to index a smaller vector.

One last example to bring it home…

So we have our first_column. For this example, let’s make it a bit smaller, just the first 10 values.

first_column_10 <- first_column[1:10]

And we can match our the possible values of our column 1 - 5, this is known as an index

match(first_column_10, c(1,2,3,4,5))

##  [1] 2 1 4 4 1 4 5 3 4 4

We can use this index to create a vector based on new values (given they are the same length as the values we mapped to)

# find the index of old values
indx_of_old_values <- match(first_column_10, c(1,2,3,4,5))

# index the new values to "replace" them
c(10,20,30,40,50)[indx_of_old_values]

##  [1] 20 10 40 40 10 40 50 30 40 40

Exercise Recode Values

recode_values <- function(x, old_values, new_values) {
    # make sure everything is numeric
    tryCatch(
      stopifnot(is.numeric(x), is.numeric(old_values), is.numeric(new_values)),
           error = function(err) {
             err$message <- "all arguments must be numeric"
             stop(err)
             })
  # What validation could we use?
  # ?
  # ?
  # ?
  
  # find the old values in x
  old_value_index <- match(x, old_values)
  # now lets return the updated x
  x <- new_values[old_value_index]
  return(x)
}

One last magic trick

How long do you think it would take you to reverse code each of the 300 columns? Or even subtract 1 from them? Maybe add 1? Let’s use the function we made to do these things.

We can time ourselves.

t0 <- Sys.time()
# reverse code
rev_sim_data <- sapply(sim_data, function(x) {
  old_vals <- sort(unique(x))
  x <- recode_values(x, old_vals, rev(old_vals))
})
# add 1
add1_sim_data <- sapply(sim_data, function(x) {
  old_vals <- sort(unique(x))
  x <- recode_values(x, old_vals, old_vals + 1)
})

# subtract 1
sub1_sim_data <- sapply(sim_data, function(x) {
  old_vals <- sort(unique(x))
  x <- recode_values(x, old_vals, old_vals - 1)
})
#how much time did it take
final_time <- Sys.time() - t0

Time difference of 0.1130021 secs

  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1  2  5  1  1  2  2  2  4  1   3
2  1  2  3  2  5  1  2  3  5   4

     X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
[1,]  4  1  5  5  4  4  4  2  5   3
[2,]  5  4  3  4  1  5  4  3  1   2

     X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
[1,]  3  6  2  2  3  3  3  5  2   4
[2,]  2  3  4  3  6  2  3  4  6   5

     X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
[1,]  1  4  0  0  1  1  1  3  0   2
[2,]  0  1  2  1  4  0  1  2  4   3

For those in I/O Learning

To complete your attendance assignment, create a post on I/O Learning’s Blackboard Discussion Board for week 2.

Answer the following questions:

Is there an activity related to your school-life or work-life where you feel that programming (in R) can help? Briefly support your response.
Can you think of two examples/situations where the function we created (recode_values) would not work?
Feel free to give any feedback on the talk (as harsh as you want to be)

Takeaways

R (or really any programming language) does not make you great, it makes you greater
Programming is just like writing
- Most people can write
- Writing “well” involves problem-solving and computational thinking
You must make your journey– your journey
You don’t fall on top of mountains, you climb them
A mastery of R means that you are not writing code for yourself but for others

Thanks!

Feel free to reach out if you have any questions:

email: sfyffe@masonlive.gmu.edu

https://linkedin.com/in/sheafyffe/

https://github.com/Shea-Fyffe

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
2	5	1	1	2	2	2	4	1	3
1	2	3	2	5	1	2	3	5	4
4	2	5	4	2	2	1	1	3	2
4	5	2	3	2	4	3	2	3	2
1	4	5	3	1	1	3	2	1	4
4	1	4	5	1	1	5	5	1	2

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
2	5	1	1	2	2	2	4	1	3
1	2	3	2	5	1	2	3	5	4
4	2	5	4	2	2	1	1	3	2
4	5	2	3	2	4	3	2	3	2
1	4	5	3	1	1	3	2	1	4
4	1	4	5	1	1	5	5	1	2

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
2	5	1	1	2	2	2	4	1	3
1	2	3	2	5	1	2	3	5	4
4	2	5	4	2	2	1	1	3	2
4	5	2	3	2	4	3	2	3	2
1	4	5	3	1	1	3	2	1	4
4	1	4	5	1	1	5	5	1	2