WPA #8: Custom Functions

WPA Hints

For this WPA, you will frequently use the paste() function. The function takes numeric and string arguments, and combines the arguments into one string. Here is an example of the string function in action (to see more info, look at the help menu ?paste)

a <- 5
b <- 10
paste("The number a is", a, "and the number b is", b)

## [1] "The number a is 5 and the number b is 10"

You will also be doing logical indexing a LOT! Here’s a quick reminder on how it works.

x <- c(1, 4, -2, 1, 4, -5, 10)

# logical vector of positive values
pos.log <- x > 0
pos.log

## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE

# Now use the logical vector to create a vector of ONLY positive values from x

x.pos <- x[pos.log]  # Give me the values of x for which pos.log is TRUE
x.pos

## [1]  1  4  1  4 10

Question 1. Captain Jack is convinced that he can predict how much gold he will find on an island with the following equation: (a * b) - c * 324 + log(a), where a is the area of the island in square meters, b is the number of trees on the island, and c is how drunk he is on a scale of 1 to 10. Create a function called Jacks.Equation that takes a, b, and c as arguments and returns Captain Jack’s predictions. Test your function for an island with an area of 1,000 square meters that contains 30 trees when Jack is at a 7 on a drunkenness scale.

Jacks.Equation <- function(a, b, c) {
  
  return(a * b - c * 324 + log(a))
  
}


Jacks.Equation(a = 1000, b = 30, c = 7)

## [1] 27738.91

Question 2. Write a function called standardize.me that takes a vector x as an argument, and returns a vector that standardizes the values of x (standardization means subtracting the mean and dividing by the standard deviation).

standardize.me <- function(x) {
  
  output <- (x - mean(x)) / sd(x)
  
  return(output)
  
}

Try your function on this vector, you should get the following result:

data <- c(6, 3, 8, 6, 3, 2, 3, 2, 100)
standardize.me(data)

## [1] -0.2740789 -0.3677514 -0.2116305 -0.2740789 -0.3677514 -0.3989756
## [7] -0.3677514 -0.3989756  2.6609937

Question 3. Write a function called how.many that takes two arguments (data and value). The function should return a value indicating how many times the element value occurred in the vector data

how.many <- function(data, value) {
  
  output <- sum(data == value)
  
  return(output)
  
  }

Try running your function with these arguments:

how.many(data = c(1, 1, 9, 3, 2, 1, 1), value = 1)

## [1] 4

how.many(data = c(1, 1, 9, 3, 2, 1, 1), value = -100)

## [1] 0

Question 4. Often times you will need to recode values of a dataset. For example, if you have a survey of age data, you may want to convert any crazy values (like anything below 0 or above 100) to NA. Let’s create a function to do this in R. Write a function called recode.numeric() with 3 arguments: x, lb, and ub. We’ll assume that x is a numeric vector. The function should look at the values of x, convert any values below lb and above ub to NA, and then return the resulting vector.

recode.numeric <- function(x, lb, ub) {
  
 y <- x
 
 y[x < lb] <- NA
 y[x > ub] <- NA
 
 return(y)
  
}

Here is the function in action:

recode.numeric(x = c(5, 6, -10, 2, 1000, 2), lb = 0, ub = 100)

## [1]  5  6 NA  2 NA  2

Here is a dataset containing results from a survey containing 3 questions. The answer to every question should be between 1 and 10. Create a new dataframe called “survey.fixed” that contains the original data but with all invalid values converted to NA.

survey <- data.frame(
                     id = 1:6,
                     q1 = c(6, 2, 5, -1, 11, 100),
                     q2 = c(-5, 4, 65, 3, 7, 6),
                     q3 = c(2, 1, 2, 45, 5, -5)
                     )

survey

##   id  q1 q2 q3
## 1  1   6 -5  2
## 2  2   2  4  1
## 3  3   5 65  2
## 4  4  -1  3 45
## 5  5  11  7  5
## 6  6 100  6 -5

survey.fixed <- survey

survey.fixed$q1 <- recode.numeric(survey$q1, lb = 1, ub = 10)
survey.fixed$q2 <- recode.numeric(survey$q3, lb = 1, ub = 10)
survey.fixed$q3 <- recode.numeric(survey$q3, lb = 1, ub = 10)

survey.fixed

##   id q1 q2 q3
## 1  1  6  2  2
## 2  2  2  1  1
## 3  3  5  2  2
## 4  4 NA NA NA
## 5  5 NA  5  5
## 6  6 NA NA NA

Question 5. Now, write recode.factor() which does the same thing as recode.numeric(), except with categorical (e.g.; factor) values. recode.factor() should have three arguments: x, old, and new. old is the vector of old values that will be replaced, and new is the vector of replacement values. For now, assume that the length of both old and new are always 3.

recode.factor <- function(x, old, new) {

  y <- x
  
   y[y == old[1]] <- new[1]
   y[y == old[2]] <- new[2]
   y[y == old[3]] <- new[3]

   return(y)
}

Here is the function in action. Let’s say you’re throwing a party for a group of 8 young pirates. You ask each one what they would like to drink and store the results in a vector called orders. Here is the vector

orders <- c("coke light", 
            "coke", 
            "pepsi", 
            "coke", 
            "coke light", 
            "water", 
            "pepsi", 
            "pepsi light", 
            "water", 
            "water")

Now, unfortunately your ship is sponsored by pepsi, so you don’t have any coke products. So you’ll need to replace all the coke products with pepsi products. For example, “coke” should become “pepsi”, and “coke light” should become “pepsi light”. Also, because you don’t want the party to be too lame, you want to replace all the orders for “water”" with “pepsi max”.

Now use your function to fix the orders!

new.orders <- recode.factor(x = c("coke light", "coke", "pepsi", "coke", "pepsi", "coke light", "water", "water"),
              old = c("coke light", "coke", "water"),
              new = c("pepsi light", "pepsi", "pepsi max")
              )

new.orders

## [1] "pepsi light" "pepsi"       "pepsi"       "pepsi"       "pepsi"      
## [6] "pepsi light" "pepsi max"   "pepsi max"

Question 6. Write a function called madlib that takes three strings as arguments, and returns the following sentence with the string arguments inserted into the following text:

“If you talk to an ADJECTIVE pirate like NAME you may find that he/she spends more time talking about PLURALNOUN than the pirate arts.”

Your three arguments to the function should be:

adjective, a string indicating an adjective
name, a string of a person’s name
plural.noun, a string indicating plural noun

madlib <- function(adjective, name, plural.noun) {
  
  output <- paste("If you talk to an ", adjective, " pirate like ", name, ", you may find that he/she spends more time talking about ", plural.noun, " than the pirate arts.", sep = "")
  
  return(output)
  
  }

Try your function with the arguments “hipster”, “Bruce” and “kale”

madlib("hipster", "Bruce", "kale")

## [1] "If you talk to an hipster pirate like Bruce, you may find that he/she spends more time talking about kale than the pirate arts."

Question 7. Write a function called remove.outliers that takes a vector as an argument, determines which values of the vector are outliers, and returns a vector with the outliers removed. Define an outlier as any value that is less than 2 standard deviations below the mean, or more than 2 standard deviations above the mean.

remove.outliers <- function(x) {
  
  out.log <- x > (mean(x) + 2 * sd(x)) | x < (mean(x) - 2 * sd(x))
  
  output <- x[out.log == FALSE]
  
  return(output)
  
  }

When you are finished writing the function, run these commands to make sure your function works:

data <- c(rep(1, 50), -529484903)
remove.outliers(data)

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Question 8. Write a function called ttest.apa that takes a numeric vector as an argument, conducts a one-sample t.test on that vector, and returns a string summarizing the test in APA style. Your function should have 3 arguments

x, a vector of data
null, the population mean under the null hypothesis
p.critical, the critical value for determining significance

ttest.apa <- function(x, null, p.critical) {
  
  test.result <- t.test(x, mu = null)
  
  test.statistic <- round(test.result$statistic, 2)
  df <- round(test.result$parameter, 2)
  p.value <- round(test.result$p.value, 2)
  
  if (p.value <= p.critical) {
    
   output <-  paste("A one sample t-test was significant (t(", df, ") = ", test.statistic, ", p = ", p.value, "). We reject the null hypothesis that the true population mean is ", null, sep = "")
    
  }
  if (p.value > p.critical) {
    
    output <- paste("A one sample t-test was non-significant (t(", df, ") = ", test.statistic, ", p = ", p.value, "). We fail to reject the null hypothesis that the true population mean is ", null, sep = "")
    
  }
  
  
  return(output)
  
  }

Once you have programmed your function, test it on these data:

data <- seq(-10, 10, 1)

ttest.apa(x = data, null = 0, p.critical = .05)

## [1] "A one sample t-test was non-significant (t(20) = 0, p = 1). We fail to reject the null hypothesis that the true population mean is 0"

ttest.apa(x = data, null = 6, p.critical = .05)

## [1] "A one sample t-test was significant (t(20) = -4.43, p = 0). We reject the null hypothesis that the true population mean is 6"

Question 9. Create a function called three.var.scatter that takes three vectors of numeric data, x, y and z, and creates a scatterplot where the vector z is used in constructing the plot. Here are the arguments

x, y, z: Three vectors of numerical data. x and y are the data for the x and y axes, while z can be used for the size and color of the plotting points.
point.size.log: A logical value indicating whether or not to make the size of the points depend on the vector z (hint, use cex)
point.col.log: A logical value indicating whether or not to make the color of the points depend on the vector z (hint, use gray())
mean.lines.log: A logical value indicating whether or not to add lines showing the means of x and y
regression.log: A logical value indicating whether or not to add a regression line

three.var.scatter <- function(x, y, z, 
                              point.size.log, point.col.log, mean.lines.log, regression.log, ...) {

  
  
  
  if (point.size.log == T) {point.cex <- (z - mean(z)) / sd(z)}
  if (point.col.log == T) {point.col <- (z - mean(z)) / sd(z) / 3
                           point.col[point.col > 1] <- 1
                           point.col[point.col < 0] <- 0
  }
  
  if(point.size.log == F) {point.cex <- 1}
  if(point.col.log == F) {point.col <- "gray"}
  
  plot(x, y, cex = point.cex, pch = 16, col = gray(point.col, alpha = .5))
  
  if(mean.lines.log == T) {
    
    abline(h = mean(y), lty = 2, col = gray(.8))
    abline(v = mean(x), lty = 2, col = gray(.8))
    
    }
  
  if(regression.log == T) {abline(lm(y ~ x), lwd = 2, col = gray(.5))}
  
}

Here are some examples of three.var.scatter in action. Try them on your function!

x <- rnorm(1000)
y <- x + rnorm(1000)
z <- rnorm(1000)

three.var.scatter(x, y, z, 
                  point.size.log = T, 
                  point.col.log = T, 
                  mean.lines.log = T, 
                  regression.log = T)