Welcome back! Today, we’re going to learn about functions. These notes are in depth to be sure everyone can learn these skills no matter where you are in your understanding.
Let’s start our lecture as we always do: load some packages. Packages, as you’ll recall, are functions someone has written for R. For us today, we’ll be working with…
Fire these suckers up
library(pacman)
p_load(tidyverse, magrittr, ggplot2)
Before we talk about functions in R, we should first make sure we understand what a function is in general. You all are probably used to thinking of something like:
But what really does this even mean? Why is this a useful tool?
It takes some input (a number), and gives us an output (in this case, the squared number, which is indeed also a number). We can represent the function \(f\) above as:
Number -> ADifferentNumber where that arrow is something we call a mapping. A mapping links one object to another. This doesn’t have to be two numbers, it could be a word and a dataframe, or two words. For instance, I could come up with a function called color labeler and then pass it some object, which it will map to a color.
color_labeler(banana) = yellow
Knowing all of this, we need to go back to lecture 1 to remind ourselves about R-facts before we make a function.
name, and everything is an objectinputs/arguments and outputsWith these in mind, let’s write a function that will take some \(x\), and spit out \(x^2 + 10\). That is, let’s write the code for \(f(x)=x^2+10\).
Everything in R needs a name That includes our function.
We do this by starting with a name, setting it equal to a ‘function()’ function.
In general, this function looks like:
function([some_set_of_arguments]){your function code}
Let’s look at an example below.
#function() is a special operator that takes any arguments you want in the parentheses, and then lets you manipulate them in any way you see fit. Think of the parentheses here as the toys you're giving your computer to play with in the sandbox.
squarePlusten = function(x){
#tell squarePlusten what to do. x is an input here, we can tell our function to transform our variable into something else.
x_squaredten = x^2+10
#Now, in order to make use of this value, we need our function to spit something out. We do this with another special function, 'return()'. This will stop your function and tell it to spit out whatever object is in the parentheses. In a sense, it's the toy your computer gives back to the rest of your workspace.
return(x_squaredten)
}
notice however: x_squaredten isn’t equal to anything now that our function has been run. That’s because x_squaredten is only defined in the context of your written function.
Try running this, it won’t work
# x_squaredten
Now, let’s see what our function can do here.
#check for 10. should be 110
squarePlusten(10)
## [1] 110
A couple things to note
Brackets around functions lets R know where the function starts and ends.
Technically we do not need them for one line functions
Return: this tells the function what it should return for us. We don’t hold onto any temporary objects created while your function runs, though the function can modify objects outside of itself.
Remember, functions don’t just change numbers to numbers. A more accurate way to think of functions is as a map from one object to another. The starting point is the argument, and the end point is the output. The functions take an object as an input For example, we can write a function that takes a vector, square all the elements, and then return a plot. In this case, (lets call our function g) does the following:
Vector -> Plot
We’re going to need a new tool for this which I will get to.
sapply()So, what should our code do?
Our first step is to figure out all of our moving parts. Here, we need some set of numbers. We can use vectors with the c() command!
Lets store them in a vector called x: x = c(1,4,9)
Now, I want a function to return a plot with all of the (x,x^2 = y) pairs affiliated. We can use a special function called sapply() which lets us perform a function a whole bunch of times.
sapply() is cool. The sapply() function takes list, vector or data frame as input and gives output in vector or matrix.
This is how it works: sapply(some_vector,function) will take every element in the vector and apply your function to it. We can use this to run a function a whole ton of times all at once.
We will use sapply(), which takes three arguments, an input, an argument to build a new set of values.
x = c(1,4,9) # Build a vector, as we discussed above
plot_sq = function(x){ # Now we can use the `sapply()` function as discussed above.
y = sapply(x, function(x) x^2)
return(plot(x=x,y=y)) # Specify return() to be the plot
}
plot_sq(1:3) # Check that it works
We just built our own plotting function! Pretty exciting
Note that we could have also written the same thing with a for loop. What is a for loop? the definition is actually hinted at in the name: the logical flow here is ‘for something i in a list/vector, do argument x.’ A better source:
From wikipedia:
A for-loop has two parts: a header specifying the iteration, and a body which is executed once per iteration
How do I use for loops in everyday tasks? Let say I need to do some task that simple/tedious but I need to do it 1000 times. For example:
Let’s say I want to find the average between each observation (row) and the two observations (rows) above and below and save it to a new variable. One way I can do this is to manual code for each row and use Ctrl + C and Ctrl + V 1000 times. Or I can just write a loop that will iterate down all rows and do it in a fraction of the time.
Writing a loop will save you hours. They will be your best friend, this is an important thing to learn. Not just for R, but for all coding languages. However, they can be hard to write if you have never done it before.
Here is an example of how to write a basic for loop:
x = c() # Build an empty vector
for (i in 1:10){ # for object i from 1 to 10
x[i]= 10/i # Divide 10 by the numbers 1 through 10.
}
x
## [1] 10.000000 5.000000 3.333333 2.500000 2.000000 1.666667 1.428571
## [8] 1.250000 1.111111 1.000000
i is, for each loop, storing the value in your sequence (here, it’s a number in 1:10) then performing the operation you defined in the loop. Like a function, a for loop defines its “body” by setting the start point with a { and an end point with a } (<– curly braces).
Always make sure that you complete your curly braces!! Always makes sure in curly braces, the code is tabbed to the right!
But, what is this doing? Let’s do this manually, but for i in 1:3
For i = 1
x_slow = c() # Start with an empty vector
i = 1
x_slow[i] = 10/i
x_slow
## [1] 10
For i = 2
i=i+1
x_slow[i] = 10/i
x_slow
## [1] 10 5
For i = 3
i=i+1
x_slow[i] = 10/i
x_slow
## [1] 10.000000 5.000000 3.333333
Really though, we can loop over any object we want. Maybe we want to loop over some weird sequence, say c(2,300,-4,6) and the loop will perform some set of operations based on it. For instance,
z = 0
for (i in c(2,300,-4,6)) {
print(paste0(i, ' plus ', z, ' equals'))
z = z + i
print(z)
}
## [1] "2 plus 0 equals"
## [1] 2
## [1] "300 plus 2 equals"
## [1] 302
## [1] "-4 plus 302 equals"
## [1] 298
## [1] "6 plus 298 equals"
## [1] 304
Lets rewrite our function plot_sq with a for loop inside it
plot_sqf = function(x){
y=c() # Initialize y
for (i in 1:length(x)) {
y[i] = x[i]^2 # For each i in the vector i:length, update the ith value in y with:
}
return(plot(x=x,y=y)) # Return the plot
}
plot_sqf(1:10) # Plot squared values between 1-10
Understanding the loop
for (i in 1:length(x)): i does what is called “indexing.” My goal for this function is to “loop” through all elements of x, and fill a vector called y with the squared elements of x. Hence the loop needs to terminate at the final element of x.
length(x) gives the # of rows x has.y[i] = x[i]^2 remember how we index arrays in R: row by column. Thus for each i=1,2.., the iterator fills row i in y with the squared element of the ith row of x.Let’s take a look at the plot and make sure it is the same as our last version with sapply()
plot_sqf(1:3)
Great! But that graph looked horrible. How should we spruce it up? Great question, that is something we will talk more about, in great detail, at a later date (see ggplot2).
You don’t have to code this, but I want to give you guys some intuition for part 1f on your problem set. In particular, what does an unbiased & inconsistent vs. biased and consistent estimator look like? Why would we want one or another?
To do this, I’m going to simulate my own data. We’re going to build one column of a normal variable. But, we’ll be using this functionality a LOT. So let’s make a function to generate this dataframe.
data_gen <- function(n) {
return(data.frame(
v1 = rnorm(n,0,2)
)
)
}
Now, let’s create an inconsistent but unbiased estimator.
What exactly does inconsistent mean? What about unbiased? Let’s try to estimate the mean of v1
df <- data_gen(20)
head(df, 10)
We also are going to generate a lot of these dataframes. Let’s make a second function
# First create an empty object - you will see why we need this
est1 = c()
iter <- function(itr, n) {
for (i in 1:itr) {
data<-data_gen(n)
est1[i] = mean(sample(data$v1,1))
}
return(est1)
}
This function created an estimator that randomly picks one observation of a variable and uses that to estimate the mean.
Does our estimator fit the bill? Is it unbiased? Not consistent? Let’s plot it across several N to get a better idea. We can use ggplot here, but use some of its more advanced functions to do so. Ggplot objects are built in layers, where each layer can be a different section of the graph. We’ll also use cowplot which lets us build an array of graphs.
Lets load in a cool package: ‘ggthemes’
p_load(ggthemes)
We will use this package to make sweet graphs using the following code and our generated data.
#n = 1
p1 <- ggplot(,aes(iter(500,1))) +
geom_density(color = 'darkred') +
theme_fivethirtyeight() +
geom_vline(xintercept = 0) +
geom_vline(xintercept = mean(iter(500,1)), color = 'red')
#n = 2
p2 <- ggplot(,aes(iter(500,2))) +
geom_density(color = 'darkred') +
theme_fivethirtyeight() +
geom_vline(xintercept = 0)+
geom_vline(xintercept = mean(iter(500,2)), color = 'red')
#n=4
p3 <- ggplot(,aes(iter(500,4))) +
geom_density(color = 'darkred') +
theme_fivethirtyeight() +
geom_vline(xintercept = 0)+
geom_vline(xintercept = mean(iter(500,4)), color = 'red')
#n = 1000
p4 <- ggplot(,aes(iter(2000,1000))) +
geom_density(color = 'darkred') +
theme_fivethirtyeight() +
geom_vline(xintercept = 0)+
geom_vline(xintercept = mean(iter(500,1000)), color = 'red')
#let's plot these guys together.
p_load(cowplot)
plot_grid(p1, p2, p3, p4,
labels = c("n = 1", "n = 2", "n = 4", 'n = 1000'),
ncol = 2, nrow = 2)
now, to hurry things along, let’s do our other option. Consistent but biased. That means, as n-> infinity, we are converging on beta, but E(estimator) isn’t equal to our E(estimated). What kind of function would do that? Well…
est2 = c()
iter2 <- function(itr, n) {
for (i in 1:itr) {
data<-data_gen(n)
est2[i] = mean(data$v1)+6/n
}
return(est2)
}
Now let’s look at this new estimator at different n
#n = 1
p1 <- ggplot(,aes(iter2(500,1))) +
geom_density(color = 'darkred') +
theme_fivethirtyeight() +
geom_vline(xintercept = 0) +
geom_vline(xintercept = mean(iter2(500,1)), color = 'red')
#n = 2
p2 <- ggplot(,aes(iter2(500,2))) +
geom_density(color = 'darkred') +
theme_fivethirtyeight() +
geom_vline(xintercept = 0)+
geom_vline(xintercept = mean(iter2(500,2)), color = 'red')
#n=4
p3 <- ggplot(,aes(iter2(500,4))) +
geom_density(color = 'darkred') +
theme_fivethirtyeight() +
geom_vline(xintercept = 0)+
geom_vline(xintercept = mean(iter2(500,4)), color = 'red')
#n = 1000
p4 <- ggplot(,aes(iter2(500,1000))) +
geom_density(color = 'darkred') +
theme_fivethirtyeight() +
geom_vline(xintercept = 0)+
geom_vline(xintercept = mean(iter2(500,1000)), color = 'red')
#let's plot these guys together.
library(cowplot)
plot_grid(p1, p2, p3, p4,
labels = c("n = 1", "n = 2", "n = 4", 'n = 1000'),
ncol = 2, nrow = 2)
So, now that you’ve seen this, which option would you choose? The choice is more complicated than you think, so definitely spend some time on this. The homework question itself is mostly about how much thought you put into your answer, but this question actually has a great deal of value if you see yourself going into data anlaysis in the future.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.