Week 5 Lab Notes

Welcome to week 5! Today, we’re going to learn about functions.

Let’s start our lecture as we always do: load some packages. Packages, as you’ll recall, are functions someone has written for R. For us today, we’ll be working with… - ggplot2 to make amazing data visualizations, - tidyverse to help manipulate data, -pacman which makes managing all of these packages so much simpler.

##Lesson 0: packages

if you haven’t already, you can install pacman with the following command: install.packages("pacman",dependencies = T,repos = "http://cran.us.r-project.org")

library(pacman)
p_load(tidyverse, magrittr, ggplot2, ggthemes)

Lesson 1: Function Basics

Before we talk about functions in R, we should first make sure we understand what a function is in general. You all are probably used to thinking of something like:

     `f(x)= x^2`

But what really does this even mean? Why is this a useful tool? - It takes some input (a number), and gives us an output (in this case, the squared number, which is indeed also a number). We can represent the function \(f\) above as: Number -> ADifferentNumber where that arrow is something we call a mapping. A mapping links one object to another. This doesn’t have to be two numbers, it could be a word and a dataframe, or two words. For instance, I could come up with a function called color labeler and then pass it some object, which it will map to a color.

color_labeler(banana) = yellow

Where color_labeler is the name of the function, that maps fruits(inputs) to colors(outputs)

Knowing all of this, we need to go back to lecture 1 to remind ourselves about R-facts before we make a function.

  • Everything in R has a name, and everything is an object
  • Functions have inputs/arguments and outputs

With these in mind, let’s write a function that will take some x, and spit out x^2 + 10. That is, let’s write the code for f(x)=x^2+10.

Everything in R needs a name That includes our function.
We do this by starting with a name, setting it equal to a ‘function()’ function.

In general, this function looks like function([some_set_of_arguments]){your function code}. Let’s look at an example below.

function() is a special operator that takes any arguments you want in the parentheses, and then lets you manipulate them in any way you see fit. Think of the parentheses here as the toys you’re giving your computer to play with inthe sandbox.

squarePlusten = function(x){
  #tell squarePlusten what to do. x is an input here, we can tell our function to transform our variable into
  #something else.
  x_squaredten = x^2+10
  #Now, in order to make use of this value, we need our function to spit something out. 
  #We do this with another special function, 'return()'. This will stop your function and tell it to spit out
  #whatever object is in the parentheses. In a sense, it's the toy your computer gives back to the rest of your 
  #workspace.
  return(x_squaredten)
}

notice however: x_squaredten isn’t equal to anything now that our function has been run. That’s because x_squaredten is only defined in the context of your written function.

x_squaredten

Now, let’s see what our function can do here.

#check for 10. should be 110
squarePlusten(10)
## [1] 110

A couple things to note

  • Brackets around functions lets R know where the function starts and ends.

  • Technically we do not need them for one line functions

  • Return: this tells the function what it should return for us. We don’t hold onto any temporary objects created while your function runs, though the function can modify objects outside of itself.

Remember, functions don’t just change numbers to numbers. A more accurate way to think of functions is as a map from one object to another. The starting point is the argument, and the end point is the output. The functions take an object as an input.

For example, we can write a function that takes a vector, square all the elements, and then return a plot. In this case, (lets call our function g) does the following:

Vector -> F -> Plot

We’re going to need a new tool for this …

Sapply()

So, what should our code do? - Take a vector, say c(1,2,3) - Square each element of the vector. note that we will need a way to store these results.

Our first step is to figure out all of our moving parts. Here, we need some set of numbers. We can use vectors with the c() command!

Lets store them in a vector called x: x = c(1,4,9)

Now, I want a function to return a plot with all of the (x,x^2 = y) pairs affiliated. We can use a special function called sapply() which lets us perform a function a whole bunch of times.

sapply() is cool. This is how it works: sapply(some_vector,function) will take every element in the vector and apply your function to it. We can use this to run a function a whole ton of times all at once.

#build a vector, as we discussed above
x = c(1,4,9)
#we'll use the sapply, which takes three arguments, an input, an argument to build a new set of values.
plot_sq = function(x){
  #now we can use the sapply function as discussed above.
  y = sapply(x, function(x) x^2)
  #return the plot
  return(plot(x=x,y=y))
}
#check that it works
plot_sq(1:3)

We just built our own plotting function! Pretty exciting

Lesson 2: For Loops

Note that we could have also written the same thing with a for loop. What is a for loop?

the definition is actually hinted at in the name: the logical flow here is ‘for something in a list, do an argument I provide.’ A better source:

From wikipedia:

A for-loop has two parts: a header specifying the iteration, and a body which is executed once per iteration

Here is an example of how to write a basic for loop:

#we will build an empty vector x, which we will use a for loop to fill in
x=c()
#starting our loop, we're saying that i is an object in 1-10.
for (i in 1:10){
  #dividing 10 by the numbers 1 through 10.
  x[i]= 10/i
}
#let's look at what this makes
x
##  [1] 10.000000  5.000000  3.333333  2.500000  2.000000  1.666667  1.428571
##  [8]  1.250000  1.111111  1.000000

i is, for each loop, storing the value in your sequence (here, it’s a number in 1:10) then performing the operation you defined in the loop. Like a function, a for loop defines its “body” by setting the start point with a { and an end point with a }.

But, what is this doing? Let’s do this manually, but for i in 1:3

# start with i equals 1, but first we need an empty vector to fill in.
x_slow = c()
i = 1
x_slow[i] = 10/i
x_slow
## [1] 10

Now, we update i(i = i+1) and repeat the above

#look now at i = 2. 
i=i+1
x_slow[i] = 10/i
x_slow
## [1] 10  5

Again, we update i and repeat

#look now at i = 3
i=i+1
x_slow[i] = 10/i
x_slow #and so forth
## [1] 10.000000  5.000000  3.333333

Really though, we can loop over any object we want. Maybe we want to loop over some weird sequence, say c(2,300,-4,6) and the loop will perform some set of operations based on it. For instance,

z = 0
for (i in c(2,300,-4,6)) {
      print(paste0(i, ' plus ', z, ' equals'))
      z = z + i
      print(z)
      }
## [1] "2 plus 0 equals"
## [1] 2
## [1] "300 plus 2 equals"
## [1] 302
## [1] "-4 plus 302 equals"
## [1] 298
## [1] "6 plus 298 equals"
## [1] 304

Lets rewrite our function plot_sq with a for loop inside it

plot_sqf = function(x){
    #initialize y:
    y=c()
    #start our for loop. We're looping over every object in the vector x, so we want to iterate a number of times
    #equal to the length of x so we don't miss any.
    for (i in 1:length(x)) {
       #for each i in the vector i:length, update the ith value in y with:
       y[i] = x[i]^2
       }
      #return the plot
       return(plot(x=x,y=y))
}

#plot squared values between 1-10
plot_sqf(1:10)

Understanding the loop

Let’s look at this closer

for (i in 1:length(x)): i does what is called “indexing.” My goal for this function is to “loop” through all elements of x, and fill a vector called y with the squared elements of x. Hence the loop needs to terminate at the final element of x.

  • length(x) gives the # of rows x has.
  • y[i] = x[i]^2 remember how we index arrays in R: row by column. Thus for each i=1,2.., the iterator fills row i in y with the squared element of the ith row of x.

Let’s take a look at the plot and make sure it is the same as our last version with sapply()

(plot_sqf(1:3))
## NULL
plot_sq(1:3)

Great! But that graph looked horrible. How should we spruce it up? We could use ggplot, but we learned that already, so let’s skip a bit. Mostly, we need to learn how to plot time series.

Let’s make a basic ggplot plot using the starwars data. This data has information on star wars characters. First, let’s see what information we have here.

names(starwars)
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
## [11] "films"      "vehicles"   "starships"

Now lets plot height of characters on mass. If you don’t remember your ggplot formula, go review lab 2. In general, we can plot points this way:

ggplot(starwars,aes(x=height,y=mass)) + geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).

We have a solid dataset…. but what’s that way up on top. Over 1000 kilos??

Who is the outlier? Any guesses?

  • If you guessed Jabba, nice work!

Now lets plot it WITHOUT Jabba and make the graph look a bit nicer. This will require a few more layers. We’ll filter our data so that we only keep observations with mass less than 1000, and maybe we want a line of best fit. So let’s do that.

#So, we need some layers. Let's filter to get rid of Jabba. The theme layer will let us make things prettier
ggplot(starwars %>% filter(mass<1000),aes(x=height,y=mass))+ 
         geom_point()+geom_smooth(method = "lm", col = "red") +
         labs(x= "Mass", y="Height") + theme_minimal()

Great! Now lets return to our function from before and make the plots we produce look a bit nicer, using ggplot.

plot_sq2 = function(x){
 #can do this with a for loop as well. sapply applies the function x^2 to all elements of x
y = sapply(x, function(x) x^2)
#return the plot
return(ggplot()+geom_line(aes(x=x,y=y)))
 }
                                                         
 #This will make it so our graph is indeed a connected line.
#check that it works
plot_sq2(-20:20) 

#nice!!

Back to metrics: Simulating time series data

We will now return to econometrics. Why did we learn about all of this? To set up a virtual lab- a simulation! Why do we care about simulations?

They allow us to mess around with model parameters when we know we don’t need to worry about data, because we generated it!

So we will write a function that simulates the model:

y_t = alpha_o+alpha_1 * y_{t-1} + u_t, u_t ~ N(0, sigmasq)

How do we do this? We can use the tools we have just talked about. This takes a few steps.

1.) Draw T values of u, where T is the number of periods we simulate 2.) Set some starting point for y 3.) Generate y data with a for loop, which lets us keep its time-dependent status.

we set up a function that takes an initial value for y, some alphas, a std deviation parameter and a number for our total number of observations

#we start with a function, passing in all of the parameters we'll need. See above for what all these are.
ts_model = function(y_init,a_0,a_1, sigma, T){
  #our error
  u = rnorm(T,mean =0, sd=sigma )
  y = c()
  #set first observation to our provided level
  y[1]= y_init
  #loop through with for
  for (i in 1:T){
    y[i] = a_0+a_1*y[i-1]+u[i]
  }
  #finally, return y
  return(y)
  
  
  #let's create a dataframe, declare 3 variables. We haven't done this before, but we can use the data.frame() function to create a dataframe, where each column is separated by columns.
  ts_data = data.frame(
    time = c(1:T), 
    model =  y)
  return(ts_data)
  #and don't forget to close your function off with a }!
}

ts_model(10,1,.02,.22,22)

It didn’t work! What happened? Look at the the sequence we are iterating over: 1:T. What would the first element of y be here?

y[1] = a_0+y[0]+u[1] What is y[0]? R has no idea

R starts indexing at 1. This is unusual for coding languages and can trip you up if you don’t keep it in mind. So, y[0] is not defined. This should be an easy fix.

ts_model2 = function(y_init,a_0,a_1, sigma, T){
  #our error
  u = rnorm(T,mean =0, sd=sigma )
  y = c()
  #set first observation to our provided level
  y[1]= y_init
  #loop through with for
  for (i in 2:T){
    y[i] = a_0+a_1*y[i-1]+u[i]
  }
  #finally, return y
  #return(y) We comment this out so our function returns a data frame rather than y
  
  
  #let's create a dataframe, declare 3 variables. We haven't done this before, but we can use the data.frame() function to create a dataframe, where each column is separated by columns.
  ts_data = data.frame(
    time = c(1:T), 
    model =  y)
  return(ts_data)
  #and don't forget to close your function off with a }!
}
#let's test our data!
mean(ts_model2(10,1,.02,.22,22)$model)
## [1] 1.458959

Now, using ggplot, which we talked about, we can pass the ts_model2() function we made, and generate a graph of the process!

ggplot(aes(x = time, y = model), data = ts_model2(3,.39, .87, .04, 100)) +  
       geom_line(col = 'red') +
       theme_minimal()

A quick reminder about AR models: - Before you jump into plotting this, you need to be careful. What happens when \(|a_1|>=1\)?

#make a_1 equal to 1 in ts_model2()
ggplot(aes(x = time, y = model), data = ts_model2(4,.1, 1, 10, 100)) +  
       geom_line( col = 'red') +
       theme_minimal()

It doesn’t really stay in place…

This is also a problem when a = -1. It looks a LOT different though:

 #make a_1 equal to -1 in ts_model2()
ggplot(aes(x = time, y = model), data = ts_model2(4,.1, -1, 10, 100)) +  
     geom_line( col = 'red') +
     theme_minimal()

Let’s turn it up to 11!!!

#make a_1 equal to 11 in ts_model2()
ggplot(aes(x = time, y = model), data = ts_model2(4,.1, 11, 10, 100)) +  
            geom_line( col = 'red') +
            theme_minimal()

  • These processes are non-stationary. In other words, it will diverge over time, which means it’s hard to predict into the future and may not be a realistic representation of a process.

  • The great thing about simulations is you can see how different parameter values impact your model

  • Sometimes changing a parameter can change the output of your model in a not obvious way

  • I just gave you guys an example below, but mess around with the parameters.

Another important ability: how can we save this graph for future use?

png('#FILE PATH HERE\file_name.png')

we can clear all plots if our workspace gets messy with dev.off()

We can also just screenshot from plots or use the save option in the plot window if you prefer.

ADL Simulation

You guys learned about this in class, but what does ADL stand for?

Autocorrelated Distributed Lag. Let’s walk through these terms.

What does Autocorrelated mean? Autocorrelated: - Auto: self, directed from within (Greek). - Correlated. Correlated: Mirriam-Webster: a phenomenon that accompanies another phenomenon, is usually parallel to it, and is related in some way to it

So in total, this means an ADL model is in some way self-related. In practice, you guys will use autocorrelation to be purely related to time, that is, objects are related to recent observations. However, autocorrelation can also show up in other ways, for instance, in spatial autocorrelation where data that is collected from places close to one another are correlated.

Distributed Lag: This is describing some variable x where the impact of x on our outcome y happens over several periods, that is, X_t at t=T and X_t at t=T-1 will BOTH effect y_T. As an example:

Building new roads, for instance, may have an effect today. Maybe we believe that opening some new roads will reduce congestion. However, it’s possible that drivers take some time to really start using these roads efficiently, so those new roads will also impact commutes next year too. Let’s think of some way of simulating this data, where t is in months, and x will be thousands of dollars invested in roads.

Let’s simulate an adl(1,1) model. That means one ‘lag’ of y and one ‘lag’ of x. Let’s write this out

y_t = b0 + a1y_(t-1) + b1x_t + b2x_(t-1) + u_t

to build a simulation function, we’re going to need to build each of these variables. Before we build the function in full, let’s just build y_1.

First, we need our model parameters

Step 1: We need to know where our commute time starts. Let’s say in our very first month, commutes were an hour on average.

Note:our commute times will be in minutes

#at time 0, our commutes are an hour, ie, they are 60 minutes long
y_0 <- 60

#Step 2: Let’s set a constant: for us, we’ll expect every year will increase our commutes by 10 minutes, all else equal and aside from whatever our commute time was last period

#commute times are increasing by 10 minutes
b0 <- 10

Step 3: However, we know that next month’s average commute is PROBABLY related to last month’s commute. To model this, we need an ‘a1’ coefficient, which will determine how any month’s commute will change a future month’s commute

a1 <- 5/6 #5/6 of last month's commute will impact this month's commute

Step 4: We need to figure out the effect of investment in roads on our commute times.

Let’s say this month, a thousand dollars invested in roads reduces average commute times by 20 seconds.

#-1/2 of a minute corresponds to a decrease of 30 seconds for every 1000 dollars increase (ie, when X increases by 1.)
b1 <- -1/3

and next month, every thousand dollars will decrease average commute times by 10 seconds as people adjust their paths

b2 <- -1/6

Step 5: Now we need our data.

x_0, x_1, and u are what we need to calculate, but do simulate these we need to make some assumptions on the data. We will assume that dollars spent on roads is not correlated over time. So X0 and X1 will be two draws from a normal distribution. We get to pick the parameters for this. Let’s say on average we spend 10,000 dollars or 10*1000 dollars (ie, the mean for x will be 10,)

rnorm takes 3 arguments, number of draws(creates a vector of this size), a mean, and a standard deviation (sd)

x <- rnorm(2,mean = 10, sd = 2.5)

#x <- rep(0,2)
#to check that we made 2 xs here,
length(x)
## [1] 2

great! Now we do the same thing to generate our error u. Let’s say our error is mean 0, standard deviation 10. Our error only needs 1 draw, and has a standard deviation of 10 minutes.

u <- rnorm(1,mean = 0, sd = 10)

Now we get to make y_1! What does this look like? Let’s look

**y_t = b0 + a1y_(t-1) + b1x_t + b2*x_(t-1) + u_t**

So we can use this to create y_1.

y_1 = b0 + a1*y_0 + b1*x[2] + b2*x[1] + u
#let's look at our results: how many minutes have we shaved off commute times on average?
print(paste0('Commutes have fallen by ', y_0 - y_1, ' minutes.'))
## [1] "Commutes have fallen by 12.0859341111107 minutes."
#after investing
print(paste0('After investing ', sum(x)*1000, ' dollars in roads'))
## [1] "After investing 22392.8393529744 dollars in roads"

Great! next week we’ll build a for loop that does all of this over a longer period to examine this process. I hope your week goes well and good luck on the midterm!