Create a lagged variable conditional on another variable

Author

Tom Hanna, Ph.D. Candidate, University of Houston, Department of Political Science

The problem

I have a panel data set of countries that joined an organization at varying dates, denoted by a variable member where 0 is not a member and 1 is a member. I have good theoretical reason to believe that there is a selection effect based on some of my dependent variables of interest. For example, I am interested in the effect of joining on the countries ideology, but I know that those with left wing ideologies to start with are more likely to join. I want to account for this at least in part by taking into account their initial ideology.

To do this, I want to create new variables, d in the example, for the existing value of certain variables that I think may have a selection effect, in the example c, during the year prior to joining. For years before they joined, the value of d would simply equal the current year’s actual value of c. For countries that never join, d will also equal c.

A simple lag won’t work, because they don’t all join at the same time and because for years prior to joining I want d = c. 

Here is a sample data setup. (Bear in mind that the actual “c” variables aren’t anywhere near this regular, these are just sequences of numbers for the sake of example. They don’t necessarily run in ascending order. Think, for example, GDP, export-import totals, or population values as potential variables.):

country <- c("cuba","cuba","cuba","cuba","cuba","cuba","cuba",
       "venezuela","venezuela","venezuela","venezuela","venezuela",
       "venezuela","venezuela","honduras","honduras","honduras",
       "honduras","honduras","honduras","honduras",
       "nicaragua","nicaragua","nicaragua","nicaragua","nicaragua",
       "nicaragua","nicaragua","belize","belize","belize","belize",
       "belize","belize","belize")

year <- c(2004:2010,2004:2010,2004:2010,2004:2010,2004:2010)
member <- c(0,1,1,1,1,1,1,0,1,1,1,1,1,1,0,0,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,
           0,0,0,0)
c <- c(1:7,2:8,3:9,0:6,4:10)
data <- data.frame(country,year,member,c)
head(data, n = 35L)
     country year member  c
1       cuba 2004      0  1
2       cuba 2005      1  2
3       cuba 2006      1  3
4       cuba 2007      1  4
5       cuba 2008      1  5
6       cuba 2009      1  6
7       cuba 2010      1  7
8  venezuela 2004      0  2
9  venezuela 2005      1  3
10 venezuela 2006      1  4
11 venezuela 2007      1  5
12 venezuela 2008      1  6
13 venezuela 2009      1  7
14 venezuela 2010      1  8
15  honduras 2004      0  3
16  honduras 2005      0  4
17  honduras 2006      1  5
18  honduras 2007      1  6
19  honduras 2008      1  7
20  honduras 2009      1  8
21  honduras 2010      0  9
22 nicaragua 2004      0  0
23 nicaragua 2005      0  1
24 nicaragua 2006      1  2
25 nicaragua 2007      1  3
26 nicaragua 2008      1  4
27 nicaragua 2009      1  5
28 nicaragua 2010      1  6
29    belize 2004      0  4
30    belize 2005      0  5
31    belize 2006      0  6
32    belize 2007      0  7
33    belize 2008      0  8
34    belize 2009      0  9
35    belize 2010      0 10

This is how it should look when done:

     country year member  c  d
1       cuba 2004      0  1  1
2       cuba 2005      1  2  1
3       cuba 2006      1  3  1
4       cuba 2007      1  4  1
5       cuba 2008      1  5  1
6       cuba 2009      1  6  1
7       cuba 2010      1  7  1
8  venezuela 2004      0  2  2
9  venezuela 2005      1  3  2
10 venezuela 2006      1  4  2
11 venezuela 2007      1  5  2
12 venezuela 2008      1  6  2
13 venezuela 2009      1  7  2
14 venezuela 2010      1  8  2
15  honduras 2004      0  3  3
16  honduras 2005      0  4  4
17  honduras 2006      1  5  4
18  honduras 2007      1  6  4
19  honduras 2008      1  7  4
20  honduras 2009      1  8  4
21  honduras 2010      0  9  4
22 nicaragua 2004      0  0  0
23 nicaragua 2005      0  1  1
24 nicaragua 2006      1  2  1
25 nicaragua 2007      1  3  1
26 nicaragua 2008      1  4  1
27 nicaragua 2009      1  5  1
28 nicaragua 2010      1  6  1
29    belize 2004      0  4  4
30    belize 2005      0  5  5
31    belize 2006      0  6  6
32    belize 2007      0  7  7
33    belize 2008      0  8  8
34    belize 2009      0  9  9
35    belize 2010      0 10 10

Solution:

Because I use the plm package to run linear panel models with two way fixed effects, I use it here. The lag() function in plm has the advantage that it automatically groups the lags based on the first index variable. So, year 1 of the second country uses the values from that country rather than using the last year of the previous country. Other lag functions don’t do this without the added step of using something like the tidyverse group_by() function.

library(plm)

data <- pdata.frame(data, index = c("country","year"))


data$c_lag <- lag(data$c)
data$d <- NA

data$d <- ifelse(lag(data$member == 0) & data$member == 0,data$c,
                 ifelse(lag(data$member == 0) & data$member == 1, data$c_lag, lag(data$d)))


for (i in 1:7) {
        data$d <- ifelse(is.na(lag(data$d)) == 1,data$d, ifelse(data$member == 0,lag(data$c),lag(data$d)))
}
        


#drop the c_lag that is no longer needed
data <- data[c(1:4,6)]



head(data, n = 35L)
                 country year member  c  d
belize-2004       belize 2004      0  4 NA
belize-2005       belize 2005      0  5  5
belize-2006       belize 2006      0  6  5
belize-2007       belize 2007      0  7  6
belize-2008       belize 2008      0  8  7
belize-2009       belize 2009      0  9  8
belize-2010       belize 2010      0 10  9
cuba-2004           cuba 2004      0  1 NA
cuba-2005           cuba 2005      1  2  1
cuba-2006           cuba 2006      1  3  1
cuba-2007           cuba 2007      1  4  1
cuba-2008           cuba 2008      1  5  1
cuba-2009           cuba 2009      1  6  1
cuba-2010           cuba 2010      1  7  1
honduras-2004   honduras 2004      0  3 NA
honduras-2005   honduras 2005      0  4  4
honduras-2006   honduras 2006      1  5  4
honduras-2007   honduras 2007      1  6  4
honduras-2008   honduras 2008      1  7  4
honduras-2009   honduras 2009      1  8  4
honduras-2010   honduras 2010      0  9  8
nicaragua-2004 nicaragua 2004      0  0 NA
nicaragua-2005 nicaragua 2005      0  1  1
nicaragua-2006 nicaragua 2006      1  2  1
nicaragua-2007 nicaragua 2007      1  3  1
nicaragua-2008 nicaragua 2008      1  4  1
nicaragua-2009 nicaragua 2009      1  5  1
nicaragua-2010 nicaragua 2010      1  6  1
venezuela-2004 venezuela 2004      0  2 NA
venezuela-2005 venezuela 2005      1  3  2
venezuela-2006 venezuela 2006      1  4  2
venezuela-2007 venezuela 2007      1  5  2
venezuela-2008 venezuela 2008      1  6  2
venezuela-2009 venezuela 2009      1  7  2
venezuela-2010 venezuela 2010      1  8  2

The only issue now will be with the country that left - Honduras. Since only two countries in the full data set did this, I will fix them manually.