Coding for Social Science

Workshop: Using Time-Lagged Variables to Model Solar Farm Adoption in Japan

Click here to run the code for this workshop yourself on RStudio.Cloud!

Some longitudinal studies use time-lagged variables, but what do we mean by a time-lagged variable? Lagging refers to when we investigate the effect of a past outcome (eg. COVID-19 cases last week) on a future outcome (eg. COVID-19 cases this week). Many processes in social life are path dependent, meaning what happened in the past constrains the future outcomes we may experience. Let’s use an example of solar adoption to explore this.

This example comes from a study I did with my terrific student coauthors Lily Cunningham and Amos Nasongo. We investigated why some communities saw greater adoption of rooftop solar than others in the months and years following the Fukushima disaster, which ushered in a new era or renewable energy promotion policies. Below, I’m going to demo three helpful techniques for analyzing effects over time:

Dealing with Right Skewed Rates
Lagged Variables
Likelihood Ratio Test
Simulation

0. Import Data

In this workshop, we’ll use the data.frame jp, which is a matched sample of 43 monthly observations on 147 Japanese municipalities, where some experienced disaster deaths or damage from the 2011 tsunami and earthquake (disaster == 1), while others did not (disaster == 0). This creates a total dataset of 6,321 city-month observations.

Load Data

In this dataset, each row is a city-month!

library(tidyverse) # for data wrangling
library(texreg) # for charts
library(lmtest) # Likelihood ratio test
library(broom) # model tables
library(gtools) # significance
library(Zelig) # simulation
library(viridis) # colors

# Save the diamonds dataset as an object in my environment
jp <- read_csv("jp_solar.csv") %>%
  select(muni_code, date, year, solar_rate, disaster, pop) %>%
  arrange(muni_code, date)

View Data

# View first 3 rows of dataset
jp %>% head(3)

muni_code	date	year	solar_rate	pop
02204	2014-05-25	2014	0.0583363	34284
02204	2014-06-25	2014	0.0291681	34284
02204	2014-07-25	2014	0.0000000	34284

Codebook

In this dataset, our variables mean:

solar_rate: rate of new rooftop solar units adopted in that city that month, normalized per 1000 residents.
disaster: dichotomous variable indicating whether city experienced disaster deaths or damage from the 2011 tsunami and earthquake (disaster == 1), or did not (disaster == 0).
pop: population from city’s most recent 5-year census tally.

1. Right Skewed Rates

We’re going to investigate the rate of rooftop solar adoption per 1000 residents in each town over time (solar_rate).

Problems

However, we have two competing problems:

Our outcome is a right skewed rate, going from zero to infinity. We need to log-transform it in order to use ordinary least squares regression. Otherwise, we will get a bunch of highly correlated residuals, because all our cities with zero or few observed solar farms will receive impossible negative predicted values, which makes no sense.
A lot of our municipalities recorded a rate of zero solar farms. But it is not possible to log-transform a rate of zero solar farms. Some people exclude these zero-observations, but this is a very dicey decision, because those zeros are deeply theoretically important. If we think of our measurements as being subject to even the slightest amount of error, which all measurements are, then these are municipalities that probably had very, very, very few solar farms per 1,000 residents, if not zero. If we remove all of these cases, our model results will skew in favor of our cases which adopted more solar farms.

Visual

For example, see the visual below, which tallies the really substantial number of city-month observations in this dataset which involve zero recorded solar farms.

# Let's count up in how many rows the outcome is zero, vs. not zero
jp %>%
  group_by(values = if_else(solar_rate == 0, "0 Units", "1 Unit or More")) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(percent = count / sum(count) * 100,
         percent = paste(round(percent,1), "%", sep = "")) %>%
  # Visualize!
  ggplot(mapping = aes(x = values, y = count, 
                       fill = values, 
                       label = percent)) +
  geom_col(color = "black") +
  geom_text(nudge_y = -200, size = 7, color = "white") +
  scale_fill_viridis(option = "mako", discrete = TRUE, begin = 0.8, end = 0.2) +
  guides(fill = "none") +
  theme_classic(base_size = 14) +
  labs(x = "Municipality-Month Observations by Level of Rooftop Solar Adoption",
       y = "Observations (n = 6,321)")

Solution

So what’s a data scientist to do? On the one hand, you could log-transform your outcome, and lose 38.4% of your total sample size. (Not great.) On the other hand, you could add a very small constant to those zero observations (eg. 0.000000001) and end up drastically skewing your results without realizing it. (Also not great!)

A halfway-option might be a good solution. Our real goal is to log-transform our outcome so that our data can be appropriately modeled, while preserving our zero-observations which are theoretically important to our data. So, one solution is to add a small but realistic constant to the zero-observations in your outcome. This way, the distribution of your variable remains almost identical, but your variable no longer faces any problems when dividing by zero.

Small-but-realistic

This raises the question, what counts as a small-but-realistic value? A solution I once heard is half the size of your smallest non-zero observation. The idea here is that this preserves the general shape of your distribution.

# Let's a vector of my unique observations, sorted from smallest to largest.
myoutcome <- jp$solar_rate %>% unique() %>% sort()
# Let's view the first 3
myoutcome %>% head(3)

## [1] 0.000000000 0.009499262 0.010846340

Constant

So now that we’ve got our range of unique values, let’s grab the smallest non-zero value, the second value in the vector, and divide it by two. This essentially says, hey, it’s probably an error anyways that we recorded zero rooftop solar units in this town, but that low value is important, so let’s capture the lowness of that observation using a value that is grounded in our distribution (eg. tethered to the smallest non-zero value). (This is as opposed to some random constant like 0.000000001 that might skew drastically when log-transformed.)

# Get your constant
myconstant <- myoutcome[2] / 2
# View your constnat
myconstant

## [1] 0.004749631

Comparison

For example, let’s view side by side three proposed distributions in the figure below. These show logged solar rates, based on three different conditions.

In the left-most panel, we just log transformed the original dataset, losing 38% of observations which reported zero new solar panels that month.
In the central panel, we replaced that with our realistic constant, ~0.004, then log-transformed the data.
In the right-most panel, we replaced the constant with a very small constant (0.000000000001) and log-transformed the data.

While we would think that each of these produces a similar result when not log-transformed, after log-transformation, the results are very, very different:

Our unadjusted sample loses a tremendous amount of data.
Our realistically adjusted sample shows nicely the fact that yes, many towns experienced a very low amount of solar adoption (de-facto zero), but the log-transformed value remains relatively close to the rest of the distribution.
Finally, the unrealistically adjusted sample shows that many towns experience a low amount of solar, but after log-transformation, the scale is completely off the rest of the data, registering at around -20 rather than the majority of data, which is located around the values of -5 to 0 in logged values.

So, this highlights how using small-but-realistic constants can help get us closer to the truth by preserving the general shape of our distribution, preserving the original size of our dataset, and allowing us to test our hypotheses in ways that aren’t deeply biased by whatever random constant we used.

# Get a dataset comparing our different transformation options
compare <- bind_rows(
  # Let's first get a version of the dataset as it is originally
  jp %>%
    mutate(type = "Original\nNo Adjustment",
           finding = "Excludes 38.4%",
           position = 0),
  # Next, let's get a version where we impute zeros with our realistic constnat
  jp %>%
    mutate(solar_rate = if_else(
      condition = solar_rate == 0, 
      true = myconstant, false = solar_rate),
      type = "Realistic\nAdjustment",
      finding = "Shows frequent\nlow values\npretty well",
      position = 0),
  # Then, let's get a version where we impute zeros with our unrealistic constant
  jp %>%
    mutate(solar_rate = if_else(
      condition = solar_rate == 0, 
      true = 0.000000000001, false = solar_rate),
           type = "Unrealistic\nAdjustment",
           finding = "Creates\nvery strong\nartificial outliers",
           position = -10)
)

# Let's map their distributions
compare %>%
  ggplot(mapping = aes(x = log(solar_rate))) +
  geom_histogram(mapping = aes(y = ..density..), bins = 30, color = "white") +
  geom_density(fill = "steelblue", alpha = 0.5) +
  geom_text(mapping = aes(x = position, y = 0.50, 
                          label = finding), size = 5) +
  facet_wrap(~type, scales = "free_x") +
  theme_classic(base_size = 14) +
  theme(panel.border = element_rect(fill = NA, color = "black")) +
  labs(x = "Logged Solar Rate", y = "(%) Frequency (n = 6321)")

Revised Dataset

So in the code below, we will use our constant to impute all zero values, and then log-transform the outcome variable.

# Let's a vector of my unique observations, sorted from smallest to largest.
myoutcome <- jp$solar_rate %>% unique() %>% sort()

# Get your constant
myconstant <- myoutcome[2] / 2

# Update our data.frame with new value
jp2 <- jp %>%
    mutate(solar_log = if_else(
      condition = solar_rate == 0, 
      true = myconstant,
      false = solar_rate) %>%
        log())

2. Lagged Variables

Next, let’s make some lagged variables to account for the effect of past solar deployment on future solar deployment. Cities that have been ‘on a roll’ in recent months and years might see a boost in future deployment, because their local businesses and partners are seeing investment, gaining name recognition, and getting practice at pulling off these projects. Let’s test that hypothesis!

Below, we’re going to learn to use the lag(), lm(), and htmlreg() function.

lag()

First, we need to create a lagged variable. We will order the dataset with arrange(), then lag the variable, one city at a time using group_by(), using lag() to push back the variable for that city, leaving an NA behind. Important to check that it has worked as expected.

jp3 <- jp2 %>%
  # Must order by city and time
  arrange(muni_code, date) %>%
  # Must group by municipality - IMPORTANT
  group_by(muni_code) %>%
  # Then lag the solar rate, using n to specify how many steps.
  mutate(
    # First let's lag by 1 row,
    solar_log_lag1 = lag(solar_log, n = 1),
    # Then let's lag by 2 rows
    solar_log_lag2 = lag(solar_log, n = 2),
    # Then let's lag by 3 rows
    solar_log_lag3 = lag(solar_log, n = 3)) %>%
  ungroup() %>%
  # Turn date into a factor for easy modeling
  mutate(date = factor(date))

Let’s check the result!

jp3 %>%
  select(muni_code, date, solar_log, contains("lag") ) %>%
  head(5)

muni_code	date	solar_log	solar_log_lag1	solar_log_lag2	solar_log_lag3
02204	2014-05-25	-2.841532	NA	NA	NA
02204	2014-06-25	-3.534679	-2.841532	NA	NA
02204	2014-07-25	-5.349688	-3.534679	-2.841532	NA
02204	2014-08-25	-2.436066	-5.349688	-3.534679	-2.841532
02204	2014-09-25	-2.436066	-2.436066	-5.349688	-3.534679

lm()

Finally, let’s model with them!

This dataset was original designed as a matching experiment to test the effect of disasters on rooftop solar adoption. So, we’ll run our first model to test the direct effect of disasters on solar adoption. Each model will also adjust for date as a fixed effect.

m1 <- jp3 %>%
  lm(formula = solar_log ~ disaster + date)

Next, let’s also control for the effect of solar adoption one month prior (a manual auto-regressive control.)

m2 <- jp3 %>%
  lm(formula = solar_log ~ disaster + date +
       solar_log_lag1)

Next, let’s also control for the effect of solar adoption two months prior.

m3 <- jp3 %>%
  lm(formula = solar_log ~ disaster + date +
       solar_log_lag1 + solar_log_lag2)

Finally, let’s control for the effect of solar adoption three months prior.

m4 <- jp3 %>%
  lm(formula = solar_log ~ disaster + date +
       solar_log_lag1 + solar_log_lag2 + solar_log_lag3)

htmlreg()

library(texreg)
htmlreg(
  list(m1,m2,m3,m4), 
  # We'll omit our many date-fixed effects for readability
  omit.coef = "date",
  caption = "*** p < 0.001; ** p < 0.01; * p < 0.05.<br>With monthly fixed effects, omitted from table.")

*** p < 0.001; ** p < 0.01; * p < 0.05.
With monthly fixed effects, omitted from table.
	Model 1	Model 2	Model 3	Model 4
(Intercept)	-3.71^***	-2.72^***	-2.09^***	-1.86^***
	(0.13)	(0.13)	(0.14)	(0.14)
disaster	1.58^***	1.12^***	0.90^***	0.78^***
	(0.04)	(0.04)	(0.05)	(0.05)
solar_log_lag1		0.29^***	0.23^***	0.21^***
		(0.01)	(0.01)	(0.01)
solar_log_lag2			0.19^***	0.16^***
			(0.01)	(0.01)
solar_log_lag3				0.14^***
				(0.01)
R²	0.44	0.49	0.51	0.53
Adj. R²	0.44	0.49	0.51	0.53
Num. obs.	6321	6174	6027	5880
^*p < 0.001; ^p < 0.01; ^*p < 0.05

Finally, we can compare our results above in the texreg package’s htmlreg() function (or its handy counterpart screenreg()), which shows that the addition of our autoregressive terms did not change the direction or significance of our disaster variable’s effect on solar. However, it did decrease the effect slightly, and we see that prior solar adoption in each time period is strongly related to an increase in future solar adoption. This is the main value added of lagged variables.

3. Likelihood Ratio Tests

Finally, what if we need to justify further the inclusion, or exclusion, of a lagged variable? Likelihood ratio tests are a great way to do this. They compare the change in a measure of your model’s fit, the log-likelihood. Plus, they calculate a chi-squared statistic to evaluate how extreme an improvement your new model makes upon a prior specification of the model, and whether that improvement is statistically significant, or not. Below, we’re going to learn the steps, using the functions drop_na(), htmlreg(), and lrtest().

drop_na()

However, a key requirement of these tests is that the sample size must be identical in each model. Due to lagged terms, we regularly dropped several cases from earlier timesteps no longer covered, since each model is working with slightly different sizes of data. So, to give us a fair baseline for comparison, we need to rerun our models, having dropped all such missing values at the beginning. We will do this using the dplyr package’s drop_na() function. We will call these models z1, z2, z3, and z4.

z1 <- jp3 %>%
  drop_na() %>%
  lm(formula = solar_log ~ disaster + date)

z2 <- jp3 %>%
  drop_na() %>%
  lm(formula = solar_log ~ disaster + date +
       solar_log_lag1)

z3 <- jp3 %>%
  drop_na() %>%
  lm(formula = solar_log ~ disaster + date +
       solar_log_lag1 + solar_log_lag2)

z4 <- jp3 %>%
  drop_na() %>%
  lm(formula = solar_log ~ disaster + date +
       solar_log_lag1 + solar_log_lag2 + solar_log_lag3)

htmlreg() (Again!)

Let’s examine the texreg table now.

htmlreg(
  list(z1,z2,z3,z4), 
  # We'll omit our many date-fixed effects for readability
  omit.coef = "date",
  caption = "*** p < 0.001; ** p < 0.01; * p < 0.05.<br>With monthly fixed effects, omitted from table.")

*** p < 0.001; ** p < 0.01; * p < 0.05.
With monthly fixed effects, omitted from table.
	Model 1	Model 2	Model 3	Model 4
(Intercept)	-3.76^***	-2.70^***	-2.17^***	-1.86^***
	(0.13)	(0.13)	(0.14)	(0.14)
disaster	1.58^***	1.12^***	0.91^***	0.78^***
	(0.04)	(0.04)	(0.05)	(0.05)
solar_log_lag1		0.29^***	0.23^***	0.21^***
		(0.01)	(0.01)	(0.01)
solar_log_lag2			0.19^***	0.16^***
			(0.01)	(0.01)
solar_log_lag3				0.14^***
				(0.01)
R²	0.46	0.50	0.52	0.53
Adj. R²	0.45	0.50	0.52	0.53
Num. obs.	5880	5880	5880	5880
^*p < 0.001; ^p < 0.01; ^*p < 0.05

lrtest()

Finally, let’s run our likelihood ratio test, using the lmtest package’s lrtest() function! Then, we can extract the results into a tidy data.frame using the broom package’s tidy() function, and add stars for significance with the gtools package’s stars.pval() function.

library(lmtest)
library(broom)
library(gtools)

result <- lrtest(z1,z2,z3,z4) %>%
  tidy() %>%
  mutate(sig = stars.pval(p.value))

result

X.Df	LogLik	df	statistic	p.value	sig
42	-11034.05	NA	NA	NA
43	-10777.57	1	512.9506	0	***
44	-10665.58	1	223.9834	0	***
45	-10603.17	1	124.8214	0	***

This is great news! Looks like the inclusion of each term greatly increases the log-likelihood. (Note how it gets progressively less and less negative in the LogLik column?) And each time, the increase is statistically significant at the p < 0.001 level! This is strong justification for the inclusion of such a term. Although, one should probably add more control variables before making a final call.

4. Simulate!

Finally, we can take our models, which now involve a logged outcome variable, and use simulation in the Zelig package to get back interpretable, exponentiated predictions. Below, we’re going to use the functions sim(), bind_rows(), quantile(), and ggplot().

sim()

library(Zelig)

# Let's get the interquartile range for logged solar adoption
myrange <- jp3$solar_log %>% 
  quantile(probs = c(0.25, 0.75), na.rm = TRUE) 

# Simulate effects for high solar for 3-months prior
simlag3 <- z4 %>%
  # Convert to Zelig object
  to_zelig() %>%
  # Set range of predictor values
  setx(solar_log_lag3 = seq(from = myrange[1], to = myrange[2], length.out = 10),
       solar_log_lag2 = seq(from = myrange[1], to = myrange[2], length.out = 10),
       solar_log_lag1 = seq(from = myrange[1], to = myrange[2], length.out = 10)) %>%
  # Simulate expected value
  sim() %>%
  # Extract expected value
  zelig_qi_to_df() %>%
  select(x = solar_log_lag3, ev = expected_value)

# Simulate effects for high solar for 2-months prior
simlag2 <- z4 %>%
  # Convert to Zelig object
  to_zelig() %>%
  # Set range of predictor values
  setx(solar_log_lag2 = seq(from = myrange[1], to = myrange[2], length.out = 10),
       solar_log_lag1 = seq(from = myrange[1], to = myrange[2], length.out = 10)) %>%
  # Simulate expected value
  sim() %>%
  # Extract expected value
  zelig_qi_to_df() %>%
  select(x = solar_log_lag2, ev = expected_value)

# Simulate effects for high solar for 1 months prior
simlag1 <- z4 %>%
  # Convert to Zelig object
  to_zelig() %>%
  # Set range of predictor values
  setx(solar_log_lag1 = seq(from = myrange[1], to = myrange[2], length.out = 10)) %>%
  # Simulate expected value
  sim() %>%
  # Extract expected value
  zelig_qi_to_df() %>%
  select(x = solar_log_lag1, ev = expected_value)

bind_rows()

Finally, let’s combine our simulations for these different variables to visualize the changing trajectory of solar adoption in an average city, if only the level of solar adoption in previous months was altered.

mysim <- bind_rows(
  simlag1 %>% 
    mutate(type = "1 month"),
  simlag2 %>% 
    mutate(type = "2 months"),
  simlag3 %>% 
    mutate(type = "3 months")) %>%
  # exponentiate the predictor and expected value
  # To bring them back into their original range
  mutate(ev = exp(ev),
         x = exp(x))

quantile()

mystats <- mysim %>%
  # Get a 95% confidence interval for each type and predictor
  group_by(type, x) %>%
  summarize(lower = quantile(ev, probs = 0.025),
            median = quantile(ev, probs = 0.50),
            upper = quantile(ev, probs = 0.975))

ggplot()

mystats %>% 
  ggplot(mapping = aes(x = x, y = median, ymin = lower, ymax = upper, fill = type)) +
  geom_ribbon(alpha = 0.5, color = "white") +
  scale_fill_viridis(option = "plasma", begin = 0.2, end = 0.8, discrete = TRUE) +
  labs(fill = "Increased\nSolar Adoption\nfor X months\nprior",
       x = "Rate of Solar Adoption in Prior Month(s)",
       y = "Expected Rate of Solar Adoption\n(with 95% simulated confidence interval)",
       subtitle = "Effect of Past Solar Adoption\non Future Solar Adoption",
       caption = "Based on 1000 simulations in the Zelig package in R,\nfor an otherwise average Japanese city.") +
  theme_bw()

That’s a wrap! Hope this helped demonstrate the power of logged outcomes, lagged variables, likelihood ratio tests, and simulation on panel datasets. Please be in touch if you have any questions!