This R package implements an approach to estimating the causal effect of a designed intervention on a time series. For example, how many additional daily clicks were generated by an advertising campaign? Answering a question like this can be difficult when a randomized experiment is not available.
Given a response time series (e.g., clicks) and a set of control time series (e.g., clicks in non-affected markets or clicks on other sites), the package constructs a Bayesian structural time-series model. This model is then used to try and predict the counterfactual, i.e., how the response metric would have evolved after the intervention if the intervention had never occurred. For a quick overview, watch the tutorial video. For details, see: Brodersen et al., Annals of Applied Statistics (2015).
We will apply LIWC only in certain categories:
## [1] "i" "we" "you" "shehe" "they"
## [6] "ipron" "negate" "compare" "posemo" "negemo"
## [11] "anx" "anger" "sad" "social" "family"
## [16] "friend" "female" "male" "insight" "cause"
## [21] "discrep" "tentat" "certain" "differ" "see"
## [26] "hear" "feel" "body" "health" "sexual"
## [31] "ingest" "affiliation" "achiev" "power" "reward"
## [36] "risk" "focuspast" "focuspresent" "focusfuture" "relativ"
## [41] "work" "leisure" "home" "money" "relig"
## [46] "death" "informal" "swear" "assent" "nonflu"
## [51] "filler"
Example: In our example data we consider two different days (-1,1), two different users (A, B). User A tweeted 3 times during day -1, and 2 times during day 1. A mentioned happy words 2 times in the first tweet and 1 time in the 3th tweet. During day 1 both users tweeted 2 times. A and B mentioned happyness 1 time. And A mentioned 1 time ipron.
days_tweet <- c(rep(-1, 4), rep(1,4))
id <- c(rep("A", 3), "B", rep("A",2), rep("B", 2))
id_tweet <- sample(30000:40000, 8, replace = F)
happy <- c(2, 0, 1, 0, 1, 0, 1, 0)
ipron <- c(0, 0, 0, 1, 1, 0, 0, 0)
d <- tibble(days_tweet, id, id_tweet, happy, ipron)
d %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)
| days_tweet | id | id_tweet | happy | ipron |
|---|---|---|---|---|
| -1 | A | 33451 | 2 | 0 |
| -1 | A | 31439 | 0 | 0 |
| -1 | A | 32404 | 1 | 0 |
| -1 | B | 38084 | 0 | 1 |
| 1 | A | 36666 | 1 | 1 |
| 1 | A | 35090 | 0 | 0 |
| 1 | B | 34824 | 1 | 0 |
| 1 | B | 32405 | 0 | 0 |
Steps:
d %>%
group_by(days_tweet) %>%
summarise_at(vars(happy:ipron), mean) %>%
ungroup() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)
| days_tweet | happy | ipron |
|---|---|---|
| -1 | 0.75 | 0.25 |
| 1 | 0.50 | 0.25 |
Steps:
We first computed the mean value per day and user.
Then we computed the mean value per day.
d %>%
group_by(days_tweet, id) %>%
summarise_at(vars(happy:ipron), mean) %>%
ungroup() %>%
group_by(days_tweet) %>%
summarise_at(vars(happy:ipron), mean) %>%
ungroup() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)
| days_tweet | happy | ipron |
|---|---|---|
| -1 | 0.5 | 0.50 |
| 1 | 0.5 | 0.25 |
This is values in NEDA and baseline with the first approach:
baseline_liwc %>%
select(-text, -created_at, -id_tweet, -id) %>%
group_by(days_tweet) %>%
summarise_all(mean) %>%
ungroup() %>%
pivot_longer(cols = fun:filler, names_to = "categ", values_to = "values_baseline") -> ci_baseline
neda_liwc %>%
select(-id_tweet, -id) %>%
group_by(days_tweet) %>%
summarise_all(mean) %>%
ungroup() %>%
pivot_longer(-days_tweet, names_to = "categ", values_to = "values_neda") -> ci_neda_liwc
pre_period <- c(1, 16)
post_period <- c(17, 31)
ci_baseline %>%
inner_join(ci_neda_liwc) -> d_first
d_first %>%
pivot_longer(cols = values_baseline:values_neda,
names_to = "names", values_to = "values") %>%
mutate(names = case_when(names == "values_baseline" ~ "Baseline",
names == "values_neda" ~ "NEDA")) %>%
filter(categ %in% sel_cat[1:24]) %>%
ggplot(aes(x = days_tweet, y = values, color = names)) +
geom_line() +
facet_wrap(categ ~., scales = "free", ncol = 6) +
labs(color = "", title = "First Approach") +
theme(legend.position="top")
This is values in NEDA and baseline with the second approach:
In the table bellow we compare result from computing Causal Impact Package using this two different approach. Only p_value < 0.05 are included. Columns contains information about:
Word Category: Word category in LIWC.
1st App Relative Eff.: Cummulative relative effect in percentages with the first approach.
1st P Value: P Value first approach.
2nd App Relative Eff.: Cummulative relative effect in percentages with the second approach.
d_first %>%
select(categ, values_neda, values_baseline) %>%
nest(data = - categ) %>%
mutate(mod = map(data, ~CausalImpact::CausalImpact(., pre_period, post_period))) -> ci
ci %>%
mutate(summary_mod = map(mod, "summary")) %>%
filter(!map_lgl(summary_mod, is.null)) -> ci_resul
ci_resul %>%
mutate(p = map(summary_mod, "p")) %>%
mutate(p = map_dbl(p, 1)) %>%
filter(p < 0.05) %>%
mutate(relative_effect = map(summary_mod, "RelEffect")) %>%
mutate(relative_effect = map_dbl(relative_effect, 2))-> sig_cat
sig_cat %>%
filter(categ %in% sel_cat) -> sig_cat
sig_cat %>%
arrange(desc(relative_effect)) %>%
select(categ, first_relative_effect = relative_effect, p_first = p) -> first_ap
d_second %>%
select(categ, values_neda, values_baseline) %>%
nest(data = - categ) %>%
mutate(mod = map(data, ~CausalImpact::CausalImpact(., pre_period, post_period))) -> ci
ci %>%
mutate(summary_mod = map(mod, "summary")) %>%
filter(!map_lgl(summary_mod, is.null)) -> ci_resul
ci_resul %>%
mutate(p = map(summary_mod, "p")) %>%
mutate(p = map_dbl(p, 1)) %>%
filter(p < 0.05) %>%
mutate(relative_effect = map(summary_mod, "RelEffect")) %>%
mutate(relative_effect = map_dbl(relative_effect, 2))-> sig_cat
sig_cat %>%
filter(categ %in% sel_cat) -> sig_cat
sig_cat %>%
arrange(desc(relative_effect)) %>%
select(categ, second_relative_effect = relative_effect, p_second = p) -> second_ap
second_ap %>%
mutate(categ = str_to_title(categ)) %>%
mutate_at(vars(second_relative_effect), ~.*100) %>%
mutate_if(is.numeric, ~round(., digits = 3)) %>%
mutate_at(vars(second_relative_effect), function(x){
cell_spec(x, "html", color = spec_color(x), bold = T)
}) %>%
kable("html", escape = F,
align = "lrr",
col.names = c("Word Category", "Relative Eff.(%)",
"P Value")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)
| Word Category | Relative Eff.(%) | P Value |
|---|---|---|
| Female | 17.418 | 0.001 |
| Anx | 7.566 | 0.004 |
| Family | 6.452 | 0.004 |
| Money | 5.991 | 0.001 |
| Relig | 5.209 | 0.042 |
| Achiev | 3.813 | 0.006 |
| They | 3.397 | 0.034 |
| Negate | 2.889 | 0.004 |
| Health | 2.526 | 0.005 |
| Power | 2.458 | 0.017 |
| Negemo | 2.066 | 0.041 |
| Informal | 1.116 | 0.036 |
| Ipron | -1.476 | 0.011 |
| Discrep | -1.977 | 0.037 |
| See | -2.036 | 0.027 |
| You | -2.238 | 0.029 |
| Differ | -2.694 | 0.012 |
| Posemo | -3.277 | 0.001 |
| Tentat | -3.347 | 0.002 |
| Shehe | -7.042 | 0.030 |
| Affiliation | -7.167 | 0.002 |
d_second %>%
select(categ, values_neda, values_baseline) %>%
filter(categ == "female") %>%
select(-categ) -> d
CausalImpact::CausalImpact(data = d, pre_period, post_period) -> ci
plot(ci) +
theme_light()
summary(ci, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 0.12. By contrast, in the absence of an intervention, we would have expected an average response of 0.10. The 95% interval of this counterfactual prediction is [0.096, 0.11]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 0.018 with a 95% interval of [0.012, 0.024]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 1.80. By contrast, had the intervention not taken place, we would have expected a sum of 1.54. The 95% interval of this prediction is [1.45, 1.63].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +17%. The 95% interval of this percentage is [+11%, +23%].
##
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (0.018) to the original goal of the underlying intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.
d_second %>%
select(categ, values_neda, values_baseline) %>%
filter(categ == "family") %>%
select(-categ) -> d
CausalImpact::CausalImpact(data = d, pre_period, post_period) -> ci
plot(ci)
summary(ci, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 0.061. By contrast, in the absence of an intervention, we would have expected an average response of 0.057. The 95% interval of this counterfactual prediction is [0.054, 0.060]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 0.0037 with a 95% interval of [0.00091, 0.0062]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 0.91. By contrast, had the intervention not taken place, we would have expected a sum of 0.85. The 95% interval of this prediction is [0.81, 0.89].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +6%. The 95% interval of this percentage is [+2%, +11%].
##
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (0.0037) to the original goal of the underlying intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.004). This means the causal effect can be considered statistically significant.
d_second %>%
select(categ, values_neda, values_baseline) %>%
filter(categ == "anx") %>%
select(-categ) -> d
CausalImpact::CausalImpact(data = d, pre_period, post_period) -> ci
plot(ci) +
theme_light()
summary(ci, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 0.049. By contrast, in the absence of an intervention, we would have expected an average response of 0.046. The 95% interval of this counterfactual prediction is [0.044, 0.048]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 0.0034 with a 95% interval of [0.0015, 0.0055]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 0.74. By contrast, had the intervention not taken place, we would have expected a sum of 0.68. The 95% interval of this prediction is [0.65, 0.71].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +8%. The 95% interval of this percentage is [+3%, +12%].
##
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (0.0034) to the original goal of the underlying intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.
d_second %>%
select(categ, values_neda, values_baseline) %>%
filter(categ == "shehe") %>%
select(-categ) -> d
CausalImpact::CausalImpact(data = d, pre_period, post_period) -> ci
plot(ci)
summary(ci)
## Posterior inference {CausalImpact}
##
## Average Cumulative
## Actual 0.077 1.160
## Prediction (s.d.) 0.083 (0.003) 1.248 (0.045)
## 95% CI [0.077, 0.089] [1.158, 1.331]
##
## Absolute effect (s.d.) -0.0059 (0.003) -0.0879 (0.045)
## 95% CI [-0.011, 0.00014] [-0.170, 0.00216]
##
## Relative effect (s.d.) -7% (3.6%) -7% (3.6%)
## 95% CI [-14%, 0.17%] [-14%, 0.17%]
##
## Posterior tail-area probability p: 0.02926
## Posterior prob. of a causal effect: 97.074%
##
## For more details, type: summary(impact, "report")
summary(ci, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 0.077. In the absence of an intervention, we would have expected an average response of 0.083. The 95% interval of this counterfactual prediction is [0.077, 0.089]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is -0.0059 with a 95% interval of [-0.011, 0.00014]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 1.16. Had the intervention not taken place, we would have expected a sum of 1.25. The 95% interval of this prediction is [1.16, 1.33].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed a decrease of -7%. The 95% interval of this percentage is [-14%, +0%].
##
## This means that, although it may look as though the intervention has exerted a negative effect on the response variable when considering the intervention period as a whole, this effect is not statistically significant, and so cannot be meaningfully interpreted. The apparent effect could be the result of random fluctuations that are unrelated to the intervention. This is often the case when the intervention period is very long and includes much of the time when the effect has already worn off. It can also be the case when the intervention period is too short to distinguish the signal from the noise. Finally, failing to find a significant effect can happen when there are not enough control variables or when these variables do not correlate well with the response variable during the learning period.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.029). This means the causal effect can be considered statistically significant.
d_second %>%
select(categ, values_neda, values_baseline) %>%
filter(categ == "affiliation") %>%
select(-categ) -> d
CausalImpact::CausalImpact(data = d, pre_period, post_period) -> ci
plot(ci)
summary(ci, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 0.43. By contrast, in the absence of an intervention, we would have expected an average response of 0.46. The 95% interval of this counterfactual prediction is [0.44, 0.48]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is -0.033 with a 95% interval of [-0.054, -0.011]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 6.41. By contrast, had the intervention not taken place, we would have expected a sum of 6.91. The 95% interval of this prediction is [6.57, 7.22].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed a decrease of -7%. The 95% interval of this percentage is [-12%, -2%].
##
## This means that the negative effect observed during the intervention period is statistically significant. If the experimenter had expected a positive effect, it is recommended to double-check whether anomalies in the control variables may have caused an overly optimistic expectation of what should have happened in the response variable in the absence of the intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.004). This means the causal effect can be considered statistically significant.
d_second %>%
select(categ, values_neda, values_baseline) %>%
filter(categ == "friend") %>%
select(-categ) -> d
CausalImpact::CausalImpact(data = d, pre_period, post_period, alpha = .05) -> ci
plot(ci)
summary(ci, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 0.048. In the absence of an intervention, we would have expected an average response of 0.058. The 95% interval of this counterfactual prediction is [0.047, 0.070]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is -0.0095 with a 95% interval of [-0.022, 0.0016]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 0.72. Had the intervention not taken place, we would have expected a sum of 0.87. The 95% interval of this prediction is [0.70, 1.05].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed a decrease of -16%. The 95% interval of this percentage is [-37%, +3%].
##
## This means that, although it may look as though the intervention has exerted a negative effect on the response variable when considering the intervention period as a whole, this effect is not statistically significant, and so cannot be meaningfully interpreted. The apparent effect could be the result of random fluctuations that are unrelated to the intervention. This is often the case when the intervention period is very long and includes much of the time when the effect has already worn off. It can also be the case when the intervention period is too short to distinguish the signal from the noise. Finally, failing to find a significant effect can happen when there are not enough control variables or when these variables do not correlate well with the response variable during the learning period.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.045). This means the causal effect can be considered statistically significant.
d_second %>%
select(categ, values_neda, values_baseline) %>%
filter(categ == "relig") %>%
select(-categ) -> d
CausalImpact::CausalImpact(data = d, pre_period, post_period, alpha = .05) -> ci
plot(ci)
summary(ci, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 0.046. In the absence of an intervention, we would have expected an average response of 0.044. The 95% interval of this counterfactual prediction is [0.041, 0.046]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 0.0023 with a 95% interval of [-0.00012, 0.0048]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 0.69. Had the intervention not taken place, we would have expected a sum of 0.66. The 95% interval of this prediction is [0.62, 0.69].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +5%. The 95% interval of this percentage is [-0%, +11%].
##
## This means that, although the intervention appears to have caused a positive effect, this effect is not statistically significant when considering the entire post-intervention period as a whole. Individual days or shorter stretches within the intervention period may of course still have had a significant effect, as indicated whenever the lower limit of the impact time series (lower plot) was above zero. The apparent effect could be the result of random fluctuations that are unrelated to the intervention. This is often the case when the intervention period is very long and includes much of the time when the effect has already worn off. It can also be the case when the intervention period is too short to distinguish the signal from the noise. Finally, failing to find a significant effect can happen when there are not enough control variables or when these variables do not correlate well with the response variable during the learning period.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.034). This means the causal effect can be considered statistically significant.