Seda pead tegema üks kord, hiljem kommenteeri see rida välja.
install.packages("CausalImpact")
library(tidyverse)
library(lubridate)
library(CausalImpact)
library(readxl)
andmed <- read_excel("eres_cit_res.xlsx")
Andmed sisaldavad riigi nime eesti ja inglise keeles, kuud, mille kohta andmed käivad, kui mitu selle riigi kodanikku liitus e-residentsusega ning kui mitu selle riigi residenti liitus e-residentsusega ning mugavuse pärast kuupäeva kuupäevaformaadis (kuu esimene päev).
E-residentsus lisas 2021 aasta alguses (loeme kuuks märtsi 2021) neli uut väljastuspunkti: Singapur, Lõuna-Aafrika Vabariik, Brasiilia ning Tai. Meid huvitab, kuidas nende väljastuspunktide rajamine kasvatas e-residentide arvu neist riikidest ning hiljem ka see, kuis see kasvatas ettevõtete loomist neist riikidest.
Kasutame selleks Google poolt loodud CausalImpact paketti. Hindame riigi elanike arvu kasvu e-residentide hulgas
4.1) Andmete ettevalmistamine
Causalimpact tahab, et andmed oleksid laias formaadis.
Viime andmed laia formaati:
andmed_lai <- andmed %>%
select(country, value_residence, date) %>%
pivot_wider(names_from = country, values_from = value_residence)
Proovime esialgu kõikide riikidega (peale riikide, kus toimus sekkumine). CausalImpact tahab, et esimene veerg oleks riik, kus toimus sekkumine, ülejäänud veerud riigid, kus sekkumist ei toimunud. Tõstame Brasiilia kõige ette.
andmed_analysis <- andmed_lai %>%
select(-Thailand, -"South Africa", -Singapore, -date) %>%
relocate(Brazil)
Causal impact ei taha, et oleks NA-sid sees. Reaalselt on kõik NA-d nullid. Asendame nad ära.
andmed_analysis <- andmed_analysis %>%
replace(is.na(.), 0)
millise hetkeni oli periood enne sekkumist, ning millisest hetkest on sekkumisperiood:
andmed_lai$date #siit saame lugeda, mitmes periood on 2021-03-01, see on 76-s
## [1] "2014-12-01 UTC" "2015-01-01 UTC" "2015-02-01 UTC" "2015-03-01 UTC"
## [5] "2015-04-01 UTC" "2015-05-01 UTC" "2015-06-01 UTC" "2015-07-01 UTC"
## [9] "2015-08-01 UTC" "2015-09-01 UTC" "2015-10-01 UTC" "2015-11-01 UTC"
## [13] "2015-12-01 UTC" "2016-01-01 UTC" "2016-02-01 UTC" "2016-03-01 UTC"
## [17] "2016-04-01 UTC" "2016-05-01 UTC" "2016-06-01 UTC" "2016-07-01 UTC"
## [21] "2016-08-01 UTC" "2016-09-01 UTC" "2016-10-01 UTC" "2016-11-01 UTC"
## [25] "2016-12-01 UTC" "2017-01-01 UTC" "2017-02-01 UTC" "2017-03-01 UTC"
## [29] "2017-04-01 UTC" "2017-05-01 UTC" "2017-06-01 UTC" "2017-07-01 UTC"
## [33] "2017-08-01 UTC" "2017-09-01 UTC" "2017-10-01 UTC" "2017-11-01 UTC"
## [37] "2017-12-01 UTC" "2018-01-01 UTC" "2018-02-01 UTC" "2018-03-01 UTC"
## [41] "2018-04-01 UTC" "2018-05-01 UTC" "2018-06-01 UTC" "2018-07-01 UTC"
## [45] "2018-08-01 UTC" "2018-09-01 UTC" "2018-10-01 UTC" "2018-11-01 UTC"
## [49] "2018-12-01 UTC" "2019-01-01 UTC" "2019-02-01 UTC" "2019-03-01 UTC"
## [53] "2019-04-01 UTC" "2019-05-01 UTC" "2019-06-01 UTC" "2019-07-01 UTC"
## [57] "2019-08-01 UTC" "2019-09-01 UTC" "2019-10-01 UTC" "2019-11-01 UTC"
## [61] "2019-12-01 UTC" "2020-01-01 UTC" "2020-02-01 UTC" "2020-03-01 UTC"
## [65] "2020-04-01 UTC" "2020-05-01 UTC" "2020-06-01 UTC" "2020-07-01 UTC"
## [69] "2020-08-01 UTC" "2020-09-01 UTC" "2020-10-01 UTC" "2020-11-01 UTC"
## [73] "2020-12-01 UTC" "2021-01-01 UTC" "2021-02-01 UTC" "2021-03-01 UTC"
## [77] "2021-04-01 UTC" "2021-05-01 UTC" "2021-06-01 UTC" "2021-07-01 UTC"
## [81] "2021-08-01 UTC" "2021-09-01 UTC" "2021-10-01 UTC" "2021-11-01 UTC"
## [85] "2021-12-01 UTC" "2022-01-01 UTC" "2022-02-01 UTC" "2022-03-01 UTC"
## [89] "2022-04-01 UTC" "2022-05-01 UTC" "2022-06-01 UTC" "2022-07-01 UTC"
## [93] "2022-08-01 UTC" "2022-09-01 UTC" "2022-10-01 UTC" "2022-11-01 UTC"
pre.period <- c(1, 75)
post.period <- c(76, 96)
ja jooksutame analüüsi (öeldes talle ette, et arvestagu 12-perioodilise seasonalityga ja tehku 5000 iteratsiooni vaikimisi tehtava 1000 asemel; lisaks tuleb andmetabel mingil arusaamatul põhjusel zoo() sisse kirjutada):
impact <- CausalImpact(zoo(andmed_analysis), pre.period, post.period, model.args = list(niter = 5000, nseasons = 12))
summary(impact)
## Posterior inference {CausalImpact}
##
## Average Cumulative
## Actual 14 290
## Prediction (s.d.) 4.4 (4) 93.3 (84)
## 95% CI [-0.94, 9.4] [-19.70, 198.4]
##
## Absolute effect (s.d.) 9.4 (4) 196.7 (84)
## 95% CI [4.4, 15] [91.6, 310]
##
## Relative effect (s.d.) 201% (214%) 201% (214%)
## 95% CI [-158%, 400%] [-158%, 400%]
##
## Posterior tail-area probability p: 0.01742
## Posterior prob. of a causal effect: 98.258%
##
## For more details, type: summary(impact, "report")
summary(impact, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 13.81. In the absence of an intervention, we would have expected an average response of 4.44. The 95% interval of this counterfactual prediction is [-0.94, 9.45]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 9.37 with a 95% interval of [4.36, 14.75]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 290.00. Had the intervention not taken place, we would have expected a sum of 93.29. The 95% interval of this prediction is [-19.70, 198.44].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +201%. The 95% interval of this percentage is [-158%, +400%].
##
## This means that, although the intervention appears to have caused a positive effect, this effect is not statistically significant when considering the entire post-intervention period as a whole. Individual days or shorter stretches within the intervention period may of course still have had a significant effect, as indicated whenever the lower limit of the impact time series (lower plot) was above zero. The apparent effect could be the result of random fluctuations that are unrelated to the intervention. This is often the case when the intervention period is very long and includes much of the time when the effect has already worn off. It can also be the case when the intervention period is too short to distinguish the signal from the noise. Finally, failing to find a significant effect can happen when there are not enough control variables or when these variables do not correlate well with the response variable during the learning period.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.017). This means the causal effect can be considered statistically significant.
Visualiseerime tulemused:
plot(impact)
See graafik on ggplot, nii et me saame seda muuta nt nii:
graafik_brazil <- plot(impact)
graafik_brazil +
theme_minimal() +
labs(title = "Impact of opening a pickup-location in Brazil")
Detailsed andmed saab sellest objektist ka kätte, kui tahta ise graafikuid teha:
str(impact)
impact$series
Milliseid predictoreid kasutati:
plot(impact$model$bsts.model, "coefficients")
ok, see on bullshit graafik, selle me teeme ümber pärast.
Tsiteerida tuleb seda niimoodi:
citation("CausalImpact")
##
## To cite CausalImpact in publications use:
##
## Brodersen et al., 2015, Annals of Applied Statistics. Inferring
## causal impact using Bayesian structural time-series models.
## https://research.google/pubs/pub41854/
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Inferring causal impact using {B}ayesian structural time-series models},
## author = {Kay H. Brodersen and Fabian Gallusser and Jim Koehler and Nicolas Remy and Steven L. Scott},
## journal = {Annals of Applied Statistics},
## year = {2014},
## volume = {9},
## pages = {247--274},
## url = {https://research.google/pubs/pub41854/},
## }
Your turn: tee sama läbi ülejäänud interventioni saanud riikidega, ja seesama ka, aga jäta ka Argentiina välja, sest see selgelt sai ka interventioni läbi Brasiilia.