In many contexts, confidentiality constraints severely restrict access to unique and valuable microdata. Synthetic data which mimics the original observed data and preserves the relationships between variables but do not contain any disclosive records are one possible solution to this problem (Nowok et al., 2011).
Here we investigate the synthpop package and examine its utility for generating a dataframe to work with facilitating tool produciton with the pretence that the observed data is restricted for legal reasons. Imagine we are a synthesiser, it is our job to convert the real data into a meaningful equivalent which protects against data intruders. We’ll use the Student Math Performance data from Cortez (2008).
## [1] "C:/Users/mammykins/Google Drive/R/student_cortez"
The synthpop package aims to provide a user with an easy way of generating synthetic versions of original observed data sets. Via the function syn() a synthetic data set is produced using a single command.
Let’s glimpse() the observed data set (assigned to the object, ods).
glimpse(ods)
## Observations: 395
## Variables: 33
## $ school (fctr) GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex (fctr) F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age (int) 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address (fctr) U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize (fctr) GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus (fctr) A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu (int) 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu (int) 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob (fctr) at_home, at_home, at_home, health, other, services...
## $ Fjob (fctr) teacher, other, other, services, other, other, oth...
## $ reason (fctr) course, course, other, home, home, reputation, hom...
## $ guardian (fctr) mother, father, mother, mother, father, mother, mo...
## $ traveltime (int) 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime (int) 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures (int) 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup (fctr) yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup (fctr) no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid (fctr) no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities (fctr) no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ nursery (fctr) yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ higher (fctr) yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ internet (fctr) no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ romantic (fctr) no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel (int) 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime (int) 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout (int) 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc (int) 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc (int) 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health (int) 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences (int) 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6...
## $ G1 (int) 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, ...
## $ G2 (int) 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, ...
## $ G3 (int) 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11,...
Let’s make this reproducible by fixing the seed for the pseudo-random number generator.
my.seed <- 1337
# Here we sunthesiste a data set using default methods of syn()
sds.default <- syn(ods, seed = my.seed) # progress is printed to console
## Variable(s): Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout, Dalc, Walc, health numeric but with fewer than 5 levels turned into factor(s).
##
## syn variables
## 1 school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## reason guardian traveltime studytime failures schoolsup famsup paid activities nursery
## higher internet romantic famrel freetime goout Dalc Walc health absences
## G1 G2 G3
The function produces an object or list comprised of 22 components.
names(sds.default)
## [1] "call" "m" "syn"
## [4] "method" "visit.sequence" "predictor.matrix"
## [7] "smoothing" "event" "denom"
## [10] "proper" "n" "k"
## [13] "rules" "rvalues" "cont.na"
## [16] "semicont" "drop.not.used" "drop.pred.only"
## [19] "seed" "var.lab" "val.lab"
## [22] "obs.vars"
#str(sds.default) # a detailed idea of the structure, but lengthy
The package provides the compare() function. It is a generic function for comparison of various aspects of synthesised and observed data. The function invokes particular methods depending on the class of the first argument.
Here we compare the G3 attainment as well as the other variables that our previous analysis with a decision tree showed to be useful in predicting G3 pass or fail.
compare(sds.default, ods, vars = c("G2", "Fjob", "reason", "G3"))
##
## Comparing percentages observed with synthetic
##
## $G2
## 0 1 2 3 4 5 6 7
## observed 3.291139 0 0 0.2531646 3.797468 3.544304 5.316456 8.101266
## synthetic 4.050633 0 0 0.0000000 2.025316 3.037975 5.063291 11.898734
## 8 9 10 11 12 13 14
## observed 12.65823 11.64557 8.860759 10.37975 9.367089 5.822785 8.607595
## synthetic 13.41772 11.89873 9.620253 10.37975 9.367089 5.822785 6.835443
## 15 16 17 18
## observed 3.291139 1.2658228 3.037975 0.7594937
## synthetic 3.037975 0.2531646 2.531646 0.7594937
##
## $Fjob
## at_home health other services teacher
## observed 5.063291 4.556962 54.93671 28.10127 7.341772
## synthetic 3.797468 3.037975 56.70886 30.88608 5.569620
##
## $reason
## course home other reputation
## observed 36.70886 27.59494 9.113924 26.58228
## synthetic 34.17722 28.60759 9.113924 28.10127
##
## $G3
## 0 1 2 3 4 5 6 7
## observed 9.620253 0 0 0.2531646 1.772152 3.797468 2.278481 8.101266
## synthetic 4.303797 0 0 0.2531646 2.531646 5.822785 3.037975 10.379747
## 8 9 10 11 12 13 14
## observed 7.088608 14.17722 11.89873 7.848101 7.848101 6.835443 8.354430
## synthetic 7.848101 15.18987 12.91139 6.835443 9.873418 6.835443 6.835443
## 15 16 17 18 19
## observed 4.050633 1.5189873 3.037975 1.2658228 0.2531646
## synthetic 3.037975 0.7594937 2.784810 0.5063291 0.2531646
The synthetic data set does pretty well except for the G3 where the weird distribution causes a noticeable discrepancy between the observed and synthetic for students scoring zero. Errors at the tails of a distribution could be problematic. However, as the pass and fail boundary is at the half-way point we notice these are more similar. We’ll give the function some slack as we used the default methods, perhaps we can tweak a better fit.
We estimate the ods using generalised linear model implemented in R glm() function. We compare the coeffecient estimates
m.ods <- glm(G3 ~ G2 + sex + age + Fjob, data = ods, family = "gaussian") # linear model, real data
m.ods
##
## Call: glm(formula = G3 ~ G2 + sex + age + Fjob, family = "gaussian",
## data = ods)
##
## Coefficients:
## (Intercept) G2 sexM age Fjobhealth
## 0.2059 1.0956 0.1939 -0.1007 0.4801
## Fjobother Fjobservices Fjobteacher
## 0.1702 -0.2339 0.1553
##
## Degrees of Freedom: 394 Total (i.e. Null); 387 Residual
## Null Deviance: 8270
## Residual Deviance: 1471 AIC: 1658
Note the use of glm.synds here:
m.sds <- glm.synds(G3 ~ G2 + sex + age + Fjob, data = sds.default, family = "gaussian") # synth data
m.sds
##
## Call:
## glm.synds(formula = G3 ~ G2 + sex + age + Fjob, family = "gaussian",
## data = sds.default)
##
## Combined coefficient estimates:
## (Intercept) G2 sexM age Fjobhealth
## 1.30662664 0.87422795 0.21764512 -0.02684721 0.95276818
## Fjobother Fjobservices Fjobteacher
## 0.34931705 0.50794834 0.32453025
Function compare() allows the synthesiser to compare the estimates based on the synthesised data sets with those based on the original data and presents the results in both tabular and graphical form.
compare(m.sds, ods)
##
## Call used to fit models to the data:
## glm.synds(formula = G3 ~ G2 + sex + age + Fjob, family = "gaussian",
## data = sds.default)
##
## Estimates for the observed data set:
## Beta se(Beta) Z
## (Intercept) 0.2059053 1.48114650 0.1390175
## G2 1.0955770 0.02665090 41.1084353
## sexM 0.1939183 0.19851243 0.9768571
## age -0.1007112 0.07864223 -1.2806248
## Fjobhealth 0.4801359 0.63920845 0.7511413
## Fjobother 0.1701975 0.45854165 0.3711712
## Fjobservices -0.2339209 0.47569709 -0.4917433
## Fjobteacher 0.1552785 0.57241615 0.2712685
##
## Combined estimates for the synthetised data set(s):
## B.syn se(Beta).syn se(B.syn) Z.syn
## (Intercept) 1.30662664 1.52389772 1.52389772 0.8574241
## G2 0.87422795 0.02904873 0.02904873 30.0952184
## sexM 0.21764512 0.21135399 0.21135399 1.0297658
## age -0.02684721 0.08285669 0.08285669 -0.3240198
## Fjobhealth 0.95276818 0.80856798 0.80856798 1.1783402
## Fjobother 0.34931705 0.55902263 0.55902263 0.6248710
## Fjobservices 0.50794834 0.57328561 0.57328561 0.8860302
## Fjobteacher 0.32453025 0.70388129 0.70388129 0.4610582
##
## Differences synthetic vs. observed:
## Std. coef diff p value CI overlap
## G2 -7.6199208 0.000 0.0000000
## sexM 0.1122611 0.911 0.9696207
## age 0.8914667 0.373 0.7671729
## Fjobhealth 0.5845301 0.559 0.8450050
## Fjobother 0.3204156 0.749 0.9101280
## Fjobservices 1.2940656 0.196 0.6447500
## Fjobteacher 0.2404550 0.810 0.9066141
##
## Confidence interval plot:
From both original and synthetic data we conclude that only the student’s G2 score is a useful predictor, the other variables were uninformative given this particular model structure. It is clear the default methods are struggling with the attainment scores distribution so this will require some fine tuning, this may also be a relic of poor model choice. The fact that the results from synthetic data can have a similar pattern to the results from the real data is encouraging for further developments of synthetic data tools.
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] synthpop_1.2-0 ggplot2_2.0.0 nnet_7.3-11 MASS_7.3-45
## [5] lattice_0.20-33 dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 formatR_1.2.1 plyr_1.8.3
## [4] tools_3.2.3 rpart_4.1-10 digest_0.6.9
## [7] polspline_1.1.12 evaluate_0.8 gtable_0.1.2
## [10] DBI_0.3.1 yaml_2.1.13 parallel_3.2.3
## [13] mvtnorm_1.0-3 coin_1.1-2 proto_0.3-10
## [16] stringr_1.0.0 knitr_1.12 stats4_3.2.3
## [19] grid_3.2.3 R6_2.1.1 survival_2.38-3
## [22] foreign_0.8-66 rmarkdown_0.9.2 multcomp_1.4-1
## [25] TH.data_1.0-6 magrittr_1.5 scales_0.3.0
## [28] codetools_0.2-14 htmltools_0.3 modeltools_0.2-21
## [31] splines_3.2.3 randomForest_4.6-12 assertthat_0.1
## [34] strucchange_1.5-1 colorspace_1.2-6 labeling_0.3
## [37] sandwich_2.3-4 stringi_1.0-1 party_1.0-25
## [40] munsell_0.4.2 zoo_1.7-12