Synthesising Sensitive Student dataset to test machine learning tools for prediction

Introduction

In many contexts, confidentiality constraints severely restrict access to unique and valuable microdata. Synthetic data which mimics the original observed data and preserves the relationships between variables but do not contain any disclosive records are one possible solution to this problem (Nowok et al., 2011).

Here we investigate the synthpop package and examine its utility for generating a dataframe to work with facilitating tool produciton with the pretence that the observed data is restricted for legal reasons. Imagine we are a synthesiser, it is our job to convert the real data into a meaningful equivalent which protects against data intruders. We’ll use the Student Math Performance data from Cortez (2008).

Aims

  • Our aim is to produce an equivalent dataset that does not allow identification of any individuals by intruders who have access to a few key variables concerning people of interest.
  • The synthesised data should have similar characteristics when compared to the real observed data set.
## [1] "C:/Users/mammykins/Google Drive/R/student_cortez"

Basic example

The synthpop package aims to provide a user with an easy way of generating synthetic versions of original observed data sets. Via the function syn() a synthetic data set is produced using a single command.
Let’s glimpse() the observed data set (assigned to the object, ods).

glimpse(ods)  
## Observations: 395
## Variables: 33
## $ school     (fctr) GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        (fctr) F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        (int) 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    (fctr) U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    (fctr) GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    (fctr) A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       (int) 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       (int) 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       (fctr) at_home, at_home, at_home, health, other, services...
## $ Fjob       (fctr) teacher, other, other, services, other, other, oth...
## $ reason     (fctr) course, course, other, home, home, reputation, hom...
## $ guardian   (fctr) mother, father, mother, mother, father, mother, mo...
## $ traveltime (int) 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  (int) 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   (int) 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  (fctr) yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     (fctr) no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       (fctr) no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities (fctr) no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ nursery    (fctr) yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ higher     (fctr) yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ internet   (fctr) no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ romantic   (fctr) no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     (int) 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   (int) 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      (int) 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       (int) 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       (int) 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     (int) 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   (int) 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6...
## $ G1         (int) 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, ...
## $ G2         (int) 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, ...
## $ G3         (int) 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11,...

Let’s make this reproducible by fixing the seed for the pseudo-random number generator.

my.seed <- 1337 
# Here we sunthesiste a data set using default methods of syn()
sds.default <- syn(ods, seed = my.seed)  #  progress is printed to console
## Variable(s):  Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout, Dalc, Walc, health numeric but with fewer than 5 levels turned into factor(s).
## 
## syn  variables
## 1    school sex age address famsize Pstatus Medu Fedu Mjob Fjob
##      reason guardian traveltime studytime failures schoolsup famsup paid activities nursery
##      higher internet romantic famrel freetime goout Dalc Walc health absences
##      G1 G2 G3

The function produces an object or list comprised of 22 components.

names(sds.default)
##  [1] "call"             "m"                "syn"             
##  [4] "method"           "visit.sequence"   "predictor.matrix"
##  [7] "smoothing"        "event"            "denom"           
## [10] "proper"           "n"                "k"               
## [13] "rules"            "rvalues"          "cont.na"         
## [16] "semicont"         "drop.not.used"    "drop.pred.only"  
## [19] "seed"             "var.lab"          "val.lab"         
## [22] "obs.vars"
#str(sds.default)  #  a detailed idea of the structure, but lengthy

The package provides the compare() function. It is a generic function for comparison of various aspects of synthesised and observed data. The function invokes particular methods depending on the class of the first argument.
Here we compare the G3 attainment as well as the other variables that our previous analysis with a decision tree showed to be useful in predicting G3 pass or fail.

compare(sds.default, ods, vars = c("G2", "Fjob", "reason", "G3"))
## 
## Comparing percentages observed with synthetic
## 
## $G2
##                  0 1 2         3        4        5        6         7
## observed  3.291139 0 0 0.2531646 3.797468 3.544304 5.316456  8.101266
## synthetic 4.050633 0 0 0.0000000 2.025316 3.037975 5.063291 11.898734
##                  8        9       10       11       12       13       14
## observed  12.65823 11.64557 8.860759 10.37975 9.367089 5.822785 8.607595
## synthetic 13.41772 11.89873 9.620253 10.37975 9.367089 5.822785 6.835443
##                 15        16       17        18
## observed  3.291139 1.2658228 3.037975 0.7594937
## synthetic 3.037975 0.2531646 2.531646 0.7594937
## 
## $Fjob
##            at_home   health    other services  teacher
## observed  5.063291 4.556962 54.93671 28.10127 7.341772
## synthetic 3.797468 3.037975 56.70886 30.88608 5.569620
## 
## $reason
##             course     home    other reputation
## observed  36.70886 27.59494 9.113924   26.58228
## synthetic 34.17722 28.60759 9.113924   28.10127
## 
## $G3
##                  0 1 2         3        4        5        6         7
## observed  9.620253 0 0 0.2531646 1.772152 3.797468 2.278481  8.101266
## synthetic 4.303797 0 0 0.2531646 2.531646 5.822785 3.037975 10.379747
##                  8        9       10       11       12       13       14
## observed  7.088608 14.17722 11.89873 7.848101 7.848101 6.835443 8.354430
## synthetic 7.848101 15.18987 12.91139 6.835443 9.873418 6.835443 6.835443
##                 15        16       17        18        19
## observed  4.050633 1.5189873 3.037975 1.2658228 0.2531646
## synthetic 3.037975 0.7594937 2.784810 0.5063291 0.2531646

Evaluation

The synthetic data set does pretty well except for the G3 where the weird distribution causes a noticeable discrepancy between the observed and synthetic for students scoring zero. Errors at the tails of a distribution could be problematic. However, as the pass and fail boundary is at the half-way point we notice these are more similar. We’ll give the function some slack as we used the default methods, perhaps we can tweak a better fit.

Compare fitted generalised linear model

We estimate the ods using generalised linear model implemented in R glm() function. We compare the coeffecient estimates

m.ods <- glm(G3 ~ G2 + sex + age + Fjob, data = ods, family = "gaussian")  # linear model, real data
m.ods
## 
## Call:  glm(formula = G3 ~ G2 + sex + age + Fjob, family = "gaussian", 
##     data = ods)
## 
## Coefficients:
##  (Intercept)            G2          sexM           age    Fjobhealth  
##       0.2059        1.0956        0.1939       -0.1007        0.4801  
##    Fjobother  Fjobservices   Fjobteacher  
##       0.1702       -0.2339        0.1553  
## 
## Degrees of Freedom: 394 Total (i.e. Null);  387 Residual
## Null Deviance:       8270 
## Residual Deviance: 1471  AIC: 1658

Note the use of glm.synds here:

m.sds <- glm.synds(G3 ~ G2 + sex + age + Fjob, data = sds.default, family = "gaussian") # synth data
m.sds
## 
## Call:
## glm.synds(formula = G3 ~ G2 + sex + age + Fjob, family = "gaussian", 
##     data = sds.default)
## 
## Combined coefficient estimates:
##  (Intercept)           G2         sexM          age   Fjobhealth 
##   1.30662664   0.87422795   0.21764512  -0.02684721   0.95276818 
##    Fjobother Fjobservices  Fjobteacher 
##   0.34931705   0.50794834   0.32453025

Function compare() allows the synthesiser to compare the estimates based on the synthesised data sets with those based on the original data and presents the results in both tabular and graphical form.

compare(m.sds, ods)
## 
## Call used to fit models to the data:
## glm.synds(formula = G3 ~ G2 + sex + age + Fjob, family = "gaussian", 
##     data = sds.default)
## 
## Estimates for the observed data set:
##                    Beta   se(Beta)          Z
## (Intercept)   0.2059053 1.48114650  0.1390175
## G2            1.0955770 0.02665090 41.1084353
## sexM          0.1939183 0.19851243  0.9768571
## age          -0.1007112 0.07864223 -1.2806248
## Fjobhealth    0.4801359 0.63920845  0.7511413
## Fjobother     0.1701975 0.45854165  0.3711712
## Fjobservices -0.2339209 0.47569709 -0.4917433
## Fjobteacher   0.1552785 0.57241615  0.2712685
## 
## Combined estimates for the synthetised data set(s):
##                    B.syn se(Beta).syn  se(B.syn)      Z.syn
## (Intercept)   1.30662664   1.52389772 1.52389772  0.8574241
## G2            0.87422795   0.02904873 0.02904873 30.0952184
## sexM          0.21764512   0.21135399 0.21135399  1.0297658
## age          -0.02684721   0.08285669 0.08285669 -0.3240198
## Fjobhealth    0.95276818   0.80856798 0.80856798  1.1783402
## Fjobother     0.34931705   0.55902263 0.55902263  0.6248710
## Fjobservices  0.50794834   0.57328561 0.57328561  0.8860302
## Fjobteacher   0.32453025   0.70388129 0.70388129  0.4610582
## 
## Differences synthetic vs. observed:
##              Std. coef diff p value CI overlap
## G2               -7.6199208   0.000  0.0000000
## sexM              0.1122611   0.911  0.9696207
## age               0.8914667   0.373  0.7671729
## Fjobhealth        0.5845301   0.559  0.8450050
## Fjobother         0.3204156   0.749  0.9101280
## Fjobservices      1.2940656   0.196  0.6447500
## Fjobteacher       0.2404550   0.810  0.9066141
## 
## Confidence interval plot:

Conclusion

From both original and synthetic data we conclude that only the student’s G2 score is a useful predictor, the other variables were uninformative given this particular model structure. It is clear the default methods are struggling with the attainment scores distribution so this will require some fine tuning, this may also be a relic of poor model choice. The fact that the results from synthetic data can have a similar pattern to the results from the real data is encouraging for further developments of synthetic data tools.

References

  • Cortez and Silva (2008). Using data mining to predict secondary school performance
  • Dreschler (2010). Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata
  • Nowok (2011). Synthpop Package in R. Also see the vignette for illustrative example.
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] synthpop_1.2-0  ggplot2_2.0.0   nnet_7.3-11     MASS_7.3-45    
## [5] lattice_0.20-33 dplyr_0.4.3    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3         formatR_1.2.1       plyr_1.8.3         
##  [4] tools_3.2.3         rpart_4.1-10        digest_0.6.9       
##  [7] polspline_1.1.12    evaluate_0.8        gtable_0.1.2       
## [10] DBI_0.3.1           yaml_2.1.13         parallel_3.2.3     
## [13] mvtnorm_1.0-3       coin_1.1-2          proto_0.3-10       
## [16] stringr_1.0.0       knitr_1.12          stats4_3.2.3       
## [19] grid_3.2.3          R6_2.1.1            survival_2.38-3    
## [22] foreign_0.8-66      rmarkdown_0.9.2     multcomp_1.4-1     
## [25] TH.data_1.0-6       magrittr_1.5        scales_0.3.0       
## [28] codetools_0.2-14    htmltools_0.3       modeltools_0.2-21  
## [31] splines_3.2.3       randomForest_4.6-12 assertthat_0.1     
## [34] strucchange_1.5-1   colorspace_1.2-6    labeling_0.3       
## [37] sandwich_2.3-4      stringi_1.0-1       party_1.0-25       
## [40] munsell_0.4.2       zoo_1.7-12