class: center, middle, inverse, title-slide # Tutorial {simPop} ## Simulation of Synthetic Complex Data ### Shang Chi Lee ### Department of Public Health, NCKU ### 2021-12-27 --- ## Overviews ###- Synthetic population: Methods and code ####- Synthetic data ####- Creating the household structure ####- Adding categorical variables ####- Adding continuous variables ####- Simulation of components ###- Data utility of the simulated population ###- Conclusions --- #Synthetic population: Methods and code #Input data ```r pacman::p_load(simPop, sas7bdat) set.seed(1234) ``` 使用國衛院健保資料庫 <img src="img/data0.png" width="100%" /> --- #譯碼簿 .pull-left[ <img src="img/note.png" width="100%" /> ] .pull-right[ 英文欄位 | Description ----------------------------|--------------------------------- `TX_CODE` | 異動別 `ID_IN_DATE` | 加保日期 `ID_OUT_DATE ` | 退保日期 ] --- #simPop format summary tables Function | Description ----------------------------|--------------------------------- `addWeights()` | add weights to a ‘simPopObj’ object `calibSample ()` | calibrate the sample `specifyInput()` | create a ‘dataObj’ object `simStructure ()` | simulate the household structure `simCategorical ()` | simulate categorical variables `simContinuous ()` | bold variable labels --- #Synthetic population: Methods and code ##synthetic data ```r #創建一個類“dataObj”的對象,使用specifyInput()存入 inp <- specifyInput(raw, hhid ="ID", hhsize = "hsize",strata="ID_Sex", weight = "age") ``` ```r print(inp) ``` ``` ## ## -------------- ## survey sample of size 10000 x 17 ## ## Selected important variables: ## ## household ID: ID ## personal ID: pid ## variable household size: hsize ## sampling weight: age ## strata: ID_Sex ## -------------- ``` --- ### Creating the household structure INS_ID_TYPE1身份別 ```r synthP <- simStructure(data = inp, method = "direct",basicHHvars = c("age", "ID_Sex", "INS_ID_TYPE1")) synthP ``` ``` ## ## -------------- ## synthetic population of size ## 560874 x 7 ## ## build from a sample of size ## 10000 x 17 ## -------------- ## ## variables in the population: ## ID,hsize,age,ID_Sex,INS_ID_TYPE1,pid,weight ``` --- ### Adding categorical variables INS_RELATION 稱謂代號 REG_ZIP_CODE 居住地區代碼 ```r synthP <- simCategorical(synthP, additional=c("INS_relation","REG_ZIP_CODE"), method = "distribution", nr_cpus = 1) ``` ``` ## [1] "age" "INS_ID_TYPE1" ``` ```r print(synthP) ``` ``` ## ## -------------- ## synthetic population of size ## 560874 x 9 ## ## build from a sample of size ## 10000 x 17 ## -------------- ## ## variables in the population: ## ID,hsize,age,ID_Sex,INS_ID_TYPE1,pid,weight,INS_relation,REG_ZIP_CODE ``` --- ### Adding continuous variables .pull-left[ ```r synthP <- simContinuous(synthP, additional = "INS_AMT1" , upper = 200000 ,equidist = FALSE, nr_cpus = 1) ``` ```r synthP ``` ``` ## ## -------------- ## synthetic population of size ## 560874 x 11 ## ## build from a sample of size ## 10000 x 17 ## -------------- ## ## variables in the population: ## ID,hsize,age,ID_Sex,INS_ID_TYPE1,pid,weight,INS_relation,REG_ZIP_CODE,INS_AMT1Cat,INS_AMT1 ``` ] .pull-right[ ####也可以寫成以下synthP <- simContinuous(synthP, additional = "INS_AMT1" , method = "lm") ] --- ### Simulation of components Unit_INS_type 職業類別,公務員、農、漁業、職業工會.... ```r sIncome <- manageSimPopObj(synthP, var = "INS_AMT1", sample = TRUE) sWeight <- manageSimPopObj(synthP, var = "age", sample = TRUE) pIncome <- manageSimPopObj(synthP, var = "INS_AMT1") breaks <- getBreaks(x = sIncome, w = sWeight, upper = Inf,equidist = FALSE) synthP <- manageSimPopObj(synthP, var = "Unit_INS_type",sample = TRUE, set = TRUE,values = getCat(x = sIncome, breaks)) synthP <- manageSimPopObj(synthP, var = "Unit_INS_type",sample = FALSE, set = TRUE,values = getCat(x = pIncome, breaks)) ``` ```r synthP ``` ``` ## ## -------------- ## synthetic population of size ## 560874 x 12 ## ## build from a sample of size ## 10000 x 17 ## -------------- ## ## variables in the population: ## ID,hsize,age,ID_Sex,INS_ID_TYPE1,pid,weight,INS_relation,REG_ZIP_CODE,INS_AMT1Cat,INS_AMT1,Unit_INS_type ``` --- ##Data utility of the simulated population INS_ID_TYPE 身份別 : 1雇主 2一般 4農民 6育嬰 ```r tab <- spTable(synthP, select = c("ID_Sex", "INS_ID_TYPE1", "hsize")) spMosaic(tab, labeling = labeling_border(abbreviate = c(ID = TRUE))) ``` <!-- --> --- ##Data utility of the simulated population ```r spCdfplot(synthP, "INS_AMT1", cond = "ID_Sex", layout = c(1, 2)) ``` <!-- --> --- ##Data utility of the simulated population 女性在收入上的抽樣沒有很雷同 ```r spBwplot(synthP, x = "INS_AMT1", cond = "ID_Sex", layout = c(1, 2)) ``` <!-- --> --- ##Conclusions .pull-left[ ### 優點 ####- 使用原始結構相同的合成樣本作為調查樣本 ####- 當原始樣本有缺失時也可用 ####- 沒有研究倫理問題 ####- 可與原始資料做驗證 ] .pull-right[ ### 進一步改進 ####- 須測試合成的樣本是否包含大量遺漏值 ####- 雖可合成小地理區資料,但無法設定細部條件 ] --- ##個人收穫 .pull-left[ 收穫 ---------------------------- #### 1.未來最發生率或盛行率研究適合 #### 2.若對主題不熟悉時,可先利用此方法抽樣測試 #### 3.與老師同學討論研究時可模擬測試 ] .pull-right[ 困難點 ---------------------------- #### 1.還有須許多細部Funcation須測試,在calibrate速度慢 #### 2.針對人口作為抽樣,若為臨床上資料是否可類推(method可通用?) ] --- ## Happy New Year 2022 <img src="img/happy1.jpg" width="40%" />