Tutorial {simPop}

class: center, middle, inverse, title-slide

# Tutorial {simPop}
## Simulation of Synthetic Complex Data
### Shang Chi Lee
### Department of Public Health, NCKU
### 2021-12-27

---

## Overviews
###- Synthetic population: Methods and code
####- Synthetic data
####- Creating the household structure 
####- Adding categorical variables
####- Adding continuous variables
####- Simulation of components
###- Data utility of the simulated population
###- Conclusions

---
#Synthetic population: Methods and code 
#Input data

```r
pacman::p_load(simPop, sas7bdat)
 set.seed(1234)
```

使用國衛院健保資料庫

---
#譯碼簿

.pull-left[
<img src="img/note.png" width="100%" />
]

.pull-right[

英文欄位                    | Description
----------------------------|---------------------------------
`TX_CODE`                   | 異動別 
`ID_IN_DATE`                | 加保日期
`ID_OUT_DATE `              | 退保日期

]

---

#simPop format summary tables

Function                    | Description
----------------------------|---------------------------------
`addWeights()`              | add weights to a ‘simPopObj’ object
`calibSample ()`            | calibrate the sample 
`specifyInput()`            | create a ‘dataObj’ object
`simStructure ()`           | simulate the household structure
`simCategorical ()`         | simulate categorical variables
`simContinuous ()`          | bold variable labels

---
#Synthetic population: Methods and code
##synthetic data

```r
#創建一個類“dataObj”的對象，使用specifyInput()存入
inp <- specifyInput(raw, hhid ="ID", hhsize = "hsize",strata="ID_Sex", weight = "age")
```

```r
print(inp)
```

```
## 
##  -------------- 
## survey sample of size 10000 x 17 
## 
##  Selected important variables: 
## 
##  household ID: ID
##  personal ID: pid
##  variable household size: hsize
##  sampling weight: age
##  strata: ID_Sex
##  --------------
```

---
### Creating the household structure
INS_ID_TYPE1身份別

```r
synthP <- simStructure(data = inp, method = "direct",basicHHvars = c("age", "ID_Sex", "INS_ID_TYPE1"))
synthP
```

```
## 
## -------------- 
## synthetic population  of size 
##  560874 x 7 
## 
## build from a sample of size 
## 10000 x 17
## -------------- 
## 
## variables in the population:
## ID,hsize,age,ID_Sex,INS_ID_TYPE1,pid,weight
```

---
### Adding categorical variables
INS_RELATION 稱謂代號
REG_ZIP_CODE 居住地區代碼

```r
synthP <- simCategorical(synthP, 
                         additional=c("INS_relation","REG_ZIP_CODE"),
                         method = "distribution", nr_cpus = 1)
```

```
## [1] "age"          "INS_ID_TYPE1"
```

```r
print(synthP)
```

```
## 
## -------------- 
## synthetic population  of size 
##  560874 x 9 
## 
## build from a sample of size 
## 10000 x 17
## -------------- 
## 
## variables in the population:
## ID,hsize,age,ID_Sex,INS_ID_TYPE1,pid,weight,INS_relation,REG_ZIP_CODE
```

---
### Adding continuous variables

.pull-left[

```r
synthP <- simContinuous(synthP, additional = "INS_AMT1" ,
                        upper = 200000 ,equidist = FALSE, 
                        nr_cpus = 1)
```

```r
synthP
```

```
## 
## -------------- 
## synthetic population  of size 
##  560874 x 11 
## 
## build from a sample of size 
## 10000 x 17
## -------------- 
## 
## variables in the population:
## ID,hsize,age,ID_Sex,INS_ID_TYPE1,pid,weight,INS_relation,REG_ZIP_CODE,INS_AMT1Cat,INS_AMT1
```
]

.pull-right[

####也可以寫成以下synthP <- simContinuous(synthP, additional = "INS_AMT1" , method = "lm")
]

---
### Simulation of components

Unit_INS_type 職業類別，公務員、農、漁業、職業工會....

```r
sIncome <- manageSimPopObj(synthP, var = "INS_AMT1", sample = TRUE)
sWeight <- manageSimPopObj(synthP, var = "age", sample = TRUE)
pIncome <- manageSimPopObj(synthP, var = "INS_AMT1")
breaks <- getBreaks(x = sIncome, w = sWeight, upper = Inf,equidist = FALSE)
synthP <- manageSimPopObj(synthP, var = "Unit_INS_type",sample = TRUE, set = TRUE,values = getCat(x = sIncome, breaks))
synthP <- manageSimPopObj(synthP, var = "Unit_INS_type",sample = FALSE, set = TRUE,values = getCat(x = pIncome, breaks))
```

```r
synthP 
```

```
## 
## -------------- 
## synthetic population  of size 
##  560874 x 12 
## 
## build from a sample of size 
## 10000 x 17
## -------------- 
## 
## variables in the population:
## ID,hsize,age,ID_Sex,INS_ID_TYPE1,pid,weight,INS_relation,REG_ZIP_CODE,INS_AMT1Cat,INS_AMT1,Unit_INS_type
```

---
##Data utility of the simulated population

INS_ID_TYPE 身份別 : 1雇主  2一般  4農民 6育嬰

```r
tab <- spTable(synthP, select = c("ID_Sex", "INS_ID_TYPE1", "hsize"))
spMosaic(tab, labeling = labeling_border(abbreviate = c(ID = TRUE)))
```

![](20211226_Tutorial-simpop_SCL_files/figure-html/unnamed-chunk-13-1.png)

---
##Data utility of the simulated population

```r
spCdfplot(synthP, "INS_AMT1", cond = "ID_Sex", layout = c(1, 2))
```

![](20211226_Tutorial-simpop_SCL_files/figure-html/unnamed-chunk-14-1.png)

---
##Data utility of the simulated population
女性在收入上的抽樣沒有很雷同

```r
 spBwplot(synthP, x = "INS_AMT1", cond = "ID_Sex", layout = c(1, 2))
```

![](20211226_Tutorial-simpop_SCL_files/figure-html/unnamed-chunk-15-1.png)

---
##Conclusions

.pull-left[

### 優點
####- 使用原始結構相同的合成樣本作為調查樣本
####- 當原始樣本有缺失時也可用
####- 沒有研究倫理問題
####- 可與原始資料做驗證
]

.pull-right[

### 進一步改進

####- 須測試合成的樣本是否包含大量遺漏值
####- 雖可合成小地理區資料，但無法設定細部條件

]

---
##個人收穫

.pull-left[

收穫                       
----------------------------  
#### 1.未來最發生率或盛行率研究適合
#### 2.若對主題不熟悉時，可先利用此方法抽樣測試
#### 3.與老師同學討論研究時可模擬測試

]

.pull-right[

困難點
----------------------------  
#### 1.還有須許多細部Funcation須測試，在calibrate速度慢
#### 2.針對人口作為抽樣，若為臨床上資料是否可類推(method可通用?)

]

---

## Happy New Year 2022