Taller 2021 MSP3010: Muestras complejas y uso de bases de datos Encuesta Nacional de Salud (ENS) Chile en RStudio

Análisis de muestras complejas

** Parte 1. 27 Octubre 2021 15:30 a 18:20

** Parte 2. 3 Noviembre 2021 15:30 a 18:20

Alvaro Passi-Solar linkedin twitter scholar arpassi@uc.cl

Thanks to Dr Shaun Scholes. I used some slides from the class “Basic Statistics for Medical Sciences” (Stata), November 29th, 2019

Slides taller 1

Objectives

• Identify complex survey designs features

• Specify survey design

• Perform survey analysis: mean, cross-tabulations, and proportions

• Perform survey analysis: linear and logistic regressions

• Perform sub-population analysis (correctly)

• Recognise the importance of the design effect and estimate the design effect (DEFF) or design factor

Videos del taller

Intro

Taller 1

Brief R intro

RStudio cloud

https://rstudio.cloud/project/1676149

No requiere instalar R ni RStudio
Puedes abrirlo en tu celular
Trae la base ENS2017 y el taller de hoy
Memoria limitada a 1GB

R y RStudio installed

https://cran.r-project.org/mirrors.html

https://rstudio.com/products/rstudio/download/

RStudio layout

.R and .Rmd

Choose .R for a simpler code file
Choose .Rmd for a complete code file

.Rmd can be displayed as .doc, .pdf, .html

.Rmd structure

Paquetes/librerias R requeridos para el taller

library(rio) #To import and export data#
library(tidyverse) #plot your results
library(survey) #Analise using survey design
library(srvyr) #Analise using survey design with grammar

install.packages()

RStudio te ofrece installar los paquetes por ti. Guarda el archivo. No demora en aparecer un mensaje en amarillo ofreciendo instalar

install.packages()

# install.packages("survey")

Puedes activar el código borrando el # antes del texto
Puedes cambiar el “nombre del paquete” a instalar
Comillas en install.packages() son importantes

What is a complex survey design?

Basic statistical analyses assume that the data being analysed constitute a simple random sample (SRS) of the population. Remember: SRS = all members have an equal probability of selection.
In real-life, survey samples are often stratified by variables of interest and are frequently clustered
Stratification and clustering are two of the three features of a complex survey design

Dr Shaun Scholes, 2019

Typically in sampling we:

Stratify (partition the population into groups called strata), sampling independently from each and every stratum. Examples? Region; Faculty.
Sample from the set of clusters in each stratum (these are known as the Primary Sampling Units: PSUs)
Why? For reasons of cost and convenience!
But….there is a statistical price to pay. In this session, we treat the clustering of data as a nuisance (not as something of primary interest)

Chile is a complex country

Population strata and clustering

Common surveys

NHANES

HSE

HSE - NatCen

Precision of an estimate (assuming SRS)

Example: Precision for a proportion

\[SE=1.96~x~\sqrt\frac{P*(1-P)}{n}\]

What are the consequences of complex survey design?

Both stratification and clustering can influence the magnitude of the standard errors (SEs): and hence the value of test statistics and width of confidence intervals (CIs).
Because SEs affect significance levels (P-values), the conclusions drawn from an analysis that does not take these complex survey design features into account may be misleading.
Usually misleading in one direction: under-estimation of amount of uncertainty in the survey data (p-values too small; CIs too narrow)
Correct inference requires taking the features of the complex survey design into account.

To summarise

When using complex survey designs, the study might have any or all of the three following features:

• Stratification

• Clustering (selection of PSUs within strata)

• Weighting

In practice, these are variables in a complex survey dataset that can be used to enable you to use packages such as Stata and R to perform appropriate statistical inference (generalise from your sample to the underlying population)

Resources and requirements for this workshop

Databases

http://epi.minsal.cl/condiciones-de-uso/ http://epi.minsal.cl/bases-de-datos/

Databases

– ENS 2003 (SPSS)

– Base de datos ENS 2003 – Región y comuna (SPSS)

– ENS 2009-2010

SPSS, STATA

(a.1) ENS 2009-2010 Comuna (SPSS)

Base de medicamentos

– ENS 2016-17

Base Formulario 1-Formulario 2 y exámenes

(a.1) Base Formulario 1-Formulario 2 y exámenes – comuna y variables complejas (SPSS)

(a.2) Base Formulario1-Formulario 2 y exámenes – Metales Pesados (SPSS)

Base Formulario 4
Manual de uso (Actualización 14/02/2018)
Libro de códigos F1, F2 , F3 y F4
Base de Medicamentos
Base formuarios y libro de codigos F3

Databases

Hoy utilizaremos la base ENS2017 de metales pesados

Questionnaires and technical reports

Open packages: library()

# library(rio) # To import and export data#
library(tidyverse) # Collection of packages to prepare data and plot
library(survey) # Analise using survey design
library(srvyr) # Analise using survey design with grammar

Knitr (ctrl+shift+K)

Knit: Do you see that blue button with knitting sticks? Press it

Import and explore database

Leave all the files you need in the same working directory

# Current wd
getwd()

## [1] "/home/passi/MEGA/Taller_ENS"

# Setting wd
# setwd('~/Dropbox/PhD_web/ENS/static/Referencias ENS')

df0_original <- rio::import('Base de datos Encuesta Nacional de Salud 2016-2017(ENS).Formulario 1_2_EX.MINSAL_EPI. (2)_CIDI_SF_Comuna_metales pesados.sav')

df0<-df0_original

rio:: lo utilizas cuando no has cargado la librería “rio” o para asegurar que usas la función “import” del paquete “rio” y no de “otro paquete”.

Explore database

# # View(df0)
# WARNING # View() de bases gigante 
# ENS no es gigante

# nombres de variables / columnas
# names(df0)
# total columnas
# length(names(df0)) 
# total filas
# length(df0$IdEncuesta)
# options:
# nrow(df0)
# ncol(df0)
dim(df0)

## [1] 6233 1165

Explore database

# A veces es mejor generar un elemento ("namesdf0")
namesdf0 <- names(df0)
namesdf0 <- as.data.frame(namesdf0)

# namesdf0
# y abrirlo en otra ventana
# # View(namesdf0)
namesdf0[1:40,]

##  [1] "IdEncuesta"      "FechaInicioF1"   "Region"          "Comuna"         
##  [5] "Zona"            "IdSegmento"      "IdPersona_1"     "Ident7"         
##  [9] "Edad"            "Edad_Codificada" "Sexo"            "c1"             
## [13] "c1_esp"          "c2"              "c2a"             "c2b"            
## [17] "c3"              "c3a"             "c3b"             "c3c"            
## [21] "c5"              "c5_otro"         "c5b"             "c6"             
## [25] "c7_0_niño"       "c7_1_niño"       "c7_1_cuidador"   "c7_2_niño"      
## [29] "c7_2_cuidador"   "c7_3_niño"       "c7_3_cuidador"   "c7_4_niño"      
## [33] "c7_4_cuidador"   "c7_5_niño"       "c7_5_cuidador"   "e1"             
## [37] "e2a"             "e2b"             "e2c"             "e2d"

# names(df0)

Explore database

# summary(df0)
df0$Region<-as.factor(df0$Region)
summary(df0[1:10])

##    IdEncuesta    FechaInicioF1                     Region    
##  Min.   :20006   Min.   :2016-08-04 12:16:22   7      : 912  
##  1st Qu.:22684   1st Qu.:2016-10-14 16:18:55   6      : 668  
##  Median :25698   Median :2016-11-06 11:51:08   10     : 661  
##  Mean   :25764   Mean   :2016-11-08 10:39:39   9      : 369  
##  3rd Qu.:28501   3rd Qu.:2016-11-28 12:38:12   1      : 359  
##  Max.   :70000   Max.   :2017-02-23 18:03:30   13     : 346  
##                                                (Other):2918  
##     Comuna               Zona         IdSegmento        IdPersona_1   
##  Length:6233        Min.   :1.000   Min.   : 1101101   Min.   :  1.0  
##  Class :character   1st Qu.:1.000   1st Qu.: 5109123   1st Qu.: 40.0  
##  Mode  :character   Median :1.000   Median : 8301107   Median : 82.0  
##                     Mean   :1.159   Mean   : 8503764   Mean   :106.6  
##                     3rd Qu.:1.000   3rd Qu.:13101108   3rd Qu.:146.0  
##                     Max.   :2.000   Max.   :15101204   Max.   :590.0  
##                                                                       
##      Ident7                Edad       Edad_Codificada
##  Min.   :1918-11-26   Min.   :15.00   Min.   :1.000  
##  1st Qu.:1952-04-23   1st Qu.:33.00   1st Qu.:2.000  
##  Median :1966-10-30   Median :50.00   Median :3.000  
##  Mean   :1967-06-20   Mean   :48.91   Mean   :2.684  
##  3rd Qu.:1983-07-25   3rd Qu.:64.00   3rd Qu.:3.000  
##  Max.   :2002-01-19   Max.   :98.00   Max.   :4.000  
##

Explore database

names(df0)[grepl("Vitamina",names(df0))]

## [1] "v_25_OH_Vitamina_D2"    "aux_25_OH_Vitamina_D2"  "v_25_OH_Vitamina_D2_D3"
## [4] "v_25_OH_Vitamina_D3"

names(df0)[grepl("Fexp",names(df0))]

## [1] "Fexp_F1p_Corr"      "Fexp_F2p_Corr"      "Fexp_F1F2p_Corr"   
## [4] "Fexp_EX1p_Corr"     "Fexp_F1F2EX1p_Corr" "Fexp_EX2p_Corr"    
## [7] "Fexp_F1F2EX2p_Corr" "Fexp_EX3p_Corr"     "Fexp_F1F2EX3p_Corr"

names(df0)[grepl("Congl",names(df0))]

## [1] "Conglomerado"

names(df0)[grepl("Estr",names(df0))]

## [1] "Estrato"

names(df0)[grepl("m4p3",names(df0))]

## [1] "m4p3"   "m4p3_1" "m4p3_2"

Extract and describe selected variables

df0_ext<-df0 %>%
  select("IdEncuesta","FechaInicioF1","Region","Zona", "Edad",
         "Edad_Codificada","Sexo","NEDU1_MINSAL_1",
         "Fexp_F1p_Corr","Fexp_F2p_Corr","Fexp_F1F2p_Corr",
         "Fexp_EX1p_Corr","Fexp_F1F2EX1p_Corr", "Fexp_EX2p_Corr",
         "Fexp_F1F2EX2p_Corr" ,"Fexp_EX3p_Corr","Fexp_F1F2EX3p_Corr",
         "Conglomerado","Estrato","v_25_OH_Vitamina_D2_D3")

summary(df0_ext)

##    IdEncuesta    FechaInicioF1                     Region          Zona      
##  Min.   :20006   Min.   :2016-08-04 12:16:22   7      : 912   Min.   :1.000  
##  1st Qu.:22684   1st Qu.:2016-10-14 16:18:55   6      : 668   1st Qu.:1.000  
##  Median :25698   Median :2016-11-06 11:51:08   10     : 661   Median :1.000  
##  Mean   :25764   Mean   :2016-11-08 10:39:39   9      : 369   Mean   :1.159  
##  3rd Qu.:28501   3rd Qu.:2016-11-28 12:38:12   1      : 359   3rd Qu.:1.000  
##  Max.   :70000   Max.   :2017-02-23 18:03:30   13     : 346   Max.   :2.000  
##                                                (Other):2918                  
##       Edad       Edad_Codificada      Sexo       NEDU1_MINSAL_1 
##  Min.   :15.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:33.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :50.00   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :48.91   Mean   :2.684   Mean   :1.629   Mean   :1.983  
##  3rd Qu.:64.00   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :98.00   Max.   :4.000   Max.   :2.000   Max.   :3.000  
##                                                  NA's   :59     
##  Fexp_F1p_Corr       Fexp_F2p_Corr       Fexp_F1F2p_Corr    
##  Min.   :    2.594   Min.   :    3.131   Min.   :    3.131  
##  1st Qu.:  414.818   1st Qu.:  454.585   1st Qu.:  454.585  
##  Median : 1084.403   Median : 1209.348   Median : 1209.348  
##  Mean   : 2329.371   Mean   : 2630.248   Mean   : 2630.248  
##  3rd Qu.: 2633.405   3rd Qu.: 3017.059   3rd Qu.: 3017.059  
##  Max.   :24000.000   Max.   :24000.000   Max.   :24000.000  
##                      NA's   :713         NA's   :713        
##  Fexp_EX1p_Corr      Fexp_F1F2EX1p_Corr  Fexp_EX2p_Corr     Fexp_F1F2EX2p_Corr
##  Min.   :    3.163   Min.   :    3.163   Min.   :    4.32   Min.   :    4.34  
##  1st Qu.:  464.486   1st Qu.:  465.550   1st Qu.:  726.64   1st Qu.:  728.37  
##  Median : 1232.532   Median : 1241.019   Median : 1781.00   Median : 1791.02  
##  Mean   : 2666.477   Mean   : 2674.828   Mean   : 3765.63   Mean   : 3774.10  
##  3rd Qu.: 3071.311   3rd Qu.: 3071.912   3rd Qu.: 4341.99   3rd Qu.: 4349.47  
##  Max.   :24000.000   Max.   :24000.000   Max.   :34000.00   Max.   :34000.00  
##  NA's   :788         NA's   :805         NA's   :2386       NA's   :2386      
##  Fexp_EX3p_Corr     Fexp_F1F2EX3p_Corr  Conglomerado         Estrato      
##  Min.   :   10.27   Min.   :   10.41   Min.   : 1101101   Min.   : 11.00  
##  1st Qu.: 1808.85   1st Qu.: 1821.60   1st Qu.: 5109123   1st Qu.: 51.00  
##  Median : 4823.59   Median : 4832.53   Median : 8301107   Median : 81.00  
##  Mean   :10532.52   Mean   :10551.58   Mean   : 8503691   Mean   : 84.27  
##  3rd Qu.:12958.39   3rd Qu.:12962.88   3rd Qu.:13101108   3rd Qu.:131.00  
##  Max.   :98000.00   Max.   :98000.00   Max.   :15101204   Max.   :152.00  
##  NA's   :4857       NA's   :4857                                          
##  v_25_OH_Vitamina_D2_D3
##  Min.   : 1.30         
##  1st Qu.:13.50         
##  Median :18.90         
##  Mean   :19.74         
##  3rd Qu.:24.90         
##  Max.   :59.70         
##  NA's   :3347

by(df0_ext, df0_ext$Sexo, summary)

## df0_ext$Sexo: 1
##    IdEncuesta    FechaInicioF1                     Region          Zona      
##  Min.   :20013   Min.   :2016-09-03 18:35:55   7      : 350   Min.   :1.000  
##  1st Qu.:22616   1st Qu.:2016-10-14 11:57:05   10     : 257   1st Qu.:1.000  
##  Median :25585   Median :2016-11-04 19:28:26   6      : 247   Median :1.000  
##  Mean   :25608   Mean   :2016-11-07 11:56:48   1      : 140   Mean   :1.143  
##  3rd Qu.:28369   3rd Qu.:2016-11-27 04:25:12   3      : 138   3rd Qu.:1.000  
##  Max.   :31839   Max.   :2017-02-23 18:03:30   8      : 133   Max.   :2.000  
##                                                (Other):1050                  
##       Edad       Edad_Codificada      Sexo   NEDU1_MINSAL_1 
##  Min.   :15.00   Min.   :1.000   Min.   :1   Min.   :1.000  
##  1st Qu.:31.00   1st Qu.:2.000   1st Qu.:1   1st Qu.:2.000  
##  Median :49.00   Median :3.000   Median :1   Median :2.000  
##  Mean   :47.93   Mean   :2.634   Mean   :1   Mean   :2.043  
##  3rd Qu.:63.00   3rd Qu.:3.000   3rd Qu.:1   3rd Qu.:2.000  
##  Max.   :95.00   Max.   :4.000   Max.   :1   Max.   :3.000  
##                                              NA's   :18     
##  Fexp_F1p_Corr       Fexp_F2p_Corr       Fexp_F1F2p_Corr    
##  Min.   :    6.424   Min.   :    8.888   Min.   :    8.888  
##  1st Qu.:  568.766   1st Qu.:  639.303   1st Qu.:  639.303  
##  Median : 1507.263   Median : 1786.075   Median : 1786.075  
##  Mean   : 3080.486   Mean   : 3532.108   Mean   : 3532.108  
##  3rd Qu.: 3768.953   3rd Qu.: 4395.049   3rd Qu.: 4395.049  
##  Max.   :24000.000   Max.   :24000.000   Max.   :24000.000  
##                      NA's   :296         NA's   :296        
##  Fexp_EX1p_Corr      Fexp_F1F2EX1p_Corr  Fexp_EX2p_Corr     Fexp_F1F2EX2p_Corr
##  Min.   :    8.888   Min.   :    8.888   Min.   :    9.11   Min.   :    9.11  
##  1st Qu.:  662.982   1st Qu.:  665.465   1st Qu.: 1038.81   1st Qu.: 1048.51  
##  Median : 1827.887   Median : 1837.309   Median : 2561.75   Median : 2579.03  
##  Mean   : 3567.447   Mean   : 3583.581   Mean   : 5053.01   Mean   : 5068.46  
##  3rd Qu.: 4470.406   3rd Qu.: 4514.441   3rd Qu.: 6300.13   3rd Qu.: 6323.65  
##  Max.   :24000.000   Max.   :24000.000   Max.   :34000.00   Max.   :34000.00  
##  NA's   :316         NA's   :325         NA's   :908        NA's   :908       
##  Fexp_EX3p_Corr     Fexp_F1F2EX3p_Corr  Conglomerado         Estrato      
##  Min.   :   64.65   Min.   :   64.65   Min.   : 1101101   Min.   : 11.00  
##  1st Qu.: 2810.20   1st Qu.: 2813.99   1st Qu.: 5301101   1st Qu.: 51.00  
##  Median : 6550.06   Median : 6554.51   Median : 8301106   Median : 81.00  
##  Mean   :14537.63   Mean   :14553.73   Mean   : 8501529   Mean   : 84.21  
##  3rd Qu.:20210.62   3rd Qu.:20237.63   3rd Qu.:13101108   3rd Qu.:131.00  
##  Max.   :98000.00   Max.   :98000.00   Max.   :15101204   Max.   :152.00  
##  NA's   :1825       NA's   :1825                                          
##  v_25_OH_Vitamina_D2_D3
##  Min.   : 1.30         
##  1st Qu.:14.60         
##  Median :20.45         
##  Mean   :21.14         
##  3rd Qu.:26.75         
##  Max.   :50.10         
##  NA's   :1851          
## ------------------------------------------------------------ 
## df0_ext$Sexo: 2
##    IdEncuesta    FechaInicioF1                     Region          Zona      
##  Min.   :20006   Min.   :2016-08-04 12:16:22   7      : 562   Min.   :1.000  
##  1st Qu.:22707   1st Qu.:2016-10-14 18:06:12   6      : 421   1st Qu.:1.000  
##  Median :25778   Median :2016-11-07 13:05:54   10     : 404   Median :1.000  
##  Mean   :25856   Mean   :2016-11-09 00:04:54   9      : 256   Mean   :1.168  
##  3rd Qu.:28584   3rd Qu.:2016-11-28 17:59:45   12     : 225   3rd Qu.:1.000  
##  Max.   :70000   Max.   :2017-02-22 20:02:35   2      : 223   Max.   :2.000  
##                                                (Other):1827                  
##       Edad       Edad_Codificada      Sexo   NEDU1_MINSAL_1 
##  Min.   :15.00   Min.   :1.000   Min.   :2   Min.   :1.000  
##  1st Qu.:34.00   1st Qu.:2.000   1st Qu.:2   1st Qu.:1.000  
##  Median :50.00   Median :3.000   Median :2   Median :2.000  
##  Mean   :49.49   Mean   :2.713   Mean   :2   Mean   :1.948  
##  3rd Qu.:64.00   3rd Qu.:3.000   3rd Qu.:2   3rd Qu.:2.000  
##  Max.   :98.00   Max.   :4.000   Max.   :2   Max.   :3.000  
##                                              NA's   :41     
##  Fexp_F1p_Corr       Fexp_F2p_Corr       Fexp_F1F2p_Corr    
##  Min.   :    2.594   Min.   :    3.131   Min.   :    3.131  
##  1st Qu.:  346.014   1st Qu.:  387.065   1st Qu.:  387.065  
##  Median :  876.152   Median :  965.907   Median :  965.907  
##  Mean   : 1885.565   Mean   : 2110.152   Mean   : 2110.152  
##  3rd Qu.: 2138.313   3rd Qu.: 2402.234   3rd Qu.: 2402.234  
##  Max.   :24000.000   Max.   :24000.000   Max.   :24000.000  
##                      NA's   :417         NA's   :417        
##  Fexp_EX1p_Corr      Fexp_F1F2EX1p_Corr  Fexp_EX2p_Corr     Fexp_F1F2EX2p_Corr
##  Min.   :    3.163   Min.   :    3.163   Min.   :    4.32   Min.   :    4.34  
##  1st Qu.:  393.154   1st Qu.:  393.644   1st Qu.:  590.95   1st Qu.:  591.24  
##  Median :  983.557   Median :  984.762   Median : 1446.51   Median : 1448.54  
##  Mean   : 2143.831   Mean   : 2148.820   Mean   : 3023.28   Mean   : 3027.72  
##  3rd Qu.: 2458.421   3rd Qu.: 2461.515   3rd Qu.: 3504.26   3rd Qu.: 3511.23  
##  Max.   :24000.000   Max.   :24000.000   Max.   :34000.00   Max.   :34000.00  
##  NA's   :472         NA's   :480         NA's   :1478       NA's   :1478      
##  Fexp_EX3p_Corr     Fexp_F1F2EX3p_Corr  Conglomerado         Estrato      
##  Min.   :   10.27   Min.   :   10.41   Min.   : 1101101   Min.   : 11.00  
##  1st Qu.: 1473.66   1st Qu.: 1482.27   1st Qu.: 5109122   1st Qu.: 51.00  
##  Median : 3715.37   Median : 3722.69   Median : 8301108   Median : 81.00  
##  Mean   : 8317.50   Mean   : 8338.20   Mean   : 8504969   Mean   : 84.32  
##  3rd Qu.: 9521.39   3rd Qu.: 9521.39   3rd Qu.:13101109   3rd Qu.:131.00  
##  Max.   :98000.00   Max.   :98000.00   Max.   :15101204   Max.   :152.00  
##  NA's   :3032       NA's   :3032                                          
##  v_25_OH_Vitamina_D2_D3
##  Min.   : 2.00         
##  1st Qu.:13.20         
##  Median :18.60         
##  Mean   :19.47         
##  3rd Qu.:24.60         
##  Max.   :59.70         
##  NA's   :1496

PSU, strata and weights

df0_ext<-df0 %>%
  select( "IdEncuesta","Fexp_F1p_Corr","Fexp_F1F2EX1p_Corr", "Conglomerado","Estrato","v_25_OH_Vitamina_D2_D3")


df0_ext[1:10,]

Weights

• Weights provided with complex datasets are positive values associated with each (response) unit in the sample. There may be different weights for different stages of a survey (e.g. interview; nurse examination; collection of blood samples)

• Sampling weights: “inverse probability weights” are the inverse / reciprocal of the probability of selection

• The main purpose is to reduce bias (systematic error) in population estimates by up-weighting population subgroups that are under-represented in the sample and down-weighting subgroups that are over-represented in the sample

Weights compensate for:

• Unequal probability of selection

• Differences in response rates

• Post-stratification: adjust the sample distribution for key variables of interest (e.g. age, sex, region) so that the weighted sample conforms to a known population distribution in analysis (i.e. an adjustment after data has been collected)

• Weights for analysis can include all 3 components: selection × (non-response × poststratification)

Weights (continued)…

We wont worry today about how weights are calculated, but note that subjects aged 65 y or more were deliberately oversampled to ensure adequate numbers for subgroup analysis (and bearing in mind their response rates); and men (especially young men) are less likely to respond to surveys than women.

To achieve nationally-representative estimates (avoid bias) you would expect the weighting scheme to show: (1) lower weights for people aged 65 or more (compensate for over-sampling); and (2) higher weights for young men (adjust for lower response)

Dealing with survey data

SEs (estimate of variability) attached to estimates from complex surveys tend to be larger than those from a Simple Random Sample (of the same size)…this is the statistical price to be paid for the cost and convenience of complex designs.
Typically Stratification reduces SEs; clustering increases SEs
The difference in precision of estimates produced by a complex design relative to a SRS is estimated by the Design Effect (DEFF)

Design effect (DEFF)

\[deff=\frac{variance~estimate~(complex~design)}{variance~estimate~(SRS)}\] * Design effects (DEFF) and root design effects (design factor, DEFT) compare the sample-to-sample variability from a given survey dataset with a hypothetical SRS design with the same number of individuals sampled from the population.

Is the ratio of the actual variance of an estimate under the sampling method used (e.g. stratification & clustering) to the variance calculated under the assumption of SRS.
The DEFF increases both as cluster size increases and as rho increases (fewer clusters and large homogeneity within them: intraclass correlation)
DEFF = 1 implies similar precision for the complex design and SRS
Magnitude of the design effect is estimate-specific

Design effect

DEFF increases both as cluster size increases and as ρ increases (sample more people who are alike)
Interpretation? A DEFF = 2.0 is that the sampling variance for that estimate is twice as large than it would be if the survey was based on the same size sample but was SRS (hypothetical). A guide to the larger imprecision in the estimates.
Alternative interpretation: only a half as many sample cases would be needed to infer estimates with the same precision if you had used a SRS (effective sample size = ncomplex / DEFF)
But remember the costs and convenience!

Design effect

The SE is the √ variance.
Similarly, the design factor (DEFT) is the √ DEFF
A DEFT of 2.0 indicates that the SEs are twice as large as they would have been had the design been SRS.

Precision of an estimate

DEFT is typically greater than 1: so SEs for complex surveys are larger than those assuming SRS.

Complex survey design

Specify survey design features
Compute estimates of interest: means; crosstabulations; proportions; regression coefficients taking into account those design features
Possible post-estimation commands giving more information: e.g. design effects; design factors; appropriate statistical testing for complex survey data

Identify weight to use

Utilizar el que tenga el menor n de las variables a analizar

F1: Visita 1 - entrevistador
F2: Visita 2 - Enfermera
Ex: Exámenes de laboratorio (varios)
Y combinaciones (intersecciones)

Weights ENS2017: “Fexp_F1p_Corr”,“Fexp_F2p_Corr”,“Fexp_F1F2p_Corr”, “Fexp_EX1p_Corr”,“Fexp_F1F2EX1p_Corr”, “Fexp_EX2p_Corr”, “Fexp_F1F2EX2p_Corr” ,“Fexp_EX3p_Corr”,“Fexp_F1F2EX3p_Corr”

FactoresExp1 <- rio::import('ENS2017_FactoresExp.xlsx', sheet=1)
FactoresExp3 <- rio::import('ENS2017_FactoresExp.xlsx', sheet=3)
FactoresExp2 <- rio::import('ENS2017_FactoresExp.xlsx', sheet=2)
FactoresExp2

Describe ENS as Simple Random Sample

table(df0$NEDU1_MINSAL_1)

## 
##    1    2    3 
## 1477 3323 1374

table(df0$NEDU1_MINSAL_1, useNA = "ifany")

## 
##    1    2    3 <NA> 
## 1477 3323 1374   59

prop.table(table(df0$NEDU1_MINSAL_1))

## 
##         1         2         3 
## 0.2392290 0.5382248 0.2225462

Describe ENS as Simple Random Sample

hist(df0$v_25_OH_Vitamina_D2_D3)

Create/identify variables

df0$Gender <- factor(df0$Sexo,
                     levels=c("1","2"),
                     labels=c("Male","Female"))
df0$Educational_level <- factor(df0$NEDU1_MINSAL_1,
                                levels=c("1","2","3"),
                                labels=c("Low","Mid","High"))
df0$Area <- factor(df0$Zona,
                   levels=c("1","2"),
                   labels=c("Urban","Rural"))
df0$Age <- factor(df0$Edad_Codificada,
                  levels=c("1","2","3","4"),
                  labels=c("17-24","25-44","45-64","65+"))

Survey design

df0$Conglomerado_ <- NA
df0$Conglomerado_ <- df0$Conglomerado
df0$strata_ <- NA
df0$strata_ <- df0$Estrato
df0$fexp <- df0$Fexp_F1F2EX1p_Corr

Survey design

Explore using pipes %>% magrittr: Secuencia de acciones

Primero despierto %>%

Después me levanto %>%

Finalmente, hago un café

Finalmente, hago un café(Después me levanto(Primero despierto))

df0$Area<-df0$Zona

res_0b <- df0 %>%
  dplyr::group_by(Region,Area,strata_) %>%
  summarize(Conglomerado_l = length(unique(Conglomerado_)))
res_0b

Survey design

res_0b <- df0 %>%
  group_by(Region,Area,strata_) %>%
  summarize(missing_fexp = sum(is.na(Fexp_F1F2EX1p_Corr)),
    valid_fexp = sum(!is.na(Fexp_F1F2EX1p_Corr)),
    mean_fexp = mean(Fexp_F1F2EX1p_Corr, na.rm=TRUE)
  )
res_0b

Survey design

Explore

length(unique(df0$Conglomerado_))

## [1] 1077

# table(df0$Conglomerado_,df0$strata_)

Survey design

Explore using ggplot

ggplot(df0)+
  geom_point(aes(x=Edad, y=Fexp_F1F2EX1p_Corr))

Survey design

ggplot(df0)+
  geom_point(aes(x=Edad, y=Fexp_F1F2EX1p_Corr, 
                 col=Gender, shape=Gender))

Recycle the code <- Identify exposure & outcome

df0$exposure <- df0$Educational_level
df0$outcome <- df0$v_25_OH_Vitamina_D2_D3
df0$outcome1 <- as.numeric(df0$v_25_OH_Vitamina_D2_D3<30)
df0$outcome2 <- as.numeric(df0$v_25_OH_Vitamina_D2_D3<20)
df0$outcome3 <- as.numeric(df0$v_25_OH_Vitamina_D2_D3<12)
df0$outcome5_ <- NA
df0$outcome5_[df0$v_25_OH_Vitamina_D2_D3>=30] <- 0
df0$outcome5_[df0$v_25_OH_Vitamina_D2_D3<30] <- 1
df0$outcome5_[df0$v_25_OH_Vitamina_D2_D3<20] <- 2
df0$outcome5_[df0$v_25_OH_Vitamina_D2_D3<12] <- 3
df0$outcome5_ <- as.factor(df0$outcome5_)

Subset of valid values for Survey design

create a subset with valid values: fexp, strata_ Conglomerado_

# drop cases
df0 <- subset(df0,!is.na(fexp)& !is.na(strata_)& !is.na(Conglomerado_))
# I have df0_original in case I want to use another fexp

Creating Survey design

survey_design <- df0 %>%
  as_survey_design(id=Conglomerado_, 
                   weight = fexp, 
                   strata=strata_ 
  )
options(survey.lonely.psu="certainty")
# nest=TRUE IS REQUIERED TO ANALISE TRENDS USING ENS.   If TRUE, relabel cluster ids to enforce nesting within strata. Same PSU LABELS in different surveys.

options(survey.lonely.psu=“certainty”)????

res_0b <- df0 %>%
  group_by(Region,Area,strata_) %>%
  summarize(Conglomerado_l = length(unique(Conglomerado_)))
Single_PSU <- subset(res_0b,Conglomerado_l==1)
Single_PSU

options(survey.lonely.psu=“certainty”)????

http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html

If only one PSU is sampled from a particular stratum the variance can’t be computed (there is no unbiased estimator and the standard estimator gives 0/0). Variance estimation in sample surveys involves variances computed within primary sampling units.

One exception to this is “certainty” PSUs in sampling without replacement, where the population has only one PSU in the stratum. With 100% sampling, there is no contribution to the variance from the first stage of sampling in this stratum. A single-PSU stratum makes no contribution to the variance (for multistage sampling it makes no contribution at that level of sampling).

WARNING Survey design

“As a general rule when working with complex survey data such as NHANES, you should never drop records from your analysis dataset before executing your analysis procedures. Instead, use the special statements provided in your software’s analysis procedure to perform subgroup analyses.” https://wwwn.cdc.gov/nchs/nhanes/tutorials/module4.aspx

Survey design: SUBSET

survey_design1 <- subset(survey_design,
                         !is.na(outcome) &!is.na(exposure) & Edad>=15)

MEAN

Simple Random Sample: Mean and and 95% CI of plasma vitamin D levels (ng / mL)

# Mean and 95% CI Simple Random Sample
ttest_table<-t.test(df0$outcome)
ttest_table$estimate

## mean of x 
##  19.73233

ttest_table$conf.int

## [1] 19.42651 20.03814
## attr(,"conf.level")
## [1] 0.95

Survey desing: Mean and and 95% CI of plasma vitamin D levels (ng / mL)

# Survey design created
survey_design <- df0 %>%
  as_survey_design(id=Conglomerado_, 
                   weight = fexp, 
                   strata=strata_)
options(survey.lonely.psu="certainty")

# Subset survey design with valid outcome (vit D) and exposure (Educ level)
survey_design1 <- subset(survey_design,
                         !is.na(outcome) &!is.na(exposure) & Edad>=15 )



# Mean and SE: the simplest code
survey_design1 %>%
  summarize(mean = survey_mean(outcome))

# outcome in this case is v_25_OH_Vitamina_D2_D3
survey_design1 %>%
  summarize(mean = survey_mean(v_25_OH_Vitamina_D2_D3))

# Mean and 95%CI
survey_design1 %>%
  summarize(mean = survey_mean(v_25_OH_Vitamina_D2_D3, vartype = c("se","ci")))

# Mean and 95%CI and unweighed n
survey_design1 %>%
  summarize(mean = survey_mean(v_25_OH_Vitamina_D2_D3, vartype = c("se","ci")),
            n = unweighted(n()))

# Mean and 95%CI and unweighed n saved in a table
res_survey <- survey_design1 %>%
  summarize(mean = survey_mean(v_25_OH_Vitamina_D2_D3, vartype = c("se","ci")),
            n = unweighted(n()))
res_survey

# Mean and 95%CI by groups (exposure:educ level)
res_v0 <- survey_design1 %>%
  group_by(exposure) %>%
  summarize(mean = survey_mean(outcome, 
                               vartype = c("se","ci"), 
                               deff=TRUE),
            n = unweighted(n()))
res_v0$Sexo<-0
res_v0

# Mean and 95%CI by groups (sex and exposure:educ level)
res_v1 <- survey_design1 %>%
  group_by(Sexo, exposure) %>%
  summarize(mean = survey_mean(outcome, 
                               vartype = c("se","ci"), 
                               deff=TRUE),
            n = unweighted(n()))
res_v1

Mean and error of plasma vitamin D levels (ng / mL)

Generar tabla de medias, se, n e IC95%

# Bind tables
res_v01 <- bind_rows(res_v0, res_v1)

# Move variable Sexo as first column
res_v01 <- res_v01 %>%
  select(Sexo, everything())

res_v01

Mean and error of plasma vitamin D levels (ng / mL) in subpop

survey_design1 <- subset(survey_design,Edad<50 & Sexo==2&!is.na(outcome) & !is.na(exposure))
# Criterio de aplicación de Examen Vit D: ( (ser mujer) & (edad>=15 & edad<=49) ) Ó (edad>=65) 

res_v0 <- survey_design1 %>%
  group_by(exposure) %>%
  summarize(mean = survey_mean(outcome, 
                               vartype = c("se","ci"), 
                               deff=TRUE),
            n = unweighted(n()))
res_v0$Region<-"Total"

res_v1 <- survey_design1 %>%
  group_by(Region, exposure) %>%
  summarize(mean = survey_mean(outcome, 
                               vartype = c("se","ci"), 
                               deff=TRUE),
            n = unweighted(n()))
res_v1

res_v01 <- bind_rows(res_v0, res_v1)

res_v01 <- res_v01 %>%
  select(Region, everything())

res_v01

Mean and error of plasma vitamin D levels (ng / mL) in subpop

res_v01

Plot

# names(res0)
ggplot(res_v01)+
  geom_point(aes(x=Region, y=mean, col=exposure))

Plot

plot1vd<-ggplot(res_v01)+
  geom_point(aes(x=Region, y=mean, col=exposure))+
  geom_errorbar(aes(x=Region, 
                    ymin = mean_low, ymax = mean_upp, col=exposure),
                width = 0.1,
                size = 0.1,
                position = position_dodge(0.9)
  )+
  labs(y="mean vitamin D levels (ng / mL)",
       x="Nivel educacional")+
  theme_minimal()

Plot

plot1vd

PROPORTION

Prevalence of vitamin D <30, <20 and <12

# Análisis sin considerar diseño muestral
prop_t1<-prop.test(sum(df0$outcome1, na.rm=T), sum(!is.na(df0$outcome1)))
prop_t1$conf.int[1]

## [1] 0.8741721

prop_t1$conf.int[2]

## [1] 0.8976996

Prevalence of vitamin D <30, <20 and <12

# Análisis con diseño muestral
# para obtener proporción de cada categoría
survey_design1 <- subset(survey_design,!is.na(outcome1) & !is.na(exposure))

res1<-survey_design1 %>%
  group_by(exposure,outcome1) %>%
  summarize(mean = survey_mean(
    vartype = c("se","ci"), 
    proportion = TRUE,
    prop_method = c("likelihood")),
    n = unweighted(n()))
# https://r-survey.r-forge.r-project.org/survey/html/svyciprop.html
res1$outcome<-"<30"

# Para outcomes binarios 0-1
res1<-survey_design1 %>%
  group_by(exposure) %>%
  summarize(mean = survey_mean(outcome1,
                               vartype = c("se","ci"), 
                               proportion = TRUE,
                               prop_method = c("likelihood")),
            n = unweighted(n()))
res1$outcome<-"<30"

Prevalence of vitamin D <30, <20 and <12

res2<-survey_design1 %>%
  group_by(exposure) %>%
  summarize(mean = survey_mean(outcome2,
                               vartype = c("se","ci"), 
                               proportion = TRUE,
                               prop_method = c("likelihood")),
            n = unweighted(n()))
res2$outcome<-"<20"

res3<-survey_design1 %>%
  group_by(exposure) %>%
  summarize(mean = survey_mean(outcome3,
                               vartype = c("se","ci"), 
                               proportion = TRUE,
                               prop_method = c("likelihood")),
            n = unweighted(n()))
res3$outcome<-"<12"

Prevalence of vitamin D <30, <20 and <12

res_t_p<-bind_rows(res1,res2,res3)


res_t_p

Basic Plot

plot2<-ggplot(res_t_p)+
  geom_bar(aes(x = exposure, y = mean, fill=outcome), 
           width = 0.9, stat="identity", 
           position = position_dodge(0.8))+
  geom_errorbar(aes(x=exposure, 
                    ymin = mean_low, ymax = mean_upp, fill=outcome),
                width = 0.1,size = 0.1, position = position_dodge(0.9))

plot2

You can change the labels, the scale into %, background color, etc.

plot2<-plot2 +
  facet_grid(.~outcome, scales = "fixed")
plot2

# https://ggplot2.tidyverse.org/reference/labs.html
# https://www.r-graph-gallery.com/275-add-text-labels-with-ggplot2.html

Vitamin D Ranges

0: Vit D >=30
1: Vit D 20-29.9
2: Vit D 12-19.9
3: Vit D <12

Vitamin D Ranges

survey_design1 <- subset(survey_design,!is.na(outcome5_) & !is.na(exposure))

res1<-survey_design1 %>%
  group_by(exposure, outcome5_) %>%
  summarize(mean = survey_mean( 
    vartype = c("se","ci"), 
    proportion = TRUE,
    prop_method = c("likelihood")),
    n = unweighted(n()))
res1

res1

write.csv(res1,file="check.csv")

CROSSTAB

res1<-survey_design %>%
  subset(!is.na(outcome1) & !is.na(exposure)) %>%
  subset(Sexo==1) %>%
  group_by(exposure, outcome1) %>%
  summarize(svytotal = survey_total(),
            n = unweighted(n()))

res1

# svytable(~outcome1 +exposure, subset(survey_design, Sexo==1)) 
svychisq(~outcome1 +exposure, subset(survey_design, Sexo==1))

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  NextMethod()
## F = 0.43519, ndf = 1.8662, ddf = 1683.3296, p-value = 0.6333

Linear regression: svyglm gaussian

svyglm(outcome~ Area+Edad, survey_design, deff=TRUE)

Linear regression: svyglm gaussian: VitD & urbana/rural area ajusted by edad

Primero describo y luego modelo

#Describo

res_v1 <- survey_design1 %>%
  group_by(Area) %>%
  summarize(mean = survey_mean(outcome, 
                               vartype = c("se","ci"), 
                               deff=TRUE),
            n = unweighted(n()))
res_v1

Linear regression: svyglm gaussian: VitD & urbana/rural area ajusted by edad

# simple random sample
fit2_SRS <- glm(outcome~ Area+Edad,
                data=subset(df0, !is.na(outcome) 
                            &!is.na(exposure) & Edad>=15))
summary(fit2_SRS)

## 
## Call:
## glm(formula = outcome ~ Area + Edad, data = subset(df0, !is.na(outcome) & 
##     !is.na(exposure) & Edad >= 15))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -19.273   -5.951   -0.900    5.116   41.502  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.335781   0.591592  27.613  < 2e-16 ***
## Area         4.420967   0.402864  10.974  < 2e-16 ***
## Edad        -0.035053   0.006956  -5.039 4.96e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 66.90658)
## 
##     Null deviance: 200345  on 2858  degrees of freedom
## Residual deviance: 191085  on 2856  degrees of freedom
## AIC: 20136
## 
## Number of Fisher Scoring iterations: 2

#Modelo con survey design
fit2 <- svyglm(outcome~ Area+Edad, survey_design, deff=TRUE)
summary(fit2)

## 
## Call:
## svyglm(formula = outcome ~ Area + Edad, design = survey_design, 
##     deff = TRUE)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.57713    1.05260  14.799  < 2e-16 ***
## Area         4.74497    0.64577   7.348 4.28e-13 ***
## Edad        -0.02449    0.01257  -1.948   0.0517 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 52.01855)
## 
## Number of Fisher Scoring iterations: 2

Linear regression: svyglm gaussian: VitD & urbana/rural area ajusted by edad

fit2 <- svyglm(outcome~ Area+Edad, survey_design, deff=TRUE)
fit2 <- summary(fit2)
fit2 <- as.data.frame(coef(fit2))
fit2$name <- rownames(fit2)
fit2$outcome <- "Vit D numeric"

Linear regression: svyglm gaussian: VitD & urbana/rural area ajusted by edad

fit2

write.csv(fit2,file="fit2.csv")

Logistic regression: svyglm logit

svyglm(outcome1 ~Area+Edad, survey_design, family = quasibinomial(link = “logit”), deff=TRUE)

Logistic regression: svyglm logit

svyglm(outcome1 ~Area+Edad, survey_design, 
       family = quasibinomial(link = "logit"), deff=TRUE)

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (1001) clusters.
## Called via srvyr
## Sampling variables:
##  - ids: Conglomerado_
##  - strata: strata_
##  - weights: fexp
## 
## Call:  svyglm(formula = outcome1 ~ Area + Edad, design = survey_design, 
##     family = quasibinomial(link = "logit"), deff = TRUE)
## 
## Coefficients:
## (Intercept)         Area         Edad  
##   3.2191557   -0.9665616   -0.0004236  
## 
## Degrees of Freedom: 2879 Total (i.e. Null);  969 Residual
##   (2548 observations deleted due to missingness)
## Null Deviance:       1632 
## Residual Deviance: 1600  AIC: NA

Logistic regression: svyglm logit

fit2 <- svyglm(outcome1 ~Area+Edad, survey_design, 
               family = quasibinomial(link = "logit"), deff=TRUE)
summary(fit2)

## 
## Call:
## svyglm(formula = outcome1 ~ Area + Edad, design = survey_design, 
##     family = quasibinomial(link = "logit"), deff = TRUE)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.2191557  0.3697096   8.707  < 2e-16 ***
## Area        -0.9665616  0.2005148  -4.820 1.66e-06 ***
## Edad        -0.0004236  0.0042062  -0.101     0.92    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.8087385)
## 
## Number of Fisher Scoring iterations: 4

Logistic regression: VitD<30 and urbana/rural area ajusted by age

# res1a <- svyby(~outcome1, ~Area, survey_design, svymean, vartype=c("se","ci"), 
#                deff=TRUE)
fit_t <- svyglm(outcome1 ~Area+Edad, survey_design, 
                family = quasibinomial(link = "logit"), deff=TRUE)
fit_t <- summary(fit_t)
fit_t <- as.data.frame(coef(fit_t))
fit_t$name <- rownames(fit_t)
fit_t$outcome <- "VitD<30"


fit_t$OR <- round(exp(fit_t$Estimate),digits=2)
fit_t$OR_i <- round(exp((fit_t$Estimate)-(1.96*fit_t$`Std. Error`)),2)
fit_t$OR_s <- round(exp((fit_t$Estimate)+(1.96*fit_t$`Std. Error`)),2)
fit_t$Estimate <- NULL
fit_t$`Std. Error` <- NULL
fit_t$`t value` <- NULL
fit_t$pvalue <- round(fit_t$`Pr(>|t|)`,digits=4)
fit_t$`Pr(>|t|)` <- NULL
fit_t <- subset(fit_t,name!="(Intercept)")
fit_t$name <- NULL

Logistic regression: VitD<30 and urbana/rural area ajusted by age

fit_t

Survey desing variables for ENS2003

Factores de expansión:“fact_af1”,“fact_af2”,“fact_aex”,
Estrato:“region”
Conglomerado:“comuna”

Debes elegir un factor de exp. El factor de exp depende de tus variables de análisis.

Survey desing variables for ENS2010

Factores de expansión:“factor”,“FEXP1”,“FEXP2”,“FEXP_ex”,“FEXP_factor”
Estrato:“estrato”
Conglomerado:“id_Comuna”,“conglomerado1”,“conglomerado2”

Debes elegir un factor de exp y un conglomerado. El factor de exp depende de tus variables de análisis. Para el conglomerado… los tres se pueden justificar. Yo utilizo el que te entrega el mayor error de estimación (el mas “conservador”).

Rpubs account?

Subscribe https://rpubs.com/

Publish web - Rpubs

Do you see a the blue “Publish” button? Press it, select RPubs and follow instructions

Explore

https://faculty.washington.edu/tlumley/old-survey/survey-wss.pdf

https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx

https://github.com/pachamaltese/casen/blob/master/R/descriptive_statistics.R

https://www.tidyverse.org/packages/

https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/

https://jef.works/art-with-code/

https://medium.com/@alinastepanova/tiny-art-with-r-d6d93d110619

Tarea individual n°1 - Curso MSP 3010, 2021

Tarea individual n°1 - Curso MSP 3010, 2021 Objetivo de la tarea: Adquirir habilidades para realizar análisis de variables incluidas en las bases de datos de la Encuesta Nacional de Salud chilena, utilizando un paquete estadístico para “Muestras Complejas”. Instrucciones:

Revise previamente los cuestionarios de cualquier base de datos ENS disponible al público en Chile.
Elija un tema de su interés y construya un indicador específico que pueda ser dicotomizado (0-1).
Calcule una tasa de prevalencia nacional y según sexo para ese parámetro.

Tarea individual n°1 - Curso MSP 3010, 2021

*4. Calcule las tasas de prevalencia según nivel educacional (bajo, medio, alto).

*5. Estime los Odds Ratio para una comparación entre la prevalencia del nivel educacional bajo vs. Alto ajustado por edad y sexo (utilizando regresión logística en el módulo de muestras complejas).

*6. Ensaye realizar los mismos cálculos del punto 3,4 y 5 fuera del módulo de muestras complejas. ¿Nota algún cambio? ¿De qué magnitud? ¿Por qué ocurre esto? Discuta con sus compañeros.

*7. Entregar vía mail 2 archivos:

Tarea individual n°1 - Curso MSP 3010, 2021

Archivo 1. Un archivo con 3 tablas (ejemplo):

“Prevalencia de (consumo de 5 o más frutas o verduras al día) según sexo, Chile ENS 20XX”
“Prevalencia de … según nivel educacional, Chile ENS …”
“Odds ratio ajustado por edad y sexo para la comparación de prevalencias de … entre el nivel educacional bajo y Alto”

Archivo 2. Un archivo con la sintaxis utilizada para realizar estos análisis.

Tarea individual n°1 - Curso MSP 3010, 2021

Para realizar este trabajo contará con el siguiente apoyo:

PDF de Clases sobre las bases de datos ENS y sobre la teoría estadística detrás del muestreo complejo.
Lectura y apuntes sobre el uso del módulo de muestras complejas utilizando STATA o R.
Talleres de STATA o R.

Fecha de entrega: miércoles 9 de noviembre (hasta las 23.59), enviar archivos rotulados como 1_su nombre.xls y 2_su nombre.doc a Alvaro Passi: passi.solar@gmail.com con copia a Francisco Valenzuela: fjvalen2@uc.cl

Muestras complejas y uso de bases de datos ENS Chile en RStudio

Alvaro Passi-Solar. Curso MSP3010 Diseño, conducción y análisis de Encuestas de salud, Departamento de Salud Pública UC

27/10/2021

Objectives

Videos del taller

Brief R intro

RStudio cloud

R y RStudio installed

RStudio layout

.R and .Rmd

.Rmd can be displayed as .doc, .pdf, .html

.Rmd structure

Paquetes/librerias R requeridos para el taller

install.packages()

install.packages()

What is a complex survey design?

Typically in sampling we:

Chile is a complex country

Population strata and clustering

Common surveys

Precision of an estimate (assuming SRS)

What are the consequences of complex survey design?

To summarise

Resources and requirements for this workshop

Databases

Databases

Databases

Questionnaires and technical reports

Open packages: library()

Knitr (ctrl+shift+K)

Import and explore database

Import and explore database

Explore database

Explore database

Explore database

Explore database

Extract and describe selected variables

PSU, strata and weights

Weights

Weights compensate for:

Weights (continued)…

Dealing with survey data

Design effect (DEFF)

Design effect

Design effect

Precision of an estimate

Complex survey design

Identify weight to use

Describe ENS as Simple Random Sample

Describe ENS as Simple Random Sample

Create/identify variables

Survey design

Survey design

Survey design

Survey design

Survey design

Survey design

Recycle the code <- Identify exposure & outcome

Subset of valid values for Survey design

Creating Survey design

options(survey.lonely.psu=“certainty”)????

options(survey.lonely.psu=“certainty”)????

WARNING Survey design

Survey design: SUBSET

MEAN

Simple Random Sample: Mean and and 95% CI of plasma vitamin D levels (ng / mL)

Survey desing: Mean and and 95% CI of plasma vitamin D levels (ng / mL)

Mean and error of plasma vitamin D levels (ng / mL)

Mean and error of plasma vitamin D levels (ng / mL) in subpop

Mean and error of plasma vitamin D levels (ng / mL) in subpop

Plot

Plot

Plot

PROPORTION

Prevalence of vitamin D <30, <20 and <12

Prevalence of vitamin D <30, <20 and <12

Prevalence of vitamin D <30, <20 and <12

Prevalence of vitamin D <30, <20 and <12

Basic Plot

You can change the labels, the scale into %, background color, etc.