** Parte 1. 27 Octubre 2021 15:30 a 18:20
** Parte 2. 3 Noviembre 2021 15:30 a 18:20
Alvaro Passi-Solar linkedin twitter scholar arpassi@uc.cl
Thanks to Dr Shaun Scholes. I used some slides from the class “Basic Statistics for Medical Sciences” (Stata), November 29th, 2019
• Identify complex survey designs features
• Specify survey design
• Perform survey analysis: mean, cross-tabulations, and proportions
• Perform survey analysis: linear and logistic regressions
• Perform sub-population analysis (correctly)
• Recognise the importance of the design effect and estimate the design effect (DEFF) or design factor
https://rstudio.cloud/project/1676149
No requiere instalar R ni RStudio
Puedes abrirlo en tu celular
Trae la base ENS2017 y el taller de hoy
Memoria limitada a 1GB
Choose .R for a simpler code file
Choose .Rmd for a complete code file
library(rio) #To import and export data#
library(tidyverse) #plot your results
library(survey) #Analise using survey design
library(srvyr) #Analise using survey design with grammar
RStudio te ofrece installar los paquetes por ti. Guarda el archivo. No demora en aparecer un mensaje en amarillo ofreciendo instalar
# install.packages("survey")
Puedes activar el código borrando el # antes del texto
Puedes cambiar el “nombre del paquete” a instalar
Comillas en install.packages() son importantes
Dr Shaun Scholes, 2019
Stratify (partition the population into groups called strata), sampling independently from each and every stratum. Examples? Region; Faculty.
Sample from the set of clusters in each stratum (these are known as the Primary Sampling Units: PSUs)
Why? For reasons of cost and convenience!
But….there is a statistical price to pay. In this session, we treat the clustering of data as a nuisance (not as something of primary interest)
Example: Precision for a proportion
\[SE=1.96~x~\sqrt\frac{P*(1-P)}{n}\]
Both stratification and clustering can influence the magnitude of the standard errors (SEs): and hence the value of test statistics and width of confidence intervals (CIs).
Because SEs affect significance levels (P-values), the conclusions drawn from an analysis that does not take these complex survey design features into account may be misleading.
Usually misleading in one direction: under-estimation of amount of uncertainty in the survey data (p-values too small; CIs too narrow)
Correct inference requires taking the features of the complex survey design into account.
When using complex survey designs, the study might have any or all of the three following features:
• Stratification
• Clustering (selection of PSUs within strata)
• Weighting
In practice, these are variables in a complex survey dataset that can be used to enable you to use packages such as Stata and R to perform appropriate statistical inference (generalise from your sample to the underlying population)
– ENS 2003 (SPSS)
– Base de datos ENS 2003 – Región y comuna (SPSS)
– ENS 2009-2010
(a.1) ENS 2009-2010 Comuna (SPSS)
– ENS 2016-17
(a.1) Base Formulario 1-Formulario 2 y exámenes – comuna y variables complejas (SPSS)
(a.2) Base Formulario1-Formulario 2 y exámenes – Metales Pesados (SPSS)
Base Formulario 4
Manual de uso (Actualización 14/02/2018)
Libro de códigos F1, F2 , F3 y F4
Base de Medicamentos
Base formuarios y libro de codigos F3
Hoy utilizaremos la base ENS2017 de metales pesados
# library(rio) # To import and export data#
library(tidyverse) # Collection of packages to prepare data and plot
library(survey) # Analise using survey design
library(srvyr) # Analise using survey design with grammar
Knit: Do you see that blue button with knitting sticks? Press it
# Current wd
getwd()
## [1] "/home/passi/MEGA/Taller_ENS"
# Setting wd
# setwd('~/Dropbox/PhD_web/ENS/static/Referencias ENS')
df0_original <- rio::import('Base de datos Encuesta Nacional de Salud 2016-2017(ENS).Formulario 1_2_EX.MINSAL_EPI. (2)_CIDI_SF_Comuna_metales pesados.sav')
df0<-df0_original
rio:: lo utilizas cuando no has cargado la librería “rio” o para asegurar que usas la función “import” del paquete “rio” y no de “otro paquete”.
# # View(df0)
# WARNING # View() de bases gigante
# ENS no es gigante
# nombres de variables / columnas
# names(df0)
# total columnas
# length(names(df0))
# total filas
# length(df0$IdEncuesta)
# options:
# nrow(df0)
# ncol(df0)
dim(df0)
## [1] 6233 1165
# A veces es mejor generar un elemento ("namesdf0")
namesdf0 <- names(df0)
namesdf0 <- as.data.frame(namesdf0)
# namesdf0
# y abrirlo en otra ventana
# # View(namesdf0)
namesdf0[1:40,]
## [1] "IdEncuesta" "FechaInicioF1" "Region" "Comuna"
## [5] "Zona" "IdSegmento" "IdPersona_1" "Ident7"
## [9] "Edad" "Edad_Codificada" "Sexo" "c1"
## [13] "c1_esp" "c2" "c2a" "c2b"
## [17] "c3" "c3a" "c3b" "c3c"
## [21] "c5" "c5_otro" "c5b" "c6"
## [25] "c7_0_niño" "c7_1_niño" "c7_1_cuidador" "c7_2_niño"
## [29] "c7_2_cuidador" "c7_3_niño" "c7_3_cuidador" "c7_4_niño"
## [33] "c7_4_cuidador" "c7_5_niño" "c7_5_cuidador" "e1"
## [37] "e2a" "e2b" "e2c" "e2d"
# names(df0)
# summary(df0)
df0$Region<-as.factor(df0$Region)
summary(df0[1:10])
## IdEncuesta FechaInicioF1 Region
## Min. :20006 Min. :2016-08-04 12:16:22 7 : 912
## 1st Qu.:22684 1st Qu.:2016-10-14 16:18:55 6 : 668
## Median :25698 Median :2016-11-06 11:51:08 10 : 661
## Mean :25764 Mean :2016-11-08 10:39:39 9 : 369
## 3rd Qu.:28501 3rd Qu.:2016-11-28 12:38:12 1 : 359
## Max. :70000 Max. :2017-02-23 18:03:30 13 : 346
## (Other):2918
## Comuna Zona IdSegmento IdPersona_1
## Length:6233 Min. :1.000 Min. : 1101101 Min. : 1.0
## Class :character 1st Qu.:1.000 1st Qu.: 5109123 1st Qu.: 40.0
## Mode :character Median :1.000 Median : 8301107 Median : 82.0
## Mean :1.159 Mean : 8503764 Mean :106.6
## 3rd Qu.:1.000 3rd Qu.:13101108 3rd Qu.:146.0
## Max. :2.000 Max. :15101204 Max. :590.0
##
## Ident7 Edad Edad_Codificada
## Min. :1918-11-26 Min. :15.00 Min. :1.000
## 1st Qu.:1952-04-23 1st Qu.:33.00 1st Qu.:2.000
## Median :1966-10-30 Median :50.00 Median :3.000
## Mean :1967-06-20 Mean :48.91 Mean :2.684
## 3rd Qu.:1983-07-25 3rd Qu.:64.00 3rd Qu.:3.000
## Max. :2002-01-19 Max. :98.00 Max. :4.000
##
names(df0)[grepl("Vitamina",names(df0))]
## [1] "v_25_OH_Vitamina_D2" "aux_25_OH_Vitamina_D2" "v_25_OH_Vitamina_D2_D3"
## [4] "v_25_OH_Vitamina_D3"
names(df0)[grepl("Fexp",names(df0))]
## [1] "Fexp_F1p_Corr" "Fexp_F2p_Corr" "Fexp_F1F2p_Corr"
## [4] "Fexp_EX1p_Corr" "Fexp_F1F2EX1p_Corr" "Fexp_EX2p_Corr"
## [7] "Fexp_F1F2EX2p_Corr" "Fexp_EX3p_Corr" "Fexp_F1F2EX3p_Corr"
names(df0)[grepl("Congl",names(df0))]
## [1] "Conglomerado"
names(df0)[grepl("Estr",names(df0))]
## [1] "Estrato"
names(df0)[grepl("m4p3",names(df0))]
## [1] "m4p3" "m4p3_1" "m4p3_2"
df0_ext<-df0 %>%
select("IdEncuesta","FechaInicioF1","Region","Zona", "Edad",
"Edad_Codificada","Sexo","NEDU1_MINSAL_1",
"Fexp_F1p_Corr","Fexp_F2p_Corr","Fexp_F1F2p_Corr",
"Fexp_EX1p_Corr","Fexp_F1F2EX1p_Corr", "Fexp_EX2p_Corr",
"Fexp_F1F2EX2p_Corr" ,"Fexp_EX3p_Corr","Fexp_F1F2EX3p_Corr",
"Conglomerado","Estrato","v_25_OH_Vitamina_D2_D3")
summary(df0_ext)
## IdEncuesta FechaInicioF1 Region Zona
## Min. :20006 Min. :2016-08-04 12:16:22 7 : 912 Min. :1.000
## 1st Qu.:22684 1st Qu.:2016-10-14 16:18:55 6 : 668 1st Qu.:1.000
## Median :25698 Median :2016-11-06 11:51:08 10 : 661 Median :1.000
## Mean :25764 Mean :2016-11-08 10:39:39 9 : 369 Mean :1.159
## 3rd Qu.:28501 3rd Qu.:2016-11-28 12:38:12 1 : 359 3rd Qu.:1.000
## Max. :70000 Max. :2017-02-23 18:03:30 13 : 346 Max. :2.000
## (Other):2918
## Edad Edad_Codificada Sexo NEDU1_MINSAL_1
## Min. :15.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:33.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
## Median :50.00 Median :3.000 Median :2.000 Median :2.000
## Mean :48.91 Mean :2.684 Mean :1.629 Mean :1.983
## 3rd Qu.:64.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :98.00 Max. :4.000 Max. :2.000 Max. :3.000
## NA's :59
## Fexp_F1p_Corr Fexp_F2p_Corr Fexp_F1F2p_Corr
## Min. : 2.594 Min. : 3.131 Min. : 3.131
## 1st Qu.: 414.818 1st Qu.: 454.585 1st Qu.: 454.585
## Median : 1084.403 Median : 1209.348 Median : 1209.348
## Mean : 2329.371 Mean : 2630.248 Mean : 2630.248
## 3rd Qu.: 2633.405 3rd Qu.: 3017.059 3rd Qu.: 3017.059
## Max. :24000.000 Max. :24000.000 Max. :24000.000
## NA's :713 NA's :713
## Fexp_EX1p_Corr Fexp_F1F2EX1p_Corr Fexp_EX2p_Corr Fexp_F1F2EX2p_Corr
## Min. : 3.163 Min. : 3.163 Min. : 4.32 Min. : 4.34
## 1st Qu.: 464.486 1st Qu.: 465.550 1st Qu.: 726.64 1st Qu.: 728.37
## Median : 1232.532 Median : 1241.019 Median : 1781.00 Median : 1791.02
## Mean : 2666.477 Mean : 2674.828 Mean : 3765.63 Mean : 3774.10
## 3rd Qu.: 3071.311 3rd Qu.: 3071.912 3rd Qu.: 4341.99 3rd Qu.: 4349.47
## Max. :24000.000 Max. :24000.000 Max. :34000.00 Max. :34000.00
## NA's :788 NA's :805 NA's :2386 NA's :2386
## Fexp_EX3p_Corr Fexp_F1F2EX3p_Corr Conglomerado Estrato
## Min. : 10.27 Min. : 10.41 Min. : 1101101 Min. : 11.00
## 1st Qu.: 1808.85 1st Qu.: 1821.60 1st Qu.: 5109123 1st Qu.: 51.00
## Median : 4823.59 Median : 4832.53 Median : 8301107 Median : 81.00
## Mean :10532.52 Mean :10551.58 Mean : 8503691 Mean : 84.27
## 3rd Qu.:12958.39 3rd Qu.:12962.88 3rd Qu.:13101108 3rd Qu.:131.00
## Max. :98000.00 Max. :98000.00 Max. :15101204 Max. :152.00
## NA's :4857 NA's :4857
## v_25_OH_Vitamina_D2_D3
## Min. : 1.30
## 1st Qu.:13.50
## Median :18.90
## Mean :19.74
## 3rd Qu.:24.90
## Max. :59.70
## NA's :3347
by(df0_ext, df0_ext$Sexo, summary)
## df0_ext$Sexo: 1
## IdEncuesta FechaInicioF1 Region Zona
## Min. :20013 Min. :2016-09-03 18:35:55 7 : 350 Min. :1.000
## 1st Qu.:22616 1st Qu.:2016-10-14 11:57:05 10 : 257 1st Qu.:1.000
## Median :25585 Median :2016-11-04 19:28:26 6 : 247 Median :1.000
## Mean :25608 Mean :2016-11-07 11:56:48 1 : 140 Mean :1.143
## 3rd Qu.:28369 3rd Qu.:2016-11-27 04:25:12 3 : 138 3rd Qu.:1.000
## Max. :31839 Max. :2017-02-23 18:03:30 8 : 133 Max. :2.000
## (Other):1050
## Edad Edad_Codificada Sexo NEDU1_MINSAL_1
## Min. :15.00 Min. :1.000 Min. :1 Min. :1.000
## 1st Qu.:31.00 1st Qu.:2.000 1st Qu.:1 1st Qu.:2.000
## Median :49.00 Median :3.000 Median :1 Median :2.000
## Mean :47.93 Mean :2.634 Mean :1 Mean :2.043
## 3rd Qu.:63.00 3rd Qu.:3.000 3rd Qu.:1 3rd Qu.:2.000
## Max. :95.00 Max. :4.000 Max. :1 Max. :3.000
## NA's :18
## Fexp_F1p_Corr Fexp_F2p_Corr Fexp_F1F2p_Corr
## Min. : 6.424 Min. : 8.888 Min. : 8.888
## 1st Qu.: 568.766 1st Qu.: 639.303 1st Qu.: 639.303
## Median : 1507.263 Median : 1786.075 Median : 1786.075
## Mean : 3080.486 Mean : 3532.108 Mean : 3532.108
## 3rd Qu.: 3768.953 3rd Qu.: 4395.049 3rd Qu.: 4395.049
## Max. :24000.000 Max. :24000.000 Max. :24000.000
## NA's :296 NA's :296
## Fexp_EX1p_Corr Fexp_F1F2EX1p_Corr Fexp_EX2p_Corr Fexp_F1F2EX2p_Corr
## Min. : 8.888 Min. : 8.888 Min. : 9.11 Min. : 9.11
## 1st Qu.: 662.982 1st Qu.: 665.465 1st Qu.: 1038.81 1st Qu.: 1048.51
## Median : 1827.887 Median : 1837.309 Median : 2561.75 Median : 2579.03
## Mean : 3567.447 Mean : 3583.581 Mean : 5053.01 Mean : 5068.46
## 3rd Qu.: 4470.406 3rd Qu.: 4514.441 3rd Qu.: 6300.13 3rd Qu.: 6323.65
## Max. :24000.000 Max. :24000.000 Max. :34000.00 Max. :34000.00
## NA's :316 NA's :325 NA's :908 NA's :908
## Fexp_EX3p_Corr Fexp_F1F2EX3p_Corr Conglomerado Estrato
## Min. : 64.65 Min. : 64.65 Min. : 1101101 Min. : 11.00
## 1st Qu.: 2810.20 1st Qu.: 2813.99 1st Qu.: 5301101 1st Qu.: 51.00
## Median : 6550.06 Median : 6554.51 Median : 8301106 Median : 81.00
## Mean :14537.63 Mean :14553.73 Mean : 8501529 Mean : 84.21
## 3rd Qu.:20210.62 3rd Qu.:20237.63 3rd Qu.:13101108 3rd Qu.:131.00
## Max. :98000.00 Max. :98000.00 Max. :15101204 Max. :152.00
## NA's :1825 NA's :1825
## v_25_OH_Vitamina_D2_D3
## Min. : 1.30
## 1st Qu.:14.60
## Median :20.45
## Mean :21.14
## 3rd Qu.:26.75
## Max. :50.10
## NA's :1851
## ------------------------------------------------------------
## df0_ext$Sexo: 2
## IdEncuesta FechaInicioF1 Region Zona
## Min. :20006 Min. :2016-08-04 12:16:22 7 : 562 Min. :1.000
## 1st Qu.:22707 1st Qu.:2016-10-14 18:06:12 6 : 421 1st Qu.:1.000
## Median :25778 Median :2016-11-07 13:05:54 10 : 404 Median :1.000
## Mean :25856 Mean :2016-11-09 00:04:54 9 : 256 Mean :1.168
## 3rd Qu.:28584 3rd Qu.:2016-11-28 17:59:45 12 : 225 3rd Qu.:1.000
## Max. :70000 Max. :2017-02-22 20:02:35 2 : 223 Max. :2.000
## (Other):1827
## Edad Edad_Codificada Sexo NEDU1_MINSAL_1
## Min. :15.00 Min. :1.000 Min. :2 Min. :1.000
## 1st Qu.:34.00 1st Qu.:2.000 1st Qu.:2 1st Qu.:1.000
## Median :50.00 Median :3.000 Median :2 Median :2.000
## Mean :49.49 Mean :2.713 Mean :2 Mean :1.948
## 3rd Qu.:64.00 3rd Qu.:3.000 3rd Qu.:2 3rd Qu.:2.000
## Max. :98.00 Max. :4.000 Max. :2 Max. :3.000
## NA's :41
## Fexp_F1p_Corr Fexp_F2p_Corr Fexp_F1F2p_Corr
## Min. : 2.594 Min. : 3.131 Min. : 3.131
## 1st Qu.: 346.014 1st Qu.: 387.065 1st Qu.: 387.065
## Median : 876.152 Median : 965.907 Median : 965.907
## Mean : 1885.565 Mean : 2110.152 Mean : 2110.152
## 3rd Qu.: 2138.313 3rd Qu.: 2402.234 3rd Qu.: 2402.234
## Max. :24000.000 Max. :24000.000 Max. :24000.000
## NA's :417 NA's :417
## Fexp_EX1p_Corr Fexp_F1F2EX1p_Corr Fexp_EX2p_Corr Fexp_F1F2EX2p_Corr
## Min. : 3.163 Min. : 3.163 Min. : 4.32 Min. : 4.34
## 1st Qu.: 393.154 1st Qu.: 393.644 1st Qu.: 590.95 1st Qu.: 591.24
## Median : 983.557 Median : 984.762 Median : 1446.51 Median : 1448.54
## Mean : 2143.831 Mean : 2148.820 Mean : 3023.28 Mean : 3027.72
## 3rd Qu.: 2458.421 3rd Qu.: 2461.515 3rd Qu.: 3504.26 3rd Qu.: 3511.23
## Max. :24000.000 Max. :24000.000 Max. :34000.00 Max. :34000.00
## NA's :472 NA's :480 NA's :1478 NA's :1478
## Fexp_EX3p_Corr Fexp_F1F2EX3p_Corr Conglomerado Estrato
## Min. : 10.27 Min. : 10.41 Min. : 1101101 Min. : 11.00
## 1st Qu.: 1473.66 1st Qu.: 1482.27 1st Qu.: 5109122 1st Qu.: 51.00
## Median : 3715.37 Median : 3722.69 Median : 8301108 Median : 81.00
## Mean : 8317.50 Mean : 8338.20 Mean : 8504969 Mean : 84.32
## 3rd Qu.: 9521.39 3rd Qu.: 9521.39 3rd Qu.:13101109 3rd Qu.:131.00
## Max. :98000.00 Max. :98000.00 Max. :15101204 Max. :152.00
## NA's :3032 NA's :3032
## v_25_OH_Vitamina_D2_D3
## Min. : 2.00
## 1st Qu.:13.20
## Median :18.60
## Mean :19.47
## 3rd Qu.:24.60
## Max. :59.70
## NA's :1496
df0_ext<-df0 %>%
select( "IdEncuesta","Fexp_F1p_Corr","Fexp_F1F2EX1p_Corr", "Conglomerado","Estrato","v_25_OH_Vitamina_D2_D3")
df0_ext[1:10,]
• Weights provided with complex datasets are positive values associated with each (response) unit in the sample. There may be different weights for different stages of a survey (e.g. interview; nurse examination; collection of blood samples)
• Sampling weights: “inverse probability weights” are the inverse / reciprocal of the probability of selection
• The main purpose is to reduce bias (systematic error) in population estimates by up-weighting population subgroups that are under-represented in the sample and down-weighting subgroups that are over-represented in the sample
• Unequal probability of selection
• Differences in response rates
• Post-stratification: adjust the sample distribution for key variables of interest (e.g. age, sex, region) so that the weighted sample conforms to a known population distribution in analysis (i.e. an adjustment after data has been collected)
• Weights for analysis can include all 3 components: selection × (non-response × poststratification)
We wont worry today about how weights are calculated, but note that subjects aged 65 y or more were deliberately oversampled to ensure adequate numbers for subgroup analysis (and bearing in mind their response rates); and men (especially young men) are less likely to respond to surveys than women.
To achieve nationally-representative estimates (avoid bias) you would expect the weighting scheme to show: (1) lower weights for people aged 65 or more (compensate for over-sampling); and (2) higher weights for young men (adjust for lower response)
SEs (estimate of variability) attached to estimates from complex surveys tend to be larger than those from a Simple Random Sample (of the same size)…this is the statistical price to be paid for the cost and convenience of complex designs.
Typically Stratification reduces SEs; clustering increases SEs
The difference in precision of estimates produced by a complex design relative to a SRS is estimated by the Design Effect (DEFF)
\[deff=\frac{variance~estimate~(complex~design)}{variance~estimate~(SRS)}\] * Design effects (DEFF) and root design effects (design factor, DEFT) compare the sample-to-sample variability from a given survey dataset with a hypothetical SRS design with the same number of individuals sampled from the population.
Is the ratio of the actual variance of an estimate under the sampling method used (e.g. stratification & clustering) to the variance calculated under the assumption of SRS.
The DEFF increases both as cluster size increases and as rho increases (fewer clusters and large homogeneity within them: intraclass correlation)
DEFF = 1 implies similar precision for the complex design and SRS
Magnitude of the design effect is estimate-specific
DEFF increases both as cluster size increases and as ρ increases (sample more people who are alike)
Interpretation? A DEFF = 2.0 is that the sampling variance for that estimate is twice as large than it would be if the survey was based on the same size sample but was SRS (hypothetical). A guide to the larger imprecision in the estimates.
Alternative interpretation: only a half as many sample cases would be needed to infer estimates with the same precision if you had used a SRS (effective sample size = ncomplex / DEFF)
But remember the costs and convenience!
The SE is the √ variance.
Similarly, the design factor (DEFT) is the √ DEFF
A DEFT of 2.0 indicates that the SEs are twice as large as they would have been had the design been SRS.
DEFT is typically greater than 1: so SEs for complex surveys are larger than those assuming SRS.
Specify survey design features
Compute estimates of interest: means; crosstabulations; proportions; regression coefficients taking into account those design features
Possible post-estimation commands giving more information: e.g. design effects; design factors; appropriate statistical testing for complex survey data
Utilizar el que tenga el menor n de las variables a analizar
F1: Visita 1 - entrevistador
F2: Visita 2 - Enfermera
Ex: Exámenes de laboratorio (varios)
Y combinaciones (intersecciones)
Weights ENS2017: “Fexp_F1p_Corr”,“Fexp_F2p_Corr”,“Fexp_F1F2p_Corr”, “Fexp_EX1p_Corr”,“Fexp_F1F2EX1p_Corr”, “Fexp_EX2p_Corr”, “Fexp_F1F2EX2p_Corr” ,“Fexp_EX3p_Corr”,“Fexp_F1F2EX3p_Corr”
FactoresExp1 <- rio::import('ENS2017_FactoresExp.xlsx', sheet=1)
FactoresExp3 <- rio::import('ENS2017_FactoresExp.xlsx', sheet=3)
FactoresExp2 <- rio::import('ENS2017_FactoresExp.xlsx', sheet=2)
FactoresExp2
table(df0$NEDU1_MINSAL_1)
##
## 1 2 3
## 1477 3323 1374
table(df0$NEDU1_MINSAL_1, useNA = "ifany")
##
## 1 2 3 <NA>
## 1477 3323 1374 59
prop.table(table(df0$NEDU1_MINSAL_1))
##
## 1 2 3
## 0.2392290 0.5382248 0.2225462
hist(df0$v_25_OH_Vitamina_D2_D3)
df0$Gender <- factor(df0$Sexo,
levels=c("1","2"),
labels=c("Male","Female"))
df0$Educational_level <- factor(df0$NEDU1_MINSAL_1,
levels=c("1","2","3"),
labels=c("Low","Mid","High"))
df0$Area <- factor(df0$Zona,
levels=c("1","2"),
labels=c("Urban","Rural"))
df0$Age <- factor(df0$Edad_Codificada,
levels=c("1","2","3","4"),
labels=c("17-24","25-44","45-64","65+"))
df0$Conglomerado_ <- NA
df0$Conglomerado_ <- df0$Conglomerado
df0$strata_ <- NA
df0$strata_ <- df0$Estrato
df0$fexp <- df0$Fexp_F1F2EX1p_Corr
Explore using pipes %>% magrittr: Secuencia de acciones
Primero despierto %>%
Después me levanto %>%
Finalmente, hago un café
Finalmente, hago un café(Después me levanto(Primero despierto))
df0$Area<-df0$Zona
res_0b <- df0 %>%
dplyr::group_by(Region,Area,strata_) %>%
summarize(Conglomerado_l = length(unique(Conglomerado_)))
res_0b
res_0b <- df0 %>%
group_by(Region,Area,strata_) %>%
summarize(missing_fexp = sum(is.na(Fexp_F1F2EX1p_Corr)),
valid_fexp = sum(!is.na(Fexp_F1F2EX1p_Corr)),
mean_fexp = mean(Fexp_F1F2EX1p_Corr, na.rm=TRUE)
)
res_0b
Explore
length(unique(df0$Conglomerado_))
## [1] 1077
# table(df0$Conglomerado_,df0$strata_)
Explore using ggplot
ggplot(df0)+
geom_point(aes(x=Edad, y=Fexp_F1F2EX1p_Corr))
ggplot(df0)+
geom_point(aes(x=Edad, y=Fexp_F1F2EX1p_Corr,
col=Gender, shape=Gender))
df0$exposure <- df0$Educational_level
df0$outcome <- df0$v_25_OH_Vitamina_D2_D3
df0$outcome1 <- as.numeric(df0$v_25_OH_Vitamina_D2_D3<30)
df0$outcome2 <- as.numeric(df0$v_25_OH_Vitamina_D2_D3<20)
df0$outcome3 <- as.numeric(df0$v_25_OH_Vitamina_D2_D3<12)
df0$outcome5_ <- NA
df0$outcome5_[df0$v_25_OH_Vitamina_D2_D3>=30] <- 0
df0$outcome5_[df0$v_25_OH_Vitamina_D2_D3<30] <- 1
df0$outcome5_[df0$v_25_OH_Vitamina_D2_D3<20] <- 2
df0$outcome5_[df0$v_25_OH_Vitamina_D2_D3<12] <- 3
df0$outcome5_ <- as.factor(df0$outcome5_)
create a subset with valid values: fexp, strata_ Conglomerado_
# drop cases
df0 <- subset(df0,!is.na(fexp)& !is.na(strata_)& !is.na(Conglomerado_))
# I have df0_original in case I want to use another fexp
survey_design <- df0 %>%
as_survey_design(id=Conglomerado_,
weight = fexp,
strata=strata_
)
options(survey.lonely.psu="certainty")
# nest=TRUE IS REQUIERED TO ANALISE TRENDS USING ENS. If TRUE, relabel cluster ids to enforce nesting within strata. Same PSU LABELS in different surveys.
res_0b <- df0 %>%
group_by(Region,Area,strata_) %>%
summarize(Conglomerado_l = length(unique(Conglomerado_)))
Single_PSU <- subset(res_0b,Conglomerado_l==1)
Single_PSU
http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html
If only one PSU is sampled from a particular stratum the variance can’t be computed (there is no unbiased estimator and the standard estimator gives 0/0). Variance estimation in sample surveys involves variances computed within primary sampling units.
One exception to this is “certainty” PSUs in sampling without replacement, where the population has only one PSU in the stratum. With 100% sampling, there is no contribution to the variance from the first stage of sampling in this stratum. A single-PSU stratum makes no contribution to the variance (for multistage sampling it makes no contribution at that level of sampling).
“As a general rule when working with complex survey data such as NHANES, you should never drop records from your analysis dataset before executing your analysis procedures. Instead, use the special statements provided in your software’s analysis procedure to perform subgroup analyses.” https://wwwn.cdc.gov/nchs/nhanes/tutorials/module4.aspx
survey_design1 <- subset(survey_design,
!is.na(outcome) &!is.na(exposure) & Edad>=15)
# Mean and 95% CI Simple Random Sample
ttest_table<-t.test(df0$outcome)
ttest_table$estimate
## mean of x
## 19.73233
ttest_table$conf.int
## [1] 19.42651 20.03814
## attr(,"conf.level")
## [1] 0.95
# Survey design created
survey_design <- df0 %>%
as_survey_design(id=Conglomerado_,
weight = fexp,
strata=strata_)
options(survey.lonely.psu="certainty")
# Subset survey design with valid outcome (vit D) and exposure (Educ level)
survey_design1 <- subset(survey_design,
!is.na(outcome) &!is.na(exposure) & Edad>=15 )
# Mean and SE: the simplest code
survey_design1 %>%
summarize(mean = survey_mean(outcome))
# outcome in this case is v_25_OH_Vitamina_D2_D3
survey_design1 %>%
summarize(mean = survey_mean(v_25_OH_Vitamina_D2_D3))
# Mean and 95%CI
survey_design1 %>%
summarize(mean = survey_mean(v_25_OH_Vitamina_D2_D3, vartype = c("se","ci")))
# Mean and 95%CI and unweighed n
survey_design1 %>%
summarize(mean = survey_mean(v_25_OH_Vitamina_D2_D3, vartype = c("se","ci")),
n = unweighted(n()))
# Mean and 95%CI and unweighed n saved in a table
res_survey <- survey_design1 %>%
summarize(mean = survey_mean(v_25_OH_Vitamina_D2_D3, vartype = c("se","ci")),
n = unweighted(n()))
res_survey
# Mean and 95%CI by groups (exposure:educ level)
res_v0 <- survey_design1 %>%
group_by(exposure) %>%
summarize(mean = survey_mean(outcome,
vartype = c("se","ci"),
deff=TRUE),
n = unweighted(n()))
res_v0$Sexo<-0
res_v0
# Mean and 95%CI by groups (sex and exposure:educ level)
res_v1 <- survey_design1 %>%
group_by(Sexo, exposure) %>%
summarize(mean = survey_mean(outcome,
vartype = c("se","ci"),
deff=TRUE),
n = unweighted(n()))
res_v1
Generar tabla de medias, se, n e IC95%
# Bind tables
res_v01 <- bind_rows(res_v0, res_v1)
# Move variable Sexo as first column
res_v01 <- res_v01 %>%
select(Sexo, everything())
res_v01
survey_design1 <- subset(survey_design,Edad<50 & Sexo==2&!is.na(outcome) & !is.na(exposure))
# Criterio de aplicación de Examen Vit D: ( (ser mujer) & (edad>=15 & edad<=49) ) Ó (edad>=65)
res_v0 <- survey_design1 %>%
group_by(exposure) %>%
summarize(mean = survey_mean(outcome,
vartype = c("se","ci"),
deff=TRUE),
n = unweighted(n()))
res_v0$Region<-"Total"
res_v1 <- survey_design1 %>%
group_by(Region, exposure) %>%
summarize(mean = survey_mean(outcome,
vartype = c("se","ci"),
deff=TRUE),
n = unweighted(n()))
res_v1
res_v01 <- bind_rows(res_v0, res_v1)
res_v01 <- res_v01 %>%
select(Region, everything())
res_v01
res_v01
# names(res0)
ggplot(res_v01)+
geom_point(aes(x=Region, y=mean, col=exposure))
plot1vd<-ggplot(res_v01)+
geom_point(aes(x=Region, y=mean, col=exposure))+
geom_errorbar(aes(x=Region,
ymin = mean_low, ymax = mean_upp, col=exposure),
width = 0.1,
size = 0.1,
position = position_dodge(0.9)
)+
labs(y="mean vitamin D levels (ng / mL)",
x="Nivel educacional")+
theme_minimal()
plot1vd
# Análisis sin considerar diseño muestral
prop_t1<-prop.test(sum(df0$outcome1, na.rm=T), sum(!is.na(df0$outcome1)))
prop_t1$conf.int[1]
## [1] 0.8741721
prop_t1$conf.int[2]
## [1] 0.8976996
# Análisis con diseño muestral
# para obtener proporción de cada categoría
survey_design1 <- subset(survey_design,!is.na(outcome1) & !is.na(exposure))
res1<-survey_design1 %>%
group_by(exposure,outcome1) %>%
summarize(mean = survey_mean(
vartype = c("se","ci"),
proportion = TRUE,
prop_method = c("likelihood")),
n = unweighted(n()))
# https://r-survey.r-forge.r-project.org/survey/html/svyciprop.html
res1$outcome<-"<30"
# Para outcomes binarios 0-1
res1<-survey_design1 %>%
group_by(exposure) %>%
summarize(mean = survey_mean(outcome1,
vartype = c("se","ci"),
proportion = TRUE,
prop_method = c("likelihood")),
n = unweighted(n()))
res1$outcome<-"<30"
res2<-survey_design1 %>%
group_by(exposure) %>%
summarize(mean = survey_mean(outcome2,
vartype = c("se","ci"),
proportion = TRUE,
prop_method = c("likelihood")),
n = unweighted(n()))
res2$outcome<-"<20"
res3<-survey_design1 %>%
group_by(exposure) %>%
summarize(mean = survey_mean(outcome3,
vartype = c("se","ci"),
proportion = TRUE,
prop_method = c("likelihood")),
n = unweighted(n()))
res3$outcome<-"<12"
res_t_p<-bind_rows(res1,res2,res3)
res_t_p
plot2<-ggplot(res_t_p)+
geom_bar(aes(x = exposure, y = mean, fill=outcome),
width = 0.9, stat="identity",
position = position_dodge(0.8))+
geom_errorbar(aes(x=exposure,
ymin = mean_low, ymax = mean_upp, fill=outcome),
width = 0.1,size = 0.1, position = position_dodge(0.9))
plot2
plot2<-plot2 +
facet_grid(.~outcome, scales = "fixed")
plot2
# https://ggplot2.tidyverse.org/reference/labs.html
# https://www.r-graph-gallery.com/275-add-text-labels-with-ggplot2.html
survey_design1 <- subset(survey_design,!is.na(outcome5_) & !is.na(exposure))
res1<-survey_design1 %>%
group_by(exposure, outcome5_) %>%
summarize(mean = survey_mean(
vartype = c("se","ci"),
proportion = TRUE,
prop_method = c("likelihood")),
n = unweighted(n()))
res1
res1
write.csv(res1,file="check.csv")
res1<-survey_design %>%
subset(!is.na(outcome1) & !is.na(exposure)) %>%
subset(Sexo==1) %>%
group_by(exposure, outcome1) %>%
summarize(svytotal = survey_total(),
n = unweighted(n()))
res1
# svytable(~outcome1 +exposure, subset(survey_design, Sexo==1))
svychisq(~outcome1 +exposure, subset(survey_design, Sexo==1))
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: NextMethod()
## F = 0.43519, ndf = 1.8662, ddf = 1683.3296, p-value = 0.6333
svyglm(outcome~ Area+Edad, survey_design, deff=TRUE)
Primero describo y luego modelo
#Describo
res_v1 <- survey_design1 %>%
group_by(Area) %>%
summarize(mean = survey_mean(outcome,
vartype = c("se","ci"),
deff=TRUE),
n = unweighted(n()))
res_v1
# simple random sample
fit2_SRS <- glm(outcome~ Area+Edad,
data=subset(df0, !is.na(outcome)
&!is.na(exposure) & Edad>=15))
summary(fit2_SRS)
##
## Call:
## glm(formula = outcome ~ Area + Edad, data = subset(df0, !is.na(outcome) &
## !is.na(exposure) & Edad >= 15))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -19.273 -5.951 -0.900 5.116 41.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.335781 0.591592 27.613 < 2e-16 ***
## Area 4.420967 0.402864 10.974 < 2e-16 ***
## Edad -0.035053 0.006956 -5.039 4.96e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 66.90658)
##
## Null deviance: 200345 on 2858 degrees of freedom
## Residual deviance: 191085 on 2856 degrees of freedom
## AIC: 20136
##
## Number of Fisher Scoring iterations: 2
#Modelo con survey design
fit2 <- svyglm(outcome~ Area+Edad, survey_design, deff=TRUE)
summary(fit2)
##
## Call:
## svyglm(formula = outcome ~ Area + Edad, design = survey_design,
## deff = TRUE)
##
## Survey design:
## Called via srvyr
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.57713 1.05260 14.799 < 2e-16 ***
## Area 4.74497 0.64577 7.348 4.28e-13 ***
## Edad -0.02449 0.01257 -1.948 0.0517 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 52.01855)
##
## Number of Fisher Scoring iterations: 2
fit2 <- svyglm(outcome~ Area+Edad, survey_design, deff=TRUE)
fit2 <- summary(fit2)
fit2 <- as.data.frame(coef(fit2))
fit2$name <- rownames(fit2)
fit2$outcome <- "Vit D numeric"
fit2
write.csv(fit2,file="fit2.csv")
svyglm(outcome1 ~Area+Edad, survey_design, family = quasibinomial(link = “logit”), deff=TRUE)
svyglm(outcome1 ~Area+Edad, survey_design,
family = quasibinomial(link = "logit"), deff=TRUE)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (1001) clusters.
## Called via srvyr
## Sampling variables:
## - ids: Conglomerado_
## - strata: strata_
## - weights: fexp
##
## Call: svyglm(formula = outcome1 ~ Area + Edad, design = survey_design,
## family = quasibinomial(link = "logit"), deff = TRUE)
##
## Coefficients:
## (Intercept) Area Edad
## 3.2191557 -0.9665616 -0.0004236
##
## Degrees of Freedom: 2879 Total (i.e. Null); 969 Residual
## (2548 observations deleted due to missingness)
## Null Deviance: 1632
## Residual Deviance: 1600 AIC: NA
fit2 <- svyglm(outcome1 ~Area+Edad, survey_design,
family = quasibinomial(link = "logit"), deff=TRUE)
summary(fit2)
##
## Call:
## svyglm(formula = outcome1 ~ Area + Edad, design = survey_design,
## family = quasibinomial(link = "logit"), deff = TRUE)
##
## Survey design:
## Called via srvyr
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2191557 0.3697096 8.707 < 2e-16 ***
## Area -0.9665616 0.2005148 -4.820 1.66e-06 ***
## Edad -0.0004236 0.0042062 -0.101 0.92
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.8087385)
##
## Number of Fisher Scoring iterations: 4
# res1a <- svyby(~outcome1, ~Area, survey_design, svymean, vartype=c("se","ci"),
# deff=TRUE)
fit_t <- svyglm(outcome1 ~Area+Edad, survey_design,
family = quasibinomial(link = "logit"), deff=TRUE)
fit_t <- summary(fit_t)
fit_t <- as.data.frame(coef(fit_t))
fit_t$name <- rownames(fit_t)
fit_t$outcome <- "VitD<30"
fit_t$OR <- round(exp(fit_t$Estimate),digits=2)
fit_t$OR_i <- round(exp((fit_t$Estimate)-(1.96*fit_t$`Std. Error`)),2)
fit_t$OR_s <- round(exp((fit_t$Estimate)+(1.96*fit_t$`Std. Error`)),2)
fit_t$Estimate <- NULL
fit_t$`Std. Error` <- NULL
fit_t$`t value` <- NULL
fit_t$pvalue <- round(fit_t$`Pr(>|t|)`,digits=4)
fit_t$`Pr(>|t|)` <- NULL
fit_t <- subset(fit_t,name!="(Intercept)")
fit_t$name <- NULL
fit_t
Factores de expansión:“fact_af1”,“fact_af2”,“fact_aex”,
Estrato:“region”
Conglomerado:“comuna”
Debes elegir un factor de exp. El factor de exp depende de tus variables de análisis.
Factores de expansión:“factor”,“FEXP1”,“FEXP2”,“FEXP_ex”,“FEXP_factor”
Estrato:“estrato”
Conglomerado:“id_Comuna”,“conglomerado1”,“conglomerado2”
Debes elegir un factor de exp y un conglomerado. El factor de exp depende de tus variables de análisis. Para el conglomerado… los tres se pueden justificar. Yo utilizo el que te entrega el mayor error de estimación (el mas “conservador”).
Subscribe https://rpubs.com/
Do you see a the blue “Publish” button? Press it, select RPubs and follow instructions
https://faculty.washington.edu/tlumley/old-survey/survey-wss.pdf
https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx
https://github.com/pachamaltese/casen/blob/master/R/descriptive_statistics.R
https://www.tidyverse.org/packages/
https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/
https://jef.works/art-with-code/
https://medium.com/@alinastepanova/tiny-art-with-r-d6d93d110619
Tarea individual n°1 - Curso MSP 3010, 2021 Objetivo de la tarea: Adquirir habilidades para realizar análisis de variables incluidas en las bases de datos de la Encuesta Nacional de Salud chilena, utilizando un paquete estadístico para “Muestras Complejas”. Instrucciones:
Revise previamente los cuestionarios de cualquier base de datos ENS disponible al público en Chile.
Elija un tema de su interés y construya un indicador específico que pueda ser dicotomizado (0-1).
Calcule una tasa de prevalencia nacional y según sexo para ese parámetro.
*4. Calcule las tasas de prevalencia según nivel educacional (bajo, medio, alto).
*5. Estime los Odds Ratio para una comparación entre la prevalencia del nivel educacional bajo vs. Alto ajustado por edad y sexo (utilizando regresión logística en el módulo de muestras complejas).
*6. Ensaye realizar los mismos cálculos del punto 3,4 y 5 fuera del módulo de muestras complejas. ¿Nota algún cambio? ¿De qué magnitud? ¿Por qué ocurre esto? Discuta con sus compañeros.
*7. Entregar vía mail 2 archivos:
Archivo 1. Un archivo con 3 tablas (ejemplo):
“Prevalencia de (consumo de 5 o más frutas o verduras al día) según sexo, Chile ENS 20XX”
“Prevalencia de … según nivel educacional, Chile ENS …”
“Odds ratio ajustado por edad y sexo para la comparación de prevalencias de … entre el nivel educacional bajo y Alto”
Archivo 2. Un archivo con la sintaxis utilizada para realizar estos análisis.
Para realizar este trabajo contará con el siguiente apoyo:
PDF de Clases sobre las bases de datos ENS y sobre la teoría estadística detrás del muestreo complejo.
Lectura y apuntes sobre el uso del módulo de muestras complejas utilizando STATA o R.
Talleres de STATA o R.
Fecha de entrega: miércoles 9 de noviembre (hasta las 23.59), enviar archivos rotulados como 1_su nombre.xls y 2_su nombre.doc a Alvaro Passi: passi.solar@gmail.com con copia a Francisco Valenzuela: fjvalen2@uc.cl