We begin with the pedigree file constructed by the amazing Lisa Reiber (https://lisallreiber.github.io/GeneAnalysis/index.html). Using SOEP-IS data v34, Lisa identified the following family structure in the GSOEP sample:
Using Lisa’s R syntax and SOEP-IS data v35 (i.e., 2018), we reconstruct her genetic pedigree and observe the following:
gen_ped <- import("../Input/Pedigree/gen_pedigree_igene.rda")
gen_ped %>%
count("mother avail" = !(is.na(mother_id)),
"father avail" = !(is.na(father_id)))
## # A tibble: 3 x 3
## `mother avail` `father avail` n
## <lgl> <lgl> <int>
## 1 FALSE TRUE 27
## 2 TRUE FALSE 141
## 3 TRUE TRUE 227
So it appears that we have gained a few genetic trios, but otherwise we find almost exactly the sample sizes we expected using the v35 data.
Next, import the GSOEP data file in order to extract only those GSOEP cases that passed the QC process:
gsoep <- import("../Input/I_Gene/GSOEP_PGI_v1.1.txt") %>%
select(pid, QCpass) %>%
drop_na()
(n_distinct(gsoep$pid))
## [1] 2494
After importing the QC status of the GSOEP sample, we see that n=2,494 unique cases exist in the data.
Here is the breakdown of the QC pass/fail rates in the GSOEP:
gsoep %>%
mutate(QC_Status = ifelse(QCpass==1, "Pass", "Fail")) %>%
count(QC_Status)
## QC_Status n
## 1 Fail 250
## 2 Pass 2245
Now that we have identified which cases passed the PGI QC process, we will drop those cases from our pedigree file and observe the changes in the family structure of the GSOEP sample.
We begin by dropping all cases that failed the QC process:
gsoep_qc <- gsoep %>%
filter(QCpass==1)
(n_distinct(gsoep_qc$pid))
## [1] 2244
Then, we follow a three-step process:
Merge genetic pedigree file back together using temporary pair ID
#generate temporary pair IDs to facilitate re-merging data
gen_ped <- gen_ped %>%
mutate(temp_id = row_number())
# STEP 1 - Separate IDs
child <- select(gen_ped, child_id, temp_id)
mother <- select(gen_ped, mother_id, temp_id)
father <- select(gen_ped, father_id, temp_id)
#STEP 2 - Keep only cases that passed QC
child <- child %>%
semi_join(gsoep_qc, by = c("child_id" = "pid"))
mother <- mother %>%
semi_join(gsoep_qc, by = c("mother_id" = "pid"))
father <- father %>%
semi_join(gsoep, by = c("father_id" = "pid"))
#STEP 3 - Merge IDs back into duo/trio pairs
gen_ped_qc <- full_join(child, mother, by = "temp_id")
gen_ped_qc <- full_join(gen_ped_qc, father, by = "temp_id")
gen_ped_qc <- select(gen_ped_qc, -temp_id)
#drop cases with: 1) missing child IDs and 2) cases missing both parent IDs
gen_ped_qc <- gen_ped_qc %>%
filter(!is.na(child_id)) %>%
filter(!is.na(mother_id) | !is.na(father_id))
Now that we are only working with cases that passed QC, let’s see what the family structure in the GSOEP sample looks like now:
gen_ped_qc %>%
count("mother avail" = !(is.na(mother_id)),
"father avail" = !(is.na(father_id)))
## # A tibble: 3 x 3
## `mother avail` `father avail` n
## <lgl> <lgl> <int>
## 1 FALSE TRUE 35
## 2 TRUE FALSE 114
## 3 TRUE TRUE 168
Finally, we will add pair IDs for pair type (duo or trio), pair number (which pair), and position within the pair (child, mother, or father). These will allow us to stack IDs and merge with conventionally structured data files.
#Trios
#select full trios.
#add pair type and pair number
trios <- gen_ped_qc %>%
filter(!is.na(mother_id) & !is.na(father_id)) %>%
arrange(child_id) %>%
mutate(pair_type = "T",
pair_num = row_number(),
pair_num = str_pad(pair_num, 3, side = "left", pad = "0"))
#separate IDs
#add pair position
tchild <- trios %>%
select(pid = child_id, pair_type, pair_num) %>%
mutate(pair_pos = "Ch")
tmother <- trios %>%
select(pid = mother_id, pair_type, pair_num) %>%
mutate(pair_pos = "Mo")
tfather <- trios %>%
select(pid = father_id, pair_type, pair_num) %>%
mutate(pair_pos = "Fa")
# Duos
#select full duos
#add pair type and pair number
duos <- gen_ped_qc %>%
filter(is.na(mother_id) | is.na(father_id)) %>%
arrange(child_id) %>%
mutate(pair_type = "D",
pair_num = row_number(),
pair_num = str_pad(pair_num, 3, side = "left", pad = "0"))
#separate IDs
#add pair position
dchild <- select(duos, pid = child_id, pair_type, pair_num) %>%
mutate(pair_pos = "Ch")
dmother <- select(duos, pid = mother_id, pair_type, pair_num) %>%
mutate(pair_pos = "Mo")
dfather <- select(duos, pid = father_id, pair_type, pair_num) %>%
mutate(pair_pos = "Fa")
Now we stack individual pedigree files:
gen_ped_qc_stacked <- rbind(tchild,dchild,
tmother,dmother,
tfather,dfather
)
#stacked dataframe means that no IDs should be missing
gen_ped_qc_stacked <- gen_ped_qc_stacked %>%
drop_na() %>%
distinct(., .keep_all = TRUE)
#concatenate PIDs with pair ID information to create pair-specific unique IDs
gen_ped_qc_stacked <- gen_ped_qc_stacked %>%
mutate(pair_id = paste0(pair_type, pair_num, pair_pos, "_", pid)) %>%
arrange(pair_num, pair_type)
head(gen_ped_qc_stacked, n=20)
## # A tibble: 20 x 5
## pid pair_type pair_num pair_pos pair_id
## <dbl> <chr> <chr> <chr> <chr>
## 1 1161602 D 001 Ch D001Ch_1161602
## 2 1161603 D 001 Fa D001Fa_1161603
## 3 1233704 T 001 Ch T001Ch_1233704
## 4 1233702 T 001 Mo T001Mo_1233702
## 5 2093203 T 001 Fa T001Fa_2093203
## 6 1342703 D 002 Ch D002Ch_1342703
## 7 1342702 D 002 Mo D002Mo_1342702
## 8 1342707 T 002 Ch T002Ch_1342707
## 9 1342702 T 002 Mo T002Mo_1342702
## 10 2156903 T 002 Fa T002Fa_2156903
## 11 1342705 D 003 Ch D003Ch_1342705
## 12 1342702 D 003 Mo D003Mo_1342702
## 13 1342803 T 003 Ch T003Ch_1342803
## 14 2098603 T 003 Mo T003Mo_2098603
## 15 1342802 T 003 Fa T003Fa_1342802
## 16 1344404 D 004 Ch D004Ch_1344404
## 17 1344403 D 004 Mo D004Mo_1344403
## 18 1360004 T 004 Ch T004Ch_1360004
## 19 1360002 T 004 Mo T004Mo_1360002
## 20 2097903 T 004 Fa T004Fa_2097903
Let’s check the number of unique duos and trios.
#Trios first
(n_distinct(gen_ped_qc_stacked$pair_num[gen_ped_qc_stacked$pair_type=="T"], na.rm = TRUE))
## [1] 168
#Duos next
(n_distinct(gen_ped_qc_stacked$pair_num[gen_ped_qc_stacked$pair_type=="D"], na.rm = TRUE))
## [1] 149
Looks like we didn’t lose any pairs while stacking.
We are now ready to merge longitudinal phenotype data to our QC’d genetic pedigree file backbone.
We start importing the longitudinal tracker file developed by Lisa. This will allow us to get some handle on the demographics of the GSOEP sample.
soepis_long <- import("../Code/soepis_igene_longish.rda") %>%
select(pid, hid, cid, sex, syear, byear, bmonth) %>%
arrange(pid, syear)
head(soepis_long, n=20)
## # A tibble: 20 x 7
## pid hid cid sex syear byear bmonth
## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1161602 NA 219940 2 2008 2008 5
## 2 1161602 116165 219940 2 2009 2008 5
## 3 1161602 116165 219940 2 2010 2008 5
## 4 1161602 116165 219940 2 2011 2008 5
## 5 1161602 116165 219940 2 2012 2008 5
## 6 1161602 116165 219940 2 2013 2008 5
## 7 1161602 116165 219940 2 2014 2008 5
## 8 1161602 116165 219940 2 2015 2008 5
## 9 1161602 116165 219940 2 2016 2008 5
## 10 1161602 116165 219940 2 2017 2008 5
## 11 1161602 116165 219940 2 2018 2008 5
## 12 1161603 NA 219940 1 1998 1976 1
## 13 1161603 NA 219940 1 1999 1976 1
## 14 1161603 NA 219940 1 2000 1976 1
## 15 1161603 NA 219940 1 2001 1976 1
## 16 1161603 NA 219940 1 2002 1976 1
## 17 1161603 NA 219940 1 2003 1976 1
## 18 1161603 NA 219940 1 2004 1976 1
## 19 1161603 NA 219940 1 2005 1976 1
## 20 1161603 NA 219940 1 2006 1976 1
Next we will merge the longitudinal tracker file with our stacked pedigree file. This must be done by pair, however, as some pairs have overlapping members (this is the case for parents with multiple children).
gsoep_long <- gen_ped_qc_stacked %>%
group_by(pair_type, pair_num) %>%
left_join(soepis_long, by = c("pid")) %>%
ungroup() %>%
select(pid, cid, hid, sex, syear, byear, bmonth,
starts_with("pair")) %>%
arrange(pid, syear) %>%
drop_na(pair_id) %>%
distinct(pair_id, syear, .keep_all = TRUE)
head(gsoep_long, n=20)
## # A tibble: 20 x 11
## pid cid hid sex syear byear bmonth pair_type pair_num pair_pos
## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr> <chr>
## 1 1161602 219940 NA 2 2008 2008 5 D 001 Ch
## 2 1161602 219940 116165 2 2009 2008 5 D 001 Ch
## 3 1161602 219940 116165 2 2010 2008 5 D 001 Ch
## 4 1161602 219940 116165 2 2011 2008 5 D 001 Ch
## 5 1161602 219940 116165 2 2012 2008 5 D 001 Ch
## 6 1161602 219940 116165 2 2013 2008 5 D 001 Ch
## 7 1161602 219940 116165 2 2014 2008 5 D 001 Ch
## 8 1161602 219940 116165 2 2015 2008 5 D 001 Ch
## 9 1161602 219940 116165 2 2016 2008 5 D 001 Ch
## 10 1161602 219940 116165 2 2017 2008 5 D 001 Ch
## 11 1161602 219940 116165 2 2018 2008 5 D 001 Ch
## 12 1161603 219940 NA 1 1998 1976 1 D 001 Fa
## 13 1161603 219940 NA 1 1999 1976 1 D 001 Fa
## 14 1161603 219940 NA 1 2000 1976 1 D 001 Fa
## 15 1161603 219940 NA 1 2001 1976 1 D 001 Fa
## 16 1161603 219940 NA 1 2002 1976 1 D 001 Fa
## 17 1161603 219940 NA 1 2003 1976 1 D 001 Fa
## 18 1161603 219940 NA 1 2004 1976 1 D 001 Fa
## 19 1161603 219940 NA 1 2005 1976 1 D 001 Fa
## 20 1161603 219940 NA 1 2006 1976 1 D 001 Fa
## # … with 1 more variable: pair_id <chr>
Let’s check the number of unique duos and trios.
#Trios first
(n_distinct(gsoep_long$pair_num[gsoep_long$pair_type=="T"], na.rm = TRUE))
## [1] 168
#Duos next
(n_distinct(gsoep_long$pair_num[gsoep_long$pair_type=="D"], na.rm = TRUE))
## [1] 149
We maintained our pairs but now we have longitudinal data for each individual. See how our observations increased:
#Observations in stacked pedigree file
(nrow(gen_ped_qc_stacked))
## [1] 802
#Observations in GSOEP long file
(nrow(gsoep_long))
## [1] 14861
Now we are ready to characterize our sample and select based on the features we need. First, we want to assign each pair a “pair age” based on the byear of the youngest member of the pair (because before that year, they weren’t a pair).
Once we know the age of each pair, we will drop observations where the “pair age” is younger than 4yrs. We do this because 5 years of age is the youngest age at which the SOEP collects data on the externalizing behavior of children (i.e., the SDQ questionnaire).
#assign pair age to each pair
gsoep_long <- gsoep_long %>%
group_by(pair_type, pair_num) %>%
mutate(pair_startyr = max(byear),
pair_age = syear - pair_startyr) %>%
ungroup() %>%
select(pid, pair_id, pair_age, everything())
#drop observations of pairs 4yrs or younger
gsoep_long <- gsoep_long %>%
filter(pair_age>4)
head(gsoep_long, n=20)
## # A tibble: 20 x 13
## pid pair_id pair_age cid hid sex syear byear bmonth pair_type
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
## 1 1161602 D001Ch_116… 5 219940 116165 2 2013 2008 5 D
## 2 1161602 D001Ch_116… 6 219940 116165 2 2014 2008 5 D
## 3 1161602 D001Ch_116… 7 219940 116165 2 2015 2008 5 D
## 4 1161602 D001Ch_116… 8 219940 116165 2 2016 2008 5 D
## 5 1161602 D001Ch_116… 9 219940 116165 2 2017 2008 5 D
## 6 1161602 D001Ch_116… 10 219940 116165 2 2018 2008 5 D
## 7 1161603 D001Fa_116… 5 219940 116165 1 2013 1976 1 D
## 8 1161603 D001Fa_116… 6 219940 116165 1 2014 1976 1 D
## 9 1161603 D001Fa_116… 7 219940 116165 1 2015 1976 1 D
## 10 1161603 D001Fa_116… 8 219940 116165 1 2016 1976 1 D
## 11 1161603 D001Fa_116… 9 219940 116165 1 2017 1976 1 D
## 12 1161603 D001Fa_116… 10 219940 116165 1 2018 1976 1 D
## 13 1233702 T001Mo_123… 5 209325 123374 2 2014 1982 8 T
## 14 1233702 T001Mo_123… 6 209325 123374 2 2015 1982 8 T
## 15 1233702 T001Mo_123… 7 209325 123374 2 2016 1982 8 T
## 16 1233702 T001Mo_123… 8 209325 123374 2 2017 1982 8 T
## 17 1233702 T001Mo_123… 9 209325 123374 2 2018 1982 8 T
## 18 1233704 T001Ch_123… 5 209325 123374 1 2014 2009 7 T
## 19 1233704 T001Ch_123… 6 209325 123374 1 2015 2009 7 T
## 20 1233704 T001Ch_123… 7 209325 123374 1 2016 2009 7 T
## # … with 3 more variables: pair_num <chr>, pair_pos <chr>, pair_startyr <dbl>
Did that drop our cases any?
Let’s check the number of unique duos and trios.
#Trios first
(n_distinct(gsoep_long$pair_num[gsoep_long$pair_type=="T"], na.rm = TRUE))
## [1] 143
#Duos next
(n_distinct(gsoep_long$pair_num[gsoep_long$pair_type=="D"], na.rm = TRUE))
## [1] 131
So we definitely lost a few cases. This means that some of our pairs were “too young” at the time of every survey they are recorded in.
Next, let’s check the breakdown of pairs according to whether the youngest member was a youth (i.e., 5-17) or an adult (i.e., 18+).
Note: these groups may be overlapping due to the longitudinal nature of the SOEP. Some youth-based pairs may also end up among the adult-based pairs if they took part in data collection for enough years.
We will start by separating the pairs based on “pair age”.
#Youth-based pairs
gsoep_long_youth <- gsoep_long %>%
filter(pair_age<=17)
#Adult-based pairs
gsoep_long_adult <- gsoep_long %>%
filter(pair_age>17)
Let’s check the number of youth-based duos/trios:
#Trios first
(n_distinct(gsoep_long_youth$pair_num[gsoep_long_youth$pair_type=="T"]))
## [1] 143
#Duos next
(n_distinct(gsoep_long_youth$pair_num[gsoep_long_youth$pair_type=="D"]))
## [1] 122
Now we check the number of adult-based duos/trios:
#Trios first
(n_distinct(gsoep_long_adult$pair_num[gsoep_long_adult$pair_type=="T"]))
## [1] 62
#Duos next
(n_distinct(gsoep_long_adult$pair_num[gsoep_long_adult$pair_type=="D"]))
## [1] 71
So the majority of pairs are youth-based but there is clearly some overlap. How many of the adult-based pairs also appear in the youth-based sample?
#select only cases that match pair IDs in the youth-based sample
overlap <- gsoep_long_adult %>%
semi_join(gsoep_long_youth, by = "pair_id")
#Trios first
(n_distinct(overlap$pair_num[overlap$pair_type=="T"], na.rm = TRUE))
## [1] 62
#Duos next
(n_distinct(overlap$pair_num[overlap$pair_type=="D"], na.rm = TRUE))
## [1] 62
It appears as though most of the pairs for which we have adult data also appear as youth-based pairs in the sample.
Let’s focus on the youth-based sample and check out the top externalizing phenotype: the SDQ questionnaire.
The SDQ questionnaire was administered in the SOEP-Core survey in a number of different situations:
Households with youths ages 5-6
For each survey, the items were averaged to produce overall scores for each subscale. We begin by importing datafiles for each questionnaire and then stacking the files to create an SDQ long file.
#SDQ questionnaires
child_6_sdq <- import("../Input/Child/child_6_cleaned.rda") %>%
select(pid, syear, starts_with("sdq_"))
child_10_sdq <- import("../Input/Child/child_10_cleaned.rda") %>%
select(pid, syear, starts_with("sdq_"))
child_12_sdq <- import("../Input/Child/preteen_12_cleaned.rda") %>%
select(pid, syear, starts_with("sdq_"))
child_14_sdq <- import("../Input/Child/teen_14_cleaned.rda") %>%
select(pid, syear, starts_with("sdq_"))
#rbind SDQ data frames
sdq <- rbind(child_6_sdq,
child_10_sdq,
child_12_sdq,
child_14_sdq)
#fill by group, drop duplicates, drop if missing for all SDQ subscales
sdq <- sdq %>%
group_by(pid, syear) %>%
fill(starts_with("sdq_"), .direction = "updown") %>%
ungroup() %>%
distinct(pid, syear, .keep_all = TRUE) %>%
filter(!is.na(sdq_hyper_mean)
| !is.na(sdq_emoprob_mean)
| !is.na(sdq_prosoc_mean)
| !is.na(sdq_conduct_mean)
| !is.na(sdq_peerprob_mean))
head(sdq, n=20)
## # A tibble: 20 x 7
## pid syear sdq_hyper_mean sdq_emoprob_mean sdq_prosoc_mean sdq_conduct_mean
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.99e5 2012 2 1 6.5 3
## 2 7.99e5 2017 1.5 2.33 6.25 1
## 3 8.93e5 2008 1 2.67 6.5 1.5
## 4 8.93e5 2010 1 4 7 1
## 5 7.15e5 2010 2.75 6.33 6.67 1
## 6 1.26e6 2012 5 4 3.75 4.5
## 7 1.26e6 2015 5 1.67 5.75 1
## 8 1.15e6 2009 2.25 2 5.5 2
## 9 1.26e6 2008 3.25 2 5.5 1
## 10 2.39e4 2008 1 2.67 7 1
## 11 1.08e6 2008 4.67 2.33 5 3
## 12 1.08e6 2012 2.75 2 5.5 2
## 13 1.23e6 2011 3.5 4.67 5.5 4
## 14 1.10e6 2011 3.5 3 6 4
## 15 1.10e6 2014 4.25 2 5.5 3.5
## 16 1.12e6 2012 5.5 5 3 3.5
## 17 8.27e5 2012 7 6 4.5 7
## 18 8.27e5 2009 6.75 7 5.5 4.5
## 19 9.53e5 2011 3.25 2.33 5 1
## 20 9.55e5 2009 3.25 3.67 5.5 3.5
## # … with 1 more variable: sdq_peerprob_mean <dbl>
How many observations/individuals do we have in the SDQ data?
#How many non-missing observations?
sdq %>%
count(!is.na(sdq_conduct_mean))
## # A tibble: 2 x 2
## `!is.na(sdq_conduct_mean)` n
## <lgl> <int>
## 1 FALSE 11
## 2 TRUE 13065
#How many unique IDs?
(n_distinct(sdq$pid, na.rm = TRUE))
## [1] 8767
It looks like there are n=13,065 observations for n=8,767 individuals.
Finally, we merge the new SDQ file with our youth-based GSOEP long file.
gsoep_long_youth_sqd <- gsoep_long_youth %>%
left_join(sdq, by=c("pid", "syear")) %>%
select(pair_id, syear, starts_with("sdq"), everything()) %>%
arrange(pair_id, syear )
head(gsoep_long_youth_sqd, n=20)
## # A tibble: 20 x 18
## pair_id syear sdq_hyper_mean sdq_emoprob_mean sdq_prosoc_mean
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 D001Ch_1161602 2013 NA NA NA
## 2 D001Ch_1161602 2014 NA NA NA
## 3 D001Ch_1161602 2015 NA NA NA
## 4 D001Ch_1161602 2016 NA NA NA
## 5 D001Ch_1161602 2017 NA NA NA
## 6 D001Ch_1161602 2018 NA NA NA
## 7 D001Fa_1161603 2013 NA NA NA
## 8 D001Fa_1161603 2014 NA NA NA
## 9 D001Fa_1161603 2015 NA NA NA
## 10 D001Fa_1161603 2016 NA NA NA
## 11 D001Fa_1161603 2017 NA NA NA
## 12 D001Fa_1161603 2018 NA NA NA
## 13 D002Ch_1342703 2006 NA NA NA
## 14 D002Ch_1342703 2007 NA NA NA
## 15 D002Ch_1342703 2008 NA NA NA
## 16 D002Ch_1342703 2009 NA NA NA
## 17 D002Ch_1342703 2010 NA NA NA
## 18 D002Ch_1342703 2011 NA NA NA
## 19 D002Ch_1342703 2012 NA NA NA
## 20 D002Ch_1342703 2013 NA NA NA
## # … with 13 more variables: sdq_conduct_mean <dbl>, sdq_peerprob_mean <dbl>,
## # pid <dbl>, pair_age <dbl>, cid <dbl>, hid <dbl>, sex <dbl>, byear <dbl>,
## # bmonth <dbl>, pair_type <chr>, pair_num <chr>, pair_pos <chr>,
## # pair_startyr <dbl>
It looks like there is a lot of missing data. See how many non-missing observations we have:
#How many non-missing observations do we have?
gsoep_long_youth_sqd %>%
count(!is.na(sdq_conduct_mean))
## # A tibble: 2 x 2
## `!is.na(sdq_conduct_mean)` n
## <lgl> <int>
## 1 FALSE 6465
## 2 TRUE 2
That’s not good. It looks like there is almost no overlap between SDQ data and the youth-based GSOEP sample.
Let’s check the adult-based sample.
gsoep_long_adult_sqd <- gsoep_long_adult %>%
left_join(sdq, by=c("pid", "syear")) %>%
select(pair_id, syear, starts_with("sdq"), everything()) %>%
arrange(pair_id, syear )
head(gsoep_long_adult_sqd, n=20)
## # A tibble: 20 x 18
## pair_id syear sdq_hyper_mean sdq_emoprob_mean sdq_prosoc_mean
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 D006Ch_2031306 2008 NA NA NA
## 2 D006Ch_2031306 2009 NA NA NA
## 3 D006Ch_2031306 2010 NA NA NA
## 4 D006Ch_2031306 2011 NA NA NA
## 5 D006Ch_2031306 2012 NA NA NA
## 6 D006Ch_2031306 2013 NA NA NA
## 7 D006Ch_2031306 2014 NA NA NA
## 8 D006Ch_2031306 2015 NA NA NA
## 9 D006Ch_2031306 2016 NA NA NA
## 10 D006Ch_2031306 2017 NA NA NA
## 11 D006Ch_2031306 2018 NA NA NA
## 12 D006Mo_2031302 2008 NA NA NA
## 13 D006Mo_2031302 2009 NA NA NA
## 14 D006Mo_2031302 2010 NA NA NA
## 15 D006Mo_2031302 2011 NA NA NA
## 16 D006Mo_2031302 2012 NA NA NA
## 17 D006Mo_2031302 2013 NA NA NA
## 18 D006Mo_2031302 2014 NA NA NA
## 19 D006Mo_2031302 2015 NA NA NA
## 20 D006Mo_2031302 2016 NA NA NA
## # … with 13 more variables: sdq_conduct_mean <dbl>, sdq_peerprob_mean <dbl>,
## # pid <dbl>, pair_age <dbl>, cid <dbl>, hid <dbl>, sex <dbl>, byear <dbl>,
## # bmonth <dbl>, pair_type <chr>, pair_num <chr>, pair_pos <chr>,
## # pair_startyr <dbl>
And the non-missing?
gsoep_long_adult_sqd %>%
count(!is.na(sdq_conduct_mean))
## # A tibble: 1 x 2
## `!is.na(sdq_conduct_mean)` n
## <lgl> <int>
## 1 FALSE 2263
No overlap.
Let’s try it another way. I am going to drop all of the PIDs in the GSOEP youth long file that do not appear in the SDQ file:
#drop PIDs in GSOEP youth file that do not appear in the SDQ file.
gsoep_long_youth_in_sdq <- gsoep_long_youth %>%
semi_join(sdq, by = "pid")
#How many observations are left?
(nrow(gsoep_long_youth_in_sdq))
## [1] 21
So it appears that only n=21 cases with SDQ data appear in the GSOEP youth long file. And the adult file?
#drop PIDs in GSOEP youth file that do not appear in the SDQ file.
gsoep_long_adult_in_sdq <- gsoep_long_adult %>%
semi_join(sdq, by = "pid")
#How many observations are left?
(nrow(gsoep_long_adult_in_sdq))
## [1] 0
None in the adult file. It appears as though the SDQ is not a datasource that is available in the children of the SOEP-IS sample.
Based on these results, I think we should consider dropping the GSOEP sample for the EXIT project. For externalizing behavior there just doesn’t seem to be any good coverage for the youth-based sample, which is the largest grouping of genotyped duos/trios.