Starting where Lisa left off…

We begin with the pedigree file constructed by the amazing Lisa Reiber (https://lisallreiber.github.io/GeneAnalysis/index.html). Using SOEP-IS data v34, Lisa identified the following family structure in the GSOEP sample:

Using Lisa’s R syntax and SOEP-IS data v35 (i.e., 2018), we reconstruct her genetic pedigree and observe the following:

gen_ped <- import("../Input/Pedigree/gen_pedigree_igene.rda")

gen_ped %>% 
   count("mother avail" = !(is.na(mother_id)), 
         "father avail" = !(is.na(father_id)))
## # A tibble: 3 x 3
##   `mother avail` `father avail`     n
##   <lgl>          <lgl>          <int>
## 1 FALSE          TRUE              27
## 2 TRUE           FALSE            141
## 3 TRUE           TRUE             227

So it appears that we have gained a few genetic trios, but otherwise we find almost exactly the sample sizes we expected using the v35 data.

GSOEP Data - Post PGI Processing

Next, import the GSOEP data file in order to extract only those GSOEP cases that passed the QC process:

gsoep <- import("../Input/I_Gene/GSOEP_PGI_v1.1.txt") %>%
  select(pid, QCpass) %>%
  drop_na()

(n_distinct(gsoep$pid))
## [1] 2494

After importing the QC status of the GSOEP sample, we see that n=2,494 unique cases exist in the data.

Here is the breakdown of the QC pass/fail rates in the GSOEP:

gsoep %>% 
  mutate(QC_Status = ifelse(QCpass==1, "Pass", "Fail")) %>% 
  count(QC_Status)
##   QC_Status    n
## 1      Fail  250
## 2      Pass 2245

Dropping GSOEP cases that failed PGI QC

Now that we have identified which cases passed the PGI QC process, we will drop those cases from our pedigree file and observe the changes in the family structure of the GSOEP sample.

We begin by dropping all cases that failed the QC process:

gsoep_qc <- gsoep %>%
  filter(QCpass==1)

(n_distinct(gsoep_qc$pid))
## [1] 2244
Then, we follow a three-step process:
  1. Separate IDs in pedigree file (add temporary pair ID for recombining)
  2. Remove IDs from each file that do not appear in the GSOEP file containing only QC passed IDs
  3. Merge genetic pedigree file back together using temporary pair ID

    #generate temporary pair IDs to facilitate re-merging data
    gen_ped <- gen_ped %>%
      mutate(temp_id = row_number())
    
    # STEP 1 - Separate IDs
    child <- select(gen_ped, child_id, temp_id)
    mother <- select(gen_ped, mother_id, temp_id)
    father <- select(gen_ped, father_id, temp_id)
    
    #STEP 2 - Keep only cases that passed QC
    child <- child %>%
      semi_join(gsoep_qc, by = c("child_id" = "pid"))
    mother <- mother %>%
      semi_join(gsoep_qc, by = c("mother_id" = "pid"))
    father <-  father %>%
      semi_join(gsoep, by = c("father_id" = "pid"))
    
    #STEP 3 - Merge IDs back into duo/trio pairs
    gen_ped_qc <- full_join(child, mother, by = "temp_id")
    gen_ped_qc <- full_join(gen_ped_qc, father, by = "temp_id")
    gen_ped_qc <- select(gen_ped_qc, -temp_id)
    
    #drop cases with: 1) missing child IDs and 2) cases missing both parent IDs
    gen_ped_qc <- gen_ped_qc %>%
      filter(!is.na(child_id)) %>%
      filter(!is.na(mother_id) | !is.na(father_id))

    New family structure in the QC’d genetic pedigree file

    Now that we are only working with cases that passed QC, let’s see what the family structure in the GSOEP sample looks like now:

    gen_ped_qc %>% 
       count("mother avail" = !(is.na(mother_id)), 
             "father avail" = !(is.na(father_id)))
    ## # A tibble: 3 x 3
    ##   `mother avail` `father avail`     n
    ##   <lgl>          <lgl>          <int>
    ## 1 FALSE          TRUE              35
    ## 2 TRUE           FALSE            114
    ## 3 TRUE           TRUE             168


    Finally, we will add pair IDs for pair type (duo or trio), pair number (which pair), and position within the pair (child, mother, or father). These will allow us to stack IDs and merge with conventionally structured data files.

    #Trios
    
    #select full trios. 
    #add pair type and pair number
    trios <- gen_ped_qc %>%
      filter(!is.na(mother_id) & !is.na(father_id)) %>%
      arrange(child_id) %>%
      mutate(pair_type = "T",
             pair_num = row_number(),
             pair_num = str_pad(pair_num, 3, side = "left", pad = "0"))
      
    #separate IDs
    #add pair position
    tchild  <- trios %>%  
      select(pid = child_id,  pair_type, pair_num) %>%
      mutate(pair_pos = "Ch")
    tmother <- trios %>%
      select(pid = mother_id, pair_type, pair_num) %>%
      mutate(pair_pos = "Mo") 
    tfather <- trios %>%
      select(pid = father_id, pair_type, pair_num) %>%
      mutate(pair_pos = "Fa")
    
    
    
    # Duos
    
    #select full duos
    #add pair type and pair number
    duos <- gen_ped_qc %>%
      filter(is.na(mother_id) | is.na(father_id)) %>%
      arrange(child_id) %>%
      mutate(pair_type = "D",
             pair_num = row_number(),
             pair_num = str_pad(pair_num, 3, side = "left", pad = "0")) 
      
    #separate IDs
    #add pair position
    dchild  <- select(duos, pid = child_id,  pair_type, pair_num) %>%
      mutate(pair_pos = "Ch")
    dmother <- select(duos, pid = mother_id, pair_type, pair_num) %>%
      mutate(pair_pos = "Mo")
    dfather <- select(duos, pid = father_id, pair_type, pair_num) %>%
      mutate(pair_pos = "Fa")


    Now we stack individual pedigree files:

    gen_ped_qc_stacked <- rbind(tchild,dchild,
                                tmother,dmother,
                                tfather,dfather
                                )
    
    #stacked dataframe means that no IDs should be missing
    gen_ped_qc_stacked <- gen_ped_qc_stacked %>%
      drop_na() %>%
      distinct(., .keep_all = TRUE)
    
    #concatenate PIDs with pair ID information to create pair-specific unique IDs
    gen_ped_qc_stacked <- gen_ped_qc_stacked %>%
      mutate(pair_id = paste0(pair_type, pair_num, pair_pos, "_", pid)) %>%
      arrange(pair_num, pair_type)
    
    head(gen_ped_qc_stacked, n=20)
    ## # A tibble: 20 x 5
    ##        pid pair_type pair_num pair_pos pair_id       
    ##      <dbl> <chr>     <chr>    <chr>    <chr>         
    ##  1 1161602 D         001      Ch       D001Ch_1161602
    ##  2 1161603 D         001      Fa       D001Fa_1161603
    ##  3 1233704 T         001      Ch       T001Ch_1233704
    ##  4 1233702 T         001      Mo       T001Mo_1233702
    ##  5 2093203 T         001      Fa       T001Fa_2093203
    ##  6 1342703 D         002      Ch       D002Ch_1342703
    ##  7 1342702 D         002      Mo       D002Mo_1342702
    ##  8 1342707 T         002      Ch       T002Ch_1342707
    ##  9 1342702 T         002      Mo       T002Mo_1342702
    ## 10 2156903 T         002      Fa       T002Fa_2156903
    ## 11 1342705 D         003      Ch       D003Ch_1342705
    ## 12 1342702 D         003      Mo       D003Mo_1342702
    ## 13 1342803 T         003      Ch       T003Ch_1342803
    ## 14 2098603 T         003      Mo       T003Mo_2098603
    ## 15 1342802 T         003      Fa       T003Fa_1342802
    ## 16 1344404 D         004      Ch       D004Ch_1344404
    ## 17 1344403 D         004      Mo       D004Mo_1344403
    ## 18 1360004 T         004      Ch       T004Ch_1360004
    ## 19 1360002 T         004      Mo       T004Mo_1360002
    ## 20 2097903 T         004      Fa       T004Fa_2097903

    Let’s check the number of unique duos and trios.

    #Trios first
    (n_distinct(gen_ped_qc_stacked$pair_num[gen_ped_qc_stacked$pair_type=="T"], na.rm = TRUE))
    ## [1] 168
    #Duos next
    (n_distinct(gen_ped_qc_stacked$pair_num[gen_ped_qc_stacked$pair_type=="D"], na.rm = TRUE))
    ## [1] 149

    Looks like we didn’t lose any pairs while stacking.


    We are now ready to merge longitudinal phenotype data to our QC’d genetic pedigree file backbone.

    Phenotype Data

    We start importing the longitudinal tracker file developed by Lisa. This will allow us to get some handle on the demographics of the GSOEP sample.

    soepis_long <- import("../Code/soepis_igene_longish.rda") %>%
      select(pid, hid, cid, sex, syear, byear, bmonth) %>%
      arrange(pid, syear)
    
    head(soepis_long, n=20)
    ## # A tibble: 20 x 7
    ##        pid    hid    cid   sex syear byear bmonth
    ##      <dbl>  <dbl>  <dbl> <dbl> <int> <dbl>  <dbl>
    ##  1 1161602     NA 219940     2  2008  2008      5
    ##  2 1161602 116165 219940     2  2009  2008      5
    ##  3 1161602 116165 219940     2  2010  2008      5
    ##  4 1161602 116165 219940     2  2011  2008      5
    ##  5 1161602 116165 219940     2  2012  2008      5
    ##  6 1161602 116165 219940     2  2013  2008      5
    ##  7 1161602 116165 219940     2  2014  2008      5
    ##  8 1161602 116165 219940     2  2015  2008      5
    ##  9 1161602 116165 219940     2  2016  2008      5
    ## 10 1161602 116165 219940     2  2017  2008      5
    ## 11 1161602 116165 219940     2  2018  2008      5
    ## 12 1161603     NA 219940     1  1998  1976      1
    ## 13 1161603     NA 219940     1  1999  1976      1
    ## 14 1161603     NA 219940     1  2000  1976      1
    ## 15 1161603     NA 219940     1  2001  1976      1
    ## 16 1161603     NA 219940     1  2002  1976      1
    ## 17 1161603     NA 219940     1  2003  1976      1
    ## 18 1161603     NA 219940     1  2004  1976      1
    ## 19 1161603     NA 219940     1  2005  1976      1
    ## 20 1161603     NA 219940     1  2006  1976      1


    Next we will merge the longitudinal tracker file with our stacked pedigree file. This must be done by pair, however, as some pairs have overlapping members (this is the case for parents with multiple children).

    gsoep_long <- gen_ped_qc_stacked %>%
      group_by(pair_type, pair_num) %>%
      left_join(soepis_long, by = c("pid")) %>%
      ungroup() %>%
      select(pid, cid, hid, sex, syear, byear, bmonth,
             starts_with("pair")) %>%
      arrange(pid, syear) %>%
      drop_na(pair_id) %>%
      distinct(pair_id, syear, .keep_all = TRUE)
    
    head(gsoep_long, n=20)
    ## # A tibble: 20 x 11
    ##        pid    cid    hid   sex syear byear bmonth pair_type pair_num pair_pos
    ##      <dbl>  <dbl>  <dbl> <dbl> <int> <dbl>  <dbl> <chr>     <chr>    <chr>   
    ##  1 1161602 219940     NA     2  2008  2008      5 D         001      Ch      
    ##  2 1161602 219940 116165     2  2009  2008      5 D         001      Ch      
    ##  3 1161602 219940 116165     2  2010  2008      5 D         001      Ch      
    ##  4 1161602 219940 116165     2  2011  2008      5 D         001      Ch      
    ##  5 1161602 219940 116165     2  2012  2008      5 D         001      Ch      
    ##  6 1161602 219940 116165     2  2013  2008      5 D         001      Ch      
    ##  7 1161602 219940 116165     2  2014  2008      5 D         001      Ch      
    ##  8 1161602 219940 116165     2  2015  2008      5 D         001      Ch      
    ##  9 1161602 219940 116165     2  2016  2008      5 D         001      Ch      
    ## 10 1161602 219940 116165     2  2017  2008      5 D         001      Ch      
    ## 11 1161602 219940 116165     2  2018  2008      5 D         001      Ch      
    ## 12 1161603 219940     NA     1  1998  1976      1 D         001      Fa      
    ## 13 1161603 219940     NA     1  1999  1976      1 D         001      Fa      
    ## 14 1161603 219940     NA     1  2000  1976      1 D         001      Fa      
    ## 15 1161603 219940     NA     1  2001  1976      1 D         001      Fa      
    ## 16 1161603 219940     NA     1  2002  1976      1 D         001      Fa      
    ## 17 1161603 219940     NA     1  2003  1976      1 D         001      Fa      
    ## 18 1161603 219940     NA     1  2004  1976      1 D         001      Fa      
    ## 19 1161603 219940     NA     1  2005  1976      1 D         001      Fa      
    ## 20 1161603 219940     NA     1  2006  1976      1 D         001      Fa      
    ## # … with 1 more variable: pair_id <chr>

    Let’s check the number of unique duos and trios.

    #Trios first
    (n_distinct(gsoep_long$pair_num[gsoep_long$pair_type=="T"], na.rm = TRUE))
    ## [1] 168
    #Duos next
    (n_distinct(gsoep_long$pair_num[gsoep_long$pair_type=="D"], na.rm = TRUE))
    ## [1] 149


    We maintained our pairs but now we have longitudinal data for each individual. See how our observations increased:

    #Observations in stacked pedigree file
    (nrow(gen_ped_qc_stacked))
    ## [1] 802
    #Observations in GSOEP long file
    (nrow(gsoep_long))
    ## [1] 14861

    Selecting our sample based on demographics


    Now we are ready to characterize our sample and select based on the features we need. First, we want to assign each pair a “pair age” based on the byear of the youngest member of the pair (because before that year, they weren’t a pair).

    Once we know the age of each pair, we will drop observations where the “pair age” is younger than 4yrs. We do this because 5 years of age is the youngest age at which the SOEP collects data on the externalizing behavior of children (i.e., the SDQ questionnaire).

    #assign pair age to each pair
    gsoep_long <- gsoep_long %>%
      group_by(pair_type, pair_num) %>%
      mutate(pair_startyr = max(byear),
             pair_age = syear - pair_startyr) %>%
      ungroup() %>%
      select(pid, pair_id, pair_age, everything())
      
    #drop observations of pairs 4yrs or younger
    gsoep_long <- gsoep_long %>%
      filter(pair_age>4)
    
    head(gsoep_long, n=20)
    ## # A tibble: 20 x 13
    ##        pid pair_id     pair_age    cid    hid   sex syear byear bmonth pair_type
    ##      <dbl> <chr>          <dbl>  <dbl>  <dbl> <dbl> <int> <dbl>  <dbl> <chr>    
    ##  1 1161602 D001Ch_116…        5 219940 116165     2  2013  2008      5 D        
    ##  2 1161602 D001Ch_116…        6 219940 116165     2  2014  2008      5 D        
    ##  3 1161602 D001Ch_116…        7 219940 116165     2  2015  2008      5 D        
    ##  4 1161602 D001Ch_116…        8 219940 116165     2  2016  2008      5 D        
    ##  5 1161602 D001Ch_116…        9 219940 116165     2  2017  2008      5 D        
    ##  6 1161602 D001Ch_116…       10 219940 116165     2  2018  2008      5 D        
    ##  7 1161603 D001Fa_116…        5 219940 116165     1  2013  1976      1 D        
    ##  8 1161603 D001Fa_116…        6 219940 116165     1  2014  1976      1 D        
    ##  9 1161603 D001Fa_116…        7 219940 116165     1  2015  1976      1 D        
    ## 10 1161603 D001Fa_116…        8 219940 116165     1  2016  1976      1 D        
    ## 11 1161603 D001Fa_116…        9 219940 116165     1  2017  1976      1 D        
    ## 12 1161603 D001Fa_116…       10 219940 116165     1  2018  1976      1 D        
    ## 13 1233702 T001Mo_123…        5 209325 123374     2  2014  1982      8 T        
    ## 14 1233702 T001Mo_123…        6 209325 123374     2  2015  1982      8 T        
    ## 15 1233702 T001Mo_123…        7 209325 123374     2  2016  1982      8 T        
    ## 16 1233702 T001Mo_123…        8 209325 123374     2  2017  1982      8 T        
    ## 17 1233702 T001Mo_123…        9 209325 123374     2  2018  1982      8 T        
    ## 18 1233704 T001Ch_123…        5 209325 123374     1  2014  2009      7 T        
    ## 19 1233704 T001Ch_123…        6 209325 123374     1  2015  2009      7 T        
    ## 20 1233704 T001Ch_123…        7 209325 123374     1  2016  2009      7 T        
    ## # … with 3 more variables: pair_num <chr>, pair_pos <chr>, pair_startyr <dbl>

    Did that drop our cases any?
    Let’s check the number of unique duos and trios.

    #Trios first
    (n_distinct(gsoep_long$pair_num[gsoep_long$pair_type=="T"], na.rm = TRUE))
    ## [1] 143
    #Duos next
    (n_distinct(gsoep_long$pair_num[gsoep_long$pair_type=="D"], na.rm = TRUE))
    ## [1] 131


    So we definitely lost a few cases. This means that some of our pairs were “too young” at the time of every survey they are recorded in.
    Next, let’s check the breakdown of pairs according to whether the youngest member was a youth (i.e., 5-17) or an adult (i.e., 18+).
    Note: these groups may be overlapping due to the longitudinal nature of the SOEP. Some youth-based pairs may also end up among the adult-based pairs if they took part in data collection for enough years.

    We will start by separating the pairs based on “pair age”.

    #Youth-based pairs
    gsoep_long_youth <- gsoep_long %>%
      filter(pair_age<=17)
    
    #Adult-based pairs
    gsoep_long_adult <- gsoep_long %>%
      filter(pair_age>17)


    Let’s check the number of youth-based duos/trios:

    #Trios first
    (n_distinct(gsoep_long_youth$pair_num[gsoep_long_youth$pair_type=="T"]))
    ## [1] 143
    #Duos next
    (n_distinct(gsoep_long_youth$pair_num[gsoep_long_youth$pair_type=="D"]))
    ## [1] 122


    Now we check the number of adult-based duos/trios:

    #Trios first
    (n_distinct(gsoep_long_adult$pair_num[gsoep_long_adult$pair_type=="T"]))
    ## [1] 62
    #Duos next
    (n_distinct(gsoep_long_adult$pair_num[gsoep_long_adult$pair_type=="D"]))
    ## [1] 71


    So the majority of pairs are youth-based but there is clearly some overlap. How many of the adult-based pairs also appear in the youth-based sample?

    #select only cases that match pair IDs in the youth-based sample
    overlap <- gsoep_long_adult %>%
      semi_join(gsoep_long_youth, by = "pair_id")
    
    #Trios first
    (n_distinct(overlap$pair_num[overlap$pair_type=="T"], na.rm = TRUE))
    ## [1] 62
    #Duos next
    (n_distinct(overlap$pair_num[overlap$pair_type=="D"], na.rm = TRUE))
    ## [1] 62


    It appears as though most of the pairs for which we have adult data also appear as youth-based pairs in the sample.
    Let’s focus on the youth-based sample and check out the top externalizing phenotype: the SDQ questionnaire.

    SDQ Questionnaire in the GSOEP

    The SDQ questionnaire was administered in the SOEP-Core survey in a number of different situations:

    Households with youths ages 5-6
    • Informant: Mother
    • Since: 2008
    Households with youths ages 9-10
    • Informant: Mother & Father
    • Since: 2012
    Households with youths ages 11-12
    • Informant: Youth
    • Since: 2014
    Households with youths ages 13-14
    • Informant: Youth
    • Since: 2015

    The SDQ is divided into a number of subscales:
    • Hyperactivity
    • Emotional Problems
    • Prosocial Behavior
    • Conduct Problems
    • Peer Problems


    For each survey, the items were averaged to produce overall scores for each subscale. We begin by importing datafiles for each questionnaire and then stacking the files to create an SDQ long file.

    #SDQ questionnaires
    child_6_sdq  <- import("../Input/Child/child_6_cleaned.rda") %>%
      select(pid, syear, starts_with("sdq_"))
    child_10_sdq <- import("../Input/Child/child_10_cleaned.rda") %>%
      select(pid, syear, starts_with("sdq_"))
    child_12_sdq <- import("../Input/Child/preteen_12_cleaned.rda") %>%
      select(pid, syear, starts_with("sdq_"))
    child_14_sdq <- import("../Input/Child/teen_14_cleaned.rda") %>%
      select(pid, syear, starts_with("sdq_"))
    
    #rbind SDQ data frames
    sdq <- rbind(child_6_sdq,
                 child_10_sdq,
                 child_12_sdq, 
                 child_14_sdq)
    
    #fill by group, drop duplicates, drop if missing for all SDQ subscales
    sdq <- sdq %>%
      group_by(pid, syear) %>%
      fill(starts_with("sdq_"), .direction = "updown") %>%
      ungroup() %>%
      distinct(pid, syear, .keep_all = TRUE) %>%
      filter(!is.na(sdq_hyper_mean) 
             | !is.na(sdq_emoprob_mean)
             | !is.na(sdq_prosoc_mean) 
             | !is.na(sdq_conduct_mean) 
             | !is.na(sdq_peerprob_mean))
    
    head(sdq, n=20)
    ## # A tibble: 20 x 7
    ##       pid syear sdq_hyper_mean sdq_emoprob_mean sdq_prosoc_mean sdq_conduct_mean
    ##     <dbl> <dbl>          <dbl>            <dbl>           <dbl>            <dbl>
    ##  1 7.99e5  2012           2                1               6.5               3  
    ##  2 7.99e5  2017           1.5              2.33            6.25              1  
    ##  3 8.93e5  2008           1                2.67            6.5               1.5
    ##  4 8.93e5  2010           1                4               7                 1  
    ##  5 7.15e5  2010           2.75             6.33            6.67              1  
    ##  6 1.26e6  2012           5                4               3.75              4.5
    ##  7 1.26e6  2015           5                1.67            5.75              1  
    ##  8 1.15e6  2009           2.25             2               5.5               2  
    ##  9 1.26e6  2008           3.25             2               5.5               1  
    ## 10 2.39e4  2008           1                2.67            7                 1  
    ## 11 1.08e6  2008           4.67             2.33            5                 3  
    ## 12 1.08e6  2012           2.75             2               5.5               2  
    ## 13 1.23e6  2011           3.5              4.67            5.5               4  
    ## 14 1.10e6  2011           3.5              3               6                 4  
    ## 15 1.10e6  2014           4.25             2               5.5               3.5
    ## 16 1.12e6  2012           5.5              5               3                 3.5
    ## 17 8.27e5  2012           7                6               4.5               7  
    ## 18 8.27e5  2009           6.75             7               5.5               4.5
    ## 19 9.53e5  2011           3.25             2.33            5                 1  
    ## 20 9.55e5  2009           3.25             3.67            5.5               3.5
    ## # … with 1 more variable: sdq_peerprob_mean <dbl>


    How many observations/individuals do we have in the SDQ data?

    #How many non-missing observations?
    sdq %>%
      count(!is.na(sdq_conduct_mean))
    ## # A tibble: 2 x 2
    ##   `!is.na(sdq_conduct_mean)`     n
    ##   <lgl>                      <int>
    ## 1 FALSE                         11
    ## 2 TRUE                       13065
    #How many unique IDs?
    (n_distinct(sdq$pid, na.rm = TRUE))
    ## [1] 8767


    It looks like there are n=13,065 observations for n=8,767 individuals.

    Merging the SDQ into the GSOEP long file.


    Finally, we merge the new SDQ file with our youth-based GSOEP long file.

    gsoep_long_youth_sqd <- gsoep_long_youth %>%
      left_join(sdq, by=c("pid", "syear")) %>%
      select(pair_id, syear, starts_with("sdq"), everything()) %>%
      arrange(pair_id, syear )
    
    head(gsoep_long_youth_sqd, n=20)
    ## # A tibble: 20 x 18
    ##    pair_id        syear sdq_hyper_mean sdq_emoprob_mean sdq_prosoc_mean
    ##    <chr>          <dbl>          <dbl>            <dbl>           <dbl>
    ##  1 D001Ch_1161602  2013             NA               NA              NA
    ##  2 D001Ch_1161602  2014             NA               NA              NA
    ##  3 D001Ch_1161602  2015             NA               NA              NA
    ##  4 D001Ch_1161602  2016             NA               NA              NA
    ##  5 D001Ch_1161602  2017             NA               NA              NA
    ##  6 D001Ch_1161602  2018             NA               NA              NA
    ##  7 D001Fa_1161603  2013             NA               NA              NA
    ##  8 D001Fa_1161603  2014             NA               NA              NA
    ##  9 D001Fa_1161603  2015             NA               NA              NA
    ## 10 D001Fa_1161603  2016             NA               NA              NA
    ## 11 D001Fa_1161603  2017             NA               NA              NA
    ## 12 D001Fa_1161603  2018             NA               NA              NA
    ## 13 D002Ch_1342703  2006             NA               NA              NA
    ## 14 D002Ch_1342703  2007             NA               NA              NA
    ## 15 D002Ch_1342703  2008             NA               NA              NA
    ## 16 D002Ch_1342703  2009             NA               NA              NA
    ## 17 D002Ch_1342703  2010             NA               NA              NA
    ## 18 D002Ch_1342703  2011             NA               NA              NA
    ## 19 D002Ch_1342703  2012             NA               NA              NA
    ## 20 D002Ch_1342703  2013             NA               NA              NA
    ## # … with 13 more variables: sdq_conduct_mean <dbl>, sdq_peerprob_mean <dbl>,
    ## #   pid <dbl>, pair_age <dbl>, cid <dbl>, hid <dbl>, sex <dbl>, byear <dbl>,
    ## #   bmonth <dbl>, pair_type <chr>, pair_num <chr>, pair_pos <chr>,
    ## #   pair_startyr <dbl>


    It looks like there is a lot of missing data. See how many non-missing observations we have:

    #How many non-missing observations do we have?
    gsoep_long_youth_sqd %>%
      count(!is.na(sdq_conduct_mean))
    ## # A tibble: 2 x 2
    ##   `!is.na(sdq_conduct_mean)`     n
    ##   <lgl>                      <int>
    ## 1 FALSE                       6465
    ## 2 TRUE                           2


    That’s not good. It looks like there is almost no overlap between SDQ data and the youth-based GSOEP sample.


    Let’s check the adult-based sample.

    gsoep_long_adult_sqd <- gsoep_long_adult %>%
      left_join(sdq, by=c("pid", "syear")) %>%
      select(pair_id, syear, starts_with("sdq"), everything()) %>%
      arrange(pair_id, syear )
    
    head(gsoep_long_adult_sqd, n=20)
    ## # A tibble: 20 x 18
    ##    pair_id        syear sdq_hyper_mean sdq_emoprob_mean sdq_prosoc_mean
    ##    <chr>          <dbl>          <dbl>            <dbl>           <dbl>
    ##  1 D006Ch_2031306  2008             NA               NA              NA
    ##  2 D006Ch_2031306  2009             NA               NA              NA
    ##  3 D006Ch_2031306  2010             NA               NA              NA
    ##  4 D006Ch_2031306  2011             NA               NA              NA
    ##  5 D006Ch_2031306  2012             NA               NA              NA
    ##  6 D006Ch_2031306  2013             NA               NA              NA
    ##  7 D006Ch_2031306  2014             NA               NA              NA
    ##  8 D006Ch_2031306  2015             NA               NA              NA
    ##  9 D006Ch_2031306  2016             NA               NA              NA
    ## 10 D006Ch_2031306  2017             NA               NA              NA
    ## 11 D006Ch_2031306  2018             NA               NA              NA
    ## 12 D006Mo_2031302  2008             NA               NA              NA
    ## 13 D006Mo_2031302  2009             NA               NA              NA
    ## 14 D006Mo_2031302  2010             NA               NA              NA
    ## 15 D006Mo_2031302  2011             NA               NA              NA
    ## 16 D006Mo_2031302  2012             NA               NA              NA
    ## 17 D006Mo_2031302  2013             NA               NA              NA
    ## 18 D006Mo_2031302  2014             NA               NA              NA
    ## 19 D006Mo_2031302  2015             NA               NA              NA
    ## 20 D006Mo_2031302  2016             NA               NA              NA
    ## # … with 13 more variables: sdq_conduct_mean <dbl>, sdq_peerprob_mean <dbl>,
    ## #   pid <dbl>, pair_age <dbl>, cid <dbl>, hid <dbl>, sex <dbl>, byear <dbl>,
    ## #   bmonth <dbl>, pair_type <chr>, pair_num <chr>, pair_pos <chr>,
    ## #   pair_startyr <dbl>


    And the non-missing?

    gsoep_long_adult_sqd %>%
      count(!is.na(sdq_conduct_mean))
    ## # A tibble: 1 x 2
    ##   `!is.na(sdq_conduct_mean)`     n
    ##   <lgl>                      <int>
    ## 1 FALSE                       2263


    No overlap.


    Let’s try it another way. I am going to drop all of the PIDs in the GSOEP youth long file that do not appear in the SDQ file:

    #drop PIDs in GSOEP youth file that do not appear in the SDQ file.
    gsoep_long_youth_in_sdq <- gsoep_long_youth %>%
      semi_join(sdq, by = "pid")
    
    #How many observations are left?
    (nrow(gsoep_long_youth_in_sdq))
    ## [1] 21


    So it appears that only n=21 cases with SDQ data appear in the GSOEP youth long file. And the adult file?

    #drop PIDs in GSOEP youth file that do not appear in the SDQ file.
    gsoep_long_adult_in_sdq <- gsoep_long_adult %>%
      semi_join(sdq, by = "pid")
    
    #How many observations are left?
    (nrow(gsoep_long_adult_in_sdq))
    ## [1] 0


    None in the adult file. It appears as though the SDQ is not a datasource that is available in the children of the SOEP-IS sample.


    Conclusion

    Based on these results, I think we should consider dropping the GSOEP sample for the EXIT project. For externalizing behavior there just doesn’t seem to be any good coverage for the youth-based sample, which is the largest grouping of genotyped duos/trios.