GSOEP - Data Availability Check (Pt. 2)

We are going to start off by importing the GSOEP data frame. After working with Lisa, this file is good to go and has the following features:

It contains only individuals who have passed the genetic data QC process
It is in long format (i.e., containing multiple observations per pair)
It contains phenotype data for adults and children

gsoep <- import("../Input/gsoep_wide_pheno.rda") %>%
  select(pid, syear, starts_with(c("pair_", "sdq_")), everything()) %>%
  arrange(pair_pos)

head(gsoep, n = 20)

## # A tibble: 20 x 26
##        pid syear pair_type pair_num pair_pos pair_id        sdq_hyper_mean
##      <dbl> <dbl> <chr>     <chr>    <chr>    <chr>                   <dbl>
##  1 1161602  2009 D         001      Ch       D001Ch_1161602          NA   
##  2 1161602  2010 D         001      Ch       D001Ch_1161602          NA   
##  3 1161602  2011 D         001      Ch       D001Ch_1161602          NA   
##  4 1161602  2012 D         001      Ch       D001Ch_1161602           5   
##  5 1161602  2013 D         001      Ch       D001Ch_1161602           5   
##  6 1161602  2014 D         001      Ch       D001Ch_1161602           4   
##  7 1161602  2015 D         001      Ch       D001Ch_1161602           4.25
##  8 1161602  2016 D         001      Ch       D001Ch_1161602           4.75
##  9 1161602  2017 D         001      Ch       D001Ch_1161602           4   
## 10 1161602  2018 D         001      Ch       D001Ch_1161602           5   
## 11 1233704  2010 T         001      Ch       T001Ch_1233704          NA   
## 12 1233704  2011 T         001      Ch       T001Ch_1233704          NA   
## 13 1233704  2012 T         001      Ch       T001Ch_1233704          NA   
## 14 1233704  2013 T         001      Ch       T001Ch_1233704           2.5 
## 15 1233704  2014 T         001      Ch       T001Ch_1233704           2.5 
## 16 1233704  2015 T         001      Ch       T001Ch_1233704           2.5 
## 17 1233704  2016 T         001      Ch       T001Ch_1233704           2.75
## 18 1233704  2017 T         001      Ch       T001Ch_1233704           3.5 
## 19 1233704  2018 T         001      Ch       T001Ch_1233704           2.5 
## 20 1342703  2009 D         002      Ch       D002Ch_1342703          NA   
## # … with 19 more variables: sdq_emoprob_mean <dbl>, sdq_prosoc_mean <dbl>,
## #   sdq_conduct_mean <dbl>, sdq_peerprob_mean <dbl>, pid_type <chr>,
## #   byear <dbl>, netto <dbl>, psample <dbl>, sex <dbl>, youngest_byear <dbl>,
## #   child_age <dbl>, alc_mean <dbl>, imp_mean <dbl>, risk_mean <dbl>,
## #   bfi_o_mean <dbl>, bfi_c_mean <dbl>, bfi_e_mean <dbl>, bfi_a_mean <dbl>,
## #   bfi_n_mean <dbl>

Case Counts Across Age Categories

Next, let’s get an idea of what our case counts will be across the age bins that we specified in our analysis plan:

#bin child_age
gsoep <- gsoep %>%
  mutate(child_age_cat = case_when(child_age < 5        ~ "Child",
                                   child_age %in% 5:10  ~ "Preadolescent",
                                   child_age %in% 11:18 ~ "Adolescent",
                                   child_age >18        ~ "Adult"),
         child_age_cat = fct_relevel(child_age_cat, 
                                     "Child",
                                     "Preadolescent",
                                     "Adolescent",
                                     "Adult"))

#total case counts by age bins
gsoep_cases_tot <- gsoep %>%
  filter(pair_pos=="Ch") %>%
  count("Pair Type" = pair_type,
        "Age Category" = child_age_cat) %>% 
  pivot_wider(values_from = "n", names_from = "Age Category") %>%
  print()

## # A tibble: 2 x 5
##   `Pair Type` Child Preadolescent Adolescent Adult
##   <chr>       <int>         <int>      <int> <int>
## 1 D             116           195        366   423
## 2 T             162           287        366   189

We can see that we have >100 observations for both duos and trios across each age category. However, these are binned age categories, so how many unique pairs are we looking at? Note: this is important to consider because our analysis plan calls for the averaging of observations within-developmental epochs.

#unique case counts by age bins
gsoep_cases_uniq <- gsoep %>%
  filter(pair_pos=="Ch") %>%
  group_by(child_age_cat) %>%
  distinct(pair_type, pair_num, .keep_all = TRUE) %>%
  ungroup() %>%
  count("Pair Type" = pair_type,
        "Age Category" = child_age_cat) %>% 
  pivot_wider(values_from = "n", names_from = "Age Category")

#combine into table
gsoep_cases <- gsoep_cases_tot %>%
  rbind(gsoep_cases_uniq) %>%
  mutate(Obs_Type = c("Total", "Total", "Unique", "Unique")) %>%
  arrange(`Pair Type`) %>%
  print()

## # A tibble: 4 x 6
##   `Pair Type` Child Preadolescent Adolescent Adult Obs_Type
##   <chr>       <int>         <int>      <int> <int> <chr>   
## 1 D             116           195        366   423 Total   
## 2 D              43            55         85    69 Unique  
## 3 T             162           287        366   189 Total   
## 4 T              57            71         87    42 Unique

It appears that the largest number of unique pairs are observed during adolescence (i.e., 11-18 years old).

Unique Pairs vs. Unique Individuals

Another angle we should examine is the number of “unique pairs” vs. “unique individuals”. Some of the mothers and fathers in the sample have multiple children.

#Unique Pairs
gsoep %>%
  distinct(pair_id, .keep_all = TRUE) %>%
  count(pair_type, pair_pos) %>% 
  pivot_wider(values_from = "n", names_from = "pair_pos")

## # A tibble: 2 x 4
##   pair_type    Ch    Fa    Mo
##   <chr>     <int> <int> <int>
## 1 D           158    29   129
## 2 T           149   149   149

#Unique Individuals
gsoep %>%
  distinct(pair_type, pair_pos, pid, .keep_all = TRUE) %>%
  count(pair_type, pair_pos) %>% 
  pivot_wider(values_from = "n", names_from = "pair_pos")

## # A tibble: 2 x 4
##   pair_type    Ch    Fa    Mo
##   <chr>     <int> <int> <int>
## 1 D           158    24    96
## 2 T           149   102   102

Notice that the number of children (which is equal to the total number of duo/trios pairs) does not change. The number of mothers and fathers does change, however, suggesting the presence of mutliple children within the same household. This layer of family structure will need to be kept in mind when analyzing the data.

Phenotype Availability

Next, we examine what the availability of phenotypes for the GSOEP sample, and we will start with phenotypes for pairs of duos and trios based on offspring who are still youths.

Youth Phenotypes

Strengths and Difficulties Questionnaire (SDQ)
Big Five Inventory (BFI)

First, we’ll explore the most important youth phenotype: the Strengths and Difficulties Questionnaire (SDQ), specifically the conduct problems and hyperactivity subscales. We’ll get a handle on the case counts retaining only those pairs with valid data from the SDQ subscales and then repeat the above process of examining the total/unique number of pairs across age categories. Doing this, we get the following breakdown:

## # A tibble: 4 x 5
##   `Pair Type` Child Preadolescent Adolescent Obs_Type
##   <chr>       <int>         <int>      <int> <chr>   
## 1 D              17           141         94 Total   
## 2 D              17            43         49 Unique  
## 3 T              22           198        115 Total   
## 4 T              22            64         53 Unique

Note: no pairs with adult offspring were retained, as the SDQ was not administered to adults.

Based on the above case counts, we will definitely have issues with power for the “child” age category and possibly for the other age categories as well.

Next we’ll examine the availability of data on the BFI. We are particularly interested in the conscientiousness and agreeableness factor scales. We’ll repeat the same process we used for the SDQ, filtering cases on BFI availability and examining total/unique case counts across ages. We get the following breakdown for BFI:

## # A tibble: 4 x 5
##   `Pair Type` Child Preadolescent Adolescent Obs_Type
##   <chr>       <int>         <int>      <int> <chr>   
## 1 D              17           147        223 Total   
## 2 D              17            45         76 Unique  
## 3 T              23           206        260 Total   
## 4 T              23            64         83 Unique

The BFI case similar case counts in every age category, except for “adolescent” where it does have a few additional cases.

Adult Phenotypes

This category of phenotypes covers both the parents and the adult offspring, as they would have been offered the same questionnaires to fill out. We focus on four phenotypes measured in the adult questionnaire:

Big Five Inventory (BFI)
Risk-taking
Impulsivity
Alcohol Use

We will not be examining age categories as all observations occurring after age 18 will all fall into the category of “adult”. To make things easier, we will examine parents and adult offspring separately though. And we will start with adult offspring.

Adult offspring

First, we need to know the number of observations/unique pairs for which we have data:

#total case counts by age bins
gsoep_cases_tot <- gsoep %>%
  filter(child_age>18) %>%
  filter(pair_pos=="Ch") %>%
  count("Pair Type" = pair_type)

#unique case counts by age bins
gsoep_cases_uniq <- gsoep %>%
    filter(child_age>18) %>%
  filter(pair_pos=="Ch") %>%
  distinct(pair_type, pair_num, .keep_all = TRUE) %>%
  count("Pair Type" = pair_type)

#combine into table
gsoep_cases <- gsoep_cases_tot %>%
  rbind(gsoep_cases_uniq) %>%
  mutate(Obs_Type = c("Total","Total", "Unique", "Unique")) %>%
  arrange(`Pair Type`) %>%
  print()

## # A tibble: 4 x 3
##   `Pair Type`     n Obs_Type
##   <chr>       <int> <chr>   
## 1 D             423 Total   
## 2 D              69 Unique  
## 3 T             189 Total   
## 4 T              42 Unique

Because we will be using data from all across adulthood to compute PCs of externalizing for adults, we construct a count variable that captures the number of variables each individual adult-offspring pair is missing across all waves.

Using the “naniar” package, we see that alcohol use is contributing almost all of the cross-wave missingness in the adult offspring sample.

Dropping the alcohol variable increases the number of pairs with at least one observation of each variable as adult offspring.

Parents

Now we examine the exact same variables, but for the adult parents of pairs. The complicating factor here is that the missingness may be different across mothers and fathers. (We may think about computing a mid-parent score for externalizing.)

First, we need to know the number of observations/unique pairs for which we have data:

#total case counts by age bins
gsoep_cases_tot <- gsoep %>%
  filter(!pair_pos=="Ch") %>%
  count("Pair Type" = pair_type,
        "Pair Position" = pair_pos)

#unique case counts by age bins
gsoep_cases_uniq <- gsoep %>%
  filter(!pair_pos=="Ch") %>%
  distinct(pair_type, pair_num, pair_pos, .keep_all = TRUE) %>%
  count("Pair Type" = pair_type,
        "Pair Position" = pair_pos)

#combine into table
gsoep_cases <- gsoep_cases_tot %>%
  rbind(gsoep_cases_uniq) %>%
  mutate(Obs_Type = c("Total","Total", "Total","Total", "Unique", "Unique", "Unique", "Unique")) %>%
  arrange(`Pair Type`) %>%
  print()

## # A tibble: 8 x 4
##   `Pair Type` `Pair Position`     n Obs_Type
##   <chr>       <chr>           <int> <chr>   
## 1 D           Fa                163 Total   
## 2 D           Mo                952 Total   
## 3 D           Fa                 29 Unique  
## 4 D           Mo                129 Unique  
## 5 T           Fa                996 Total   
## 6 T           Mo               1004 Total   
## 7 T           Fa                149 Unique  
## 8 T           Mo                149 Unique

Next, because we will be using data from all across adulthood to compute PCs of externalizing for adults, we construct a count variable that captures the number of variables each individual parent is missing across all waves.

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_text).

Only a very small minority of pairs have any missing data. Once again, using the “naniar” package, we see that alcohol use is contributing almost all of the cross-wave missingness in the parent sample.

Dropping the alcohol variable increases the number of pairs with at least one observation of each variable for parents.

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_text).

Takeaways

The number of unique pairs across developmental epochs may limit the power of the analysis
The SDQ and BFI both have decent coverage for youth offspring across developmental epochs
Adult phenotypes have good coverage, except for alcohol