gsoep <- import("../Input/gsoep_wide_pheno.rda") %>%
select(pid, syear, starts_with(c("pair_", "sdq_")), everything()) %>%
arrange(pair_pos)
head(gsoep, n = 20)
## # A tibble: 20 x 26
## pid syear pair_type pair_num pair_pos pair_id sdq_hyper_mean
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 1161602 2009 D 001 Ch D001Ch_1161602 NA
## 2 1161602 2010 D 001 Ch D001Ch_1161602 NA
## 3 1161602 2011 D 001 Ch D001Ch_1161602 NA
## 4 1161602 2012 D 001 Ch D001Ch_1161602 5
## 5 1161602 2013 D 001 Ch D001Ch_1161602 5
## 6 1161602 2014 D 001 Ch D001Ch_1161602 4
## 7 1161602 2015 D 001 Ch D001Ch_1161602 4.25
## 8 1161602 2016 D 001 Ch D001Ch_1161602 4.75
## 9 1161602 2017 D 001 Ch D001Ch_1161602 4
## 10 1161602 2018 D 001 Ch D001Ch_1161602 5
## 11 1233704 2010 T 001 Ch T001Ch_1233704 NA
## 12 1233704 2011 T 001 Ch T001Ch_1233704 NA
## 13 1233704 2012 T 001 Ch T001Ch_1233704 NA
## 14 1233704 2013 T 001 Ch T001Ch_1233704 2.5
## 15 1233704 2014 T 001 Ch T001Ch_1233704 2.5
## 16 1233704 2015 T 001 Ch T001Ch_1233704 2.5
## 17 1233704 2016 T 001 Ch T001Ch_1233704 2.75
## 18 1233704 2017 T 001 Ch T001Ch_1233704 3.5
## 19 1233704 2018 T 001 Ch T001Ch_1233704 2.5
## 20 1342703 2009 D 002 Ch D002Ch_1342703 NA
## # … with 19 more variables: sdq_emoprob_mean <dbl>, sdq_prosoc_mean <dbl>,
## # sdq_conduct_mean <dbl>, sdq_peerprob_mean <dbl>, pid_type <chr>,
## # byear <dbl>, netto <dbl>, psample <dbl>, sex <dbl>, youngest_byear <dbl>,
## # child_age <dbl>, alc_mean <dbl>, imp_mean <dbl>, risk_mean <dbl>,
## # bfi_o_mean <dbl>, bfi_c_mean <dbl>, bfi_e_mean <dbl>, bfi_a_mean <dbl>,
## # bfi_n_mean <dbl>
Next, let’s get an idea of what our case counts will be across the age bins that we specified in our analysis plan:
#bin child_age
gsoep <- gsoep %>%
mutate(child_age_cat = case_when(child_age < 5 ~ "Child",
child_age %in% 5:10 ~ "Preadolescent",
child_age %in% 11:18 ~ "Adolescent",
child_age >18 ~ "Adult"),
child_age_cat = fct_relevel(child_age_cat,
"Child",
"Preadolescent",
"Adolescent",
"Adult"))
#total case counts by age bins
gsoep_cases_tot <- gsoep %>%
filter(pair_pos=="Ch") %>%
count("Pair Type" = pair_type,
"Age Category" = child_age_cat) %>%
pivot_wider(values_from = "n", names_from = "Age Category") %>%
print()
## # A tibble: 2 x 5
## `Pair Type` Child Preadolescent Adolescent Adult
## <chr> <int> <int> <int> <int>
## 1 D 116 195 366 423
## 2 T 162 287 366 189
We can see that we have >100 observations for both duos and trios across each age category. However, these are binned age categories, so how many unique pairs are we looking at? Note: this is important to consider because our analysis plan calls for the averaging of observations within-developmental epochs.
#unique case counts by age bins
gsoep_cases_uniq <- gsoep %>%
filter(pair_pos=="Ch") %>%
group_by(child_age_cat) %>%
distinct(pair_type, pair_num, .keep_all = TRUE) %>%
ungroup() %>%
count("Pair Type" = pair_type,
"Age Category" = child_age_cat) %>%
pivot_wider(values_from = "n", names_from = "Age Category")
#combine into table
gsoep_cases <- gsoep_cases_tot %>%
rbind(gsoep_cases_uniq) %>%
mutate(Obs_Type = c("Total", "Total", "Unique", "Unique")) %>%
arrange(`Pair Type`) %>%
print()
## # A tibble: 4 x 6
## `Pair Type` Child Preadolescent Adolescent Adult Obs_Type
## <chr> <int> <int> <int> <int> <chr>
## 1 D 116 195 366 423 Total
## 2 D 43 55 85 69 Unique
## 3 T 162 287 366 189 Total
## 4 T 57 71 87 42 Unique
It appears that the largest number of unique pairs are observed during adolescence (i.e., 11-18 years old).
Another angle we should examine is the number of “unique pairs” vs. “unique individuals”. Some of the mothers and fathers in the sample have multiple children.
#Unique Pairs
gsoep %>%
distinct(pair_id, .keep_all = TRUE) %>%
count(pair_type, pair_pos) %>%
pivot_wider(values_from = "n", names_from = "pair_pos")
## # A tibble: 2 x 4
## pair_type Ch Fa Mo
## <chr> <int> <int> <int>
## 1 D 158 29 129
## 2 T 149 149 149
#Unique Individuals
gsoep %>%
distinct(pair_type, pair_pos, pid, .keep_all = TRUE) %>%
count(pair_type, pair_pos) %>%
pivot_wider(values_from = "n", names_from = "pair_pos")
## # A tibble: 2 x 4
## pair_type Ch Fa Mo
## <chr> <int> <int> <int>
## 1 D 158 24 96
## 2 T 149 102 102
Notice that the number of children (which is equal to the total number of duo/trios pairs) does not change. The number of mothers and fathers does change, however, suggesting the presence of mutliple children within the same household. This layer of family structure will need to be kept in mind when analyzing the data.
Next, we examine what the availability of phenotypes for the GSOEP sample, and we will start with phenotypes for pairs of duos and trios based on offspring who are still youths.
First, we’ll explore the most important youth phenotype: the Strengths and Difficulties Questionnaire (SDQ), specifically the conduct problems and hyperactivity subscales. We’ll get a handle on the case counts retaining only those pairs with valid data from the SDQ subscales and then repeat the above process of examining the total/unique number of pairs across age categories. Doing this, we get the following breakdown:
## # A tibble: 4 x 5
## `Pair Type` Child Preadolescent Adolescent Obs_Type
## <chr> <int> <int> <int> <chr>
## 1 D 17 141 94 Total
## 2 D 17 43 49 Unique
## 3 T 22 198 115 Total
## 4 T 22 64 53 Unique
Note: no pairs with adult offspring were retained, as the SDQ was not administered to adults.
Based on the above case counts, we will definitely have issues with power for the “child” age category and possibly for the other age categories as well.
Next we’ll examine the availability of data on the BFI. We are particularly interested in the conscientiousness and agreeableness factor scales. We’ll repeat the same process we used for the SDQ, filtering cases on BFI availability and examining total/unique case counts across ages. We get the following breakdown for BFI:
## # A tibble: 4 x 5
## `Pair Type` Child Preadolescent Adolescent Obs_Type
## <chr> <int> <int> <int> <chr>
## 1 D 17 147 223 Total
## 2 D 17 45 76 Unique
## 3 T 23 206 260 Total
## 4 T 23 64 83 Unique
The BFI case similar case counts in every age category, except for “adolescent” where it does have a few additional cases.
This category of phenotypes covers both the parents and the adult offspring, as they would have been offered the same questionnaires to fill out. We focus on four phenotypes measured in the adult questionnaire:
First, we need to know the number of observations/unique pairs for which we have data:
#total case counts by age bins
gsoep_cases_tot <- gsoep %>%
filter(child_age>18) %>%
filter(pair_pos=="Ch") %>%
count("Pair Type" = pair_type)
#unique case counts by age bins
gsoep_cases_uniq <- gsoep %>%
filter(child_age>18) %>%
filter(pair_pos=="Ch") %>%
distinct(pair_type, pair_num, .keep_all = TRUE) %>%
count("Pair Type" = pair_type)
#combine into table
gsoep_cases <- gsoep_cases_tot %>%
rbind(gsoep_cases_uniq) %>%
mutate(Obs_Type = c("Total","Total", "Unique", "Unique")) %>%
arrange(`Pair Type`) %>%
print()
## # A tibble: 4 x 3
## `Pair Type` n Obs_Type
## <chr> <int> <chr>
## 1 D 423 Total
## 2 D 69 Unique
## 3 T 189 Total
## 4 T 42 Unique
Because we will be using data from all across adulthood to compute PCs of externalizing for adults, we construct a count variable that captures the number of variables each individual adult-offspring pair is missing across all waves.
Using the “naniar” package, we see that alcohol use is contributing almost all of the cross-wave missingness in the adult offspring sample.
Dropping the alcohol variable increases the number of pairs with at least one observation of each variable as adult offspring.
Now we examine the exact same variables, but for the adult parents of pairs. The complicating factor here is that the missingness may be different across mothers and fathers. (We may think about computing a mid-parent score for externalizing.)
First, we need to know the number of observations/unique pairs for which we have data:
#total case counts by age bins
gsoep_cases_tot <- gsoep %>%
filter(!pair_pos=="Ch") %>%
count("Pair Type" = pair_type,
"Pair Position" = pair_pos)
#unique case counts by age bins
gsoep_cases_uniq <- gsoep %>%
filter(!pair_pos=="Ch") %>%
distinct(pair_type, pair_num, pair_pos, .keep_all = TRUE) %>%
count("Pair Type" = pair_type,
"Pair Position" = pair_pos)
#combine into table
gsoep_cases <- gsoep_cases_tot %>%
rbind(gsoep_cases_uniq) %>%
mutate(Obs_Type = c("Total","Total", "Total","Total", "Unique", "Unique", "Unique", "Unique")) %>%
arrange(`Pair Type`) %>%
print()
## # A tibble: 8 x 4
## `Pair Type` `Pair Position` n Obs_Type
## <chr> <chr> <int> <chr>
## 1 D Fa 163 Total
## 2 D Mo 952 Total
## 3 D Fa 29 Unique
## 4 D Mo 129 Unique
## 5 T Fa 996 Total
## 6 T Mo 1004 Total
## 7 T Fa 149 Unique
## 8 T Mo 149 Unique
Next, because we will be using data from all across adulthood to compute PCs of externalizing for adults, we construct a count variable that captures the number of variables each individual parent is missing across all waves.
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_text).
Only a very small minority of pairs have any missing data. Once again, using the “naniar” package, we see that alcohol use is contributing almost all of the cross-wave missingness in the parent sample.
Dropping the alcohol variable increases the number of pairs with at least one observation of each variable for parents.
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_text).