Data Dive 4

Teresa Ortyl

First, the tidyverse library and the data set were loaded.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")
## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl  (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sampling the Data Set

I chose to sample 8 columns from the data set. The first two, hypothyroid and on_anti_thyroid_medication, were selected to know whether the patient might have hypothyroidism, hyperthyroidism, or have a normal thyroid. The remaining 6 are the categorical columns TSH_measured, TT4_measured, and TBG_measured and their respective numerical columns for test results. I chose these specific tests because TSH/TT4 are the standard blood tests for hypothyroidism patients and because TBG, as discovered last week, is the test most often given on its own (if people do not get the TT4/TSH blood tests, the next most likely option is they just got the TBG test).

I will collect six samples of 1500 patients. The data set is around 3100 patients, so 1500 is sampling a little less than 50% of the data set (It is 48% to be specific, but 1500 is a neater number to work with than the true 50% count which is 1582).

Collecting the Six Samples

sample1 <- hypothyroid |>
  select(hypothyroid,on_antithyroid_medication,TSH_measured,TSH,TT4_measured,TT4,TBG_measured,TBG) |>
  sample_n(size=1500, replace= FALSE)
sample1["Sample"]="Sample 1"

sample2 <- hypothyroid |>
  select(hypothyroid,on_antithyroid_medication,TSH_measured,TSH,TT4_measured,TT4,TBG_measured,TBG) |>
  sample_n(size=1500, replace= FALSE)
sample2["Sample"]="Sample 2"

sample3 <- hypothyroid |>
  select(hypothyroid,on_antithyroid_medication,TSH_measured,TSH,TT4_measured,TT4,TBG_measured,TBG) |>
  sample_n(size=1500, replace= FALSE)
sample3["Sample"]="Sample 3"

sample4 <- hypothyroid |>
  select(hypothyroid,on_antithyroid_medication,TSH_measured,TSH,TT4_measured,TT4,TBG_measured,TBG) |>
  sample_n(size=1500, replace= FALSE)
sample4["Sample"]="Sample 4"

sample5 <- hypothyroid |>
  select(hypothyroid,on_antithyroid_medication,TSH_measured,TSH,TT4_measured,TT4,TBG_measured,TBG) |>
  sample_n(size=1500, replace= FALSE)
sample5["Sample"]="Sample 5"

sample6 <- hypothyroid |>
  select(hypothyroid,on_antithyroid_medication,TSH_measured,TSH,TT4_measured,TT4,TBG_measured,TBG) |>
  sample_n(size=1500, replace= FALSE)
sample6["Sample"]="Sample 6"

As you can see, we gathered 6 samples of 1500 entries from the hypothyroidism data set. You may also notice I added a column to each sample that says which sample the data came from. This is because I want to join all the samples into 1 tibble for ease of comparison and plotting later.

Combining the samples into 1 tibble

allSamples <-bind_rows(sample1,sample2,sample3,sample4,sample5,sample6)

I will still use the individual sample tibbles for some parts of this analysis, but having a single tibble of all six samples is also useful.

##Analysing the Samples

Comparing Proportion of Individuals with Hypothyroidism in Each Sample

First, we will start by looking at proportions of the categorical variables in each sample.

allSamples |>
  group_by(hypothyroid,Sample) |>
  summarize(total=n(),proportion=(n()/1500))
## `summarise()` has grouped output by 'hypothyroid'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 4
## # Groups:   hypothyroid [2]
##    hypothyroid Sample   total proportion
##    <chr>       <chr>    <int>      <dbl>
##  1 hypothyroid Sample 1    66     0.044 
##  2 hypothyroid Sample 2    67     0.0447
##  3 hypothyroid Sample 3    64     0.0427
##  4 hypothyroid Sample 4    77     0.0513
##  5 hypothyroid Sample 5    70     0.0467
##  6 hypothyroid Sample 6    71     0.0473
##  7 negative    Sample 1  1434     0.956 
##  8 negative    Sample 2  1433     0.955 
##  9 negative    Sample 3  1436     0.957 
## 10 negative    Sample 4  1423     0.949 
## 11 negative    Sample 5  1430     0.953 
## 12 negative    Sample 6  1429     0.953

The total number of people negative for hypothyroidism in Samples 5 and 6 are 1434 and 1433 respectively and the associated proportions are .95600000 and .95533333 respectively.

Overall, the probability of people with hypothyroidism in any given group is around 4.5%. The minimum probability is from Sample 3 at 3.9% and the maximum probability is 5.0% for Sample 2. Overall, the samples are pretty similar for the amount of people having hypothyroidism (about half of the number from the full data set), with Sample 3 maybe being a little low by comparison. If we were to run a statistical test to see if it was meaningfully different from from the population proportion, I doubt it would be significant.

This similarity between samples can be visualized below:

allSamples |>
ggplot() +
  geom_bar(mapping=aes(x=hypothyroid,fill=Sample),position="dodge")+labs(title="Counts of Individuals' Hypothyroidism Status by Sample",x="Patient Hypothyroidism Status")+theme_bw()

Comparing Proportions of Individuals on Antithyroid Medications in Each Sample

allSamples |>
  group_by(on_antithyroid_medication,Sample) |>
  summarize(total=n(),proportion=(n()/1500))
## `summarise()` has grouped output by 'on_antithyroid_medication'. You can
## override using the `.groups` argument.
## # A tibble: 12 × 4
## # Groups:   on_antithyroid_medication [2]
##    on_antithyroid_medication Sample   total proportion
##    <lgl>                     <chr>    <int>      <dbl>
##  1 FALSE                     Sample 1  1482     0.988 
##  2 FALSE                     Sample 2  1480     0.987 
##  3 FALSE                     Sample 3  1480     0.987 
##  4 FALSE                     Sample 4  1479     0.986 
##  5 FALSE                     Sample 5  1483     0.989 
##  6 FALSE                     Sample 6  1477     0.985 
##  7 TRUE                      Sample 1    18     0.012 
##  8 TRUE                      Sample 2    20     0.0133
##  9 TRUE                      Sample 3    20     0.0133
## 10 TRUE                      Sample 4    21     0.014 
## 11 TRUE                      Sample 5    17     0.0113
## 12 TRUE                      Sample 6    23     0.0153

The total number of people on antithyroid medications in Samples 5 and 6 are 16 and 19 respectively and the associated proportions are .01066667 and .01266667 respectively.

Again, the proportion of people on antithyroid medication in each sample is pretty similar with a range of 10 individuals’ difference between the samples with the highest and lowest number of individuals on antithyroid medication (Sample 1 has the highest at 26, while sample 5 has the lowest at 16.) Given the total number people on antithyroid medication in the full data set is around 40, the samples are all fairly representative of the whole data set for the number of people on antithyroid medication.

This similarity between samples can be visualized below:

allSamples |>
ggplot() +
  geom_bar(mapping=aes(x=on_antithyroid_medication,fill=Sample),position="dodge")+labs(title="Counts of Individuals Who Took Antithyroid Medication by Sample",x="Antithyroid Medication Usage")+theme_bw()

Comparing Proportions of Individuals Who Got the TSH Blood Test in Each Sample

allSamples |>
  group_by(TSH_measured,Sample) |>
  summarize(total=n(),proportion=(n()/1500))
## `summarise()` has grouped output by 'TSH_measured'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 4
## # Groups:   TSH_measured [2]
##    TSH_measured Sample   total proportion
##    <chr>        <chr>    <int>      <dbl>
##  1 n            Sample 1   207      0.138
##  2 n            Sample 2   226      0.151
##  3 n            Sample 3   204      0.136
##  4 n            Sample 4   234      0.156
##  5 n            Sample 5   229      0.153
##  6 n            Sample 6   225      0.15 
##  7 y            Sample 1  1293      0.862
##  8 y            Sample 2  1274      0.849
##  9 y            Sample 3  1296      0.864
## 10 y            Sample 4  1266      0.844
## 11 y            Sample 5  1271      0.847
## 12 y            Sample 6  1275      0.85

The total number of people who got the TSH test in Samples 5 and 6 are 1274 and 1289 respectively and the associated proportions are .8493333 and .8593333 respectively.

Again, the proportion of people who got the TSH test is fairly consistent between all six samples, which seems to set a trend so far that these samples are rather similar. The proportions range from 84.4% (Sample 2) to 86.1% (Sample 3) of the sample. Since the min and max of the proportion of individuals with hypothyroidism were also set by Samples 2 and 3, this may suggest that they are the most different of the samples. Of course, the other two categorical columns and the numerical columns will also need to be looked at to confirm this. However, even though they are different, they are both still quite similar to the proportion in the data set which is 85.3%.

This similarity between samples can be visualized below:

allSamples |>
ggplot() +
  geom_bar(mapping=aes(x=TSH_measured,fill=Sample),position="dodge")+labs(title="Counts of if Individuals Took the TSH Test by Sample",x="TSH Test Taken")+theme_bw()

Comparing Proportions of Individuals Who Got the TT4 Blood Test in Each Sample

allSamples |>
  group_by(TT4_measured,Sample) |>
  summarize(total=n(),proportion=(n()/1500))
## `summarise()` has grouped output by 'TT4_measured'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 4
## # Groups:   TT4_measured [2]
##    TT4_measured Sample   total proportion
##    <chr>        <chr>    <int>      <dbl>
##  1 n            Sample 1   104     0.0693
##  2 n            Sample 2   115     0.0767
##  3 n            Sample 3   106     0.0707
##  4 n            Sample 4   126     0.084 
##  5 n            Sample 5   125     0.0833
##  6 n            Sample 6   120     0.08  
##  7 y            Sample 1  1396     0.931 
##  8 y            Sample 2  1385     0.923 
##  9 y            Sample 3  1394     0.929 
## 10 y            Sample 4  1374     0.916 
## 11 y            Sample 5  1375     0.917 
## 12 y            Sample 6  1380     0.92

The total number of people who got the TT4 test in Samples 5 and 6 are 1372 and 1390 respectively and the associated proportions are .91466667 and .92666667 respectively.

The proportions of people who got the TT4 blood test in each sample are quite consistent and similar to the true proportion of 92.2% in the full data set. The minimum and maximum proportions are 91.5% and 92.7% from Sample 5 and Sample 6 respectively, which does not support the conjecture that Samples 2 and 3 might be most dissimilar as mentioned in the previous proportion comparison. Overall, there seems to be a trend that the samples are all quite representative of the overall data set, which is good but unfortunately makes this analysis a bit boring to read.

This similarity between groups can be visualized below:

allSamples |>
ggplot() +
  geom_bar(mapping=aes(x=TT4_measured,fill=Sample),position="dodge")+labs(title="Counts of if Individuals got the TT4 Test by Sample",x="TT4 Test Taken")+theme_bw()

Comparing Proportions of Individuals Who Got the TBG Test in Each Sample

allSamples |>
  group_by(TBG_measured,Sample) |>
  summarize(total=n(),proportion=(n()/1500))
## `summarise()` has grouped output by 'TBG_measured'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 4
## # Groups:   TBG_measured [2]
##    TBG_measured Sample   total proportion
##    <chr>        <chr>    <int>      <dbl>
##  1 n            Sample 1  1394     0.929 
##  2 n            Sample 2  1380     0.92  
##  3 n            Sample 3  1390     0.927 
##  4 n            Sample 4  1369     0.913 
##  5 n            Sample 5  1373     0.915 
##  6 n            Sample 6  1376     0.917 
##  7 y            Sample 1   106     0.0707
##  8 y            Sample 2   120     0.08  
##  9 y            Sample 3   110     0.0733
## 10 y            Sample 4   131     0.0873
## 11 y            Sample 5   127     0.0847
## 12 y            Sample 6   124     0.0827

The total number of people who got the TBG test in Samples 5 and 6 are 134 and 116 respectively and the associated proportions are .08933333 and .077333333 respectively.

As expected, the proportions of individuals who got the TBG test is quite similar among samples with the minimum and maximum proportions as 7.7% and 8.9% in Sample 6 and Sample 5 respectively. Neither the minimum or maximum are particularly different from the proportion in the overall data set, which is 8.2%, which again suggests all of these samples are quite representative of the overall data set.

This similarity between samples can be visualized below:

allSamples |>
ggplot() +
  geom_bar(mapping=aes(x=TBG_measured,fill=Sample),position="dodge")+labs(title="Counts of if Individuals Took the TBG Test by Sample",x="TBG Test Taken")+theme_bw()

###Comparing Means and Medians for Numerical Variables

Given that the proportions for each of the categorical variables are representative of the overall data set for each sample, we would expect the numerical columns to have similar means.

allSamples |>
  group_by(Sample) |>
  summarize(meanTSH=mean(TSH,na.rm=TRUE),meanTT4=mean(TT4,na.rm=TRUE),meanTBG=mean(TBG,na.rm=TRUE))
## # A tibble: 6 × 4
##   Sample   meanTSH meanTT4 meanTBG
##   <chr>      <dbl>   <dbl>   <dbl>
## 1 Sample 1    5.53    110.    33.2
## 2 Sample 2    5.69    109.    31.2
## 3 Sample 3    5.87    109.    29.4
## 4 Sample 4    6.22    108.    30.0
## 5 Sample 5    6.41    108.    31.8
## 6 Sample 6    5.20    109.    30.7

For comparison’s sake, here are also the means for the overall data set:

hypothyroid |>
  summarize(meanTSH=mean(TSH,na.rm=TRUE),meanTT4=mean(TT4,na.rm=TRUE),meanTBG=mean(TBG,na.rm=TRUE))
## # A tibble: 1 × 3
##   meanTSH meanTT4 meanTBG
##     <dbl>   <dbl>   <dbl>
## 1    5.92    109.    31.3

Again, all of the means for each sample are similar both to each other and to the means of the overall data set. I was hoping for one of the samples to get a disproportionate amount of outliers (TSH in particular has some that I haven’t addressed much in my data dives thus far other than zoomed-in graphs to make up for the graph readability issues caused by them) just to have something to talk about, but they all seem reasonably similar to the mean TSH test result of the overall data set and to the other samples.

Since I mentioned outliers above, I will also look at the medians to make sure those are similar as medians are far more resistant to outliers compared to the mean.

allSamples |>
  group_by(Sample) |>
  summarize(medianTSH=median(TSH,na.rm=TRUE),medianTT4=median(TT4,na.rm=TRUE),medianTBG=median(TBG,na.rm=TRUE))
## # A tibble: 6 × 4
##   Sample   medianTSH medianTT4 medianTBG
##   <chr>        <dbl>     <dbl>     <dbl>
## 1 Sample 1       0.7       105      29  
## 2 Sample 2       0.7       104      28  
## 3 Sample 3       0.7       103      27.5
## 4 Sample 4       0.8       102      28  
## 5 Sample 5       0.7       103      28  
## 6 Sample 6       0.7       104      28

I will also calculate the medians for the overall data set for comparison’s sake:

hypothyroid |>
  summarize(medianTSH=median(TSH,na.rm=TRUE),medianTT4=median(TT4,na.rm=TRUE),medianTBG=median(TBG,na.rm=TRUE))
## # A tibble: 1 × 3
##   medianTSH medianTT4 medianTBG
##       <dbl>     <dbl>     <dbl>
## 1       0.7       104        28

You can see a massive effect of outliers on TSH just by comparing the median and the mean TSH test results. The median TSH is within the normal TSH test result range of .5-4, which the mean is higher the range and suggests the average person in the data set has hypothyroidism, which we know is not true. Thus, you will likely see me using median as a measure of central tendency as much as I can going forward for the TSH column to account for the presence of outliers for TSH test results (I know there are a few values in the 100s for TSH, which understandably have a massive effect on mean TSH test results).

When comparing means, the similarity of medians in the samples to each other and the medians of the overall data set is even more striking than with the means for each set of test results. This confirms that overall, all of the samples seem to be both representative of the overall data set and not meaningfully different from each other.

We can also compare distributions for each of the numerical values

allSamples |>
  ggplot()+geom_density(mapping=aes(x=TSH,color=Sample))+xlim(0,10)+labs(title="Density Plot of TSH Test Results By Sample",x="TSH Test Result (milliunits/liter)")+theme_bw()
## Warning: Removed 2026 rows containing non-finite values (`stat_density()`).

From this, we can see that TSH test results follow a similar distribution of values for all six samples. The x range was limited to the range 0-10 to see the distribution better as there are some outliers that massively skew the distribution otherwise.

Next, we will compare for TT4 test results:

allSamples |>
  ggplot()+geom_density(mapping=aes(x=TT4,color=Sample))+labs(title="Density Plot of TT4 Test Results By Sample",x="TT4 Test Result (nanomoles/liter)")+theme_bw()
## Warning: Removed 696 rows containing non-finite values (`stat_density()`).

Again, the distributiosn are about the sample between samples.

Finally, we’ll compare the TBG test results:

allSamples |>
  ggplot()+geom_density(mapping=aes(x=TBG,color=Sample))+labs(title="Density Plot of TBG Test Results By Sample",x="TBG Test Result (micrograms/milliliter)")+theme_bw()
## Warning: Removed 8282 rows containing non-finite values (`stat_density()`).

These differ a little more, largely because there are so many fewer individuals who got the TBG test in each sample compared to the other two tests, but all samples follow the same general density distribution pattern.

Comparing Combinations of Categorical Variables

This is the one comparison where we will break back into the individual sample data sets. My goal here is to see if there are any particularly unique combinations of the five categorical variables in each of the six samples. Given probabilities are so similar, I do not expect there to be many combinations that are unique to a particular sample.

Sample 1’s combinations are shown below:

sample1 |>
  group_by(hypothyroid,on_antithyroid_medication,TSH_measured,TT4_measured,TBG_measured) |>
  summarize(count=n()) |>
  arrange(desc(count))
## `summarise()` has grouped output by 'hypothyroid', 'on_antithyroid_medication',
## 'TSH_measured', 'TT4_measured'. You can override using the `.groups` argument.
## # A tibble: 12 × 6
## # Groups:   hypothyroid, on_antithyroid_medication, TSH_measured, TT4_measured
## #   [9]
##    hypothyroid on_antithyroid_medication TSH_measured TT4_measured TBG_measured
##    <chr>       <lgl>                     <chr>        <chr>        <chr>       
##  1 negative    FALSE                     y            y            n           
##  2 negative    FALSE                     n            y            n           
##  3 negative    FALSE                     n            n            y           
##  4 hypothyroid FALSE                     y            y            n           
##  5 negative    TRUE                      y            y            n           
##  6 negative    FALSE                     n            y            y           
##  7 negative    TRUE                      n            n            y           
##  8 hypothyroid FALSE                     n            y            n           
##  9 hypothyroid FALSE                     y            y            y           
## 10 hypothyroid TRUE                      y            y            n           
## 11 negative    FALSE                     n            n            n           
## 12 negative    FALSE                     y            n            y           
## # ℹ 1 more variable: count <int>

Sample 2’s combinations:

sample2 |>
  group_by(hypothyroid,on_antithyroid_medication,TSH_measured,TT4_measured,TBG_measured) |>
  summarize(count=n()) |>
  arrange(desc(count))
## `summarise()` has grouped output by 'hypothyroid', 'on_antithyroid_medication',
## 'TSH_measured', 'TT4_measured'. You can override using the `.groups` argument.
## # A tibble: 13 × 6
## # Groups:   hypothyroid, on_antithyroid_medication, TSH_measured, TT4_measured
## #   [10]
##    hypothyroid on_antithyroid_medication TSH_measured TT4_measured TBG_measured
##    <chr>       <lgl>                     <chr>        <chr>        <chr>       
##  1 negative    FALSE                     y            y            n           
##  2 negative    FALSE                     n            n            y           
##  3 negative    FALSE                     n            y            n           
##  4 hypothyroid FALSE                     y            y            n           
##  5 negative    TRUE                      y            y            n           
##  6 negative    FALSE                     n            y            y           
##  7 negative    TRUE                      n            n            y           
##  8 hypothyroid FALSE                     y            y            y           
##  9 negative    TRUE                      n            y            n           
## 10 hypothyroid FALSE                     n            y            n           
## 11 hypothyroid TRUE                      y            y            n           
## 12 negative    FALSE                     n            n            n           
## 13 negative    FALSE                     y            n            y           
## # ℹ 1 more variable: count <int>

Sample 3’s combinations:

sample3 |>
  group_by(hypothyroid,on_antithyroid_medication,TSH_measured,TT4_measured,TBG_measured) |>
  summarize(count=n()) |>
  arrange(desc(count))
## `summarise()` has grouped output by 'hypothyroid', 'on_antithyroid_medication',
## 'TSH_measured', 'TT4_measured'. You can override using the `.groups` argument.
## # A tibble: 13 × 6
## # Groups:   hypothyroid, on_antithyroid_medication, TSH_measured, TT4_measured
## #   [10]
##    hypothyroid on_antithyroid_medication TSH_measured TT4_measured TBG_measured
##    <chr>       <lgl>                     <chr>        <chr>        <chr>       
##  1 negative    FALSE                     y            y            n           
##  2 negative    FALSE                     n            n            y           
##  3 negative    FALSE                     n            y            n           
##  4 hypothyroid FALSE                     y            y            n           
##  5 negative    TRUE                      y            y            n           
##  6 negative    FALSE                     n            y            y           
##  7 negative    TRUE                      n            n            y           
##  8 hypothyroid FALSE                     y            y            y           
##  9 hypothyroid FALSE                     n            y            n           
## 10 hypothyroid TRUE                      y            y            n           
## 11 negative    FALSE                     n            n            n           
## 12 negative    FALSE                     y            n            y           
## 13 negative    TRUE                      n            y            n           
## # ℹ 1 more variable: count <int>

Sample 4’s combinations are shown below:

sample4 |>
  group_by(hypothyroid,on_antithyroid_medication,TSH_measured,TT4_measured,TBG_measured) |>
  summarize(count=n()) |>
  arrange(desc(count))
## `summarise()` has grouped output by 'hypothyroid', 'on_antithyroid_medication',
## 'TSH_measured', 'TT4_measured'. You can override using the `.groups` argument.
## # A tibble: 14 × 6
## # Groups:   hypothyroid, on_antithyroid_medication, TSH_measured, TT4_measured
## #   [10]
##    hypothyroid on_antithyroid_medication TSH_measured TT4_measured TBG_measured
##    <chr>       <lgl>                     <chr>        <chr>        <chr>       
##  1 negative    FALSE                     y            y            n           
##  2 negative    FALSE                     n            n            y           
##  3 negative    FALSE                     n            y            n           
##  4 hypothyroid FALSE                     y            y            n           
##  5 negative    TRUE                      y            y            n           
##  6 negative    FALSE                     n            y            y           
##  7 hypothyroid FALSE                     y            y            y           
##  8 negative    FALSE                     n            n            n           
##  9 negative    TRUE                      n            n            y           
## 10 hypothyroid FALSE                     n            y            n           
## 11 hypothyroid TRUE                      y            y            n           
## 12 negative    FALSE                     y            n            y           
## 13 negative    FALSE                     y            y            y           
## 14 negative    TRUE                      n            y            n           
## # ℹ 1 more variable: count <int>

Sample 5’s combinations:

sample5 |>
  group_by(hypothyroid,on_antithyroid_medication,TSH_measured,TT4_measured,TBG_measured) |>
  summarize(count=n()) |>
  arrange(desc(count))
## `summarise()` has grouped output by 'hypothyroid', 'on_antithyroid_medication',
## 'TSH_measured', 'TT4_measured'. You can override using the `.groups` argument.
## # A tibble: 14 × 6
## # Groups:   hypothyroid, on_antithyroid_medication, TSH_measured, TT4_measured
## #   [10]
##    hypothyroid on_antithyroid_medication TSH_measured TT4_measured TBG_measured
##    <chr>       <lgl>                     <chr>        <chr>        <chr>       
##  1 negative    FALSE                     y            y            n           
##  2 negative    FALSE                     n            n            y           
##  3 negative    FALSE                     n            y            n           
##  4 hypothyroid FALSE                     y            y            n           
##  5 negative    TRUE                      y            y            n           
##  6 negative    TRUE                      n            n            y           
##  7 hypothyroid FALSE                     y            y            y           
##  8 negative    FALSE                     n            n            n           
##  9 hypothyroid FALSE                     n            y            n           
## 10 hypothyroid TRUE                      y            y            n           
## 11 negative    FALSE                     n            y            y           
## 12 negative    FALSE                     y            n            y           
## 13 negative    FALSE                     y            y            y           
## 14 negative    TRUE                      n            y            n           
## # ℹ 1 more variable: count <int>

Sample 6’s combinations:

sample6 |>
  group_by(hypothyroid,on_antithyroid_medication,TSH_measured,TT4_measured,TBG_measured) |>
  summarize(count=n()) |>
  arrange(desc(count))
## `summarise()` has grouped output by 'hypothyroid', 'on_antithyroid_medication',
## 'TSH_measured', 'TT4_measured'. You can override using the `.groups` argument.
## # A tibble: 14 × 6
## # Groups:   hypothyroid, on_antithyroid_medication, TSH_measured, TT4_measured
## #   [10]
##    hypothyroid on_antithyroid_medication TSH_measured TT4_measured TBG_measured
##    <chr>       <lgl>                     <chr>        <chr>        <chr>       
##  1 negative    FALSE                     y            y            n           
##  2 negative    FALSE                     n            n            y           
##  3 negative    FALSE                     n            y            n           
##  4 hypothyroid FALSE                     y            y            n           
##  5 negative    TRUE                      y            y            n           
##  6 negative    TRUE                      n            n            y           
##  7 hypothyroid FALSE                     y            y            y           
##  8 negative    FALSE                     n            y            y           
##  9 hypothyroid FALSE                     n            y            n           
## 10 hypothyroid TRUE                      y            y            n           
## 11 negative    FALSE                     n            n            n           
## 12 negative    FALSE                     y            n            y           
## 13 negative    FALSE                     y            y            y           
## 14 negative    TRUE                      n            y            n           
## # ℹ 1 more variable: count <int>

Across all six samples, the two most common combinations are people without hypothyroidism and not on antithryoid medication who did the TSH and TT4 tests and people without hypothyroidism and not on antithyroid medication who did only the TBG test. In fact, the top five combinations are the same for all six samples, with around 20 individuals in each sample with combinations not in those top five. The samples had 4-8 uncommon combinations (the lowest number of non-top 5 combinations being in sample 4 and the highest being in sample 3). However, in light of the fact that samples of 1500 individuals have only around 20 individuals who do not fall into the top 5 combinations of these five categorical variables, there is little need to worry about them as there are so few of them and the samples all seem quite similar to each other and to the original data set beyond that.

Conclusions

Overall, the samples ended up being very similar both to each other and to the original data set. This suggests that if I intend to do sampling of the data set later on in data dives, 1500 samples should result in a representative sample of the data set to work with. There were no major differences to discuss other than some unique combinations of categorical variables that showed up in some samples but not others, but even they were not of much concern because they made up such a small portion of the sample. The only other major conclusion for all of this is that median should be used as a measure of central tendency for TSH test results because of some outliers in that column specifically, which I will remember for future data dives.