Data Dive 3

Teresa Ortyl

First, the tidyverse library and the dataset were loaded.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")

## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl  (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Group_By Dataframes

For the group_by dataframes, my goal is to see if different medical treatments (mainly whether patients have had thryoid surgery or are on certain thyroid medications) have different TSH/TT4 profiles. Essentially, I am looking at whether their TSH/TT4 results are normal, more similar to patients with hypothyrodism, or more similar to patients with hyperthyroidism. I will also be looking at the probability of a patient in this dataset to be in each of the three sets of groups.

Grouped by whether patients are on thyroxine

Thyroxine is an important thyroid hormone, and those with hypothyroidism generally have to take a synthetic version of it to make up for their bodies’ lack of thyroxine. If the medication is working as it should, people in this group should have normal TSH/TT4 tests, but it would not be unreasonable for the TSH/TT4 results to resemble hypothyroidism results since people who take synthetic thyroxine generally have hypothyroidism.

hypothyroid|>
  group_by(on_thyroxine)|>
  summarize(mean_TSH=mean(TSH,na.rm=TRUE),mean_TT4=mean(TT4,na.rm=TRUE),size=n())

## # A tibble: 2 × 4
##   on_thyroxine mean_TSH mean_TT4  size
##   <lgl>           <dbl>    <dbl> <int>
## 1 FALSE            5.96     105.  2702
## 2 TRUE             5.74     132.   461

Interestingly, people on synthetic thyroxine had a lower TSH mean and a higher TT4 mean, which seems to suggest that the people on synthetic thyroxine have overshot thyroxine levels and now the thyroid is working slightly more than the average person. The results are not anywhere near hyperthyroidism results. They’re still in normal range, which is good as it means the medication is generally working.

This can also be visualized below:

hypothyroid|>
  ggplot()+geom_point(mapping=aes(x=TSH,y=TT4,color=on_thyroxine))+labs(title="TSH vs. TT4 scores Based On Patients' Synthetic Thyroxine Usage")+theme_bw()

## Warning: Removed 469 rows containing missing values (`geom_point()`).

As you can see, TSH/TT4 results are similarly distributed regardless of synthetic thyroxine usage.

The probabilities for whether an individual is on thyroxine are calculated below:

total<- 2702+461
print(paste("The total number of individuals who responded to this question is:",total))

## [1] "The total number of individuals who responded to this question is: 3163"

print(paste("The probability that a person is not on synthetic thyroxine is:", 2720/total))

## [1] "The probability that a person is not on synthetic thyroxine is: 0.859943092001265"

print(paste("The probability that a person is on sythetic thyroxine is:",461/total))

## [1] "The probability that a person is on sythetic thyroxine is: 0.145747707872273"

So, 14% of people in the dataset are on synthetic thyroxine, while 86% are not.

The expected proportions would in theory be identical to the proportion of people who have hypothyroidism in the set or who have had thyroid surgery.

hypothyroid|>
  group_by(hypothyroid,thyroid_surgery)|>
  summarize(size=n())

## `summarise()` has grouped output by 'hypothyroid'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 3
## # Groups:   hypothyroid [2]
##   hypothyroid thyroid_surgery  size
##   <chr>       <lgl>           <int>
## 1 hypothyroid FALSE             141
## 2 hypothyroid TRUE               10
## 3 negative    FALSE            2918
## 4 negative    TRUE               94

In this case, 2918 people would be expected to not be on thyroxine, while 245 would be expected to be on thyroxine.

total<-2918+245
print(paste("The expected proportion of people on thyroxine is:",245/total))

## [1] "The expected proportion of people on thyroxine is: 0.0774581093898198"

print(paste("The expected proportion of people not on thyroxine is:",2918/total))

## [1] "The expected proportion of people not on thyroxine is: 0.92254189061018"

Thus, we would expect 8% of people in the dataset to be on thyroxine and 92% to not be on it.

One testable hypothesis is whether the differences in the expected and actual proportions of people on thyroxine in the dataset is likely due to random chance or if there may other confounding variables that explain why more people are on thyroxine than expected.

Grouped by whether patients are on antithyroid medications

Unlike thyroxine, antithyroid medications are used to treat hyperthyroidism, which is when the thyroid is overactive. Checking TSH/TT4 test results allows us to determine if antithyroid medication usage could be a proxy for hyperthyroidism in this data set. So first, we will group by antithyroid medication usage and determine group means for TSH and TT4 to see if there is any difference.

hypothyroid|>
  group_by(on_antithyroid_medication)|>
  summarize(mean_TSH=mean(TSH,na.rm=TRUE),mean_TT4=mean(TT4,na.rm=TRUE),size=n())

## # A tibble: 2 × 4
##   on_antithyroid_medication mean_TSH mean_TT4  size
##   <lgl>                        <dbl>    <dbl> <int>
## 1 FALSE                         5.93     109.  3121
## 2 TRUE                          5.55     107.    42

Interestingly both the mean TSH and TT4 test results are lower for individuals on antithyroid medication compared to those who are not. However, the difference is very slight for TT4 and in the normal range. The difference is more meaningful for TSH as TSH values have a smaller range. For TSH values, lower results are more in the direction of hyperthyroidism, so it would make sense for those on antithyroid medication to have lower levels compared those not on it.

The lack of difference between groups can also be visualized below:

hypothyroid|>
  ggplot()+geom_point(mapping=aes(x=TSH,y=TT4,color=on_antithyroid_medication),alpha=.7)+labs(title="TSH vs. TT4 scores Based On Patients' Antithyroid Medication Usage")+theme_bw()+xlim(0,10)+ylim(0,100)

## Warning: Removed 2148 rows containing missing values (`geom_point()`).

Because there are only 42 individuals on antithyroid medication, it is tough to visualize, but they are concentrated in the same normal range as those not on the medication. This would suggest the medication is working as intended.

Despite not having obvious hyperthyroidism TSH/TT4 test results, the main people who take antithyroid are those with hyperthyroidism so it is still worth calculating the probability of an individual in this dataset being on antithyroid medication as it could still be a good estimate of the probability of people in this data set with hyperthyroidism. This and the probability of not being on antithyroid medication is calculated below:

total2<- 3121+42
print(paste("The total number of individuals who responded to this question is:",total2))

## [1] "The total number of individuals who responded to this question is: 3163"

print(paste("The probability that a person is not on antithyroid medication is:", 3121/total2))

## [1] "The probability that a person is not on antithyroid medication is: 0.986721466961745"

print(paste("The probability that a person is on antithyroid medication is:",42/total2))

## [1] "The probability that a person is on antithyroid medication is: 0.0132785330382548"

From this, we can see that 99% of people in the data set are not on antithyroid medications while 1% are. The stark difference is to be expected; given the data set focuses on hypothyroidism, it would make sense to exclude individuals thought to have hyperthyroidism. We could always test if that exclusion was successful by comparing the proportion of people taking antithyroid medications with the proportion of people taking those medications in the general population and seeing if they are meaningfully different.

However, because the number of people on antithyroid medications on this dataset is so small, it will likely not be a major focus of study throughout the rest of these data dives.

Group by whether individuals had thyroid surgery

An individual’s thyroid can be removed, partially or completely, for a number of reasons, including cancer, nodules, and hyperthyroidism. In cases where it is mostly or totally removed, these patients will have to take synthetic thyroxine as they cannot produce enough or any on their own. By looking at TSH/TT4 levels for individuals who had thyroid surgery, we are mostly looking to see if there is any overlap with hypothyroidism.

hypothyroid|>
  group_by(thyroid_surgery)|>
  summarize(mean_TSH=mean(TSH,na.rm=TRUE),mean_TT4=mean(TT4,na.rm=TRUE),size=n())

## # A tibble: 2 × 4
##   thyroid_surgery mean_TSH mean_TT4  size
##   <lgl>              <dbl>    <dbl> <int>
## 1 FALSE               5.87     109.  3059
## 2 TRUE                7.37     104.   104

To be expected, mean TSH results for people who had thyroid surgery are higher than for those who did not, and mean TT4 results are lower for those who had thryoid surgery than those who did not. This is more in line with hypothyroidism results, which would be expected.

This can also be visualized below:

hypothyroid|>
  ggplot()+geom_point(mapping=aes(x=TSH,y=TT4,color=thyroid_surgery),alpha=.6)+labs(title="TSH vs. TT4 scores Based On Patients' Synthetic Thyroxine Usage")+theme_bw()+xlim(0,200)+ylim(0,200)

## Warning: Removed 579 rows containing missing values (`geom_point()`).

While on the whole, individuals who had thyroid surgery do tend to have normal TSH/TT4 test results, when they are out of the normal range, they trend towards hypothyroidism which is depicted by a few of blue points towards the lower right of the above graph.

It is is also to know the proportion of people in the data set who have had thyroid surgery, which is calculated below:

total3<- 3059+104
print(paste("The total number of individuals who responded to this question is:",total3))

## [1] "The total number of individuals who responded to this question is: 3163"

print(paste("The probability that a person has not had thyroid surgery is:", 3059/total3))

## [1] "The probability that a person has not had thyroid surgery is: 0.967119822952893"

print(paste("The probability that a person has had thyroid surgery is:",104/total3))

## [1] "The probability that a person has had thyroid surgery is: 0.0328801770471072"

From this, 97% of people in this data set have not had thyroid surgery, while 3% of people in this data set have had thyroid surgery. One interesting testable hypothesis would be if people who have had their thyroids removed are more likely to be diagnosed with hypothyroidism than those who have not had their thyroids removed, which could be calculated via conditional probabilities.

As a reminder from above, this is the breakdown of hypothyroidism presence based on whether an individual had thyroid surgery:

hypothyroid|>
  group_by(hypothyroid,thyroid_surgery)|>
  summarize(size=n())

## `summarise()` has grouped output by 'hypothyroid'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 3
## # Groups:   hypothyroid [2]
##   hypothyroid thyroid_surgery  size
##   <chr>       <lgl>           <int>
## 1 hypothyroid FALSE             141
## 2 hypothyroid TRUE               10
## 3 negative    FALSE            2918
## 4 negative    TRUE               94

Since we discussed probability this week, I’ll calculate the conditional probabilities below:

total4 <- 141+10+2918+94
surg_prob <- 104/(total4)
norm_prob <- 1-surg_prob
hypo_given_surgery_prob <- (10/total4)/surg_prob
hypo_given_no_surgery_prob <- (141/total4)/norm_prob
print(paste("The conditional probability of hypothyroidism if a patient had thyroid surgery is:",hypo_given_surgery_prob))

## [1] "The conditional probability of hypothyroidism if a patient had thyroid surgery is: 0.0961538461538461"

print(paste("The conditional probability of hypothyroidism if a patient did not have thyroid surgery is:",hypo_given_no_surgery_prob))

## [1] "The conditional probability of hypothyroidism if a patient did not have thyroid surgery is: 0.0460934946060804"

From this, we can compare these two conditional probabilities:

odds=.09156/.04609
print(paste("The odds of having hypothyrodism if a patient had thyroid surgery in comparison to having hypothyrodism if a patient did not have thyroid surgery are:",odds))

## [1] "The odds of having hypothyrodism if a patient had thyroid surgery in comparison to having hypothyrodism if a patient did not have thyroid surgery are: 1.9865480581471"

Essentially, this means that individuals who have had thyroid surgery are two times as likely to have hypothyroidism compared to those who have not had thyroid surgery.

Categorical Variable Combinations

While I have looked at some combinations with grouping above, I will specifically focus on the combinations of tests given to patients for this section:

They are grouped below:

hypothyroid|>
  group_by(TSH_measured,T3_measured,TT4_measured,T4U_measured,FTI_measured,TBG_measured)|>
  summarize(number=n()) |>
  arrange(desc(number))

## `summarise()` has grouped output by 'TSH_measured', 'T3_measured',
## 'TT4_measured', 'T4U_measured', 'FTI_measured'. You can override using the
## `.groups` argument.

## # A tibble: 10 × 7
## # Groups:   TSH_measured, T3_measured, TT4_measured, T4U_measured, FTI_measured
## #   [8]
##    TSH_measured T3_measured TT4_measured T4U_measured FTI_measured TBG_measured
##    <chr>        <chr>       <chr>        <chr>        <chr>        <chr>       
##  1 y            y           y            y            y            n           
##  2 y            n           y            y            y            n           
##  3 n            n           n            n            n            y           
##  4 n            n           y            y            y            n           
##  5 n            y           y            y            y            n           
##  6 n            n           y            y            y            y           
##  7 y            y           y            y            y            y           
##  8 n            n           n            n            y            n           
##  9 n            n           n            y            y            n           
## 10 y            y           n            n            n            y           
## # ℹ 1 more variable: number <int>

Knitting this file to an HTML document leaves out the last row, so for context, the counts for each group are 2394, 296, 246, 142, 69, 9, 4, 1, 1, 1 respectively.

Surprisingly, there are only 10 unique combinations of tests given when I expected significantly more considering there are 6 blood tests discussed in the data set. I would have expected more combinations to have 1 or 2 patients who took that exact combination of tests, but only 3 combinations of tests were taken by only 1 patient, and five combinations were taken by less than 10 patients.

The other five combinations that exist were more significantly represented in the data set. The most common was all the tests but the TBG test, while the second most common combination is all the tests but the TBG test and the T3 test. These two combinations make sense as T3 and TBG are secondary tests done after the TT4 and TSH tests respectively, so they are not as necessary. The third most common combination was just the TBG test, which suggests that patients may take that test separately from the other tests.

Perhaps most surprising was that only 4 patients took all the tests as I would have expected far more in the data set to take them all. Additionally, it also surprised me that no one in the data set did not take any of the tests. Everyone took at least 1 of the blood tests reported on within the data set.

A visualization of the 5 most common combination and their relative counts is below:

labels<-c("All but TBG", "All But TBG and T3", "Just TBG", "TT4, T4U, and FTI", "All But TSH and TBG")
counts <- c(2394,296,246,142,69)
barplot(counts,main="Top 5 Most Common Testing Combinations",xlab="Combinations of Tests",ylab="Counts",names.arg=labels, col=c("darkred","darkorange","darkgreen","navy","purple"),cex.names = .5,cex.axis = .7)

Sorry this is not a pretty ggplot bar plot but for labeling purposes, it was easier to make the bar plot with R’s built-in bar plot function. As you can see, the vast majority of people got all the blood tests but the TBG test, while people who got just the TBG test and people who got all the tests but the TBG and T3 tests were relatively similar in frequency but significantly less frequent than the most preferred combination of tests.

From this comparison, one of the big takeaways is that the TBG test should not be a huge focus for future studies as the most common combination of blood tests taken by patients in the data set is the one where patients did not get a TBG test. Also, it may be interesting to look at patients who only got a TBG test in comparison to individuals in the data set as a whole to see if there are any notable differences between them.