Data Dive 2

Teresa Ortyl

The first steps are loading in the tidyverse package and the hypothyroid dataset.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.2     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.3     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")
## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl  (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I did some data cleaning prior to first use, which involved adding the column names directed into the CSV file and changing the ? marks in the columns to empty cells for ease of use in R.

First, we will start by looking at some basic summaries of the columns.

Categorical Columns of Interest

To start, we will look at a few of the categorical columns and their counts for unique values.

Hypothyroid

The first is the hypothyroid column, which lists whether or not an individual has hypothyroidism.

hypothyroid|>
  count(hypothyroid)
## # A tibble: 2 Ă— 2
##   hypothyroid     n
##   <chr>       <int>
## 1 hypothyroid   151
## 2 negative     3012

As you can see, only 151 people have it in this dataset.

on_thyroxine

The second column we will look at is on_thyroxine. Thyroxine is a hormone produced by the thyroid and is not produced at high enough levels in people with hypothyroidism. This column asks whether individuals are taking a synthetic version of the hormone.

hypothyroid|>
  count(on_thyroxine)
## # A tibble: 2 Ă— 2
##   on_thyroxine     n
##   <lgl>        <int>
## 1 FALSE         2702
## 2 TRUE           461

Interestingly, there are 461 who are taking a synthetic version of thyroxine, despite only 151 people specifically having hypothyroidism.

on_antithryoid_medication

The third column to investigate is on_antithyroid_medication. Antithyroid medications are used when the thyroid is overactive, which is called hyperthyroidism. Thus, this column can serve as a way of checking how many individuals in this group have hyperthyroidism.

hypothyroid|>
  count(on_antithyroid_medication)
## # A tibble: 2 Ă— 2
##   on_antithyroid_medication     n
##   <lgl>                     <int>
## 1 FALSE                      3121
## 2 TRUE                         42

Only 42 people are on antithyroid medications, which means that presumably only 42 people in the dataset have hyperthyroidism.

thyroid_surgery The fourth categorical column to look at is thyroid_surgery, which shows whether or not an individual has had their thyroid completely or partially removed.

hypothyroid|>
  count(thyroid_surgery)
## # A tibble: 2 Ă— 2
##   thyroid_surgery     n
##   <lgl>           <int>
## 1 FALSE            3059
## 2 TRUE              104

104 people have had thyroid surgery, which may help explain the elevated numbers of people on synthetic thyroxine as those who do not have thyroids have to take synthetic thyroxine as well.

Sex

The fifth categorical column we will look at is sex, just to get an idea of the sex breakdown of individuals in the dataset.

hypothyroid|>
  count(sex)
## # A tibble: 3 Ă— 2
##   sex       n
##   <chr> <int>
## 1 F      2182
## 2 M       908
## 3 <NA>     73

Women make up about 2/3 of the dataset. This makes sense given that women are at an increased risk of getting hypothyroidism, particularly after menopause, so more women probably undergo the various tests covered in this dataset. Interestingly, 73 people did not list a sex, though it has no bearing on this dataset.

Numeric Columns of Interest

TSH

The sixth column that will be looked at is TSH, which measures the level of thryoid-stimulating hormone in the blood.

The distribution of this column’s data is as follows:

hypothyroid |>
  ggplot()+geom_histogram(mapping=aes(x=TSH), bins=30,fill="blue")+xlim(0,7.5)+ylim(0,300)+labs(title="Histogram of TSH Test Values",y="Count",x="TSH Test Result (microunits per microliter)")+theme_bw()+geom_vline(xintercept=.7,color="orange",size=1.5)+annotate("text",x=1.4,y=270, label="<- the median")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 793 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

The distribution is right skewed, which reflects that a number of patients in this dataset have hypothyroidism as a higher result on this test is correlated with hypothryoidism.

The min, max, mean and median are shown below.

hypothyroid |>
  summarize(mean=mean(TSH,na.rm=TRUE),median=median(TSH,na.rm=TRUE),min=min(TSH,na.rm=TRUE), max=max(TSH,na.rm=TRUE))
## # A tibble: 1 Ă— 4
##    mean median   min   max
##   <dbl>  <dbl> <dbl> <dbl>
## 1  5.92    0.7     0   530

The mean is exceptionally high (.5-4 is the normal range), which may be due to some outliers in this dataset. The most likely explanation of this is that some measurements were measured differently and not converted to the standard measurement, which is microunits per microliter. The median seems a better measure for central tendency here because of this.

Next, we’ll look at the 25th,50th, and 75th quantiles of the TSH column.

hypothyroid |>
  summarize(quantile(TSH,probs=seq(.25,.75,.25),na.rm=TRUE))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 3 Ă— 1
##   `quantile(TSH, probs = seq(0.25, 0.75, 0.25), na.rm = TRUE)`
##                                                          <dbl>
## 1                                                          0  
## 2                                                          0.7
## 3                                                          2.3

This gives the 25th, 50th, and 75th quantiles for the TSH column. Interestingly, the 25th quantile is 0, which suggests a significant group in this dataset may have hyperthyroidism as that is what a low result on this test indicates.

T3

The seventh column that will be looked at is T3, which measures the level of triiodothyronine in the blood.Triiodothyronine is one of the major hormones made by the thyroid.

We’ll start by looking a histogram for the T3 column:

hypothyroid |>
  ggplot()+geom_histogram(mapping=aes(x=T3), bins=50,fill="purple")+labs(title="Histogram of T3 Test Values",x="T3 Test Results (nanomoles per liter)",y="count")+theme_bw()+geom_vline(xintercept=1.8,color="cyan",size=1.5)+annotate("text",x=2.7,y=300,label="<- the median")
## Warning: Removed 695 rows containing non-finite values (`stat_bin()`).

The distribution is centered around the median (the specific number is below) with a right skew that may reflect people in this dataset that have hyperthyroidism.

The min, max, mean and median are shown below.

hypothyroid |>
  summarize(mean=mean(T3,na.rm=TRUE),median=median(T3,na.rm=TRUE),min=min(T3,na.rm=TRUE), max=max(T3,na.rm=TRUE))
## # A tibble: 1 Ă— 4
##    mean median   min   max
##   <dbl>  <dbl> <dbl> <dbl>
## 1  1.94    1.8     0  10.2

The mean and the median are much more similar for T3. The normal range for T3 is 0.9-2.8 nanomoles per liter, so both the median and the mean are right in range for that. The max is exceptionally high and likely reflects someone who has hyperthyroidism, which the minimum of 0 reflects someone with hypothyroidism in the dataset.

Next, we’ll look at the 25th,50th, and 75th quantiles of the T3 column.

hypothyroid |>
  summarize(quantile(T3,probs=seq(.25,.75,.25),na.rm=TRUE))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 3 Ă— 1
##   `quantile(T3, probs = seq(0.25, 0.75, 0.25), na.rm = TRUE)`
##                                                         <dbl>
## 1                                                         1.4
## 2                                                         1.8
## 3                                                         2.3

This gives the 25th, 50th, and 75th quantiles for the T3 column. Thankfully, these are all in range for normal values for the T3 test, unlike with the TSH column.

TT4

The eighth column that will be looked at is TT4, which measures the level of total thyroxine in the blood. Thyroxine is one of the other major hormones made by the thyroid and can be referred to as T4.

The distribution of test values is shown below:

hypothyroid |>
  ggplot()+geom_histogram(mapping=aes(x=TT4), bins=50,fill="green")+labs(title="Histogram of TT4 Test Values",x="TT4 Test Results (nanomoles per liter)",y="count")+theme_bw()+geom_vline(xintercept=104,color="black",size=1.5)+annotate("text",x=145,y=300,label="<- the median")
## Warning: Removed 249 rows containing non-finite values (`stat_bin()`).

The TT4 test results are the most normally distributed so far, though there is still a right skew. Again, these are people who likely have hyperthyroidism.

The min, max, mean and median are shown below.

hypothyroid |>
  summarize(mean=mean(TT4,na.rm=TRUE),median=median(TT4,na.rm=TRUE),min=min(TT4,na.rm=TRUE), max=max(TT4,na.rm=TRUE))
## # A tibble: 1 Ă— 4
##    mean median   min   max
##   <dbl>  <dbl> <dbl> <dbl>
## 1  109.    104     2   450

The mean and the median are also fairly similar for TT4. The normal range for TT4 is normally not listed in the units above as the numbers are usually more in the 1-2 range. However, I was able to find a standard range in nanomloes per liter which is 57-148 nanomoles per liter, which seems more in line with these values. Both the mean and median are rightin the middle of the range. The max again likely reflects someone who has hyperthyroidism, which the minimum of 0 reflects someone with hypothyroidism in the dataset.

Next, we’ll look at the 25th,50th, and 75th quantiles of the TT4 column.

hypothyroid |>
  summarize(quantile(TT4,probs=seq(.25,.75,.25),na.rm=TRUE))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 3 Ă— 1
##   `quantile(TT4, probs = seq(0.25, 0.75, 0.25), na.rm = TRUE)`
##                                                          <dbl>
## 1                                                           83
## 2                                                          104
## 3                                                          128

All 3 percentiles (25th, 50th, and 75th) are within the normal range.

FTI

I am skipping the T4U column/test as I cannot find much information on that test.

The ninth column that will be looked at is FTI, which measures the level of free thyroxine in the blood. While the TT4 test looks at total thyroxine, this test looks only at the free thyroxine.

The distribution for FTI is shown below:

hypothyroid |>
  ggplot()+geom_histogram(mapping=aes(x=FTI), bins=50,fill="orange")+labs(title="Histogram of FTI Test Values",x="FTI Test Results (nanomoles per liter)",y="count")+theme_bw()+geom_vline(xintercept=107,color="black",size=1.5)+annotate("text",x=145,y=385,label="<- the median")+xlim(0,400)
## Warning: Removed 266 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

Again, FTI is fairly normally distributed, though with some right skew for potential hyperthyroidism-suffering outliers. Interestingly, there is some added weight near zero ,which seems likely to be reflection of patients with hyperthyroidism.

The min, max, mean and median are shown below.

hypothyroid |>
  summarize(mean=mean(FTI,na.rm=TRUE),median=median(FTI,na.rm=TRUE),min=min(FTI,na.rm=TRUE), max=max(FTI,na.rm=TRUE))
## # A tibble: 1 Ă— 4
##    mean median   min   max
##   <dbl>  <dbl> <dbl> <dbl>
## 1  115.    107     0   881

The mean and the median are much more similar for FTI. The normal range for FTI is 61-163 nanomoles per liter (I had to convert an existing range to nanomoles per liter, but the conversion factor was provided with the range so this range should be pretty accurate), so both the median and the mean are right in range for that. The max probably someone who has hyperthyroidism, which the minimum of 0 reflects someone with hypothyroidism in the dataset.

Next, we’ll look at the 25th,50th, and 75th quantiles of the FTI column.

hypothyroid |>
  summarize(quantile(FTI,probs=seq(.25,.75,.25),na.rm=TRUE))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 3 Ă— 1
##   `quantile(FTI, probs = seq(0.25, 0.75, 0.25), na.rm = TRUE)`
##                                                          <dbl>
## 1                                                           91
## 2                                                          107
## 3                                                          129

The 25th, 50th, and 75th percentiles are all in the normal range for the FTI test, which is to be expect.

TBG

The final column we’ll look at is the TBG column, which measures the level of thyroxine binding globulin (TBG), which is the protein that carries thyroxine through the blood.

The distribution for TBG is shown below:

hypothyroid |>
  ggplot()+geom_histogram(mapping=aes(x=TBG), bins=30,fill="red")+labs(title="Histogram of TBG Test Values",x="TBG Test Results (nanomoles per liter)",y="count")+theme_bw()+geom_vline(xintercept=28,color="black",size=1.5)+annotate("text",x=45,y=50,label="<- the median")
## Warning: Removed 2903 rows containing non-finite values (`stat_bin()`).

The distribution has a slight right skew. One thing of note is that counts are significantly smaller for this test, reflecting that a much smaller group of people got the TBG test compared to the other 4 looked at so far.

The min, max, mean and median are shown below.

hypothyroid |>
  summarize(mean=mean(TBG,na.rm=TRUE),median=median(TBG,na.rm=TRUE),min=min(TBG,na.rm=TRUE), max=max(TBG,na.rm=TRUE))
## # A tibble: 1 Ă— 4
##    mean median   min   max
##   <dbl>  <dbl> <dbl> <dbl>
## 1  31.3     28     0   122

The mean and the median are much more similar for T3. The normal range for TBG is 12-30 micrograms per milliliter. The median is in range, but the mean is not. However, both are on the high end of the range. A higher value on this test is connected with hypothyroidism, so the maximum number may come from someone who y has hypothyroidism, which the minimum level may come from someone who likely has hyperthyroidism.

Next, we’ll look at the 25th,50th, and 75th quantiles of the TBG column.

hypothyroid |>
  summarize(quantile(TBG,probs=seq(.25,.75,.25),na.rm=TRUE))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 3 Ă— 1
##   `quantile(TBG, probs = seq(0.25, 0.75, 0.25), na.rm = TRUE)`
##                                                          <dbl>
## 1                                                           21
## 2                                                           28
## 3                                                           34

The 25th and 50th quantiles are in range for a normal TBG test value, while the 75th quantile is out of range. This may reflect a number of people showing signs of hypothyroidism who have not been diagnosed yet.

Novel Questions to Investigate

  1. First, I would want to double check that a high TSH value is correlated with a low TT4 value as that is the general diagnostic tool for hypothyroidism. Mostly this is because the dataset does not use the most standard values for test results, and I want to be sure the dataset is usable.

  2. Second, I would like to see if sex makes one more likely to have a hypothyroidism test profile (so test results align with what would be seen with hypothyroidism).

  3. Similarly, I would like to see if age is related to having a more hypothyroidism-aligned profile.

  4. I would also like to investigate what impact the confounding factors from the dataset (illness, pregnancy, tumor, lithium, and goitre) have on test results.

  5. The other main question I have at this point is what factors were associated with getting these thyroid tests done (like age, sex, potential confouning factors). I don’t plan on really exploring the columns on whether or not an individual got a test in this data dive as there are so many more columns that are more interesting to look at on their own, but since they make up a significant portion of this dataset, it is worth looking into.

A Couple Interesting Aggragations to Look At So Far

Hypothyroidism and TSH/TT4 Means

The first thing worth aggregating is TSH/TT4 test results based on whether the person has hypothyroidism or not. This was mostly to make sure the data is accurate to what is expected.

First, we will compare means.

hypothyroid |>
  group_by(hypothyroid) |>
  summarize(mean_TSH=mean(TSH,na.rm=TRUE),mean_TT4=mean(TT4,na.rm=TRUE))
## # A tibble: 2 Ă— 3
##   hypothyroid mean_TSH mean_TT4
##   <chr>          <dbl>    <dbl>
## 1 hypothyroid    63.6      35.4
## 2 negative        2.52    113.

As expected, those with hypothyroidism have higher TSH results and lower TT4 test results on average than those who do not have hypothyroidism as expected.

This can also be visualized below:

hypothyroid |>
  filter(!is.na(TSH)) |>
  filter(!is.na(TT4)) |>
  ggplot()+geom_point(mapping=aes(x=TSH,y=TT4,color=hypothyroid, shape=hypothyroid))+labs(title="TT4 vs. TSH Test Results Grouped By Hypothyroidism Status",x="TSH Test Results (microunits per microliter)",y="TT4 Test Results (nanomoles per liter)")+theme_bw()+xlim(0,200)
## Warning: Removed 7 rows containing missing values (`geom_point()`).

Very obviously, individuals with hypothyroidism have high TSH scores and low TT4 scores in comparison to those without it.

Gender Breakdown of Hypothyroidism Profile (via TSH/TT4 test)

Since one of the questions I had was whether there were any sex differences in TSH/TT4 profiles, I’ll also use that as an aggregation example.

First, we will compare means like above:

hypothyroid |>
  filter(!is.na(sex)) |> #removing null vallues
  group_by(sex) |>
  summarize(mean_TSH=mean(TSH,na.rm=TRUE),mean_TT4=mean(TT4,na.rm=TRUE))
## # A tibble: 2 Ă— 3
##   sex   mean_TSH mean_TT4
##   <chr>    <dbl>    <dbl>
## 1 F         6.54    114. 
## 2 M         4.55     96.5

There is little difference between the two, interestingly enough, though women do skew higher for both test results which means nothing since a higher TSH result indicates hypothyroidism while a lower TT4 result indicates hypothyroidism.

We will also visualize to see if we can see any differences there:

hypothyroid |>
  filter(!is.na(TSH)) |>
  filter(!is.na(TT4)) |>
  filter(!is.na(sex)) |>
  ggplot()+geom_point(mapping=aes(x=TSH,y=TT4,color=sex, shape=sex),alpha=.5)+labs(title="TT4 Test Results vs. TSH Test Results Grouped By Sex",x="TSH Test Results (microunits per microliter)",y="TT4 Test Results (nanomoles per liter)")+theme_bw()+xlim(0,150)+scale_color_brewer(palette="Set1")
## Warning: Removed 16 rows containing missing values (`geom_point()`).

Again, there seem to be no main differences between the two. It seems more women skew to the lower right of the graph, which would reflect how women have an increase risk of hypothyroidism, but overall, both men and women have similar test result distributions. I would add smooth curves to show the trends for both men and women, but they would overlap too much in the hypothyroidism area of the graph to be meaningful.

I do find it interesting that women make up the extremes for both test results, which may explain the higher means seen above.

Conclusions

Since I don’t want to analyze this data too much before knowing what I have to do in the other data dives, I will stop here for this week. There are plenty of more trends worth considering and which I would likely end up considering in future weeks, but for now, a couple additional questions I have going forward are if age will prove to be as uncorrelated to a hypothyroidism profile as sex seems to be from the aggregation before and if there are differences in how extreme the test results are based on sex or age in people with hypothyroidism.