Data Dive 5

Teresa Ortyl

First, the tidyverse library and the data set were loaded.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")
## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl  (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Columns that Documentation Clarified

The documetation for this dataset ended up being very limited other than names of what each column in the dataset represents and a brief description of what the data set was collected for. In most cases, the name and possible options of what could be stored in a column were clear enough that this lack of thorough documentation was not an issues. As a result, documentation used for clarification will largely end up being from looking up what the different tests included in the dataset were and what normal results for the tests should be.

TT4, TT3 FTI, and TBG Columns

These four numerical columns in particular greatly benefited from additional research. While I knew they were all blood tests that could be used with testing for hypothyroidism, I had no idea what each test specifically tested for and what the normal range of results would be for each test. However, through researching the blood tests, I was able to find not only what each test specifically tested, but also the normal range and units for test results. The results of this research are explained for each of these 4 columns in Data Dive 2 so I will not repeat it here.

TestName_measured Columns Associated with the 4 Test Columns Listed Above

Along with there numerical columns of test results, there are associated columns that can have only yes and no in them with the name TestName_measured (for instance, TT4_measured is one such column). On their own, the reasonable assumption would be that these columns might be the test results such as the measured level of the tested protein or molecule in the blood. However, the documentation showing that the columns can only contain yes or no values and the fact that corresponding columns with numerical values exist make it clear that this column exists to show whether an individual got that specific blood test or not. Additionally, comparing the complimentary columns shows this because anywhere there is a no in the TestName_measured type column, there is NA in the TestName column. For instance, if an entry in TT4_measured was no, the corresponding entry in TT4 would be NA.

on_thyroxine, on_anti_thryoid_medication Columns

These two columns, based on the names and data documentation, were clearly asking whether patients were on certain types of medication to treat thyroid conditions, but the documention alone did not provide enough information on what they were to meaningfully interpret them. Again, like with the 4 test results columns mentioned above, I had to look the the medications, which I learned were used to treat hypothyroidism and hyperthyroidism respectively, which has allowed me to understand what being on thyroxine or being on anti-thyroid medication represents.

Elements of Data that are Unclear Even After Reading the Data Documentation

T4U Column

From my research, I know the T4U test tests for T4 uptake, but it is tough to find documentation on what a normal value for it would look like, which is why I have never used this column or the T4U_measured column in any of my data dives so far. It also seems to be less common given how hard it is to find information on it in comparison to the other tests, so not using it in my analyses is not likely a major problem.

query_hypothyroid, query_hyperthyroid, query_on_thyroxine Columns

There is no good documentation on what these columns are supposed to represent. I even checked documentations from other datasets included with the hypothyroid dataset I am using because they have similar names and was also unable to find any documentation on what they are supposed to represent there. My best guess would be if patients were asked about these things, but I cannot say for certain what these columns represent, which is why I have not used them and will probably never use them in my data dives.

TSH Column

This is not a case of not understanding what the column represented. I know from the dataset documentation and from research that it represents TSH test results and what the normal TSH values should be. However, the average range for TSH is .5-4 microunits per milliliter, with 10 microunits per milliter being considered high enough to consider hypothyroidism, and there are values well in the 100s for TSH. One explanation for this discrepancy could be that some of the TSH values were measured in different units from the standard. Another is that since there are many TSH values that high that maybe at least some of these values are accurate.

To determine what might be an outlier, we will look first at the distribution of

hypothyroid |>
  ggplot()+geom_histogram(mapping=aes(x=TSH), bins=30,fill="blue")+labs(title="Histogram of TSH Test Values",y="Count",x="TSH Test Result (microunits per milliliter)")+theme_bw()+geom_vline(xintercept=10,color="orange",size=1.5)+annotate("text",x=105,y=500, label="<- hypothyroidism TSH level")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 468 rows containing non-finite values (`stat_bin()`).

As you can, the distribution is fairly consistent up until around 100 microunits per milliliter, where some gaps exists.

To get a better look at it, here is the same histogram, but only for TSH levels above 50 microunits per milliliter and with color based on hypothyroidism status.

hypothyroid |>
  ggplot()+geom_histogram(mapping=aes(x=TSH,fill=hypothyroid), bins=30)+xlim(50,500)+ylim(0,20)+labs(title="Histogram of TSH Test Values above 50 microunits per milliliter",y="Count",x="TSH Test Result (microunits per milliliter)")+theme_bw()
## Warning: Removed 3091 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing missing values (`geom_bar()`).

This also confirms that there are still significant counts of TSH results through 100 microunits per milliliter, and there are some in between 100-200 microunits that might still be reasonable, though extremely high. Additionally, by breaking this down by class, you can see that the majority of values this high are associated with hypothyroidism, which suggest these are legitimate values and not the result of wrong units or input errors. To be safe, I would probably consider anything higher than 100 microunits per milliliter an outlier in the TSH column to be an outlier just because there are so much fewer test results that are larger than 100 microunits per milliliter compared to between 50-100 microunits per milliliter.

Conclusions

Overall, the documentation and additional research have made this hypothyroidism dataset perfectly usable other than the 5 columns (T4U/T4U_measured and the 3 query columns) mentioned above as not having enough documentation to be certain what they represent or what the expected value may be. The other column of concern, the TSH column, has some outliers that will likely need to be acknowledged in further analysis, but the unusually high values do appear legitimate and not due to any sort of error. For the columns without concern, the data is perfectly understandable and normal test result ranges and the associated units for the data have been identified for use in all future analyses I will conduct with this dataset.