Data Dive 6

Teresa Ortyl

First, the tidyverse library and the data set were loaded.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("boot")
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")

## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl  (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Variable Combinations and Correlations

Set 1: Age and TSH/TT4 Results

For the first combination of variables, I decided to look at age versus the proportion of TSH/TT4 results. By doing this, I would be seeing if hypothyroidism is correlated to age as the higher the proportion of TSH to TT4, the more likely a patient has hypothyroidism. So first, I will get filter the data set and get a data set with only the desired combination of variables, then add the proportion column:

set1 <- hypothyroid |> 
  select(hypothyroid,age,TSH, TT4) |> 
  mutate(TSHtoTT4=TSH/TT4)

I left the hypothyroid column in for visualization purposes.

Visualization

From here, we can visualize the relationship between age and TSH/TT4 results:

set1 |>
  ggplot()+geom_point(mapping=aes(x=age,y=TSHtoTT4,color=hypothyroid))+labs(title="Age Versus the Proportion of TSH results to TT4 Results",x="Age",y="TSH/TT4 (microunits per milliliter/nanomoles per liter")+theme_bw()+ylim(0,10)

## Warning: Removed 888 rows containing missing values (`geom_point()`).

From this, there appears to be little to no correlation between age and the proportion of TSH results to TT4 results, which would suggest that the likelihood of getting hypothyroidism does not depend on age. There are some extremely high outliers not visualized on the graph (almost certainly because TSH has a few extremely higher outliers, and TSH and TT4 are negatively correlated as shown in set 2) as well, but they are few and far between.

Correlation Coefficient

We will also calculate Pearson’s correlation coefficient to confirm this:

round(cor(set1$age, set1$TSHtoTT4,use="pairwise.complete.obs"), 2)

## [1] -0.03

As expected, Pearson’s correlation coefficient for this is nearly zero, which would again suggest no correlation between age and the proportion of TSH results to TT4 results. This is in line with the fairly straight horizontal line on the graph with a few outliers here and there.

Confidence Interval

Next, we will calculate the 95% confidence interval for the mean of the TSH/TT4 proportion values:

#I am using the boot_ci function as written in the lab from class on Tuesday
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 1000) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  
  boot.ci(b, conf = conf, type = "perc")
}

nona1<-set1 |> #removing NA rows so the confidence interval function can run properly
  na.omit()

boot_ci(nona1$TSHtoTT4, mean, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 0.2539,  0.5835 )  
## Calculations and Intervals on Original Scale

From this, we can see that determining the mean using this bootstrap confidence interval calculation method would result in capturing the true mean of the proportion of TSH test results to TT4 test results 95% of the time. As such, we can say that there is a 95% possibility that the confidence interval for the TSH/TT4 proportion mean captures the true mean for the TSH/TT4 results proportion, which would mean there is a 95% chance the true mean is actually between .2451 and .5631 microunits per millilter/ nanomoles per liter.

Set 2: TSH and TT4

For the second variable combination, I intend to demonstrate that TSH and TT4 are correlated in this data set as a lot of my analyses thus far have been based in the fact that with hypothyroidism, TSH results are higher than they should be while TT4 results are lower than they should be. I will also create a column comparing TT4 and TSH values to the mean TT4 and TSH values mostly to prove I can make other columns:

set2 <- hypothyroid |> 
  select(hypothyroid, TSH, TT4) |> 
  mutate(TT4toMean=TT4-mean(TT4,na.rm=TRUE)) |>
  mutate(TSHtoMean=TSH-mean(TSH,na.rm=TRUE))

I will use the TT4toMean and TSHtoMean columns in the visualization just for fun and because they give an idea of the distribution of TT4 and TSH results in comparison to the mean values. Again, the hypothyroid column was kept in as a visualization tool.

###Visualization

set2 |>
  ggplot(aes(x=TSHtoMean,y=TT4toMean, linetype="Trend Line"))+geom_point(mapping=aes(color=hypothyroid))+labs(title="TSH Results' Versus TT4 Results' Differences from the Mean",x="TSH Results' Diffference from Mean TSH (microunits per milliliter)",y="TT4 Results' Difference from Mean TT4 (nanomoles per liter)")+theme_bw()+xlim(-10,150)+ylim(-100,100)+geom_smooth(method="lm",se=FALSE,color="black")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 582 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 582 rows containing missing values (`geom_point()`).

## Warning: Removed 26 rows containing missing values (`geom_smooth()`).

As you can see, there appears to be some negative linear relationship between TSH results and TT4 results’ difference from the mean TT4 test results. You can also see clearly that patients with hypothyroidism have TT4 test results that are lower from the mean TT4 result and TSH results that are higher than the mean TSH result. Alternatively, normal patients tend to have TSH results that are lower than the mean TSH value and TT4 results that are much closer to the mean than patients with hypothyroidism.

###Correlation Coefficient

To confirm there is a negative linear relationship between TSH and TT4 results’ differences from their respective means, we will calculate Pearson’s correlation coefficient.

round(cor(set2$TSHtoMean, set2$TT4toMean,use="pairwise.complete.obs"), 2)

## [1] -0.32

There is a negative correlation, which is to be expected. However, it is not as negative or as strong of a correlation as one might expect; this is largely due to the fact that there are a lot more normal patients in this data set than patients with hypothyroidism, which results in the natural variation within the overabundance of normal patients skewing the data and reducing the strength of the correlation which is known to exist between TSH and TT4. Additionally, the few extreme outliers in TSH may also affect the strength of the correlation as well.

Confidence Interval

Next, we will calculate the 95% confidence interval for the mean TT4 results’ difference from the mean value:

nona2<-set2 |> #removing NA rows so the confidence interval function can run properly
  na.omit()

boot_ci(nona2$TT4toMean, mean, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (-1.8726,  1.3815 )  
## Calculations and Intervals on Original Scale

From this, we can say that there is a 95% possibility that the confidence interval for the difference between the TT4 result and the mean TT4 result accurately captures the true mean difference between the TT4 test result and the mean TT4 result. This can also be rephrased as there being a 95% possibility the true mean difference between the TT4 test result and the mean TT4 test result is actually between -1.9908 nanomoles per liter and 1.5435 nanomoles per liter. Based on the confidence interval, we can see that the mean difference the confidence interval is based around is slightly negative, which is no surprise given there’s a slight negative skew for TT4 difference from the mean results as shown from the visualization above.

Set 3: TT4 and FTI

TT4 and the FTI tests should be directly correlated since the TT4 test measures total thyroxine while FTI measures just the free thyroxine. If one goes up, it seems reasonable to assume that the other will also go up. I will make differences from the mean columns again for both TT4 and FTI like I did with variable set 2.

set3 <- hypothyroid |> 
  select(hypothyroid, TT4,FTI) |> 
  mutate(TT4toMean=TT4-mean(TT4,na.rm=TRUE)) |>
  mutate(FTItoMean=FTI-mean(FTI,na.rm=TRUE))

Again, the hypothyroid column was kept in again for visualization purposes.

Visualization

Much like set 2, I will visualize the differences from the mean to give more an idea of distribution.

set3 |>
  ggplot(aes(x=TT4toMean,y=FTItoMean, linetype="Trend Line"))+geom_point(mapping=aes(color=hypothyroid),alpha=0.3)+labs(title="TT4 Results' Versus FTI Results' Differences from the Mean",x="TT4 Results' Diffference from Mean TT4 (nanomoles per liter)",y="FTI Results' Difference from Mean FTI (nanomoles per liter)")+theme_bw()+geom_smooth(method="lm",se=FALSE,color="black")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 249 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 249 rows containing missing values (`geom_point()`).

As expected, there is a pretty direct correlation between TT4 and FTI results. Also as expected, hypothyroid patients’ results for TT4 and FTI are both lower than the mean. Overall, most of the data falls between -100 and 100 nanomoles per liter for TT4 difference and between -100 and 100 nanomoles per liter for FTI difference, but there is somewhat of a right/positive skew for both TT4 and FTI differences from the mean result.

Correlation Coefficient

However, to confirm the strength of that direct correlation, we will calculate Pearson’s correlation coefficient between the two variables.

round(cor(set3$TT4toMean, set3$FTItoMean,use="pairwise.complete.obs"), 2)

## [1] 0.68

As you can see, Pearson’s correlation coefficient once again confirms there is a direct correlation between TT4 and FTI test results and, specifically in this case, their differences from their respective means. This is a stronger correlation than between TSH and TT4, and the difference in strength of the correlation can be visualized pretty easily by comparing the graphs. One thing that may have prevented this correlation calculated from being stronger is the presence of a secondary, steeper linear trend between the two values as present on the graph above, which may result in the correlation coefficient losing its strength by accommodating for those values that make up the secondary trend line.

Confidence Interval

Finally we will calculate the 95% confidence interval for the mean FTI results’ difference from the mean value as we already did it for TT4 in set 2:

nona3<-set3 |> #removing NA rows so the confidence interval function can run properly
  na.omit()

boot_ci(nona3$FTItoMean, mean, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (-2.1984,  2.4723 )  
## Calculations and Intervals on Original Scale

From this, we can say that there is a 95% possibility that the confidence interval for the difference between the FTI result and the mean FTI result accurately captures the true mean difference between the FTI test result and the mean FTI result. In other words, there is a 95% possibility that the true mean difference between the FTI result and the mean FTI result actually lies between -2.0383 nanomoles per liter and 2.2133 nanomoles per liter. From this, we can see that the mean difference the confidence interval is based on is slightly positive, which indicates that on average, the difference between a given FTI value and the mean FTI value is very slightly positive. This reflects the slight positive skew we could visualize on the graph.

Data Dive Week 6

Teresa Ortyl

2023-09-26

Data Dive 6

Teresa Ortyl

Set 1: Age and TSH/TT4 Results

Visualization

Correlation Coefficient

Confidence Interval

Set 2: TSH and TT4

Confidence Interval

Set 3: TT4 and FTI

Visualization

Correlation Coefficient

Confidence Interval

Conclusions