First, the tidyverse library and the data set were loaded.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("boot")
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")
## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Variable Combinations and Correlations
For the first combination of variables, I decided to look at age versus the proportion of TSH/TT4 results. By doing this, I would be seeing if hypothyroidism is correlated to age as the higher the proportion of TSH to TT4, the more likely a patient has hypothyroidism. So first, I will get filter the data set and get a data set with only the desired combination of variables, then add the proportion column:
set1 <- hypothyroid |>
select(hypothyroid,age,TSH, TT4) |>
mutate(TSHtoTT4=TSH/TT4)
I left the hypothyroid column in for visualization purposes.
From here, we can visualize the relationship between age and TSH/TT4 results:
set1 |>
ggplot()+geom_point(mapping=aes(x=age,y=TSHtoTT4,color=hypothyroid))+labs(title="Age Versus the Proportion of TSH results to TT4 Results",x="Age",y="TSH/TT4 (microunits per milliliter/nanomoles per liter")+theme_bw()+ylim(0,10)
## Warning: Removed 888 rows containing missing values (`geom_point()`).
From this, there appears to be little to no correlation between age and
the proportion of TSH results to TT4 results, which would suggest that
the likelihood of getting hypothyroidism does not depend on age. There
are some extremely high outliers not visualized on the graph (almost
certainly because TSH has a few extremely higher outliers, and TSH and
TT4 are negatively correlated as shown in set 2) as well, but they are
few and far between.
We will also calculate Pearson’s correlation coefficient to confirm this:
round(cor(set1$age, set1$TSHtoTT4,use="pairwise.complete.obs"), 2)
## [1] -0.03
As expected, Pearson’s correlation coefficient for this is nearly zero, which would again suggest no correlation between age and the proportion of TSH results to TT4 results. This is in line with the fairly straight horizontal line on the graph with a few outliers here and there.
Next, we will calculate the 95% confidence interval for the mean of the TSH/TT4 proportion values:
#I am using the boot_ci function as written in the lab from class on Tuesday
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 1000) {
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
nona1<-set1 |> #removing NA rows so the confidence interval function can run properly
na.omit()
boot_ci(nona1$TSHtoTT4, mean, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 0.2539, 0.5835 )
## Calculations and Intervals on Original Scale
From this, we can see that determining the mean using this bootstrap confidence interval calculation method would result in capturing the true mean of the proportion of TSH test results to TT4 test results 95% of the time. As such, we can say that there is a 95% possibility that the confidence interval for the TSH/TT4 proportion mean captures the true mean for the TSH/TT4 results proportion, which would mean there is a 95% chance the true mean is actually between .2451 and .5631 microunits per millilter/ nanomoles per liter.
For the second variable combination, I intend to demonstrate that TSH and TT4 are correlated in this data set as a lot of my analyses thus far have been based in the fact that with hypothyroidism, TSH results are higher than they should be while TT4 results are lower than they should be. I will also create a column comparing TT4 and TSH values to the mean TT4 and TSH values mostly to prove I can make other columns:
set2 <- hypothyroid |>
select(hypothyroid, TSH, TT4) |>
mutate(TT4toMean=TT4-mean(TT4,na.rm=TRUE)) |>
mutate(TSHtoMean=TSH-mean(TSH,na.rm=TRUE))
I will use the TT4toMean and TSHtoMean columns in the visualization just for fun and because they give an idea of the distribution of TT4 and TSH results in comparison to the mean values. Again, the hypothyroid column was kept in as a visualization tool.
###Visualization
set2 |>
ggplot(aes(x=TSHtoMean,y=TT4toMean, linetype="Trend Line"))+geom_point(mapping=aes(color=hypothyroid))+labs(title="TSH Results' Versus TT4 Results' Differences from the Mean",x="TSH Results' Diffference from Mean TSH (microunits per milliliter)",y="TT4 Results' Difference from Mean TT4 (nanomoles per liter)")+theme_bw()+xlim(-10,150)+ylim(-100,100)+geom_smooth(method="lm",se=FALSE,color="black")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 582 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 582 rows containing missing values (`geom_point()`).
## Warning: Removed 26 rows containing missing values (`geom_smooth()`).
As you can see, there appears to be some negative linear relationship
between TSH results and TT4 results’ difference from the mean TT4 test
results. You can also see clearly that patients with hypothyroidism have
TT4 test results that are lower from the mean TT4 result and TSH results
that are higher than the mean TSH result. Alternatively, normal patients
tend to have TSH results that are lower than the mean TSH value and TT4
results that are much closer to the mean than patients with
hypothyroidism.
###Correlation Coefficient
To confirm there is a negative linear relationship between TSH and TT4 results’ differences from their respective means, we will calculate Pearson’s correlation coefficient.
round(cor(set2$TSHtoMean, set2$TT4toMean,use="pairwise.complete.obs"), 2)
## [1] -0.32
There is a negative correlation, which is to be expected. However, it is not as negative or as strong of a correlation as one might expect; this is largely due to the fact that there are a lot more normal patients in this data set than patients with hypothyroidism, which results in the natural variation within the overabundance of normal patients skewing the data and reducing the strength of the correlation which is known to exist between TSH and TT4. Additionally, the few extreme outliers in TSH may also affect the strength of the correlation as well.
Next, we will calculate the 95% confidence interval for the mean TT4 results’ difference from the mean value:
nona2<-set2 |> #removing NA rows so the confidence interval function can run properly
na.omit()
boot_ci(nona2$TT4toMean, mean, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (-1.8726, 1.3815 )
## Calculations and Intervals on Original Scale
From this, we can say that there is a 95% possibility that the confidence interval for the difference between the TT4 result and the mean TT4 result accurately captures the true mean difference between the TT4 test result and the mean TT4 result. This can also be rephrased as there being a 95% possibility the true mean difference between the TT4 test result and the mean TT4 test result is actually between -1.9908 nanomoles per liter and 1.5435 nanomoles per liter. Based on the confidence interval, we can see that the mean difference the confidence interval is based around is slightly negative, which is no surprise given there’s a slight negative skew for TT4 difference from the mean results as shown from the visualization above.
TT4 and the FTI tests should be directly correlated since the TT4 test measures total thyroxine while FTI measures just the free thyroxine. If one goes up, it seems reasonable to assume that the other will also go up. I will make differences from the mean columns again for both TT4 and FTI like I did with variable set 2.
set3 <- hypothyroid |>
select(hypothyroid, TT4,FTI) |>
mutate(TT4toMean=TT4-mean(TT4,na.rm=TRUE)) |>
mutate(FTItoMean=FTI-mean(FTI,na.rm=TRUE))
Again, the hypothyroid column was kept in again for visualization purposes.
Much like set 2, I will visualize the differences from the mean to give more an idea of distribution.
set3 |>
ggplot(aes(x=TT4toMean,y=FTItoMean, linetype="Trend Line"))+geom_point(mapping=aes(color=hypothyroid),alpha=0.3)+labs(title="TT4 Results' Versus FTI Results' Differences from the Mean",x="TT4 Results' Diffference from Mean TT4 (nanomoles per liter)",y="FTI Results' Difference from Mean FTI (nanomoles per liter)")+theme_bw()+geom_smooth(method="lm",se=FALSE,color="black")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 249 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 249 rows containing missing values (`geom_point()`).
As expected, there is a pretty direct correlation between TT4 and FTI
results. Also as expected, hypothyroid patients’ results for TT4 and FTI
are both lower than the mean. Overall, most of the data falls between
-100 and 100 nanomoles per liter for TT4 difference and between -100 and
100 nanomoles per liter for FTI difference, but there is somewhat of a
right/positive skew for both TT4 and FTI differences from the mean
result.
However, to confirm the strength of that direct correlation, we will calculate Pearson’s correlation coefficient between the two variables.
round(cor(set3$TT4toMean, set3$FTItoMean,use="pairwise.complete.obs"), 2)
## [1] 0.68
As you can see, Pearson’s correlation coefficient once again confirms there is a direct correlation between TT4 and FTI test results and, specifically in this case, their differences from their respective means. This is a stronger correlation than between TSH and TT4, and the difference in strength of the correlation can be visualized pretty easily by comparing the graphs. One thing that may have prevented this correlation calculated from being stronger is the presence of a secondary, steeper linear trend between the two values as present on the graph above, which may result in the correlation coefficient losing its strength by accommodating for those values that make up the secondary trend line.
Finally we will calculate the 95% confidence interval for the mean FTI results’ difference from the mean value as we already did it for TT4 in set 2:
nona3<-set3 |> #removing NA rows so the confidence interval function can run properly
na.omit()
boot_ci(nona3$FTItoMean, mean, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (-2.1984, 2.4723 )
## Calculations and Intervals on Original Scale
From this, we can say that there is a 95% possibility that the confidence interval for the difference between the FTI result and the mean FTI result accurately captures the true mean difference between the FTI test result and the mean FTI result. In other words, there is a 95% possibility that the true mean difference between the FTI result and the mean FTI result actually lies between -2.0383 nanomoles per liter and 2.2133 nanomoles per liter. From this, we can see that the mean difference the confidence interval is based on is slightly positive, which indicates that on average, the difference between a given FTI value and the mean FTI value is very slightly positive. This reflects the slight positive skew we could visualize on the graph.
From these analyses, we can conclude that age is not correlated with the TSH/TT4 proportion of results, which would indicate that age plays little to no role in whether a person has hypothyroidism. We can also conclude that TSH and TT4 are negatively correlated as expected, though the correlation wasn’t nearly as strong as expected which may mean needing to reconsider how much I base things on TSH and TT4 being correlated in future analyses. Finally, TT4 and FTI were positively correlated as expected and stronger than TSH/TT4 at that, which provides some relief that the data set is as it should be. If I had not found a positive correlation between FTI and TT4, there would be massive concerns about the quality of the data set.