First, the tidyverse library, effsize library, pwrss library, and the data set were loaded.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pwrss)
##
## Attaching package: 'pwrss'
##
## The following object is masked from 'package:stats':
##
## power.t.test
library(effsize)
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")
## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Since one of the main points of the data set is showing difference between patients with and without hypothyroidism, I decided to first test one of the core differences between the two groups- their TSH and TT4 blood test results. I have long operated under the assumption that they are different based on existing literature, and I will now test to see if there is any evidence in my data set to disprove the null hypothesis that the test results are the same between the two groups.
I will select an alpha of 0.01 as so much of my data analyses have relied on the hypothyroidism and normal patients having different TSH/TT4 test results, so I want to minimize the probability of Type I error (and concluding that I reject the null hypothesis that the difference for TSH/TT4 test result proportions is equal to zero even when that null hypothesis is genuinely correct). I will pick .90 for power because while I do not want Type II error if I can help it, I am less concerned it than Type I error in this case and would rather the test be overzealous in being unable to reject the null hypothesis. The minimum effect size I will use is .8 (from calculating Cohen’s d) as I want a large effect size to draw conclusions about rejecting the null hypothesis since rejecting this null hypothesis has massive implications for further analysis and I want to be confident that if I reject the null hypothesis, it is based on a notable difference.
First, I will create a data subset to use for testing the null hypothesis. I removed all null values in the TSH/TT4 columns and removed the obvious outliers from the TSH column to prevent a wrong result from TSH outliers skewing the hypothyroidism TSH/TT4 mean and causing it to appear like the null hypothesis should be rejected in a case where it shouldn’t be rejected.
null1 <- hypothyroid |>
select(hypothyroid,TSH,TT4) |>
filter(is.na(TSH)==FALSE) |>
filter(is.na(TT4)==FALSE) |>
filter(TSH<=150) |> #to handle outliers
mutate(TSHtoTT4=TSH/TT4)
null1
## # A tibble: 2,678 × 4
## hypothyroid TSH TT4 TSHtoTT4
## <chr> <dbl> <dbl> <dbl>
## 1 hypothyroid 30 15 2
## 2 hypothyroid 145 19 7.63
## 3 hypothyroid 0 4 0
## 4 hypothyroid 7.3 57 0.128
## 5 hypothyroid 138 27 5.11
## 6 hypothyroid 7.7 54 0.143
## 7 hypothyroid 21 34 0.618
## 8 hypothyroid 92 39 2.36
## 9 hypothyroid 48 7.6 6.32
## 10 hypothyroid 21 53 0.396
## # ℹ 2,668 more rows
Next, I will calculate the mean TSH/TT4 for each group of patients.
means1 <- null1 |>
group_by(hypothyroid) |>
summarize(meanProp=mean(TSHtoTT4),count=n())
means1
## # A tibble: 2 × 3
## hypothyroid meanProp count
## <chr> <dbl> <int>
## 1 hypothyroid 3.04 136
## 2 negative 0.0309 2542
They are visibly different here, even after removing obvious outliers in the TSH category. However, just looking at the numbers is not enough to tell if this is difference is meaningful, so I will continue with statistical testing.
Next, I will calculate the effect size using Cohen’s d:
cohen.d(d = filter(null1, hypothyroid == "hypothyroid") |> pluck("TSHtoTT4"),
f = filter(null1, hypothyroid == "negative") |> pluck("TSHtoTT4"))
##
## Cohen's d
##
## d estimate: 2.699849 (large)
## 95 percent confidence interval:
## lower upper
## 2.512721 2.886977
This is a large estimate for Cohen’s d which seems to suggest there is a difference, but I will continue on with hypothesis testing for the purpose of this assignment. I calculated the kappa based on the counts in the table above.
Next, I will check to make sure my data set is large enough to meet the minimum sample requirements for my chosen alpha and power values. I set kappa equal to .05 to reflect the proportion of data in each group.
test <- pwrss.t.2means(mu1 = .8, sd1 = 1, kappa = .05, power = .90, alpha = 0.01, alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.9
## n1 = 25
## n2 = 492
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 515
## Non-centrality parameter = 3.902
## Type I error rate = 0.01
## Type II error rate = 0.1
plot(test)
The minimum sample size required for a Neyman-Pearson hypothesis test with these conditions is 517, which the dataset is much larger than, with individual groups needing to be at least 25 and 492 respectively, which is met as shown by the counts above. You can see in the graph above that to safely reject the null hypothesis, the t-value needs to be greater than approximately 3 (the critical value is probably smaller than that just looking at the graph above, but I will go for the conservative answer and say 3).
I will finally perform a 2-sample t-test to calculate the t-statistic for the difference, which will also give me the p-value I need for the Fisher’s style test.
test1<-t.test(TSHtoTT4 ~ hypothyroid, data = null1)
test1
##
## Welch Two Sample t-test
##
## data: TSHtoTT4 by hypothyroid
## t = 7.119, df = 135.01, p-value = 5.829e-11
## alternative hypothesis: true difference in means between group hypothyroid and group negative is not equal to 0
## 95 percent confidence interval:
## 2.175761 3.849648
## sample estimates:
## mean in group hypothyroid mean in group negative
## 3.0436266 0.0309221
With regards Neyman-Pearson hypothesis test, the t-value is 7.119, which is much larger than the critical t-value of around 3 as determined by the graph above, which means that I can reject the null hypothesis that the mean proportion of TSH test results to TT4 test results is the same between hypothyroidism and normal patients. From the Fisher’s style test, I got a p-value of 5.829 x 10^-11, which is incredibly low (the alpha I selected for the Neyman-Pearson hypothesis test is the same p-value I intended to use as a baseline of whether to reject the null hypothesis for the Fisher’s style test) and much lower than my maximum allowed p-value of .01, which means I can also safely reject the null hypothesis based on the results of the Fisher’s style test.
To better visualize this difference, I created a density plot:
null1 |>
ggplot(aes(x=TSHtoTT4,color=hypothyroid))+geom_density()+xlim(0,2.5)+ylim(0,2.5)+labs(title="Density of TSH per TT4 Test Results for Hypothyroidism and Normal Patients",x="TSH/TT4 (microunits/milliter per nanomoles/liter)")
## Warning: Removed 40 rows containing non-finite values (`stat_density()`).
Here, it is obvious that the proportions are very dissimilar between groups based on the very different density distributions.
I chose to test goiter patients’ TSH/TT4 test results compared to non-goiter patients mainly because I have not touched the columns of confounding factors (like pregnancy, lithium, etc.) in the data set up until this point and thought I needed to do so for something different and to keep things interesting. A goiter is an enlargement of the thyroid and can result in test results similar to hyperthyroidism or hypothyroidism depending on the type of goiter. Therefore, it seems reasonable that there could be a difference in TSH/TT4 test results when compared to patients in the data set that do not have a goiter, which is why I am performing this test to see if I can reject the null hypothesis that TSH/TT4 test results do not differ between goiter and non-goiter patients.
I will keep the alpha and power the same as I did with the first null hypothesis test (.01 and .09 respectively) since while this is not nearly as important to my overall data analysis as the data and test was with the first null hypothesis, I do not want to be more forgiving of errors. Essentially, I want to keep all my hypothesis testing to the same rigorous standard in terms of alpha and power. However, I will use a lower minimum effect size just because I do not necessarily expect there should be a strong effect like with the means tested in null hypothesis 1. In this case, I will use .5 for the minimum effect size since it is the average between the maximum effect size for negligible effect (.2) and the minimum effect size for large effect (.8).
First, I will create a data subset to use for testing the null hypothesis. Again, I removed NA values and obvious outliers in the TSH column.
null2 <- hypothyroid |>
select(goitre,TSH,TT4) |>
filter(is.na(TSH)==FALSE) |>
filter(is.na(TT4)==FALSE) |>
filter(TSH<=150) |> #to handle outliers
mutate(TSHtoTT4=TSH/TT4)
null2 |>
group_by(goitre) |>
summarize(count=n())
## # A tibble: 2 × 2
## goitre count
## <lgl> <int>
## 1 FALSE 2609
## 2 TRUE 69
Next, I will calculate the mean TSH/TT4 proportion for each group of patients.
means2 <- null2 |>
group_by(goitre) |>
summarize(meanProp=mean(TSHtoTT4),count=n())
means2
## # A tibble: 2 × 3
## goitre meanProp count
## <lgl> <dbl> <int>
## 1 FALSE 0.187 2609
## 2 TRUE 0.0856 69
There is a difference between the means, but they are much more similar than they were for the hypothyroidism versus normal patient test so I would not be surprised if hypothesis testing resulted in not being able to reject this null hypothesis.
Next, I will calculate the effect size via Cohen’s d:
cohen.d(d = filter(null2, goitre == FALSE) |> pluck("TSHtoTT4"),
f = filter(null2, goitre == TRUE) |> pluck("TSHtoTT4"))
##
## Cohen's d
##
## d estimate: 0.07780132 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.1613676 0.3169703
This is a negligible value for Cohen’s d which would suggest they are not meaningfully different (and in fact, this is smaller than the minimum effect size I listed above), but I will continue on with hypothesis testing for the purpose of this assignment.
Next, I want to determine the minimum sample size needed for the alpha, power, and minimum effect size I wanted. I calculated the kappa based on the counts in the table above.
test2 <- pwrss.t.2means(mu1 = .5, sd1 = 1, kappa = .026, power = .90, alpha = 0.01, alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.9
## n1 = 62
## n2 = 2352
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 2412
## Non-centrality parameter = 3.886
## Type I error rate = 0.01
## Type II error rate = 0.1
plot(test2)
The minimum sample size required for a Neyman-Pearson hypothesis test with these conditions is 2414, which the null2 dataset is larger than, with individual groups needing to be at least 62 and 2352 respectively, which is met as shown by the counts above though not by much for the group with a goiter. You can see in the graph above that to safely reject the null hypothesis, the t-value needs to be greater than approximately 3 (again, I am being conservative and saying 3 to be safe, even though this looks like it might be closer to 2.5).
Finally, I will perform a 2-sample t-test to calculate the t-statistic for the difference, which will also give me the p-value I need for the Fisher’s style test.
test2<-t.test(TSHtoTT4 ~ goitre, data = null2)
test2
##
## Welch Two Sample t-test
##
## data: TSHtoTT4 by goitre
## t = 2.1371, df = 136.67, p-value = 0.03437
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 0.007537072 0.194314122
## sample estimates:
## mean in group FALSE mean in group TRUE
## 0.18652019 0.08559459
With regards Neyman-Pearson hypothesis test, the t-value is 2.1371, which is smaller than the critical t-value of around 3 as determined by the graph above. This means that I failed to reject the null hypothesis that the mean proportion of TSH test results to TT4 test results is the same between goiter and non-goiter patients. From the Fisher’s style test, I got a p-value of .03437, which is higher than my maximum allowed p-value of .01, which means I also cannot reject the null hypothesis based on the results of the Fisher’s style test. I was significantly closer to rejecting the null hypothesis than expected based on Cohen’s d, which was surprising, though I still ultimately failed to reject the null hypothesis due to my strict standards for alpha and power.
To better visualize this lack of difference, I created a density plot:
null2 |>
ggplot(aes(x=TSHtoTT4,color=goitre))+geom_density()+xlim(0,2.5)+ylim(0,5)+labs(title="Density of TSH per TT4 Test Results for Goiter and Non-Goiter Patients",x="TSH/TT4 (microunits/milliter per nanomoles/liter)")
## Warning: Removed 40 rows containing non-finite values (`stat_density()`).
As you can see on the density plot above, the density plots for patients with and without goiter are much more similar than they were for patients with and without hypothyroidism. There is some notable spikes in the goiter density plot, which may reflect that goiter can result in hyperthyroidism or hypothyroidism-type symptoms and test patterns depending on the kind of goiter it is.
From the tests, I can safely reject the null hypothesis that the TSH/TT4 proportion for patients with and without hypothyroidism is the same and cannot reject the same null hypothesis for goiter and non-goiter patients. This vindicates both my decision to use the TSH/TT4 test result proportion as a meaningful way to tell hypothyroidism and non-hypothyroidism patients apart throughout my analyses and my decision to not investigate the confounding factors’ columns much since the different in TSH/TT4 proportion means between goiter and non-goiter patients turned out to not be statistically significant.