First, the tidyverse and lindia libraries and hypothyroidism dataset were loaded in.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lindia)
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")

## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl  (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Looking At a Specific Response Variable: TT4 Test Results

I wanted to focus on TSH test results as my response variable of interest as the TSH test is generally the go-to blood test for thyroid issues, but it is not normally distributed which makes running an ANOVA test difficult as the normality assumption will not be met. Instead, I decided to choose the second most-common test, the TT4 test, as it is still important and the TT4 test results are normally distributed.

ANOVA Test: Comparing TT4 Test Results Based On Whether a Patient is Sick

Being sick can have an effect on TT4 levels, so it seemed worthwhile to determine if a patient being sick had a notable effect on the TSH test result for patients in this dataset.

The null hypothesis for this ANOVA test is that the mean TT4 value will not differ between patients who are sick and patients who are not sick.

In terms of the assumptions, all the data points are independent as they are from different patients whose health and blood test results are not dependent on other patients’ health and results.

Checking Assumptions

With regards to the normality assumption, we will look at histograms for TT4 results for the groups:

hypothyroid |>
  ggplot()+facet_wrap(facets=vars(sick),scales="free")+geom_histogram(mapping=aes(x=TT4),fill="cornflowerblue",color="darkblue")+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 249 rows containing non-finite values (`stat_bin()`).

Both groups (sick and not sick) have a reasonably normal distribution for TT4 test results, so I will say that the normality assumption is met.

With regards to the same variance assumption, I chose to look at standard deviation:

grouped <-hypothyroid |>
  group_by(sick) |>
  summarize(mean_TT4=mean(TT4,na.rm=TRUE),sd_TT4=sd(TT4,na.rm=TRUE))
grouped

## # A tibble: 2 × 3
##   sick  mean_TT4 sd_TT4
##   <lgl>    <dbl>  <dbl>
## 1 FALSE    109.    45.6
## 2 TRUE      98.4   39.2

From here, it is clear that the standard deviations for both groups are reasonably similar to each other, so the homoscedasticity assumption is met.

Running the ANOVA Test

Now, I will run the ANOVA test, continuing to use p=.01 like I used last week as my standard for rejecting the null hypothesis:

anovaResults <- aov(TT4 ~ sick, data = hypothyroid)
summary(anovaResults)

##               Df  Sum Sq Mean Sq F value Pr(>F)  
## sick           1   10843   10843   5.248  0.022 *
## Residuals   2912 6015931    2066                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 249 observations deleted due to missingness

After conducting the ANOVA test, the resulting p-value was .022 (as see above). This means there is a 2.2% chance of getting the same or more extreme results when compared to the null hypothesis that the groups have the same mean TT4 results. Since I set p=.01 as my standard for rejecting the null hypothesis, I do not have the evidence based on this ANOVA test result to reject the null hypothesis that the mean TT4 test result is the same between groups and fail to reject the null hypothesis. Thus, the average TT4 test results for a sick person seems to be the same as the average TT4 test results for a non-sick person.

Linear Regression: Evaluating a TT4 vs. TBG Linear Model

I chose to compare TBG and TT4 as the TBG test measures the level of the protein (thyroxine binding globulin or TBG for short) that moves thyroxine through the body, while TT4 measures the levels of total thyroxine. Therefore, it seems reasonable to assume TBG levels might influence TT4 levels.

Visualizing the Relationship

We can visualize the relationship between these 2 variables below:

hypothyroid |>
  filter(TT4<350)|> #there is an outlier that massively skews the trend line, so I am removing it to show the true trend
  ggplot(aes(x=TBG,y=TT4))+geom_point()+labs(title="TT4 Test Results Versus TBG Test Results")+theme_bw()+geom_smooth(method = "lm", se = FALSE, color = 'magenta')

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 2894 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 2894 rows containing missing values (`geom_point()`).

There was surprisingly little overlap in patients who got the TT4 test and patients who got the TBG test, so this line is made based on very few points. However, the trend seems clearly linear.

Determining the Linear Regression Model

Next, I will determine what the trend line actually is:

sansOutlier <-hypothyroid |>
  filter(TT4<350)
baseLinearRegressionModel <- lm(TT4 ~ TBG, sansOutlier)
baseLinearRegressionModel$coefficients

## (Intercept)         TBG 
##  334.968876   -9.900195

From here, we can get the trend line of approximately y=-335.0-9.9x, where x is the TBG test result and y is the TT4 test result. This means that when TBG test results are zero, TT4 test results would be expected to be about 335 nanomoles/liter, and that for every increase of TBG results from 0 by 1 nanomole/liter, TT4 test results would be expected to decrease by around 10 nanomoles/liter.

Hypothesis Tests with the Linear Regression Model

Now I have a trend line, hypothesis tests need to be run to see if we have the evidence to reject the null hypothesis of there not being a relationship between TBG test results and TT4 test results.

To start, a summary of the model will be looked at:

summary(baseLinearRegressionModel)

## 
## Call:
## lm(formula = TT4 ~ TBG, data = sansOutlier)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.86  -51.44   13.34   56.01   67.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  334.969     75.763   4.421  0.00129 **
## TBG           -9.900      3.052  -3.244  0.00881 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.68 on 10 degrees of freedom
##   (2894 observations deleted due to missingness)
## Multiple R-squared:  0.5128, Adjusted R-squared:  0.4641 
## F-statistic: 10.53 on 1 and 10 DF,  p-value: 0.008806

Sticking with the p-value of 0.01 for rejecting the null hypothesis, the p-value for both coefficients is lower than the threshold p-value for rejecting the null hypothesis. Additionally, the p-value associateed with the F-statistic is also lower than 0.01. Thus, I have sufficient evidence to reject the null hypothesis that there is no relationship between TBG test results and TT4 test results, which indicates that there may be a relationship between the two.

The R-squared value is .5128, which seems decent enough (though we have no other models to compare to to say if it is good or bad).

From here, it is safe to say the model meets the assumptions that TBG and TT4 are linearly correlated, observations are independent and uncorrelated, and independent variables cannot be linearly correlated.

Looking at Diagnostic Plots

Next, I will make a plot to see if errors have constant variance among all predictions:

gg_resfitted(baseLinearRegressionModel) +
  theme_bw()

Based on the plot above, the residuals seem pretty randomly distributed with no clear trends in variance as predictions of TT4 test results increase, which would suggest that the assumption that errors have constant variance across all predictions is true.

Finally, I will look at a QQ plot to see if errors are normally distributed over the prediction line:

gg_qqplot(baseLinearRegressionModel) +
  theme_bw()

The pattern of residuals is not linear, especially towards the quantile extremes, which would suggest that the assumption that errors are normally distributed over the prediction line is not met. However, it is not as blatant as it could be (like all residuals being on one side of the prediction line), which may suggest that we do not violate this assumption too badly.

Overall, the linear regression model between TBG and TT4 test results seems reasonable with the one major issue being with the assumption that errors are normally distributed over the prediction line, but even that is still not a blatant violation of the assumption so I can still accept this as a reasonable linear regression model.

Multi-Variable Linear Regression: TT4 Test Results Compared to TBG and FTI Test Results

Given how little overlap there is in patients who got the TBG test and patients who got the TT4 test, I was limited in what I could use for linear regression, hence only adding FTI for the multi-variable linear regression.

Checking for Interactions

First, I will check to see make sure TBG and FTI are not correlated:

hypothyroid |>
  filter(TBG<100 & FTI <300)|> #removing 2 outliers that may give the illusion of a trend where there is none
  ggplot(mapping = aes(x = TBG, y = FTI)) + geom_point()+ geom_smooth(method = 'lm', se = FALSE, linewidth = 0.5)+labs(title="FTI Test Results Versus TBG Test Results")

## `geom_smooth()` using formula = 'y ~ x'

Though a trend line was drawn, the points are pretty scattered on the chart and not particularly similar to the trend line, so it seems reasonable to conclude there is no colinearity between the two types of test results.

Making the Multiple Variable Linear Regression Model

multivariateLinearRegressionModel <- lm(TT4 ~ TBG + FTI, sansOutlier)
multivariateLinearRegressionModel$coefficients

## (Intercept)         TBG         FTI 
## -20.3011866   1.1146503   0.8803319

This gives a trend line of y= 1.11x1+.88x2-20.30. This means that when FTI and TBG test results both equal zero, TT4 would be -20.30 (which cannot be possible). This also means that for every 1 nanomole/liter TBG goes up, TT4 will increase by 1.11 nanomoles/liter. It also means that for every 1 nanomole/liter FTI goes up, TT4 will go up by .88 nanomoles/liter.

Hypothesis Testing with the Multiple Variable Linear Regression Model

Next, I will look at the summary and see hypothesis tests’ results:

summary(multivariateLinearRegressionModel)

## 
## Call:
## lm(formula = TT4 ~ TBG + FTI, data = sansOutlier)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.788 -10.706   1.425   6.323  31.093 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.30119   32.89875  -0.617    0.552    
## TBG           1.11465    1.12075   0.995    0.346    
## FTI           0.88033    0.06814  12.920 4.09e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.94 on 9 degrees of freedom
##   (2894 observations deleted due to missingness)
## Multiple R-squared:  0.9751, Adjusted R-squared:  0.9695 
## F-statistic: 176.1 on 2 and 9 DF,  p-value: 6.091e-08

From here, we can see that the p-value for the intercept and the TBG are both higher than .01. In particular, this suggests that there is no significant relationship between TT4 and TBG results, at least in comparison to the relationship between TT4 and FTI test results. However, the coefficient for FTI has an extremely low p value, suggesting that the null hypothesis that there is no relationship between TT4 and FTI can be rejected. This is to be expected- see the plot comparing TT4 and FTI in the Week 6 Data Dive to see how strongly correlated they are. Additionally, the F-statistic has a p-value that is much lower than .01, which means that we reject the null hypothesis that neither TBG or FTI test results have a relationship with TT4 test results, which is a given since FTI test results and TT4 test results are correlated.

Additionally, the adjusted r-squared value of .9695 shows this model is much better than the single variable model, which is to be expected given that the variable added (FTI test results) is much more strongly correlated to TT4 test results than TBG test results are.

With regards to assumptions, it is safe to say by this point that the model meets the assumptions that TBG and FTI are linearly correlated to FTI, observations are independent and uncorrelated, and independent variables cannot be linearly correlated.

Looking at Diagnostic Plots

Next, I will make a plot to see if errors have constant variance among all predictions:

gg_resfitted(multivariateLinearRegressionModel) +
  theme_bw()

Based on the plot above, the residuals seem decently randomly distributed, though it does shift from more negative residuals to more positive residuals as fitted values increases. This may suggest that the assumption that errors have cosistent variance is not accurate.

Finally, I will look at a QQ plot to see if errors are normally distributed over the prediction line:

gg_qqplot(multivariateLinearRegressionModel) +
  theme_bw()

The pattern of residuals is mostly linear, which would suggest that the assumption that errors are normally distributed over the prediction line is met.

Overall, the linear regression model between TBG and FTI and TT4 test results seems reasonable with the one major issue being with the assumption that errors have consistent variance, but even that assumption could be argued that it is met because it is the direction that changes and not the overall size of the variance (the points remain on average about the same distance from the center line).

Conclusions

The ANOVA test above showed that there appears to be no relationship between sickness status and TT4 results, while the two linear regression models showed that it appears unlikely that there is no relationship between TT4 and TBG test results (Model 1) and TT4 and FTI test results (Model 2). Of the two linear regression models, the multiple variable one is technically better, but it is largely because the relationship between FTI and TT4 test results is so strong (which is to be expected since FTI measures free thyroxine and TT4 measures total thyroxine). To better visualize the TBG and TT4 relationship, it would be better just to look at the single variable linear regression model.

Week 8 Data Dive

Teresa Ortyl

2023-10-10

Looking At a Specific Response Variable: TT4 Test Results

ANOVA Test: Comparing TT4 Test Results Based On Whether a Patient is Sick

Checking Assumptions

Running the ANOVA Test

Linear Regression: Evaluating a TT4 vs. TBG Linear Model

Visualizing the Relationship

Determining the Linear Regression Model

Hypothesis Tests with the Linear Regression Model

Looking at Diagnostic Plots

Multi-Variable Linear Regression: TT4 Test Results Compared to TBG and FTI Test Results

Checking for Interactions

Making the Multiple Variable Linear Regression Model

Hypothesis Testing with the Multiple Variable Linear Regression Model

Looking at Diagnostic Plots

Conclusions