Week 11 Data Dive

First, the tidyverse, lindia, and car libraries and hypothyroidism dataset were loaded in.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")

## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl  (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Building a Linear Model

Selection of Response and Explanatory Variables

I will be predicting TSH values, which will require transformation via log since TSH values have a Poisson distribution. For explanatory variables, I will pick the hypothyroid and TT4 columns since they have long been established to be related to TSHthrough these data dives. I had intended to use poisson regression, but to be able to diagnose the model with the diagnostic plots we learned about, it requires using a regular model so I will use the log of TSH instead of doing Poisson regression with TSH as the response variable.

Converting the hypothyroid column to binary

First, for linear regression, I have to convert the hypothyroid column into binary form, as done below:

binarydata<-hypothyroid |>
  select(hypothyroid,TT4, TSH) |>
  filter(TSH>0) |> #This has to be done to prevent strange log values
  mutate(hypothyroid= ifelse(hypothyroid=="hypothyroid",1,0)) |>
  mutate(logTSH=log(TSH))
na.omit(binarydata)

## # A tibble: 1,801 × 4
##    hypothyroid   TT4   TSH logTSH
##          <dbl> <dbl> <dbl>  <dbl>
##  1           1  15    30     3.40
##  2           1  19   145     4.98
##  3           1   6   430     6.06
##  4           1  57     7.3   1.99
##  5           1  27   138     4.93
##  6           1  54     7.7   2.04
##  7           1  34    21     3.04
##  8           1  39    92     4.52
##  9           1   7.6  48     3.87
## 10           1  53    21     3.04
## # ℹ 1,791 more rows

binarydata

## # A tibble: 1,801 × 4
##    hypothyroid   TT4   TSH logTSH
##          <dbl> <dbl> <dbl>  <dbl>
##  1           1  15    30     3.40
##  2           1  19   145     4.98
##  3           1   6   430     6.06
##  4           1  57     7.3   1.99
##  5           1  27   138     4.93
##  6           1  54     7.7   2.04
##  7           1  34    21     3.04
##  8           1  39    92     4.52
##  9           1   7.6  48     3.87
## 10           1  53    21     3.04
## # ℹ 1,791 more rows

Checking That There Are No Other Transformations Needed

It is important now that the logTSH column exists to make sure that this new response variable has a linear relationship to the TT4 and hypothyroid variables (although hypothyroid is binary, so it is more looking for an obvious difference between groups).

First, we will look at log(TSH) vs TT4:

ggplot(data=binarydata, aes(x=TT4,y=logTSH))+geom_point(color="seagreen")+labs(title="The Log of TSH Test Results vs. TT4 Test Results",x="TT4 Test Results",y="log(TSH Test Results")+theme_bw()

As you can see, there is a clear linear relationship between TT4 and log(TSH), so no further transformations need to be done.

ggplot(data=binarydata, aes(x=hypothyroid,y=logTSH))+geom_point(color="navy")+labs(title="The Log of TSH Test Results vs Hypothyroidism Classification",x="Hypothyroidism Classification (0=Without Hypothyroidism, 1= With Hypothyroidism)",y="log(TSH Test Results)")+theme_bw()

Though not as obvious as with log(TSH) vs TT4, there appears to be a monotonic (as monotonic as a binary variable versus a continuous variable can be) relationship between the two and no further transformations need to be performed.

Creating the Model

Finally, we can create the linear model:

model <- lm(logTSH~ hypothyroid +TT4, binarydata)
summary(model)

## 
## Call:
## lm(formula = logTSH ~ hypothyroid + TT4, data = binarydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0130 -0.7645 -0.1111  0.7261  4.3386 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.5648944  0.0869282   18.00   <2e-16 ***
## hypothyroid  2.4828953  0.1219059   20.37   <2e-16 ***
## TT4         -0.0114559  0.0007602  -15.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.278 on 1798 degrees of freedom
## Multiple R-squared:  0.3874, Adjusted R-squared:  0.3867 
## F-statistic: 568.4 on 2 and 1798 DF,  p-value: < 2.2e-16

One thing of note with this model is the adjusted R-squared value is .3867, which is not particularly high and can probably be improved upon by tweaking the model.

Interpreting the Coefficients

The intercept means that when a patient doesn’t have hypothyroidism and has a TT4 test result of 0, the log(TSH test result) is equal to 1.56. THe hypothryoid coefficient means that when an individual has hypothyroidism, the log(TSH test result) increases by 2.48 compared if that individual hadn’t had hypothryoidism. Finally, the TT4 coefficient means that as the TT4 test result increases by one, the log(TSH test result) decrease by .01.

Diagnosing the Model

Beyond making sure the response variable (TSH) was normally distributed via a log transformation and ensuring that the two explanatory variables (hypothyroid and TT4) were reasonably linearly-related to the response variable, I will check a couple of diagnostic plots to ensure residuals have constant variance and are normally distributed.

Checking For Constant Variance of Residuals

I will use a residuals plot to check for this:

gg_resfitted(model) +
  theme_bw()

There is clearly greater variance closer to zero than there is at 4, so there is not constant variance of residuals (the two chunk nature reflects that one of the explanatory variables, hypothyroid was a binary attribute).

Checking for Normal Distribution of Residuals

Next, I will look at a QQ plot to see if the errors are normally distributed.

gg_qqplot(model) +  theme_bw()

Though the plot somewhat follows the line, the tails stray too far from it to say that the errors are normally distributed.

Checking that Hypothryoid and TT4 are not Linearly Correlated

Another assumption of linear regression is that independent variables are not linearly correlated, so it makes sense to check that TT4 and hypothyroid are not correlated:

ggplot(data=binarydata, aes(x=hypothyroid,y=TT4))+geom_point(color="purple")+labs(title="TT4 Test Results vs Hypothyroidism Classification",x="Hypothyroidism Classification (0=Without Hypothyroidism, 1= With Hypothyroidism)",y="TT4 Test Results")+theme_bw()

Unfortunately, there does seem to be a linear correlation between TT4 and hypothryoid, so one or the other should have been picked for the model and not both.

This can also be checked with variance inflation factor, which checks multicollinearity:

vif(model)

## hypothyroid         TT4 
##    1.244027    1.244027

The values are both about 1.24, which is fairly low (10 would be cause for concern) and may suggest that correlation we saw with the plot may not be as much of a concern as I initially thought.

Conclusions

A linear regression model between log(TSH) and hypothyroid and TT4 was created, which met the independent and linear relationship assumptions of a linear model but failed to meet the constant variance and normally distributed errors assumptions. The assumption that independent variables not being correlated was inconclusive whether it was met or not. As such, this linear model has concerns which may not make it particularly usable even if the coefficients were highly significant due to their extremely low p-values (although those p-values are based on neither hypothyroidism and TT4 test results being correlated to log(TSH test results) instead of just the explanatory variable in question).