First, the tidyverse, lindia, and car libraries and hypothyroidism dataset were loaded in.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")
## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I will be predicting TSH values, which will require transformation via log since TSH values have a Poisson distribution. For explanatory variables, I will pick the hypothyroid and TT4 columns since they have long been established to be related to TSHthrough these data dives. I had intended to use poisson regression, but to be able to diagnose the model with the diagnostic plots we learned about, it requires using a regular model so I will use the log of TSH instead of doing Poisson regression with TSH as the response variable.
First, for linear regression, I have to convert the hypothyroid column into binary form, as done below:
binarydata<-hypothyroid |>
select(hypothyroid,TT4, TSH) |>
filter(TSH>0) |> #This has to be done to prevent strange log values
mutate(hypothyroid= ifelse(hypothyroid=="hypothyroid",1,0)) |>
mutate(logTSH=log(TSH))
na.omit(binarydata)
## # A tibble: 1,801 × 4
## hypothyroid TT4 TSH logTSH
## <dbl> <dbl> <dbl> <dbl>
## 1 1 15 30 3.40
## 2 1 19 145 4.98
## 3 1 6 430 6.06
## 4 1 57 7.3 1.99
## 5 1 27 138 4.93
## 6 1 54 7.7 2.04
## 7 1 34 21 3.04
## 8 1 39 92 4.52
## 9 1 7.6 48 3.87
## 10 1 53 21 3.04
## # ℹ 1,791 more rows
binarydata
## # A tibble: 1,801 × 4
## hypothyroid TT4 TSH logTSH
## <dbl> <dbl> <dbl> <dbl>
## 1 1 15 30 3.40
## 2 1 19 145 4.98
## 3 1 6 430 6.06
## 4 1 57 7.3 1.99
## 5 1 27 138 4.93
## 6 1 54 7.7 2.04
## 7 1 34 21 3.04
## 8 1 39 92 4.52
## 9 1 7.6 48 3.87
## 10 1 53 21 3.04
## # ℹ 1,791 more rows
It is important now that the logTSH column exists to make sure that this new response variable has a linear relationship to the TT4 and hypothyroid variables (although hypothyroid is binary, so it is more looking for an obvious difference between groups).
First, we will look at log(TSH) vs TT4:
ggplot(data=binarydata, aes(x=TT4,y=logTSH))+geom_point(color="seagreen")+labs(title="The Log of TSH Test Results vs. TT4 Test Results",x="TT4 Test Results",y="log(TSH Test Results")+theme_bw()
As you can see, there is a clear linear relationship between TT4 and log(TSH), so no further transformations need to be done.
ggplot(data=binarydata, aes(x=hypothyroid,y=logTSH))+geom_point(color="navy")+labs(title="The Log of TSH Test Results vs Hypothyroidism Classification",x="Hypothyroidism Classification (0=Without Hypothyroidism, 1= With Hypothyroidism)",y="log(TSH Test Results)")+theme_bw()
Though not as obvious as with log(TSH) vs TT4, there appears to be a monotonic (as monotonic as a binary variable versus a continuous variable can be) relationship between the two and no further transformations need to be performed.
Finally, we can create the linear model:
model <- lm(logTSH~ hypothyroid +TT4, binarydata)
summary(model)
##
## Call:
## lm(formula = logTSH ~ hypothyroid + TT4, data = binarydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0130 -0.7645 -0.1111 0.7261 4.3386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5648944 0.0869282 18.00 <2e-16 ***
## hypothyroid 2.4828953 0.1219059 20.37 <2e-16 ***
## TT4 -0.0114559 0.0007602 -15.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.278 on 1798 degrees of freedom
## Multiple R-squared: 0.3874, Adjusted R-squared: 0.3867
## F-statistic: 568.4 on 2 and 1798 DF, p-value: < 2.2e-16
One thing of note with this model is the adjusted R-squared value is .3867, which is not particularly high and can probably be improved upon by tweaking the model.
The intercept means that when a patient doesn’t have hypothyroidism and has a TT4 test result of 0, the log(TSH test result) is equal to 1.56. THe hypothryoid coefficient means that when an individual has hypothyroidism, the log(TSH test result) increases by 2.48 compared if that individual hadn’t had hypothryoidism. Finally, the TT4 coefficient means that as the TT4 test result increases by one, the log(TSH test result) decrease by .01.
Beyond making sure the response variable (TSH) was normally distributed via a log transformation and ensuring that the two explanatory variables (hypothyroid and TT4) were reasonably linearly-related to the response variable, I will check a couple of diagnostic plots to ensure residuals have constant variance and are normally distributed.
I will use a residuals plot to check for this:
gg_resfitted(model) +
theme_bw()
There is clearly greater variance closer to zero than there is at 4, so there is not constant variance of residuals (the two chunk nature reflects that one of the explanatory variables, hypothyroid was a binary attribute).
Next, I will look at a QQ plot to see if the errors are normally distributed.
gg_qqplot(model) + theme_bw()
Though the plot somewhat follows the line, the tails stray too far from it to say that the errors are normally distributed.
A linear regression model between log(TSH) and hypothyroid and TT4 was created, which met the independent and linear relationship assumptions of a linear model but failed to meet the constant variance and normally distributed errors assumptions. The assumption that independent variables not being correlated was inconclusive whether it was met or not. As such, this linear model has concerns which may not make it particularly usable even if the coefficients were highly significant due to their extremely low p-values (although those p-values are based on neither hypothyroidism and TT4 test results being correlated to log(TSH test results) instead of just the explanatory variable in question).