First, the tidyverse library and hypothyroidism dataset were loaded in.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lindia)
hypothyroid <- read_delim("./hypothyroid data set.csv", delim = ",")
## Rows: 3163 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): hypothyroid, sex, TSH_measured, T3_measured, TT4_measured, T4U_mea...
## dbl (7): age, TSH, T3, TT4, T4U, FTI, TBG
## lgl (11): on_thyroxine, query_on_thyroxine, on_antithyroid_medication, thyro...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Building a Generalized Linear Model for the Hypothyroid Column
Given this is a hypothyroidism-focused dataset, the binary column most worth modeling is whether or not an individual has hypothyroidism. As such, it will be used as the response variable.
First, I need to change this column so that it is 0/1 binary.
binarydata<-hypothyroid |>
mutate(hypothyroid= ifelse(hypothyroid=="hypothyroid",1,0))
binarydata
## # A tibble: 3,163 Ă— 26
## hypothyroid age sex on_thyroxine query_on_thyroxine
## <dbl> <dbl> <chr> <lgl> <lgl>
## 1 1 72 M FALSE FALSE
## 2 1 15 F TRUE FALSE
## 3 1 24 M FALSE FALSE
## 4 1 24 F FALSE FALSE
## 5 1 77 M FALSE FALSE
## 6 1 85 F FALSE FALSE
## 7 1 64 F FALSE FALSE
## 8 1 72 F FALSE FALSE
## 9 1 20 F FALSE FALSE
## 10 1 42 F FALSE FALSE
## # ℹ 3,153 more rows
## # ℹ 21 more variables: on_antithyroid_medication <lgl>, thyroid_surgery <lgl>,
## # query_hypothyroid <lgl>, query_hyperthyroid <lgl>, pregnant <lgl>,
## # sick <lgl>, tumor <lgl>, lithium <lgl>, goitre <lgl>, TSH_measured <chr>,
## # TSH <dbl>, T3_measured <chr>, T3 <dbl>, TT4_measured <chr>, TT4 <dbl>,
## # T4U_measured <chr>, T4U <dbl>, FTI_measured <chr>, FTI <dbl>,
## # TBG_measured <chr>, TBG <dbl>
Converting to 1 and 0 allows for a model to be built with hypothyroidism as the response variable. Being negative is 0 and having hypothyroidism is 1.
I chose 4 response variables- on_thyroxine, thyroid_surgery, TSH, and TT4. Being on thyroxine and having thyroid surgery are both attributes that are connected with hypothyroidism as hypothyroidism patients are generally on thyroxine and thyroid surgery, especially when it is a complete thyroid removal, can result in a patient essentially having hypothyroidism for the rest of their lives. TSH and TT4 were chosen because these are the two tests most commonly done when diagnosing hypothyroidism so an assoication between whether a patient has hypothyroidism and their results for these 2 blood tests should be associated.
Like with the hypothyroid column, I also had to convert the on_thyroxine and thyroid_surgery columns to 0 and 1 where 0 is False and 1 is True for both columns.
binarydata<-binarydata |>
select(hypothyroid,on_thyroxine,thyroid_surgery,TSH,TT4) |>
mutate(on_thyroxine= ifelse(on_thyroxine==TRUE,1,0)) |>
mutate(thyroid_surgery= ifelse(thyroid_surgery==TRUE,1,0))
binarydata
## # A tibble: 3,163 Ă— 5
## hypothyroid on_thyroxine thyroid_surgery TSH TT4
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 30 15
## 2 1 1 0 145 19
## 3 1 0 0 0 4
## 4 1 0 0 430 6
## 5 1 0 0 7.3 57
## 6 1 0 0 138 27
## 7 1 0 1 7.7 54
## 8 1 0 0 21 34
## 9 1 0 0 92 39
## 10 1 0 0 48 7.6
## # ℹ 3,153 more rows
###Creating a Basic Model and Explaining the Coefficients
hypomodel <- glm(hypothyroid ~on_thyroxine+thyroid_surgery+TSH+TT4 , data = binarydata,
family = binomial(link = 'logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
hypomodel$coefficients
## (Intercept) on_thyroxine thyroid_surgery TSH TT4
## 1.88730043 -0.36479160 1.24559288 0.04522853 -0.07478100
It is difficult to interpret the coefficients on their own with logistic regression; however, e to the power of the coefficient is able to be interpreted. These are calculated below:
thyroxineco<-exp(-.36479160)
surgeryco<-exp(1.24559288)
TSHco<-exp(.04522853)
TT4co<-exp(-.07478100)
intercept<-exp(1.88740043)
paste("on_thyroxine Coefficent:",round(thyroxineco,3),
"thryoid_surgery Coefficient:",round(surgeryco,3),
"TSH Coefficient:", round(TSHco,3),
"TT4 Coefficient:",round(TT4co,3),
"Intercept:", round(intercept,3))
## [1] "on_thyroxine Coefficent: 0.694 thryoid_surgery Coefficient: 3.475 TSH Coefficient: 1.046 TT4 Coefficient: 0.928 Intercept: 6.602"
From here, we can see that a unit increase in on_thyroixine results in an approximately 31% decrease in the odds of a patient having hypothyroidism. Next, thyroid_surgery results in a 345% increase in the odds of a patient having hypothyroidism per unit increase. TSH results in a 4.6% increase in the odds per unit increase, while TT4 results in a 7.2% decrease in the odds per unit increase.The odds of a patient having hypothyroidism. The odds of a patient having hypothyroidism if all 4 explanatory variables equaled zero (the patient does not take thyrxoine, has not had thyroid surgery, and has TSH and TT4 results both equal to 0) is 6, which means the odds are in favor of the patient having hypothyroidism in this circumstance.For the most part, they are consistent with what one would expect based on known relationships with the lone exception of on_thyroxine, which results in a decrease in odds of having hypothyroidism as on_thyroxine shifts from 0 to 1 instead of the expected increase in odds for the same shift.
Next, I will calculate the 95% confidence intervals for each of the coefficients;
confint(hypomodel)
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 2.5 % 97.5 %
## (Intercept) 1.02541100 2.78764779
## on_thyroxine -1.44265687 0.59855578
## thyroid_surgery 0.13927603 2.22655468
## TSH 0.03135026 0.06037731
## TT4 -0.08841836 -0.06239622
From this, we can say the coefficients’ confidence intervals, when calculated the same way, will capture the true coefficient 95% of the time. Interestingly, one of on_thyroxine’s confidence interval bounds is negative and the other is positive, which may suggest that using on_thyroxine in this model may not be a good choice as the coefficient may not actually affect the odds of having hypothyroidism the way the model shows.
###Transformation to Improve Model
From past data dives, I know that TSH does not have a normal distribution, which can once again be illustrated below:
binarydata |>
ggplot(mapping=aes(x=TSH))+geom_histogram(fill="maroon")+xlim(0,100)+ylim(0,300)+labs(title="Histogram of TSH Test Results (microunits/milliliter)", x="TSH Test Results (microunits/milliliter)")+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 496 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 3 rows containing missing values (`geom_bar()`).
As you can see, the data is heavily right-skewed. I will be trying various transformations to get more normal TSH test results.
First I will try taking the square root of TSH test results:
binarydata |>
mutate(sqrtTSH=sqrt(TSH))|>
ggplot(mapping=aes(x=sqrtTSH))+geom_histogram(fill="maroon")+labs(title="Histogram of Square Root of TSH Test Results", x="Square Root of TSH Test Results")+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 468 rows containing non-finite values (`stat_bin()`).
Square root does not seem to help, so I will try taking the log of TSH next:
binarydata |>
mutate(logTSH=log(TSH))|>
ggplot(mapping=aes(x=logTSH))+geom_histogram(fill="maroon")+labs(title="Histogram of Log of TSH Test Results", x="Log of TSH Test Results ")+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1362 rows containing non-finite values (`stat_bin()`).
Taking the log seems to work to normalize TSH test results, so I will create a new data set for a second model featuring the on_thyroxine, thyroid_surgery, log(TSH), and TT4 columns.
The second model is built below:
binarydata2<-binarydata |>
filter(TSH>0)|> #I had to set TSH greater than 0 for the model to work
mutate(logTSH=log(TSH))
hypomodel2 <- glm(hypothyroid ~on_thyroxine+thyroid_surgery+logTSH+TT4 , data = binarydata2,
family = binomial(link = 'logit'))
hypomodel2$coefficients
## (Intercept) on_thyroxine thyroid_surgery logTSH TT4
## -0.50296678 -0.81330625 1.02607492 1.20067187 -0.06102514
The intercept and coefficients changed, as it to be expected.
thyroxineco2<-exp(-.81330625)
surgeryco2<-exp(1.02607492)
TSHco2<-exp(1.20067187)
TT4co2<-exp(-.06102514)
intercept2<-exp(-.50296678)
paste("on_thyroxine Coefficent:",round(thyroxineco2,3),
"thryoid_surgery Coefficient:",round(surgeryco2,3),
"Log(TSH) Coefficient:", round(TSHco2,3),
"TT4 Coefficient:",round(TT4co2,3),
"Intercept:", round(intercept2,3))
## [1] "on_thyroxine Coefficent: 0.443 thryoid_surgery Coefficient: 2.79 Log(TSH) Coefficient: 3.322 TT4 Coefficient: 0.941 Intercept: 0.605"
The Log(TSH) Coefficient has a much bigger effect on the odds than the TSH Coefficient from the first model, which is interesting and may be a reflection of the smaller scale of log(TSH) values compared to TSH values, but the direction of the 4 explanatory variables effecting the response variable did not change. However, the intercept is now with an odds of 0.605, which suggests it is more likely that an individual with 0 for all four explanatory variables does not have hypothyroidism.
From the 2 logistic regression models made, using the thyroid_surgery, TT4, and TSH columns seemed reasonable for trying to predict a patient’s hypothryoidism status using logistic regression. The on_thyroxine model would likely not be used going forward since its confidence interval covered both negative and positive values, which would have changed the direction of how a change from 0 (not on thryoxine) to 1 (on thyroxine) affected whether or not a patient had hypothyroidism.