Group 07A: Enola Heinzer, Leonie von Moos
Group 07B: Selina Kohler, Razoare Iulia
The data analyzed stems from an exploratory research study focusing on scoliosis and its diagnostic methods. Exploratory research means that no hypothesis is available and no definitive conclusions can be drawn, without testing for them in a new study setting.
The provided dataset is a combination of two separately aquired datasets. The first one originated from the Larson Study, where data was collected from 72 participants that did not suffer from scoliosis. The second dataset was collected at the Balgrist University Hospital and included 120 participants with a diagnosed or suspected scoliosis. The dataset that will be analysed in the following pages, is limited to 38 participants without and 39 participants with scoliosis. Reasons for this limitation were not given.
The diagnosis as well as the monitoring of scoliosis progression typically require frequent X-ray imaging, which exposes patients to potentially harmful levels of radiation over time. Thus, a new and less damaging measurement method is needed. For this reason two ideas were tested.
The first was to fit a model that correctly predicts whether or not a person suffers from scoliosis, using only parameters that do not require an x-ray image to be taken. In this model the variable scoliosis was considered the response variable.
In a second step a model was fitted to predict the primary Cobb angle of every participant, as this is the diagnostic tool to diagnose scoliosis (>|10B0|). For this prediction only the 38 participants of the Balgrist study could be considered, as no Cobb angle measurements were available for the other study. In this case the primary Cobb angle was considered the response variable.
## 'data.frame': 77 obs. of 24 variables:
## $ age : int 15 17 20 23 16 15 14 16 23 15 ...
## $ gender : int 1 2 1 1 2 1 2 1 1 1 ...
## $ height : int 164 173 158 163 185 167 176 167 157 177 ...
## $ weight : int 46 68 56 51 60 61 59 62 43 63 ...
## $ primary_cobbAngle : int -33 18 15 38 -25 -12 -21 28 -33 54 ...
## $ cobbAngle_proximal_thoracic: int -15 -6 -12 NA -14 NA -7 -26 -6 -27 ...
## $ cobbAngle_main_thoracic : int 28 18 15 38 21 9 18 28 24 54 ...
## $ cobbAngle_thoracolumbar : int -33 -11 NA -38 -25 -12 -21 -16 -33 -31 ...
## $ rip_hump : num NA 5 0 10 5 0 5 10 5 5 ...
## $ lumbar_hump : num NA 20 5 13 22 10 7 0 10 5 ...
## $ RoM : num 63.1 54 71.3 69.9 57.8 ...
## $ inclination_diff : num -4.85 5.03 1.06 -7.23 -7.3 ...
## $ bending_diff : num 15.84 1.36 -11.15 -3.99 27.23 ...
## $ relative_bending_asymmetry : num 0.2509 0.0252 -0.1564 -0.057 0.4714 ...
## $ left_bending_angle : num 16.44 13.61 20.55 27.66 4.46 ...
## $ right_bending_angle : num 32.3 15 9.4 23.7 31.7 ...
## $ left_inclination_angle : num -34 -24.5 -35.1 -38.6 -32.5 ...
## $ right_inclination_angle : num 29.1 29.5 36.2 31.3 25.2 ...
## $ scoliosis : num 1 1 1 1 1 1 1 1 1 1 ...
## $ spine_diagnosis : int NA NA NA NA NA NA NA NA NA NA ...
## $ apex_proximal_thoracic.fac : Factor w/ 11 levels "","-","Th1","Th1/2",..: 7 4 3 2 7 1 11 8 5 8 ...
## $ apex_main_thoracic.fac : Factor w/ 14 levels "","Th10","Th10/11",..: 11 6 11 5 13 3 14 2 9 13 ...
## $ apex_thoracolumbar.fac : Factor w/ 10 levels "","L1","L1/2",..: 2 7 1 9 3 4 5 5 10 4 ...
## $ scoliosis.fac : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
scoliosis_data$severity <- NA
scoliosis_data <- scoliosis_data %>%
mutate(severity = case_when(
scoliosis == 0 ~ 'no',
abs(primary_cobbAngle) >= 40 ~ 'severe',
abs(primary_cobbAngle) >= 20 & abs(primary_cobbAngle) < 40 ~ 'moderate',
abs(primary_cobbAngle) >= 10 & abs(primary_cobbAngle) < 20 ~ 'mild',
TRUE ~ NA_character_ # Default case: retain the current scoliosis value (if no conditions match)
))How is the variable of scoliosis distributed in the dataset?
0: no scoliosis
1: scoliosis
##
## 0 1
## 38 39
In the beginning the scoliosis cases were categorized according to their severity. This is how they are distributed over all the participants.
##
## mild moderate no severe
## 13 15 38 9
Bending Difference:
hist(scoliosis_data$bending_diff, main = "Histogram of Bending Difference", xlab = "Angle Difference [B0]", col = "steelblue")qqnorm(scoliosis_data$bending_diff, main = "Q-Q Plot of Bending Differences")
qqline(scoliosis_data$bending_diff, col = "steelblue")Primary Cobb Angle:
hist(abs(scoliosis_data$primary_cobbAngle), main = "Histogram of Absolute Primary Cobb Angles", xlab ="Angle [B0]", col = "steelblue")qqnorm(abs(scoliosis_data$primary_cobbAngle), main = "Q-Q Plot of Absolute Primary Cobb Angles")
qqline(abs(scoliosis_data$primary_cobbAngle), col = "steelblue")Rib Hump:
hist(scoliosis_data$rib_hump, main = "Histogram of Rib Humps", xlab = "Angle of Trunk Rotation [B0]", col = "steelblue")qqnorm(scoliosis_data$rib_hump, main = "Q-Q Plot of Rib Humps")
qqline(scoliosis_data$rib_hump, col = "steelblue")Models were fitted for two different response variables. On the one hand a binary model was used to predict the scoliosis status of the person (yes/no), on the other hand a model was fitted for the primary Cobb angle of every participant, as this angle used to not only to diagnose, but to monitor the scoliosis progression as well.
As the scoliosis status is a binary variable (0 or 1), the first model was a binary generalizable linear model to predict scoliosis based on the bending difference.
glm_skoli <- glm(scoliosis~abs(bending_diff),
data=scoliosis_data,
family="binomial")
summary(glm_skoli)##
## Call:
## glm(formula = scoliosis ~ abs(bending_diff), family = "binomial",
## data = scoliosis_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.67121 0.40002 -1.678 0.0934 .
## abs(bending_diff) 0.07360 0.03494 2.107 0.0351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 106.73 on 76 degrees of freedom
## Residual deviance: 101.88 on 75 degrees of freedom
## AIC: 105.88
##
## Number of Fisher Scoring iterations: 4
ggplot(data=scoliosis_data,
mapping = aes(y=scoliosis,
x=abs(bending_diff))) +
geom_hline(yintercept= c(0,1), colour = "lightgrey") +
geom_point(aes(color=scoliosis)) +
geom_smooth(method = "glm",
method.args = list(family="binomial"),
se = FALSE)## `geom_smooth()` using formula = 'y ~ x'
It can be observed that this model shows a linear line, and would not be able to distinguish between a healthy or sick individual. This is why the question occurred, if it might be possible to find a difference for the boundary conditions of participants with no scoliosis, and severe scoliosis.
glm_severity <- glm(scoliosis~abs(bending_diff),
data=scoliosis_data %>%
filter(severity %in% c("severe", "no")),
family="binomial")
summary(glm_severity)##
## Call:
## glm(formula = scoliosis ~ abs(bending_diff), family = "binomial",
## data = scoliosis_data %>% filter(severity %in% c("severe",
## "no")))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.14161 0.89671 -3.503 0.000459 ***
## abs(bending_diff) 0.15444 0.06336 2.438 0.014787 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45.907 on 46 degrees of freedom
## Residual deviance: 38.746 on 45 degrees of freedom
## AIC: 42.746
##
## Number of Fisher Scoring iterations: 5
## MODEL INFO:
## Observations: 47
## Dependent Variable: scoliosis
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(1) = 7.16, p = 0.01
## Pseudo-R² (Cragg-Uhler) = 0.23
## Pseudo-R² (McFadden) = 0.16
## AIC = 42.75, BIC = 46.45
##
## Standard errors:MLE
## ------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ----------------------- ----------- ------ ------- -------- ------
## (Intercept) 0.04 0.01 0.25 -3.50 0.00
## abs(bending_diff) 1.17 1.03 1.32 2.44 0.01
## ------------------------------------------------------------------
ggplot(data=scoliosis_data %>%
filter(severity %in% c("severe","no")),
mapping = aes(y=scoliosis,
x=abs(bending_diff))) +
geom_hline(yintercept= c(0,1), colour = "lightgrey") +
geom_point(aes(color = severity)) +
geom_smooth(method = "glm",
method.args = list(family="binomial"),
se = FALSE)## `geom_smooth()` using formula = 'y ~ x'
As shown in the first summary, ‘bending_diff’ seems to have a
significant effect on the scoliosis status.
To follow this up and get a more precise prediction, the additional
category of moderate scoliosis was incorporated. However, the
significant predicting character of ‘bending_diff’ was lost.
The model graph seems to be approximately sigmoid. However the points for the two categories (scoliosis/no scoliosis) are not easily separable. Bigger bending differences seem to predict scoliosis, but there are also some scoliosis patients with a small bending difference. It is possible that these would not be identified by the model. A big problem is the very limited number of severe scoliosis patients in the dataset, which limits the power of the model considerably. To further look into this possible relationship, it is necessary to conduct a new study testing specifically for this hypothesis.
The second idea was to try and predict the patients primary Cobb
angle, which is clinically the most relevant. Per definition this is the
most prominent angle of the scoliosis, independent from its location on
the spine. It is determined by looking at the x-rays of the spine, which
is why a simple model with based on more easily measurable parameters
(such as the size of the humps of the spine) would be advantageous for
future scoliosis patients.
For this a linear model was tested, again based on the ‘bending_diff’,
was tested. This time the variable ‘rib_hump’ was added, as it seemed to
be the more promising of the two humps, when looking at correlation
plots between the primary Cobb Angle and the ‘rib/lumbar hump’.
Furthermore, the range of motion was added to the model as well, but its p-value increased considerably and was therefore quickly discarded.
It is important to note, that the model can only be trained and evaluated using the data from the Balgrist study (n=38), because no Cobb angle was available for the participants from the Larson study.
##
## Call:
## lm(formula = primary_cobbAngle ~ bending_diff + rib_hump, data = scoliosis_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.459 -14.275 -0.614 16.113 44.147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.8376 5.8480 -0.485 0.63092
## bending_diff -0.8235 0.2996 -2.749 0.00989 **
## rib_hump 1.6565 0.6776 2.445 0.02038 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.96 on 31 degrees of freedom
## (43 Beobachtungen als fehlend gelöscht)
## Multiple R-squared: 0.3866, Adjusted R-squared: 0.347
## F-statistic: 9.769 on 2 and 31 DF, p-value: 0.0005129
It seems like both variables, the ‘bending_diff’ and ‘rib_hump’ displayed a low p-value, and can be considererd significant for the prediction of the Cobb angle.
Assumptions:
ggplot(mapping= aes(y=resid(lm2param), x=fitted(lm2param))) + geom_abline(intercept= 0, slope=0) +
geom_point() + geom_smooth()## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Expected Value of Errors is Zero:
Looking at the first plot, the modeled line wasn’t straight on the
residuals of 0, but not too far off, that the model seemed to perform
rather good. The predictions seemed to display residuals mostly in the
range of an angle of 10B0. Because a curved trend line was observed, it
was assumed that the model wasn’t able to completely capture the
relationship between the predictors and the outcome value, indicating
that maybe the relationship is not a linear one. –> In this case the
assumption of the errors being zero, is not fulfilled.
Because of this quadratic terms were added to the variables, but as soon as one was added, the variable lost its significance for the model.
Normality of Residuals: However, the residuals seemed to mostly be normally distributed as seen in the Q-Q Plot. To ensure that the still prevailing deviation was in an acceptable range, 20 simulated lines were plotted with a normal distribution. This is seen in the last plot from above. These residuals were considered to be roughly normally distributed.
Homoscedasticity (Constant Variability of Residuals): To check if the variability is constant, the square-root transformed absolute values of the residuals were plotted against the fitted values.
ggplot(mapping= aes(y=sqrt(abs(resid(lm2param))), x=fitted(lm2param))) + geom_abline(intercept= 0, slope=0) +
geom_point() + geom_smooth()## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Looking at the grey curve, representing the variance over the residuals, this a distribution seems to be constant.
scoliosis_non_zero <- scoliosis_data %>%
filter(!is.na(rib_hump))
scoliosis_non_zero$resid.lm2param <- resid(lm2param)
ggplot(data=scoliosis_non_zero,
mapping=aes(y=resid.lm2param, x=severity)) + geom_boxplot()Influential observations:
By taking a look at the outliers, three of them are found and
highlighted in the plots (observations 11,25,30). However all of them
are below the threshold of a 0.5 Cook’s distance. Therefore none are not
big enough to wrongly influence the model, and therefore they do not
have to be accounted for.
Because the relationship between the predictors and the outcome value, didn’t seem to be linear, regression splines were used to try and improve the model.
lm_spline <- lm(primary_cobbAngle ~ ns(bending_diff, df = 4) +
ns(rib_hump, df = 3),
data = scoliosis_data)
summary(lm_spline)##
## Call:
## lm(formula = primary_cobbAngle ~ ns(bending_diff, df = 4) + ns(rib_hump,
## df = 3), data = scoliosis_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.03 -11.82 1.61 10.28 47.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.388 30.183 0.212 0.8340
## ns(bending_diff, df = 4)1 -20.890 27.572 -0.758 0.4555
## ns(bending_diff, df = 4)2 -23.544 24.730 -0.952 0.3498
## ns(bending_diff, df = 4)3 -19.748 65.156 -0.303 0.7642
## ns(bending_diff, df = 4)4 -48.707 21.025 -2.317 0.0287 *
## ns(rib_hump, df = 3)1 51.347 23.927 2.146 0.0414 *
## ns(rib_hump, df = 3)2 33.732 23.922 1.410 0.1704
## ns(rib_hump, df = 3)3 29.547 25.088 1.178 0.2496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.32 on 26 degrees of freedom
## (43 Beobachtungen als fehlend gelöscht)
## Multiple R-squared: 0.4698, Adjusted R-squared: 0.327
## F-statistic: 3.291 on 7 and 26 DF, p-value: 0.01216
Two terms seem to be significant. The residuals were further analysed in this case.
lm2paramProx <- lm(cobbAngle_proximal_thoracic~bending_diff +rib_hump, data= scoliosis_data)
summary(lm2paramProx)##
## Call:
## lm(formula = cobbAngle_proximal_thoracic ~ bending_diff + rib_hump,
## data = scoliosis_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2976 -5.0418 0.2205 5.7497 10.2303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.83585 1.94898 -5.560 1.18e-05 ***
## bending_diff 0.20636 0.09292 2.221 0.03648 *
## rib_hump -0.73456 0.21564 -3.406 0.00242 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.573 on 23 degrees of freedom
## (51 Beobachtungen als fehlend gelöscht)
## Multiple R-squared: 0.4885, Adjusted R-squared: 0.444
## F-statistic: 10.98 on 2 and 23 DF, p-value: 0.0004489
Also this time, to find other non-linear relatioships, regression splines were used.
lm_spline <- lm(primary_cobbAngle ~ ns(bending_diff, df = 4) +
ns(rib_hump, df = 3),
data = scoliosis_data)
summary(lm_spline)##
## Call:
## lm(formula = primary_cobbAngle ~ ns(bending_diff, df = 4) + ns(rib_hump,
## df = 3), data = scoliosis_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.03 -11.82 1.61 10.28 47.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.388 30.183 0.212 0.8340
## ns(bending_diff, df = 4)1 -20.890 27.572 -0.758 0.4555
## ns(bending_diff, df = 4)2 -23.544 24.730 -0.952 0.3498
## ns(bending_diff, df = 4)3 -19.748 65.156 -0.303 0.7642
## ns(bending_diff, df = 4)4 -48.707 21.025 -2.317 0.0287 *
## ns(rib_hump, df = 3)1 51.347 23.927 2.146 0.0414 *
## ns(rib_hump, df = 3)2 33.732 23.922 1.410 0.1704
## ns(rib_hump, df = 3)3 29.547 25.088 1.178 0.2496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.32 on 26 degrees of freedom
## (43 Beobachtungen als fehlend gelöscht)
## Multiple R-squared: 0.4698, Adjusted R-squared: 0.327
## F-statistic: 3.291 on 7 and 26 DF, p-value: 0.01216
ggplot(mapping= aes(y=resid(lm_spline), x=fitted(lm_spline))) + geom_abline(intercept= 0, slope=0) +
geom_point() + geom_smooth()## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As the relationship between the outcome and predictors might not be linear, a generalized additive model (GAM) was tested between the best predictors so far (bending_diff and rib_hump) and the outcome of primary_cobbAngle, to better the models performance.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## primary_cobbAngle ~ s(bending_diff) + s(rib_hump)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.471 4.039 2.345 0.0258 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(bending_diff) 1.000 1.000 8.628 0.00631 **
## s(rib_hump) 1.604 1.966 3.207 0.04249 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 17/19
## R-sq.(adj) = 0.369 Deviance explained = 41.9%
## GCV = 620.42 Scale est. = 554.65 n = 34
The graph above shows two different slopes, one of them linear, and one not. As the bending difference was associated with a linear term, it was to be handled as such, in the following GAM code. The rib hump was to be modeled with a smooth term of 1.6.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## primary_cobbAngle ~ bending_diff + s(rib_hump)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5801 4.0899 1.853 0.07356 .
## bending_diff -0.8845 0.3011 -2.937 0.00626 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(rib_hump) 1.605 1.967 3.211 0.0425 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.369 Deviance explained = 41.9%
## GCV = 620.43 Scale est. = 554.65 n = 34
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 6.775169e-06 .
## The Hessian was positive definite.
## Model rank = 11 / 11
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(rib_hump) 9.0 1.6 1.18 0.78