Tina Pivk, 2. 2. 2023

Bone Density Data

mydata <- read.table("C:/Users/TinaP/Desktop/EF/IMB/multivariate analysis/MVA_2022_2023/DN-R/bone density/bmd.csv", header=TRUE, sep=",", dec=".")

mydata <- mydata[,c(1,2,3,5,6,9)]

head(mydata)
##      id      age sex weight_kg height_cm    bmd
## 1   469 57.05277   F        64     155.5 0.8793
## 2  8724 75.74122   F        78     162.0 0.7946
## 3  6736 70.77890   M        73     170.5 0.9067
## 4 24180 78.24718   F        60     148.0 0.7112
## 5 17072 54.19188   M        55     161.0 0.7909
## 6  3806 77.17775   M        65     168.0 0.7301

Description:

  • unit: one patient (more specifics on what kind of patients, from where, when are not provided)
  • sample size: 169

Variables:

  • id – patient’s number
  • age – patient’s age
  • sex - patient’s gender
  • weight_kg – weight measured in kg
  • height_cm – height measured in cm
  • bmd – bone mineral density measure in the hip (g/cm2)

Data source:

Teixeira-Pinto, A. (2022, July 23). Machine Learning for Biostatistics: Module 2. Retrieved from https://bookdown.org/tpinto_home/Regression-and-Classification/#datasets-used-in-the-examples

Main Goal of the Analysis

The main goal of the analysis carried out in homework 3 is to see if the bone mineral density (in the hip) of patients is dependent on age, gender, weight and height.

Data Manipulation

colSums(is.na(mydata))
##        id       age       sex weight_kg height_cm       bmd 
##         0         0         0         0         0         0
mydata$sexF <- as.factor(mydata$sex)

sapply(mydata[,c(2,4,5,6)], FUN=max)
##       age weight_kg height_cm       bmd 
##   88.7538   96.0000  177.0000    1.3624
sapply(mydata[,c(2,4,5,6)], FUN=min)
##       age weight_kg height_cm       bmd 
##  35.81406  36.00000 142.00000   0.40760

None of the values is outside a normal range.

Multiple Regression Model Definition

Theoretical Background

According to Alswat (2017) males have higher bone density than females. According to Ruffing, et al. (2006) height and weight are importand determinants of body mineral density as well. Furthermore, the bone mineral density is also supposed to decrease with age (Berger et al., 2008).

Scatterplot Matrix

library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
scatterplotMatrix(mydata[ , c(6,2,4,5)], 
                  smooth = FALSE) 

Based on scatterplots BMD seems to have some positive linear correlation with height and weight and negative linear correlation with age. There does not seem to be a problem with non-linearity. However, height and weight seem to be a bit correlated, which could present a problem with multicolinearity. We will check this later using VIF.

Checking Assumptions

We have already checked for linearity, so in this section we will check the other assumptions.

fit1 <- lm(bmd ~ age + weight_kg + height_cm + sexF,
           data = mydata)
Multicolinierity
vif(fit1)
##       age weight_kg height_cm      sexF 
##  1.075589  1.163407  2.102418  1.921106
mean(vif(fit1))
## [1] 1.56563

VIF values are below 5, height having the highest VIF of about 2 and the mean is not 1 but is relatively close to it. From this we can conclude there are no problems with multicolinearity so, therefore we can procede.

Outliers
mydata$StdResid <- round(rstandard(fit1), 3)  

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

There are units with standardized residuals higher than 3 (none lower than -3), meaning we have some outliers.

head(mydata[order(-mydata$StdResid),], 3)
##       id      age sex weight_kg height_cm    bmd sexF StdResid
## 67  3090 84.78739   M        68     165.0 1.2508    M    3.831
## 49   197 69.72845   M        84     164.5 1.3624    M    3.415
## 110   69 67.92553   F        62     159.0 1.0176    F    2.154

There are 2 outliers: ID 3090 and ID 197. We exclude them below.

mydata <- mydata[c(-67, -49), ]
Units With High Impact
fit1 <- lm(bmd ~ age + weight_kg + height_cm + sexF,
           data = mydata)

mydata$CooksD <- round(cooks.distance(fit1), 3) 

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

We see from the histogram that there is a gap between some units and the rest so we will exclude these for being high impact units.

head(mydata[order(-mydata$CooksD),], 6)
##        id      age sex weight_kg height_cm    bmd sexF StdResid CooksD
## 97  24200 72.77239   F        81     149.5 0.5090    F   -2.624  0.070
## 135  4335 54.15273   M        79     161.0 0.5693    M   -2.778  0.053
## 139 23995 69.97770   F        85     148.5 0.6264    F   -2.018  0.052
## 54   6973 88.75380   M        58     153.0 0.8548    M    1.527  0.040
## 129  5348 80.77319   M        79     159.0 0.5676    M   -2.100  0.035
## 140   145 71.17212   F        63     148.0 0.4076    F   -2.416  0.030

There are 3 units with high impact; ID 24200, ID 4335 and ID 23995. We exclude them below.

mydata <- mydata[c(-97, -135, -139), ]
Normal Distribution of Errors
fit1 <- lm(bmd ~ age + weight_kg + height_cm + sexF,
           data = mydata)

mydata$StdResid <- round(rstandard(fit1), 3) 

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.98573, p-value = 0.09177

H0: Errors are normally distributed

H1: Errors are not normally distributed

After removing outliers and units with high impact we cannot claim that errors are not normally distributed, since p-value from Shapiro-Wilk normality test is higher than 5% (p=0.09177), meaning the assumption regarding normal distribution of errors is met.

Homoscedasticity
mydata$StdFittedValues <- scale(fit1$fitted.values) 

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

From the scatterplot there seems to be a little bit lower variability on the left than on the right. Therefore, let us check homoscedasticity assumption using a formal test. Otherwise, the points are random, again indicating no non-linearity problems.

suppressPackageStartupMessages(library(olsrr))
## Warning: package 'olsrr' was built under R version 4.2.2
ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##              Data               
##  -------------------------------
##  Response : bmd 
##  Variables: fitted values of bmd 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    3.000226 
##  Prob > Chi2   =    0.0832529

From Breusch Pagan test we cannot reject the null hypothesis stated in the test results, therefore, we cannot claim there is a problem with heteroscedasticity, since the p-value is higher than 5%. (It is below 10% but we do have a relatively large sample.)

Descriptive Statistics of the Variables Used in the Final Model

library(pastecs)
round(stat.desc(mydata[ ,c(2,4,5,6)]), 2)
##                   age weight_kg height_cm    bmd
## nbr.val        164.00    164.00    164.00 164.00
## nbr.null         0.00      0.00      0.00   0.00
## nbr.na           0.00      0.00      0.00   0.00
## min             35.81     36.00    142.00   0.41
## max             88.75     96.00    177.00   1.13
## range           52.94     60.00     35.00   0.73
## sum          10401.44  10585.50  26284.50 127.85
## median          63.36     64.25    160.25   0.78
## mean            63.42     64.55    160.27   0.78
## SE.mean          0.97      0.90      0.62   0.01
## CI.mean.0.95     1.91      1.78      1.23   0.02
## var            153.77    133.81     63.16   0.02
## std.dev         12.40     11.57      7.95   0.16
## coef.var         0.20      0.18      0.05   0.20
sum(mydata[,7] == 'F')
## [1] 81
sum(mydata[,7] == 'M')
## [1] 83

The minimum of age is 35.81 years and the maximum 88.75 years, the minimum of weight 36kg and maximum 96kg, the minimum of height 142cm and the maximum 177cm, the minimum of BMD 0.41g/cm2 and the maximum 1.13g/cm2. (Above about 0.88 is normal.)

Medians and means for all variables are pretty close. The patients are on average 63.42 years old, weigh 64.55kg, are 160.27cm tall and have their BMD 0.78g/cm2. 50% of patients is older than 63.36 years, weighs more than 64.25kg, is taller than 160.25cm, and has BMD higher than 0.78g/cm2. The other 50% is younger, weighs less, is smaller than the numbers stated and has BMD lower than 0.78g/cm2.

The 95% confidence intervals are relatively small for all variables. Provided the sample is random the true mean of age is between 61.51 and 65.33 years, the true weight mean between 62.77 and 66.33kg, the true height mean between 159.04 and 161.50cm, and the true BMD mean between 0.76 and 0.80g/cm2 (p-value = 5%).

The highest variability according to coefficient of variance is for age and BMD, very closely followed by weight, and the smallest variability is observed for height.

Out of 164 units in the sample, 81 are women and 83 men.

The Model Description and Explanation

#from the last estimation of the model nothing changed so we will not re-estimate it here
summary(fit1)
## 
## Call:
## lm(formula = bmd ~ age + weight_kg + height_cm + sexF, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35135 -0.06853 -0.00516  0.08205  0.28286 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.4497743  0.2765318   1.626    0.106    
## age         -0.0038527  0.0007912  -4.869 2.68e-06 ***
## weight_kg    0.0065988  0.0008833   7.471 4.99e-12 ***
## height_cm    0.0008650  0.0017148   0.504    0.615    
## sexFM        0.0189404  0.0257956   0.734    0.464    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1205 on 159 degrees of freedom
## Multiple R-squared:  0.4159, Adjusted R-squared:  0.4012 
## F-statistic:  28.3 on 4 and 159 DF,  p-value: < 2.2e-16

With this model 41.59% of variability of BMD is explained by linear effect of all the explanatory variables included in the model; age, weight, height, and sex.

F STATISTICS:

H0: RO2 = 0

H1: RO2 > 0

We reject H0 at p-value < 0.001. Therefor, at least some of the variability is explained with explanatory variables, meaning the model is good.

If we take a look at partial regression coefficients we see that we did not find a significant effect of height or sex on BMD, but we have found it for age and weight: If a patient’s age increases by a year the BMD will on average decrease by 0.0038527g/cm2 provided everything else is left unchanged (p-value < 0.001). If a patient’s weight increases by a kilogram their BMD will on average increase by 0.0065988g/cm2 provided everything else is left unchanged (p-value < 0.001).

sqrt(summary(fit1)$r.squared) 
## [1] 0.6449007

Linear relationship between BMD and all four explanatory variables is semi strong.

Conclusion

By using linear regression we have found a good model in which bone mineral density (in the hip) of patients is the dependent variable and age, sex, height, and weight the independent variables, meaning the explanatory variables explain a part of BMD variability. Furthermore, we have found a significant effect of weight and age on BMD but not for height or sex. We have to keep in mind this can only be generalized on the population of the specific patients.

References

  • Alswat K. A. (2017, May). Gender Disparities in Osteoporosis. J Clin Med Res. 9(5), pp. 382-387.
  • Berger C., Langsetmo L., Joseph L., Hanley D. A., Davison K. S., Josse R., Kreiger N., Tenenhouse A., Goltzman D. (2008, June 17). Change in bone mineral density as a function of age in women and men and association with the use of antiresorptive agents. CMAJ, 178(13)
  • Ruffing, J., Cosman, F., Zion, M. et al. (2006). Determinants of bone mass and bone size in a large cohort of physically active young adult men. Nutr Metab (Lond) 3(14).