Main Goal of the Analysis
The main goal of the analysis carried out in homework 3 is to see if the bone mineral density (in the hip) of patients is dependent on age, gender, weight and height.
mydata <- read.table("C:/Users/TinaP/Desktop/EF/IMB/multivariate analysis/MVA_2022_2023/DN-R/bone density/bmd.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata[,c(1,2,3,5,6,9)]
head(mydata)
## id age sex weight_kg height_cm bmd
## 1 469 57.05277 F 64 155.5 0.8793
## 2 8724 75.74122 F 78 162.0 0.7946
## 3 6736 70.77890 M 73 170.5 0.9067
## 4 24180 78.24718 F 60 148.0 0.7112
## 5 17072 54.19188 M 55 161.0 0.7909
## 6 3806 77.17775 M 65 168.0 0.7301
Description:
Variables:
Data source:
Teixeira-Pinto, A. (2022, July 23). Machine Learning for Biostatistics: Module 2. Retrieved from https://bookdown.org/tpinto_home/Regression-and-Classification/#datasets-used-in-the-examples
The main goal of the analysis carried out in homework 3 is to see if the bone mineral density (in the hip) of patients is dependent on age, gender, weight and height.
colSums(is.na(mydata))
## id age sex weight_kg height_cm bmd
## 0 0 0 0 0 0
mydata$sexF <- as.factor(mydata$sex)
sapply(mydata[,c(2,4,5,6)], FUN=max)
## age weight_kg height_cm bmd
## 88.7538 96.0000 177.0000 1.3624
sapply(mydata[,c(2,4,5,6)], FUN=min)
## age weight_kg height_cm bmd
## 35.81406 36.00000 142.00000 0.40760
None of the values is outside a normal range.
According to Alswat (2017) males have higher bone density than females. According to Ruffing, et al. (2006) height and weight are importand determinants of body mineral density as well. Furthermore, the bone mineral density is also supposed to decrease with age (Berger et al., 2008).
library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
scatterplotMatrix(mydata[ , c(6,2,4,5)],
smooth = FALSE)
Based on scatterplots BMD seems to have some positive linear correlation with height and weight and negative linear correlation with age. There does not seem to be a problem with non-linearity. However, height and weight seem to be a bit correlated, which could present a problem with multicolinearity. We will check this later using VIF.
We have already checked for linearity, so in this section we will check the other assumptions.
fit1 <- lm(bmd ~ age + weight_kg + height_cm + sexF,
data = mydata)
vif(fit1)
## age weight_kg height_cm sexF
## 1.075589 1.163407 2.102418 1.921106
mean(vif(fit1))
## [1] 1.56563
VIF values are below 5, height having the highest VIF of about 2 and the mean is not 1 but is relatively close to it. From this we can conclude there are no problems with multicolinearity so, therefore we can procede.
mydata$StdResid <- round(rstandard(fit1), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
There are units with standardized residuals higher than 3 (none lower than -3), meaning we have some outliers.
head(mydata[order(-mydata$StdResid),], 3)
## id age sex weight_kg height_cm bmd sexF StdResid
## 67 3090 84.78739 M 68 165.0 1.2508 M 3.831
## 49 197 69.72845 M 84 164.5 1.3624 M 3.415
## 110 69 67.92553 F 62 159.0 1.0176 F 2.154
There are 2 outliers: ID 3090 and ID 197. We exclude them below.
mydata <- mydata[c(-67, -49), ]
fit1 <- lm(bmd ~ age + weight_kg + height_cm + sexF,
data = mydata)
mydata$CooksD <- round(cooks.distance(fit1), 3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We see from the histogram that there is a gap between some units and the rest so we will exclude these for being high impact units.
head(mydata[order(-mydata$CooksD),], 6)
## id age sex weight_kg height_cm bmd sexF StdResid CooksD
## 97 24200 72.77239 F 81 149.5 0.5090 F -2.624 0.070
## 135 4335 54.15273 M 79 161.0 0.5693 M -2.778 0.053
## 139 23995 69.97770 F 85 148.5 0.6264 F -2.018 0.052
## 54 6973 88.75380 M 58 153.0 0.8548 M 1.527 0.040
## 129 5348 80.77319 M 79 159.0 0.5676 M -2.100 0.035
## 140 145 71.17212 F 63 148.0 0.4076 F -2.416 0.030
There are 3 units with high impact; ID 24200, ID 4335 and ID 23995. We exclude them below.
mydata <- mydata[c(-97, -135, -139), ]
fit1 <- lm(bmd ~ age + weight_kg + height_cm + sexF,
data = mydata)
mydata$StdResid <- round(rstandard(fit1), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.98573, p-value = 0.09177
H0: Errors are normally distributed
H1: Errors are not normally distributed
After removing outliers and units with high impact we cannot claim that errors are not normally distributed, since p-value from Shapiro-Wilk normality test is higher than 5% (p=0.09177), meaning the assumption regarding normal distribution of errors is met.
mydata$StdFittedValues <- scale(fit1$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
From the scatterplot there seems to be a little bit lower variability on the left than on the right. Therefore, let us check homoscedasticity assumption using a formal test. Otherwise, the points are random, again indicating no non-linearity problems.
suppressPackageStartupMessages(library(olsrr))
## Warning: package 'olsrr' was built under R version 4.2.2
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : bmd
## Variables: fitted values of bmd
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 3.000226
## Prob > Chi2 = 0.0832529
From Breusch Pagan test we cannot reject the null hypothesis stated in the test results, therefore, we cannot claim there is a problem with heteroscedasticity, since the p-value is higher than 5%. (It is below 10% but we do have a relatively large sample.)
library(pastecs)
round(stat.desc(mydata[ ,c(2,4,5,6)]), 2)
## age weight_kg height_cm bmd
## nbr.val 164.00 164.00 164.00 164.00
## nbr.null 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 35.81 36.00 142.00 0.41
## max 88.75 96.00 177.00 1.13
## range 52.94 60.00 35.00 0.73
## sum 10401.44 10585.50 26284.50 127.85
## median 63.36 64.25 160.25 0.78
## mean 63.42 64.55 160.27 0.78
## SE.mean 0.97 0.90 0.62 0.01
## CI.mean.0.95 1.91 1.78 1.23 0.02
## var 153.77 133.81 63.16 0.02
## std.dev 12.40 11.57 7.95 0.16
## coef.var 0.20 0.18 0.05 0.20
sum(mydata[,7] == 'F')
## [1] 81
sum(mydata[,7] == 'M')
## [1] 83
The minimum of age is 35.81 years and the maximum 88.75 years, the minimum of weight 36kg and maximum 96kg, the minimum of height 142cm and the maximum 177cm, the minimum of BMD 0.41g/cm2 and the maximum 1.13g/cm2. (Above about 0.88 is normal.)
Medians and means for all variables are pretty close. The patients are on average 63.42 years old, weigh 64.55kg, are 160.27cm tall and have their BMD 0.78g/cm2. 50% of patients is older than 63.36 years, weighs more than 64.25kg, is taller than 160.25cm, and has BMD higher than 0.78g/cm2. The other 50% is younger, weighs less, is smaller than the numbers stated and has BMD lower than 0.78g/cm2.
The 95% confidence intervals are relatively small for all variables. Provided the sample is random the true mean of age is between 61.51 and 65.33 years, the true weight mean between 62.77 and 66.33kg, the true height mean between 159.04 and 161.50cm, and the true BMD mean between 0.76 and 0.80g/cm2 (p-value = 5%).
The highest variability according to coefficient of variance is for age and BMD, very closely followed by weight, and the smallest variability is observed for height.
Out of 164 units in the sample, 81 are women and 83 men.
#from the last estimation of the model nothing changed so we will not re-estimate it here
summary(fit1)
##
## Call:
## lm(formula = bmd ~ age + weight_kg + height_cm + sexF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35135 -0.06853 -0.00516 0.08205 0.28286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4497743 0.2765318 1.626 0.106
## age -0.0038527 0.0007912 -4.869 2.68e-06 ***
## weight_kg 0.0065988 0.0008833 7.471 4.99e-12 ***
## height_cm 0.0008650 0.0017148 0.504 0.615
## sexFM 0.0189404 0.0257956 0.734 0.464
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1205 on 159 degrees of freedom
## Multiple R-squared: 0.4159, Adjusted R-squared: 0.4012
## F-statistic: 28.3 on 4 and 159 DF, p-value: < 2.2e-16
With this model 41.59% of variability of BMD is explained by linear effect of all the explanatory variables included in the model; age, weight, height, and sex.
F STATISTICS:
H0: RO2 = 0
H1: RO2 > 0
We reject H0 at p-value < 0.001. Therefor, at least some of the variability is explained with explanatory variables, meaning the model is good.
If we take a look at partial regression coefficients we see that we did not find a significant effect of height or sex on BMD, but we have found it for age and weight: If a patient’s age increases by a year the BMD will on average decrease by 0.0038527g/cm2 provided everything else is left unchanged (p-value < 0.001). If a patient’s weight increases by a kilogram their BMD will on average increase by 0.0065988g/cm2 provided everything else is left unchanged (p-value < 0.001).
sqrt(summary(fit1)$r.squared)
## [1] 0.6449007
Linear relationship between BMD and all four explanatory variables is semi strong.
By using linear regression we have found a good model in which bone mineral density (in the hip) of patients is the dependent variable and age, sex, height, and weight the independent variables, meaning the explanatory variables explain a part of BMD variability. Furthermore, we have found a significant effect of weight and age on BMD but not for height or sex. We have to keep in mind this can only be generalized on the population of the specific patients.