Mathematical Modelling Assignment - Modelling Kidney Diseases Among Youths in Ghana

Abdul-Karim Kadiri - Student ID: 223009555

2023-05-19

knitr::include_graphics("kidneypic.jpg")

Introduction:

Recently, the health landscape in Ghana has been swamped with a disturbing trend: an increase in incidences of kidney failure among the country’s youth. My colleague, a bio-statistician with Ghana Health Service (GHA), and I have taken particular interest in this issue. We looked at the variables that could be accounting for this prevalence, particularly we focused on age, exercise habits, and alcohol intake among patients presenting these conditions at various hospitals across the country.

Problem Statement:

Kidney failure, medically referred to as End-Stage Kidney Disease (ESKD) or End-Stage Renal Disease (ESRD), is a critical condition that manifests when the kidneys lose their filtering capacity (America Kidney Fund Medical Advisory Committee, 2022). This inability to adequately eliminate waste from the body has severe symptoms, including but not limited to, itchy skin or rashes, muscle cramps, nausea, swelling in extremities, changes in urination patterns, frothy urine, breathlessness, and disrupted sleep (America Kidney Fund Medical Advisory Committee, 2022).

What causes kidney failure?

Rationale:

This investigation stems from the concerning visual evidence that often emerges on social media platforms and news outlets. Images of young adults between the ages of 25 to 45, incapacitated by kidney diseases, are far too common. The recurring appeals for funds to cover expenses like dialysis treatments further underscore the severity of this health crisis.

Explanation of the variables

Explanation of the variables

Testing whether the data meets the four main assumptions for multiple regression

hist(kidney.data$kidney_failure, 
     main="Histogram of Kidney Failure", 
     xlab="Kidney Failure")

Autocorrelation (excercise and alcohol consumption)

#
#we use test the relationship between the independent variables to be sure that they are not highly correlated

cor.test(kidney.data$exercise, kidney.data$alcohol_intake)
## 
##  Pearson's product-moment correlation
## 
## data:  kidney.data$exercise and kidney.data$alcohol_intake
## t = 0.33714, df = 496, p-value = 0.7362
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07282732  0.10286603
## sample estimates:
##        cor 
## 0.01513618

checking for autocorrelation

The correlation between proportion of youths who exercise and those who take alcohol is small (0.015 is only a 1.5% correlation), so we can include both parameters in our model.

autocorrelation - excercise and age

#
#we use test the relationship between the independent variables to be sure that they are not highly correlated

cor.test(kidney.data$exercise, kidney.data$age)
## 
##  Pearson's product-moment correlation
## 
## data:  kidney.data$exercise and kidney.data$age
## t = 0.93097, df = 496, p-value = 0.3523
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04627127  0.12915786
## sample estimates:
##        cor 
## 0.04176519

checking for autocorrelation - excercise and age

The correlation between proportion of youths who exercise and age is also small (0.04), so we can include both parameters in our model.

Testing for linearity.

There are different types of models, linear regression model, exponential model, logarithmic model, polynomial model, To determine which model to use, we plot the data, specifically, the dependent variables and the independent variables.

Linearity test - Excercise and Kidney failure

plot(kidney_failure ~ exercise, data = kidney.data, 
     main="Kidney Failure vs. Exercise", 
     xlab="Exercise", 
     ylab="Kidney Failure")

#the relationship appears to be linear

Linearity test: Correlation - Excercise and Kidney failure

cor.test(kidney.data$kidney_failure, kidney.data$exercise)
## 
##  Pearson's product-moment correlation
## 
## data:  kidney.data$kidney_failure and kidney.data$exercise
## t = -25.435, df = 496, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7881167 -0.7115190
## sample estimates:
##        cor 
## -0.7523496

From the plot, we see that the relationship between kidney failure incidence and exercise is linear, with a correlation coefficient of -0.752. This means that the higher the exercise, the lower the kidney problem.

Linearity test - Alcohol intake and kidney failure

plot(kidney_failure ~ alcohol_intake, data = kidney.data, 
     main="Kidney Failure vs. Alcohol Intake", 
     xlab="Alcohol Intake", 
     ylab="Kidney Failure")

#Although the relationship between those who take alcohol and kidney failure is a bit less clear

Linearity test: Correlation - Alcohol intake and kidney failure

cor.test(kidney.data$exercise, kidney.data$alcohol_intake)
## 
##  Pearson's product-moment correlation
## 
## data:  kidney.data$exercise and kidney.data$alcohol_intake
## t = 0.33714, df = 496, p-value = 0.7362
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07282732  0.10286603
## sample estimates:
##        cor 
## 0.01513618

Although the relationship between those who take alcohol and kidney failure is a bit less clear, with correlation coeffieicent of 0.015 it still appears to be linear but weak.

Given that the data appears to be linear, I proceed to conduct a regression model. However, I will also use logarithmic model and then check the differences with AIC.

Dividing the data into training and testing

set.seed(123)

# As indicated last week, at least 75 percent of the data can be used for training the model
train_size <- floor(0.75 * nrow(kidney.data))

# Generating the sample for training
train_indices <- sample(seq_len(nrow(kidney.data)), size = train_size)

# Creating the training and test sets
train_data <- kidney.data[train_indices, ]
test_data <- kidney.data[-train_indices, ]

Multiple regression:

kidney_failure_lm_train <- lm(kidney_failure ~ exercise + alcohol_intake + age + sex, data = train_data)

summary(kidney_failure_lm_train)
## 
## Call:
## lm(formula = kidney_failure ~ exercise + alcohol_intake + age + 
##     sex, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3934  -0.9507   0.5415   1.4587   4.4154 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    11.292019   0.760749  14.843  < 2e-16 ***
## exercise       -0.122956   0.006888 -17.851  < 2e-16 ***
## alcohol_intake  0.075259   0.014851   5.068 6.39e-07 ***
## age             0.028917   0.019090   1.515    0.131    
## sexMale         0.389293   0.302616   1.286    0.199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.303 on 368 degrees of freedom
## Multiple R-squared:  0.5869, Adjusted R-squared:  0.5824 
## F-statistic: 130.7 on 4 and 368 DF,  p-value: < 2.2e-16

Multiple regression: Interpreting the result

Multiple regression: Interpreting the result (Contd)

Predicting using the test data - multiple regression

predictions <- predict(kidney_failure_lm_train, newdata = test_data)

Computing the MSE - multiple regression

mse_linear <- mean((test_data$kidney_failure - predictions)^2)

Using logarithmic model:

kidney_failure_log_lm <- lm(log(kidney_failure) ~ exercise + alcohol_intake + age + sex, data = train_data)

summary(kidney_failure_log_lm)
## 
## Call:
## lm(formula = log(kidney_failure) ~ exercise + alcohol_intake + 
##     age + sex, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72325 -0.11313  0.08284  0.23855  0.43179 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.620660   0.113527  23.084  < 2e-16 ***
## exercise       -0.020157   0.001028 -19.610  < 2e-16 ***
## alcohol_intake  0.016686   0.002216   7.529 3.98e-13 ***
## age             0.002695   0.002849   0.946    0.345    
## sexMale        -0.183891   0.045159  -4.072 5.71e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3436 on 368 degrees of freedom
## Multiple R-squared:  0.5674, Adjusted R-squared:  0.5627 
## F-statistic: 120.7 on 4 and 368 DF,  p-value: < 2.2e-16

Predicting using the test data - logarithmic model

predictions_log <- predict(kidney_failure_log_lm, newdata = test_data)

Computing the MSE - logarithmic model

mse_log <- mean((test_data$kidney_failure - predictions_log)^2)

Computing the MSEs - Linear and Logarithmic model

print(paste("Mean Squared Error for the multiple regression is: ", mse_linear))
## [1] "Mean Squared Error for the multiple regression is:  5.4881304361219"
print(paste("Mean Squared Error for the logarithmic model is: ", mse_log))
## [1] "Mean Squared Error for the logarithmic model is:  54.5914662756264"

Implication of kidney failure among Ghanaian youths

Implication of kidney failure among Ghanaian youths (Contd)

Conclusion

Thank you.