Mathematical Modelling Assignment - Modelling Kidney Diseases Among Youths in Ghana

Abdul-Karim Kadiri - Student ID: 223009555

2023-05-19

knitr::include_graphics("kidneypic.jpg")

Introduction:

Recently, the health landscape in Ghana has been swamped with a disturbing trend: an increase in incidences of kidney failure among the country’s youth. My colleague, a bio-statistician with Ghana Health Service (GHA), and I have taken particular interest in this issue. We looked at the variables that could be accounting for this prevalence, particularly we focused on age, exercise habits, and alcohol intake among patients presenting these conditions at various hospitals across the country.

Problem Statement:

Kidney failure, medically referred to as End-Stage Kidney Disease (ESKD) or End-Stage Renal Disease (ESRD), is a critical condition that manifests when the kidneys lose their filtering capacity (America Kidney Fund Medical Advisory Committee, 2022). This inability to adequately eliminate waste from the body has severe symptoms, including but not limited to, itchy skin or rashes, muscle cramps, nausea, swelling in extremities, changes in urination patterns, frothy urine, breathlessness, and disrupted sleep (America Kidney Fund Medical Advisory Committee, 2022).

What causes kidney failure?

According to (America Kidney Fund Medical Advisory Committee (2022), Kidney failure is usually caused by:
Smoking tobacco
Drinking too much alcohol (no more than 1 drink a day for women, and no more than 2 drinks a day for men)
Lack of excercise
Diabetes, which is the most common cause
High blood pressure, which is the second most common cause
Autoimmune diseases, such as lupus and IgA nephropathy
Genetic diseases (diseases passed down from one or both parents), such as polycystic kidney disease
Nephrotic syndrome -Problems in your urinary tract (organs that make urine and remove it from your body), such as kidney stones

Rationale:

This investigation stems from the concerning visual evidence that often emerges on social media platforms and news outlets. Images of young adults between the ages of 25 to 45, incapacitated by kidney diseases, are far too common. The recurring appeals for funds to cover expenses like dialysis treatments further underscore the severity of this health crisis.

Explanation of the variables

Over the past one year, the GHA has been collecting data on the variables that might be contributing to kidney failure among youth in Ghana. Particularly, sex, age, exercise habits, alcohol intake, and kidney failure diagnoses among the youth presenting at various district and regional government hospitals across the country.
The data used in this model was collected by the GHA over a period of one year and some few months. The data was collected from some district and regional hospitals across the country (government hospitals only). The data contains variables such as sex, age, exercise, alcohol_intake, kidney_failure.

Explanation of the variables

Exercise - This variable relates to the proportion of youth who reported to the hospital with symptoms of kidney failures and indicated that they exercise regularly.
Alcohol_intake - This variable refers to the proportion of the youth who reported to the hospitals with symptoms of kidney failures and indicated that they consume alcohol regularly
Kidney_failure - This is the kidney failure incidences. Of the youth who reported to the hospitals, the proportion were diagnosed with kidney failure or problems.

Testing whether the data meets the four main assumptions for multiple regression

hist(kidney.data$kidney_failure, 
     main="Histogram of Kidney Failure", 
     xlab="Kidney Failure")

Autocorrelation (excercise and alcohol consumption)

#
#we use test the relationship between the independent variables to be sure that they are not highly correlated

cor.test(kidney.data$exercise, kidney.data$alcohol_intake)

## 
##  Pearson's product-moment correlation
## 
## data:  kidney.data$exercise and kidney.data$alcohol_intake
## t = 0.33714, df = 496, p-value = 0.7362
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07282732  0.10286603
## sample estimates:
##        cor 
## 0.01513618

checking for autocorrelation

The correlation between proportion of youths who exercise and those who take alcohol is small (0.015 is only a 1.5% correlation), so we can include both parameters in our model.

autocorrelation - excercise and age

#
#we use test the relationship between the independent variables to be sure that they are not highly correlated

cor.test(kidney.data$exercise, kidney.data$age)

## 
##  Pearson's product-moment correlation
## 
## data:  kidney.data$exercise and kidney.data$age
## t = 0.93097, df = 496, p-value = 0.3523
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04627127  0.12915786
## sample estimates:
##        cor 
## 0.04176519

checking for autocorrelation - excercise and age

The correlation between proportion of youths who exercise and age is also small (0.04), so we can include both parameters in our model.

Testing for linearity.

There are different types of models, linear regression model, exponential model, logarithmic model, polynomial model, To determine which model to use, we plot the data, specifically, the dependent variables and the independent variables.

Linearity test - Excercise and Kidney failure

plot(kidney_failure ~ exercise, data = kidney.data, 
     main="Kidney Failure vs. Exercise", 
     xlab="Exercise", 
     ylab="Kidney Failure")

#the relationship appears to be linear

Linearity test: Correlation - Excercise and Kidney failure

cor.test(kidney.data$kidney_failure, kidney.data$exercise)

## 
##  Pearson's product-moment correlation
## 
## data:  kidney.data$kidney_failure and kidney.data$exercise
## t = -25.435, df = 496, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7881167 -0.7115190
## sample estimates:
##        cor 
## -0.7523496

From the plot, we see that the relationship between kidney failure incidence and exercise is linear, with a correlation coefficient of -0.752. This means that the higher the exercise, the lower the kidney problem.

Linearity test - Alcohol intake and kidney failure

plot(kidney_failure ~ alcohol_intake, data = kidney.data, 
     main="Kidney Failure vs. Alcohol Intake", 
     xlab="Alcohol Intake", 
     ylab="Kidney Failure")

#Although the relationship between those who take alcohol and kidney failure is a bit less clear

Linearity test: Correlation - Alcohol intake and kidney failure

cor.test(kidney.data$exercise, kidney.data$alcohol_intake)

## 
##  Pearson's product-moment correlation
## 
## data:  kidney.data$exercise and kidney.data$alcohol_intake
## t = 0.33714, df = 496, p-value = 0.7362
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07282732  0.10286603
## sample estimates:
##        cor 
## 0.01513618

Although the relationship between those who take alcohol and kidney failure is a bit less clear, with correlation coeffieicent of 0.015 it still appears to be linear but weak.

Given that the data appears to be linear, I proceed to conduct a regression model. However, I will also use logarithmic model and then check the differences with AIC.

Dividing the data into training and testing

set.seed(123)

# As indicated last week, at least 75 percent of the data can be used for training the model
train_size <- floor(0.75 * nrow(kidney.data))

# Generating the sample for training
train_indices <- sample(seq_len(nrow(kidney.data)), size = train_size)

# Creating the training and test sets
train_data <- kidney.data[train_indices, ]
test_data <- kidney.data[-train_indices, ]

Multiple regression:

kidney_failure_lm_train <- lm(kidney_failure ~ exercise + alcohol_intake + age + sex, data = train_data)

summary(kidney_failure_lm_train)

## 
## Call:
## lm(formula = kidney_failure ~ exercise + alcohol_intake + age + 
##     sex, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3934  -0.9507   0.5415   1.4587   4.4154 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    11.292019   0.760749  14.843  < 2e-16 ***
## exercise       -0.122956   0.006888 -17.851  < 2e-16 ***
## alcohol_intake  0.075259   0.014851   5.068 6.39e-07 ***
## age             0.028917   0.019090   1.515    0.131    
## sexMale         0.389293   0.302616   1.286    0.199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.303 on 368 degrees of freedom
## Multiple R-squared:  0.5869, Adjusted R-squared:  0.5824 
## F-statistic: 130.7 on 4 and 368 DF,  p-value: < 2.2e-16

Multiple regression: Interpreting the result

The estimated effect or coefficient of exercise is -0.122, while the estimated coefficient of alcohol intake is 0.07.That of age is 0.002
This means that for every 1% increase in exercise, there is a correlated 0.122 percent decrease in the incidence of kidney failure among the youth. Meanwhile, for every 1% increase in alcohol consumption, there is a 0.07% increase in kidney failures among the youth.
The coefficient of age is 0.02. it means that an increase in the year of young adults leads to an increase in the the likelihood of the person’s kidney failing by 0.02%.

Multiple regression: Interpreting the result (Contd)

The R-squared value of 0.5869 suggests that about 58.69% of the variability in kidney failure is explained by these four variables in this model.
As seen from the p-value, variables “exercise” and “alcohol_intake” are statistically significantly associated with kidney failure at a 5% level of significance. “Age and”sexMale” variable is not statistically significant at a 5% level since its p-value is greater than 0.05.

Predicting using the test data - multiple regression

predictions <- predict(kidney_failure_lm_train, newdata = test_data)

Computing the MSE - multiple regression

mse_linear <- mean((test_data$kidney_failure - predictions)^2)

Using logarithmic model:

kidney_failure_log_lm <- lm(log(kidney_failure) ~ exercise + alcohol_intake + age + sex, data = train_data)

summary(kidney_failure_log_lm)

## 
## Call:
## lm(formula = log(kidney_failure) ~ exercise + alcohol_intake + 
##     age + sex, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72325 -0.11313  0.08284  0.23855  0.43179 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.620660   0.113527  23.084  < 2e-16 ***
## exercise       -0.020157   0.001028 -19.610  < 2e-16 ***
## alcohol_intake  0.016686   0.002216   7.529 3.98e-13 ***
## age             0.002695   0.002849   0.946    0.345    
## sexMale        -0.183891   0.045159  -4.072 5.71e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3436 on 368 degrees of freedom
## Multiple R-squared:  0.5674, Adjusted R-squared:  0.5627 
## F-statistic: 120.7 on 4 and 368 DF,  p-value: < 2.2e-16

Predicting using the test data - logarithmic model

predictions_log <- predict(kidney_failure_log_lm, newdata = test_data)

Computing the MSE - logarithmic model

mse_log <- mean((test_data$kidney_failure - predictions_log)^2)

Computing the MSEs - Linear and Logarithmic model

print(paste("Mean Squared Error for the multiple regression is: ", mse_linear))

## [1] "Mean Squared Error for the multiple regression is:  5.4881304361219"

print(paste("Mean Squared Error for the logarithmic model is: ", mse_log))

## [1] "Mean Squared Error for the logarithmic model is:  54.5914662756264"

Implication of kidney failure among Ghanaian youths

The multiple regression analysis of the data set suggests significant implications concerning kidney failure among Ghanaian youths.
The negative relationship between exercise and kidney failure highlights the potential protective effect of physical activity on kidney health.
It suggests that as the percentage of regular exercise increases by 1%, kidney failure incidents decrease by 0.13%. This finding underscores the importance of promoting regular exercise and physical activity among young people to help protect against kidney disease.

Implication of kidney failure among Ghanaian youths (Contd)

Conversely, the analysis shows a positive correlation between alcohol intake and kidney failure rates. Specifically, a 1% increase in alcohol consumption corresponds to a 0.08% increase in kidney failure among the youth.
This indicates that alcohol consumption may have detrimental effects on kidney health, thereby increasing the risk of kidney failure. Public health efforts may be required to curb alcohol abuse among youths and raise awareness about the potential long-term health risks.
The low standard errors and high t-statistics for these coefficients strongly suggest that the observed effects are not due to chance.

Conclusion

Preventive measures such as encouraging regular exercise, controlling alcohol intake, and early health check-ups as youths age could be vital strategies in reducing the incidence of kidney failure among Ghanaian youths. It is essential to raise awareness about these factors in the community and implement lifestyle modification programs to tackle this health issue.