Sajjad Abolhassani Larki (s3438733)
Aswathy Murali Ottur (s3782618)
2023-05-28
Nonalcoholic fatty liver disease (NAFLD), which can progress to severe fibrosis and cirrhosis, affects 10%–30% of the general U.S. population.
NAFLD is essentially the presence of excess fat in the liver and is comparable to the ongoing obesity epidemic.
The majority of the time, early-stage NAFLD is harmless, but if it worsens, it can result in severe liver damage, including cirrhosis.
With the data provided and the background of the condition, it is evident that the fat deposit is the main cause of NAFLD.
there are 3 tables that are provided in the website related to this study which included the clinical test of the population. Only the main data set that contained the important variables were used in this investigation.
Dataset : Click here to view the dataset.
Obesity is considered a high risk factor of NAFLD. A BMI of 30 or higher falls withing the range of obesity.
Obesity leads to an accumulation of ectopic fat,including visceral obesity and fatty liver
The purpose of this investigation is to check if BMI
has an impact on the status and gender of a
patient diagnosed with NAFLD.
We are going to compare variables in the table through hypothesis
testing for BMI along with status and
gender.
Click here to view the dataset. The dataset was found on the Kaggle open data source website.
ID (Numeric) - Subject IdentifierAge (numeric) - age of participantMale (Binary) - 1= male 0= femaleweight (Numeric) - Weight in kgheight (Numeric) - Height in cmBMI (Numeric) - Body Mass IndexCase.id (Numeric) - the id of the NAFLD case to whom
this subject is matchedfutime (Numeric) - time to death or last follow-upstatus (Binary) - 0= alive at last follow-up,
1=deadOverview of the data structure after importing.
## X id age male weight height bmi case.id futime status
## 1 3631 1 57 0 60.0 163 22.69094 10630 6261 0
## 2 8458 2 67 0 70.4 168 24.88403 14817 624 0
## 3 6298 3 53 1 105.8 186 30.45354 3 1783 0
## 4 15398 4 56 1 109.3 170 37.83010 6628 3143 0
## 5 13261 5 68 1 NA NA NA 1871 1836 1
## 6 14423 6 39 0 63.9 155 26.61559 7144 1581 0
Over view of data and their data types. This helps understand and perform better analysis on the data.
## 'data.frame': 17549 obs. of 10 variables:
## $ X : int 3631 8458 6298 15398 13261 14423 9518 15366 8827 688 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 57 67 53 56 68 39 49 30 47 79 ...
## $ male : int 0 0 1 1 1 0 0 1 1 0 ...
## $ weight : num 60 70.4 105.8 109.3 NA ...
## $ height : int 163 168 186 170 NA 155 161 180 188 155 ...
## $ bmi : num 22.7 24.9 30.5 37.8 NA ...
## $ case.id: int 10630 14817 3 6628 1871 7144 7507 13417 9 13518 ...
## $ futime : int 6261 624 1783 3143 1836 1581 3109 1339 1671 2239 ...
## $ status : int 0 0 0 0 1 0 0 0 0 1 ...
In this stage the data is put through various filters, and manipulated to suit further analysis.
The variable names are initially changed for better understanding.
The binary variables are also converted to factors, to show their actual
values of F and M instead of 0 and 1 under the
renamed column Gender (previously known as Male). The
status variables are also replaced with alive and dead for better
understanding of factor variables.
Liver_dis <- Liver_dis %>% rename(gender = male)
Liver_dis <- Liver_dis %>% rename(follow_up_time = futime)
Liver_dis$gender <- factor(Liver_dis$gender, levels = c(0,1), labels = c("F","M"))
Liver_dis$status <- factor(Liver_dis$status, levels = c(0,1), labels = c("Alive","Dead"))
Liver_dis$height <- as.numeric(Liver_dis$height)## 'data.frame': 17549 obs. of 10 variables:
## $ X : int 3631 8458 6298 15398 13261 14423 9518 15366 8827 688 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 57 67 53 56 68 39 49 30 47 79 ...
## $ gender : Factor w/ 2 levels "F","M": 1 1 2 2 2 1 1 2 2 1 ...
## $ weight : num 60 70.4 105.8 109.3 NA ...
## $ height : num 163 168 186 170 NA 155 161 180 188 155 ...
## $ bmi : num 22.7 24.9 30.5 37.8 NA ...
## $ case.id : int 10630 14817 3 6628 1871 7144 7507 13417 9 13518 ...
## $ follow_up_time: int 6261 624 1783 3143 1836 1581 3109 1339 1671 2239 ...
## $ status : Factor w/ 2 levels "Alive","Dead": 1 1 1 1 2 1 1 1 1 2 ...
The next step was to eliminate any possible null values in teh data set.
## X id age gender weight
## 0 0 0 0 0
## height bmi case.id follow_up_time status
## 0 0 0 0 0
## X id age gender weight
## Min. : 6 Min. : 1 Min. :18.00 F:6993 Min. : 33.40
## 1st Qu.: 4563 1st Qu.: 4404 1st Qu.:44.00 M:5569 1st Qu.: 70.00
## Median : 8944 Median : 8800 Median :53.00 Median : 84.00
## Mean : 8802 Mean : 8782 Mean :53.55 Mean : 86.41
## 3rd Qu.:12793 3rd Qu.:13152 3rd Qu.:64.00 3rd Qu.: 99.30
## Max. :17565 Max. :17566 Max. :98.00 Max. :181.70
## height bmi case.id follow_up_time status
## Min. :123.0 Min. : 9.207 Min. : 3 Min. : 7 Alive:11550
## 1st Qu.:162.0 1st Qu.:25.127 1st Qu.: 4697 1st Qu.:1143 Dead : 1012
## Median :169.0 Median :28.870 Median : 8783 Median :2146
## Mean :169.3 Mean :30.069 Mean : 8868 Mean :2402
## 3rd Qu.:177.0 3rd Qu.:33.703 3rd Qu.:13269 3rd Qu.:3341
## Max. :207.0 Max. :84.396 Max. :17563 Max. :7145
This provides us with a clean data set to now work on. The variable names, data types and the null values have been taken care off.
As observed in the graph above there are evident outliers, but the decision to keep them were made as they may have an important impact in the final hypothesis testing.
The histogram shows the normal distribution of the variables.
par(mfrow=c(3,1))
par(mar=c(4,4,4,4))
#BMI
hist(Liver_dis_clean$bmi,
main = "BMI",
xlab = "BMI of Patients", col = "orange")
#Weight
hist(Liver_dis_clean$weight,
main = "Weight",
xlab = "Weight of Patients", col = "orange")par(mfrow=c(3,1))
par(mar=c(4,4,4,4))
#height
hist(Liver_dis_clean$height,
main = "height",
xlab = "height of Patients", col = "orange")
#follow_up_time
hist(Liver_dis_clean$follow_up_time,
main = "follow_up_time",
xlab = "follow_up_time of Patients", col = "orange")This part is to show the random sampling skills and confident interval for only one variable that could represent the whole population with determined confident interval.
Transfer Liver_dis_clean to new variable.
The code randomly sampling BMI variables and determined that mean of BMI population (mu) is siting within the sample’s 95% confident interval. That shows if the chosen sample size is a good representative for the population.
BMI_randomSample <- liver$bmi %>% sample(size=50, replace = FALSE)
cof_int_BMI <- BMI_randomSample %>% t.test( conf.level= 0.95)
cof_int_BMI$conf.int[1]## [1] 27.24586
## [1] 30.56695
Popul_mean_BMI <- mean(liver$bmi)
hypoMean <- cof_int_BMI$conf.int[1] <= Popul_mean_BMI & Popul_mean_BMI <= cof_int_BMI$conf.int[2]
hypoMean## [1] TRUE
The first and second numbers are the lower and upper tail values. The
True means that the mean of the population sat within 95%
confidence interval of the sample. False on the other hand
means did not sit within sample’s confidence interval. By repeating the
code over and over you get the percentage of true over false. True means
the null hypothesis is true and false is the null hypothesis
rejected.
Now comparing the two population of male and female through independent Two sample T-test. “Is there any different between mean of male’s and female’s BMI?”.
H0: μ1 - μ2 = 0 ==> There is no difference between male mean and female mean.
HA: μ1 - μ2 ≠ 0 ==> There is difference between male mean and female mean.
Where the μ1 and μ2 refer to the mean of the BMI of the two statuses.
Separate BMI value for male and Female:
## [1] 5569
## [1] 6993
The length of samples are different, therefore, we will not check the equal variance and use Welch Two Sample t-test in this case.
ttest_resultGenderBMI <- t.test(Male_BMI,Female_BMI, alternative = "two.sided", conf.level = 0.95,
var.equal = FALSE)
ttest_resultGenderBMI##
## Welch Two Sample t-test
##
## data: Male_BMI and Female_BMI
## t = 2.7719, df = 12551, p-value = 0.00558
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1004225 0.5853809
## sample estimates:
## mean of x mean of y
## 30.26008 29.91718
The p_value 0.00558 < 0.05 which indicate that the null hypothesis will be rejected in favor of alternative hypothesis. As it shown, the mean of x(male) is different from mean of y(Female).
Now comparing the two population BMI of Dead and Alive through independent Two sample T-test. “Is there any different between mean of Alive and Dead?”.
H0: μ1 - μ2 = 0 ==> There is no difference between dead mean and alive mean.
HA: μ1 - μ2 ≠ 0 ==> There is difference between dead mean and alive mean.
Separate BMI value for male and Female:
## [1] 1012
## [1] 11550
TtestResultStatusBMI <- t.test(Dead_BMI,Alive_BMI, alternative = "two.sided", conf.level = 0.95,
var.equal = FALSE)
TtestResultStatusBMI##
## Welch Two Sample t-test
##
## data: Dead_BMI and Alive_BMI
## t = -0.074943, df = 1168, p-value = 0.9403
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5048168 0.4676706
## sample estimates:
## mean of x mean of y
## 30.05212 30.07069
Here we have p_value greater than 0.05 that indicate we are fail to reject null hypothesis which means there is no different between BMI of Dead and Alive people.
boxplot(bmi~status,data=Liver_dis_clean, main="BMI and Status",
xlab="Status of patient", ylab="BMI")
It is also evident from the box plot above that the mean of both the
dead and alive are very close.
Our investigation found that participants BMI does
not affect the status (Dead or Alive) of the participants.
The p-value of 0.94 lead to this conclusion, which was higher than 0.05
that indicated the BMI of NAFLD population does not relate to the
mortality of the participant.
According to the statistical investigation on the
NAFLD data set, Male and Female (Gender)
showed different BMI means. The p-value of 0.00558 lead to this
conclusion, which was lower than 0.05.
Limitation of the study included the smaller population size as well as the limitation of description provided for each variables. The understanding of the data was through basic initial description of the data.
Further investigation can be done through including other
variables such as age that is also in the data set. This
might provide more insight and better outcome.
https://www.cdc.gov/obesity/basics/adult-defining.html#:~:text=If%20your%20BMI%20is%20less,falls%20within%20the%20obesity%20range.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6367583/#:~:text=Visceral%20obesity%20is%20an%20important,increase%20in%20pro%2Dinflammatory%20cytokines.