Introduction

Nonalcoholic fatty liver disease (NAFLD), which can progress to severe fibrosis and cirrhosis, affects 10%–30% of the general U.S. population.
NAFLD is essentially the presence of excess fat in the liver and is comparable to the ongoing obesity epidemic.
The majority of the time, early-stage NAFLD is harmless, but if it worsens, it can result in severe liver damage, including cirrhosis.
With the data provided and the background of the condition, it is evident that the fat deposit is the main cause of NAFLD.
there are 3 tables that are provided in the website related to this study which included the clinical test of the population. Only the main data set that contained the important variables were used in this investigation.

Dataset : Click here to view the dataset.

Problem Statement

Obesity is considered a high risk factor of NAFLD. A BMI of 30 or higher falls withing the range of obesity.
Obesity leads to an accumulation of ectopic fat,including visceral obesity and fatty liver
The purpose of this investigation is to check if BMI has an impact on the status and gender of a patient diagnosed with NAFLD.
We are going to compare variables in the table through hypothesis testing for BMI along with status and gender.

Data

The data provides information of a very small sample size who are all affected by NAFLD.
The Data set has 17,549 observations along 10 variables.

Click here to view the dataset. The dataset was found on the Kaggle open data source website.

ID (Numeric) - Subject Identifier
Age (numeric) - age of participant
Male (Binary) - 1= male 0= female
weight (Numeric) - Weight in kg
height (Numeric) - Height in cm
BMI (Numeric) - Body Mass Index
Case.id (Numeric) - the id of the NAFLD case to whom this subject is matched
futime (Numeric) - time to death or last follow-up
status (Binary) - 0= alive at last follow-up, 1=dead

Statistical Analysis

Overview of the data structure after importing.

head(Liver_dis, obs=5)

##       X id age male weight height      bmi case.id futime status
## 1  3631  1  57    0   60.0    163 22.69094   10630   6261      0
## 2  8458  2  67    0   70.4    168 24.88403   14817    624      0
## 3  6298  3  53    1  105.8    186 30.45354       3   1783      0
## 4 15398  4  56    1  109.3    170 37.83010    6628   3143      0
## 5 13261  5  68    1     NA     NA       NA    1871   1836      1
## 6 14423  6  39    0   63.9    155 26.61559    7144   1581      0

Over view of data and their data types. This helps understand and perform better analysis on the data.

str(Liver_dis)

## 'data.frame':    17549 obs. of  10 variables:
##  $ X      : int  3631 8458 6298 15398 13261 14423 9518 15366 8827 688 ...
##  $ id     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age    : int  57 67 53 56 68 39 49 30 47 79 ...
##  $ male   : int  0 0 1 1 1 0 0 1 1 0 ...
##  $ weight : num  60 70.4 105.8 109.3 NA ...
##  $ height : int  163 168 186 170 NA 155 161 180 188 155 ...
##  $ bmi    : num  22.7 24.9 30.5 37.8 NA ...
##  $ case.id: int  10630 14817 3 6628 1871 7144 7507 13417 9 13518 ...
##  $ futime : int  6261 624 1783 3143 1836 1581 3109 1339 1671 2239 ...
##  $ status : int  0 0 0 0 1 0 0 0 0 1 ...

Data Preprocessing

In this stage the data is put through various filters, and manipulated to suit further analysis.

The variable names are initially changed for better understanding. The binary variables are also converted to factors, to show their actual values of F and M instead of 0 and 1 under the renamed column Gender (previously known as Male). The status variables are also replaced with alive and dead for better understanding of factor variables.

Liver_dis <- Liver_dis %>% rename(gender = male)
Liver_dis <- Liver_dis %>% rename(follow_up_time = futime)

Liver_dis$gender <- factor(Liver_dis$gender, levels = c(0,1), labels = c("F","M"))
Liver_dis$status <- factor(Liver_dis$status, levels = c(0,1), labels = c("Alive","Dead"))

Liver_dis$height <- as.numeric(Liver_dis$height)

str(Liver_dis)

## 'data.frame':    17549 obs. of  10 variables:
##  $ X             : int  3631 8458 6298 15398 13261 14423 9518 15366 8827 688 ...
##  $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age           : int  57 67 53 56 68 39 49 30 47 79 ...
##  $ gender        : Factor w/ 2 levels "F","M": 1 1 2 2 2 1 1 2 2 1 ...
##  $ weight        : num  60 70.4 105.8 109.3 NA ...
##  $ height        : num  163 168 186 170 NA 155 161 180 188 155 ...
##  $ bmi           : num  22.7 24.9 30.5 37.8 NA ...
##  $ case.id       : int  10630 14817 3 6628 1871 7144 7507 13417 9 13518 ...
##  $ follow_up_time: int  6261 624 1783 3143 1836 1581 3109 1339 1671 2239 ...
##  $ status        : Factor w/ 2 levels "Alive","Dead": 1 1 1 1 2 1 1 1 1 2 ...

Data Preprocessing Continued

The next step was to eliminate any possible null values in teh data set.

Liver_dis_clean <- na.omit(Liver_dis)
colSums(is.na(Liver_dis_clean))

##              X             id            age         gender         weight 
##              0              0              0              0              0 
##         height            bmi        case.id follow_up_time         status 
##              0              0              0              0              0

summary(Liver_dis_clean)

##        X               id             age        gender       weight      
##  Min.   :    6   Min.   :    1   Min.   :18.00   F:6993   Min.   : 33.40  
##  1st Qu.: 4563   1st Qu.: 4404   1st Qu.:44.00   M:5569   1st Qu.: 70.00  
##  Median : 8944   Median : 8800   Median :53.00            Median : 84.00  
##  Mean   : 8802   Mean   : 8782   Mean   :53.55            Mean   : 86.41  
##  3rd Qu.:12793   3rd Qu.:13152   3rd Qu.:64.00            3rd Qu.: 99.30  
##  Max.   :17565   Max.   :17566   Max.   :98.00            Max.   :181.70  
##      height           bmi            case.id      follow_up_time   status     
##  Min.   :123.0   Min.   : 9.207   Min.   :    3   Min.   :   7   Alive:11550  
##  1st Qu.:162.0   1st Qu.:25.127   1st Qu.: 4697   1st Qu.:1143   Dead : 1012  
##  Median :169.0   Median :28.870   Median : 8783   Median :2146                
##  Mean   :169.3   Mean   :30.069   Mean   : 8868   Mean   :2402                
##  3rd Qu.:177.0   3rd Qu.:33.703   3rd Qu.:13269   3rd Qu.:3341                
##  Max.   :207.0   Max.   :84.396   Max.   :17563   Max.   :7145

This provides us with a clean data set to now work on. The variable names, data types and the null values have been taken care off.

Data Preprocessing Continued

boxplot(Liver_dis_clean$bmi, main="BMI outlier check")

As observed in the graph above there are evident outliers, but the decision to keep them were made as they may have an important impact in the final hypothesis testing.

Check the normal distribution

The histogram shows the normal distribution of the variables.

par(mfrow=c(3,1))
par(mar=c(4,4,4,4))
#BMI
hist(Liver_dis_clean$bmi, 
     main = "BMI", 
     xlab = "BMI of Patients", col = "orange")

#Weight
hist(Liver_dis_clean$weight, 
     main = "Weight", 
     xlab = "Weight of Patients", col = "orange")

Check the normal distribution

par(mfrow=c(3,1))
par(mar=c(4,4,4,4))

#height
hist(Liver_dis_clean$height, 
     main = "height", 
     xlab = "height of Patients", col = "orange")

#follow_up_time
hist(Liver_dis_clean$follow_up_time, 
     main = "follow_up_time", 
     xlab = "follow_up_time of Patients", col = "orange")

Random Sampling

This part is to show the random sampling skills and confident interval for only one variable that could represent the whole population with determined confident interval.

Transfer Liver_dis_clean to new variable.

The code randomly sampling BMI variables and determined that mean of BMI population (mu) is siting within the sample’s 95% confident interval. That shows if the chosen sample size is a good representative for the population.

liver <- Liver_dis_clean

Random Sampling Continuation

BMI_randomSample <- liver$bmi %>%  sample(size=50, replace = FALSE)
cof_int_BMI <- BMI_randomSample %>% t.test( conf.level= 0.95)
cof_int_BMI$conf.int[1]

## [1] 27.24586

cof_int_BMI$conf.int[2]

## [1] 30.56695

Popul_mean_BMI <- mean(liver$bmi)
hypoMean <- cof_int_BMI$conf.int[1] <= Popul_mean_BMI & Popul_mean_BMI <= cof_int_BMI$conf.int[2] 
hypoMean

## [1] TRUE

The first and second numbers are the lower and upper tail values. The True means that the mean of the population sat within 95% confidence interval of the sample. False on the other hand means did not sit within sample’s confidence interval. By repeating the code over and over you get the percentage of true over false. True means the null hypothesis is true and false is the null hypothesis rejected.

Hypothesis Testing for BMI and Gender

Now comparing the two population of male and female through independent Two sample T-test. “Is there any different between mean of male’s and female’s BMI?”.

H0: μ1 - μ2 = 0 ==> There is no difference between male mean and female mean.

HA: μ1 - μ2 ≠ 0 ==> There is difference between male mean and female mean.

Where the μ1 and μ2 refer to the mean of the BMI of the two statuses.

Separate BMI value for male and Female:

Male_BMI <- liver$bmi[liver$gender== "M"]
Female_BMI <- liver$bmi[liver$gender== "F"]

Male_BMI %>% length()

## [1] 5569

Female_BMI %>% length()

## [1] 6993

The length of samples are different, therefore, we will not check the equal variance and use Welch Two Sample t-test in this case.

Hypothesis Testing- Sample Ttest for BMI and Gender

ttest_resultGenderBMI <- t.test(Male_BMI,Female_BMI, alternative = "two.sided", conf.level = 0.95,
                       var.equal = FALSE)
ttest_resultGenderBMI

## 
##  Welch Two Sample t-test
## 
## data:  Male_BMI and Female_BMI
## t = 2.7719, df = 12551, p-value = 0.00558
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1004225 0.5853809
## sample estimates:
## mean of x mean of y 
##  30.26008  29.91718

The p_value 0.00558 < 0.05 which indicate that the null hypothesis will be rejected in favor of alternative hypothesis. As it shown, the mean of x(male) is different from mean of y(Female).

Now comparing the two population BMI of Dead and Alive through independent Two sample T-test. “Is there any different between mean of Alive and Dead?”.

Hypothesis Testing for BMI and Status

H0: μ1 - μ2 = 0 ==> There is no difference between dead mean and alive mean.

HA: μ1 - μ2 ≠ 0 ==> There is difference between dead mean and alive mean.

Separate BMI value for male and Female:

Dead_BMI <- liver$bmi[liver$status== "Dead"]
Alive_BMI <- liver$bmi[liver$status== "Alive"]

Dead_BMI %>% length()

## [1] 1012

Alive_BMI %>% length()

## [1] 11550

Hypothesis Testing- Sample Ttest for BMI and Status

TtestResultStatusBMI <- t.test(Dead_BMI,Alive_BMI, alternative = "two.sided", conf.level = 0.95,
                       var.equal = FALSE)
TtestResultStatusBMI

## 
##  Welch Two Sample t-test
## 
## data:  Dead_BMI and Alive_BMI
## t = -0.074943, df = 1168, p-value = 0.9403
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5048168  0.4676706
## sample estimates:
## mean of x mean of y 
##  30.05212  30.07069

Here we have p_value greater than 0.05 that indicate we are fail to reject null hypothesis which means there is no different between BMI of Dead and Alive people.

Box PLot

boxplot(bmi~status,data=Liver_dis_clean, main="BMI and Status",
   xlab="Status of patient", ylab="BMI")

It is also evident from the box plot above that the mean of both the dead and alive are very close.

Conclusion

Our investigation found that participants BMI does not affect the status (Dead or Alive) of the participants. The p-value of 0.94 lead to this conclusion, which was higher than 0.05 that indicated the BMI of NAFLD population does not relate to the mortality of the participant.
According to the statistical investigation on the NAFLD data set, Male and Female (Gender) showed different BMI means. The p-value of 0.00558 lead to this conclusion, which was lower than 0.05.
Limitation of the study included the smaller population size as well as the limitation of description provided for each variables. The understanding of the data was through basic initial description of the data.
Further investigation can be done through including other variables such as age that is also in the data set. This might provide more insight and better outcome.

Applied Analytics Data Project 2

Non-Alcoholic fatty liver disease (NAFLD)

Introduction

Problem Statement

Data

Statistical Analysis

Data Preprocessing

Data Preprocessing Continued

Data Preprocessing Continued

Check the normal distribution

Check the normal distribution

Random Sampling

Random Sampling Continuation

Hypothesis Testing for BMI and Gender

Hypothesis Testing- Sample Ttest for BMI and Gender

Hypothesis Testing for BMI and Status

Hypothesis Testing- Sample Ttest for BMI and Status

Box PLot

Conclusion

Reference

Applied Analytics Data Project 2 Non-Alcoholic fatty liver disease (NAFLD)

Introduction

Problem Statement

Data

Statistical Analysis

Data Preprocessing

Data Preprocessing Continued

Data Preprocessing Continued

Check the normal distribution

Check the normal distribution

Random Sampling

Random Sampling Continuation

Hypothesis Testing for BMI and Gender

Hypothesis Testing- Sample Ttest for BMI and Gender

Hypothesis Testing for BMI and Status

Hypothesis Testing- Sample Ttest for BMI and Status

Box PLot

Conclusion

Reference

Applied Analytics Data Project 2

Non-Alcoholic fatty liver disease (NAFLD)