Applied Analytics Data Project 2

Non-Alcoholic fatty liver disease (NAFLD)

Sajjad Abolhassani Larki (s3438733)
Aswathy Murali Ottur (s3782618)

2023-05-28

Introduction

Dataset : Click here to view the dataset.

Problem Statement

Data

Click here to view the dataset. The dataset was found on the Kaggle open data source website.

Statistical Analysis

Overview of the data structure after importing.

head(Liver_dis, obs=5)
##       X id age male weight height      bmi case.id futime status
## 1  3631  1  57    0   60.0    163 22.69094   10630   6261      0
## 2  8458  2  67    0   70.4    168 24.88403   14817    624      0
## 3  6298  3  53    1  105.8    186 30.45354       3   1783      0
## 4 15398  4  56    1  109.3    170 37.83010    6628   3143      0
## 5 13261  5  68    1     NA     NA       NA    1871   1836      1
## 6 14423  6  39    0   63.9    155 26.61559    7144   1581      0

Over view of data and their data types. This helps understand and perform better analysis on the data.

str(Liver_dis)
## 'data.frame':    17549 obs. of  10 variables:
##  $ X      : int  3631 8458 6298 15398 13261 14423 9518 15366 8827 688 ...
##  $ id     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age    : int  57 67 53 56 68 39 49 30 47 79 ...
##  $ male   : int  0 0 1 1 1 0 0 1 1 0 ...
##  $ weight : num  60 70.4 105.8 109.3 NA ...
##  $ height : int  163 168 186 170 NA 155 161 180 188 155 ...
##  $ bmi    : num  22.7 24.9 30.5 37.8 NA ...
##  $ case.id: int  10630 14817 3 6628 1871 7144 7507 13417 9 13518 ...
##  $ futime : int  6261 624 1783 3143 1836 1581 3109 1339 1671 2239 ...
##  $ status : int  0 0 0 0 1 0 0 0 0 1 ...

Data Preprocessing

In this stage the data is put through various filters, and manipulated to suit further analysis.

The variable names are initially changed for better understanding. The binary variables are also converted to factors, to show their actual values of F and M instead of 0 and 1 under the renamed column Gender (previously known as Male). The status variables are also replaced with alive and dead for better understanding of factor variables.

Liver_dis <- Liver_dis %>% rename(gender = male)
Liver_dis <- Liver_dis %>% rename(follow_up_time = futime)

Liver_dis$gender <- factor(Liver_dis$gender, levels = c(0,1), labels = c("F","M"))
Liver_dis$status <- factor(Liver_dis$status, levels = c(0,1), labels = c("Alive","Dead"))

Liver_dis$height <- as.numeric(Liver_dis$height)
str(Liver_dis)
## 'data.frame':    17549 obs. of  10 variables:
##  $ X             : int  3631 8458 6298 15398 13261 14423 9518 15366 8827 688 ...
##  $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age           : int  57 67 53 56 68 39 49 30 47 79 ...
##  $ gender        : Factor w/ 2 levels "F","M": 1 1 2 2 2 1 1 2 2 1 ...
##  $ weight        : num  60 70.4 105.8 109.3 NA ...
##  $ height        : num  163 168 186 170 NA 155 161 180 188 155 ...
##  $ bmi           : num  22.7 24.9 30.5 37.8 NA ...
##  $ case.id       : int  10630 14817 3 6628 1871 7144 7507 13417 9 13518 ...
##  $ follow_up_time: int  6261 624 1783 3143 1836 1581 3109 1339 1671 2239 ...
##  $ status        : Factor w/ 2 levels "Alive","Dead": 1 1 1 1 2 1 1 1 1 2 ...

Data Preprocessing Continued

The next step was to eliminate any possible null values in teh data set.

Liver_dis_clean <- na.omit(Liver_dis)
colSums(is.na(Liver_dis_clean))
##              X             id            age         gender         weight 
##              0              0              0              0              0 
##         height            bmi        case.id follow_up_time         status 
##              0              0              0              0              0
summary(Liver_dis_clean)
##        X               id             age        gender       weight      
##  Min.   :    6   Min.   :    1   Min.   :18.00   F:6993   Min.   : 33.40  
##  1st Qu.: 4563   1st Qu.: 4404   1st Qu.:44.00   M:5569   1st Qu.: 70.00  
##  Median : 8944   Median : 8800   Median :53.00            Median : 84.00  
##  Mean   : 8802   Mean   : 8782   Mean   :53.55            Mean   : 86.41  
##  3rd Qu.:12793   3rd Qu.:13152   3rd Qu.:64.00            3rd Qu.: 99.30  
##  Max.   :17565   Max.   :17566   Max.   :98.00            Max.   :181.70  
##      height           bmi            case.id      follow_up_time   status     
##  Min.   :123.0   Min.   : 9.207   Min.   :    3   Min.   :   7   Alive:11550  
##  1st Qu.:162.0   1st Qu.:25.127   1st Qu.: 4697   1st Qu.:1143   Dead : 1012  
##  Median :169.0   Median :28.870   Median : 8783   Median :2146                
##  Mean   :169.3   Mean   :30.069   Mean   : 8868   Mean   :2402                
##  3rd Qu.:177.0   3rd Qu.:33.703   3rd Qu.:13269   3rd Qu.:3341                
##  Max.   :207.0   Max.   :84.396   Max.   :17563   Max.   :7145

This provides us with a clean data set to now work on. The variable names, data types and the null values have been taken care off.

Data Preprocessing Continued

boxplot(Liver_dis_clean$bmi, main="BMI outlier check")

As observed in the graph above there are evident outliers, but the decision to keep them were made as they may have an important impact in the final hypothesis testing.

Check the normal distribution

The histogram shows the normal distribution of the variables.

par(mfrow=c(3,1))
par(mar=c(4,4,4,4))
#BMI
hist(Liver_dis_clean$bmi, 
     main = "BMI", 
     xlab = "BMI of Patients", col = "orange")

#Weight
hist(Liver_dis_clean$weight, 
     main = "Weight", 
     xlab = "Weight of Patients", col = "orange")

Check the normal distribution

par(mfrow=c(3,1))
par(mar=c(4,4,4,4))

#height
hist(Liver_dis_clean$height, 
     main = "height", 
     xlab = "height of Patients", col = "orange")

#follow_up_time
hist(Liver_dis_clean$follow_up_time, 
     main = "follow_up_time", 
     xlab = "follow_up_time of Patients", col = "orange")

Random Sampling

This part is to show the random sampling skills and confident interval for only one variable that could represent the whole population with determined confident interval.

Transfer Liver_dis_clean to new variable.

The code randomly sampling BMI variables and determined that mean of BMI population (mu) is siting within the sample’s 95% confident interval. That shows if the chosen sample size is a good representative for the population.

liver <- Liver_dis_clean

Random Sampling Continuation

BMI_randomSample <- liver$bmi %>%  sample(size=50, replace = FALSE)
cof_int_BMI <- BMI_randomSample %>% t.test( conf.level= 0.95)
cof_int_BMI$conf.int[1]
## [1] 27.24586
cof_int_BMI$conf.int[2]
## [1] 30.56695
Popul_mean_BMI <- mean(liver$bmi)
hypoMean <- cof_int_BMI$conf.int[1] <= Popul_mean_BMI & Popul_mean_BMI <= cof_int_BMI$conf.int[2] 
hypoMean
## [1] TRUE

The first and second numbers are the lower and upper tail values. The True means that the mean of the population sat within 95% confidence interval of the sample. False on the other hand means did not sit within sample’s confidence interval. By repeating the code over and over you get the percentage of true over false. True means the null hypothesis is true and false is the null hypothesis rejected.

Hypothesis Testing for BMI and Gender

Now comparing the two population of male and female through independent Two sample T-test. “Is there any different between mean of male’s and female’s BMI?”.

H0: μ1 - μ2 = 0 ==> There is no difference between male mean and female mean.

HA: μ1 - μ2 ≠ 0 ==> There is difference between male mean and female mean.

Where the μ1 and μ2 refer to the mean of the BMI of the two statuses.

Separate BMI value for male and Female:

Male_BMI <- liver$bmi[liver$gender== "M"]
Female_BMI <- liver$bmi[liver$gender== "F"]
Male_BMI %>% length()
## [1] 5569
Female_BMI %>% length()
## [1] 6993

The length of samples are different, therefore, we will not check the equal variance and use Welch Two Sample t-test in this case.

Hypothesis Testing- Sample Ttest for BMI and Gender

ttest_resultGenderBMI <- t.test(Male_BMI,Female_BMI, alternative = "two.sided", conf.level = 0.95,
                       var.equal = FALSE)
ttest_resultGenderBMI
## 
##  Welch Two Sample t-test
## 
## data:  Male_BMI and Female_BMI
## t = 2.7719, df = 12551, p-value = 0.00558
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1004225 0.5853809
## sample estimates:
## mean of x mean of y 
##  30.26008  29.91718

The p_value 0.00558 < 0.05 which indicate that the null hypothesis will be rejected in favor of alternative hypothesis. As it shown, the mean of x(male) is different from mean of y(Female).

Now comparing the two population BMI of Dead and Alive through independent Two sample T-test. “Is there any different between mean of Alive and Dead?”.

Hypothesis Testing for BMI and Status

H0: μ1 - μ2 = 0 ==> There is no difference between dead mean and alive mean.

HA: μ1 - μ2 ≠ 0 ==> There is difference between dead mean and alive mean.

Separate BMI value for male and Female:

Dead_BMI <- liver$bmi[liver$status== "Dead"]
Alive_BMI <- liver$bmi[liver$status== "Alive"]
Dead_BMI %>% length()
## [1] 1012
Alive_BMI %>% length()
## [1] 11550

Hypothesis Testing- Sample Ttest for BMI and Status

TtestResultStatusBMI <- t.test(Dead_BMI,Alive_BMI, alternative = "two.sided", conf.level = 0.95,
                       var.equal = FALSE)
TtestResultStatusBMI
## 
##  Welch Two Sample t-test
## 
## data:  Dead_BMI and Alive_BMI
## t = -0.074943, df = 1168, p-value = 0.9403
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5048168  0.4676706
## sample estimates:
## mean of x mean of y 
##  30.05212  30.07069

Here we have p_value greater than 0.05 that indicate we are fail to reject null hypothesis which means there is no different between BMI of Dead and Alive people.

Box PLot

boxplot(bmi~status,data=Liver_dis_clean, main="BMI and Status",
   xlab="Status of patient", ylab="BMI")

It is also evident from the box plot above that the mean of both the dead and alive are very close.

Conclusion

Reference