ANNALISHIA GEORGE CHETTIAR (S3794870)
TAMIL MUHIL Karuppiah (S3775152)
Niranjan Kumar Ramachandran (S3711568)
Last updated: 27 October, 2019
-To test if there is any difference in the mean age in case of male and female we will be using statistical hypothesis.
In the above considered case the null hypothesis is that there is no significant mean age difference in case of Indian male and female patients of liver disease.
Ho : μ1 = μ2
The alternate hypothesis is that there is significant mean age difference in case of Indian male and female patients of liver disease. HA : μ1 ≠ μ2
Due to presence of the group Male and Female we have later used two sample test to test this hypothesis. This will conclude if the population is independent of each other.
The dataset selected for this investigation is from Kaggle.com. This dataset is an open dataset and has data for indian liver patient records. It has patient records collected from North East of Andhra Pradesh, India.
Link for dataset: https://www.kaggle.com/uciml/indian-liver-patient-records
Number of missing values: 0
The variables available in the dataset are Age, Gender, Total_BilirubinTotal, Direct_Bilirubin, Alkaline_Phosphotase, Alamine_Aminotransferase, Aspartate_Aminotransferase, Total_Protiens, Albumin, Albumin_and_Globulin_Ratio, Dataset
In the dataset the range of age is 4years to 90years
In the dataset to carry further analysis the gender which was defined as character initially was changed to factor to simplify the analysis.
In the process of further analysis we intend to determine the number of NA values and there was found to be none of them.
Here are the examples of R chunks and outputs
library(readr)
indian_liver_patient<-read_csv("indian_liver_patient.csv")
View(indian_liver_patient)
indian_liver_patient$Gender = factor(indian_liver_patient$Gender , ordered = TRUE )
str(indian_liver_patient)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 583 obs. of 11 variables:
## $ Age : num 65 62 62 58 72 46 26 29 17 55 ...
## $ Gender : Ord.factor w/ 2 levels "Female"<"Male": 1 2 2 2 2 2 1 1 2 2 ...
## $ Total_Bilirubin : num 0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.9 0.7 ...
## $ Direct_Bilirubin : num 0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.3 0.2 ...
## $ Alkaline_Phosphotase : num 187 699 490 182 195 208 154 202 202 290 ...
## $ Alamine_Aminotransferase : num 16 64 60 14 27 19 16 14 22 53 ...
## $ Aspartate_Aminotransferase: num 18 100 68 20 59 14 12 11 19 58 ...
## $ Total_Protiens : num 6.8 7.5 7 6.8 7.3 7.6 7 6.7 7.4 6.8 ...
## $ Albumin : num 3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 4.1 3.4 ...
## $ Albumin_and_Globulin_Ratio: num 0.9 0.74 0.89 1 0.4 1.3 1 1.1 1.2 1 ...
## $ Dataset : num 1 1 1 1 1 1 1 1 2 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Total_Bilirubin = col_double(),
## .. Direct_Bilirubin = col_double(),
## .. Alkaline_Phosphotase = col_double(),
## .. Alamine_Aminotransferase = col_double(),
## .. Aspartate_Aminotransferase = col_double(),
## .. Total_Protiens = col_double(),
## .. Albumin = col_double(),
## .. Albumin_and_Globulin_Ratio = col_double(),
## .. Dataset = col_double()
## .. )
## Age Gender
## 0 0
## Total_Bilirubin Direct_Bilirubin
## 0 0
## Alkaline_Phosphotase Alamine_Aminotransferase
## 0 0
## Aspartate_Aminotransferase Total_Protiens
## 0 0
## Albumin Albumin_and_Globulin_Ratio
## 0 4
## Dataset
## 0
For the next step we filter out male and female factor variables.
Next we use the summarize function to find out the quartiles, median, mean, maximum, minimum and standard deviation for each gender separately.
In order to figure out if there is any difference in the mean age visually we use box plot
You can use the knitr:kable function to print nice HTML tables. Here is an example R code:
indian_liver_patient_Male = indian_liver_patient %>% filter(Gender == 'Male')
indian_liver_patient_Male %>% summarise( Min = min(indian_liver_patient_Male$Age, na.rm = TRUE), Q1 = quantile(indian_liver_patient_Male$Age, probs = .25, na.rm = TRUE), Median = median(indian_liver_patient_Male$Age, na.rm = TRUE), Q3 = quantile(indian_liver_patient_Male$Age, probs = .75, na.rm = TRUE), Max = max(indian_liver_patient_Male$Age, na.rm = TRUE), Mean = mean(indian_liver_patient_Male$Age, na.rm = TRUE), SD = sd(indian_liver_patient_Male$Age, na.rm = TRUE), n = n(), Missing = sum(is.na(indian_liver_patient_Male$Age)) ) indian_liver_patient_Female = indian_liver_patient %>% filter(Gender =='Female')
indian_liver_patient_Female %>% summarise( Min = min(indian_liver_patient_Female$Age, na.rm = TRUE), Q1 = quantile(indian_liver_patient_Female$Age, probs = .25, na.rm = TRUE), Median = median(indian_liver_patient_Female$Age, na.rm = TRUE), Q3 = quantile(indian_liver_patient_Female$Age, probs = .75, na.rm = TRUE), Max = max(indian_liver_patient_Female$Age, na.rm = TRUE), Mean = mean(indian_liver_patient_Female$Age, na.rm = TRUE), SD = sd(indian_liver_patient_Female$Age, na.rm = TRUE), n = n(), Missing = sum(is.na(indian_liver_patient_Female$Age)) ) ##
## Two Sample t-test
##
## data: Age by Gender
## t = -1.3655, df = 581, p-value = 0.1726
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.197309 0.934302
## sample estimates:
## mean in group Female mean in group Male
## 43.13380 45.26531
\[H_0: \mu_1 = \mu_2 \]
\[H_A: \mu_1 \ne \mu_2\]
\[S = \sum^n_{i = 1}d^2_i\]
-Further in the two sample t-test assumimg equal variances the p-value is 0.1726 which is greater than 0.05 hence we fail to reject the null hypothesis and accept Ho, that there is no significant mean age difference in case of Indian male and female patients of liver disease.
The mean age of Male is 45.26531 and mean age of female is 43.1338.
After conducting appropriate analysis it can be concluded that Male with age 45.26531 and Female with age 43.1338 are likely to suffer from liver disease.
The findings of this investigation needs to be an eye-opener to people all around. Males and Females exposed to the suffering of lung disease at such early age is likely to be a matter of concern to the society. Also this will affect the next generation as they are likely to be influenced by the habits followed by elders. Proper initiatives need to be taken to spread awareness along with precautionary measures regarding this issue.