Shivaniben Nikunjkumar Prajapati (s3738826) and Santosh Kumaravel Sundaravadivelu (s3729461)
Last updated: 29 October, 2018
We are investigating Total Proteins consumed by Male and Female and how it helps the function of liver and other organs in the long run.In this problem we are trying to identify whether there is some association between two variable for e.g. Total_Protein, Gender since, the level of protein is different in both male and female and this test will help us understand that whether protein is associated with gender or not.
core<- read_csv("D:/Intro to Stats/Assignment3/indian_liver_patient.csv")As mentioned in the introduction the dataset is taken from www.kaggle.com. The dataset contains following variables:
Age - Age of the patient
Gender - Gender of the patient
Total_Bilirubin - Total Bilirubin
Direct_BilirubinDirect Bilirubin
Alkaline_Phosphotase - Alkaline Phosphotase
Alamine_Aminotransferase - Alamine Aminotransferase
Aspartate_Aminotransferase - Aspartate Aminotransferase
Total_Protiens - Total Protiens
Albumin - Albumin
Albumin_and_Globulin_Ratio - Albumin and Globulin Ratio
Dataset: field used to split the data into two sets (patient with liver disease, or no disease)
Two variables are very important for further analysis
Total_protein - Total protein in human body. It is a numeric variable with the scale of 2.6 - 9.7
Gender - Gender Information. It has a characted data type which will be later on converted into factor
Numeric Variables : Scale of Numeric variable
Total_Bilirubin : 1-31
Direct_Bilirubin : 0.1-19.7
Alkaline_Phosphotase : 63-2110
Alamine_Aminotransferase : 10 - 2000
Total_Protiens : 2.7 - 9.6
Albumin : 0.9 - 5.5
Albumin_and_Globulin_Ratio : 0.3 - 2.8
In the further statistical analysis we have used boxplot to know whether there is any outlier in Total_Proteins as we are going to perform hypothesis testing on that variable and the detected outliers are removed using capping.
Here we have randomly sampled the data into 100 outputs so can we can efficiently perform statistical analysis.
boxplot(core$Total_Protiens)cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
core$Total_Protiens <- core$Total_Protiens %>% cap()
boxplot(core$Total_Protiens)The summarise() function gives us the information about all summary statistics for e.g. min, max, first quantile(q1), third quantile(q3), mean, median, Standard deviation, Missing values etc.
Here we are going to summarise the Total_proteins variable and by using knitr::kable(table1) it will be visible in tabular form.
core %>%group_by(Gender) %>% summarise(Min = min(Total_Protiens, na.rm = TRUE),
Q1 = quantile(Total_Protiens,probs = .25, na.rm = TRUE),
Median = median(Total_Protiens, na.rm = TRUE),
Q3 = quantile(Total_Protiens,probs = .75,na.rm = TRUE),
Max = max(Total_Protiens, na.rm = TRUE),
Mean = mean(Total_Protiens, na.rm = TRUE),
SD = sd(Total_Protiens, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Total_Protiens))) -> table1
knitr::kable(table1)| Gender | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Female | 4.1 | 5.925 | 6.8 | 7.5 | 9.2 | 6.660634 | 1.118123 | 142 | 0 |
| Male | 3.7 | 5.700 | 6.5 | 7.1 | 9.2 | 6.438435 | 1.007471 | 441 | 0 |
Since P value (0.09503 < 0.05) it is consired as the unequal variance for two-sample T-test. * When performing Two-sample T-test the P value is 0.0188 * For Ho - Both male and Female have protein in equal amounts. * For HA - With the mean difference Female have more than Male
Since pValue is less than 0.05, we can reject Ho.
total_protein_male <- core %>% filter(Gender == "Male")
total_protein_male$Total_Protiens %>% qqPlot(dist="norm")## [1] 391 77
total_protein_female <- core %>% filter(Gender == "Female")
total_protein_female$Total_Protiens %>% qqPlot(dist="norm")## [1] 111 129
total_protein_male$Total_Protiens <- total_protein_male$Total_Protiens %>% cap()
total_protein_female$Total_Protiens <- total_protein_female$Total_Protiens %>% cap()
core_new <- rbind(total_protein_male,total_protein_female)
core_new %>% boxplot(Total_Protiens ~ Gender, data = ., ylab = "Protein Value", xlab = "Gender")leveneTest(Total_Protiens ~ Gender, data = core_new)t.test(
Total_Protiens ~ Gender,
data = core_new,
var.equal = FALSE,
alternative = "two.sided"
)##
## Welch Two Sample t-test
##
## data: Total_Protiens by Gender
## t = 2.1085, df = 219.55, p-value = 0.03612
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01450457 0.42989229
## sample estimates:
## mean in group Female mean in group Male
## 6.660634 6.438435
$t= mean(x_1)???mean(x_2)/ $
\(s^2p=(n_1???1)s^2_1+(n2???1)s^2_2/n_1+n_2???2\)
\(df'=(s^ 2_1/n_1+s^2_2/n_2)^2/(s^2_1/n_1)^2/n_1???1+(s^2_2/n_2)^2/n_2???1\)
The t-statistic is compared to a two-tailed t-critical value t??? with df:
\(df=n_1+n_2???2\)
The 95% CI of the difference between the means was calculated using the following formula in R:
\(mean(x_1)???mean(x_2)???t_(n_1+n_2???2),1?????/2\sqrt{s2pn1+s2pn2}, mean(x_1)???mean(x_2)+t_(n_1+n_2???2),1?????/2\sqrt{s2pn1+s2pn2}\)
Strength: * The major strength of this finding is that the chance of females being strong compared to Male and it breaks the myth which was being belived. * It can play a vital role in the mindset of the people and the approach towards the Total protein consumed will be drastically changed.
Limitation: * The only limitation of this study was that only a small group of people and their liver condition is examined. We could get more interesting results while performing this test on different variables such as age_group, gender and different chemical levels in their liver.