Vaibhav Kulkarni (S3959656)
Last updated: 28 May, 2023
As part of the data pre-processing activity, we shall conduct the following tasks:
Diabetics %>% group_by(gender) %>% summarise(Min = min(blood_glucose_level,na.rm = TRUE),
Q1 = quantile(blood_glucose_level,probs = .25,na.rm = TRUE),
Median = median(blood_glucose_level, na.rm = TRUE),
Q3 = quantile(blood_glucose_level,probs = .75,na.rm = TRUE),
Max = max(blood_glucose_level,na.rm = TRUE),
Mean = mean(blood_glucose_level, na.rm = TRUE),
SD = sd(blood_glucose_level, na.rm = TRUE),
n = n(),
Missing = sum(is.na(blood_glucose_level))) -> table1
knitr::kable(table1)| gender | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Female | 80 | 100 | 140 | 159.00 | 300 | 137.4690 | 40.10283 | 58552 | 0 |
| Male | 80 | 100 | 140 | 159.00 | 300 | 138.8900 | 41.53797 | 41430 | 0 |
| Other | 80 | 126 | 158 | 159.75 | 200 | 139.4444 | 33.38055 | 18 | 0 |
We shall plot a boxplot to identify and check if the data has any outliers or not. If yes, then we have to perform few additional steps in order to remove the.
out_norm <- function(x){
qntile <- quantile(x, probs=c(.25, .75))
caps <- quantile(x, probs=c(.05, .95))
H <- 1.5 * IQR(x, na.rm = T)
x[x < (qntile[1] - H)] <- caps[1]
x[x > (qntile[2] + H)] <- caps[2]
return(x)
}
Diabetics$blood_glucose_level=out_norm(Diabetics$blood_glucose_level)
ggplot(Diabetics, mapping = aes(x = gender , y = blood_glucose_level)) + geom_boxplot(outlier.colour = "red", outlier.shape = 4, outlier.size = 2)library(car) # Used to plot the qq-plot
par(mfrow=c(1,3))
Gender_Male <- Diabetics %>% filter(gender == "Male")
M <- Gender_Male$blood_glucose_level %>% qqPlot(dist="norm", xlab = "Gender_Male")
Gender_Female <- Diabetics %>% filter(gender == "Female")
Fe <- Gender_Female$blood_glucose_level %>% qqPlot(dist="norm",xlab = "Gender_Female")
Gender_Others <- Diabetics %>% filter(gender == "Other")
Oth <- Gender_Others$blood_glucose_level %>% qqPlot(dist="norm",xlab = "Gender_Others")##
## Pearson's Chi-squared test
##
## data: table(Diabetics$blood_glucose_level, Diabetics$gender)
## X-squared = 55.553, df = 28, p-value = 0.001458
##
## Female Male Other
## 80 4198 2907 1
## 85 4113 2787 1
## 90 4189 2921 2
## 100 4124 2901 0
## 126 4562 3138 2
## 130 4599 3195 0
##
## Female Male Other
## 80 4160.705 2944.016 1.27908
## 85 4040.674 2859.084 1.24218
## 90 4164.218 2946.502 1.28016
## 100 4113.278 2910.457 1.26450
## 126 4509.675 3190.939 1.38636
## 130 4563.543 3229.054 1.40292
result<- t.test(blood_glucose_level ~ gender,
data = filtered_df,
var.equal = TRUE,
alternative = "two.sided"
)
result##
## Two Sample t-test
##
## data: blood_glucose_level by gender
## t = -4.3246, df = 99980, p-value = 1.53e-05
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -1.4627625 -0.5503688
## sample estimates:
## mean in group Female mean in group Male
## 136.0022 137.0088
Looking at the values of p for both the tests, we can observe that it is less than 0.05. Hence, we can reject the below function for H0 and the 95% CI did not capture H0 = u1 - u2. Both functions are stated below as follows:
\[H_0: \mu_1 = \mu_2 \]
\[H_A: \mu_1 \ne \mu_2\]