MATH1324 Introduction to Statistics Assignment 2

Hypothesis Testing of Diabetics Dataset

Vaibhav Kulkarni (S3959656)

Last updated: 28 May, 2023

Introduction

Problem Statement

Data

Data Preprocessing

As part of the data pre-processing activity, we shall conduct the following tasks:

Descriptive Statistics and Visualisation

Diabetics <- read.csv("diabetes_prediction_dataset.csv")
head(Diabetics)

Decsriptive Statistics Cont.

Diabetics %>% group_by(gender) %>% summarise(Min = min(blood_glucose_level,na.rm = TRUE),
                                           Q1 = quantile(blood_glucose_level,probs = .25,na.rm = TRUE),
                                           Median = median(blood_glucose_level, na.rm = TRUE),
                                           Q3 = quantile(blood_glucose_level,probs = .75,na.rm = TRUE),
                                           Max = max(blood_glucose_level,na.rm = TRUE),
                                           Mean = mean(blood_glucose_level, na.rm = TRUE),
                                           SD = sd(blood_glucose_level, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(blood_glucose_level))) -> table1
knitr::kable(table1)
gender Min Q1 Median Q3 Max Mean SD n Missing
Female 80 100 140 159.00 300 137.4690 40.10283 58552 0
Male 80 100 140 159.00 300 138.8900 41.53797 41430 0
Other 80 126 158 159.75 200 139.4444 33.38055 18 0

Decsriptive Statistics Cont.

We shall plot a boxplot to identify and check if the data has any outliers or not. If yes, then we have to perform few additional steps in order to remove the.

boxplot(blood_glucose_level~gender, data = Diabetics, ylab = "Blood Glucose Level", xlab= "Gender")

Decsriptive Statistics Cont.

out_norm <- function(x){
   qntile <- quantile(x, probs=c(.25, .75))
   caps <- quantile(x, probs=c(.05, .95))
   H <- 1.5 * IQR(x, na.rm = T)
   x[x < (qntile[1] - H)] <- caps[1]
   x[x > (qntile[2] + H)] <- caps[2]
   return(x)
}
Diabetics$blood_glucose_level=out_norm(Diabetics$blood_glucose_level)
ggplot(Diabetics, mapping = aes(x = gender , y = blood_glucose_level)) + geom_boxplot(outlier.colour = "red", outlier.shape = 4, outlier.size = 2)

Normality Check

library(car) # Used to plot the qq-plot
par(mfrow=c(1,3))
Gender_Male <- Diabetics %>% filter(gender == "Male")
M <- Gender_Male$blood_glucose_level %>% qqPlot(dist="norm", xlab = "Gender_Male")
Gender_Female <- Diabetics %>% filter(gender == "Female")
Fe <- Gender_Female$blood_glucose_level %>% qqPlot(dist="norm",xlab = "Gender_Female")
Gender_Others <- Diabetics %>% filter(gender == "Other")
Oth <- Gender_Others$blood_glucose_level %>% qqPlot(dist="norm",xlab = "Gender_Others")

Hypothesis Testing

dia_chi <- chisq.test(table(Diabetics$blood_glucose_level,Diabetics$gender))
dia_chi
## 
##  Pearson's Chi-squared test
## 
## data:  table(Diabetics$blood_glucose_level, Diabetics$gender)
## X-squared = 55.553, df = 28, p-value = 0.001458
head(dia_chi$observed)
##      
##       Female Male Other
##   80    4198 2907     1
##   85    4113 2787     1
##   90    4189 2921     2
##   100   4124 2901     0
##   126   4562 3138     2
##   130   4599 3195     0
head(dia_chi$expected)
##      
##         Female     Male   Other
##   80  4160.705 2944.016 1.27908
##   85  4040.674 2859.084 1.24218
##   90  4164.218 2946.502 1.28016
##   100 4113.278 2910.457 1.26450
##   126 4509.675 3190.939 1.38636
##   130 4563.543 3229.054 1.40292

Hypthesis Testing Cont.

filtered_df <- filter(Diabetics, gender %in%  c("Male", "Female"))
head(filtered_df)

Hypthesis Testing Cont.

leveneTest(blood_glucose_level~gender, data = filtered_df)

Hypthesis Testing Cont.

result<- t.test(blood_glucose_level ~ gender,
 data = filtered_df,
 var.equal = TRUE,
 alternative = "two.sided"
 )
result
## 
##  Two Sample t-test
## 
## data:  blood_glucose_level by gender
## t = -4.3246, df = 99980, p-value = 1.53e-05
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -1.4627625 -0.5503688
## sample estimates:
## mean in group Female   mean in group Male 
##             136.0022             137.0088

Hypthesis Testing Cont.

Looking at the values of p for both the tests, we can observe that it is less than 0.05. Hence, we can reject the below function for H0 and the 95% CI did not capture H0 = u1 - u2. Both functions are stated below as follows:

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\]

Discussion

References