Introduction

In the recent times, Diabetics has become a common metabolic disease which is measured through various parameters, and amongst them is high elevated levels of blood glucose in an individual.
We have taken a dataset which showcases the data of people having diabetics from age group ranging from 10-80 for gender Male, Female & Others.
There are a lot of information on the internet readily available on open sources about the population having diabetics, which makes us question which gender is more prone to diabetics and the reasons pertaining to it whether it is due to genetics or heredity or any other reasons.
Higher blood glucose level people are more likely and prone to have diabetics, which we can be used as one of the parameter to identify and solve our problem statement.

Problem Statement

The purpose of this problem is to analyze the data, and conclude which gender is high likely to have diabetics based on the dataset that we have chosen.
Hence, we shall perform some hypothesis testing in order to identify the relationship and try to understand the significance of it.
We will use statistics to firstly understand the data & later determine if there is a significant difference between the variables we are considering, or if the frequency of the data varies from the expected outcomes.
Using R, we can use various function and conduct hypothesis testing like two-sided t-test where we compare the mean of the two variables or chi-square goodness fit to identify the association between the variable. This way we can solidify our investigate based upon our objective.

Data

The data has been collected from Kaggle from a dataset name Diabetics Prediction Dataset Mohammed Mustafa. (n.d).
The dataset contains the following information:
- Age
- Gender
- BMI
- Smoking History
- Glucose Blood Level
- Diabetics Status
We will be considering Gender and Glucose Blood Level as our important variables to perform the test.
Blood Glucose Level is a numerical variable where as Gender is a categorical containing there factor variable i.e. Male, Female & Others.

Data Preprocessing

As part of the data pre-processing activity, we shall conduct the following tasks:

Calculate the mean, median, min/max value and other statistical functions for the blood glucose level column
Similarly, we shall check if the data has any missing values, and use appropriate function to remove them
Check if the data has any outliers, if yes then we will perform necessary steps in order to remove them
Check the normality of the data in order to define the data is normally distributed or not, for each variable by using the filter function.

Descriptive Statistics and Visualisation

Firstly, load the dataset and there after we will perform the data pre-processing activity to make sure the data is best fit for us to conduct the hypothesis testing.

Diabetics <- read.csv("diabetes_prediction_dataset.csv")
head(Diabetics)

Decsriptive Statistics Cont.

Below is the table which showcases the statistical values of the variables we have considered, i.e. blood_glucose_level like min/max value, mean, median, and also includes if the data has any missing values that we need to deal with before moving ahead with hypothesis testing.

Diabetics %>% group_by(gender) %>% summarise(Min = min(blood_glucose_level,na.rm = TRUE),
                                           Q1 = quantile(blood_glucose_level,probs = .25,na.rm = TRUE),
                                           Median = median(blood_glucose_level, na.rm = TRUE),
                                           Q3 = quantile(blood_glucose_level,probs = .75,na.rm = TRUE),
                                           Max = max(blood_glucose_level,na.rm = TRUE),
                                           Mean = mean(blood_glucose_level, na.rm = TRUE),
                                           SD = sd(blood_glucose_level, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(blood_glucose_level))) -> table1
knitr::kable(table1)

gender	Min	Q1	Median	Q3	Max	Mean	SD	n
Female	80	100	140	159.00	300	137.4690	40.10283	58552
Male	80	100	140	159.00	300	138.8900	41.53797	41430
Other	80	126	158	159.75	200	139.4444	33.38055	18

Decsriptive Statistics Cont.

We shall plot a boxplot to identify and check if the data has any outliers or not. If yes, then we have to perform few additional steps in order to remove the.

boxplot(blood_glucose_level~gender, data = Diabetics, ylab = "Blood Glucose Level", xlab= "Gender")

Decsriptive Statistics Cont.

Looking at the visualization, we can observe that there are a lot of outliers.
we need to work upon to remove them which will ultimately reduce the standard deviation between the numbers.
Below function is used to remove the outliers and accordingly the graphs states the same.

out_norm <- function(x){
   qntile <- quantile(x, probs=c(.25, .75))
   caps <- quantile(x, probs=c(.05, .95))
   H <- 1.5 * IQR(x, na.rm = T)
   x[x < (qntile[1] - H)] <- caps[1]
   x[x > (qntile[2] + H)] <- caps[2]
   return(x)
}
Diabetics$blood_glucose_level=out_norm(Diabetics$blood_glucose_level)
ggplot(Diabetics, mapping = aes(x = gender , y = blood_glucose_level)) + geom_boxplot(outlier.colour = "red", outlier.shape = 4, outlier.size = 2)

Normality Check

To check if the data is distributed normally or not, we shall plot qq-plot in order to determine it for all three variables under gender column

library(car) # Used to plot the qq-plot
par(mfrow=c(1,3))
Gender_Male <- Diabetics %>% filter(gender == "Male")
M <- Gender_Male$blood_glucose_level %>% qqPlot(dist="norm", xlab = "Gender_Male")
Gender_Female <- Diabetics %>% filter(gender == "Female")
Fe <- Gender_Female$blood_glucose_level %>% qqPlot(dist="norm",xlab = "Gender_Female")
Gender_Others <- Diabetics %>% filter(gender == "Other")
Oth <- Gender_Others$blood_glucose_level %>% qqPlot(dist="norm",xlab = "Gender_Others")

Hypothesis Testing

As we are considering gender as the variable & since it has three factors, hence we can’t perform the two.sided t-test as it is feasible with just two factors.
So, in this case we can perform Chi-square Goodness of Fit Test to check if the categorical value in gender fit the model of the expected outcomes or out.
If the p < 0.05, then we can state that there is a significant relationship between the three variables.

dia_chi <- chisq.test(table(Diabetics$blood_glucose_level,Diabetics$gender))
dia_chi

## 
##  Pearson's Chi-squared test
## 
## data:  table(Diabetics$blood_glucose_level, Diabetics$gender)
## X-squared = 55.553, df = 28, p-value = 0.001458

head(dia_chi$observed)

##      
##       Female Male Other
##   80    4198 2907     1
##   85    4113 2787     1
##   90    4189 2921     2
##   100   4124 2901     0
##   126   4562 3138     2
##   130   4599 3195     0

head(dia_chi$expected)

##      
##         Female     Male   Other
##   80  4160.705 2944.016 1.27908
##   85  4040.674 2859.084 1.24218
##   90  4164.218 2946.502 1.28016
##   100 4113.278 2910.457 1.26450
##   126 4509.675 3190.939 1.38636
##   130 4563.543 3229.054 1.40292

Hypthesis Testing Cont.

If we look at the count of the observations for Others, then we can see the count of just it to be just 18 which in almost negligible in comparison to Male/Female.
So, if we take an assumption to drop Others’ data from the dataset and perform two.sided t set to check if there is a huge difference between the mean of the two variables.

filtered_df <- filter(Diabetics, gender %in%  c("Male", "Female"))
head(filtered_df)

Hypthesis Testing Cont.

To check the homogeneity of variance, we use the function leveneTest to check the value of p.
If the value of p is greater than 0.05 i.e. the significance level then we can say that there isn’t equal variance

leveneTest(blood_glucose_level~gender, data = filtered_df)

Hypthesis Testing Cont.

As the value of the p is less than 0.05, then we can reject the null hypothesis and perform the two sided t-test
From this testing, we need to draw a conclusion about the null hypothesis where if the value of p is less then 0.05 and CI interval doesn’t contain zero then we can reject the hypothesis.

result<- t.test(blood_glucose_level ~ gender,
 data = filtered_df,
 var.equal = TRUE,
 alternative = "two.sided"
 )
result

## 
##  Two Sample t-test
## 
## data:  blood_glucose_level by gender
## t = -4.3246, df = 99980, p-value = 1.53e-05
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -1.4627625 -0.5503688
## sample estimates:
## mean in group Female   mean in group Male 
##             136.0022             137.0088

Hypthesis Testing Cont.

Looking at the values of p for both the tests, we can observe that it is less than 0.05. Hence, we can reject the below function for H0 and the 95% CI did not capture H0 = u1 - u2. Both functions are stated below as follows:

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\]

Discussion

After using qq-plot to check the normality of each variable inside the gender column then we can say that for Male and Female, most points are deviated from the referencing line expect for others which has really few observations in the dataset, and most points are close to the line except for few, hence we can say that the data for Others Gender is distributed normally.
Before taking the assumption, we tested the Chi-Square Goodness fit and we can see that the value of p is less than 0.05 which is the significance value, hence we can summarize and conclude to say that the data is statistically significant and we can reject H0.
As we took the assumption to drop Others from the Gender Column as the observation were just 18 out of almost 1,00,000 observations, where we conducted the leveneTest first where the value of p is way less than 0.05 hence we can say that variances are homogeneous.
After this, we performed the two sided t-test where in the mean of the male gender was 138.890 and for female it is 137.469, which is relatively higher for men.
As the value of p is less than 0.05, we can say data for male and female are different and looking at the values of the mean, we can say the Male Gender is high likely to have diabetics in comparison to females in a slight manner as the blood glucose level are higher in male for some medical reason as the data states.

MATH1324 Introduction to Statistics Assignment 2

Hypothesis Testing of Diabetics Dataset

RPubs link information

Introduction

Problem Statement

Data

Data Preprocessing

Descriptive Statistics and Visualisation

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Normality Check

Hypothesis Testing

Hypthesis Testing Cont.

Hypthesis Testing Cont.

Hypthesis Testing Cont.

Hypthesis Testing Cont.

Discussion

References