Introduction

This presentation will includes descriptive statistics and answering interesting statistical questions in real world using statistical analysis (hypothesis testing) such as

Chi-square test of association
Two sample t-test for independent samples
One sample t-test
Interval estimation

The analysis focuses on the relation between diabetes diagnosis with two demographic factors- age and gender. The main explorations revolve around finding possible associations between age, gender and diabetes diagnosis.
In order to do so, we are going to load the data and understand it. Some of the variables need data type conversion.Then we begin our hypothesis testing.
The chi-square association test is used to find the statistical significance of the association between diabetes diagnosis and gender, and also with age (after discretization).
Two sample t-test for independent samples is used to find whether the mean age of males with diabetes and females with diabetes is different. The point estimate statistics of diabetic males and females are calculated and interval estimates for corresponding population parameters are found.

Problem Statement

There are many factors/ causes such as age, gender, sudden weight loss, obesity, muscle stiffness, etc which leads to diabetes. This project focused on prevalence of diabetes based on gender and age.
Descriptive Statistics 1: Age of the patients who developed diabetes in the sample?
Analysis 1: Does age have any association with having diabetes. We used Chi-square association test to establish that.
Descriptive Statistics 2: What is the mean age of males and females with diabetes in the data set?
Analysis 2: Is there a difference in mean age among diabetic males and females in the sample?
Analysis 3: What are the interval estimate for the mean age of diabetic male and female patients. Will use one sample t test.
Analysis 4: Does having diabetes have association with gender (female/ male)? Chi-square test of association will be used to analyse that.

Data

The Diabetes UCI data set is from Kaggle. (https://www.kaggle.com/datasets/alakaaay/diabetes-uci-dataset)
The data set was taken from Kaggle. It included 520 observations of patients whether they had developed diabetes or not. It was collected in a hospital in Sylhet, Bangladesh.
The dataset contains 17 variables.
Age : 20 - 65
Gender : Male/ Female
Polyuria : Urine output exceeding it’s supposed to exceed per day (Yes/ No)
Polydipsia : Excess thirst of water (Yes/ No)
Sudden weight loss : Yes/ No
Weakness : Yes/ No
Polyphagia : Excess appetite (Yes/ No)
Genital thurst : Yes/ No
Visual blurring : Yes/ No
Itching : Yes/ No
Irritability : Yes/ No
Delayed healing : Yes/ No
Partial Paresis : Yes/ No
Muscle Stiffness : Yes/ No
Alopecia : Yes/ No
Obesity : Yes/ No
Class : Positive: Have diabetes/ Negative: Don’t have diabetes

Data Cont.

Those are the 3 variables that will be focusing on this project
Age : 20 - 65 Age is numeric variables and it ranges from 20 years to 65 years. We discretized it with bins.
Age<30 is represented by 1
Age from “30 - 39” is represented by 2
Age from “40 - 49” is represented by 3
Age “>50” represented by 4
Gender : Male/ Female , Male represented by 0, female by 1.
Class : Postive: Have diabetes represented by 1, Negative: Don’t have diabetes represented by 1

Gender and class are changed to factor as they have levels.

Data Cont

diabetes <- read_csv("diabetes.csv")
typeof(diabetes$Age)

## [1] "double"

class(diabetes$Gender)

## [1] "character"

unique(diabetes$Gender)

## [1] "Male"   "Female"

#Changing Gender to factors
diabetes$Gender <- factor(diabetes$Gender, 
                          levels = c("Male","Female"),
                          labels = c("Male","Female"))
unique(diabetes$class)

## [1] "Positive" "Negative"

#Changing Diabetes Class to factors
diabetes$class <- factor(diabetes$class,
                         levels = c("Positive", "Negative"),
                         labels = c("Positive", "Negative"))

Descriptive Statistics 1: Age of the patients who developed diabetes in the sample?

diabetes %>% group_by(class) %>% summarise(Min = min(Age, na.rm = TRUE),
                                           Q1 = quantile(Age, probs = 0.25, na.rm=TRUE),
                                           Median = median(Age, na.rm = TRUE),
                                           Q3 = quantile(Age, probs = 0.75, na.rm = TRUE),
                                           Max = max(Age, na.rm = TRUE),
                                           Mean = mean(Age, na.rm = TRUE),
                                           SD = sd(Age, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(Age))) -> table1
knitr::kable(table1)

class	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
Positive	16	39	48	57	90	49.07187	12.09748	320	0
Negative	26	37	45	55	72	46.36000	12.08098	200	0

age_diabetes <- diabetes %>% boxplot(Age ~ class, data = ., 
                                     xlab = "Diabetes",
                                     ylab = "Age",
                                     main = "Age and Diabetes")

Finding : It looks like older people tends to have diabetes (Positive). Thus, we will find out whether age has association with having diabetes using Chi-Square association test in this project.

Analysis 1: Does age have anything to do with having diabetes? We will use Chi-square association test.

#Creating age categories 
no_of_samples <- dim(diabetes)[1]
age_encoded <- rep(0, no_of_samples)
i = 0;
for(i in 1:520){
  if (diabetes$Age[i] < 30){
    age_encoded[i] = 1
  }  else if (diabetes$Age[i] >= 30 & diabetes$Age[i] < 40){
    age_encoded[i] = 2
  } else if (diabetes$Age[i] >= 40 & diabetes$Age[i] < 50){
    age_encoded[i] = 3
  } else if (diabetes$Age[i] >= 50){
    age_encoded[i] = 4
  }
}
diabetes$age_cat <- age_encoded
#Changing to factors variables for age categories
diabetes$age_cat<- factor(diabetes$age_cat, 
                          levels = c(1,2,3,4), 
                          labels = c("< 30","30 - 39","40 - 49","> 50"),
                          ordered = TRUE)

#Cross tabulation of data
table3 <- table(diabetes$age_cat, diabetes$class)
table3 %>% addmargins()

##          
##           Positive Negative Sum
##   < 30           8       12  20
##   30 - 39       77       47 124
##   40 - 49       88       63 151
##   > 50         147       78 225
##   Sum          320      200 520

#Distribution of class conditional on age
table4 <- table3 %>% prop.table(margin=2) %>% round(2)
knitr::kable(table4)

	Positive	Negative
< 30	0.03	0.06
30 - 39	0.24	0.23
40 - 49	0.28	0.32
> 50	0.46	0.39

#Visualize the association between age category and whether have diabetes or not using a clustered bar chart.
barplot(table4,
        main = "Diabetes by Age Group",
        ylab= "Proportion within Age Group", 
        ylim=c(0,1),
        legend=rownames(table4),
        beside=TRUE,
        args.legend= c(x = "topright",horiz=TRUE, title="Age Cateogory"), 
        xlab="Age Category",
        col = brewer.pal(4, name = "RdBu"))

- As the height of the bars indicating proportions in each age group is different, it seems there is association between age and diabetes but we will have to use Chi-square test of association is whether this relationship is statistically significant or not.

Analysis 1: Hypothesis Testing

\[H_0: There.is.no.association.between.age.and.having.diabetes \] \[H_A: There.is.association.between.age.and.having.diabetes \] \[Assumption : No.more.than.25.percentage.of.expected.cells.count.are.below.5 \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.ssignificant.\] Chi-square test of association

chi2age <- chisq.test(table3)
chi2age

## 
##  Pearson's Chi-squared test
## 
## data:  table3
## X-squared = 5.9835, df = 3, p-value = 0.1124

chi2age$expected

##          
##            Positive  Negative
##   < 30     12.30769  7.692308
##   30 - 39  76.30769 47.692308
##   40 - 49  92.92308 58.076923
##   > 50    138.46154 86.538462

chi2age$observed

##          
##           Positive Negative
##   < 30           8       12
##   30 - 39       77       47
##   40 - 49       88       63
##   > 50         147       78

qchisq(p = .95,df = 3)

## [1] 7.814728

pchisq(q = 5.9835,df = 3,lower.tail = FALSE)

## [1] 0.1124158

chi2age$p.value

## [1] 0.1124169

Conclusion : There are no cells with expected count lower than 5. p-value=0.1124169 is not less than 0.05 (alpha significant level). Our decision is fail to reject H0. The Chi-square test of association is statistically insignificant. There was no evidence of an association between age and whether having diabetes or not. Usually there is strong claim that diabetes is associated with age but our sample didn’t show that.

Descriptive Statistics 2: What’s the mean age of having diabetes in Female and Male?

positive <- diabetes %>% filter(class == "Positive")
boxplot(positive$Age ~ positive$Gender, xlab = "Age", ylab = "Gender", main = "Gender and age of having diabetes")

#Mean age of men with diabetes
positive_male <- diabetes %>% filter(class == "Positive" & Gender == "Male")
mean_age_male_positive <- mean(positive_male$Age)

#Mean age of women with diabetes
positive_female<- diabetes %>% filter(class == "Positive" & Gender == "Female")
mean_age_female_positive <- mean(positive_female$Age)

Finding : Mean age of men having diabetes is around 51 years whereas women mean age is 47. We will find more about if gender has to do anything with having diabetes. We will use Two-sample t test, One sample t test and Chi-square test of association to explore more about that.

Analysis 2: So, is there any statistical difference in mean age among diabetic males and females in the sample?

For the purpose of reporting the difference in mean age, we performed independent two samples t test by dividing the data based on gender. We also calculated confidence interval estimates of mean age for diabetic males and females.

dim(positive)

## [1] 320  18

dim(positive_male)

## [1] 147  18

dim(positive_female)

## [1] 173  18

Analysis 2: Hypothesis Testing

\[H_0: The .difference.in.the .mean .age.of .diabetic. males.and.females .with.diabetes. is .0 \] \[H_A: The .difference .in .the .mean .age .of .diabetic. males .and .females .with .diabetes. is .not.0 \] \[Assumption : We.have.assumed.equal.variance.as.both.the.samples.come.from.the.same.population.but.we.will.also. perform.levene.test.for.confirmation. \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.significant.\] Two-sample t-test (Assumption - Normality and Homogeneity of variance) Normality qq plot for checking normality of age of diabetic males

# Normality tests - QQ plot
positive_male$Age %>% qqPlot(dist="norm")

## [1] 61 40

As most of data points follow a lie with the 95% CI and the sample size = 147 (men) is greater than 30, according to CLT, we can assume normality.

Normality qq plot for checking normality of age of diabetic female

# Normality tests  - QQ plot
positive_female$Age %>% qqPlot(dist="norm")

## [1] 63 91

As most of data points follow a lie with the 95% CI and the sample size = 173 (women) is greater than 30, according to CLT, we can assume normality.

# Homogeneity of Variance
leveneTest(Age ~ Gender, data = positive)

The p-value for the Levene’s test of equal variance for age between # males and females was p=0.72. Asp>.05, therefore, we fail to reject H0 i.e variances in the two samples are equal.

Two-sample t-test - Assuming Equal Variance

t.test( Age ~ Gender, data = positive, var.equal = TRUE, alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  Age by Gender
## t = 3.1924, df = 318, p-value = 0.001552
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  1.638895 6.903356
## sample estimates:
##   mean in group Male mean in group Female 
##             51.38095             47.10983

Conclusion : As the p-value 0.001552 is less than 0.05, we will reject H0. Thus, we can say that test is statistically significant. There is difference in the mean age for diabetic patients based on gender.

Analysis 3: The interval estimate for the mean age of diabetic male and female patients. Will use one sample t test.

#one-sample t-test
t.test(positive_male$Age, conf.level = .95)$conf.int

## [1] 49.42801 53.33389
## attr(,"conf.level")
## [1] 0.95

#one-sample t-test
t.test(positive_female$Age, conf.level = .95)$conf.int

## [1] 45.32687 48.89279
## attr(,"conf.level")
## [1] 0.95

Conclusion : The mean age of male patients with diabetes lies between 49.28 and 53.33 years and for female patients mean age lies between 45.33 and 48.89 year with 95% Confidence level on taking repeated samples.

Analysis 4: Does having diabetes have association with gender (female/ male)? Chi-square test of association will be used to analyse that.

#Cross tabulation of data
table1 <- table(diabetes$class, diabetes$Gender) 
table1 %>% addmargins()

##           
##            Male Female Sum
##   Positive  147    173 320
##   Negative  181     19 200
##   Sum       328    192 520

#Distribution of class conditional on gender
table2 <- table1 %>% prop.table(margin=2) %>% round(2)

#Visualize the association between gender and whether have diabetes or not using a clustered bar chart.
barplot(table2,
        main = "Diabetes by Gender",
        ylab= "Proportion within Gender", 
        ylim=c(0,1),
        legend=rownames(table2),
        beside=TRUE,
        args.legend= c(x = "topright",horiz=TRUE, title="Diabetes"), 
        xlab="Gender")

- If there is no association between gender and diabetes, the height of the bars (i.e. proportions) of male and female within each of the diabetes (positive and negative) would be the same. In the bar chart, women tend to have diabetes compared to men. i.e. the probability of having diabetes “depends” gender. Thus, we need to determine with a Chi-square test of association is whether this relationship is statistically significant or not.

Analysis 4 : Hypothesis Testing

\[H_0: There.is.no.association.between.gender.and.having.diabetes \] \[H_A: There.is.association.between.gender.and.having.diabetes \]

\[Assumption : No.more.than.25.percentage.of.expected.cells.count.are.below.5 \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.significant.\]

Chi square test of association

chi2 <- chisq.test(table1)
chi2

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table1
## X-squared = 103.04, df = 1, p-value < 2.2e-16

chi2$expected

##           
##                Male    Female
##   Positive 201.8462 118.15385
##   Negative 126.1538  73.84615

chi2$observed

##           
##            Male Female
##   Positive  147    173
##   Negative  181     19

qchisq(p = .95,df = 1)

## [1] 3.841459

pchisq(q = 103.04,df = 1,lower.tail = FALSE)

## [1] 3.284493e-24

chi2$p.value

## [1] 3.289704e-24

Conclusion: There are no cells with expected count lower than 5. p-value = 0.000002 that is less than 0.05 (alpha significant level). Our decision is to reject H0. Chi-square test of association is statistically significant. This means that there’s statistically significant association between gender and having diabetes.

Discussion

Many factors plays a role in having diabetes.
As we only focused on two factors: age and gender upon having diabetes, the findings are follows.
Based on findings and statistics, the first thing we found out is that there is difference in the mean age for diabetic patients based on gender. The mean age for women is 47.10983 and for men is 51.38095.
To know age interval for the mean age of diabetic male and female patients, we use one sample t test. The finding is mean age of male patients with diabetes lies between 49.28 and 53.33 years and for female patients mean age lies between 45.33 and 48.89 year with 95% Confidence level on taking repeated samples.
We also explored “Does gender has anything to do with diabetes?”. And, yes, it seems that women tend to have higher probability of having diabetes than men.
However, it’s surprising that age doesn’t have relationship with diabetes. What we learned from this analysis is from bar chart, we thought age has association with diabetes are it increases as people gets older. But the results of Chi Square test were insignificant, so we could not conclude that there is an association between age and diabetes based on this sample.
For future investigations, we can find similar associations for other demographic factors of diabetes, using the statistical methods.

ANALYSIS OF DEMOGRAPHICS OF DIABETES PATIENTS

APPLIED ANALYTICS PROJECT 2

RPubs link information