ANALYSIS OF DEMOGRAPHICS OF DIABETES PATIENTS

APPLIED ANALYTICS PROJECT 2

Su Myat Noe Yee(S3913797), Priya Krishnamurthi Chandra(S3939191), Usman Khalid(S3914769)

Last updated: 29 May, 2022

Introduction

  1. Chi-square test of association
  2. Two sample t-test for independent samples
  3. One sample t-test
  4. Interval estimation

Problem Statement

Data

Data Cont.

Data Cont

diabetes <- read_csv("diabetes.csv")
typeof(diabetes$Age)
## [1] "double"
class(diabetes$Gender)
## [1] "character"
unique(diabetes$Gender)
## [1] "Male"   "Female"
#Changing Gender to factors
diabetes$Gender <- factor(diabetes$Gender, 
                          levels = c("Male","Female"),
                          labels = c("Male","Female"))
unique(diabetes$class)
## [1] "Positive" "Negative"
#Changing Diabetes Class to factors
diabetes$class <- factor(diabetes$class,
                         levels = c("Positive", "Negative"),
                         labels = c("Positive", "Negative"))

Descriptive Statistics 1: Age of the patients who developed diabetes in the sample?

diabetes %>% group_by(class) %>% summarise(Min = min(Age, na.rm = TRUE),
                                           Q1 = quantile(Age, probs = 0.25, na.rm=TRUE),
                                           Median = median(Age, na.rm = TRUE),
                                           Q3 = quantile(Age, probs = 0.75, na.rm = TRUE),
                                           Max = max(Age, na.rm = TRUE),
                                           Mean = mean(Age, na.rm = TRUE),
                                           SD = sd(Age, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(Age))) -> table1
knitr::kable(table1)
class Min Q1 Median Q3 Max Mean SD n Missing
Positive 16 39 48 57 90 49.07187 12.09748 320 0
Negative 26 37 45 55 72 46.36000 12.08098 200 0
age_diabetes <- diabetes %>% boxplot(Age ~ class, data = ., 
                                     xlab = "Diabetes",
                                     ylab = "Age",
                                     main = "Age and Diabetes")

Finding : It looks like older people tends to have diabetes (Positive). Thus, we will find out whether age has association with having diabetes using Chi-Square association test in this project.

Analysis 1: Does age have anything to do with having diabetes? We will use Chi-square association test.

#Creating age categories 
no_of_samples <- dim(diabetes)[1]
age_encoded <- rep(0, no_of_samples)
i = 0;
for(i in 1:520){
  if (diabetes$Age[i] < 30){
    age_encoded[i] = 1
  }  else if (diabetes$Age[i] >= 30 & diabetes$Age[i] < 40){
    age_encoded[i] = 2
  } else if (diabetes$Age[i] >= 40 & diabetes$Age[i] < 50){
    age_encoded[i] = 3
  } else if (diabetes$Age[i] >= 50){
    age_encoded[i] = 4
  }
}
diabetes$age_cat <- age_encoded
#Changing to factors variables for age categories
diabetes$age_cat<- factor(diabetes$age_cat, 
                          levels = c(1,2,3,4), 
                          labels = c("< 30","30 - 39","40 - 49","> 50"),
                          ordered = TRUE)

#Cross tabulation of data
table3 <- table(diabetes$age_cat, diabetes$class)
table3 %>% addmargins()
##          
##           Positive Negative Sum
##   < 30           8       12  20
##   30 - 39       77       47 124
##   40 - 49       88       63 151
##   > 50         147       78 225
##   Sum          320      200 520
#Distribution of class conditional on age
table4 <- table3 %>% prop.table(margin=2) %>% round(2)
knitr::kable(table4)
Positive Negative
< 30 0.03 0.06
30 - 39 0.24 0.23
40 - 49 0.28 0.32
> 50 0.46 0.39
#Visualize the association between age category and whether have diabetes or not using a clustered bar chart.
barplot(table4,
        main = "Diabetes by Age Group",
        ylab= "Proportion within Age Group", 
        ylim=c(0,1),
        legend=rownames(table4),
        beside=TRUE,
        args.legend= c(x = "topright",horiz=TRUE, title="Age Cateogory"), 
        xlab="Age Category",
        col = brewer.pal(4, name = "RdBu"))

- As the height of the bars indicating proportions in each age group is different, it seems there is association between age and diabetes but we will have to use Chi-square test of association is whether this relationship is statistically significant or not.

Analysis 1: Hypothesis Testing

\[H_0: There.is.no.association.between.age.and.having.diabetes \] \[H_A: There.is.association.between.age.and.having.diabetes \] \[Assumption : No.more.than.25.percentage.of.expected.cells.count.are.below.5 \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.ssignificant.\] Chi-square test of association

chi2age <- chisq.test(table3)
chi2age
## 
##  Pearson's Chi-squared test
## 
## data:  table3
## X-squared = 5.9835, df = 3, p-value = 0.1124
chi2age$expected
##          
##            Positive  Negative
##   < 30     12.30769  7.692308
##   30 - 39  76.30769 47.692308
##   40 - 49  92.92308 58.076923
##   > 50    138.46154 86.538462
chi2age$observed
##          
##           Positive Negative
##   < 30           8       12
##   30 - 39       77       47
##   40 - 49       88       63
##   > 50         147       78
qchisq(p = .95,df = 3)
## [1] 7.814728
pchisq(q = 5.9835,df = 3,lower.tail = FALSE)
## [1] 0.1124158
chi2age$p.value
## [1] 0.1124169

Conclusion : There are no cells with expected count lower than 5. p-value=0.1124169 is not less than 0.05 (alpha significant level). Our decision is fail to reject H0. The Chi-square test of association is statistically insignificant. There was no evidence of an association between age and whether having diabetes or not. Usually there is strong claim that diabetes is associated with age but our sample didn’t show that.

Descriptive Statistics 2: What’s the mean age of having diabetes in Female and Male?

positive <- diabetes %>% filter(class == "Positive")
boxplot(positive$Age ~ positive$Gender, xlab = "Age", ylab = "Gender", main = "Gender and age of having diabetes")

#Mean age of men with diabetes
positive_male <- diabetes %>% filter(class == "Positive" & Gender == "Male")
mean_age_male_positive <- mean(positive_male$Age)

#Mean age of women with diabetes
positive_female<- diabetes %>% filter(class == "Positive" & Gender == "Female")
mean_age_female_positive <- mean(positive_female$Age)

Finding : Mean age of men having diabetes is around 51 years whereas women mean age is 47. We will find more about if gender has to do anything with having diabetes. We will use Two-sample t test, One sample t test and Chi-square test of association to explore more about that.

Analysis 2: So, is there any statistical difference in mean age among diabetic males and females in the sample?

For the purpose of reporting the difference in mean age, we performed independent two samples t test by dividing the data based on gender. We also calculated confidence interval estimates of mean age for diabetic males and females.

dim(positive)
## [1] 320  18
dim(positive_male)
## [1] 147  18
dim(positive_female)
## [1] 173  18

Analysis 2: Hypothesis Testing

\[H_0: The .difference.in.the .mean .age.of .diabetic. males.and.females .with.diabetes. is .0 \] \[H_A: The .difference .in .the .mean .age .of .diabetic. males .and .females .with .diabetes. is .not.0 \] \[Assumption : We.have.assumed.equal.variance.as.both.the.samples.come.from.the.same.population.but.we.will.also. perform.levene.test.for.confirmation. \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.significant.\] Two-sample t-test (Assumption - Normality and Homogeneity of variance) Normality qq plot for checking normality of age of diabetic males

# Normality tests - QQ plot
positive_male$Age %>% qqPlot(dist="norm") 

## [1] 61 40

As most of data points follow a lie with the 95% CI and the sample size = 147 (men) is greater than 30, according to CLT, we can assume normality.

Normality qq plot for checking normality of age of diabetic female

# Normality tests  - QQ plot
positive_female$Age %>% qqPlot(dist="norm")

## [1] 63 91

As most of data points follow a lie with the 95% CI and the sample size = 173 (women) is greater than 30, according to CLT, we can assume normality.

# Homogeneity of Variance
leveneTest(Age ~ Gender, data = positive)

The p-value for the Levene’s test of equal variance for age between # males and females was p=0.72. Asp>.05, therefore, we fail to reject H0 i.e variances in the two samples are equal.

Two-sample t-test - Assuming Equal Variance

t.test( Age ~ Gender, data = positive, var.equal = TRUE, alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  Age by Gender
## t = 3.1924, df = 318, p-value = 0.001552
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  1.638895 6.903356
## sample estimates:
##   mean in group Male mean in group Female 
##             51.38095             47.10983

Conclusion : As the p-value 0.001552 is less than 0.05, we will reject H0. Thus, we can say that test is statistically significant. There is difference in the mean age for diabetic patients based on gender.

Analysis 3: The interval estimate for the mean age of diabetic male and female patients. Will use one sample t test.

#one-sample t-test
t.test(positive_male$Age, conf.level = .95)$conf.int
## [1] 49.42801 53.33389
## attr(,"conf.level")
## [1] 0.95
#one-sample t-test
t.test(positive_female$Age, conf.level = .95)$conf.int
## [1] 45.32687 48.89279
## attr(,"conf.level")
## [1] 0.95

Conclusion : The mean age of male patients with diabetes lies between 49.28 and 53.33 years and for female patients mean age lies between 45.33 and 48.89 year with 95% Confidence level on taking repeated samples.

Analysis 4: Does having diabetes have association with gender (female/ male)? Chi-square test of association will be used to analyse that.

#Cross tabulation of data
table1 <- table(diabetes$class, diabetes$Gender) 
table1 %>% addmargins()
##           
##            Male Female Sum
##   Positive  147    173 320
##   Negative  181     19 200
##   Sum       328    192 520
#Distribution of class conditional on gender
table2 <- table1 %>% prop.table(margin=2) %>% round(2)

#Visualize the association between gender and whether have diabetes or not using a clustered bar chart.
barplot(table2,
        main = "Diabetes by Gender",
        ylab= "Proportion within Gender", 
        ylim=c(0,1),
        legend=rownames(table2),
        beside=TRUE,
        args.legend= c(x = "topright",horiz=TRUE, title="Diabetes"), 
        xlab="Gender")

- If there is no association between gender and diabetes, the height of the bars (i.e. proportions) of male and female within each of the diabetes (positive and negative) would be the same. In the bar chart, women tend to have diabetes compared to men. i.e. the probability of having diabetes “depends” gender. Thus, we need to determine with a Chi-square test of association is whether this relationship is statistically significant or not.

Analysis 4 : Hypothesis Testing

\[H_0: There.is.no.association.between.gender.and.having.diabetes \] \[H_A: There.is.association.between.gender.and.having.diabetes \]

\[Assumption : No.more.than.25.percentage.of.expected.cells.count.are.below.5 \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.significant.\]

Chi square test of association

chi2 <- chisq.test(table1)
chi2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table1
## X-squared = 103.04, df = 1, p-value < 2.2e-16
chi2$expected
##           
##                Male    Female
##   Positive 201.8462 118.15385
##   Negative 126.1538  73.84615
chi2$observed
##           
##            Male Female
##   Positive  147    173
##   Negative  181     19
qchisq(p = .95,df = 1)
## [1] 3.841459
pchisq(q = 103.04,df = 1,lower.tail = FALSE)
## [1] 3.284493e-24
chi2$p.value
## [1] 3.289704e-24

Conclusion: There are no cells with expected count lower than 5. p-value = 0.000002 that is less than 0.05 (alpha significant level). Our decision is to reject H0. Chi-square test of association is statistically significant. This means that there’s statistically significant association between gender and having diabetes.

Discussion

References