Su Myat Noe Yee(S3913797), Priya Krishnamurthi Chandra(S3939191), Usman Khalid(S3914769)
Last updated: 29 May, 2022
Those are the 3 variables that will be focusing on this project
Age : 20 - 65 Age is numeric variables and it ranges from 20 years to 65 years. We discretized it with bins.
Age<30 is represented by 1
Age from “30 - 39” is represented by 2
Age from “40 - 49” is represented by 3
Age “>50” represented by 4
Gender : Male/ Female , Male represented by 0, female by 1.
Class : Postive: Have diabetes represented by 1, Negative: Don’t have diabetes represented by 1
Gender and class are changed to factor as they have levels.
## [1] "double"
## [1] "character"
## [1] "Male" "Female"
#Changing Gender to factors
diabetes$Gender <- factor(diabetes$Gender,
levels = c("Male","Female"),
labels = c("Male","Female"))
unique(diabetes$class)## [1] "Positive" "Negative"
diabetes %>% group_by(class) %>% summarise(Min = min(Age, na.rm = TRUE),
Q1 = quantile(Age, probs = 0.25, na.rm=TRUE),
Median = median(Age, na.rm = TRUE),
Q3 = quantile(Age, probs = 0.75, na.rm = TRUE),
Max = max(Age, na.rm = TRUE),
Mean = mean(Age, na.rm = TRUE),
SD = sd(Age, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Age))) -> table1
knitr::kable(table1)| class | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Positive | 16 | 39 | 48 | 57 | 90 | 49.07187 | 12.09748 | 320 | 0 |
| Negative | 26 | 37 | 45 | 55 | 72 | 46.36000 | 12.08098 | 200 | 0 |
age_diabetes <- diabetes %>% boxplot(Age ~ class, data = .,
xlab = "Diabetes",
ylab = "Age",
main = "Age and Diabetes") Finding : It looks like older people tends to have diabetes (Positive). Thus, we will find out whether age has association with having diabetes using Chi-Square association test in this project.
#Creating age categories
no_of_samples <- dim(diabetes)[1]
age_encoded <- rep(0, no_of_samples)
i = 0;
for(i in 1:520){
if (diabetes$Age[i] < 30){
age_encoded[i] = 1
} else if (diabetes$Age[i] >= 30 & diabetes$Age[i] < 40){
age_encoded[i] = 2
} else if (diabetes$Age[i] >= 40 & diabetes$Age[i] < 50){
age_encoded[i] = 3
} else if (diabetes$Age[i] >= 50){
age_encoded[i] = 4
}
}
diabetes$age_cat <- age_encoded
#Changing to factors variables for age categories
diabetes$age_cat<- factor(diabetes$age_cat,
levels = c(1,2,3,4),
labels = c("< 30","30 - 39","40 - 49","> 50"),
ordered = TRUE)
#Cross tabulation of data
table3 <- table(diabetes$age_cat, diabetes$class)
table3 %>% addmargins()##
## Positive Negative Sum
## < 30 8 12 20
## 30 - 39 77 47 124
## 40 - 49 88 63 151
## > 50 147 78 225
## Sum 320 200 520
#Distribution of class conditional on age
table4 <- table3 %>% prop.table(margin=2) %>% round(2)
knitr::kable(table4)| Positive | Negative | |
|---|---|---|
| < 30 | 0.03 | 0.06 |
| 30 - 39 | 0.24 | 0.23 |
| 40 - 49 | 0.28 | 0.32 |
| > 50 | 0.46 | 0.39 |
#Visualize the association between age category and whether have diabetes or not using a clustered bar chart.
barplot(table4,
main = "Diabetes by Age Group",
ylab= "Proportion within Age Group",
ylim=c(0,1),
legend=rownames(table4),
beside=TRUE,
args.legend= c(x = "topright",horiz=TRUE, title="Age Cateogory"),
xlab="Age Category",
col = brewer.pal(4, name = "RdBu")) - As the height of the bars indicating proportions in each age group is different, it seems there is association between age and diabetes but we will have to use Chi-square test of association is whether this relationship is statistically significant or not.
\[H_0: There.is.no.association.between.age.and.having.diabetes \] \[H_A: There.is.association.between.age.and.having.diabetes \] \[Assumption : No.more.than.25.percentage.of.expected.cells.count.are.below.5 \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.ssignificant.\] Chi-square test of association
##
## Pearson's Chi-squared test
##
## data: table3
## X-squared = 5.9835, df = 3, p-value = 0.1124
##
## Positive Negative
## < 30 12.30769 7.692308
## 30 - 39 76.30769 47.692308
## 40 - 49 92.92308 58.076923
## > 50 138.46154 86.538462
##
## Positive Negative
## < 30 8 12
## 30 - 39 77 47
## 40 - 49 88 63
## > 50 147 78
## [1] 7.814728
## [1] 0.1124158
## [1] 0.1124169
Conclusion : There are no cells with expected count lower than 5. p-value=0.1124169 is not less than 0.05 (alpha significant level). Our decision is fail to reject H0. The Chi-square test of association is statistically insignificant. There was no evidence of an association between age and whether having diabetes or not. Usually there is strong claim that diabetes is associated with age but our sample didn’t show that.
positive <- diabetes %>% filter(class == "Positive")
boxplot(positive$Age ~ positive$Gender, xlab = "Age", ylab = "Gender", main = "Gender and age of having diabetes")#Mean age of men with diabetes
positive_male <- diabetes %>% filter(class == "Positive" & Gender == "Male")
mean_age_male_positive <- mean(positive_male$Age)
#Mean age of women with diabetes
positive_female<- diabetes %>% filter(class == "Positive" & Gender == "Female")
mean_age_female_positive <- mean(positive_female$Age)Finding : Mean age of men having diabetes is around 51 years whereas women mean age is 47. We will find more about if gender has to do anything with having diabetes. We will use Two-sample t test, One sample t test and Chi-square test of association to explore more about that.
For the purpose of reporting the difference in mean age, we performed independent two samples t test by dividing the data based on gender. We also calculated confidence interval estimates of mean age for diabetic males and females.
## [1] 320 18
## [1] 147 18
## [1] 173 18
\[H_0: The .difference.in.the .mean .age.of .diabetic. males.and.females .with.diabetes. is .0 \] \[H_A: The .difference .in .the .mean .age .of .diabetic. males .and .females .with .diabetes. is .not.0 \] \[Assumption : We.have.assumed.equal.variance.as.both.the.samples.come.from.the.same.population.but.we.will.also. perform.levene.test.for.confirmation. \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.significant.\] Two-sample t-test (Assumption - Normality and Homogeneity of variance) Normality qq plot for checking normality of age of diabetic males
## [1] 61 40
As most of data points follow a lie with the 95% CI and the sample size = 147 (men) is greater than 30, according to CLT, we can assume normality.
Normality qq plot for checking normality of age of diabetic female
## [1] 63 91
As most of data points follow a lie with the 95% CI and the sample size = 173 (women) is greater than 30, according to CLT, we can assume normality.
The p-value for the Levene’s test of equal variance for age between # males and females was p=0.72. Asp>.05, therefore, we fail to reject H0 i.e variances in the two samples are equal.
Two-sample t-test - Assuming Equal Variance
##
## Two Sample t-test
##
## data: Age by Gender
## t = 3.1924, df = 318, p-value = 0.001552
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## 1.638895 6.903356
## sample estimates:
## mean in group Male mean in group Female
## 51.38095 47.10983
Conclusion : As the p-value 0.001552 is less than 0.05, we will reject H0. Thus, we can say that test is statistically significant. There is difference in the mean age for diabetic patients based on gender.
## [1] 49.42801 53.33389
## attr(,"conf.level")
## [1] 0.95
## [1] 45.32687 48.89279
## attr(,"conf.level")
## [1] 0.95
Conclusion : The mean age of male patients with diabetes lies between 49.28 and 53.33 years and for female patients mean age lies between 45.33 and 48.89 year with 95% Confidence level on taking repeated samples.
##
## Male Female Sum
## Positive 147 173 320
## Negative 181 19 200
## Sum 328 192 520
#Distribution of class conditional on gender
table2 <- table1 %>% prop.table(margin=2) %>% round(2)
#Visualize the association between gender and whether have diabetes or not using a clustered bar chart.
barplot(table2,
main = "Diabetes by Gender",
ylab= "Proportion within Gender",
ylim=c(0,1),
legend=rownames(table2),
beside=TRUE,
args.legend= c(x = "topright",horiz=TRUE, title="Diabetes"),
xlab="Gender") - If there is no association between gender and diabetes, the height of the bars (i.e. proportions) of male and female within each of the diabetes (positive and negative) would be the same. In the bar chart, women tend to have diabetes compared to men. i.e. the probability of having diabetes “depends” gender. Thus, we need to determine with a Chi-square test of association is whether this relationship is statistically significant or not.
\[H_0: There.is.no.association.between.gender.and.having.diabetes \] \[H_A: There.is.association.between.gender.and.having.diabetes \]
\[Assumption : No.more.than.25.percentage.of.expected.cells.count.are.below.5 \] \[Decision Rule : Reject.H0.if.p-value.is.less.than.0.05(alpha.significant.level). Otherwise.fail.to.reject.H0.\] \[Conclusion : Test.will.be.significantly.significant.if.we.reject.H0.. Otherwise.test.is.not.statistically.significant.\]
Chi square test of association
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table1
## X-squared = 103.04, df = 1, p-value < 2.2e-16
##
## Male Female
## Positive 201.8462 118.15385
## Negative 126.1538 73.84615
##
## Male Female
## Positive 147 173
## Negative 181 19
## [1] 3.841459
## [1] 3.284493e-24
## [1] 3.289704e-24
Conclusion: There are no cells with expected count lower than 5. p-value = 0.000002 that is less than 0.05 (alpha significant level). Our decision is to reject H0. Chi-square test of association is statistically significant. This means that there’s statistically significant association between gender and having diabetes.