h1.title {color: deeppink;font-family: "Copperplate", fantasy; font-size: 48px;}
.author {color: mediumaquamarine ;font-family: "Copperplate", fantasy; font-size: 24px;}
.date {color: blueviolet ;font-family: "Copperplate", fantasy; font-size: 18px;}
h1.title:hover {
text-shadow: 2px 2px 8px rgba(255, 20, 147, 0.8);
}
.columns {display: flex;}
h1 {color: salmon;font-family: 'Roboto', sans-serif; font-size: 36px;}
mydata <- read.csv("C:/Users/Pino/Documents/diabetes.csv")
colnames(mydata) [3] <- "Blood Pressure"
colnames(mydata) [4] <- "Skin Thickness"
colnames(mydata) [7] <- "DPF"
head(mydata)
## Pregnancies Glucose Blood Pressure Skin Thickness Insulin BMI DPF Age
## 1 6 148 72 35 0 33.6 0.627 50
## 2 1 85 66 29 0 26.6 0.351 31
## 3 8 183 64 0 0 23.3 0.672 32
## 4 1 89 66 23 94 28.1 0.167 21
## 5 0 137 40 35 168 43.1 2.288 33
## 6 5 116 74 0 0 25.6 0.201 30
## Outcome
## 1 1
## 2 0
## 3 1
## 4 0
## 5 1
## 6 0
This dataset is originally from the National Institute of Diabetes and
Digestive and Kidney Diseases. The objective of the dataset is to
diagnostically predict whether or not a patient has diabetes, based on
certain diagnostic measurements included in the dataset. Several
constraints were placed on the selection of these instances from a
larger database. In particular, all patients here are females at least
21 years old of Pima Indian heritage. The datasets consist of several
medical predictor (independent) variables and one target (dependent)
variable, Outcome. Independent variables include the number of
pregnancies the patient has had, their BMI, insulin level, age, and so
on.
Unit of observation: Female of Pima Indian heritage over the age of 21.
Sample size: 768
Explanation of variables:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose
tolerance test
Blood Pressure: Diastolic blood pressure (mm Hg)
Skin Thickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
Diabetes Pedigree Function (DPF): Diabetes pedigree function
Age: Age in years
Outcome:1-diabetes 0-no diabetes
I renamed some variables so the name is with spaces between the words.
The data is taken from Kaggle website.
mydata$Outcome <- factor(mydata$Outcome,
levels = c(0, 1),
labels = c("No", "Yes"))
head(mydata)
## Pregnancies Glucose Blood Pressure Skin Thickness Insulin BMI DPF Age
## 1 6 148 72 35 0 33.6 0.627 50
## 2 1 85 66 29 0 26.6 0.351 31
## 3 8 183 64 0 0 23.3 0.672 32
## 4 1 89 66 23 94 28.1 0.167 21
## 5 0 137 40 35 168 43.1 2.288 33
## 6 5 116 74 0 0 25.6 0.201 30
## Outcome
## 1 Yes
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 No
Factored the variable “Outcome”.
library(psych)
describeBy(mydata[ ,-9])
## Warning in describeBy(mydata[, -9]): no grouping variable requested
## vars n mean sd median trimmed mad min max range
## Pregnancies 1 768 3.85 3.37 3.00 3.46 2.97 0.00 17.00 17.00
## Glucose 2 768 120.89 31.97 117.00 119.38 29.65 0.00 199.00 199.00
## Blood Pressure 3 768 69.11 19.36 72.00 71.36 11.86 0.00 122.00 122.00
## Skin Thickness 4 768 20.54 15.95 23.00 19.94 17.79 0.00 99.00 99.00
## Insulin 5 768 79.80 115.24 30.50 56.75 45.22 0.00 846.00 846.00
## BMI 6 768 31.99 7.88 32.00 31.96 6.82 0.00 67.10 67.10
## DPF 7 768 0.47 0.33 0.37 0.42 0.25 0.08 2.42 2.34
## Age 8 768 33.24 11.76 29.00 31.54 10.38 21.00 81.00 60.00
## skew kurtosis se
## Pregnancies 0.90 0.14 0.12
## Glucose 0.17 0.62 1.15
## Blood Pressure -1.84 5.12 0.70
## Skin Thickness 0.11 -0.53 0.58
## Insulin 2.26 7.13 4.16
## BMI -0.43 3.24 0.28
## DPF 1.91 5.53 0.01
## Age 1.13 0.62 0.42
For “Age”, min = 21, indicating that the youngest female in the sample
is 21 years old.
For “BMI”, mean = 31.99, which means that the average BMI (Body Mass
Index) of all the females in the sample is 31.99.
For “Pregnancies”, max = 17, meaning that the female with the highest
number of pregnancies in the sample was pregnant 17 times.
table(mydata$Outcome)
##
## No Yes
## 500 268
Research Question 1
Research question: Is there a difference in the BMI for all adult
females in the USA, (average is 27.8, median is 26.7) and the BMI of
females of Pima Indian heritage over the age of 21?
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = BMI)) +
geom_histogram(binwidth = 2, colour = "black", fill = "salmon") +
ylab("Frequency") +
xlab("BMI")
The histogram shows a distribution that is approximately normal but
slightly skewed to the right with some outliers. To further check
normality I will carry out Shapiro-Wilk test.
shapiro.test(mydata$BMI)
##
## Shapiro-Wilk normality test
##
## data: mydata$BMI
## W = 0.94999, p-value = 1.842e-15
H₀ : Distrubtion of BMI is normal
H₁ : Distrubtion of BMI is not normal.
We can reject the null hypothesis at p<0.001. The distribution is not
normal.
t.test(mydata$BMI,
mu = 27.8,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$BMI
## t = 14.737, df = 767, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 27.8
## 95 percent confidence interval:
## 31.43410 32.55106
## sample estimates:
## mean of x
## 31.99258
H₀ : μ = 27.8
H₁ : μ ≠ 27.8
We can reject the null hypothesis at p<0.001.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydata$BMI, mu = 27.8)
## Cohen's d | 95% CI
## ------------------------
## 0.53 | [0.46, 0.61]
##
## - Deviation from a difference of 27.8.
interpret_cohens_d(0.53, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
Conclusion: We can say that there is a difference in the average BMI for
adult women in the USA, which is 27.8 and the average BMI of females of
Pima Indian heritage over the age of 21 (d = 0.53 - a medium effect).
median(mydata$BMI)
## [1] 32
wilcox.test(mydata$BMI,
mu = 26.7,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: mydata$BMI
## V = 253954, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 26.7
H₀ : Me = 26.7
H₁ : Me ≠ 26.7
We can reject the null hypothesis at p<0.001.
library(effectsize)
effectsize(wilcox.test(mydata$BMI,
mu = 26.7,
correct = FALSE))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.72 | [0.68, 0.76]
##
## - Deviation from a difference of 26.7.
interpret_rank_biserial(0.72, rules = "funder2019")
## [1] "very large"
## (Rules: funder2019)
Conclusion: Based on the sample data, I found that the median BMI of
females of Pima Indian heritage over the age of 21 is 32, which is
different compared to the median BMI of all adult females in USA
(p<0.001, r = 0.72 - a very effect).
The most suitable test in my case is the non-parametric on, which is
Wilcoxon Signed Rank Test. Based on the latter, I conclude that there is
a difference between the BMI of all adult females in USA and the BMI of
females of Pima Indian heritage over the age of 21.
Research Question 1 (corrected)
Research question: Is there a difference in BMI between different age
groups for women of Pima Indian heritage?
mydata$Age_Group <- ifelse(mydata$Age >= 21 & mydata$Age <= 51, "0",
ifelse(mydata$Age >= 52 & mydata$Age <= 81, "1", NA))
table(mydata$Age_Group)
##
## 0 1
## 695 73
Divided the units by age, creating two age groups “Young”, which is from
21 to 51 years old and “Old”, which is from 52 to 81 years old
mydata$Age_Group <- factor(mydata$Age_Group,
levels = c(0, 1),
labels = c("Young", "Old"))
library(psych)
describeBy(mydata$BMI, mydata$Age_Group)
##
## Descriptive statistics by group
## group: Young
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 695 32.22 7.85 32.3 32.17 6.97 0 67.1 67.1 -0.36 3.17 0.3
## ------------------------------------------------------------
## group: Old
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 73 29.8 7.95 30.1 30.11 7.41 0 46.5 46.5 -1.08 3.44 0.93
library(ggplot2)
Age_Group_Young <- ggplot(mydata[mydata$Age_Group == "Young", ], aes(x = BMI)) +
theme_linedraw() +
geom_histogram(binwidth = 2, col = "black") +
ylab("Frequency") +
ggtitle("Young")
Age_Group_Old <- ggplot(mydata[mydata$Age_Group == "Old", ], aes(x = BMI)) +
theme_linedraw() +
geom_histogram(binwidth = 2, col = "black") +
ylab("Frequency") +
ggtitle("Old")
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggarrange(Age_Group_Young, Age_Group_Old,
ncol = 2, nrow = 1)

library(ggpubr)
ggqqplot(mydata,
"BMI",
facet.by = "Age_Group")

shapiro.test(mydata$BMI[mydata$Age_Group == "Young"])
##
## Shapiro-Wilk normality test
##
## data: mydata$BMI[mydata$Age_Group == "Young"]
## W = 0.9532, p-value = 4.698e-14
shapiro.test(mydata$BMI[mydata$Age_Group == "Old"])
##
## Shapiro-Wilk normality test
##
## data: mydata$BMI[mydata$Age_Group == "Old"]
## W = 0.91164, p-value = 8.291e-05
H₀ : Distrubtion of BMI is normal.
H₁ : Distrubtion of BMI is not normal.
We can reject the null hypothesis at p<0.001 for both age groups. The
distribution is not normal.
t.test(mydata$BMI ~ mydata$Age_Group,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata$BMI by mydata$Age_Group
## t = 2.4813, df = 87.392, p-value = 0.015
## alternative hypothesis: true difference in means between group Young and group Old is not equal to 0
## 95 percent confidence interval:
## 0.4825143 4.3662686
## sample estimates:
## mean in group Young mean in group Old
## 32.22302 29.79863
library(effectsize)
effectsize::cohens_d(mydata$BMI ~ mydata$Age_Group,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## ------------------------
## 0.31 | [0.06, 0.55]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.31, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
H₀ : The mean BMI is the same in both age groups.
H₁ : The mean BMI is different between the two age groups.
We reject the null hypothesis at p=0.0015
Conclusion: Based on the sample data, we find that the average BMI of
age groups differs (p = 0.015). Young women of Pima Indian heritage have
a higher average BMI (the effect size is small, 𝑑 = 0.31).
wilcox.test(mydata$BMI ~ mydata$Age_Group,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$BMI by mydata$Age_Group
## W = 29624, p-value = 0.01825
## alternative hypothesis: true location shift is not equal to 0
effectsize(wilcox.test(mydata$BMI ~ mydata$Age_Group,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.17 | [0.03, 0.30]
interpret_rank_biserial(0.17)
## [1] "small"
## (Rules: funder2019)
H₀ : Distribution location of BMI is equal for both age groups.
H₁ : Distribution location of BMI is not equal for both age groups.
We reject the null hypothesis at p=0.019
Conclusion: Based on the sample data, we find that young and old women
of Pima Indian heritage differ in the BMI (p=0.019) - young ones have a
higher BMI, the difference in distribution locations is small (r = 0.17)
The most suitable test in my case is the non-parametric on, which is the
Independant t-test with Welch correction. Based on the latter, I
conclude that there is a difference between the BMI of young women of
Pima Indian heritage and old women of Pima Indian heritage.
Research Question 2
Research question: Is there a correlation between BMI and Diabetes
Pedigree Function (DPF)?
H₀: ρ=0, there is no linear correlation between BMI and Diabetes
Pedigree Function (DPF).
H₁: ρ≠0, there is a linear correlation between BMI and Diabetes Pedigree
Function (DPF).
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(mydata[ ,c(6,7)])

The Pearson correlation coefficient between BMI and DPF is 0.141.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ ,c(6,7)]),
type = "pearson")
## BMI DPF
## BMI 1.00 0.14
## DPF 0.14 1.00
##
## n= 768
##
##
## P
## BMI DPF
## BMI 0
## DPF 0
Conclusion: The Pearson correlation coefficient (r = 0.141) indicates a
weak positive correlation between BMI and Diabetes Pedigree Function
(DPF). The p-value < 0.001, confirms that this correlation is
statistically significant.
Research Question 3
Research question: Does the outcome (having diabetes or no) vary
depending on the age group of women of Pima Indian heritage?
results <- chisq.test(mydata$Outcome, mydata$Age_Group,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$Outcome and mydata$Age_Group
## X-squared = 3.2892, df = 1, p-value = 0.06974
H₀: There is association between outcome and age group.
H₁: There is no association between outcome and age group.
We can not reject H₀ at p=0.700
We did not find enough information to say that there is association
between age group and outcome
addmargins(results$observed)
## mydata$Age_Group
## mydata$Outcome Young Old Sum
## No 460 40 500
## Yes 235 33 268
## Sum 695 73 768
Explanation of observed/empirical variables: The observed value of 460
means that, in the sample, there are 460 young individuals who do not
have diabetes.
round(results$expected, 2)
## mydata$Age_Group
## mydata$Outcome Young Old
## No 452.47 47.53
## Yes 242.53 25.47
Here we can conclude that both assumptions of association between two
categorical variables are met. Observations are independent of eachother
and all expected frequencies are greater than 5 (seen from the table
above)
Explanation of expected/theoretical variables: If there was no
association between age group and having diabetes, the expected number
of young women of Pima Indian heritage with diabetes would be 242.53.
round(results$res, 2)
## mydata$Age_Group
## mydata$Outcome Young Old
## No 0.35 -1.09
## Yes -0.48 1.49
Explanation of standard residual: The standardized residual of 0.35 by
which we can not say that there is a difference between the observed and
theoretical value for the combination of young age group and not having
diabetes, the values are very close.
addmargins(round(prop.table(results$observed), 3))
## mydata$Age_Group
## mydata$Outcome Young Old Sum
## No 0.599 0.052 0.651
## Yes 0.306 0.043 0.349
## Sum 0.905 0.095 1.000
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata$Age_Group
## mydata$Outcome Young Old Sum
## No 0.920 0.080 1.000
## Yes 0.877 0.123 1.000
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata$Age_Group
## mydata$Outcome Young Old
## No 0.662 0.548
## Yes 0.338 0.452
## Sum 1.000 1.000
library(effectsize)
effectsize::cramers_v(mydata$Age_Group, mydata$Outcome)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.06 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.06)
## [1] "very small"
## (Rules: funder2019)
The Pearson’s Chi-squared test with Yates’ continuity correction showed
no statistically significant association between age group and having
diabetes (p > 0.05). The effect size, measured by Cramér’s V, is
0.06, indicating a very small or negligible effect. This suggests that
even if there were a significant association, it would likely have
minimal practical importance.
oddsratio(mydata$Age_Group, mydata$Outcome)
## Odds ratio | 95% CI
## -------------------------
## 1.61 | [0.99, 2.63]
interpret_oddsratio(1.61)
## [1] "very small"
## (Rules: chen2010)
The odds of having diabetes in the old age group are 1.61 the times than
the odds of having diabetes in the young age group