h1.title {color: deeppink;font-family: "Copperplate", fantasy; font-size: 48px;}
.author {color: mediumaquamarine ;font-family: "Copperplate", fantasy; font-size: 24px;}
.date {color: blueviolet ;font-family: "Copperplate", fantasy; font-size: 18px;}

h1.title:hover {
    text-shadow: 2px 2px 8px rgba(255, 20, 147, 0.8);
}

.columns {display: flex;}
h1 {color: salmon;font-family: 'Roboto', sans-serif; font-size: 36px;}

mydata <- read.csv("C:/Users/Pino/Documents/diabetes.csv")

colnames(mydata) [3] <- "Blood Pressure"
colnames(mydata) [4] <- "Skin Thickness"
colnames(mydata) [7] <- "DPF"
head(mydata)

##   Pregnancies Glucose Blood Pressure Skin Thickness Insulin  BMI   DPF Age
## 1           6     148             72             35       0 33.6 0.627  50
## 2           1      85             66             29       0 26.6 0.351  31
## 3           8     183             64              0       0 23.3 0.672  32
## 4           1      89             66             23      94 28.1 0.167  21
## 5           0     137             40             35     168 43.1 2.288  33
## 6           5     116             74              0       0 25.6 0.201  30
##   Outcome
## 1       1
## 2       0
## 3       1
## 4       0
## 5       1
## 6       0

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Unit of observation: Female of Pima Indian heritage over the age of 21.

Sample size: 768

Explanation of variables:

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

Blood Pressure: Diastolic blood pressure (mm Hg)

Skin Thickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

Diabetes Pedigree Function (DPF): Diabetes pedigree function

Age: Age in years

Outcome:1-diabetes 0-no diabetes

I renamed some variables so the name is with spaces between the words.

The data is taken from Kaggle website.

mydata$Outcome <- factor(mydata$Outcome,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))
head(mydata)

##   Pregnancies Glucose Blood Pressure Skin Thickness Insulin  BMI   DPF Age
## 1           6     148             72             35       0 33.6 0.627  50
## 2           1      85             66             29       0 26.6 0.351  31
## 3           8     183             64              0       0 23.3 0.672  32
## 4           1      89             66             23      94 28.1 0.167  21
## 5           0     137             40             35     168 43.1 2.288  33
## 6           5     116             74              0       0 25.6 0.201  30
##   Outcome
## 1     Yes
## 2      No
## 3     Yes
## 4      No
## 5     Yes
## 6      No

Factored the variable “Outcome”.

library(psych)
describeBy(mydata[ ,-9])

## Warning in describeBy(mydata[, -9]): no grouping variable requested

##                vars   n   mean     sd median trimmed   mad   min    max  range
## Pregnancies       1 768   3.85   3.37   3.00    3.46  2.97  0.00  17.00  17.00
## Glucose           2 768 120.89  31.97 117.00  119.38 29.65  0.00 199.00 199.00
## Blood Pressure    3 768  69.11  19.36  72.00   71.36 11.86  0.00 122.00 122.00
## Skin Thickness    4 768  20.54  15.95  23.00   19.94 17.79  0.00  99.00  99.00
## Insulin           5 768  79.80 115.24  30.50   56.75 45.22  0.00 846.00 846.00
## BMI               6 768  31.99   7.88  32.00   31.96  6.82  0.00  67.10  67.10
## DPF               7 768   0.47   0.33   0.37    0.42  0.25  0.08   2.42   2.34
## Age               8 768  33.24  11.76  29.00   31.54 10.38 21.00  81.00  60.00
##                 skew kurtosis   se
## Pregnancies     0.90     0.14 0.12
## Glucose         0.17     0.62 1.15
## Blood Pressure -1.84     5.12 0.70
## Skin Thickness  0.11    -0.53 0.58
## Insulin         2.26     7.13 4.16
## BMI            -0.43     3.24 0.28
## DPF             1.91     5.53 0.01
## Age             1.13     0.62 0.42

For “Age”, min = 21, indicating that the youngest female in the sample is 21 years old.

For “BMI”, mean = 31.99, which means that the average BMI (Body Mass Index) of all the females in the sample is 31.99.

For “Pregnancies”, max = 17, meaning that the female with the highest number of pregnancies in the sample was pregnant 17 times.

table(mydata$Outcome)

## 
##  No Yes 
## 500 268

Research Question 1

Research question: Is there a difference in the BMI for all adult females in the USA, (average is 27.8, median is 26.7) and the BMI of females of Pima Indian heritage over the age of 21?

Source: DQYDJ. (2022). BMI percentile calculator for men and women in the United States. DQYDJ. https://dqydj.com/bmi-percentile-calculator-men-women-united-states/

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x = BMI)) +
  geom_histogram(binwidth = 2, colour = "black", fill = "salmon") +
  ylab("Frequency") + 
  xlab("BMI")

The histogram shows a distribution that is approximately normal but slightly skewed to the right with some outliers. To further check normality I will carry out Shapiro-Wilk test.

shapiro.test(mydata$BMI)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$BMI
## W = 0.94999, p-value = 1.842e-15

H₀ : Distrubtion of BMI is normal

H₁ : Distrubtion of BMI is not normal.

We can reject the null hypothesis at p<0.001. The distribution is not normal.

t.test(mydata$BMI,
       mu = 27.8,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$BMI
## t = 14.737, df = 767, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 27.8
## 95 percent confidence interval:
##  31.43410 32.55106
## sample estimates:
## mean of x 
##  31.99258

H₀ : μ = 27.8

H₁ : μ ≠ 27.8

We can reject the null hypothesis at p<0.001.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(mydata$BMI, mu = 27.8)

## Cohen's d |       95% CI
## ------------------------
## 0.53      | [0.46, 0.61]
## 
## - Deviation from a difference of 27.8.

interpret_cohens_d(0.53, rules = "sawilowsky2009")

## [1] "medium"
## (Rules: sawilowsky2009)

Conclusion: We can say that there is a difference in the average BMI for adult women in the USA, which is 27.8 and the average BMI of females of Pima Indian heritage over the age of 21 (d = 0.53 - a medium effect).

median(mydata$BMI)

## [1] 32

wilcox.test(mydata$BMI,
            mu = 26.7,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  mydata$BMI
## V = 253954, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 26.7

H₀ : Me = 26.7

H₁ : Me ≠ 26.7

We can reject the null hypothesis at p<0.001.

library(effectsize)
effectsize(wilcox.test(mydata$BMI,
                       mu = 26.7,
                       correct = FALSE))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.72              | [0.68, 0.76]
## 
## - Deviation from a difference of 26.7.

interpret_rank_biserial(0.72, rules = "funder2019")

## [1] "very large"
## (Rules: funder2019)

Conclusion: Based on the sample data, I found that the median BMI of females of Pima Indian heritage over the age of 21 is 32, which is different compared to the median BMI of all adult females in USA (p<0.001, r = 0.72 - a very effect).

The most suitable test in my case is the non-parametric on, which is Wilcoxon Signed Rank Test. Based on the latter, I conclude that there is a difference between the BMI of all adult females in USA and the BMI of females of Pima Indian heritage over the age of 21.

Research Question 1 (corrected)

Research question: Is there a difference in BMI between different age groups for women of Pima Indian heritage?

mydata$Age_Group <- ifelse(mydata$Age >= 21 & mydata$Age <= 51, "0", 
                         ifelse(mydata$Age >= 52 & mydata$Age <= 81, "1", NA))

table(mydata$Age_Group)

## 
##   0   1 
## 695  73

Divided the units by age, creating two age groups “Young”, which is from 21 to 51 years old and “Old”, which is from 52 to 81 years old

mydata$Age_Group <- factor(mydata$Age_Group,
                            levels = c(0, 1),
                            labels = c("Young", "Old"))

library(psych)

describeBy(mydata$BMI, mydata$Age_Group)

## 
##  Descriptive statistics by group 
## group: Young
##    vars   n  mean   sd median trimmed  mad min  max range  skew kurtosis  se
## X1    1 695 32.22 7.85   32.3   32.17 6.97   0 67.1  67.1 -0.36     3.17 0.3
## ------------------------------------------------------------ 
## group: Old
##    vars  n mean   sd median trimmed  mad min  max range  skew kurtosis   se
## X1    1 73 29.8 7.95   30.1   30.11 7.41   0 46.5  46.5 -1.08     3.44 0.93

library(ggplot2)

Age_Group_Young <- ggplot(mydata[mydata$Age_Group == "Young",  ], aes(x = BMI)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 2, col = "black") +
  ylab("Frequency") +
  ggtitle("Young")

Age_Group_Old <- ggplot(mydata[mydata$Age_Group == "Old",  ], aes(x = BMI)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 2, col = "black") +
  ylab("Frequency") +
  ggtitle("Old")


library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.4.2

ggarrange(Age_Group_Young, Age_Group_Old,
          ncol = 2, nrow = 1)

library(ggpubr)
ggqqplot(mydata,
         "BMI",
         facet.by = "Age_Group")

shapiro.test(mydata$BMI[mydata$Age_Group == "Young"])

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$BMI[mydata$Age_Group == "Young"]
## W = 0.9532, p-value = 4.698e-14

shapiro.test(mydata$BMI[mydata$Age_Group == "Old"])

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$BMI[mydata$Age_Group == "Old"]
## W = 0.91164, p-value = 8.291e-05

H₀ : Distrubtion of BMI is normal.

H₁ : Distrubtion of BMI is not normal.

We can reject the null hypothesis at p<0.001 for both age groups. The distribution is not normal.

t.test(mydata$BMI ~ mydata$Age_Group, 
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata$BMI by mydata$Age_Group
## t = 2.4813, df = 87.392, p-value = 0.015
## alternative hypothesis: true difference in means between group Young and group Old is not equal to 0
## 95 percent confidence interval:
##  0.4825143 4.3662686
## sample estimates:
## mean in group Young   mean in group Old 
##            32.22302            29.79863

library(effectsize)
effectsize::cohens_d(mydata$BMI ~ mydata$Age_Group, 
                     pooled_sd = FALSE)

## Cohen's d |       95% CI
## ------------------------
## 0.31      | [0.06, 0.55]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.31, rules = "sawilowsky2009")

## [1] "small"
## (Rules: sawilowsky2009)

H₀ : The mean BMI is the same in both age groups.

H₁ : The mean BMI is different between the two age groups.

We reject the null hypothesis at p=0.0015

Conclusion: Based on the sample data, we find that the average BMI of age groups differs (p = 0.015). Young women of Pima Indian heritage have a higher average BMI (the effect size is small, 𝑑 = 0.31).

wilcox.test(mydata$BMI ~ mydata$Age_Group,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$BMI by mydata$Age_Group
## W = 29624, p-value = 0.01825
## alternative hypothesis: true location shift is not equal to 0

effectsize(wilcox.test(mydata$BMI ~ mydata$Age_Group,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.17              | [0.03, 0.30]

interpret_rank_biserial(0.17)

## [1] "small"
## (Rules: funder2019)

H₀ : Distribution location of BMI is equal for both age groups.

H₁ : Distribution location of BMI is not equal for both age groups.

We reject the null hypothesis at p=0.019

Conclusion: Based on the sample data, we find that young and old women of Pima Indian heritage differ in the BMI (p=0.019) - young ones have a higher BMI, the difference in distribution locations is small (r = 0.17)

The most suitable test in my case is the non-parametric on, which is the Independant t-test with Welch correction. Based on the latter, I conclude that there is a difference between the BMI of young women of Pima Indian heritage and old women of Pima Indian heritage.

Research Question 2

Research question: Is there a correlation between BMI and Diabetes Pedigree Function (DPF)?

H₀: ρ=0, there is no linear correlation between BMI and Diabetes Pedigree Function (DPF).

H₁: ρ≠0, there is a linear correlation between BMI and Diabetes Pedigree Function (DPF).

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(mydata[ ,c(6,7)])

The Pearson correlation coefficient between BMI and DPF is 0.141.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ ,c(6,7)]), 
      type = "pearson")

##      BMI  DPF
## BMI 1.00 0.14
## DPF 0.14 1.00
## 
## n= 768 
## 
## 
## P
##     BMI DPF
## BMI      0 
## DPF  0

Conclusion: The Pearson correlation coefficient (r = 0.141) indicates a weak positive correlation between BMI and Diabetes Pedigree Function (DPF). The p-value < 0.001, confirms that this correlation is statistically significant.

Research Question 3

Research question: Does the outcome (having diabetes or no) vary depending on the age group of women of Pima Indian heritage?

results <- chisq.test(mydata$Outcome, mydata$Age_Group, 
                      correct = TRUE)
results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$Outcome and mydata$Age_Group
## X-squared = 3.2892, df = 1, p-value = 0.06974

H₀: There is association between outcome and age group.

H₁: There is no association between outcome and age group.

We can not reject H₀ at p=0.700

We did not find enough information to say that there is association between age group and outcome

addmargins(results$observed)

##               mydata$Age_Group
## mydata$Outcome Young Old Sum
##            No    460  40 500
##            Yes   235  33 268
##            Sum   695  73 768

Explanation of observed/empirical variables: The observed value of 460 means that, in the sample, there are 460 young individuals who do not have diabetes.

round(results$expected, 2)

##               mydata$Age_Group
## mydata$Outcome  Young   Old
##            No  452.47 47.53
##            Yes 242.53 25.47

Here we can conclude that both assumptions of association between two categorical variables are met. Observations are independent of eachother and all expected frequencies are greater than 5 (seen from the table above)

Explanation of expected/theoretical variables: If there was no association between age group and having diabetes, the expected number of young women of Pima Indian heritage with diabetes would be 242.53.

round(results$res, 2)

##               mydata$Age_Group
## mydata$Outcome Young   Old
##            No   0.35 -1.09
##            Yes -0.48  1.49

Explanation of standard residual: The standardized residual of 0.35 by which we can not say that there is a difference between the observed and theoretical value for the combination of young age group and not having diabetes, the values are very close.

addmargins(round(prop.table(results$observed), 3))

##               mydata$Age_Group
## mydata$Outcome Young   Old   Sum
##            No  0.599 0.052 0.651
##            Yes 0.306 0.043 0.349
##            Sum 0.905 0.095 1.000

addmargins(round(prop.table(results$observed, 1), 3), 2)

##               mydata$Age_Group
## mydata$Outcome Young   Old   Sum
##            No  0.920 0.080 1.000
##            Yes 0.877 0.123 1.000

addmargins(round(prop.table(results$observed, 2), 3), 1)

##               mydata$Age_Group
## mydata$Outcome Young   Old
##            No  0.662 0.548
##            Yes 0.338 0.452
##            Sum 1.000 1.000

library(effectsize)
effectsize::cramers_v(mydata$Age_Group, mydata$Outcome)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.06              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.06)

## [1] "very small"
## (Rules: funder2019)

The Pearson’s Chi-squared test with Yates’ continuity correction showed no statistically significant association between age group and having diabetes (p > 0.05). The effect size, measured by Cramér’s V, is 0.06, indicating a very small or negligible effect. This suggests that even if there were a significant association, it would likely have minimal practical importance.

oddsratio(mydata$Age_Group, mydata$Outcome)

## Odds ratio |       95% CI
## -------------------------
## 1.61       | [0.99, 2.63]

interpret_oddsratio(1.61)

## [1] "very small"
## (Rules: chen2010)

The odds of having diabetes in the old age group are 1.61 the times than the odds of having diabetes in the young age group

HOMEWORK 1

Pino Sekač

2025-01-11

Research Question 1

Research Question 1 (corrected)

Research Question 2

Research Question 3