Link to the dataset: https://www.openintro.org/data/index.php?data=bdims&utm_
mydata_raw <- read.table("./bdims.csv", header = TRUE, sep=",")
head(mydata_raw)
## bia_di bii_di bit_di che_de che_di elb_di wri_di kne_di ank_di sho_gi che_gi
## 1 41.7 26.8 32.0 19.7 27.8 14.0 10.5 18.4 14.6 105.0 91.5
## 2 41.3 31.1 34.9 26.4 30.2 15.0 11.6 18.8 15.8 110.6 104.9
## 3 41.1 27.8 32.0 21.1 30.6 14.8 11.4 17.6 14.2 123.1 109.6
## 4 43.4 27.3 34.7 19.9 31.3 14.0 11.2 20.2 14.3 121.0 102.5
## 5 40.7 27.8 34.0 21.1 29.4 15.6 12.0 21.2 16.4 111.7 101.2
## 6 42.5 25.2 30.6 20.9 30.4 15.3 11.4 18.9 13.8 111.0 98.7
## wai_gi nav_gi hip_gi thi_gi bic_gi for_gi kne_gi cal_gi ank_gi wri_gi age
## 1 80.2 85.7 89.2 48.5 28.7 25.0 35.4 32.3 23.0 16.2 22
## 2 109.2 104.4 101.7 56.4 34.0 29.2 39.3 38.5 24.3 17.4 55
## 3 88.4 92.0 95.1 52.5 39.4 29.8 34.5 34.0 22.0 17.0 42
## 4 81.0 84.5 105.0 56.0 31.5 26.5 38.5 36.9 21.3 16.6 29
## 5 88.8 90.8 101.3 62.5 35.2 28.9 40.6 39.2 23.0 17.5 40
## 6 78.1 78.6 92.0 56.5 35.7 29.2 36.4 37.0 22.0 17.2 24
## wgt hgt sex
## 1 61.4 177.8 1
## 2 94.1 185.4 1
## 3 75.0 168.9 1
## 4 83.6 185.4 1
## 5 85.5 180.3 1
## 6 73.9 174.0 1
Deleting all variables except age, weight, heigtht and gender
mydata <- mydata_raw[,-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)]
Renaming variables
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- rename(mydata,
"AGE" = "age",
"WEIGHT" = "wgt",
"HEIGHT" = "hgt",
"GENDER" = "sex")
Creating categorical variable out of variable age (AGE_categorical) where 0=young (19 to 41 years old) and 1=old (42 to 65 years old)
mydata <- mydata %>% mutate(AGE_categorical = ifelse(AGE <= 41, 0, 1))
Creating factor for categorical variable GENDER and AGE_categorical
mydata$GENDER <- factor(mydata$GENDER,
levels = c(0,1),
labels = c("Female", "Male"))
mydata$AGE_categorical <- factor(mydata$AGE_categorical,
levels = c(0,1),
labels = c("Young", "Old"))
head(mydata)
## AGE WEIGHT HEIGHT GENDER AGE_categorical
## 1 22 61.4 177.8 Male Young
## 2 55 94.1 185.4 Male Old
## 3 42 75.0 168.9 Male Old
## 4 29 83.6 185.4 Male Young
## 5 40 85.5 180.3 Male Young
## 6 24 73.9 174.0 Male Young
This data set explains persons weight, height, age and gender. It has 120 observations and 4 variables (2 numeric and 2 categorical). Unit of observations is one individual.
summary(mydata)
## AGE WEIGHT HEIGHT GENDER AGE_categorical
## Min. :19.00 Min. : 46.80 Min. :149.9 Female:60 Young:100
## 1st Qu.:24.75 1st Qu.: 61.85 1st Qu.:165.1 Male :60 Old : 20
## Median :29.00 Median : 72.05 Median :173.4
## Mean :31.65 Mean : 71.52 Mean :172.2
## 3rd Qu.:37.00 3rd Qu.: 80.92 3rd Qu.:179.1
## Max. :65.00 Max. :104.10 Max. :193.5
library(psych)
describeBy(mydata$HEIGHT, mydata$GENDER)
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 60 165.69 6.62 165.55 165.86 7.93 149.9 176.5 26.6 -0.17 -0.91
## se
## X1 0.86
## ------------------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 60 178.73 6.9 179.1 178.87 5.93 160 193.5 33.5 -0.25 -0.21 0.89
Histogram
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
GENDER_F <- ggplot(mydata[mydata$GENDER == "Female", ], aes(x = HEIGHT)) +
theme_linedraw() +
geom_histogram(binwidth = 2, col = "magenta") +
ylab("Frequency") +
ggtitle("Female height")
GENDER_M <- ggplot(mydata[mydata$GENDER == "Male", ], aes(x = HEIGHT)) +
theme_linedraw() +
geom_histogram(binwidth = 2, col = "darkturquoise") +
ylab("Frequency") +
ggtitle("Male height")
library(ggpubr)
ggarrange(GENDER_F, GENDER_M,
ncol = 2, nrow = 1)
Quantile-quantile plot
library(ggpubr)
ggqqplot(mydata,
"HEIGHT",
facet.by = "GENDER")
Shapiro-Wilk test:
We cannot reject H0 hypothesis for both male and female population. We assume normality of height distributions for male and female.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(GENDER) %>%
shapiro_test(HEIGHT)
## # A tibble: 2 × 4
## GENDER variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Female HEIGHT 0.967 0.0991
## 2 Male HEIGHT 0.981 0.463
Based on the results of Welch two sample t-test (independent sample t-test) we reject H0 at p<0,001.
t.test(mydata$HEIGHT ~ mydata$GENDER,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata$HEIGHT by mydata$GENDER
## t = -10.563, df = 117.81, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -15.48670 -10.59663
## sample estimates:
## mean in group Female mean in group Male
## 165.6883 178.7300
Based on the results of cohens d statistics effectsize is very large. Male have higher average height than female.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydata$HEIGHT ~ mydata$GENDER,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## --------------------------
## -1.93 | [-2.36, -1.49]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(1.93, rules = "sawilowsky2009")
## [1] "very large"
## (Rules: sawilowsky2009)
Based on Wilcoxon rank sum test we reject H0 at p<0,001.
wilcox.test(mydata$HEIGHT ~ mydata$GENDER,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$HEIGHT by mydata$GENDER
## W = 310, p-value = 4.97e-15
## alternative hypothesis: true location shift is not equal to 0
Based on the results of rank biserial statistics effectsize is very large. Male have higher average height than female.
library(effectsize)
effectsize(wilcox.test(mydata$HEIGHT ~ mydata$GENDER,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.83 | [-0.88, -0.75]
interpret_rank_biserial(0.83)
## [1] "very large"
## (Rules: funder2019)
Based on the sample data, we find that men and women differ in height (𝑝 < 0,001) - male are higher, the difference in average height is very large (d = 1,93). I used the parametric independent sample t-test as all assumptions are met (numeric variable, normality, variable has the same variance in both populations), for the t-test I also did welch correction (for variance assumption). I also conducted non-parametric Wilcoxon rank sum test which is more robust but in my example has lower statistical power than parametric test (all assumptions are met). Based on both test I rejected H0 at p<0,001 which suggests that male and female are of the same height.
Check assumptions
Based on chi-square test I reject H0 at p<0.001 and assume there is association between age and gender.
results <- chisq.test(mydata$GENDER, mydata$AGE_categorical, correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$GENDER and mydata$AGE_categorical
## X-squared = 10.14, df = 1, p-value = 0.001451
In our sample there is 120 total observations, out of all observations 60 are male and 60 are female and out of all 120 observations 100 fall into young age category while 20 fall into old age category. Out of all 60 females in sample there is 57 young females and 3 old. Out of all 60 males in sample there is 43 young males and 17 old. Out of 100 young people in sample 57 are female and 43 are male and out of 20 old people 3 are female and 17 are male.
addmargins(results$observed)
## mydata$AGE_categorical
## mydata$GENDER Young Old Sum
## Female 57 3 60
## Male 43 17 60
## Sum 100 20 120
If variables age and gender would not be associated we would expect 50 young female and 50 young male and 10 old female and 10 old male.
Assumption number 2 is met, all expected values are over 5. (sample size of 20 for old is not really got statistically should be over 100)
addmargins(round(results$expected,2))
## mydata$AGE_categorical
## mydata$GENDER Young Old Sum
## Female 50 10 60
## Male 50 10 60
## Sum 100 20 120
round(results$residuals,2)
## mydata$AGE_categorical
## mydata$GENDER Young Old
## Female 0.99 -2.21
## Male -0.99 2.21
Based on cramers v-statistics effectsize is large. Also the effectsize of oddsratio is also large. In this case we have 2x2 matrix its more suitable to use oddsratio as effectsize.
library(effectsize)
effectsize::cramers_v(mydata$GENDER,mydata$AGE_categorical)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.30 | [0.14, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.30)
## [1] "large"
## (Rules: funder2019)
oddsratio(mydata$AGE_categorical,mydata$GENDER)
## Odds ratio | 95% CI
## --------------------------
## 7.51 | [2.07, 27.28]
interpret_oddsratio(7.51)
## [1] "large"
## (Rules: cohen1988)
Based on chi-square test I discovered that categorical variables age and gender are associated.