1. Importing the raw dataset

Link to the dataset: https://www.openintro.org/data/index.php?data=bdims&utm_

mydata_raw <- read.table("./bdims.csv", header = TRUE, sep=",")

head(mydata_raw)
##   bia_di bii_di bit_di che_de che_di elb_di wri_di kne_di ank_di sho_gi che_gi
## 1   41.7   26.8   32.0   19.7   27.8   14.0   10.5   18.4   14.6  105.0   91.5
## 2   41.3   31.1   34.9   26.4   30.2   15.0   11.6   18.8   15.8  110.6  104.9
## 3   41.1   27.8   32.0   21.1   30.6   14.8   11.4   17.6   14.2  123.1  109.6
## 4   43.4   27.3   34.7   19.9   31.3   14.0   11.2   20.2   14.3  121.0  102.5
## 5   40.7   27.8   34.0   21.1   29.4   15.6   12.0   21.2   16.4  111.7  101.2
## 6   42.5   25.2   30.6   20.9   30.4   15.3   11.4   18.9   13.8  111.0   98.7
##   wai_gi nav_gi hip_gi thi_gi bic_gi for_gi kne_gi cal_gi ank_gi wri_gi age
## 1   80.2   85.7   89.2   48.5   28.7   25.0   35.4   32.3   23.0   16.2  22
## 2  109.2  104.4  101.7   56.4   34.0   29.2   39.3   38.5   24.3   17.4  55
## 3   88.4   92.0   95.1   52.5   39.4   29.8   34.5   34.0   22.0   17.0  42
## 4   81.0   84.5  105.0   56.0   31.5   26.5   38.5   36.9   21.3   16.6  29
## 5   88.8   90.8  101.3   62.5   35.2   28.9   40.6   39.2   23.0   17.5  40
## 6   78.1   78.6   92.0   56.5   35.7   29.2   36.4   37.0   22.0   17.2  24
##    wgt   hgt sex
## 1 61.4 177.8   1
## 2 94.1 185.4   1
## 3 75.0 168.9   1
## 4 83.6 185.4   1
## 5 85.5 180.3   1
## 6 73.9 174.0   1

2. Data manipulation

Deleting all variables except age, weight, heigtht and gender

mydata <- mydata_raw[,-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)]

Renaming variables

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- rename(mydata,
                   "AGE" = "age",
                   "WEIGHT" = "wgt",
                   "HEIGHT" = "hgt",
                   "GENDER" = "sex")

Creating categorical variable out of variable age (AGE_categorical) where 0=young (19 to 41 years old) and 1=old (42 to 65 years old)

mydata <- mydata %>% mutate(AGE_categorical = ifelse(AGE <= 41, 0, 1))

Creating factor for categorical variable GENDER and AGE_categorical

mydata$GENDER <- factor(mydata$GENDER,
                                  levels = c(0,1),
                                  labels = c("Female", "Male"))

mydata$AGE_categorical <- factor(mydata$AGE_categorical,
                                  levels = c(0,1),
                                  labels = c("Young", "Old"))

3. Expaining dataset

head(mydata)
##   AGE WEIGHT HEIGHT GENDER AGE_categorical
## 1  22   61.4  177.8   Male           Young
## 2  55   94.1  185.4   Male             Old
## 3  42   75.0  168.9   Male             Old
## 4  29   83.6  185.4   Male           Young
## 5  40   85.5  180.3   Male           Young
## 6  24   73.9  174.0   Male           Young

This data set explains persons weight, height, age and gender. It has 120 observations and 4 variables (2 numeric and 2 categorical). Unit of observations is one individual.

summary(mydata)
##       AGE            WEIGHT           HEIGHT         GENDER   AGE_categorical
##  Min.   :19.00   Min.   : 46.80   Min.   :149.9   Female:60   Young:100      
##  1st Qu.:24.75   1st Qu.: 61.85   1st Qu.:165.1   Male  :60   Old  : 20      
##  Median :29.00   Median : 72.05   Median :173.4                              
##  Mean   :31.65   Mean   : 71.52   Mean   :172.2                              
##  3rd Qu.:37.00   3rd Qu.: 80.92   3rd Qu.:179.1                              
##  Max.   :65.00   Max.   :104.10   Max.   :193.5

4. RQ1: Are male and female of equal height?

library(psych)

describeBy(mydata$HEIGHT, mydata$GENDER)
## 
##  Descriptive statistics by group 
## group: Female
##    vars  n   mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 60 165.69 6.62 165.55  165.86 7.93 149.9 176.5  26.6 -0.17    -0.91
##      se
## X1 0.86
## ------------------------------------------------------------ 
## group: Male
##    vars  n   mean  sd median trimmed  mad min   max range  skew kurtosis   se
## X1    1 60 178.73 6.9  179.1  178.87 5.93 160 193.5  33.5 -0.25    -0.21 0.89

Check assumptions

  • Variable height is numeric
  • Normality (histogram, quantile-quantile plot, shapiro wilk test )
  • Data from two independent populations (male, female)
  • Variable has the same variance in both populations (Welch correction)
Normality

Histogram

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
GENDER_F <- ggplot(mydata[mydata$GENDER == "Female", ], aes(x = HEIGHT)) +
  theme_linedraw() +
  geom_histogram(binwidth = 2, col = "magenta") +
  ylab("Frequency") +
  ggtitle("Female height")

GENDER_M <- ggplot(mydata[mydata$GENDER == "Male", ], aes(x = HEIGHT)) +
  theme_linedraw() +
  geom_histogram(binwidth = 2, col = "darkturquoise") +
  ylab("Frequency") +
  ggtitle("Male height")

library(ggpubr)
ggarrange(GENDER_F, GENDER_M,
          ncol = 2, nrow = 1)

Quantile-quantile plot

library(ggpubr)

ggqqplot(mydata,
         "HEIGHT",
         facet.by = "GENDER")

Shapiro-Wilk test:

  • H0: Height is normaly distributed
  • H1: Height isn’t normaly distributed

We cannot reject H0 hypothesis for both male and female population. We assume normality of height distributions for male and female.

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
mydata %>%
  group_by(GENDER) %>%
  shapiro_test(HEIGHT)
## # A tibble: 2 × 4
##   GENDER variable statistic      p
##   <fct>  <chr>        <dbl>  <dbl>
## 1 Female HEIGHT       0.967 0.0991
## 2 Male   HEIGHT       0.981 0.463

Independent sample t-test (parametric)

  • H0: population mean of height for male is equal to population mean of height for female
  • H1: population mean of height for male is not equal to population mean of height for female

Based on the results of Welch two sample t-test (independent sample t-test) we reject H0 at p<0,001.

t.test(mydata$HEIGHT ~ mydata$GENDER,
       var.equal = FALSE,
       alternative = "two.sided") 
## 
##  Welch Two Sample t-test
## 
## data:  mydata$HEIGHT by mydata$GENDER
## t = -10.563, df = 117.81, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -15.48670 -10.59663
## sample estimates:
## mean in group Female   mean in group Male 
##             165.6883             178.7300

Effectsize for independent sample t-test

Based on the results of cohens d statistics effectsize is very large. Male have higher average height than female.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cohens_d(mydata$HEIGHT ~ mydata$GENDER,
                     pooled_sd = FALSE)
## Cohen's d |         95% CI
## --------------------------
## -1.93     | [-2.36, -1.49]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(1.93, rules = "sawilowsky2009")
## [1] "very large"
## (Rules: sawilowsky2009)

Wilcoxon rank sum test (non-parametric)

  • H0: Distribution location of height for male is equal to distribution location of height for female
  • H1: Distribution location of height for male is not equal to distribution location of height for female

Based on Wilcoxon rank sum test we reject H0 at p<0,001.

wilcox.test(mydata$HEIGHT ~ mydata$GENDER,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$HEIGHT by mydata$GENDER
## W = 310, p-value = 4.97e-15
## alternative hypothesis: true location shift is not equal to 0

Effectsize for Wilcoxon rank sum test

Based on the results of rank biserial statistics effectsize is very large. Male have higher average height than female.

library(effectsize)

effectsize(wilcox.test(mydata$HEIGHT ~ mydata$GENDER,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.83             | [-0.88, -0.75]
interpret_rank_biserial(0.83)
## [1] "very large"
## (Rules: funder2019)

Conclusion

Based on the sample data, we find that men and women differ in height (𝑝 < 0,001) - male are higher, the difference in average height is very large (d = 1,93). I used the parametric independent sample t-test as all assumptions are met (numeric variable, normality, variable has the same variance in both populations), for the t-test I also did welch correction (for variance assumption). I also conducted non-parametric Wilcoxon rank sum test which is more robust but in my example has lower statistical power than parametric test (all assumptions are met). Based on both test I rejected H0 at p<0,001 which suggests that male and female are of the same height.

5. RQ2: Are variables weight and height correlated?

Pearson correlation coefficient for variables height and weight is 0.60, correlation is positive and semi strong. Then I also did test of pearson correlation coefficient, where H0 is that rho is equal to 0 which means, that variables are not correlated and H1 is that rho is not equal to 0 which means that variables are correlated. Based on the results of test of correlation coefficient we reject H0 at p<0.001, variables height and weight are correlated. We used pearson coefficient as both variables are numeric, in case of one ordinal variable we would use spearman. To conclude variables weight and height are correlated positively and semi strong.

cor(mydata$WEIGHT,mydata$HEIGHT, method = "pearson")
## [1] 0.6094579
cor.test(mydata$WEIGHT,mydata$HEIGHT, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  mydata$WEIGHT and mydata$HEIGHT
## t = 8.3505, df = 118, p-value = 1.502e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4829770 0.7110267
## sample estimates:
##       cor 
## 0.6094579

6. RQ3: Is there association between variables age (categorical: young, old) and gender?

Check assumptions

Chi-square test
  • H0: There is no association between age and gender
  • H1: There is association between age and gender

Based on chi-square test I reject H0 at p<0.001 and assume there is association between age and gender.

results <- chisq.test(mydata$GENDER, mydata$AGE_categorical, correct = TRUE)

results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$GENDER and mydata$AGE_categorical
## X-squared = 10.14, df = 1, p-value = 0.001451
Empirical frequencies

In our sample there is 120 total observations, out of all observations 60 are male and 60 are female and out of all 120 observations 100 fall into young age category while 20 fall into old age category. Out of all 60 females in sample there is 57 young females and 3 old. Out of all 60 males in sample there is 43 young males and 17 old. Out of 100 young people in sample 57 are female and 43 are male and out of 20 old people 3 are female and 17 are male.

addmargins(results$observed)
##              mydata$AGE_categorical
## mydata$GENDER Young Old Sum
##        Female    57   3  60
##        Male      43  17  60
##        Sum      100  20 120
Theoretical frequencies

If variables age and gender would not be associated we would expect 50 young female and 50 young male and 10 old female and 10 old male.

Assumption number 2 is met, all expected values are over 5. (sample size of 20 for old is not really got statistically should be over 100)

addmargins(round(results$expected,2))
##              mydata$AGE_categorical
## mydata$GENDER Young Old Sum
##        Female    50  10  60
##        Male      50  10  60
##        Sum      100  20 120
Standardized residuals
  • In the combination female and old is less than expected (-2.21) number of people at alpha = 0.05 (p<0,05), 2.21>1,96
  • In the combination male and old is more than expected (2.21) number of people at alpha = 0,05 (p<0.05), 2.21>1,96
  • Difference between observed and expected in the combination female and young is not significant (0.99<1.96)
  • Difference between observed and expected in the combination male and young is not significant (0.99<1.96)
round(results$residuals,2)
##              mydata$AGE_categorical
## mydata$GENDER Young   Old
##        Female  0.99 -2.21
##        Male   -0.99  2.21
Effectsize

Based on cramers v-statistics effectsize is large. Also the effectsize of oddsratio is also large. In this case we have 2x2 matrix its more suitable to use oddsratio as effectsize.

library(effectsize)
effectsize::cramers_v(mydata$GENDER,mydata$AGE_categorical)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.30              | [0.14, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.30)
## [1] "large"
## (Rules: funder2019)
oddsratio(mydata$AGE_categorical,mydata$GENDER)
## Odds ratio |        95% CI
## --------------------------
## 7.51       | [2.07, 27.28]
interpret_oddsratio(7.51)
## [1] "large"
## (Rules: cohen1988)
Conclusion

Based on chi-square test I discovered that categorical variables age and gender are associated.