1. Importing the raw dataset

Link to the dataset: https://www.openintro.org/data/index.php?data=bdims&utm_

mydata_raw <- read.table("./bdims.csv", header = TRUE, sep=",")

head(mydata_raw)

##   bia_di bii_di bit_di che_de che_di elb_di wri_di kne_di ank_di sho_gi che_gi
## 1   41.7   26.8   32.0   19.7   27.8   14.0   10.5   18.4   14.6  105.0   91.5
## 2   41.3   31.1   34.9   26.4   30.2   15.0   11.6   18.8   15.8  110.6  104.9
## 3   41.1   27.8   32.0   21.1   30.6   14.8   11.4   17.6   14.2  123.1  109.6
## 4   43.4   27.3   34.7   19.9   31.3   14.0   11.2   20.2   14.3  121.0  102.5
## 5   40.7   27.8   34.0   21.1   29.4   15.6   12.0   21.2   16.4  111.7  101.2
## 6   42.5   25.2   30.6   20.9   30.4   15.3   11.4   18.9   13.8  111.0   98.7
##   wai_gi nav_gi hip_gi thi_gi bic_gi for_gi kne_gi cal_gi ank_gi wri_gi age
## 1   80.2   85.7   89.2   48.5   28.7   25.0   35.4   32.3   23.0   16.2  22
## 2  109.2  104.4  101.7   56.4   34.0   29.2   39.3   38.5   24.3   17.4  55
## 3   88.4   92.0   95.1   52.5   39.4   29.8   34.5   34.0   22.0   17.0  42
## 4   81.0   84.5  105.0   56.0   31.5   26.5   38.5   36.9   21.3   16.6  29
## 5   88.8   90.8  101.3   62.5   35.2   28.9   40.6   39.2   23.0   17.5  40
## 6   78.1   78.6   92.0   56.5   35.7   29.2   36.4   37.0   22.0   17.2  24
##    wgt   hgt sex
## 1 61.4 177.8   1
## 2 94.1 185.4   1
## 3 75.0 168.9   1
## 4 83.6 185.4   1
## 5 85.5 180.3   1
## 6 73.9 174.0   1

2. Data manipulation

Deleting all variables except age, weight, heigtht and gender

mydata <- mydata_raw[,-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)]

Renaming variables

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- rename(mydata,
                   "AGE" = "age",
                   "WEIGHT" = "wgt",
                   "HEIGHT" = "hgt",
                   "GENDER" = "sex")

Creating categorical variable out of variable age (AGE_categorical) where 0=young (19 to 41 years old) and 1=old (42 to 65 years old)

mydata <- mydata %>% mutate(AGE_categorical = ifelse(AGE <= 41, 0, 1))

Creating factor for categorical variable GENDER and AGE_categorical

mydata$GENDER <- factor(mydata$GENDER,
                                  levels = c(0,1),
                                  labels = c("Female", "Male"))

mydata$AGE_categorical <- factor(mydata$AGE_categorical,
                                  levels = c(0,1),
                                  labels = c("Young", "Old"))

3. Expaining dataset

head(mydata)

##   AGE WEIGHT HEIGHT GENDER AGE_categorical
## 1  22   61.4  177.8   Male           Young
## 2  55   94.1  185.4   Male             Old
## 3  42   75.0  168.9   Male             Old
## 4  29   83.6  185.4   Male           Young
## 5  40   85.5  180.3   Male           Young
## 6  24   73.9  174.0   Male           Young

This data set explains persons weight, height, age and gender. It has 120 observations and 4 variables (2 numeric and 2 categorical). Unit of observations is one individual.

GENDER - 0 = female, 1 = male
AGE - age in years
WEIGHT - weight in kilograms
HEIGHT - height in centimeters
AGE_categorical - categorical variable of age, 0 is young (from 19 to 41 years of age) and 1 is old (from 42 to 65 years of age)

summary(mydata)

##       AGE            WEIGHT           HEIGHT         GENDER   AGE_categorical
##  Min.   :19.00   Min.   : 46.80   Min.   :149.9   Female:60   Young:100      
##  1st Qu.:24.75   1st Qu.: 61.85   1st Qu.:165.1   Male  :60   Old  : 20      
##  Median :29.00   Median : 72.05   Median :173.4                              
##  Mean   :31.65   Mean   : 71.52   Mean   :172.2                              
##  3rd Qu.:37.00   3rd Qu.: 80.92   3rd Qu.:179.1                              
##  Max.   :65.00   Max.   :104.10   Max.   :193.5

4. RQ1: Are male and female of equal height?

library(psych)

describeBy(mydata$HEIGHT, mydata$GENDER)

## 
##  Descriptive statistics by group 
## group: Female
##    vars  n   mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 60 165.69 6.62 165.55  165.86 7.93 149.9 176.5  26.6 -0.17    -0.91
##      se
## X1 0.86
## ------------------------------------------------------------ 
## group: Male
##    vars  n   mean  sd median trimmed  mad min   max range  skew kurtosis   se
## X1    1 60 178.73 6.9  179.1  178.87 5.93 160 193.5  33.5 -0.25    -0.21 0.89

Check assumptions

Variable height is numeric
Normality (histogram, quantile-quantile plot, shapiro wilk test )
Data from two independent populations (male, female)
Variable has the same variance in both populations (Welch correction)

Normality

Histogram

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

GENDER_F <- ggplot(mydata[mydata$GENDER == "Female", ], aes(x = HEIGHT)) +
  theme_linedraw() +
  geom_histogram(binwidth = 2, col = "magenta") +
  ylab("Frequency") +
  ggtitle("Female height")

GENDER_M <- ggplot(mydata[mydata$GENDER == "Male", ], aes(x = HEIGHT)) +
  theme_linedraw() +
  geom_histogram(binwidth = 2, col = "darkturquoise") +
  ylab("Frequency") +
  ggtitle("Male height")

library(ggpubr)
ggarrange(GENDER_F, GENDER_M,
          ncol = 2, nrow = 1)

Quantile-quantile plot

library(ggpubr)

ggqqplot(mydata,
         "HEIGHT",
         facet.by = "GENDER")

Shapiro-Wilk test:

H0: Height is normaly distributed
H1: Height isn’t normaly distributed

We cannot reject H0 hypothesis for both male and female population. We assume normality of height distributions for male and female.

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(GENDER) %>%
  shapiro_test(HEIGHT)

## # A tibble: 2 × 4
##   GENDER variable statistic      p
##   <fct>  <chr>        <dbl>  <dbl>
## 1 Female HEIGHT       0.967 0.0991
## 2 Male   HEIGHT       0.981 0.463

Independent sample t-test (parametric)

H0: population mean of height for male is equal to population mean of height for female
H1: population mean of height for male is not equal to population mean of height for female

Based on the results of Welch two sample t-test (independent sample t-test) we reject H0 at p<0,001.

t.test(mydata$HEIGHT ~ mydata$GENDER,
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata$HEIGHT by mydata$GENDER
## t = -10.563, df = 117.81, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -15.48670 -10.59663
## sample estimates:
## mean in group Female   mean in group Male 
##             165.6883             178.7300

Effectsize for independent sample t-test

Based on the results of cohens d statistics effectsize is very large. Male have higher average height than female.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(mydata$HEIGHT ~ mydata$GENDER,
                     pooled_sd = FALSE)

## Cohen's d |         95% CI
## --------------------------
## -1.93     | [-2.36, -1.49]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(1.93, rules = "sawilowsky2009")

## [1] "very large"
## (Rules: sawilowsky2009)

Wilcoxon rank sum test (non-parametric)

H0: Distribution location of height for male is equal to distribution location of height for female
H1: Distribution location of height for male is not equal to distribution location of height for female

Based on Wilcoxon rank sum test we reject H0 at p<0,001.

wilcox.test(mydata$HEIGHT ~ mydata$GENDER,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$HEIGHT by mydata$GENDER
## W = 310, p-value = 4.97e-15
## alternative hypothesis: true location shift is not equal to 0

Effectsize for Wilcoxon rank sum test

Based on the results of rank biserial statistics effectsize is very large. Male have higher average height than female.

library(effectsize)

effectsize(wilcox.test(mydata$HEIGHT ~ mydata$GENDER,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.83             | [-0.88, -0.75]

interpret_rank_biserial(0.83)

## [1] "very large"
## (Rules: funder2019)

Conclusion

Based on the sample data, we find that men and women differ in height (𝑝 < 0,001) - male are higher, the difference in average height is very large (d = 1,93). I used the parametric independent sample t-test as all assumptions are met (numeric variable, normality, variable has the same variance in both populations), for the t-test I also did welch correction (for variance assumption). I also conducted non-parametric Wilcoxon rank sum test which is more robust but in my example has lower statistical power than parametric test (all assumptions are met). Based on both test I rejected H0 at p<0,001 which suggests that male and female are of the same height.

5. RQ2: Are variables weight and height correlated?

H0: rho = 0
H1: rho =/= 0

Pearson correlation coefficient for variables height and weight is 0.60, correlation is positive and semi strong. Then I also did test of pearson correlation coefficient, where H0 is that rho is equal to 0 which means, that variables are not correlated and H1 is that rho is not equal to 0 which means that variables are correlated. Based on the results of test of correlation coefficient we reject H0 at p<0.001, variables height and weight are correlated. We used pearson coefficient as both variables are numeric, in case of one ordinal variable we would use spearman. To conclude variables weight and height are correlated positively and semi strong.

cor(mydata$WEIGHT,mydata$HEIGHT, method = "pearson")

## [1] 0.6094579

cor.test(mydata$WEIGHT,mydata$HEIGHT, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  mydata$WEIGHT and mydata$HEIGHT
## t = 8.3505, df = 118, p-value = 1.502e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4829770 0.7110267
## sample estimates:
##       cor 
## 0.6094579

6. RQ3: Is there association between variables age (categorical: young, old) and gender?

Check assumptions

Observations are independent of each other
All expected frequencies are greater than 5.

Chi-square test

H0: There is no association between age and gender
H1: There is association between age and gender

Based on chi-square test I reject H0 at p<0.001 and assume there is association between age and gender.

results <- chisq.test(mydata$GENDER, mydata$AGE_categorical, correct = TRUE)

results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$GENDER and mydata$AGE_categorical
## X-squared = 10.14, df = 1, p-value = 0.001451

Empirical frequencies

In our sample there is 120 total observations, out of all observations 60 are male and 60 are female and out of all 120 observations 100 fall into young age category while 20 fall into old age category. Out of all 60 females in sample there is 57 young females and 3 old. Out of all 60 males in sample there is 43 young males and 17 old. Out of 100 young people in sample 57 are female and 43 are male and out of 20 old people 3 are female and 17 are male.

addmargins(results$observed)

##              mydata$AGE_categorical
## mydata$GENDER Young Old Sum
##        Female    57   3  60
##        Male      43  17  60
##        Sum      100  20 120

Theoretical frequencies

If variables age and gender would not be associated we would expect 50 young female and 50 young male and 10 old female and 10 old male.

Assumption number 2 is met, all expected values are over 5. (sample size of 20 for old is not really got statistically should be over 100)

addmargins(round(results$expected,2))

##              mydata$AGE_categorical
## mydata$GENDER Young Old Sum
##        Female    50  10  60
##        Male      50  10  60
##        Sum      100  20 120

Standardized residuals

In the combination female and old is less than expected (-2.21) number of people at alpha = 0.05 (p<0,05), 2.21>1,96
In the combination male and old is more than expected (2.21) number of people at alpha = 0,05 (p<0.05), 2.21>1,96
Difference between observed and expected in the combination female and young is not significant (0.99<1.96)
Difference between observed and expected in the combination male and young is not significant (0.99<1.96)

round(results$residuals,2)

##              mydata$AGE_categorical
## mydata$GENDER Young   Old
##        Female  0.99 -2.21
##        Male   -0.99  2.21

Effectsize

Based on cramers v-statistics effectsize is large. Also the effectsize of oddsratio is also large. In this case we have 2x2 matrix its more suitable to use oddsratio as effectsize.

library(effectsize)
effectsize::cramers_v(mydata$GENDER,mydata$AGE_categorical)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.30              | [0.14, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.30)

## [1] "large"
## (Rules: funder2019)

oddsratio(mydata$AGE_categorical,mydata$GENDER)

## Odds ratio |        95% CI
## --------------------------
## 7.51       | [2.07, 27.28]

interpret_oddsratio(7.51)

## [1] "large"
## (Rules: cohen1988)

Conclusion

Based on chi-square test I discovered that categorical variables age and gender are associated.

MVA HW1

Luka Vrtač

2025-01-14

1. Importing the raw dataset

2. Data manipulation

3. Expaining dataset

4. RQ1: Are male and female of equal height?

Check assumptions

Normality

Independent sample t-test (parametric)

Effectsize for independent sample t-test

Wilcoxon rank sum test (non-parametric)

Effectsize for Wilcoxon rank sum test

Conclusion

5. RQ2: Are variables weight and height correlated?

6. RQ3: Is there association between variables age (categorical: young, old) and gender?

Chi-square test

Empirical frequencies

Theoretical frequencies

Standardized residuals

Effectsize

Conclusion