Milica Krivokapić Hypothesis testing- MVA1

{r}{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=70)

Research Question: Is there a difference in weight between male and female?

Data

data()

data(package = .packages(all.available = TRUE))

#install.packages("carData")
library(carData)

mydata <- force(Davis)
head(mydata)

##   sex weight height repwt repht
## 1   M     77    182    77   180
## 2   F     58    161    51   159
## 3   F     53    161    54   158
## 4   M     68    177    70   175
## 5   F     59    157    59   155
## 6   M     76    170    76   165

Now, I will add variable ID.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  mutate(ID = row_number())


head(mydata)

##   sex weight height repwt repht ID
## 1   M     77    182    77   180  1
## 2   F     58    161    51   159  2
## 3   F     53    161    54   158  3
## 4   M     68    177    70   175  4
## 5   F     59    157    59   155  5
## 6   M     76    170    76   165  6

Data explanation

Unit of observation: Person
Sample size: 200
Definition of all variables:
- Sex: A factor with levels: F, female; M,male.
- Weight: Measured weight in kg of subject.
- Height: Measured height in cm of subject.
- Repwt: Reported weight in kg of subject.
- Repht: Reported height in cm of suject.
- ID: Person ID
Source: Personal communication from C. Davis, Departments of Physical Education and Psychology, York University.

Data manipulation and descriptive statistics

Removing units with missing values

library(tidyr)

mydata <- drop_na(mydata)

nrow(mydata)

## [1] 181

Now my data has 181 rows in comparison with 200 in the beginning.

Converting categorical variables to factors

mydata$sexF <- factor(mydata$sex,
                       levels = c ("M", "F"),
                       labels = c ("M", "F"))

Descreptive statistics and explanation of parameters

library(psych)

describeBy(x=mydata$weight, group=mydata$sexF)

## 
##  Descriptive statistics by group 
## group: M
##    vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 82 75.98 12.16     75   75.12 11.86  54 119    65 0.77     0.77 1.34
## ------------------------------------------------------------ 
## group: F
##    vars  n  mean    sd median trimmed  mad min max range skew kurtosis  se
## X1    1 99 58.29 12.92     56   57.02 5.93  39 166   127 5.84    46.01 1.3

The mean weight in group M (males) is 75.98 kilograms, representing the average weight of males in this group.
The range of 65 kilograms in group M (males) represents the difference between the highest and lowest kilogram values.
With a skewness of 0.77 in group M (males), the weight distribution among males is positively skewed, suggesting the presence of individuals with higher weights.
The minimum weight in group F (females) is 39 kilograms, signifying the smallest recorded weight in the female group.
On average, the weights of females in group F (females) deviate from the mean by approximately 12.92 kilograms.(standard deviation)

Statistical hypotesis

For this type of research question the appropriate parametric test would be independent t-test. Data belong to two different groups of units (male and female) and I want to asses whether there is a difference in weight between male and female. (hypothesis about the difference between two population arithmetic means (independent samples)

However, certain assumptions need to be met:

Variable is numeric
The distribution of the variable is normal in both populations
The data must come from two independent populations
Variable has the same variance in both populations – since this assumption is often violated, we apply Welch correction.

Since weight is measured in kilograms we can say that this variable is numeric and thus first assumption is fulfilled.

To visually observe the data, ggplot will be shown:

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes (x = weight)) +
geom_histogram(binwidth = 10, colour="gray" , fill= "blue") +
facet_wrap(~sexF, ncol = 1) +
ylab("'Frequency")

From the visual representation of the variable “weight” outliers can be observed, for the both groups “M” and “F”. In order to continue, the data should be cleaned.

Now I will check the second assumption:

library(dplyr)

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.3.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(sexF) %>%
  shapiro_test(weight)

## # A tibble: 2 × 4
##   sexF  variable statistic        p
##   <fct> <chr>        <dbl>    <dbl>
## 1 M     weight       0.960 1.25e- 2
## 2 F     weight       0.543 4.87e-16

H0: Variable weight is normally distributed within M group.

H1: Variable weight is not normally distributed within M group.

Since p-value = 0.013, we reject the null hypothesis.

H0: Variable weight is normally distributed within F group.

H1: Variable weight is not normally distributed within F group.

Since p-value < 0.001, we reject the null hypothesis.

Although second assumption for normality is violated , I will do the parametric test and Welch correction since we often assume that fourth assumption is violated.

 t.test(mydata$weight ~ mydata$sexF,
         paired= FALSE,
         var.equal = FALSE,
         alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata$weight by mydata$sexF
## t = 9.4663, df = 176.09, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group M and group F is not equal to 0
## 95 percent confidence interval:
##  13.99620 21.36916
## sample estimates:
## mean in group M mean in group F 
##        75.97561        58.29293

library(effectsize)

## Warning: package 'effectsize' was built under R version 4.3.2

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize ::cohens_d(mydata$weight ~ mydata$sexF,
                    pooled_sd - FALSE)

## Cohen's d |       95% CI
## ------------------------
## 1.41      | [1.08, 1.73]
## 
## - Estimated using pooled SD.

interpret_cohens_d (1.41, rules = "sawilowsky2009")

## [1] "very large"
## (Rules: sawilowsky2009)

Using the sample data, it was able to demonstrate that the average weight differed between male and female (the effect size is very large, 𝑑 = 1.41).

As second assumption is violated we now conduct Wilcoxon Rank Sum Test which is alternative non-parametric test.

H0: Distribution location of a weight is same for male and female.

H1: Distribution location of a weight is not a same for male and female.

library(effectsize)

effectsize(wilcox.test(mydata$weight ~ mydata$sexF,
           paired = FALSE,
           correct = FALSE,
           exact = FALSE,
           alternative = "two.sided" ))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.84              | [0.79, 0.88]

interpret_rank_biserial(0.84)

## [1] "very large"
## (Rules: funder2019)