Author: Miha Fabjan
mydata <- read.table("./survey.csv.csv", header=TRUE, sep=";", dec=".")
# Add an ID column to the dataset
mydata <- cbind(ID = 1:nrow(mydata), mydata)
head(mydata, (10)) #Showing first 10 units
## ID Gender Age KmBefore KgBefore TimeBefore Medicine1 Medicine2 Medicine3
## 1 1 F 32 4.06 74.7 41.2 No No No
## 2 2 M 37 3.96 76.3 43.9 Yes Yes No
## 3 3 M 43 3.80 91.7 47.9 Yes No No
## 4 4 F 26 5.17 75.4 59.6 No No No
## 5 5 F 36 3.72 77.0 54.9 No Yes Yes
## 6 6 M 37 5.31 93.9 50.6 No Yes Yes
## 7 7 F 35 3.18 75.5 54.4 Yes Yes Yes
## 8 8 F 48 4.65 71.5 47.6 No No Yes
## 9 9 F 41 4.40 79.7 32.8 No No Yes
## 10 10 F 32 2.95 81.5 51.4 No No No
## KmAfter KgAfter TimeAfter SideEffects
## 1 4.37 91.8 61.1 N
## 2 3.09 89.6 69.7 N
## 3 6.26 92.7 49.8 N
## 4 5.81 89.1 60.9 N
## 5 7.80 91.7 60.7 Y
## 6 5.67 87.8 67.9 N
## 7 5.50 95.5 57.8 N
## 8 2.13 92.2 55.0 N
## 9 7.04 91.5 59.7 N
## 10 6.10 90.8 56.3 N
Unit of observation is an individual, who goes to the gym and took one of the medicines mentioned. Sample size is 1000 individuals.
Explanation of variables:
Gender: A binary variable that identifies the gender of each individual, coded as Male (M) or Female (F).
Age: A numeric variable representing the age of the individual in years.
Baseline Measurements (Before Medication): Km Before: The distance (in kilometers) that an individual could run before taking any medicine. Kg Before: The weight (in kilograms) of the individual before using any medicine. Time Before: The time (in minutes) it took the individual to run the distance measured in “Km Before.”
Medicines used: 1. Medicine 1: A binary variable indicating whether the individual used Medicine 1 (Yes or No). 2. Medicine 2: A binary variable indicating whether the individual used Medicine 2 (Yes or No). 3. Medicine 3: A binary variable indicating whether the individual used Medicine 3 (Yes or No).
Post-medication measurements: Km After: The distance (in kilometers) that an individual could run after taking the medicine(s). Kg After: The weight (in kilograms) of the individual after using the medicine(s). Time After: The time (in minutes) it took the individual to run the distance measured in “Km After.”
SideEffects: A binary variable that indicates whether the individual experienced any side effects from the medicine(s) (Y for Yes, N for No).
Source of data: https://www.kaggle.com/datasets/saralattarulo/do-we-still-need-gyms
H0: 𝜇km_before = 𝜇km_after / 𝜇d = 0 H1: 𝜇km_before ≠ 𝜇km_after / 𝜇d ≠ 0
library(pastecs)
round(stat.desc(mydata[,c(4,10)]), 2)
## KmBefore KmAfter
## nbr.val 1000.00 1000.00
## nbr.null 0.00 0.00
## nbr.na 0.00 0.00
## min 1.72 0.51
## max 6.28 10.35
## range 4.56 9.84
## sum 4020.63 4996.93
## median 4.02 5.01
## mean 4.02 5.00
## SE.mean 0.02 0.05
## CI.mean.0.95 0.05 0.09
## var 0.60 2.13
## std.dev 0.77 1.46
## coef.var 0.19 0.29
mydata$Difference <- mydata$KmAfter - mydata$KmBefore
library(ggplot2)
ggplot(mydata, aes(x = Difference)) +
geom_histogram(position = "identity", binwidth = 0.5, colour = "black") +
ylab("Frequency") +
xlab("Difference")
H0: Normality of distribution is met. H1: Normality of distribution is not met.
shapiro.test(mydata$Difference)
##
## Shapiro-Wilk normality test
##
## data: mydata$Difference
## W = 0.99817, p-value = 0.3608
Shapiro-Wilk normality test gives us p = 0,3608 and therefore we can not reject null hypothesis (distribution is NORMAL) - > Parametric method is valid in our case.
Parametric method paired t-test:
t.test(mydata$KmBefore, mydata$KmAfter,
paired = TRUE,
alternative = "two.sided")
##
## Paired t-test
##
## data: mydata$KmBefore and mydata$KmAfter
## t = -18.882, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -1.0777647 -0.8748353
## sample estimates:
## mean difference
## -0.9763
H0: 𝜇km_before = 𝜇km_after / 𝜇d = 0 H1: 𝜇km_before ≠ 𝜇km_after / 𝜇d ≠ 0
Since paired t-test tells us that p value is 2.2e-16, We have to reject null hypothesis at p < 0,001 (difference between mean averages of variable is not 0).
Just for an illustration we also perform non-parametric test (Wilcox sign rank test):
wilcox.test(mydata$KmBefore, mydata$KmAfter,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon signed rank test
##
## data: mydata$KmBefore and mydata$KmAfter
## V = 100234, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
library(effectsize)
cohens_d(mydata$Difference)
## Cohen's d | 95% CI
## ------------------------
## 0.60 | [0.53, 0.66]
interpret_cohens_d(0.60, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
The increase in distance (in kilometers) that an individual could run is “medium” sized.
We wanted to test, if the distance individual can run before and after taking the medicine is the same or it changes.
First we tested the normality of the distribution of the differences, where we comfirmed the normality of distribution. This leads us to use parametric method of testing the hypothesis -> paired t-test (more about that in the next paragraph).We also conducted non- parametric test, but since we have normal distribution of differences, we can skip that.
Since paired t-test told us that p value is 2.2e-16, We have to reject null hypothesis at p < 0,001, which means that difference between mean averages of variable is not 0 - the distance individual can run is improved.
The increase in distance (in kilometers) that an individual could run is “medium” sized.