Author: Miha Fabjan
mydata <- read.table("./survey.csv.csv", header=TRUE, sep=";", dec=".")
# Add an ID column to the dataset
mydata <- cbind(ID = 1:nrow(mydata), mydata)
Write the first research question. Perform the statistical hypothesis tests to answer your research question. Perform both the parametric test and the corresponding non-parametric test and explain the results. For the parametric test, check all necessary assumptions. Finally, describe which test (parametric or non-parametric) is more suitable for your particular case and why. Also calculate the effect size and explain it. Finally, answer your research question clearly.
head(mydata, (10)) #Showing first 10 units
## ID Gender Age KmBefore KgBefore TimeBefore Medicine1 Medicine2 Medicine3
## 1 1 F 32 4.06 74.7 41.2 No No No
## 2 2 M 37 3.96 76.3 43.9 Yes Yes No
## 3 3 M 43 3.80 91.7 47.9 Yes No No
## 4 4 F 26 5.17 75.4 59.6 No No No
## 5 5 F 36 3.72 77.0 54.9 No Yes Yes
## 6 6 M 37 5.31 93.9 50.6 No Yes Yes
## 7 7 F 35 3.18 75.5 54.4 Yes Yes Yes
## 8 8 F 48 4.65 71.5 47.6 No No Yes
## 9 9 F 41 4.40 79.7 32.8 No No Yes
## 10 10 F 32 2.95 81.5 51.4 No No No
## KmAfter KgAfter TimeAfter SideEffects
## 1 4.37 91.8 61.1 N
## 2 3.09 89.6 69.7 N
## 3 6.26 92.7 49.8 N
## 4 5.81 89.1 60.9 N
## 5 7.80 91.7 60.7 Y
## 6 5.67 87.8 67.9 N
## 7 5.50 95.5 57.8 N
## 8 2.13 92.2 55.0 N
## 9 7.04 91.5 59.7 N
## 10 6.10 90.8 56.3 N
Unit of observation is an individual, who goes to the gym and took one of the medicines mentioned. Sample size is 1000 individuals.
Explanation of variables:
Gender: A binary variable that identifies the gender of each individual, coded as Male (M) or Female (F).
Age: A numeric variable representing the age of the individual in years.
Baseline Measurements (Before Medication):
Km Before: The distance (in kilometers) that an individual could run before taking any medicine.
Kg Before: The weight (in kilograms) of the individual before using any medicine.
Time Before: The time (in minutes) it took the individual to run the distance measured in “Km Before.”
Medicines used:
Medicine 1: A binary variable indicating whether the individual used Medicine 1 (Yes or No).
Medicine 2: A binary variable indicating whether the individual used Medicine 2 (Yes or No).
Medicine 3: A binary variable indicating whether the individual used Medicine 3 (Yes or No).
Post-medication measurements:
Km After: The distance (in kilometers) that an individual could run after taking the medicine(s).
Kg After: The weight (in kilograms) of the individual after using the medicine(s).
Time After: The time (in minutes) it took the individual to run the distance measured in “Km After.”
SideEffects: A binary variable that indicates whether the individual experienced any side effects from the medicine(s) (Y for Yes, N for No).
Source of data: https://www.kaggle.com/datasets/saralattarulo/do-we-still-need-gyms
library(pastecs)
round(stat.desc(mydata[,c(4,10)]), 2)
## KmBefore KmAfter
## nbr.val 1000.00 1000.00
## nbr.null 0.00 0.00
## nbr.na 0.00 0.00
## min 1.72 0.51
## max 6.28 10.35
## range 4.56 9.84
## sum 4020.63 4996.93
## median 4.02 5.01
## mean 4.02 5.00
## SE.mean 0.02 0.05
## CI.mean.0.95 0.05 0.09
## var 0.60 2.13
## std.dev 0.77 1.46
## coef.var 0.19 0.29
Average distance before taking the pills was 4.02 km and 5 km after taking the pills. Min. distance which could be run by an individual before taking the pills was 1.72 up to max. 6.28 km. After taking the pills, the min. distance which an individual could run, went from 0.51 to max. 10.35 km.
Median
Half of the individuals could run up to 4.02 km before taking the pill, while the other half could run more than 4.02 km.
After taking the pill, half of the individuals could run up to 5.01 km, while the other half could run more than 5.01 km.
Research question: Is there a significant difference in the kilometers, which can be done by running before taking the medication and after taking the medication?
H0: 𝜇km_before = 𝜇km_after / 𝜇d = 0
H1: 𝜇km_before ≠ 𝜇km_after / 𝜇d ≠ 0
mydata$Difference <- mydata$KmAfter - mydata$KmBefore
library(ggplot2)
ggplot(mydata, aes(x = Difference)) +
geom_histogram(position = "identity", binwidth = 0.5, colour = "black",fill = "magenta") +
ylab("Frequency") +
xlab("Difference")
H0: Normality of distribution is met.
H1: Normality of distribution is not met.
shapiro.test(mydata$Difference)
##
## Shapiro-Wilk normality test
##
## data: mydata$Difference
## W = 0.99817, p-value = 0.3608
Shapiro-Wilk normality test gives us p = 0,3608 - we can not reject H0.(distribution is NORMAL) - > Parametric method is valid in our case.
t.test(mydata$KmBefore, mydata$KmAfter,
paired = TRUE,
alternative = "two.sided")
##
## Paired t-test
##
## data: mydata$KmBefore and mydata$KmAfter
## t = -18.882, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -1.0777647 -0.8748353
## sample estimates:
## mean difference
## -0.9763
H0: 𝜇km_before = 𝜇km_after / 𝜇d = 0
H1: 𝜇km_before ≠ 𝜇km_after / 𝜇d ≠ 0
Since paired t-test tells us that p value is 2.2e-16, We have to reject H0 at p < 0,001.
Just for an illustration we also perform non-parametric test (Wilcox sign rank test):
wilcox.test(mydata$KmBefore, mydata$KmAfter,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon signed rank test
##
## data: mydata$KmBefore and mydata$KmAfter
## V = 100234, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
library(effectsize)
cohens_d(mydata$Difference)
## Cohen's d | 95% CI
## ------------------------
## 0.60 | [0.53, 0.66]
interpret_cohens_d(0.60, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
The increase in distance (in kilometers) that an individual could run is “medium” sized.
First we tested the normality of the distribution of the differences, where we confirmed the normality of distribution. This leads us to use parametric method of testing the hypothesis -> paired t-test (more about that in the next paragraph).We also conducted non- parametric test, but since we have normal distribution of differences, we can skip that.
Since paired t-test told us that p value is 2.2e-16, We have to reject null hypothesis at p < 0,001, which means that difference between mean averages of variable is not 0.
The increase in distance (in kilometers) that an individual could run is “medium” sized.
Write the second research question. Using two numerical variables from your dataset, calculate the appropriate correlation coefficient and explain it. Justify your decision. Perform the appropriate statistical test and interpret the result obtained. Answer your research question clearly
head(mydata)
## ID Gender Age KmBefore KgBefore TimeBefore Medicine1 Medicine2 Medicine3
## 1 1 F 32 4.06 74.7 41.2 No No No
## 2 2 M 37 3.96 76.3 43.9 Yes Yes No
## 3 3 M 43 3.80 91.7 47.9 Yes No No
## 4 4 F 26 5.17 75.4 59.6 No No No
## 5 5 F 36 3.72 77.0 54.9 No Yes Yes
## 6 6 M 37 5.31 93.9 50.6 No Yes Yes
## KmAfter KgAfter TimeAfter SideEffects Difference
## 1 4.37 91.8 61.1 N 0.31
## 2 3.09 89.6 69.7 N -0.87
## 3 6.26 92.7 49.8 N 2.46
## 4 5.81 89.1 60.9 N 0.64
## 5 7.80 91.7 60.7 Y 4.08
## 6 5.67 87.8 67.9 N 0.36
Research question: Is there a correlation between the kilograms of individuals before taking the pills and after taking the pills?
H0: ρ_KgBefore,KgAfter = 0
H0: ρ_KgBefore,KgAfter ≠ 0
library(car)
## Loading required package: carData
scatterplotMatrix(mydata[,c(5,11)], smooth=FALSE)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(5,11)]),
type = "pearson")
## KgBefore KgAfter
## KgBefore 1.00 0.06
## KgAfter 0.06 1.00
##
## n= 1000
##
##
## P
## KgBefore KgAfter
## KgBefore 0.043
## KgAfter 0.043
We have to reject H0, since p-value < 0.05.
The correlation between kgbefore and kgafter is statistically significant. Linear relationship is positive and very weak.
Write the third research question. Using two categorical variables, perform the Pearson Chi2 test. Make sure that the necessary assumptions are met. Write down the null hypothesis and the alternative hypothesis as well as your findings based on the p-value of the test. Show empirical and theoretical frequencies and explain them. Also calculate the standardized residuals and interpret them. Calculate the effect size. Answer your research question clearly
Does the Side effects of the pills vary depending on the Gender?
We have following Hypothesis:
H0: There is no association between Gender and SideEffects.
H1: There is an association between Gender and SideEffects.
head(mydata)
## ID Gender Age KmBefore KgBefore TimeBefore Medicine1 Medicine2 Medicine3
## 1 1 F 32 4.06 74.7 41.2 No No No
## 2 2 M 37 3.96 76.3 43.9 Yes Yes No
## 3 3 M 43 3.80 91.7 47.9 Yes No No
## 4 4 F 26 5.17 75.4 59.6 No No No
## 5 5 F 36 3.72 77.0 54.9 No Yes Yes
## 6 6 M 37 5.31 93.9 50.6 No Yes Yes
## KmAfter KgAfter TimeAfter SideEffects Difference
## 1 4.37 91.8 61.1 N 0.31
## 2 3.09 89.6 69.7 N -0.87
## 3 6.26 92.7 49.8 N 2.46
## 4 5.81 89.1 60.9 N 0.64
## 5 7.80 91.7 60.7 Y 4.08
## 6 5.67 87.8 67.9 N 0.36
results <- chisq.test(mydata$Gender, mydata$SideEffects,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$Gender and mydata$SideEffects
## X-squared = 3.0973, df = 1, p-value = 0.07842
We cannot reject H0.
addmargins(results$observed)
## mydata$SideEffects
## mydata$Gender N Y Sum
## F 354 93 447
## M 463 90 553
## Sum 817 183 1000
354 Females from 447 females didnt got any side effects from taking any kind of medication.
93 Females from 447 got side effects from taking any kind of medication.
463 males from 553 didnt got any side effects from taking any kind of medication.
90 males from 553 got side effects from taking any kind of medication.
round(results$expected, 2)
## mydata$SideEffects
## mydata$Gender N Y
## F 365.2 81.8
## M 451.8 101.2
The theoretical frequencies are shown. All the numbers are higher than 5 - the 2nd assumption is met.
If there is no association we would expect 365.2 females to get no side effects from taking any kind of medication.
If there is no association we would expect 81.8 females to get side effects from taking any kind of medication.
If there is no association we would expect 451.8 males to get no side effects from taking any kind of medication.
If there is no association we would expect 101.2 males to get side effects from taking any kind of medication.
round(results$res, 2)
## mydata$SideEffects
## mydata$Gender N Y
## F -0.59 1.24
## M 0.53 -1.11
None of the available combinations is significant.
library(effectsize)
effectsize::cramers_v(mydata$Gender, mydata$SideEffec)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.05 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.05)
## [1] "very small"
## (Rules: funder2019)
The effect size is very small.
We cannot say that there is statistical significance between gender and side effect from taking the pills. ___________________________________________