Homework assignment 1 at course MVA

Author: Miha Fabjan

1 Data import

mydata <- read.table("./survey.csv.csv", header=TRUE, sep=";", dec=".")

# Add an ID column to the dataset
mydata <- cbind(ID = 1:nrow(mydata), mydata)

2 Data description

head(mydata, (10)) #Showing first 10 units

##    ID Gender Age KmBefore KgBefore TimeBefore Medicine1 Medicine2 Medicine3
## 1   1      F  32     4.06     74.7       41.2        No        No        No
## 2   2      M  37     3.96     76.3       43.9       Yes       Yes        No
## 3   3      M  43     3.80     91.7       47.9       Yes        No        No
## 4   4      F  26     5.17     75.4       59.6        No        No        No
## 5   5      F  36     3.72     77.0       54.9        No       Yes       Yes
## 6   6      M  37     5.31     93.9       50.6        No       Yes       Yes
## 7   7      F  35     3.18     75.5       54.4       Yes       Yes       Yes
## 8   8      F  48     4.65     71.5       47.6        No        No       Yes
## 9   9      F  41     4.40     79.7       32.8        No        No       Yes
## 10 10      F  32     2.95     81.5       51.4        No        No        No
##    KmAfter KgAfter TimeAfter SideEffects
## 1     4.37    91.8      61.1           N
## 2     3.09    89.6      69.7           N
## 3     6.26    92.7      49.8           N
## 4     5.81    89.1      60.9           N
## 5     7.80    91.7      60.7           Y
## 6     5.67    87.8      67.9           N
## 7     5.50    95.5      57.8           N
## 8     2.13    92.2      55.0           N
## 9     7.04    91.5      59.7           N
## 10    6.10    90.8      56.3           N

Unit of observation is an individual, who goes to the gym and took one of the medicines mentioned. Sample size is 1000 individuals.

Explanation of variables:

Gender: A binary variable that identifies the gender of each individual, coded as Male (M) or Female (F).

Age: A numeric variable representing the age of the individual in years.

Baseline Measurements (Before Medication): Km Before: The distance (in kilometers) that an individual could run before taking any medicine. Kg Before: The weight (in kilograms) of the individual before using any medicine. Time Before: The time (in minutes) it took the individual to run the distance measured in “Km Before.”

Medicines used: 1. Medicine 1: A binary variable indicating whether the individual used Medicine 1 (Yes or No). 2. Medicine 2: A binary variable indicating whether the individual used Medicine 2 (Yes or No). 3. Medicine 3: A binary variable indicating whether the individual used Medicine 3 (Yes or No).

Post-medication measurements: Km After: The distance (in kilometers) that an individual could run after taking the medicine(s). Kg After: The weight (in kilograms) of the individual after using the medicine(s). Time After: The time (in minutes) it took the individual to run the distance measured in “Km After.”

SideEffects: A binary variable that indicates whether the individual experienced any side effects from the medicine(s) (Y for Yes, N for No).

Source of data: https://www.kaggle.com/datasets/saralattarulo/do-we-still-need-gyms

3.1 Research question 1: XX

H0: 𝜇km_before = 𝜇km_after / 𝜇d = 0 H1: 𝜇km_before ≠ 𝜇km_after / 𝜇d ≠ 0

library(pastecs)
round(stat.desc(mydata[,c(4,10)]), 2)

##              KmBefore KmAfter
## nbr.val       1000.00 1000.00
## nbr.null         0.00    0.00
## nbr.na           0.00    0.00
## min              1.72    0.51
## max              6.28   10.35
## range            4.56    9.84
## sum           4020.63 4996.93
## median           4.02    5.01
## mean             4.02    5.00
## SE.mean          0.02    0.05
## CI.mean.0.95     0.05    0.09
## var              0.60    2.13
## std.dev          0.77    1.46
## coef.var         0.19    0.29

3.1.1 Analysis

mydata$Difference <- mydata$KmAfter - mydata$KmBefore

library(ggplot2)
ggplot(mydata, aes(x = Difference)) +
  geom_histogram(position = "identity", binwidth = 0.5, colour = "black") +
  ylab("Frequency") +
  xlab("Difference")

H0: Normality of distribution is met. H1: Normality of distribution is not met.

shapiro.test(mydata$Difference)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Difference
## W = 0.99817, p-value = 0.3608

Shapiro-Wilk normality test gives us p = 0,3608 and therefore we can not reject null hypothesis (distribution is NORMAL) - > Parametric method is valid in our case.

Parametric method paired t-test:

t.test(mydata$KmBefore, mydata$KmAfter,
       paired = TRUE,
       alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  mydata$KmBefore and mydata$KmAfter
## t = -18.882, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -1.0777647 -0.8748353
## sample estimates:
## mean difference 
##         -0.9763

H0: 𝜇km_before = 𝜇km_after / 𝜇d = 0 H1: 𝜇km_before ≠ 𝜇km_after / 𝜇d ≠ 0

Since paired t-test tells us that p value is 2.2e-16, We have to reject null hypothesis at p < 0,001 (difference between mean averages of variable is not 0).

Just for an illustration we also perform non-parametric test (Wilcox sign rank test):

wilcox.test(mydata$KmBefore, mydata$KmAfter, 
            paired = TRUE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon signed rank test
## 
## data:  mydata$KmBefore and mydata$KmAfter
## V = 100234, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

library(effectsize)
cohens_d(mydata$Difference)

## Cohen's d |       95% CI
## ------------------------
## 0.60      | [0.53, 0.66]

interpret_cohens_d(0.60, rules = "sawilowsky2009")

## [1] "medium"
## (Rules: sawilowsky2009)

The increase in distance (in kilometers) that an individual could run is “medium” sized.

3.1.2 Conclusion

We wanted to test, if the distance individual can run before and after taking the medicine is the same or it changes.

First we tested the normality of the distribution of the differences, where we comfirmed the normality of distribution. This leads us to use parametric method of testing the hypothesis -> paired t-test (more about that in the next paragraph).We also conducted non- parametric test, but since we have normal distribution of differences, we can skip that.

Since paired t-test told us that p value is 2.2e-16, We have to reject null hypothesis at p < 0,001, which means that difference between mean averages of variable is not 0 - the distance individual can run is improved.

The increase in distance (in kilometers) that an individual could run is “medium” sized.

Homework assignment 1 at course MVA

2024-12-12

1 Data import

2 Data description

3.1 Research question 1: XX

3.1.1 Analysis

3.1.2 Conclusion

3.2 Research question 2: XX

3.2.1 Analysis

3.2.2 Conclusion

3.3 Research question 3: XX

3.3.1 Analysis

3.3.2 Conclusion