library(rmarkdown)
library(knitr)
library(lsr)
Loading the ais.csv dataset:
ais <- read.csv("ais.csv", header = TRUE, sep = ",")
rmarkdown::paged_table(ais)
Compute the z-scores for athlete 163 (x = 163) for the following variables: body mass index, percentage body fat, lean body mass and height.
Z-score can be calculated by: z = (x - mean) / sd
Task 1a
So, first calculate the z-scores and then look at the z-scores at x = 163 (or the 163rd row).
bmi_z = (ais$bmi - mean(ais$bmi)) / sd(ais$bmi) # z-score of BMI index
pcBfat_z = (ais$pcBfat - mean(ais$pcBfat)) / sd(ais$pcBfat) # z-score of percentage body fat
lbm_z = (ais$lbm - mean(ais$lbm)) / sd(ais$lbm) # z-score of lean body mass
ht_z = (ais$ht - mean(ais$ht)) / sd(ais$ht) # z-score of height
athlete_163 <- data.frame(data = c(bmi_z[163], pcBfat_z[163], lbm_z[163], ht_z[163]), row.names = c('bmi', 'pcBfat', 'lbm', 'ht'))
rmarkdown::paged_table(athlete_163)
Task 1b
Athlete 163’s bmi is 4 standard deviations above the mean, his lean body mass is ~3 standard deviations above the mean and his height is ~1 standard deviation above the mean. His percentage body fat is around the mean.
Get the mean values for all numeric variables for athletes that engaged in the sport rowing
Rowing is indicated by Row in the sport column of the dataset.
str(ais) shows that rcc, wcc, hc, hg, bmi, ssf, pcBfat, lbm, ht and wt are numeric variables.
Task 2a
Creating a data frame with athletes who row, and athletes who don’t:
ais_Rowing <- subset(ais, ais$sport == 'Row')
ais_Rowing_num <- Filter(is.numeric, ais_Rowing)
ais_Rowing_num_means <- sapply(ais_Rowing_num, function(x) {mean(x)})
ais_notRowing <- subset(ais, !ais$sport == 'Row')
ais_notRowing_num <- Filter(is.numeric, ais_notRowing)
ais_notRowing_num_means <- sapply(ais_notRowing_num, function(x) {mean(x)})
ais_Rowing_notRowing <- data.frame(rbind(ais_Rowing_num_means, ais_notRowing_num_means), row.names = c('Rowing', 'Not rowing'))
rmarkdown::paged_table(ais_Rowing_notRowing)
Calculating comparison (conf = 0.95):
ciMean(ais_Rowing_notRowing, conf = 0.95)
## 2.5% 97.5%
## X -207.800301 381.415862
## rcc 4.385653 5.019943
## wcc 5.336817 8.712651
## hc 43.005820 43.169200
## hg 14.457377 14.686734
## ferr 36.838060 113.110670
## bmi 18.948984 27.383427
## ssf 12.860043 131.079155
## pcBfat -1.826580 30.451134
## lbm 51.910977 79.197224
## ht 163.316306 198.653916
## wt 48.922333 103.831869
Task 2b
If the number of respondents increases, the bounderies would be further apart.
Task 2c
If the variance increases, the standard error increases too. An increase in the standard error means that the means are more spread out, so the sample mean becomes a less accurate representation of the true mean.
Compute critical value and GOF-value between the sports practiced between males and females. For the critical value assume that p = 0.05 is considered significant. Furthermore, exclude the athletes who practice Gym, Netball and Waterpolo (W_Polo) from your analysis.
First dividing the data into two data frames: males and females, excluding those who practice Gym, Netball and Waterpolo.
aisMales <- subset(ais, ais$sex == 'm')
aisMalesExcl <- subset(aisMales, aisMales$sport != 'Netball' & aisMales$sport != 'Gym' & aisMales$sport != 'W_Polo')
aisMalesExcl <- droplevels(aisMalesExcl$sport)
aisFemales <- subset(ais, !ais$sex == 'm')
aisFemalesExcl <- subset(aisFemales, aisFemales$sport != 'Netball' & aisFemales$sport != 'Gym' & aisFemales$sport != 'W_Polo')
aisFemalesExcl <- droplevels(aisFemalesExcl$sport)
aisExcl <- subset(ais, ais$sport != 'Netball' & ais$sport != 'Gym' & ais$sport != 'W_Polo')
aisExcl <- droplevels(aisExcl)
Task 3a
h0: There is no difference in distribution between males and females. h1: There is a difference in distribution between males and females.
Task 3b
goodnessOfFitTest(aisMalesExcl)
##
## Chi-square test against specified probabilities
##
## Data variable: aisMalesExcl
##
## Hypotheses:
## null: true probabilities are as specified
## alternative: true probabilities differ from those specified
##
## Descriptives:
## observed freq. expected freq. specified prob.
## B_Ball 12 12.14286 0.1428571
## Field 12 12.14286 0.1428571
## Row 15 12.14286 0.1428571
## Swim 13 12.14286 0.1428571
## T_400m 18 12.14286 0.1428571
## T_Sprnt 11 12.14286 0.1428571
## Tennis 4 12.14286 0.1428571
##
## Test results:
## X-squared statistic: 9.129
## degrees of freedom: 6
## p-value: 0.166
goodnessOfFitTest(aisFemalesExcl)
##
## Chi-square test against specified probabilities
##
## Data variable: aisFemalesExcl
##
## Hypotheses:
## null: true probabilities are as specified
## alternative: true probabilities differ from those specified
##
## Descriptives:
## observed freq. expected freq. specified prob.
## B_Ball 13 10.42857 0.1428571
## Field 7 10.42857 0.1428571
## Row 22 10.42857 0.1428571
## Swim 9 10.42857 0.1428571
## T_400m 11 10.42857 0.1428571
## T_Sprnt 4 10.42857 0.1428571
## Tennis 7 10.42857 0.1428571
##
## Test results:
## X-squared statistic: 19.918
## degrees of freedom: 6
## p-value: 0.003
associationTest(data = aisExcl, formula = ~ sport + sex)
##
## Chi-square test of categorical association
##
## Variables: sport, sex
##
## Hypotheses:
## null: variables are independent of one another
## alternative: some contingency exists between variables
##
## Observed contingency table:
## sex
## sport f m
## B_Ball 13 12
## Field 7 12
## Row 22 15
## Swim 9 13
## T_400m 11 18
## T_Sprnt 4 11
## Tennis 7 4
##
## Expected contingency table under the null hypothesis:
## sex
## sport f m
## B_Ball 11.55 13.45
## Field 8.78 10.22
## Row 17.09 19.91
## Swim 10.16 11.84
## T_400m 13.40 15.60
## T_Sprnt 6.93 8.07
## Tennis 5.08 5.92
##
## Test results:
## X-squared statistic: 8.318
## degrees of freedom: 6
## p-value: 0.216
##
## Other information:
## estimated effect size (Cramer's v): 0.229
The GOF-statistics are as follows:
Males: 9.129 Females: 19.918
Levels of significance:
Males: 0.166 (not significant) Females: 0.003 (significant)
Task 3c
h0 should be rejected, as there is a significant value found.
Task 3d
The Cramer’s V is 0.229. A Cramer’s V of between 0.150 and 0.250 means that there is a strong association between the sex of an athlete and their choice of sports.
Task 4
Give an occasion in which you would choose to use a Fisher exact test (max: 100 words).
A Fisher’s exact test of independence can be used in place of a chi-square test in cases of small samples, to show association. The Fisher’s exact test is used with two types of classifications and a sample divided into two groups to form a 2x2 table. Two classifications can, for example, be ‘willing to drink alcohol at a party’ and ‘not willing to drink alcohol at a party’, while the group division can be ‘male’ and ‘female’. A Fisher’s test gives an exact P-value, while a chi-square test gives an approximation.
Task 5
You know why one would use a Fisher Exact test, can you explain why we had to exclude all the athletes who practice Gym, Netball and Waterpolo from our analysis?
The number of people practicing Gym, Netball and Waterpolo was too low to be included in the analysis.