library(rmarkdown)
library(knitr)
library(lsr)

Loading the ais.csv dataset:

ais <- read.csv("ais.csv", header = TRUE, sep = ",")
rmarkdown::paged_table(ais)

Task 1

Compute the z-scores for athlete 163 (x = 163) for the following variables: body mass index, percentage body fat, lean body mass and height.

Z-score can be calculated by: z = (x - mean) / sd

Task 1a

So, first calculate the z-scores and then look at the z-scores at x = 163 (or the 163rd row).

bmi_z = (ais$bmi - mean(ais$bmi)) / sd(ais$bmi) # z-score of BMI index
pcBfat_z = (ais$pcBfat - mean(ais$pcBfat)) / sd(ais$pcBfat) # z-score of percentage body fat
lbm_z = (ais$lbm - mean(ais$lbm)) / sd(ais$lbm) # z-score of lean body mass
ht_z = (ais$ht - mean(ais$ht)) / sd(ais$ht) # z-score of height

athlete_163 <- data.frame(data = c(bmi_z[163], pcBfat_z[163], lbm_z[163], ht_z[163]), row.names = c('bmi', 'pcBfat', 'lbm', 'ht'))

rmarkdown::paged_table(athlete_163)

Task 1b

Athlete 163’s bmi is 4 standard deviations above the mean, his lean body mass is ~3 standard deviations above the mean and his height is ~1 standard deviation above the mean. His percentage body fat is around the mean.

Task 2

Get the mean values for all numeric variables for athletes that engaged in the sport rowing

Rowing is indicated by Row in the sport column of the dataset.

str(ais) shows that rcc, wcc, hc, hg, bmi, ssf, pcBfat, lbm, ht and wt are numeric variables.

Task 2a

Creating a data frame with athletes who row, and athletes who don’t:

ais_Rowing <- subset(ais, ais$sport == 'Row')
ais_Rowing_num <- Filter(is.numeric, ais_Rowing)

ais_Rowing_num_means <- sapply(ais_Rowing_num, function(x) {mean(x)})

ais_notRowing <- subset(ais, !ais$sport == 'Row')
ais_notRowing_num <- Filter(is.numeric, ais_notRowing)

ais_notRowing_num_means <- sapply(ais_notRowing_num, function(x) {mean(x)})

ais_Rowing_notRowing <- data.frame(rbind(ais_Rowing_num_means, ais_notRowing_num_means), row.names = c('Rowing', 'Not rowing'))

rmarkdown::paged_table(ais_Rowing_notRowing)

Calculating comparison (conf = 0.95):

ciMean(ais_Rowing_notRowing, conf = 0.95)

##               2.5%      97.5%
## X      -207.800301 381.415862
## rcc       4.385653   5.019943
## wcc       5.336817   8.712651
## hc       43.005820  43.169200
## hg       14.457377  14.686734
## ferr     36.838060 113.110670
## bmi      18.948984  27.383427
## ssf      12.860043 131.079155
## pcBfat   -1.826580  30.451134
## lbm      51.910977  79.197224
## ht      163.316306 198.653916
## wt       48.922333 103.831869

Task 2b

If the number of respondents increases, the bounderies would be further apart.

Task 2c

If the variance increases, the standard error increases too. An increase in the standard error means that the means are more spread out, so the sample mean becomes a less accurate representation of the true mean.

Task 3

Compute critical value and GOF-value between the sports practiced between males and females. For the critical value assume that p = 0.05 is considered significant. Furthermore, exclude the athletes who practice Gym, Netball and Waterpolo (W_Polo) from your analysis.

First dividing the data into two data frames: males and females, excluding those who practice Gym, Netball and Waterpolo.

aisMales <- subset(ais, ais$sex == 'm')
aisMalesExcl <- subset(aisMales, aisMales$sport != 'Netball' & aisMales$sport != 'Gym' & aisMales$sport != 'W_Polo')
aisMalesExcl <- droplevels(aisMalesExcl$sport)

aisFemales <- subset(ais, !ais$sex == 'm')
aisFemalesExcl <- subset(aisFemales, aisFemales$sport != 'Netball' & aisFemales$sport != 'Gym' & aisFemales$sport != 'W_Polo')
aisFemalesExcl <- droplevels(aisFemalesExcl$sport)

aisExcl <- subset(ais, ais$sport != 'Netball' & ais$sport != 'Gym' & ais$sport != 'W_Polo')
aisExcl <- droplevels(aisExcl)

Task 3a

h0: There is no difference in distribution between males and females. h1: There is a difference in distribution between males and females.

Task 3b

goodnessOfFitTest(aisMalesExcl)

## 
##      Chi-square test against specified probabilities
## 
## Data variable:   aisMalesExcl 
## 
## Hypotheses: 
##    null:        true probabilities are as specified
##    alternative: true probabilities differ from those specified
## 
## Descriptives: 
##         observed freq. expected freq. specified prob.
## B_Ball              12       12.14286       0.1428571
## Field               12       12.14286       0.1428571
## Row                 15       12.14286       0.1428571
## Swim                13       12.14286       0.1428571
## T_400m              18       12.14286       0.1428571
## T_Sprnt             11       12.14286       0.1428571
## Tennis               4       12.14286       0.1428571
## 
## Test results: 
##    X-squared statistic:  9.129 
##    degrees of freedom:  6 
##    p-value:  0.166

goodnessOfFitTest(aisFemalesExcl)

## 
##      Chi-square test against specified probabilities
## 
## Data variable:   aisFemalesExcl 
## 
## Hypotheses: 
##    null:        true probabilities are as specified
##    alternative: true probabilities differ from those specified
## 
## Descriptives: 
##         observed freq. expected freq. specified prob.
## B_Ball              13       10.42857       0.1428571
## Field                7       10.42857       0.1428571
## Row                 22       10.42857       0.1428571
## Swim                 9       10.42857       0.1428571
## T_400m              11       10.42857       0.1428571
## T_Sprnt              4       10.42857       0.1428571
## Tennis               7       10.42857       0.1428571
## 
## Test results: 
##    X-squared statistic:  19.918 
##    degrees of freedom:  6 
##    p-value:  0.003

associationTest(data = aisExcl, formula =  ~ sport + sex)

## 
##      Chi-square test of categorical association
## 
## Variables:   sport, sex 
## 
## Hypotheses: 
##    null:        variables are independent of one another
##    alternative: some contingency exists between variables
## 
## Observed contingency table:
##          sex
## sport      f  m
##   B_Ball  13 12
##   Field    7 12
##   Row     22 15
##   Swim     9 13
##   T_400m  11 18
##   T_Sprnt  4 11
##   Tennis   7  4
## 
## Expected contingency table under the null hypothesis:
##          sex
## sport         f     m
##   B_Ball  11.55 13.45
##   Field    8.78 10.22
##   Row     17.09 19.91
##   Swim    10.16 11.84
##   T_400m  13.40 15.60
##   T_Sprnt  6.93  8.07
##   Tennis   5.08  5.92
## 
## Test results: 
##    X-squared statistic:  8.318 
##    degrees of freedom:  6 
##    p-value:  0.216 
## 
## Other information: 
##    estimated effect size (Cramer's v):  0.229

The GOF-statistics are as follows:

Males: 9.129 Females: 19.918

Levels of significance:

Males: 0.166 (not significant) Females: 0.003 (significant)

Task 3c

h0 should be rejected, as there is a significant value found.

Task 3d

The Cramer’s V is 0.229. A Cramer’s V of between 0.150 and 0.250 means that there is a strong association between the sex of an athlete and their choice of sports.

Task 4 and 5

Task 4

Give an occasion in which you would choose to use a Fisher exact test (max: 100 words).

A Fisher’s exact test of independence can be used in place of a chi-square test in cases of small samples, to show association. The Fisher’s exact test is used with two types of classifications and a sample divided into two groups to form a 2x2 table. Two classifications can, for example, be ‘willing to drink alcohol at a party’ and ‘not willing to drink alcohol at a party’, while the group division can be ‘male’ and ‘female’. A Fisher’s test gives an exact P-value, while a chi-square test gives an approximation.

Task 5

You know why one would use a Fisher Exact test, can you explain why we had to exclude all the athletes who practice Gym, Netball and Waterpolo from our analysis?

The number of people practicing Gym, Netball and Waterpolo was too low to be included in the analysis.

Statistics for Pre-masters DSS - Assignment 3

Jesse Vervaart

26-10-2019

Task 1

Task 2

Task 3

Task 4 and 5