Marathon Example

a question

mydata <- read.table("./Marathon.csv", header = TRUE, sep = ";", dec = ","  )

head(mydata)

##   ID Weight Height Pressure Beat Hemoglobin Hematocrit Cholesterol Glucose
## 1  1     72  179.0      105   64        160         50         4.9     4.7
## 2  2     68  178.0      105   60        158         51         4.8     4.9
## 3  3     64  174.0      109   54        155         51         4.5     7.0
## 4  4     63  174.0      112   54        153         58         8.0     7.2
## 5  5     61  173.5      100   53        152         59         4.6     6.7
## 6  6     60  173.0       99   53        158         49         3.9     6.0
##   Gender
## 1      1
## 2      0
## 3      0
## 4      0
## 5      0
## 6      0

Explanation of variables:

ID
Weight (weight in kg)
Gender (0:F, 1:M)

b question

Unit: One athlete. Sample: 35 randomly chosen athletes.

c question

mean(mydata$Height)

## [1] 176.9571

sd(mydata$Height)

## [1] 5.85156

The average height of 35 people, chosen in a sample, is 177 cm. Standard deviation, which measures the variability of variable, is equal to 5.9 cm.

d question

mydata$GenderFactor <- factor(mydata$Gender, 
                              levels = c(0, 1), 
                              labels = c("Female", "Male"))

e question

library(psych)
describeBy(mydata$Glucose, mydata$GenderFactor)

## 
##  Descriptive statistics by group 
## group: Female
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 14 5.96 0.93    5.8    5.97 1.33 4.6 7.2   2.6 0.12    -1.62 0.25
## ------------------------------------------------------------ 
## group: Male
##    vars  n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 21 4.54 0.7    4.6    4.45 0.74 3.8   6   2.2 0.97    -0.13 0.15

Additional question

Test that mean value of glucose for male is equal to 5.0.

H0: Mu = 5.0

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydataM <- mydata %>%
  filter(GenderFactor == "Male")
  

t.test(mydataM$Glucose,
       mu = 5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydataM$Glucose
## t = -3.0324, df = 20, p-value = 0.006578
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  4.216336 4.855093
## sample estimates:
## mean of x 
##  4.535714

We reject the null hypothesis at p = 0.007. We found that the average glucose level for male is different from 5. Results show that it is less.

f question

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

round(stat.desc(mydata[c(-1, -10, -11)]), 2)

##               Weight  Height Pressure    Beat Hemoglobin Hematocrit Cholesterol
## nbr.val        35.00   35.00    35.00   35.00      35.00      35.00       35.00
## nbr.null        0.00    0.00     0.00    0.00       0.00       0.00        0.00
## nbr.na          0.00    0.00     0.00    0.00       0.00       0.00        0.00
## min            55.00  166.00    90.00   49.00     143.00      45.00        3.40
## max            81.00  189.00   135.00   64.00     183.00      69.00        8.00
## range          26.00   23.00    45.00   15.00      40.00      24.00        4.60
## sum          2375.00 6193.50  3838.00 1967.00    5445.00    1801.00      167.60
## median         68.00  177.00   108.00   55.00     157.00      51.00        4.70
## mean           67.86  176.96   109.66   56.20     155.57      51.46        4.79
## SE.mean         1.30    0.99     1.79    0.67       1.45       0.82        0.17
## CI.mean.0.95    2.64    2.01     3.64    1.37       2.94       1.66        0.34
## var            59.01   34.24   112.47   15.81      73.13      23.49        1.00
## std.dev         7.68    5.85    10.61    3.98       8.55       4.85        1.00
## coef.var        0.11    0.03     0.10    0.07       0.05       0.09        0.21
##              Glucose
## nbr.val        35.00
## nbr.null        0.00
## nbr.na          0.00
## min             3.80
## max             7.20
## range           3.40
## sum           178.65
## median          4.80
## mean            5.10
## SE.mean         0.18
## CI.mean.0.95    0.36
## var             1.12
## std.dev         1.06
## coef.var        0.21

The highest variability have the last two variables with the highest Coeff. of variation.

g question

95%-CI for mean Cholesterol: 4.79 +- 0.34 [4.45, 5.13]

h question

hist(mydata$Hematocrit,
     main = "Distribtuion of Hematocrit",
     xlab = "Hematocrit",
     col = "lightblue",
     breaks = seq(45, 70, 2.5))

i question

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x = GenderFactor, y = Glucose)) +
  geom_boxplot(fill = "lightgreen") +
  xlab("Gender")