mydata <- read.table("./Marathon.csv", header = TRUE, sep = ";", dec = "," )
head(mydata)
## ID Weight Height Pressure Beat Hemoglobin Hematocrit Cholesterol Glucose
## 1 1 72 179.0 105 64 160 50 4.9 4.7
## 2 2 68 178.0 105 60 158 51 4.8 4.9
## 3 3 64 174.0 109 54 155 51 4.5 7.0
## 4 4 63 174.0 112 54 153 58 8.0 7.2
## 5 5 61 173.5 100 53 152 59 4.6 6.7
## 6 6 60 173.0 99 53 158 49 3.9 6.0
## Gender
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Explanation of variables:
Unit: One athlete. Sample: 35 randomly chosen athletes.
mean(mydata$Height)
## [1] 176.9571
sd(mydata$Height)
## [1] 5.85156
The average height of 35 people, chosen in a sample, is 177 cm. Standard deviation, which measures the variability of variable, is equal to 5.9 cm.
mydata$GenderFactor <- factor(mydata$Gender,
levels = c(0, 1),
labels = c("Female", "Male"))
library(psych)
describeBy(mydata$Glucose, mydata$GenderFactor)
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 14 5.96 0.93 5.8 5.97 1.33 4.6 7.2 2.6 0.12 -1.62 0.25
## ------------------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 21 4.54 0.7 4.6 4.45 0.74 3.8 6 2.2 0.97 -0.13 0.15
Test that mean value of glucose for male is equal to 5.0.
H0: Mu = 5.0
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydataM <- mydata %>%
filter(GenderFactor == "Male")
t.test(mydataM$Glucose,
mu = 5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydataM$Glucose
## t = -3.0324, df = 20, p-value = 0.006578
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 4.216336 4.855093
## sample estimates:
## mean of x
## 4.535714
We reject the null hypothesis at p = 0.007. We found that the average glucose level for male is different from 5. Results show that it is less.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(mydata[c(-1, -10, -11)]), 2)
## Weight Height Pressure Beat Hemoglobin Hematocrit Cholesterol
## nbr.val 35.00 35.00 35.00 35.00 35.00 35.00 35.00
## nbr.null 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## min 55.00 166.00 90.00 49.00 143.00 45.00 3.40
## max 81.00 189.00 135.00 64.00 183.00 69.00 8.00
## range 26.00 23.00 45.00 15.00 40.00 24.00 4.60
## sum 2375.00 6193.50 3838.00 1967.00 5445.00 1801.00 167.60
## median 68.00 177.00 108.00 55.00 157.00 51.00 4.70
## mean 67.86 176.96 109.66 56.20 155.57 51.46 4.79
## SE.mean 1.30 0.99 1.79 0.67 1.45 0.82 0.17
## CI.mean.0.95 2.64 2.01 3.64 1.37 2.94 1.66 0.34
## var 59.01 34.24 112.47 15.81 73.13 23.49 1.00
## std.dev 7.68 5.85 10.61 3.98 8.55 4.85 1.00
## coef.var 0.11 0.03 0.10 0.07 0.05 0.09 0.21
## Glucose
## nbr.val 35.00
## nbr.null 0.00
## nbr.na 0.00
## min 3.80
## max 7.20
## range 3.40
## sum 178.65
## median 4.80
## mean 5.10
## SE.mean 0.18
## CI.mean.0.95 0.36
## var 1.12
## std.dev 1.06
## coef.var 0.21
The highest variability have the last two variables with the highest Coeff. of variation.
95%-CI for mean Cholesterol: 4.79 +- 0.34 [4.45, 5.13]
hist(mydata$Hematocrit,
main = "Distribtuion of Hematocrit",
xlab = "Hematocrit",
col = "lightblue",
breaks = seq(45, 70, 2.5))
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = GenderFactor, y = Glucose)) +
geom_boxplot(fill = "lightgreen") +
xlab("Gender")