Joel Correa da Rosa
December 14th 2016
The material has been prepared with two basic tools.
Statistics is the science of uncertainty.
Statistics is an auxiliary tool for making decisions under the presence of uncertainty.
Biostatistics is the application of Statistics to biological and health sciences.
Decisions in health sciences are related to:
There are two main applications of statistics:
Descriptive statistics is used to describe a phenomenon by means of numerical summaries, graphs and tables.
Inferential statistics is used to draw conclusion about a universe based on a sample.
The tools for descriptive statistics are:
How to evaluate the variance in a sample \( X_1,X_2,...,X_n \) numbers:
\( S^2 = \frac{\sum_{i=1}^n(X_i-\bar X)^2}{n-1 } \)
why \( n-1 \) in the denominator ?
The standard deviation is the square root of the variance. \( S = \sqrt{S^2} \).
\( MAD = median(|X_i-Med|),i=1,2,...,n \)
Example
Dataset = \( \{5,10, 15, 25, 2000\} \)
\( MAD = median\{|5-15|,|10-15|,|15-15|,|25-15|,|2000-15|\} = \)
\( MAD = median\{10,5,0,10,1985\}= 10 \)
While the variance is a measure of dispersion that uses the mean as reference, the MAD uses the median as reference.
dataset = c(5,10,15,25,2000)
dataset
[1] 5 10 15 25 2000
median.dataset = median(dataset)
median.dataset
[1] 15
abs.deviation.median = abs(dataset-median.dataset)
abs.deviation.median
[1] 10 5 0 10 1985
mad = median(abs.deviation.median)
mad
[1] 10
The skewness indicates how symmetric the distribution of the values is.
\( skewness = \sum_{i=1}^n \frac{(X_i-\bar{X})^3/n}{S^3} \)
The skewness for the normal distribution is 0.
Pearson's second coefficient of skewness.
\( Sk = \frac{3(\bar X - Med)}{S} \)
\( \bar X \) : sample mean
\( Med \) : sample median
\( S \) : sample standard deviation
When \( Sk>0 \), the mean is larger than the median.
The kurtosis indicates how “peaked” or flat the distribution of values appears.
\( kurtosis = \sum_{i=1}^n \frac{(X_i-\bar{X})^4/n}{S^4} \)
The kurtosis for the normal distribution is 3, the “excess of kurtosis” is calculated as “kurtosis -3”.
High kurtosis indicates a distribution with tails heavier than the normal distribution and more outliers.
# generate data with kurtosis
leptokurtic.data<-rt(1000,4)
# visualize data with leptokurtic distribution
plot(density(leptokurtic.data))
A quantile \( {Q}(p) \) is a number that separates the data set in two parts: \( 100\times p\% \) of the data are smaller than \( Q(p) \) and \( 100 \times (1-p)\% \) are larger than \( Q(p) \).
\( 0 < p < 1 \)
Special cases are :
Interquartile Range: \( IQR = Q(0.75)-Q(0.25) \)
my.sample<-c(0.8,1.3,2,5,6,10.3)
my.sample
[1] 0.8 1.3 2.0 5.0 6.0 10.3
# Q(p), p=0.1
quantile(my.sample,0.1)
10%
1.05
# Q(p), p=0.5
quantile(my.sample,0.5)
50%
3.5
# Q(p), p=0.9
quantile(my.sample,0.9)
90%
8.15
# generate 100 numbers from standard normal
x<-rnorm(100)
# draw a boxplot
boxplot(rnorm(100))
An observation is classified as an outlier if it is :
Greater than \( Q(0.75)+1.5 \times IQR \)
OR
Less than \( Q(0.25)-1.5 \times IQR \)
where
\( IQR = Q(0.75)-Q(0.25) \)
boxplot(leptokurtic.data)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
6.3 | 2.9 | 5.6 | 1.8 | virginica |
5.0 | 3.5 | 1.6 | 0.6 | setosa |
5.0 | 3.0 | 1.6 | 0.2 | setosa |
7.2 | 3.0 | 5.8 | 1.6 | virginica |
4.4 | 2.9 | 1.4 | 0.2 | setosa |
4.4 | 3.0 | 1.3 | 0.2 | setosa |
5.6 | 2.8 | 4.9 | 2.0 | virginica |
5.2 | 2.7 | 3.9 | 1.4 | versicolor |
6.3 | 3.3 | 4.7 | 1.6 | versicolor |
6.3 | 2.8 | 5.1 | 1.5 | virginica |
load the data
data(iris)
summarize the data
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
boxplot(iris$Sepal.Length ~ iris$Species)
stats.box<-boxplot(iris$Sepal.Length ~ iris$Species)
stats.box
$stats
[,1] [,2] [,3]
[1,] 4.3 4.9 5.6
[2,] 4.8 5.6 6.2
[3,] 5.0 5.9 6.5
[4,] 5.2 6.3 6.9
[5,] 5.8 7.0 7.9
$n
[1] 50 50 50
$conf
[,1] [,2] [,3]
[1,] 4.910622 5.743588 6.343588
[2,] 5.089378 6.056412 6.656412
$out
[1] 4.9
$group
[1] 3
$names
[1] "setosa" "versicolor" "virginica"