Lecture 1: Basic Concepts

Joel Correa da Rosa
December 14th 2016

Computational Resources

The material has been prepared with two basic tools.

  • R software

http://cran.r-project.org

  • Rstudio

http://rstudio.com

What is Biostatistics ?

Statistics is the science of uncertainty.

Statistics is an auxiliary tool for making decisions under the presence of uncertainty.

Biostatistics is the application of Statistics to biological and health sciences.

Decisions in health sciences are related to:

  • safety
  • efficacy
  • effectiveness
  • superiority
  • ….

Descriptive Statistics vs. Inferential Statistics

There are two main applications of statistics:

  • Description/Exploration
  • Inference

Descriptive statistics is used to describe a phenomenon by means of numerical summaries, graphs and tables.

Inferential statistics is used to draw conclusion about a universe based on a sample.

Descriptive Statistics

The tools for descriptive statistics are:

  • Numerical summaries
  • Graphics
  • Tables

Univariate Numerical Summaries #1

  • Central Tendency
    • mean
    • median
    • mode
  • Dispersion
    • standard deviation
    • variance
    • median absolute deviation (MAD)
    • interquartile range

Variance

How to evaluate the variance in a sample \( X_1,X_2,...,X_n \) numbers:

\( S^2 = \frac{\sum_{i=1}^n(X_i-\bar X)^2}{n-1 } \)

why \( n-1 \) in the denominator ?

The standard deviation is the square root of the variance. \( S = \sqrt{S^2} \).

Median Absolute Deviation

\( MAD = median(|X_i-Med|),i=1,2,...,n \)

Example

Dataset = \( \{5,10, 15, 25, 2000\} \)

\( MAD = median\{|5-15|,|10-15|,|15-15|,|25-15|,|2000-15|\} = \)

\( MAD = median\{10,5,0,10,1985\}= 10 \)

While the variance is a measure of dispersion that uses the mean as reference, the MAD uses the median as reference.

Calculating the MAD in R

dataset = c(5,10,15,25,2000)
dataset
[1]    5   10   15   25 2000
median.dataset = median(dataset)
median.dataset
[1] 15
abs.deviation.median = abs(dataset-median.dataset) 
abs.deviation.median
[1]   10    5    0   10 1985
mad = median(abs.deviation.median)
mad
[1] 10

Univariate Numerical Summaries #2

  • Skewness
    • coefficient of skewness
  • Kurtosis
    • coefficient of kurtosis
  • Thresholds for Outliers
    • Tukey's boxplot limits
Remark: The boxplot criterion for detecting outlier is based on the normal distribution.

Skewness #1

The skewness indicates how symmetric the distribution of the values is.

\( skewness = \sum_{i=1}^n \frac{(X_i-\bar{X})^3/n}{S^3} \)

The skewness for the normal distribution is 0.

Skewness #2

Pearson's second coefficient of skewness.

\( Sk = \frac{3(\bar X - Med)}{S} \)

  • \( \bar X \) : sample mean

  • \( Med \) : sample median

  • \( S \) : sample standard deviation

When \( Sk>0 \), the mean is larger than the median.

Kurtosis

The kurtosis indicates how “peaked” or flat the distribution of values appears.

\( kurtosis = \sum_{i=1}^n \frac{(X_i-\bar{X})^4/n}{S^4} \)

The kurtosis for the normal distribution is 3, the “excess of kurtosis” is calculated as “kurtosis -3”.

High kurtosis indicates a distribution with tails heavier than the normal distribution and more outliers.

Excess of kurtosis

# generate data with kurtosis
leptokurtic.data<-rt(1000,4)

# visualize data with leptokurtic distribution
plot(density(leptokurtic.data))

plot of chunk unnamed-chunk-2

Order Statistics

A quantile \( {Q}(p) \) is a number that separates the data set in two parts: \( 100\times p\% \) of the data are smaller than \( Q(p) \) and \( 100 \times (1-p)\% \) are larger than \( Q(p) \).

\( 0 < p < 1 \)

Special cases are :

  • Quartiles (\( p \in \{0.25,0.5,0.75\} \))
  • Deciles (\( p \in \{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\} \))
  • Percentiles (\( p \in \{0.01,0.02,...,0.97,0.98,0.99\} \))

Interquartile Range: \( IQR = Q(0.75)-Q(0.25) \)

Quantiles with R

my.sample<-c(0.8,1.3,2,5,6,10.3)
my.sample
[1]  0.8  1.3  2.0  5.0  6.0 10.3
# Q(p), p=0.1
quantile(my.sample,0.1)
 10% 
1.05 
# Q(p), p=0.5
quantile(my.sample,0.5)
50% 
3.5 
# Q(p), p=0.9
quantile(my.sample,0.9)
 90% 
8.15 

Boxplot (a Powerful descriptive tool)

# generate 100 numbers from standard normal
x<-rnorm(100)

# draw a boxplot 
boxplot(rnorm(100))

plot of chunk unnamed-chunk-4

Boxplot rule for outliers

An observation is classified as an outlier if it is :

Greater than \( Q(0.75)+1.5 \times IQR \)

OR

Less than \( Q(0.25)-1.5 \times IQR \)

where

\( IQR = Q(0.75)-Q(0.25) \)

Boxplot for Leptokurtic Distribution

boxplot(leptokurtic.data)

plot of chunk unnamed-chunk-5

Using R for data summary

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6.3 2.9 5.6 1.8 virginica
5.0 3.5 1.6 0.6 setosa
5.0 3.0 1.6 0.2 setosa
7.2 3.0 5.8 1.6 virginica
4.4 2.9 1.4 0.2 setosa
4.4 3.0 1.3 0.2 setosa
5.6 2.8 4.9 2.0 virginica
5.2 2.7 3.9 1.4 versicolor
6.3 3.3 4.7 1.6 versicolor
6.3 2.8 5.1 1.5 virginica

Basic Summary

load the data

data(iris)

summarize the data

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  



Categorized Boxplot (Bivariate Summary)

boxplot(iris$Sepal.Length ~ iris$Species)

plot of chunk unnamed-chunk-9

Statistics from Boxplot #1

stats.box<-boxplot(iris$Sepal.Length ~ iris$Species)

plot of chunk unnamed-chunk-10

Statistics from Boxplot #2

stats.box
$stats
     [,1] [,2] [,3]
[1,]  4.3  4.9  5.6
[2,]  4.8  5.6  6.2
[3,]  5.0  5.9  6.5
[4,]  5.2  6.3  6.9
[5,]  5.8  7.0  7.9

$n
[1] 50 50 50

$conf
         [,1]     [,2]     [,3]
[1,] 4.910622 5.743588 6.343588
[2,] 5.089378 6.056412 6.656412

$out
[1] 4.9

$group
[1] 3

$names
[1] "setosa"     "versicolor" "virginica"