Extensive Statistical Data Analysis Pipeline

Introduction

In this statistical data analysis project, We will walk through a comprehensive data analysis pipeline starting with the basic analytical procedures later expanding towards more advanced, and sophisticated methods and finally hypothesis testing of different kinds. All the code chunks will be shown in rectangular box. The output of the codes will also be shown in rectangular box followed by ##. The comments in the code chunks will be followed by #.

For this project we will import a curated dataset, ICUdata, on the various observations of ICU patients, prepared by Statistician Matthias Kohl based on real life health data. Also, a couple of necessary libraries will be loaded. The simple actions of the codes will be summarized in the comments inside the code chunks. All the required libraries should be installed beforehand if any of them missing from the R environment.

icudata<-read.csv(file.choose(), header = T, stringsAsFactors = T) #Importing data and also converting all character vectors/columns to factors
library(ggplot2)
library(dplyr)
library(DescTools)
library(RColorBrewer)
icudata<-as.data.frame(icudata)

Let’s check the first few rows of the “ICUdata” and its structure.

head(icudata)

##   ID    sex age          surgery heart.rate temperature bilirubin SAPS.II
## 1  1 female  76            other       98.0        36.5  6.512142      57
## 2  2 female  60 gastrointestinal       80.0        38.1 14.523197      52
## 3  3   male  66   cardiothoracic       99.6        37.4 22.972480      57
## 4  4   male  74            other      110.0        39.1 19.299346      45
## 5  5 female  68            other       94.1        38.5 39.076485      49
## 6  6   male  68   cardiothoracic       88.8        35.1 14.805941      53
##   liver.failure LOS              outcome
## 1             0   1                 died
## 2             0   2                 home
## 3             0   1 secondary care/rehab
## 4             0   2                 home
## 5             0   1                 home
## 6             0   1 secondary care/rehab

str(icudata)

## 'data.frame':    500 obs. of  11 variables:
##  $ ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex          : Factor w/ 2 levels "female","male": 1 1 2 2 1 2 2 2 2 2 ...
##  $ age          : int  76 60 66 74 68 68 70 55 75 71 ...
##  $ surgery      : Factor w/ 5 levels "cardiothoracic",..: 4 2 1 4 4 1 1 1 1 1 ...
##  $ heart.rate   : num  98 80 99.6 110 94.1 88.8 102 106 109 102 ...
##  $ temperature  : num  36.5 38.1 37.4 39.1 38.5 35.1 36.7 39.8 39.9 38.4 ...
##  $ bilirubin    : num  6.51 14.52 22.97 19.3 39.08 ...
##  $ SAPS.II      : int  57 52 57 45 49 53 25 19 58 56 ...
##  $ liver.failure: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LOS          : int  1 2 1 2 1 1 1 1 1 3 ...
##  $ outcome      : Factor w/ 4 levels "died","home",..: 1 2 4 2 2 4 4 4 3 4 ...

The general structure reveals there are 500 observations with 11 variables: 3 factors or categorical, 5 integer, and 3 numerical/continuous variables.

The table and prop.table functions show the absolute and relative frequencies of the variables.

table(icudata$surgery) #absolute frequencies

## 
##   cardiothoracic gastrointestinal            neuro            other 
##              223               79               46              121 
##           trauma 
##               31

prop.table(table(icudata$surgery)) #relative frequencies

## 
##   cardiothoracic gastrointestinal            neuro            other 
##            0.446            0.158            0.092            0.242 
##           trauma 
##            0.062

Barplot function has been used to graphically represent the absolute and relative frequencies.

barplot(table(icudata$surgery), main = "Absolute Frequency of Surgery Data") #Plotting absolute frequencies

barplot(prop.table(table(icudata$surgery)), main = "Relative Frequency of Surgery Data") #plotting relative frequencies

The following codes will measure the Median, Interquartile range and quantiles for some prescribed probabilities of SAPS-II Score. SAPS II was designed to measure the severity of disease for patients admitted to Intensive care units aged 18 or more. 24 hours after admission to the ICU, the measurement has been completed and resulted in an integer point score between 0 and 125, and a predicted mortality between 0% and 100%.

#find the median of SAPS.II
median(icudata$SAPS.II)

## [1] 42

#alternative to find the median
quantile(icudata$SAPS.II, probs = 0.5)

## 50% 
##  42

#interquartile range
IQR(icudata$SAPS.II)

## [1] 26

#Measure different quantiles of SAPS-II for different probabilities
quantile(icudata$SAPS.II, probs = c(0.5,.75,0.90,0.95))

##   50%   75%   90%   95% 
## 42.00 57.00 67.10 75.05

Let’s draw the box and whisker plot of SAPS-II values. We will graph two different boxplots: one by base R graphics and another by qplot function provided by ggplot library. In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles.

#box and whisker plot of the SAPS-II values
boxplot(icudata$SAPS.II, main="500 ICU Patients", ylab="SAPS-II Score")

#create boxplot again by using qplot function from ggplot2
qplot(x=1,y=SAPS.II,data = icudata, geom = "boxplot", xlim = c(0,2), ylab = "SAPS-II Score")

Technically, a cumulative frequency distribution is the sum of the class and all classes below it in a frequency distribution. You can plot the Empirical Cumulative distribution by using the ecdf function.

#plotting the cumulative distribution function
plot(ecdf(icudata$SAPS.II), xlab="SAPS II", do.points=T, main="Empirical Cumulative Dist Function")

Bivariate Analysis

So far we have analyzed variables separately but now we can investigate the relationship between variables. Let’s build the contingency table for sex and surgery using the table function.

table(icudata$sex, icudata$surgery)

##         
##          cardiothoracic gastrointestinal neuro other trauma
##   female             61               31    19    57      7
##   male              162               48    27    64     24

The table contains the absolute frequencies between all the levels of the variable sex and surgery. From the contingency table it is evident that men undergo more Cardiothoracic surgery than female.

It is possible to represent the table with the proportion row-wise or column-wise using the prop.table function.

prop.table(table(icudata$sex, icudata$surgery), margin = 1) #margin 1 selected for row-wise operation

##         
##          cardiothoracic gastrointestinal      neuro      other     trauma
##   female     0.34857143       0.17714286 0.10857143 0.32571429 0.04000000
##   male       0.49846154       0.14769231 0.08307692 0.19692308 0.07384615

That’s a little bit hard to appreciate. Let’s convert this in the easy-to-read percentage format rounded up to 1 decimal point.

round(100*prop.table(table(icudata$sex, icudata$surgery), margin = 1),1)

##         
##          cardiothoracic gastrointestinal neuro other trauma
##   female           34.9             17.7  10.9  32.6    4.0
##   male             49.8             14.8   8.3  19.7    7.4

Bar charts are normally used to graphically represent the contingency table. Here, we combine the use of table(), prob.table(), barplot() functions to plot the variables sex and surgery.

barplot(prop.table(table(icudata$sex,icudata$surgery), margin = 1), 
        beside = T, legend.text = T, ylab = "Relative Frequency", 
        main = " Barplot of Sex and Surgery") #beside=T ensures the plots for male and female are side by side

Correlation

Correlation is a statistic that measures the degree to which two variables move in relation to each other. Here, we compute the correlation between SAPS-II and LOS (Length of Stay) variables using the cor() function. We will measure two different types of correlation: Spearman’s rho and Kendall’s tau.

#Spearman's rho
cor(icudata$SAPS.II, icudata$LOS, method = "spearman")

## [1] 0.3379928

#Kendall's tau
cor(icudata$SAPS.II, icudata$LOS, method = "kendall")

## [1] 0.2518917

We found a positive correlation between SAPS-II and Length of Stay (LOS). The patients with SAPS-II surgery are expected to stay in the hospital for longer period of time. In contrary, the patients with very high SAP-II score might have a low LOS because there is a high probability those patients will die soon after they were admitted to ICU. Let’s confirm this with a scatter plot between SAPS-II score and lenght of stay.

plot(icudata$SAPS.II,icudata$LOS, xlab = "SAPS-II", ylab = "LOS", main = "SAPS-II vs LOS")

Indeed, the scatterplot suggests that most of the patients with high SAPS-II score have low LOS in the ICU.

We can also leverage ggplot2 library for better visualization of the scatterplot data.

icudata %>% ggplot(aes(SAPS.II,LOS))+
  geom_point(shape=19,alpha=0.5)+
  ggtitle("SAPS-II vs LOS")+xlab("SAPS-II Score")+ylab("LOS")

Precaution must be taken to interpret anything from the correlation as we have seen here a monotonic increase of LOS with respect to SAPS-II score for a limited range but not the full range. If a serious monotonic relationship does not exist between the variables then the Spearman’s rho and Kandall’s tau could be misleading.

Metric Evaluation

Univariate Analysis

The most frequently used statistical metric to describe data is the arithmetic mean. We can compute the arithmetic mean of the maximum temperature recorded during stay at ICU by the simple mean function.

mean(icudata$temperature)

## [1] 37.6632

It is suggested that we almost always compare mean with the median because median provides the description of middle of the data not being affected by the outliers, hence, more robust than the arithmetic mean.

median(icudata$temperature)

## [1] 37.7

Let’s compare the mean and median of the LOS variable.

mean(icudata$LOS)

## [1] 5.29

median(icudata$LOS)

## [1] 1

There’s evidently a clear difference between the mean and median of the length of stay in the ICU. The mean is greater than the median which suggests the LOS distribution is right skewed or the outliers in the right are pulling the mean towards right.

There’s another location parameter, geometric mean, which is strictly applicable only on positive data. There is no dedicated function in R to calculate the geometric mean but with a little bit trickery we can find the geometric mean by following the code below. First we find the mean of the logarithms and then take the exponential of it.

exp(mean(log(icudata$bilirubin)))

## [1] 17.24162

The geometric mean of the bilirubin levels is found to be 17.24.

In statistical analysis not only the location parameter but also the dispersion of data is considered important.One good measure of dispersion of the data is the standard deviation and another is Mean Absolute Deviation. Here we compare both.

sd(icudata$temperature)

## [1] 1.735474

mad(icudata$temperature)

## [1] 1.18608

We observe a clear difference between Standard Deviation and Mean Absolute Deviation of temperature. This indicates either the temperature distribution can’t be succinctly described by a distribution symmetric around the mean or may be there are outliers skewing the data. We will investigate on this matter.

Coefficient of Variation and Geometric Standard deviation

In statistical analysis, it is a common practice to find the standardized dispersion measure. We will find the Coefficient of Variation and Geometric Standard deviation. The coefficient of variation (CV) is a statistical measure of the relative dispersion of data points in a data series around the mean. We will find the standardized dispersion measures of the maximum body temperature.

sd(icudata$temperature)/mean(icudata$temperature)#with respect to mean

## [1] 0.04607877

mad(icudata$temperature)/median(icudata$temperature)#with respect to mad and median

## [1] 0.03146101

IQR(icudata$temperature)/median(icudata$temperature)#with respect to interquartile range

## [1] 0.0397878

From these observations, we see minor variations with respect to mean and medina respectively with a range about 3-5%.

Geometric standard Deviation of bilirubin level will be calculated by the following code.

sd(log(icudata$bilirubin))

## [1] 0.7238379

Skewness and Kurtosis

Skewness also provides vital information on the shape of the distribution of data around the location parameters. A shape measure of symmetry is skewness. Here’s an illustration of skewness:

We will measure the skewness of maximum body temperature by using the function Skew from DescTools package.

Skew(icudata$temperature)

## [1] -8.77457

The skewness of the temperature distribution suggests a left skew here which contradicts our assumption of a symmetrical distribution. We assumed a symmetrical distribution from the fact that the mean and the median were more or less similar. So, what went wrong?! From a closer inspection of the icudata, it reveals the patient number 398 had a temperature record of only 9.1 degree celsius which could only be a transcription error.

icudata$temperature[398]

## [1] 9.1

We shall remove the erroneous data of the 398th patient and measure the skewness again.

Skew(icudata$temperature[-398])

## [1] 0.3142909

Voila! Now we notice the skewness is reduced and conforms with the symmetry of the temperature distribution.

Again we can check if removing the patient number 398 improves the standard deviation.

sd(icudata$temperature[-398])

## [1] 1.173187

The standard deviation also reduced clearly after excluding the 398th patient data which is now pretty close to the standardized MAD. Previously the standard deviation of temperature was 1.74.

All these analysis depicts the severity of a single outliers which can significantly distort the results of the statistical procedures.

We measure the skewness of the Length of Stay (LOS). We expect a positive skewness since we have seen previously that the mean and the median had a large disparity and also the mean was greater than the median for this LOS data.

Skew(icudata$LOS)

## [1] 4.880826

Our expectation of positive skewness was indeed correct.

Another metric for measure of shape is the Kurtosis. Normal distribution is the reference for the kurtosis. If kurtosis is negative or platykurtic then the distribution is flatter and less curvy than the normal distribution. In contrary, if the kurtosis is positive or leptokurtic then the distribution is much steeper and curvy than the normal distribution. Here’s an illustration of the Kurtosis:

We compute the Kurtosis of the maximum body temperature by using the function Kurt from the DescTools package. As we knew before that the patient number 398 have a great influence on the shape of the data so we compute the kurtosis of maximum body temperature both with including and excluding the 398th patient data.

Kurt(icudata$temperature)

## [1] 144.4649

Kurt(icudata$temperature[-398])

## [1] 0.3431707

We notice again how great the influence of a single observation can be on the shape of the distribution. After excluding the 398th patient data we can conclude that the maximum body temperature is almost normally distributed. Let’s check out the kurtosis of Length of Stay (LOS).

Kurt(icudata$LOS)

## [1] 33.59482

Thus the distribution of Length of Stay is leptokurtic.

Histogram and Density Plot

We proceed on with further analysis by using histograms. A frequency distribution shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency distributions. We create a histogram of maximum body temperature, where we use an interval of 0.5 degree Celsius with the argument breaks.

icudata %>% ggplot(aes(temperature))+geom_histogram(breaks=seq(9,42,0.5), color="Black", fill="grey")+
  # color "black" will set the border color of each bin and "fill" argument will fill the bin with that color
  labs(title = "Histogram of Max Temperature of ICU Patients", x="Maximum Body Temperature",y="Absolute Frequency")

The extreme value of the patient 398 is clearly visible in the histogram at the far left corner. We can easily prune this extreme value point by simply using the xlim argument.

icudata %>% ggplot(aes(temperature))+geom_histogram(breaks=seq(9,42,0.5), color="Black", fill="grey")+
  labs(title = "Histogram of Max Temperature of ICU Patients", x="Maximum Body Temperature",y="Absolute Frequency")+xlim(33,43)

The histogram after excluding the extreme value seems more symmetric around the arithmetic mean and the distribution can be regarded as Normal distribution.

Next, we produce the histogram of Lenght of Stay (LOS) by the same fashion.

icudata %>% ggplot(aes(LOS))+geom_histogram(color="Black", fill="grey", binwidth = 1)+
  labs(title = "Histogram of Length of Stay at ICU", x="Length of Stay",y="Absolute Frequency")

The histogram clearly reveals a right skewed and more spiky distribution of Length of Stay at ICU. Most of the patients stayed for only a few days in the ICU.

On the other hand, we can visualize the distribution with an empirical density of the values. The Empirical density may be regarded as a smoothed out version of the histogram. We have precluded the data of 398th patient to plot the density of Maximum body temperature.

icudata[-398,] %>% ggplot(aes(temperature))+geom_density(color="black",fill="grey")+
  labs(title = "Density plot of Maxiumum Body Temperature at ICU", x="Temperature",y="Density")

we have found a density that is quite symmetric around the arithmetic mean.

Now, we can plot the histogram and density together if we can facilitate to plot the histogram from the count of density.

icudata[-398,] %>% ggplot(aes(temperature))+geom_histogram(aes(y=..density..), binwidth =0.3,
                                                           color="black",fill="grey")+
  geom_density(color="red")+
  labs(title = "Histogram and Density plot of Maxiumum Body Temperature at ICU", x="Temperature",y="Density")

Robustness of Rank Correlation

We now take a deeper look at the correlation between the Maximum body temperate and Maximum heart rate. By default the cor function utilizes the Pearson’s correlation.

cor(icudata$temperature,icudata$heart.rate)

## [1] 0.1763067

The result indicates a slight positive relation between Max body temperature and the heart rate. We plot the maximum heart rate against the maximum body temperature to have a visual understanding if there is a positive correlation between the data.

icudata %>% ggplot(aes(temperature,heart.rate))+
  geom_point(shape=20,alpha=0.5)+labs(title = "500 ICU Patients", x="Maximum Body Temperature",
                                      y="Maximum Heart Rate")

We can still see the extreme value at the far left side of the graph.

We proceed by excluding this extreme data and measure the correlation again.

cor(icudata$temperature[-398],icudata$heart.rate[-398])

## [1] 0.2978033

Suddenly, the correlation between Max Temperature and Max heart rate almost became double after excluding the extreme value. This implies a single outlier may have a profound impact on Pearson’s correlation.

Now we investigate the influence of outliers in the case of Spearman’s Rho and Kendalls Tau.

cor(icudata$temperature,icudata$heart.rate, method = "spearman")

## [1] 0.2659957

cor(icudata$temperature[-398],icudata$heart.rate[-398], method = "spearman")

## [1] 0.2707241

cor(icudata$temperature,icudata$heart.rate, method = "kendall")

## [1] 0.1826903

cor(icudata$temperature[-398],icudata$heart.rate[-398], method = "kendall")

## [1] 0.1858804

In both of the cases, the correlation changed slightly even with the presence and absence of the outlier. From the above analysis it is evident that, switching to a rank correlation provides better guard and robustness against outliers.

From the following plot, we can observe a monotone positive relationship between Maximum Body Temperature and Maximum Heart Rate, if not a linear relationship after excluding the outlier.

icudata[-398,] %>% ggplot(aes(temperature,heart.rate))+
  geom_point(shape=20,alpha=0.5)+labs(title = "500 ICU Patients", x="Maximum Body Temperature",
                                      y="Maximum Heart Rate")

Visualization with colors and Diagrams

There’s a saying that a picture says thousand words. Visualization plays a very crucial role in understanding the data and its implications. We will install and load the library RColorBrewer developed by Neuwirth in 2014. Let us wander through this beautiful color palette. First we check for the qualitative color palette which are normally used to visualize categorical data/variables.

display.brewer.all(type = "qual") # "qual" stands for qualitative

The second type of color palettes provide colors for different attributes, whose levels range from uninteresting or unimportant to important or interesting.

display.brewer.all(type = "seq")

Eventually, there’s a third group of color palettes with a range from negative to neutral to positive or simply divergent to convergent series.

display.brewer.all(type = "div")

We will plot the absolute frequencies of different types of Surgery but now we will additionally use color palette set1 from ColorBrewer. For this, we need to create a vector of colors first. We have seen there are 5 different types of surgery so we will choose 5 different colors from set1.

cols<-brewer.pal(n=5, name = "Set1")
cols

## [1] "#E41A1C" "#377EB8" "#4DAF4A" "#984EA3" "#FF7F00"

Colors are encoded in hexadecimal code. We will use this newly created cols vector in the barplot of absolute frequencies of different types of Surgery.

barplot(table(icudata$surgery), main = "Different Types of Surgery", 
        ylab = "Absolute Frequency", col = cols)

The barplot can also be drawn with ggplot2 developed by Hadley Wickham. Here we generate the barplot of relative frequencies in percentage by applying ggplot2 functions.

icudata %>% ggplot(aes(x=surgery))+geom_bar(aes(y=100*(..count..)/sum(..count..)), fill=cols, width = 0.5)+
  labs(title = "Barplot of Surgery",y="Relative Frequency in %") # "..count.." will count the number of surgery for each category.

In the sequel, we create the box and whisker plot of the SAPS-II score for the different types of surgery. There are no orderings for sugery so we can use colors for qualitative data from Set3 color palettes.

icudata %>% ggplot(aes(surgery,SAPS.II))+geom_boxplot(fill=cols)+ 
  labs(title = "Boxplot of SAPS-II Score in Different Surgery groups", y="SAPS-II Score")

Colors have huge implications in the case of histogram. Here we will again draw the histogram of maximum body temperature but this time with proper color annotations. We know that blue color normally denotes colder and red color denotes warmer. We create three different intervals: <36 (colder), 36-37.5 (Normal), >37.5 (warmer).

There are five sub-intervals below the 36 degree Celsius, three between 36-37.5 and nine sub-intervals after 37.5 degree Celsius. We use ColorBrewer palette Blues and invert it for the first five intervals. Then for the next three intervals we use Green for normal temperature and repeat if three times. For the last nine intervals we use the ColorBrewer palette Reds.

cols1<-rev(brewer.pal(5, "Blues")) # "rev" will reverse the ordering of colors. 
cols2<-rep("#00FF00",3) # "#00FF00" stands for color Green in hexadecimal. 
cols3<-brewer.pal(9,"Reds")

Let’s create the histogram of maximum body temperature annotated with colors for different levels of temperature. For simplicity we exclude the data of patient 398 beforehand.

icudata[-398,] %>% ggplot(aes(temperature))+geom_histogram(breaks=seq(from=33.5, to=42, by=0.5), col="black", 
                                                           fill=c(cols1, cols2, cols3))+
  labs(title = "Histogram of Maximum Body Temperature", x="Temperature", y="Absolute Frequency")

We will use the powerful functions of ggplot2 to produce a scatter diagram of Maximum heart rate vs Maximum body temperature where the points will be labeled with sex variable. One important aspect of using color that this third variable sex will function as another dimension on the two dimensional graph.

icudata[-398,] %>% ggplot(aes(temperature, heart.rate, colour=sex))+geom_point(shape=20, alpha=0.5)+
  scale_color_manual(values = c("#FF0000", "#0000FF"))+ #"#FF0000" for Red color and "#0000FF" for Blue
  labs(title = "Scatter Diagram of Max Heart Rate vs Max Body Temperature", x="Temperature",y="Heart Rate")

The observations of females and males are almost uniformly distributed over the scatter diagram. So, we can conclude that sex does not play any role on Maximum Body temperature and Heart Rate.

Probability Distributions

Probability distributions are the backbone for the inference of an unknown population given the sample data. Here, we are going to discuss on the most frequently observed probability distributions in practical scenarios and their application in R. In actual life, we seldom know what processes are driving the data generation for the random variables but we can definitely try to estimate the underlying processes and model them through probability distributions.

Uniform Distribution

We can start with the Uniform Distribution where each random event has equal probability to occur within a certain range.

uniform.data<-runif(100000, 0, 10) #here we are randomly producing 100000 uniformly distributed data within range 0 to 10
plot(density(uniform.data))

Here, we can see the uniform density plot of random variable within the range 0 to 10 where the probability for almost all of the data points are at the same height in the distribution. If we measure the area under the curve within the range we will get 1 which supports the total probability theorem.

Here we show some of the important functions in R to calculate different statistical measures. Specially there are 4 prefixes to exact front of the name of each distribution function: “p”, “q”, “r”, “d” which stand for cumulative, quantile, random number and density respectively.

runif(5, min = 0, max = 10) #this will generate 5 random numbers within 0 and 10
punif(6, min = 0, max = 10) #this will calculate the probability below 6. Here 6 is the quantile.
qunif(0.6, min = 0, max = 10) #this will calculate the number (quantile) below which the probability is 0.6. this is inverse of the function *punif*
dunif(2, min = 0, max = 10) #this will calculate the height of the density function for the value 2

Binomial Distribution

Binomial distribution deals with the probability for a certain number of successes x out of n number of trails with a given probability p for a success. The sample space of binomial trial will contain only two outcomes, success or failure.

Suppose, we have a biased coin with a probability of p=0.6 for the head and 1-p=0.4 for the tail. If we throw the dice for 20 times then what is the probability that we get exactly 6 heads? What is the probability that we get less than 6 heads? We can calculate these in R by the following codes.

dbinom(x=6, size = 20, prob = 0.6) #Probability of getting exactly 6 heads.

## [1] 0.004854351

pbinom(q=6, size = 20, prob = 0.6, lower.tail = T) # Probability of getting less than or equal to 6 heads

## [1] 0.006465875

dbinom(x=c(7,9,11), size = 20, prob = 0.6) # Probs of getting 7, 9 and 11 heads separately

## [1] 0.01456305 0.07099488 0.15973848

qbinom(p=0.7, size = 20, prob = 0.6, lower.tail = T) #getting the number or the quantile below which the prob is 0.7

## [1] 13

We will see different plots of binomial distribution with different probabilities for success.

plot(x = 0:20, y = dbinom(x = 0:20, size = 20, prob = 0.5), 
     type = "h", xlab = "Number of Successes", ylab = "Probability", main = "Binomial Distribution with success rate=0.5") #probs for 0 to 20 successes with 20 trials and type="h" for histogram

We see a symmetrical distribution of probability, when p=0.5. In this case, we get almost equal number of successes on average with respect to the failures for each 20 trails.

The distribution is right skewed when the probability for success is low. So, we get lower number of successes on average for each 20 trials.

The distribution has now shifted more towards right and become left skewed due to the high probability for success. This time, we get higher number of successes on average for each 20 trials.

Poisson Distribution

Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

For example, a call center receives an average of 180 calls per hour, 24 hours a day. The calls are independent; receiving one does not change the probability of when the next one will arrive. The number of calls received during any minute has a Poisson distribution: the most likely numbers are 2 and 3 but 1 and 4 are also likely and there is a small probability of it being as low as zero and a very small probability it could be 10.

dpois(x=3, lambda = 180/60) #prob for 3 calls per minute with an average of 180/60 calls per min, p(x=3)

## [1] 0.2240418

dpois(x=c(2,4,6), lambda = 180/60) #probs for 2, 4 and 6 calls per min

## [1] 0.22404181 0.16803136 0.05040941

ppois(q=4, lambda = 180/60, lower.tail = T) #prob for less than or equal to 4 calls, p(x<=4)

## [1] 0.8152632

ppois(q=4, lambda = 180/60, lower.tail = F) #prob for greater than or equal to 4 calls, p(x>=4)

## [1] 0.1847368

qpois(p=0.90, lambda = 180/60, lower.tail = T) #number of calls for which prob lies below 90%

## [1] 5

We will plot different Poisson Distributions to see the effect of lambda on the shape of probability distribution. In the example above, we have a lambda=3 calls per minute. We will play around a little bit with lambda to see the change in distribution pattern.

plot(x = 0:20,y = dpois(x = 0:20,lambda = 3),type = "h", 
     xlab = "Number of calls Per Minute",ylab = "Probability Density",
     main = "Poisson Distribution with lambda=3")

The probability is higher for the number of calls equal to the parameter lambda and the distribution is skewed to the right.

Lets plot another Poisson distribution for lambda=10.

plot(x = 0:20,y = dpois(x = 0:20,lambda = 10),type = "h", 
     xlab = "Number of calls Per Minute",ylab = "Probability Density",
     main = "Poisson Distribution with lambda=10")

We see, the distribution has almost become symmetric around the mean which is lambda=10.

plot(x = 0:20,y = dpois(x = 0:20,lambda = 18),type = "h", 
     xlab = "Number of calls Per Minute",ylab = "Probability Density",
     main = "Poisson Distribution with lambda=18")

Here we see, the distribution has shifted to the right as the number of calls per minute has increased to 18, although this scenario is not realistic.

Normal Distribution

Normal distribution is by far the most important probability distribution in whole of statistics. The normal distribution produces a bell curve with a certain mean and a standard deviation.

Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. Their importance is partly due to the central limit theorem. It states that, under some conditions, the average of many samples of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases.

pnorm(q=65, mean = 70, sd = 5, lower.tail = T) #prob of getting less than 65, p(x<65)

## [1] 0.1586553

pnorm(q=75, mean = 70, sd=5, lower.tail = F) # prob of getting more than 75, p(x>75)

## [1] 0.1586553

pnorm(q=1, mean = 0, sd = 1) #prob of getting less than a z-score of 1

## [1] 0.8413447

pnorm(q=1, mean = 0, sd = 1, lower.tail = F) #prob of getting less than 1 z-score.

## [1] 0.1586553

qnorm(p=0.25, mean = 70, sd = 5) #finding the number below which the prob is 25% or the 1st quartile

## [1] 66.62755

qnorm(p=c(0.25,0.50,0.75), mean = 75, sd=5) #finding the 1st, 2nd and 3rd quartiles.

## [1] 71.62755 75.00000 78.37245

Lets draw a normal distribution density curve using the dnorm and plot function.

x<-seq(from=50, to=90, by=0.25) #creating a sequence of numbers for the density plot
density.x<-dnorm(x, mean = 70, sd = 5) #prob density for each of the number in x
plot(x,density.x, type="l", main = "Normal Density Curve: Mean=70, SD=5", xlab = "x", ylab = "Probability Density", las=1) #creating the normal density curve
abline(v=70) #a vertical line through the mean of 70

We can also approximate the Binomial Distribution by normal distribution when the parameter for Binomial distribution n is very large and probability for success p is very small. First, we will plot binomial distribution for which n=1000 and p=0.1. The distribution will have a mean or expected value, \(E=n*p=1000*0.1=100\)

plot(x = 0:1000, y = dbinom(x = 0:1000, size = 1000, prob = 0.1), 
     type = "l", xlab = "Number of Successes", ylab = "Probability", main = "Binomial Distribution with success rate=0.1") #type="l" for drawing line

The plot clearly reveals the distribution closely follows a normal distribution within the range 0 to 200. We will plot a normal distribution with mean=100 and standard deviation, \(sd=\sqrt{np}\) to approximate the Binomial distribution.

plot(x = 0:200, y = dnorm(x = 0:200, mean = 100, sd=10), type = "l", xlab = "Number of Successes", ylab = "Probability Density", main = "Normal Distribution with mean=100, SD=10")

We can clearly see, the normal distribution closely resembles the Binomial distribution when the sample size(n) is large and probability of success(p) is small.

t-Distribution

Student’s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situations where the sample size is small and the population’s standard deviation is unknown.

The t-distribution plays a role in a number of widely used statistical analyses, including Student’s t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis.

The t-distribution is symmetric and bell-shaped, like the normal distribution. However, the t-distribution has heavier tails, meaning that it is more prone to producing values that fall far from its mean.

Suppose, we have run an experiment and found a test statistics of 2.5 with 20 degrees of freedom. Lets compute the p-value for this statistics or simply p(t>2.5).

pt(q = 2.5, df = 20, lower.tail = F) #p-value for one sided t-test

## [1] 0.01061677

2*pt(q = 2.5, df=20, lower.tail = F) #p-value for two sided t-test

## [1] 0.02123355

We get a p-value close to 0.01 for one sided t-test and 0.02 for two sided t-test which will be statistically significant if the critical p-value for this experiment is 0.05.

If we want to construct a 95% confidence interval then we might need to find the t-values above 0.975 probability and below 0.025.

qt(p = 0.975, df = 20, lower.tail = T)

## [1] 2.085963

qt(p = 0.025, df = 20, lower.tail = T)

## [1] -2.085963

We will plot a t-distribution with 20 degrees of freedom and compare it with an equivalent Normal distribution.

x<-seq(-4,4,length.out = 100) #creating a sequence of 100 equally spaced numbers between -4 and 4
plot(x = x, y = dt(x = x, df = 20),xlab = "t-values", ylab = "Probability Density",
     main = "t-Distribution with 20 degrees of freedom", type = "l")

plot(x = x, y = dnorm(x = x), type = "l", xlab = "Z-score", ylab = "Probability Distribution",
     main = "Normal Distribution")

The t-distribution looks very similar to normal distribution but as the sample size becomes smaller the t-distribution differs significantly from the normal distribution.

Hypergeometric Distribution

If we draw from a sample without replacement, it results in a hypergeometric distribution.

Lets consider a box containing m white balls and n black box. If we draw k balls out of that box randomly and without replacement where k<m+n then the random variable x, describing the number of white balls drawn from the box follows a hypergeometric distribution. The probability mass function is as following:

\[P(X=j)=\frac{\binom{m}{j}\binom{n}{k-j}}{\binom{m+n}{k}}\]

Suppose, there are 500 lights bulbs. Five of them are defective and others are just fine. Now, if we draw 20 bulbs without replacement what is the probability that no bulbs will be defective? Number of defective bulbs are less than or equal to 3? What is the probability that at least one bulb is defective?

dhyper(x=0,m = 5, n = 495, k = 20) #prob of no defective bulb

## [1] 0.8146893

phyper(q = 3, m = 5, n = 495, k = 20, lower.tail = T) #prob of less than or equal to 3 defective bulbs

## [1] 0.9999908

phyper(q = 0, m = 5, n = 495, k = 20, lower.tail = F) #prob of at least one defective bulb

## [1] 0.1853107

qhyper(p = 0.95, m = 5,n = 495,k = 20,lower.tail = T) #number of defective bulbs for which prob is below 0.80

## [1] 1

We will plot a hypergeometric distribution for which number of total bulbs is 500, defective bulbs is 5 and we randomly select 20 bulbs.

x<-0:5
y<-dhyper(x = 0:5, m = 5, n = 495, k = 50) #vector of probs for 0 to 5 defective bulbs 
dat<-data.frame(x=x,y=y)
dat %>% ggplot(aes(x,y))+geom_point(col="dark red")+geom_smooth(method = "loess", se =F, col="sky blue")+ggtitle(label = "Hypergeometric Distribution")+
  labs(x="Number of Defective Bulbs", y="Probability Density") #"loess" method for fitting a smooth curve through the points.

Negative Binomial Distribution and Geometric Distribution

Negative Binomial distribution is a so-called waiting time distribution. It is applicable in the situation where we specify how many unsuccessful attempts we have to make until we get required number of successes.

Suppose, 1% of the bulbs are defective in a production line. Let, X be the random variable that describes the number of functioning bulbs drawn until we get a defective bulb. What is the probability that X=20 or the probability that 21st bulb in the first defective bulb? This is also known as the Geometric distribution as we are looking for only the first instance of defective bulb. We can find this by the following R-code.

dgeom(x = 20, prob = 0.01) #20 failures before getting the first success

## [1] 0.008179069

dnbinom(20, size = 1, prob = 0.01) #21st bulb is the first defective bulb

## [1] 0.008179069

Both codes show the same results because geometric distribution is a special form of negative binomial distribution.

pgeom(q = 10, prob = 0.01, lower.tail = T) #probability of 20 failures or less to get the first success.

## [1] 0.1046617

qgeom(p = 0.95, prob = 0.01, lower.tail = T) #number of failures for 95th percentile to get the first success

## [1] 298

We will plot the Geometric Distribution with a 1% probability for success.

x <- 0:20
y <- dgeom(x = 0:20, prob = 0.01)
dat<-data.frame(x=x,y=y)
dat %>% ggplot(aes(x,y))+geom_point(col="dark red")+geom_smooth(method = "lm", se =F, col="sky blue")+ggtitle(label = "Geometric Distribution with 1% success rate")+
  labs(x="Number of Failures Before the First Success", y="Probability Density")

In 2014, the prevalence of diabetes in adults was 9% around the world. We are ready to sample 250 people where we need at least 20 persons with diabetes. So, what is the likelihood that we will get 20 diabetes patients when we will sample a maximum of 250 adults. That means, we can at most get 230 adults without diabetes. Here, how we do it.

pnbinom(q = 230, size = 20, prob = 0.09) #at most 250 adults to get 20 patients with diabetes

## [1] 0.7407983

We can also use the quantile function to determine the sample size to get 20 diabetes patients given the probability of success.

qnbinom(p = 0.95, size = 20, prob = 0.09)

## [1] 286

The output 286 is the total number of failure before getting 20 diabetes patients. So, we need to sample 286+20=306 adults to get 20 diabetes patients with 95% probability.

Lets produce some random sampling to get 20 diabetes patients.

mean(rnbinom(n = 10, size = 20, prob = 0.09))

## [1] 196.1

Statistical Hypothesis Testing

Empirical investigations and studies often start with a new idea, a conjecture about a certain problem. This conjecture is usually postulated on the basis of empirical observations and may subject to theoretical considerations. It facilitates the verification of an assumption. In this case, we can talk about a Hypothesis. First, one should collect all the relevant available information about the problem and elaborate the theoretical background to verify whether the proposed Hypothesis is generally plausible.

In many cases, the direct proofs of these hypotheses are not possible and can’t be verified by a single experiment. At this point, statistics come into play. We collect representative and relevant data for the problem and subject the data to a suitable statistical analysis, where the results can be ensured by so-called Statistical Tests.

The origin of Statistical tests are two mutually exclusive hypotheses. They are usually denoted as follows:

Null Hypothesis \(H_{o}\): Hypothesis that shall be falsified. Alternate Hypothesis \(H_{o}\): Hypothesis that shall be confirmed(Research Hypothesis)

In the framework of inferential statistics we assume samples of larger populations. All values we compute, depend on the concrete sample and are subject to uncontrollable random variations. With respect to the decisions made based on statistical results, one has to conclude that wrong decisions can never be avoided. It is inevitable, that we make a wrong decision with a positive probability but we hope that to be very small as possible.

Thus, the possible wrong decisions are:

Type-I error: Probability of rejecting \(H_{o}\) although it is true.

Type-II error: Probability of not rejecting \(H_{o}\) although it is false.

In medical application, the type-I error is considered to be more serious.

We have to do some housekeeping before commencing any statistical hypothesis testing. We propose a list containing the necessary steps for conducting a statistical test. In the framework of clinical trial, it is strictly necessary to arrange the following set up, so that no one can interfere with the outcomes after the start of the trial.

Definition of the Hypotheses, \(H_o\) and \(H_1\) (One sided/Two sided)
Fixing the Type-I error(Significance Level)
Selection of an appropriate Test T
Sample size calculation and power analysis
Determination of rejection and acceptance region for \(H_o\)
Conducting the experiments and generation of relevant data \(x_1,.....,x_n\)
Calculation of the test statistics, \(t=T(x_1,....,x_n)\)
Decision for \(H_1\) or \(H_o\)

In practice, the test decision is often determined by the virtue of p-value. P-value is the conditional probability that the test statistics lies in the rejection region of \(H_0\) under the assumption that \(H_0\) is true. If p-value is very small we can say, it is unlikely that the data stems from the null hypothesis. We can opt for the alternative hypothesis in this case. More precisely, we shall decide as follows:

If \(p \le \alpha\): rejection of \(H_0\) (\(\alpha\) is the significance level) If \(p>\alpha\): acceptance of \(H_0\)

One Sample t-test

We will start with the most frequently used statistical test, the student’s t-test. We have shown previously that the temperature of ICU patients in our ICU data closely follows a normal distribution. Here, we will investigate the Hypothesis, whether the ICU patients have an increased body temperature. The Null and alternate hypothesis as follows:

\(H_o\): \(\mu \le 37.5\) \(H_1\): \(\mu>37.5\)

We will use one sample t-test, where we will omit the 398th patient.

t.test(icudata$temperature[-398],mu = 37.5, alternative = "greater") #alternative="greater" for one sided test

## 
##  One Sample t-test
## 
## data:  icudata$temperature[-398]
## t = 4.1973, df = 498, p-value = 1.599e-05
## alternative hypothesis: true mean is greater than 37.5
## 95 percent confidence interval:
##  37.63389      Inf
## sample estimates:
## mean of x 
##  37.72044

Based on a significance level of 5%, we can assert that the ICU patients had an increased body temperature during their stay on the ICU.

Two Sample t-test

Now, we will progress towards the two sample t-test, where we will compare two independent groups. In this case, we will investigate, whether the body temperature of females(\(\mu_1\)) and males (\(\mu_2\)) is significantly different. We assume the temperature of both females and males follow a normal distribution.

Before the test, we check the standard deviation of temperature for both female and male group. We omit the patient 398.

sd(icudata$temperature[icudata$sex=="female"])

## [1] 1.258335

sd(icudata$temperature[-398][icudata$sex[-398]=="male"])

## [1] 1.123373

We observe a difference between the variances of this two groups. So, we will follow a more conservative approach for the two sample t-test: Welch t-test where the separation or the grouping of the two sexes is done by the the formula temperature ~ sex.

t.test(temperature~sex, data = icudata[-398,])

## 
##  Welch Two Sample t-test
## 
## data:  temperature by sex
## t = -1.2514, df = 323.73, p-value = 0.2117
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.36618695  0.08144621
## sample estimates:
## mean in group female   mean in group male 
##             37.62800             37.77037

Although, we can see the mean temperature of females are somewhat lower than the male but on the significance level of 5%, our two sample test indicate, we must accept the Null hypothesis that there is no significant difference between the body temperature of female and male. The small difference in means is induced only by the random variations.

Wilcoxon Signed Rank Test

Wilcoxon signed rank test belongs to a group of non-parametric test for which we don’t necessarily require the assumption for the data stemming from any particular distribution (Normal distribution). Wilcoxon signed rank test is particularly useful when we have a very small sample size.

Following our previous example, we will again run the wilcoxon signed test on the temperature of the ICU patients to investigate whether the temperature increased. We run the experiment with and without omitting the 398th patient.

 wilcox.test(icudata$temperature, mu = 37.5, alternative = "greater", conf.int = T) #alternative is greater to check whether there is a significant increase in temperature beyond 37.5

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  icudata$temperature
## V = 69173, p-value = 0.0002349
## alternative hypothesis: true location is greater than 37.5
## 95 percent confidence interval:
##  37.59999      Inf
## sample estimates:
## (pseudo)median 
##       37.69991

This time, we exclude the 398th patient.

wilcox.test(icudata$temperature[-398], mu=37.5, alternative = "greater", conf.int = T)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  icudata$temperature[-398]
## V = 69173, p-value = 0.0001671
## alternative hypothesis: true location is greater than 37.5
## 95 percent confidence interval:
##  37.60003      Inf
## sample estimates:
## (pseudo)median 
##       37.69997

Still, we can assert there is a significant increase of body temperature, the same we have observed by conducting the one sample t-test. But, wilcoxon signed rank test is more robust than one sample t-test for rejecting the null hypothesis when 398th patient was included and excluded.

Wilcoxon-Mann-Whitney U test/Wilcoxon Rank Sum test

Now, we introduce another non-parametric test in this phase for testing two different and independent samples for which we don’t need to assume any prior distribution, aka normal.

Mann-Whitney U test can be considered an alternative approach for two sample t-test if our sample size is very small and we are unable to assume whether the sample data were procured from a distribution.

We will again compare the temperature of two distinct groups: female and male ICU patients as we did in the two sample t-test. We include the 398th patient with abnormal body temperature.

wilcox.test(temperature ~ sex, data = icudata, conf.int=T) #temperature will be grouped into sex

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  temperature by sex
## W = 26388, p-value = 0.1833
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -0.39997835  0.09992453
## sample estimates:
## difference in location 
##             -0.1000091

Without 398th patient:

wilcox.test(temperature ~ sex, data = icudata[-398,], conf.int=T)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  temperature by sex
## W = 26213, p-value = 0.1643
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -0.39995318  0.09997909
## sample estimates:
## difference in location 
##              -0.199997

The results are again the same as the two sample t-test as we have conducted earlier but Mann-Whitney U test is less disturbed by the inclusion of the 398th patient. In both cases, the confidence interval includes zero. We can’t reject the Null hypothesis. So, there is no significant differences in body temperature of female and male ICU patients.

Test of Variance

At this stage, we are no longer interested in knowing the difference between means rather we are more interested in investigating the difference in variances. We consider the ratio of two variances which leads to F-distribution and the test is called F-test. But, we assume normality for this F-test.

Another test of variance based on ranks, and thus not requiring the assumption of normality, is the Ansari- Bradley test.

As we have seen before, the variances of maximum body temperature for female and male ICU patients are different. But, is the difference in variances significant? Let’s do the F-test first with and without the patient 398.

var.test(temperature ~ sex, data = icudata)

## 
##  F test to compare two variances
## 
## data:  temperature by sex
## F = 0.41809, num df = 174, denom df = 324, p-value = 6.066e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3236546 0.5457195
## sample estimates:
## ratio of variances 
##          0.4180863

As we can see, the test suggest to reject the null hypothesis when we included the 398th patient. So, there is a significant difference in variances for maximum temperature of female and male.

Let’s see whether this holds true or not when we exclude the 398th patient with extreme temperature.

var.test(temperature ~ sex, data = icudata[-398,])

## 
##  F test to compare two variances
## 
## data:  temperature by sex
## F = 1.2547, num df = 174, denom df = 323, p-value = 0.08265
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9711595 1.6379524
## sample estimates:
## ratio of variances 
##           1.254713

From the result, it is evident that the previous suggestion to reject the null hypothesis does not hold true in this case when we exclude the 398th patient. So, we can say there is no significant difference in variances.

Next, we proceed on to perform the Ansari-Bradley test to investigate the difference in variances with and without the 398th patient.

ansari.test(temperature ~ sex, data = icudata)

## 
##  Ansari-Bradley test
## 
## data:  temperature by sex
## AB = 20904, p-value = 0.1688
## alternative hypothesis: true ratio of scales is not equal to 1

Without the 398th patient:

ansari.test(temperature ~ sex, data = icudata[-398,])

## 
##  Ansari-Bradley test
## 
## data:  temperature by sex
## AB = 20808, p-value = 0.1477
## alternative hypothesis: true ratio of scales is not equal to 1

In both cases, the ansari test suggest, we can’t reject the Null hypothesis, and there is no significant difference in variances of temperature between male and female patients. The Ansari-test is more robust against the extreme values such as the abnormal body temperature of 398th patient, and the influence of this patient is clearly smaller.

So, it is confirmed that we can assume equal variances.

One Sample Binomial Test and Proportionality Test

Suppose, there are 10 multiple choice questions in a Physics exam with 4 options to choose from for each of the question. Now, one student guessed all the answers randomly and got 8 correct! Do we suspect this student cheating in the exam, or is it just a random variation within a given confidence interval? Lets check this out:

Our Null hypothesis says, the probability of him getting each answer right is 0.25 meaning the student didn’t cheat. On the other hand, the alternate hypothesis says, the probability is greater than 0.25 meaning the student cheated in the exam.

\(H_0\): p=0.25 \(H_1\): p>0.25

We will do the binomial test to investigate whether the student cheated in the exam or not.

binom.test(x = 8, n = 10, p = 0.25, alternative = "greater") #8 correct answers guessed out of 10

## 
##  Exact binomial test
## 
## data:  8 and 10
## number of successes = 8, number of trials = 10, p-value = 0.0004158
## alternative hypothesis: true probability of success is greater than 0.25
## 95 percent confidence interval:
##  0.4930987 1.0000000
## sample estimates:
## probability of success 
##                    0.8

From the test result, there is evidence that suggests the student cheated on exam.

Using the same one sample binomial test, we will check whether the prevalence of liver failure for ICU patients is less than 0.05.

\(H_0\): p>=0.05 \(H_1\): p<0.05

sum(icudata$liver.failure==1)

## [1] 20

length(icudata$liver.failure)

## [1] 500

There are 20 patients with liver failure out of 500 patients.

binom.test(x = 20, n = 500, p = 0.05, alternative = "less")

## 
##  Exact binomial test
## 
## data:  20 and 500
## number of successes = 20, number of trials = 500, p-value = 0.1789
## alternative hypothesis: true probability of success is less than 0.05
## 95 percent confidence interval:
##  0.00000000 0.05759556
## sample estimates:
## probability of success 
##                   0.04

The test reveals, there are not enough evidence to reject the \(H_o\) and we can’t say the prevalence of liver failure is less than 0.05.

Let’s do the proportionality test for the same data.

prop.test(x = 20, n = 500, p = 0.05, alternative = "less")

## 
##  1-sample proportions test with continuity correction
## 
## data:  20 out of 500, null probability 0.05
## X-squared = 0.85263, df = 1, p-value = 0.1779
## alternative hypothesis: true p is less than 0.05
## 95 percent confidence interval:
##  0.00000000 0.05822552
## sample estimates:
##    p 
## 0.04

The proportionality test is also in accordance with the binomial test. Hence, we can’t reject the Null in this case. The prevalence of liver failure is not less than 0.05.

Fisher’s Exact test and Chi-Square test

Now, we proceed on to examine the prevalence of liver failure in female and male patients in the ICU data, and check whether the prevalence is equal or not for female and male.

For this purpose, we have to create a contingency table first.

contingency.table<-table(icudata$liver.failure, icudata$sex)
contingency.table

##    
##     female male
##   0    168  312
##   1      7   13

We will do the fisher’s exact test first to compare the prevalence of liver failure in female and male.

fisher.test(x = contingency.table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency.table
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.3625184 3.0194493
## sample estimates:
## odds ratio 
##          1

Chi-square test:

chisq.test(x = contingency.table)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency.table
## X-squared = 0, df = 1, p-value = 1

Both tests generate the same result. So, we can’t reject the null hypothesis and accept the premise that the prevalence of liver failure in female and male patients are the same. The minor difference in ratio arose from the random variation.

Correlation test

In this section, we will discuss and investigate the correlation between two variables/attributes. If we assume two normally distributed variables having a linear correlation then we can investigate this co-linearity by using Pearson’s correlation test.

And, if we are unable to assume normal distribution for the two variables and simply want to test the monotonic linear relationship between the variables then Spearman and Kendall offer better alternatives.

First, we investigate, whether the correlation between the maximum body temperature and the heart rate is significantly different from zero. The Null hypothesis says, there is no significant difference.

We will perform all the above mentioned correlation tests but we will remove the 398th patient for the Pearson correlation test.

#Pearson's correlation
cor.test(x = icudata$temperature[-398], y = icudata$heart.rate[-398])

## 
##  Pearson's product-moment correlation
## 
## data:  icudata$temperature[-398] and icudata$heart.rate[-398]
## t = 6.9546, df = 497, p-value = 1.118e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2156624 0.3757592
## sample estimates:
##       cor 
## 0.2978033

#Spearman's correlation
cor.test(x = icudata$temperature, y = icudata$heart.rate, method = "spearman")

## Warning in cor.test.default(x = icudata$temperature, y = icudata$heart.rate, :
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  icudata$temperature and icudata$heart.rate
## S = 15291695, p-value = 1.521e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2659957

#Kendall's correlation 
cor.test(x = icudata$temperature, y = icudata$heart.rate, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  icudata$temperature and icudata$heart.rate
## z = 6.0032, p-value = 1.935e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.1826903

All the three correlation tests suggest, there is a significant correlation between the maximum body temperature and heart rate of ICU patients. But, unfortunately there is a caveat. We can’t presumably measure, whether the correlation is significantly greater or lower than a prescribed value. We are only limited to investigating the significant correlation between variables with respect to zero.

ANOVA Test

Analysis of Variance(Anova) test is a very popular hypothesis testing method in which we have to test the equality of means between three or more independent groups. In the case of Anova test, we have to hold an assumption that the variances of different groups are almost similar. The other counterpart of ANOVA is the Welch one way ANOVA where we don’t have to assume equal variances for the different groups.

In the case of two sample t-test, it is called classical one way ANOVA. We will investigate, whether the mean maximum temperature for 4 categories in the outcome variable are equal. The outcome variable in the icudata has four levels or categories: died, home, other hospital, secondary care/rehab. The anova test will create four groups for each of these outcome categories and allocate the maximum temperature of each patient on different groups based on the status of their outcome. The test will be performed both including and excluding the 398th patient.

oneway.test(temperature~outcome, data = icudata, var.equal = T) #with patient 398 and

## 
##  One-way analysis of means
## 
## data:  temperature and outcome
## F = 1.9572, num df = 3, denom df = 496, p-value = 0.1195

#variance is assumed to be equal

The F-number is very low and the p-value is large which suggests we can’t reject the null hypothesis that there is no significant differences in the mean temperature for the four groups.

oneway.test(temperature~outcome, data = icudata, var.equal = F) #with patient 398 and

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  temperature and outcome
## F = 4.5435, num df = 3.00, denom df = 182.28, p-value = 0.004262

#variance is not assumed to be equal.

Assuming no equal variances and also including the 398th patient leads to rejecting the null hypothesis that there are significant differences in the mean temperature for the four groups.

For the next two experiments, we will exclude the 398th patient.

oneway.test(temperature~outcome, data = icudata[-398,], var.equal = T)

## 
##  One-way analysis of means
## 
## data:  temperature and outcome
## F = 5.2203, num df = 3, denom df = 495, p-value = 0.001481

This also leads to the rejection of the Null hypothesis.

oneway.test(temperature~outcome, data = icudata[-398,], var.equal = F)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  temperature and outcome
## F = 5.3574, num df = 3.00, denom df = 189.68, p-value = 0.001458

The final experiment also rejects the null hypothesis on behalf of the alternative that there exists significant differences in means between the four groups of patients with maximum body temperature.

Kruskal-Wallis test

The rank based counterpart of the ANOVA test is the Kruskal-Wallis test. We can also call it the non-parametric one way ANOVA. In non-parametric test, we don’t have to take in to account any prior distribution of the data among different groups. This experiment will also be performed with the inclusion and exclusion the 398th patient.

kruskal.test(temperature~outcome, data = icudata) #including 398th patient

## 
##  Kruskal-Wallis rank sum test
## 
## data:  temperature by outcome
## Kruskal-Wallis chi-squared = 15.259, df = 3, p-value = 0.001608

kruskal.test(temperature~outcome, data = icudata[-398,]) #excluding 398th patient

## 
##  Kruskal-Wallis rank sum test
## 
## data:  temperature by outcome
## Kruskal-Wallis chi-squared = 15.869, df = 3, p-value = 0.001206

In both the cases, the Kruskal-Wallis rank based test rejected the Null hypothesis suggesting significant differences in the means between the four groups of patients with maximum temperature.

Interestingly, Kruskal-Wallis test has not been disturbed by the presence of 398th patient for being a non-parametric rank based test.

In summary, we can conclude that there are significant differences among the groups. One question remains though! These tests does not shed any light on which groups and which way they are different than others and how large is the effect size.

Post Hoc Analysis

In the last section, we have surfaced some critical inquisitions. To address these issues, it is common practice to run post hoc test and generate plots to reveal the true differences between the groups. In the case of one-way ANOVA, there are several possible post hoc tests. The most natural choice is the pairwise t-tests. And, in the case of non-parametric rank based Kruskal-wallis test is preferable to use pairwise Wilcoxon-Mann-Whitney U tests.

As several tests are performed simultaneously, it is important to adjust the significance level or the p-values to control the type-I error. This is also known as the Multiple Testing. In R, the correction of Bonferroni-Holm is applied by default.

Here, we run the pairwise comparison between different outcome groups with respect to their maximum body temperature. In case of the pairwise t tests, we assume different variances(Welch t test) and exclude the 398th patient for accuracy of the tests.

pairwise.t.test(icudata$temperature[-398],icudata$outcome[-398],pool.sd = F) #pairwise Welch t tests

## 
##  Pairwise comparisons using t tests with non-pooled SD 
## 
## data:  icudata$temperature[-398] and icudata$outcome[-398] 
## 
##                      died   home   other hospital
## home                 0.0410 -      -             
## other hospital       1.0000 0.0459 -             
## secondary care/rehab 1.0000 0.0036 1.0000        
## 
## P value adjustment method: holm

pairwise.wilcox.test(icudata$temperature,icudata$outcome)

## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  icudata$temperature and icudata$outcome 
## 
##                      died   home   other hospital
## home                 0.0534 -      -             
## other hospital       1.0000 0.0534 -             
## secondary care/rehab 1.0000 0.0024 1.0000        
## 
## P value adjustment method: holm

The experimental tests reveals that the group home is different from all the other groups with a significant level of 5%.

We will now plot the temperature vs outcome data to illustrate the mean of home outcome being different from all the means of other outcomes.

icudata[-398,] %>% ggplot(aes(x=outcome,y=temperature,fill=outcome))+
  geom_boxplot()+
  stat_summary(fun = mean, color="darkred", geom = "point",shape=10, size=2 )+
  ggtitle("Maximum Body Temperature based on the Outcome")

It is clear from the plot that the maximum body temperature of the group home is smaller than the other groups in the experiment.

Conclusive Remarks

We have traversed through almost all the basic statistical methods for data analysis. Not only we have learned from introductory discussion for each of the methods applied in this project but also we have gained much greater insights with the elaboration on each method through practical programming sessions with real life practical data. The most frequently occurred probability distributions in real life are discussed with practical examples and illustrated using graphical methods. The hypothesis testing section was enumerated with different types tests to cover almost all the necessary decisions one has to make in dealing with myriads of complex situations.