Preface

Data Science is the new buzz word in every engineering and technology field. It is used for data extraction, data analysis, data visualization and data modeling. Recently, data science is applied to biomedical field to acquire knowledge about disease occurrence and diagnosis. This introductory book provides few fundamentals about biostatistics and statistical modeling.

Prerequisite

The reqirement for the readers of this book are the following:

Requirement

R software installed in your computer
An IDE for R (Many use RStudio)
Some basic knowledge about graphs and statistics

Introduction

The center of any complex engineering application or analyis is the data/dataset. It is the root node that entire organizations and corporations run their business. There is a famous saying nowadays that Data is the new oil. In this book, we are going to focus about analyzing biomedical data.

Dataset

The biomedical data we are going to use is obtained from Kaggle website. The dataset contains information about diabetes of cohort of sample subjects. This dataset arises from a research study of the National Institute of Diabetes and Digestive and Kidney Diseases. The purpose of the dataset is to predict whether or not a patient has diabetes. It is based on certain test measurements included in the dataset. Here, the patients are all females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictors/features and one target/response variable named as Outcome. We will first import the dataset into the workspace. Before that we have setup the Rstudio for carrying out the analysis. First, set the working directory using setwd(). Second, import the required library into the R Markdown file. It is shown here.

#setwd("C:/Users/RajuPC/Documents/MyR/RajuBook")
library(tidyverse) #Required for analysis, visualization
library(plotly) # Required for interactive plotting of graphs
library(ggsci) #Themes package for the plots
diab<-read_csv("diabetes.csv") # It reads the CSV file and assigns to diab object

Data Exploration

The diab is a R data frame and we can see the data in many as follows:

head(diab) # Elements of first few rows

## # A tibble: 6 x 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
## 1           6     148            72            35       0  33.6
## 2           1      85            66            29       0  26.6
## 3           8     183            64             0       0  23.3
## 4           1      89            66            23      94  28.1
## 5           0     137            40            35     168  43.1
## 6           5     116            74             0       0  25.6
## # … with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>,
## #   Outcome <dbl>

tail(diab) # Elements of Last few rows

## # A tibble: 6 x 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
## 1           9      89            62             0       0  22.5
## 2          10     101            76            48     180  32.9
## 3           2     122            70            27       0  36.8
## 4           5     121            72            23     112  26.2
## 5           1     126            60             0       0  30.1
## 6           1      93            70            31       0  30.4
## # … with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>,
## #   Outcome <dbl>

colnames(diab) #Names of Columns which are the names of predictors and outcome variables

## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

str(diab) # Structure of the dataset

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 768 obs. of  9 variables:
##  $ Pregnancies             : num  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : num  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : num  1 0 1 0 1 0 1 0 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Pregnancies = col_double(),
##   ..   Glucose = col_double(),
##   ..   BloodPressure = col_double(),
##   ..   SkinThickness = col_double(),
##   ..   Insulin = col_double(),
##   ..   BMI = col_double(),
##   ..   DiabetesPedigreeFunction = col_double(),
##   ..   Age = col_double(),
##   ..   Outcome = col_double()
##   .. )

Descriptive statistics

Here using a summary() function one can easily obtain the descriptive statistics of the imported dataframe.

summary(diab)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Factoring the variables Factors are special type of datatype in R. It is used for categorical variables. But here the outcome variable is specified as integer. It is better to represent categorical variables as factors in R. This can be done as follows:

diab$Outcome<-factor(diab$Outcome)
class(diab$Outcome)

## [1] "factor"

Data Visualization

Mere numbers and tables would not always provide the required information for any decision making. It is necessary that we gather some visual patterns, trends from the data to support any decision making process. This is where the data visuaization helps in getting visual perspective of the data. Basic plots such as scatter plot, bar chart, box-plot, histogram plot are great tools to provide intuitive information about the data.

Relation of diabetes and pregnancies

First we try to gather the relationship between occurrence of diabetes disease and age of the subjects with the pregnancies.

Note: Diabetic outcome is given as binary, where “0” refers to norm

Scatter plot

p1<-ggplot(diab,aes(x=Age,y=Pregnancies,col=Outcome))+geom_point()+geom_smooth(method="loess", se=T)+facet_grid(.~Outcome)
ggplotly(p1)

The above plot also shows the trend modeled through Loess method for the data provided.

Boxplot

The plot shows the details about pregnancies and its distribution across the age of the subjects with diabetes outcome

p2<-ggplot(diab,aes(x=Age,y=Pregnancies))+geom_boxplot(aes(fill=Outcome))+facet_wrap(Outcome~.)
ggplotly(p2)

Density Plot

Through the density plot we can find the distribution of univariate variables in our case the pregnancies of the test subjects

p3<-ggplot(diab,aes(x=Pregnancies))+geom_density(aes(fill=Outcome),alpha=0.6)+
  geom_vline(aes(xintercept=mean(Pregnancies)),
            color="blue", linetype="dashed", size=1)+facet_grid(.~Outcome)+scale_fill_aaas()
ggplotly(p3)

Relation between Glucose, Blood Pressure, Age, Pregnancy

Scatter Plot

p3<-ggplot(diab,aes(x=Age, y=Pregnancies, size=Glucose, fill=BloodPressure))+geom_point(alpha=0.2)+
  facet_grid(.~Outcome)+geom_jitter(width = 0.4)+scale_x_continuous(limits = c(18, 80))+scale_fill_material("red")
ggplotly(p3)

Statistical Learning

Statistical Learning is also known as Machine learning(ML) in general. Here we try to develop ML methods to model/predict the occurrence of diabetic outcome.

Data Preprocessing

The data consists of different features that are needed to be mapped to a common reference frame. This is done by data preprocessing step.

library(caret) # ML package for various methods

# Create the training and test datasets
set.seed(100)
hci<-diab

# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(hci$Outcome, p=0.8, list=FALSE) # Data partition for dividing the dataset into training and testing data set. This is useful for cross validation

# Step 2: Create the training  dataset
trainData <- hci[trainRowNumbers,]

# Step 3: Create the test dataset
testData <- hci[-trainRowNumbers,]

# Store X and Y for later use.
x = trainData[, 1:8]
y=trainData$Outcome

xt= testData[, 1:8]
yt=testData$Outcome
# # See the structure of the new dataset

Normalization of features

The features are normalized to a range of [0,1] using preproces command and using range method

preProcess_range_modeltr <- preProcess(trainData, method='range')
preProcess_range_modelts <- preProcess(testData, method='range')

trainData <- predict(preProcess_range_modeltr, newdata = trainData)
testData <- predict(preProcess_range_modelts, newdata = testData)

# Append the Y variable
trainData$Outcome <- y
testData$Outcome<-yt
levels(trainData$Outcome) <- c("Class0", "Class1") # Convert binary outcome into character for caret package
levels(testData$Outcome) <- c("Class0", "Class1")

#apply(trainData[, 1:8], 2, FUN=function(x){c('min'=min(x), 'max'=max(x))})
#str(trainData)

Options for training process

#fit control
fitControl <- trainControl(
  method = 'cv',                   # k-fold cross validation
  number = 5,                      # number of folds
  savePredictions = 'final',       # saves predictions for optimal tuning parameter
  classProbs = T,                  # should class probabilities be returned
  summaryFunction=twoClassSummary  # results summary function
)

Classical ML Models

The ML models we have chosen are: LDA, KNN, SVM, RandomForest, Adaboost. The Caret package provides a uniform program interface for all the machine models defined in the library.

# Step 1: Tune hyper parameters by setting tuneLength
set.seed(100)
model1 = train(Outcome ~ ., data=trainData, method='lda', tuneLength = 5, metric='ROC', trControl = fitControl)

model2 = train(Outcome ~ ., data=trainData, method='knn', tuneLength=2, trControl = fitControl)#KNN Model
model3 = train(Outcome ~ ., data=trainData, method='svmRadial', tuneLength=2, trControl = fitControl)#SVM
model4 = train(Outcome ~ ., data=trainData, method='rpart', tuneLength=2, trControl = fitControl)#RandomForest
model5 = train(Outcome ~ ., data=trainData, method='adaboost', tuneLength=2, trControl = fitControl) # Adaboost

# Compare model performances using resample()
models_compare <- resamples(list(LDA=model1,KNN=model2,SVM=model3,RF=model4, ADA=model5))

# Summary of the models performances
summary(models_compare)

## 
## Call:
## summary.resamples(object = models_compare)
## 
## Models: LDA, KNN, SVM, RF, ADA 
## Number of resamples: 5 
## 
## ROC 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA 0.7886628 0.8267442 0.8284884 0.8309884 0.8441860 0.8668605    0
## KNN 0.7204942 0.7507267 0.7848837 0.7715116 0.7976744 0.8037791    0
## SVM 0.8305233 0.8328488 0.8372093 0.8363372 0.8377907 0.8433140    0
## RF  0.6478198 0.6702035 0.6879360 0.6915698 0.6879360 0.7639535    0
## ADA 0.7311047 0.7436047 0.7938953 0.7763372 0.8029070 0.8101744    0
## 
## Sens 
##       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LDA 0.8125  0.8750 0.8750 0.8725  0.8875 0.9125    0
## KNN 0.8000  0.8000 0.8125 0.8300  0.8375 0.9000    0
## SVM 0.8375  0.8375 0.8875 0.8725  0.8875 0.9125    0
## RF  0.7125  0.7375 0.8625 0.8175  0.8875 0.8875    0
## ADA 0.7750  0.7750 0.7750 0.7950  0.8125 0.8375    0
## 
## Spec 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA 0.5581395 0.5813953 0.5813953 0.5860465 0.6046512 0.6046512    0
## KNN 0.3720930 0.3953488 0.5348837 0.5069767 0.5813953 0.6511628    0
## SVM 0.5348837 0.5348837 0.6046512 0.6046512 0.6511628 0.6976744    0
## RF  0.4883721 0.4883721 0.5581395 0.5534884 0.6046512 0.6279070    0
## ADA 0.5348837 0.5581395 0.6046512 0.5953488 0.6279070 0.6511628    0

# Draw box plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)

Testing the performance for the test data set.

# Step 2: Predict on testData and Compute the confusion matrix
# Using LDA Model
predicted <- predict(model1, testData[,1:8])
confusionMatrix(reference = testData$Outcome, data = predicted, mode='everything')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Class0 Class1
##     Class0     88     29
##     Class1     12     24
##                                           
##                Accuracy : 0.732           
##                  95% CI : (0.6545, 0.8003)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 0.02367         
##                                           
##                   Kappa : 0.36            
##                                           
##  Mcnemar's Test P-Value : 0.01246         
##                                           
##             Sensitivity : 0.8800          
##             Specificity : 0.4528          
##          Pos Pred Value : 0.7521          
##          Neg Pred Value : 0.6667          
##               Precision : 0.7521          
##                  Recall : 0.8800          
##                      F1 : 0.8111          
##              Prevalence : 0.6536          
##          Detection Rate : 0.5752          
##    Detection Prevalence : 0.7647          
##       Balanced Accuracy : 0.6664          
##                                           
##        'Positive' Class : Class0          
##

Data Analysis and Machine Learning Using R: Biomedial Data

Palani Thanaraj K

April 2, 2019