Data Science is the new buzz word in every engineering and technology field. It is used for data extraction, data analysis, data visualization and data modeling. Recently, data science is applied to biomedical field to acquire knowledge about disease occurrence and diagnosis. This introductory book provides few fundamentals about biostatistics and statistical modeling.
The reqirement for the readers of this book are the following:
Requirement
The center of any complex engineering application or analyis is the data/dataset. It is the root node that entire organizations and corporations run their business. There is a famous saying nowadays that Data is the new oil. In this book, we are going to focus about analyzing biomedical data.
The biomedical data we are going to use is obtained from Kaggle website. The dataset contains information about diabetes of cohort of sample subjects. This dataset arises from a research study of the National Institute of Diabetes and Digestive and Kidney Diseases. The purpose of the dataset is to predict whether or not a patient has diabetes. It is based on certain test measurements included in the dataset. Here, the patients are all females at least 21 years old of Pima Indian heritage.
The datasets consists of several medical predictors/features and one target/response variable named as Outcome. We will first import the dataset into the workspace. Before that we have setup the Rstudio for carrying out the analysis. First, set the working directory using setwd(). Second, import the required library into the R Markdown file. It is shown here.
#setwd("C:/Users/RajuPC/Documents/MyR/RajuBook")
library(tidyverse) #Required for analysis, visualization
library(plotly) # Required for interactive plotting of graphs
library(ggsci) #Themes package for the plots
diab<-read_csv("diabetes.csv") # It reads the CSV file and assigns to diab object
The diab is a R data frame and we can see the data in many as follows:
head(diab) # Elements of first few rows
## # A tibble: 6 x 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## # … with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>,
## # Outcome <dbl>
tail(diab) # Elements of Last few rows
## # A tibble: 6 x 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 89 62 0 0 22.5
## 2 10 101 76 48 180 32.9
## 3 2 122 70 27 0 36.8
## 4 5 121 72 23 112 26.2
## 5 1 126 60 0 0 30.1
## 6 1 93 70 31 0 30.4
## # … with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>,
## # Outcome <dbl>
colnames(diab) #Names of Columns which are the names of predictors and outcome variables
## [1] "Pregnancies" "Glucose"
## [3] "BloodPressure" "SkinThickness"
## [5] "Insulin" "BMI"
## [7] "DiabetesPedigreeFunction" "Age"
## [9] "Outcome"
str(diab) # Structure of the dataset
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : num 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : num 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : num 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : num 1 0 1 0 1 0 1 0 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Pregnancies = col_double(),
## .. Glucose = col_double(),
## .. BloodPressure = col_double(),
## .. SkinThickness = col_double(),
## .. Insulin = col_double(),
## .. BMI = col_double(),
## .. DiabetesPedigreeFunction = col_double(),
## .. Age = col_double(),
## .. Outcome = col_double()
## .. )
Here using a summary() function one can easily obtain the descriptive statistics of the imported dataframe.
summary(diab)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Factoring the variables Factors are special type of datatype in R. It is used for categorical variables. But here the outcome variable is specified as integer. It is better to represent categorical variables as factors in R. This can be done as follows:
diab$Outcome<-factor(diab$Outcome)
class(diab$Outcome)
## [1] "factor"
Mere numbers and tables would not always provide the required information for any decision making. It is necessary that we gather some visual patterns, trends from the data to support any decision making process. This is where the data visuaization helps in getting visual perspective of the data. Basic plots such as scatter plot, bar chart, box-plot, histogram plot are great tools to provide intuitive information about the data.
First we try to gather the relationship between occurrence of diabetes disease and age of the subjects with the pregnancies.
Note: Diabetic outcome is given as binary, where “0” refers to norm
p1<-ggplot(diab,aes(x=Age,y=Pregnancies,col=Outcome))+geom_point()+geom_smooth(method="loess", se=T)+facet_grid(.~Outcome)
ggplotly(p1)
The above plot also shows the trend modeled through Loess method for the data provided.
The plot shows the details about pregnancies and its distribution across the age of the subjects with diabetes outcome
p2<-ggplot(diab,aes(x=Age,y=Pregnancies))+geom_boxplot(aes(fill=Outcome))+facet_wrap(Outcome~.)
ggplotly(p2)
Through the density plot we can find the distribution of univariate variables in our case the pregnancies of the test subjects
p3<-ggplot(diab,aes(x=Pregnancies))+geom_density(aes(fill=Outcome),alpha=0.6)+
geom_vline(aes(xintercept=mean(Pregnancies)),
color="blue", linetype="dashed", size=1)+facet_grid(.~Outcome)+scale_fill_aaas()
ggplotly(p3)
p3<-ggplot(diab,aes(x=Age, y=Pregnancies, size=Glucose, fill=BloodPressure))+geom_point(alpha=0.2)+
facet_grid(.~Outcome)+geom_jitter(width = 0.4)+scale_x_continuous(limits = c(18, 80))+scale_fill_material("red")
ggplotly(p3)
Statistical Learning is also known as Machine learning(ML) in general. Here we try to develop ML methods to model/predict the occurrence of diabetic outcome.
The data consists of different features that are needed to be mapped to a common reference frame. This is done by data preprocessing step.
library(caret) # ML package for various methods
# Create the training and test datasets
set.seed(100)
hci<-diab
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(hci$Outcome, p=0.8, list=FALSE) # Data partition for dividing the dataset into training and testing data set. This is useful for cross validation
# Step 2: Create the training dataset
trainData <- hci[trainRowNumbers,]
# Step 3: Create the test dataset
testData <- hci[-trainRowNumbers,]
# Store X and Y for later use.
x = trainData[, 1:8]
y=trainData$Outcome
xt= testData[, 1:8]
yt=testData$Outcome
# # See the structure of the new dataset
The features are normalized to a range of [0,1] using preproces command and using range method
preProcess_range_modeltr <- preProcess(trainData, method='range')
preProcess_range_modelts <- preProcess(testData, method='range')
trainData <- predict(preProcess_range_modeltr, newdata = trainData)
testData <- predict(preProcess_range_modelts, newdata = testData)
# Append the Y variable
trainData$Outcome <- y
testData$Outcome<-yt
levels(trainData$Outcome) <- c("Class0", "Class1") # Convert binary outcome into character for caret package
levels(testData$Outcome) <- c("Class0", "Class1")
#apply(trainData[, 1:8], 2, FUN=function(x){c('min'=min(x), 'max'=max(x))})
#str(trainData)
#fit control
fitControl <- trainControl(
method = 'cv', # k-fold cross validation
number = 5, # number of folds
savePredictions = 'final', # saves predictions for optimal tuning parameter
classProbs = T, # should class probabilities be returned
summaryFunction=twoClassSummary # results summary function
)
The ML models we have chosen are: LDA, KNN, SVM, RandomForest, Adaboost. The Caret package provides a uniform program interface for all the machine models defined in the library.
# Step 1: Tune hyper parameters by setting tuneLength
set.seed(100)
model1 = train(Outcome ~ ., data=trainData, method='lda', tuneLength = 5, metric='ROC', trControl = fitControl)
model2 = train(Outcome ~ ., data=trainData, method='knn', tuneLength=2, trControl = fitControl)#KNN Model
model3 = train(Outcome ~ ., data=trainData, method='svmRadial', tuneLength=2, trControl = fitControl)#SVM
model4 = train(Outcome ~ ., data=trainData, method='rpart', tuneLength=2, trControl = fitControl)#RandomForest
model5 = train(Outcome ~ ., data=trainData, method='adaboost', tuneLength=2, trControl = fitControl) # Adaboost
# Compare model performances using resample()
models_compare <- resamples(list(LDA=model1,KNN=model2,SVM=model3,RF=model4, ADA=model5))
# Summary of the models performances
summary(models_compare)
##
## Call:
## summary.resamples(object = models_compare)
##
## Models: LDA, KNN, SVM, RF, ADA
## Number of resamples: 5
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.7886628 0.8267442 0.8284884 0.8309884 0.8441860 0.8668605 0
## KNN 0.7204942 0.7507267 0.7848837 0.7715116 0.7976744 0.8037791 0
## SVM 0.8305233 0.8328488 0.8372093 0.8363372 0.8377907 0.8433140 0
## RF 0.6478198 0.6702035 0.6879360 0.6915698 0.6879360 0.7639535 0
## ADA 0.7311047 0.7436047 0.7938953 0.7763372 0.8029070 0.8101744 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.8125 0.8750 0.8750 0.8725 0.8875 0.9125 0
## KNN 0.8000 0.8000 0.8125 0.8300 0.8375 0.9000 0
## SVM 0.8375 0.8375 0.8875 0.8725 0.8875 0.9125 0
## RF 0.7125 0.7375 0.8625 0.8175 0.8875 0.8875 0
## ADA 0.7750 0.7750 0.7750 0.7950 0.8125 0.8375 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.5581395 0.5813953 0.5813953 0.5860465 0.6046512 0.6046512 0
## KNN 0.3720930 0.3953488 0.5348837 0.5069767 0.5813953 0.6511628 0
## SVM 0.5348837 0.5348837 0.6046512 0.6046512 0.6511628 0.6976744 0
## RF 0.4883721 0.4883721 0.5581395 0.5534884 0.6046512 0.6279070 0
## ADA 0.5348837 0.5581395 0.6046512 0.5953488 0.6279070 0.6511628 0
# Draw box plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)
# Step 2: Predict on testData and Compute the confusion matrix
# Using LDA Model
predicted <- predict(model1, testData[,1:8])
confusionMatrix(reference = testData$Outcome, data = predicted, mode='everything')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Class0 Class1
## Class0 88 29
## Class1 12 24
##
## Accuracy : 0.732
## 95% CI : (0.6545, 0.8003)
## No Information Rate : 0.6536
## P-Value [Acc > NIR] : 0.02367
##
## Kappa : 0.36
##
## Mcnemar's Test P-Value : 0.01246
##
## Sensitivity : 0.8800
## Specificity : 0.4528
## Pos Pred Value : 0.7521
## Neg Pred Value : 0.6667
## Precision : 0.7521
## Recall : 0.8800
## F1 : 0.8111
## Prevalence : 0.6536
## Detection Rate : 0.5752
## Detection Prevalence : 0.7647
## Balanced Accuracy : 0.6664
##
## 'Positive' Class : Class0
##