Introduction

Exploratory Data Analysis (EDA), an approach to analyse data sets to summarize their main characteristics and often with visual methods, encourages data scientists to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Objective

This vignette will explain Exploratory Data Analysis of Pima Indian Diabetes Dataset from the National Institute of Diabetes, Digestive and Kidney Diseases. Source data (URL: https://www.kaggle.com/uciml/pima-indians-diabetes-database) will allow us to predict if the patient has diabetes on the basis of certain diagnostic measures available in the dataset. The different steps involved in EDA include: 1.Data Collection, 2.Data Cleaning and 3.Data Visualisation.

There are multiple different packages in R that allow us to visualise data in a geographical context.

Load Packages

All packages are installed and loaded in R as follows.

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(readr)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(dplyr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(readxl)
library(corrplot)
## corrplot 0.84 loaded

Load Data

# loading data (.csv)
data <- read_csv("C:/Users/Sanaa/Desktop/statistics data/diabetes.csv")
## Parsed with column specification:
## cols(
##   Pregnancies = col_double(),
##   Glucose = col_double(),
##   BloodPressure = col_double(),
##   SkinThickness = col_double(),
##   Insulin = col_double(),
##   BMI = col_double(),
##   DiabetesPedigreeFunction = col_double(),
##   Age = col_double(),
##   Outcome = col_double()
## )
diabetes<- as_tibble(data) 
diabetes #display data as tibble
## # A tibble: 768 x 9
##    Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##          <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
##  1           6     148            72            35       0  33.6
##  2           1      85            66            29       0  26.6
##  3           8     183            64             0       0  23.3
##  4           1      89            66            23      94  28.1
##  5           0     137            40            35     168  43.1
##  6           5     116            74             0       0  25.6
##  7           3      78            50            32      88  31  
##  8          10     115             0             0       0  35.3
##  9           2     197            70            45     543  30.5
## 10           8     125            96             0       0   0  
## # ... with 758 more rows, and 3 more variables:
## #   DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <dbl>

Data Understanding

Summary

Getting to know the data better involves understanding details such as how big the data set it, how many obervations and variables exist, etc. Below summary provides statistical information for every variable in the data set.

summary(diabetes) # determinig numerical summary of each variable of data(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
head(diabetes)
## # A tibble: 6 x 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
## 1           6     148            72            35       0  33.6
## 2           1      85            66            29       0  26.6
## 3           8     183            64             0       0  23.3
## 4           1      89            66            23      94  28.1
## 5           0     137            40            35     168  43.1
## 6           5     116            74             0       0  25.6
## # ... with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>,
## #   Outcome <dbl>
dim(diabetes) # to find the dimensions of the dataset(Diabetes)
## [1] 768   9
names(diabetes) # Name of each variable in the dataset
## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"
str(diabetes)
## Classes 'tbl_df', 'tbl' and 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : num  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : num  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : num  1 0 1 0 1 0 1 0 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Pregnancies = col_double(),
##   ..   Glucose = col_double(),
##   ..   BloodPressure = col_double(),
##   ..   SkinThickness = col_double(),
##   ..   Insulin = col_double(),
##   ..   BMI = col_double(),
##   ..   DiabetesPedigreeFunction = col_double(),
##   ..   Age = col_double(),
##   ..   Outcome = col_double()
##   .. )
sapply(diabetes, typeof) # individual feature's datatype in the dataset
##              Pregnancies                  Glucose            BloodPressure 
##                 "double"                 "double"                 "double" 
##            SkinThickness                  Insulin                      BMI 
##                 "double"                 "double"                 "double" 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                 "double"                 "double"                 "double"

Since Outcome is the dependent or target variable that explains whether an individual has diabetes or not, it is important to check if the datset is reasonably balanced.

# value '0' is assigned to "Normal" people and '1'is assigned to people with "diabetes"
table(diabetes$Outcome) 
## 
##   0   1 
## 500 268
# bar chart displaying target variable "Outcome"
# variable, 268 of 768 people have diabetes and 500 are normal
g <- ggplot(diabetes, aes(Outcome)) 
g + geom_bar(aes(group=Outcome, color=Outcome)) + theme(legend.position = "none")

Plotting the distribution of class values (for the target Variable), it is evident that there are 500 Normal and 258 Diabetic instances.

Bivariate tests between predictor variables with the Target Variable(Outcome)

Pregnancies and Outcome

# # ggplot (Pregnancies  Outcome)
g <- ggplot(diabetes, aes(Pregnancies))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")

‘1’ is the no of people with diabetes in this data set and ‘0’ is actually the number of normal people in the specific outcome, with no diabetes. Above plot shows that number of pregnancies count is high where there is no diabetes and pregnancies count reduce where there is diabetes, however the dataset is still very small to draw any concrete conclusion.

Glucose and Outcome

# # ggplot (Glucose and Outcome)
g <- ggplot(diabetes, aes(Glucose))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")

Count of people with diabetes goes high when their glucose levels are between 55 to 170 range, where as 2nd graph shows that count of people with no diabetes reduce where glucose levels are still between 75 and 200.

BloodPressure and Outcome

# ggplot (BloodPressure and Outcome)
g <- ggplot(diabetes, aes(BloodPressure))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")

# '0' represent the number of people with their BloodPressure plotted and non- diabetic
# '1' represent the people with their bloodPressure plotted(Normal range) having diabetes

Above graph shows count of people with diabetes goes high when their BP levels range between 42 to 110 range, where as 2nd graph shows that count of people with no diabetes reduce where BP levels still range between 45 and 110.

skinThickness and Outcome

# ggplot (SkinThickness and Outcome)
g <- ggplot(diabetes, aes(SkinThickness))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")

Insulin and Outcome

2-Hour serum insulin (mu U/ml) levels were recorded

# ggplot (Insulin and Outcome)
g <- ggplot(diabetes, aes(Insulin))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")

BMI and Outcome

BMI is the Body mass index (weight in kg/(height in m)^2).

# ggplot (BMI and Outcome)
g <- ggplot(diabetes, aes(BMI))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")

Diabetes Pedigree Function and Outcome

Diabetes Pedigree Function provides data on diabetes mellitus running in the family and the genetic relationship of those relatives to the patient. This measures hereditary risk one might have with the onset of diabetes mellitus.

# ggplot (DiabetesPedigreeFunction and Outcome)
g <- ggplot(diabetes, aes(DiabetesPedigreeFunction))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")

From the available data it is unclear how well this function predicts the onset of diabetes.

Age and Outcome

The population is generally young females, ranging from 21 to 50 years old.

# ggplot (Age and Outcome)
g <- ggplot(diabetes, aes(Age))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")

Correlations

A correlation plot is used to visually represent the relationship between all the variables of “diabetes” dataset. All the variables are plotted against each other and with the target “Outcome” to find any correlation that may exist.

# corrplot package is installed in the begining
# determine if there is any correlation that exist between all the variables plotted against eachother
num_vars <- unlist(lapply(diabetes, is.numeric))  
dia_nums <- diabetes[ , num_vars]

dia_corr <- cor(dia_nums)
corrplot(dia_corr, method="number")

A moderate correlation (0.54) is observed between Pregnancies and Age. The intensity of the colour shows that correlation between variables is weaker or stronger.

Missing Values

Some of the variables like BloodPressure, Glucose, SkinThickness, BMI, and Insulin can never be zero, as displayed using the command “summary(diabetes)”.

In order to impute the missing values the following steps are taken

BloodPressure

Blood pressure, Diastolic blood pressure (mm Hg) is recoreded in this data set is, and lower than 80 mm Hg is considered ‘Normal’, however it can never be zero. thus the missing values are replaced by its mean.

# missing values
mean_bp <- mean(diabetes$BloodPressure[diabetes$BloodPressure > 0])
diabetes$BloodPressure <- ifelse(diabetes$BloodPressure == 0, round(mean_bp,0), diabetes$BloodPressure)

Glucose

Plasma glucose concentration (2 hours in an oral glucose tolerance test) was recorded, whose Normal value is 140 mg/dL.

# Glucose
mean_Glu <- mean(diabetes$Glucose[diabetes$Glucose > 0])
diabetes$Glucose <- ifelse(diabetes$Glucose == 0, round(mean_Glu,0), diabetes$Glucose)

SkinThickness

The value of Triceps skin fold thickness (mm) was measured.

# SkinThickness
mean_SkT <- mean(diabetes$SkinThickness[diabetes$Glucose > 0])
diabetes$SkinThickness <- ifelse(diabetes$SkinThickness == 0, round(mean_SkT,0), diabetes$Glucose)

Insulin

# Insulin
mean_Insulin <- mean(diabetes$Insulin[diabetes$Insulin > 0])
diabetes$Insulin <- ifelse(diabetes$Insulin == 0, round(mean_Insulin,0), diabetes$Insulin)

BMI

# BMI
mean_BMI <- mean(diabetes$BMI [diabetes$BMI  > 0])
diabetes$BMI  <- ifelse(diabetes$BMI  == 0, round(mean_BMI ,0), diabetes$BMI )

Summary

summary(diabetes)
##   Pregnancies        Glucose       BloodPressure    SkinThickness   
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 21.00  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.: 21.00  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :100.00  
##  Mean   : 3.845   Mean   :121.69   Mean   : 72.39   Mean   : 91.41  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:127.00  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :199.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:121.5   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :156.0   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   :155.8   Mean   :32.45   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:156.0   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Correlations after treating missing values

# determine if there is any correlation that exist between all the variables plotted against eachother, after the missing values have been treated
num_vars <- unlist(lapply(diabetes, is.numeric))  
dia_nums <- diabetes[ , num_vars]

dia_corr <- cor(dia_nums)
corrplot(dia_corr, method="number")

Conclusion

Since some variables like BloodPressure, Glucose, SkinThickness, BMI, and Insulin can never be zero, we have imputed the missing values - which are replaced by their mean values. We have limited our analysis to bar charts only, but additional graphical techniques such as multi-vari chart, pareto chart, scatter and stem-and-leaf plots and odds ratio can be performed to understand the data better. Once we are satisified with our level of data cleaning and exploration, the logical next step would be to build a classification model to predict the possibility of onset of Diabetes based on the available independent variables.

References

  1. https://machinelearningmastery.com/case-study-predicting-the-onset-of-diabetes-within-five-years-part-1-of-3/
  2. https://www.kaggle.com/uciml/pima-indians-diabetes-database
  3. https://archive.org/details/cu31924013702968/page/n5
  4. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4354266/