Exploratory Data Analysis (EDA), an approach to analyse data sets to summarize their main characteristics and often with visual methods, encourages data scientists to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
This vignette will explain Exploratory Data Analysis of Pima Indian Diabetes Dataset from the National Institute of Diabetes, Digestive and Kidney Diseases. Source data (URL: https://www.kaggle.com/uciml/pima-indians-diabetes-database) will allow us to predict if the patient has diabetes on the basis of certain diagnostic measures available in the dataset. The different steps involved in EDA include: 1.Data Collection, 2.Data Cleaning and 3.Data Visualisation.
There are multiple different packages in R that allow us to visualise data in a geographical context.
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(readr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(dplyr)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(readxl)
library(corrplot)
## corrplot 0.84 loaded
# loading data (.csv)
data <- read_csv("C:/Users/Sanaa/Desktop/statistics data/diabetes.csv")
## Parsed with column specification:
## cols(
## Pregnancies = col_double(),
## Glucose = col_double(),
## BloodPressure = col_double(),
## SkinThickness = col_double(),
## Insulin = col_double(),
## BMI = col_double(),
## DiabetesPedigreeFunction = col_double(),
## Age = col_double(),
## Outcome = col_double()
## )
diabetes<- as_tibble(data)
diabetes #display data as tibble
## # A tibble: 768 x 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## 7 3 78 50 32 88 31
## 8 10 115 0 0 0 35.3
## 9 2 197 70 45 543 30.5
## 10 8 125 96 0 0 0
## # ... with 758 more rows, and 3 more variables:
## # DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <dbl>
Getting to know the data better involves understanding details such as how big the data set it, how many obervations and variables exist, etc. Below summary provides statistical information for every variable in the data set.
summary(diabetes) # determinig numerical summary of each variable of data(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
head(diabetes)
## # A tibble: 6 x 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## # ... with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>,
## # Outcome <dbl>
dim(diabetes) # to find the dimensions of the dataset(Diabetes)
## [1] 768 9
names(diabetes) # Name of each variable in the dataset
## [1] "Pregnancies" "Glucose"
## [3] "BloodPressure" "SkinThickness"
## [5] "Insulin" "BMI"
## [7] "DiabetesPedigreeFunction" "Age"
## [9] "Outcome"
str(diabetes)
## Classes 'tbl_df', 'tbl' and 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : num 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : num 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : num 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : num 1 0 1 0 1 0 1 0 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Pregnancies = col_double(),
## .. Glucose = col_double(),
## .. BloodPressure = col_double(),
## .. SkinThickness = col_double(),
## .. Insulin = col_double(),
## .. BMI = col_double(),
## .. DiabetesPedigreeFunction = col_double(),
## .. Age = col_double(),
## .. Outcome = col_double()
## .. )
sapply(diabetes, typeof) # individual feature's datatype in the dataset
## Pregnancies Glucose BloodPressure
## "double" "double" "double"
## SkinThickness Insulin BMI
## "double" "double" "double"
## DiabetesPedigreeFunction Age Outcome
## "double" "double" "double"
Since Outcome is the dependent or target variable that explains whether an individual has diabetes or not, it is important to check if the datset is reasonably balanced.
# value '0' is assigned to "Normal" people and '1'is assigned to people with "diabetes"
table(diabetes$Outcome)
##
## 0 1
## 500 268
# bar chart displaying target variable "Outcome"
# variable, 268 of 768 people have diabetes and 500 are normal
g <- ggplot(diabetes, aes(Outcome))
g + geom_bar(aes(group=Outcome, color=Outcome)) + theme(legend.position = "none")
Plotting the distribution of class values (for the target Variable), it is evident that there are 500 Normal and 258 Diabetic instances.
# # ggplot (Pregnancies Outcome)
g <- ggplot(diabetes, aes(Pregnancies))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")
‘1’ is the no of people with diabetes in this data set and ‘0’ is actually the number of normal people in the specific outcome, with no diabetes. Above plot shows that number of pregnancies count is high where there is no diabetes and pregnancies count reduce where there is diabetes, however the dataset is still very small to draw any concrete conclusion.
# # ggplot (Glucose and Outcome)
g <- ggplot(diabetes, aes(Glucose))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")
Count of people with diabetes goes high when their glucose levels are between 55 to 170 range, where as 2nd graph shows that count of people with no diabetes reduce where glucose levels are still between 75 and 200.
# ggplot (BloodPressure and Outcome)
g <- ggplot(diabetes, aes(BloodPressure))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")
# '0' represent the number of people with their BloodPressure plotted and non- diabetic
# '1' represent the people with their bloodPressure plotted(Normal range) having diabetes
Above graph shows count of people with diabetes goes high when their BP levels range between 42 to 110 range, where as 2nd graph shows that count of people with no diabetes reduce where BP levels still range between 45 and 110.
# ggplot (SkinThickness and Outcome)
g <- ggplot(diabetes, aes(SkinThickness))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")
# ggplot (Insulin and Outcome)
g <- ggplot(diabetes, aes(Insulin))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")
# ggplot (BMI and Outcome)
g <- ggplot(diabetes, aes(BMI))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")
Diabetes Pedigree Function provides data on diabetes mellitus running in the family and the genetic relationship of those relatives to the patient. This measures hereditary risk one might have with the onset of diabetes mellitus.
# ggplot (DiabetesPedigreeFunction and Outcome)
g <- ggplot(diabetes, aes(DiabetesPedigreeFunction))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")
From the available data it is unclear how well this function predicts the onset of diabetes.
The population is generally young females, ranging from 21 to 50 years old.
# ggplot (Age and Outcome)
g <- ggplot(diabetes, aes(Age))
g + geom_bar(aes(group=Outcome)) + facet_wrap(~Outcome) + theme(legend.position = "none")
A correlation plot is used to visually represent the relationship between all the variables of “diabetes” dataset. All the variables are plotted against each other and with the target “Outcome” to find any correlation that may exist.
# corrplot package is installed in the begining
# determine if there is any correlation that exist between all the variables plotted against eachother
num_vars <- unlist(lapply(diabetes, is.numeric))
dia_nums <- diabetes[ , num_vars]
dia_corr <- cor(dia_nums)
corrplot(dia_corr, method="number")
A moderate correlation (0.54) is observed between Pregnancies and Age. The intensity of the colour shows that correlation between variables is weaker or stronger.
Some of the variables like BloodPressure, Glucose, SkinThickness, BMI, and Insulin can never be zero, as displayed using the command “summary(diabetes)”.
In order to impute the missing values the following steps are taken
Blood pressure, Diastolic blood pressure (mm Hg) is recoreded in this data set is, and lower than 80 mm Hg is considered ‘Normal’, however it can never be zero. thus the missing values are replaced by its mean.
# missing values
mean_bp <- mean(diabetes$BloodPressure[diabetes$BloodPressure > 0])
diabetes$BloodPressure <- ifelse(diabetes$BloodPressure == 0, round(mean_bp,0), diabetes$BloodPressure)
Plasma glucose concentration (2 hours in an oral glucose tolerance test) was recorded, whose Normal value is 140 mg/dL.
# Glucose
mean_Glu <- mean(diabetes$Glucose[diabetes$Glucose > 0])
diabetes$Glucose <- ifelse(diabetes$Glucose == 0, round(mean_Glu,0), diabetes$Glucose)
The value of Triceps skin fold thickness (mm) was measured.
# SkinThickness
mean_SkT <- mean(diabetes$SkinThickness[diabetes$Glucose > 0])
diabetes$SkinThickness <- ifelse(diabetes$SkinThickness == 0, round(mean_SkT,0), diabetes$Glucose)
# Insulin
mean_Insulin <- mean(diabetes$Insulin[diabetes$Insulin > 0])
diabetes$Insulin <- ifelse(diabetes$Insulin == 0, round(mean_Insulin,0), diabetes$Insulin)
# BMI
mean_BMI <- mean(diabetes$BMI [diabetes$BMI > 0])
diabetes$BMI <- ifelse(diabetes$BMI == 0, round(mean_BMI ,0), diabetes$BMI )
summary(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 21.00
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.: 21.00
## Median : 3.000 Median :117.00 Median : 72.00 Median :100.00
## Mean : 3.845 Mean :121.69 Mean : 72.39 Mean : 91.41
## 3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:127.00
## Max. :17.000 Max. :199.00 Max. :122.00 Max. :199.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.:121.5 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :156.0 Median :32.00 Median :0.3725 Median :29.00
## Mean :155.8 Mean :32.45 Mean :0.4719 Mean :33.24
## 3rd Qu.:156.0 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
# determine if there is any correlation that exist between all the variables plotted against eachother, after the missing values have been treated
num_vars <- unlist(lapply(diabetes, is.numeric))
dia_nums <- diabetes[ , num_vars]
dia_corr <- cor(dia_nums)
corrplot(dia_corr, method="number")
Since some variables like BloodPressure, Glucose, SkinThickness, BMI, and Insulin can never be zero, we have imputed the missing values - which are replaced by their mean values. We have limited our analysis to bar charts only, but additional graphical techniques such as multi-vari chart, pareto chart, scatter and stem-and-leaf plots and odds ratio can be performed to understand the data better. Once we are satisified with our level of data cleaning and exploration, the logical next step would be to build a classification model to predict the possibility of onset of Diabetes based on the available independent variables.