## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## corrplot 0.92 loaded

Introduction

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

Data Dictionary

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Understand the data

The data was imported into R and run head to visualize the first 6 rows

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Also looked at the dimensions of the data. The dataset had 768 rows and 9 columns

dim(diabetes)
## [1] 768   9
colnames(diabetes)
## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

Looking if we had any null values in the data or any duplicated values. No null values were present.

## $Pregnancies
## [1] 0
## 
## $Glucose
## [1] 0
## 
## $BloodPressure
## [1] 0
## 
## $SkinThickness
## [1] 0
## 
## $Insulin
## [1] 0
## 
## $BMI
## [1] 0
## 
## $DiabetesPedigreeFunction
## [1] 0
## 
## $Age
## [1] 0
## 
## $Outcome
## [1] 0
## [1] 0

Since We cannot have Bps, BMI, Skin Thickness and Glucose being zero(0) all zero values were replaced with the mean/median since dropping them would change the data.

##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:20.54  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :121.68   Mean   : 72.25   Mean   :26.61  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 30.50   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 31.25   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 94.65   Mean   :32.45   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.25   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome        age_cat         
##  Min.   :0.000   Length:768        
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.349                     
##  3rd Qu.:1.000                     
##  Max.   :1.000

Insights

How was the population spread?

age_cat Percent
Below_30 51.56
30s 21.48
40s 15.36
50s 7.42
60s 3.78
Above_70 0.39

Number of Pregnancies

The mean number of pregnancies was around 3 with the data being skewed to the left. A few extremes having above 11 pregnancies accounting for 4.43%

Comparison of pregnancies and outcome

Those who had higher number of pregnancies had a higher chance of being diabetic.

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

BMI numbers

About 61.98 % of the population had a BMI of above 30 which is considered as being obese. 1.04% had a BMI of above 50 which even though possible could be an outlier. 0.52% percent of the population were underweight with a BMI < 18.5

## # A tibble: 9 × 2
##   BMI_cat           Percent
##   <chr>               <int>
## 1 Moderate Obese        235
## 2 Obese                 179
## 3 Severe Obese          150
## 4 Normal                102
## 5 Very severe Obese      62
## 6 Morbid Obese           27
## 7 Super Obese             8
## 8 Underweight             4
## 9 Hyper obese             1

Diabetes pedigree Function and BMI

The super Obese, Severe Obese and very severe obese had the highest diabetes pedigree function. This in itself is not a good measure since even the underweight still have a high DPF

Glucose ranges

Those with Glucose above 125 classified as hypergylcemia accounted for 40.49 %, Impaired glucose had 34.5% while the hypoglycemia accounted for 1%

Glucose Ranges and Diabetes Pedigree Function

Both the hyperglycemia and impaired glucose had the highest diabetes pedigree function.

The patient with hyperglycemia or high glucose 2 hours after Oral glucose test had a higher chance of having diabetes as their outcome. Compared to the hypoglycemic who none had diabetes as the outcome.

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

How do the various paramaters relate to each other

comparison of age and outcome

Blood Pressure comparison

The study utilized Diastolic blood pressures that ranges from 60 - 80 mmHg. The BPs had a normal distribution though 11.2 % had low diastolic BP(less than 60mmHg). 62 % had normal BPs and 0.13% or 1 person had a Hypertensive crises BP > 120 .

Diastolic blood pressure did not have such high impact on the outcome for diabetes.

Recommendation

Since more than 60% of the population were obese education and lifestyle modification measures should introduced to the community.

More screening measures should be introduced. Gym services and also healthy eating should be adopted.