Diabetes: Model Prediction Analysis

** Diabetes ** Diabetes symptoms vary depending on how much your blood sugar is elevated. Some, especially those with prediabetes or type 2 diabetes, may not experience symptoms initially. In type 1 diabetes, symptoms tend to come on quickly and be more severe.

** Symptoms include: **

Increased thirst Frequent urination Extreme hunger Unexplained weight loss Presence of ketones in the urine (ketones are a byproduct of the breakdown of muscle and fat that happens when there’s not enough available insulin) Fatigue Irritability Blurred vision Slow-healing sores Frequent infections, such as gums or skin infections and vaginal infections

Type 1 diabetes can develop at any age, though it often appears during childhood or adolescence. Type 2 diabetes is more common, though it’s more common in people older than 40 years old.

** Analysis of Diabetes Kim Phan Graduate Student at University of California, Riverside September 2019 **

Loading libraries

library(data.table)
library(DataExplorer)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(PerformanceAnalytics)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## The following objects are masked from 'package:data.table':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

library(corrplot)

## corrplot 0.84 loaded

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

Loading dataset

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Description of the data set: Diabetes patient records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an internal clock to timestamp events, whereas the paper records only provided “logical time” slots (breakfast, lunch, dinner, bedtime). For paper records, fixed times were assigned to breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00). Thus paper records have fictitious uniform recording times whereas electronic records have more realistic time stamps.

# Some observation rows contain zero values, they could be missing value, in some variale columns : Insulin & BMI & BloodPressure & Glucose & SkinThickenes
## Zero values are therefore being removed.

# Note Zero value in variables : Pregnancies and Outcome

# remove na  and zero value rows
clean <- filter(diabetes, Insulin !=0)
clean <- filter(clean, Glucose !=0)
clean <- filter(clean, BMI !=0)
summary(clean)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 56.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.:21.00  
##  Median : 2.000   Median :119.0   Median : 70.00   Median :29.00  
##  Mean   : 3.301   Mean   :122.6   Mean   : 70.66   Mean   :29.15  
##  3rd Qu.: 5.000   3rd Qu.:143.0   3rd Qu.: 78.00   3rd Qu.:37.00  
##  Max.   :17.000   Max.   :198.0   Max.   :110.00   Max.   :63.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0850           Min.   :21.00  
##  1st Qu.: 76.75   1st Qu.:28.40   1st Qu.:0.2697           1st Qu.:23.00  
##  Median :125.50   Median :33.20   Median :0.4495           Median :27.00  
##  Mean   :156.06   Mean   :33.09   Mean   :0.5230           Mean   :30.86  
##  3rd Qu.:190.00   3rd Qu.:37.10   3rd Qu.:0.6870           3rd Qu.:36.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3316  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

# Make outcome factor
clean$Outcome <- as.factor(clean$Outcome)

Normalizing and scaling dataset

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

clean$Insulin_normalize<-normalize(clean$Insulin)
clean$DPF_normalize<-normalize(clean$DiabetesPedigreeFunction)

clean$BloodPressure_scale <- scale(clean$BloodPressure,center = FALSE,scale = TRUE)
clean$Glucose_scale <- scale(clean$Glucose, center = FALSE,scale=TRUE)

Correlation between features

library(corrplot)
library(CorrToolBox)
library(corrr)
M<-cor(clean[,-9])
corrplot(M, method="number")

corrplot(M, method="number",type="upper", order="hclust")

plot_boxplot(clean, by="Outcome",nrow=4L,ncol=2L,title = " Boxplot of variables vs Outcome")

plot_density(clean, nrow=4L,ncol=2L)

chart.Correlation(clean[,-9], histogram=TRUE, col="grey10", pch=1, main="Chart Correlation of Variance")

library(C50)
library(tree)
hist(clean$BMI)

plot (density (clean$BMI, na.rm=TRUE) )

pairs(clean[,-9])

cor(clean[,-9])

##                           Pregnancies   Glucose BloodPressure
## Pregnancies               1.000000000 0.1982910     0.2133548
## Glucose                   0.198291043 1.0000000     0.2100266
## BloodPressure             0.213354775 0.2100266     1.0000000
## SkinThickness             0.093209397 0.1988558     0.2325712
## Insulin                   0.078983625 0.5812230     0.0985115
## BMI                      -0.025347276 0.2095159     0.3044034
## DiabetesPedigreeFunction  0.007562116 0.1401802    -0.0159711
## Age                       0.679608470 0.3436415     0.3000389
## Insulin_normalize         0.078983625 0.5812230     0.0985115
## DPF_normalize             0.007562116 0.1401802    -0.0159711
## BloodPressure_scale       0.213354775 0.2100266     1.0000000
## Glucose_scale             0.198291043 1.0000000     0.2100266
##                          SkinThickness    Insulin         BMI
## Pregnancies                  0.0932094 0.07898363 -0.02534728
## Glucose                      0.1988558 0.58122301  0.20951592
## BloodPressure                0.2325712 0.09851150  0.30440337
## SkinThickness                1.0000000 0.18219906  0.66435487
## Insulin                      0.1821991 1.00000000  0.22639652
## BMI                          0.6643549 0.22639652  1.00000000
## DiabetesPedigreeFunction     0.1604985 0.13590578  0.15877104
## Age                          0.1677611 0.21708199  0.06981380
## Insulin_normalize            0.1821991 1.00000000  0.22639652
## DPF_normalize                0.1604985 0.13590578  0.15877104
## BloodPressure_scale          0.2325712 0.09851150  0.30440337
## Glucose_scale                0.1988558 0.58122301  0.20951592
##                          DiabetesPedigreeFunction        Age
## Pregnancies                           0.007562116 0.67960847
## Glucose                               0.140180180 0.34364150
## BloodPressure                        -0.015971104 0.30003895
## SkinThickness                         0.160498526 0.16776114
## Insulin                               0.135905781 0.21708199
## BMI                                   0.158771043 0.06981380
## DiabetesPedigreeFunction              1.000000000 0.08502911
## Age                                   0.085029106 1.00000000
## Insulin_normalize                     0.135905781 0.21708199
## DPF_normalize                         1.000000000 0.08502911
## BloodPressure_scale                  -0.015971104 0.30003895
## Glucose_scale                         0.140180180 0.34364150
##                          Insulin_normalize DPF_normalize
## Pregnancies                     0.07898363   0.007562116
## Glucose                         0.58122301   0.140180180
## BloodPressure                   0.09851150  -0.015971104
## SkinThickness                   0.18219906   0.160498526
## Insulin                         1.00000000   0.135905781
## BMI                             0.22639652   0.158771043
## DiabetesPedigreeFunction        0.13590578   1.000000000
## Age                             0.21708199   0.085029106
## Insulin_normalize               1.00000000   0.135905781
## DPF_normalize                   0.13590578   1.000000000
## BloodPressure_scale             0.09851150  -0.015971104
## Glucose_scale                   0.58122301   0.140180180
##                          BloodPressure_scale Glucose_scale
## Pregnancies                        0.2133548     0.1982910
## Glucose                            0.2100266     1.0000000
## BloodPressure                      1.0000000     0.2100266
## SkinThickness                      0.2325712     0.1988558
## Insulin                            0.0985115     0.5812230
## BMI                                0.3044034     0.2095159
## DiabetesPedigreeFunction          -0.0159711     0.1401802
## Age                                0.3000389     0.3436415
## Insulin_normalize                  0.0985115     0.5812230
## DPF_normalize                     -0.0159711     0.1401802
## BloodPressure_scale                1.0000000     0.2100266
## Glucose_scale                      0.2100266     1.0000000

#-----------------------------------------
library(funModeling)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Registered S3 method overwritten by 'cli':
##   method     from
##   print.tree tree

## funModeling v.1.9.2 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai

plot_num(clean, bins = 10, path_out = NA)

## Warning: attributes are not identical across measure variables; they will
## be dropped

correlation_table(data=clean, target="Outcome")

##                    Variable Outcome
## 1                   Outcome    1.00
## 2                   Glucose    0.52
## 3                       Age    0.35
## 4                   Insulin    0.30
## 5         Insulin_normalize    0.30
## 6                       BMI    0.27
## 7               Pregnancies    0.26
## 8             SkinThickness    0.26
## 9  DiabetesPedigreeFunction    0.21
## 10            DPF_normalize    0.21
## 11            BloodPressure    0.19

profiling_num(clean)

##                    variable        mean     std_dev variation_coef
## 1               Pregnancies   3.3010204   3.2114245      0.9728581
## 2                   Glucose 122.6275510  30.8607806      0.2516627
## 3             BloodPressure  70.6632653  12.4960916      0.1768400
## 4             SkinThickness  29.1454082  10.5164239      0.3608261
## 5                   Insulin 156.0561224 118.8416898      0.7615317
## 6                       BMI  33.0862245   7.0276592      0.2124044
## 7  DiabetesPedigreeFunction   0.5230459   0.3454880      0.6605310
## 8                       Age  30.8647959  10.2007765      0.3304988
## 9         Insulin_normalize   0.1707405   0.1428386      0.8365827
## 10            DPF_normalize   0.1876000   0.1479606      0.7887028
##            p_01       p_05        p_25        p_50        p_75        p_95
## 1   0.000000000  0.0000000  1.00000000   2.0000000   5.0000000  10.0000000
## 2  70.730000000 81.0000000 99.00000000 119.0000000 143.0000000 181.0000000
## 3  39.820000000 50.0000000 62.00000000  70.0000000  78.0000000  90.0000000
## 4  10.000000000 13.0000000 21.00000000  29.0000000  37.0000000  46.4500000
## 5  18.000000000 42.5500000 76.75000000 125.5000000 190.0000000 396.5000000
## 6  19.500000000 22.2550000 28.40000000  33.2000000  37.1000000  45.2450000
## 7   0.106460000  0.1535500  0.26975000   0.4495000   0.6870000   1.1603500
## 8  21.000000000 21.0000000 23.00000000  27.0000000  36.0000000  52.4500000
## 9   0.004807692  0.0343149  0.07542067   0.1340144   0.2115385   0.4597356
## 10  0.009190578  0.0293576  0.07912206   0.1561028   0.2578158   0.4605353
##           p_99    skewness kurtosis         iqr
## 1   13.0000000  1.33048013 4.452184   4.0000000
## 2  196.0000000  0.51586626 2.507647  44.0000000
## 3  102.3600000 -0.08718115 3.770028  16.0000000
## 4   52.0000000  0.20850902 2.532854  16.0000000
## 5  580.8900000  2.15682248 9.260449 113.2500000
## 6   53.3620000  0.66094351 4.521463   8.7000000
## 7    1.7384200  1.95159662 9.270504   0.4172500
## 8   60.0000000  1.39822988 4.700180  13.0000000
## 9    0.6813582  2.15682248 9.260449   0.1361178
## 10   0.7081028  1.95159662 9.270504   0.1786938
##                                    range_98
## 1                                   [0, 13]
## 2                              [70.73, 196]
## 3                           [39.82, 102.36]
## 4                                  [10, 52]
## 5                    [18, 580.889999999999]
## 6                            [19.5, 53.362]
## 7               [0.10646, 1.73841999999999]
## 8                                  [21, 60]
## 9  [0.00480769230769231, 0.681358173076922]
## 10 [0.00919057815845824, 0.708102783725905]
##                                   range_80
## 1                                   [0, 8]
## 2                              [87, 170.9]
## 3                                 [56, 86]
## 4                                 [15, 43]
## 5                            [51.1, 292.8]
## 6                           [24.42, 42.07]
## 7                         [0.1834, 0.9422]
## 8                                 [22, 46]
## 9  [0.0445913461538462, 0.335096153846154]
## 10  [0.0421413276231263, 0.36710920770878]

Creating a Tree diagram

library(rpart)
library(rpart.plot)


# Create a decision tree model
tree <- rpart(Outcome~., data=clean, cp=.02,method="class")
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)

#-----------------------------

prp(tree, main = "assorted arguments",
    extra = 106,           # display prob of survival and percent of obs
    nn = TRUE,             # display the node numbers
    fallen.leaves = TRUE,  # put the leaves on the bottom of the page
    shadow.col = "gray",   # shadows under the leaves
    branch.lty = 3,        # draw branches using dotted lines
    branch = .5,           # change angle of branch lines
    faclen = 0,            # faclen = 0 to print full factor names
    trace = 1,             # print the auto calculated cex, xlim, ylim
    split.cex = 1.2,       # make the split text larger than the node text
    split.prefix = "is ",  # put "is " before split text
    split.suffix = "?",    # put "?" after split text
    split.box.col = "lightgreen",   # lightgray split boxes (default is white)
    split.border.col = "darkgray", # darkgray border on split boxes
    split.round = .5)              # round the split box corners a tad

## cex 1   xlim c(0, 1)   ylim c(0, 1)

#-----------------

printcp(tree) # display the results

## 
## Classification tree:
## rpart(formula = Outcome ~ ., data = clean, method = "class", 
##     cp = 0.02)
## 
## Variables actually used in tree construction:
## [1] Age     Glucose Insulin
## 
## Root node error: 130/392 = 0.33163
## 
## n= 392 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.284615      0   1.00000 1.00000 0.071703
## 2 0.065385      1   0.71538 0.71538 0.064787
## 3 0.030769      3   0.58462 0.69231 0.064053
## 4 0.020000      5   0.52308 0.76923 0.066390

plotcp(tree) # visualize cross-validation results

summary(tree) # detailed summary of splits

## Call:
## rpart(formula = Outcome ~ ., data = clean, method = "class", 
##     cp = 0.02)
##   n= 392 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.28461538      0 1.0000000 1.0000000 0.07170277
## 2 0.06538462      1 0.7153846 0.7153846 0.06478742
## 3 0.03076923      3 0.5846154 0.6923077 0.06405283
## 4 0.02000000      5 0.5230769 0.7692308 0.06639036
## 
## Variable importance
##             Glucose       Glucose_scale                 Age 
##                  28                  27                  14 
##             Insulin   Insulin_normalize         Pregnancies 
##                  10                  10                   5 
##       SkinThickness       BloodPressure BloodPressure_scale 
##                   1                   1                   1 
##                 BMI 
##                   1 
## 
## Node number 1: 392 observations,    complexity param=0.2846154
##   predicted class=0  expected loss=0.3316327  P(node) =1
##     class counts:   262   130
##    probabilities: 0.668 0.332 
##   left son=2 (241 obs) right son=3 (151 obs)
##   Primary splits:
##       Glucose           < 127.5     to the left,  improve=41.56381, (0 missing)
##       Glucose_scale     < 1.007084  to the left,  improve=41.56381, (0 missing)
##       Insulin           < 121       to the left,  improve=25.03804, (0 missing)
##       Insulin_normalize < 0.1286058 to the left,  improve=25.03804, (0 missing)
##       Age               < 28.5      to the left,  improve=24.48980, (0 missing)
##   Surrogate splits:
##       Glucose_scale     < 1.007084  to the left,  agree=1.000, adj=1.000, (0 split)
##       Insulin           < 121       to the left,  agree=0.742, adj=0.331, (0 split)
##       Insulin_normalize < 0.1286058 to the left,  agree=0.742, adj=0.331, (0 split)
##       Age               < 33.5      to the left,  agree=0.673, adj=0.152, (0 split)
##       Pregnancies       < 6.5       to the left,  agree=0.658, adj=0.113, (0 split)
## 
## Node number 2: 241 observations,    complexity param=0.03076923
##   predicted class=0  expected loss=0.1493776  P(node) =0.6147959
##     class counts:   205    36
##    probabilities: 0.851 0.149 
##   left son=4 (181 obs) right son=5 (60 obs)
##   Primary splits:
##       Insulin                  < 143.5     to the left,  improve=5.406876, (0 missing)
##       Insulin_normalize        < 0.155649  to the left,  improve=5.406876, (0 missing)
##       DiabetesPedigreeFunction < 0.6385    to the left,  improve=5.179878, (0 missing)
##       DPF_normalize            < 0.237045  to the left,  improve=5.179878, (0 missing)
##       Age                      < 28.5      to the left,  improve=5.046049, (0 missing)
##   Surrogate splits:
##       Insulin_normalize < 0.155649  to the left,  agree=1.000, adj=1.000, (0 split)
##       Glucose           < 121.5     to the left,  agree=0.776, adj=0.100, (0 split)
##       Glucose_scale     < 0.9596916 to the left,  agree=0.776, adj=0.100, (0 split)
##       Age               < 50.5      to the left,  agree=0.768, adj=0.067, (0 split)
##       SkinThickness     < 9         to the right, agree=0.759, adj=0.033, (0 split)
## 
## Node number 3: 151 observations,    complexity param=0.06538462
##   predicted class=1  expected loss=0.3774834  P(node) =0.3852041
##     class counts:    57    94
##    probabilities: 0.377 0.623 
##   left son=6 (105 obs) right son=7 (46 obs)
##   Primary splits:
##       Glucose       < 165.5     to the left,  improve=9.558606, (0 missing)
##       Glucose_scale < 1.307234  to the left,  improve=9.558606, (0 missing)
##       BMI           < 29.6      to the left,  improve=8.657035, (0 missing)
##       Age           < 23.5      to the left,  improve=7.919643, (0 missing)
##       SkinThickness < 21.5      to the left,  improve=5.499145, (0 missing)
##   Surrogate splits:
##       Glucose_scale            < 1.307234  to the left,  agree=1.000, adj=1.000, (0 split)
##       Insulin                  < 452.5     to the left,  agree=0.709, adj=0.043, (0 split)
##       DiabetesPedigreeFunction < 1.764     to the left,  agree=0.709, adj=0.043, (0 split)
##       Age                      < 51.5      to the left,  agree=0.709, adj=0.043, (0 split)
##       Insulin_normalize        < 0.5270433 to the left,  agree=0.709, adj=0.043, (0 split)
## 
## Node number 4: 181 observations
##   predicted class=0  expected loss=0.08839779  P(node) =0.4617347
##     class counts:   165    16
##    probabilities: 0.912 0.088 
## 
## Node number 5: 60 observations,    complexity param=0.03076923
##   predicted class=0  expected loss=0.3333333  P(node) =0.1530612
##     class counts:    40    20
##    probabilities: 0.667 0.333 
##   left son=10 (32 obs) right son=11 (28 obs)
##   Primary splits:
##       Age                      < 28.5      to the left,  improve=10.059520, (0 missing)
##       Pregnancies              < 7.5       to the left,  improve= 3.205128, (0 missing)
##       DiabetesPedigreeFunction < 0.6375    to the left,  improve= 2.986667, (0 missing)
##       DPF_normalize            < 0.2366167 to the left,  improve= 2.986667, (0 missing)
##       BloodPressure            < 73        to the left,  improve= 1.843810, (0 missing)
##   Surrogate splits:
##       Pregnancies         < 3.5       to the left,  agree=0.783, adj=0.536, (0 split)
##       BloodPressure       < 63        to the left,  agree=0.633, adj=0.214, (0 split)
##       SkinThickness       < 44        to the left,  agree=0.633, adj=0.214, (0 split)
##       BloodPressure_scale < 0.8768439 to the left,  agree=0.633, adj=0.214, (0 split)
##       Glucose             < 104.5     to the right, agree=0.617, adj=0.179, (0 split)
## 
## Node number 6: 105 observations,    complexity param=0.06538462
##   predicted class=1  expected loss=0.4952381  P(node) =0.2678571
##     class counts:    52    53
##    probabilities: 0.495 0.505 
##   left son=12 (19 obs) right son=13 (86 obs)
##   Primary splits:
##       Age                      < 23.5      to the left,  improve=9.484222, (0 missing)
##       SkinThickness            < 22.5      to the left,  improve=6.218768, (0 missing)
##       BMI                      < 26.3      to the left,  improve=6.066667, (0 missing)
##       Pregnancies              < 7.5       to the left,  improve=3.996881, (0 missing)
##       DiabetesPedigreeFunction < 0.8925    to the left,  improve=2.925346, (0 missing)
##   Surrogate splits:
##       BMI           < 25.1      to the left,  agree=0.848, adj=0.158, (0 split)
##       SkinThickness < 20.5      to the left,  agree=0.829, adj=0.053, (0 split)
## 
## Node number 7: 46 observations
##   predicted class=1  expected loss=0.1086957  P(node) =0.1173469
##     class counts:     5    41
##    probabilities: 0.109 0.891 
## 
## Node number 10: 32 observations
##   predicted class=0  expected loss=0.0625  P(node) =0.08163265
##     class counts:    30     2
##    probabilities: 0.937 0.062 
## 
## Node number 11: 28 observations
##   predicted class=1  expected loss=0.3571429  P(node) =0.07142857
##     class counts:    10    18
##    probabilities: 0.357 0.643 
## 
## Node number 12: 19 observations
##   predicted class=0  expected loss=0.05263158  P(node) =0.04846939
##     class counts:    18     1
##    probabilities: 0.947 0.053 
## 
## Node number 13: 86 observations
##   predicted class=1  expected loss=0.3953488  P(node) =0.2193878
##     class counts:    34    52
##    probabilities: 0.395 0.605

Comparing graphic distributions of each variable, between orginal dataset and the newly cleaned dataset

par(mfrow=c(4,2))


ggplot(clean,aes(x=Outcome,y=DiabetesPedigreeFunction,group=1,
                      color=Outcome)) + geom_boxplot(alpha=0.7)+
                     theme_bw() +
                     scale_colour_brewer(palette = "Set1",name = "Diabetes") +
  coord_flip()

#-------------------------

ggplot(diabetes,aes(x=Outcome,y=Age, color=Outcome,group=1)) + 
       geom_boxplot(alpha=0.7,fill="yellow")+ theme_bw() +
       scale_colour_brewer(palette = "Set1",name = "Diabetes")+
       ggtitle("Age distribution") +
  coord_flip()

#-----------------------


ggplot(clean,aes(x=Outcome,y=BMI, color=Outcome,group=1)) + 
       geom_boxplot(alpha=0.7,fill="lightgreen")+ theme_bw() +
       scale_colour_brewer(palette = "Set1",name = "Diabetes")+
        ggtitle("BMI Distribution")

#-----------------------
      
ggplot(clean,aes(x=Outcome,y=BloodPressure, fillOutcome,group=1)) + 
       geom_boxplot(alpha=0.7,fill="lightblue")+ theme_bw() +
       scale_colour_brewer(palette = "Set1",name = "Diabetes")+
        ggtitle("Blood Pressure Distribution") +
  coord_flip()

par(mfrow=c(4,2))


ggplot(clean, aes(Outcome))+ 
        geom_bar(aes(group=Outcome, fill=Outcome)) + 
        ggtitle("Outcome Ratio")

#----------------------

ggplot(clean, aes(Pregnancies))+ 
        geom_bar(aes(group=Outcome, fill=Outcome)) + 
        ggtitle("Number of Pregnancies vs Outcome")

#---------------------------

ggplot(clean, aes(Glucose))+ 
        geom_bar(aes(group=Outcome, fill=Outcome)) + 
        ggtitle("Glucose level Distribution")

#------------------------

ggplot(clean, aes(BloodPressure))+ 
        geom_bar(aes(group=Outcome, fill=Outcome)) + 
        ggtitle("Blood Pressure Distribution")

dia_corr <- cor(clean[,-9])
corrplot(dia_corr, method="number")

Visual correlation analysis. Plot different graphs in order to expose the inner information of any numeric variable against the target variable

library(funModeling)
plotar(clean, target = 'Outcome', plot_type = "boxplot")

Exploratory Data Analysis

library(DataExplorer)
clean <- select(clean, c(Pregnancies,BMI,Age,SkinThickness,Insulin_normalize,DPF_normalize,
                            BloodPressure_scale,Glucose_scale,Outcome))



plot_intro(clean)

plot_histogram(clean)

plot_density(clean)

Loading Caret for regression analysis

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

set.seed(12345)
index <- createDataPartition(clean$Outcome, p = .75, list = FALSE)
training <- clean[ index,]
testing  <- clean[-index,]

fitControl <- trainControl(## 5-fold CV
                           method = "repeatedcv",
                           number = 5,
                           ## repeated ten times
                           repeats = 3)

# get predictors
predictors <- training %>% select(-Outcome)
outcome <- training$Outcome

library(mboost)

## Loading required package: parallel

## Loading required package: stabs

## This is mboost 2.9-1. See 'package?mboost' and 'news(package  = "mboost")'
## for a complete list of changes.

## 
## Attaching package: 'mboost'

## The following object is masked from 'package:ggplot2':
## 
##     %+%

library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

glmboost_model <- train(Outcome ~ ., data = training, method = 'glmboost',                                                     trControl=fitControl)
                 
glmboost_pred <-predict(glmboost_model, training)

confusionMatrix(glmboost_pred, reference=training$Outcome, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 174  40
##          1  23  58
##                                           
##                Accuracy : 0.7864          
##                  95% CI : (0.7352, 0.8318)
##     No Information Rate : 0.6678          
##     P-Value [Acc > NIR] : 4.944e-06       
##                                           
##                   Kappa : 0.4967          
##                                           
##  Mcnemar's Test P-Value : 0.04382         
##                                           
##             Sensitivity : 0.5918          
##             Specificity : 0.8832          
##          Pos Pred Value : 0.7160          
##          Neg Pred Value : 0.8131          
##              Prevalence : 0.3322          
##          Detection Rate : 0.1966          
##    Detection Prevalence : 0.2746          
##       Balanced Accuracy : 0.7375          
##                                           
##        'Positive' Class : 1               
##

gbm_model <- train(Outcome ~ ., data = training, method = "gbm",trControl=fitControl,verbose=FALSE)
                 
gbm_pred <- predict(gbm_model,training)
confusionMatrix(gbm_pred, reference=training$Outcome, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 183  22
##          1  14  76
##                                          
##                Accuracy : 0.878          
##                  95% CI : (0.8351, 0.913)
##     No Information Rate : 0.6678         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7192         
##                                          
##  Mcnemar's Test P-Value : 0.2433         
##                                          
##             Sensitivity : 0.7755         
##             Specificity : 0.9289         
##          Pos Pred Value : 0.8444         
##          Neg Pred Value : 0.8927         
##              Prevalence : 0.3322         
##          Detection Rate : 0.2576         
##    Detection Prevalence : 0.3051         
##       Balanced Accuracy : 0.8522         
##                                          
##        'Positive' Class : 1              
##

bayes_model<-train(Outcome ~ ., data=training, method="bayesglm",trControl=fitControl)
bayes_pred <- predict(bayes_model,training[,-9])
confusionMatrix(bayes_pred, reference=training$Outcome, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 176  43
##          1  21  55
##                                           
##                Accuracy : 0.7831          
##                  95% CI : (0.7316, 0.8287)
##     No Information Rate : 0.6678          
##     P-Value [Acc > NIR] : 9.118e-06       
##                                           
##                   Kappa : 0.4818          
##                                           
##  Mcnemar's Test P-Value : 0.008665        
##                                           
##             Sensitivity : 0.5612          
##             Specificity : 0.8934          
##          Pos Pred Value : 0.7237          
##          Neg Pred Value : 0.8037          
##              Prevalence : 0.3322          
##          Detection Rate : 0.1864          
##    Detection Prevalence : 0.2576          
##       Balanced Accuracy : 0.7273          
##                                           
##        'Positive' Class : 1               
##

rf_model<-train(Outcome ~ ., data=training, method="rf",trControl=fitControl)
rf_model

## Random Forest 
## 
## 295 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 236, 236, 236, 237, 235, 236, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.7774823  0.4766071
##   5     0.7842626  0.4999003
##   8     0.7831716  0.4968882
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.

rf_pred <- predict(rf_model,training[,-9])
confusionMatrix(rf_pred, reference=training$Outcome, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 197   0
##          1   0  98
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9876, 1)
##     No Information Rate : 0.6678     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3322     
##          Detection Rate : 0.3322     
##    Detection Prevalence : 0.3322     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
##

Accuracy Results

rf_model$results

##   mtry  Accuracy     Kappa AccuracySD    KappaSD
## 1    2 0.7774823 0.4766071 0.03336247 0.08195144
## 2    5 0.7842626 0.4999003 0.03458244 0.07886623
## 3    8 0.7831716 0.4968882 0.03703514 0.08552193

gbm_model$results

##   shrinkage interaction.depth n.minobsinnode n.trees  Accuracy     Kappa
## 1       0.1                 1             10      50 0.7853432 0.4928655
## 4       0.1                 2             10      50 0.7786791 0.4777174
## 7       0.1                 3             10      50 0.7854400 0.4997576
## 2       0.1                 1             10     100 0.7843470 0.4931914
## 5       0.1                 2             10     100 0.7753075 0.4786375
## 8       0.1                 3             10     100 0.7706728 0.4695441
## 3       0.1                 1             10     150 0.7707676 0.4646304
## 6       0.1                 2             10     150 0.7718417 0.4723185
## 9       0.1                 3             10     150 0.7695610 0.4602532
##   AccuracySD    KappaSD
## 1 0.04532948 0.10868468
## 4 0.05075836 0.12136668
## 7 0.04926423 0.11778351
## 2 0.05353235 0.12912093
## 5 0.05079529 0.11866165
## 8 0.04349296 0.10456867
## 3 0.05285915 0.12834362
## 6 0.04974845 0.11435936
## 9 0.03841921 0.09709254

glmboost_model$results

##   mstop prune  Accuracy     Kappa AccuracySD    KappaSD
## 1    50    no 0.7763322 0.4720058 0.04198236 0.09490990
## 2   100    no 0.7707585 0.4601780 0.03344112 0.07363558
## 3   150    no 0.7696091 0.4572041 0.03653811 0.07965128

bayes_model$results

##   parameter  Accuracy    Kappa AccuracySD    KappaSD
## 1      none 0.7716962 0.461895 0.04175356 0.09568866

# Comparing Multiple Models
# Having set the same seed before running gbm.tune and xgb.tune
# we have generated paired samples and are in a position to compare models 
# using a resampling technique.



rpartFit <- train(Outcome~.,data=training,
                   "rpart",
                   tuneLength = 7,
                   trControl = fitControl)
 rfFit <-     train(Outcome ~ ., data=training, method="rf",
                    tuneLength=7,
                    trControl=fitControl)
 ctreeFit <- train(Outcome~.,data=training,
                   "ctree",
                   tuneLength = 7,
                   trControl = fitControl) 
  
 earthFit <- train(Outcome~.,data=training,
                   "earth",
                   tuneLength = 7,
                   trControl = fitControl)

## Loading required package: earth

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

## 
## Attaching package: 'TeachingDemos'

## The following objects are masked from 'package:Hmisc':
## 
##     cnvrt.coords, subplot

## The following object is masked from 'package:corrr':
## 
##     dice

 bayesFit <- train(Outcome ~ ., data=training, 
                   method="bayesglm",
                   tuneLength=7,
                   trControl=fitControl)
 svmFit <- train(Outcome ~ ., data=training, 
                   method="svmLinear",
                   tuneLength=7,
                   trControl=fitControl)
 knnFit <- train(Outcome ~ ., data=training, 
                   method="knn",
                   tuneLength=7,
                   trControl=fitControl)
 glmFit <- train(Outcome ~ ., data=training, 
                   method = "glmnet",
                   tuneLength=7,
                   trControl=fitControl)
 glmboostFit <- train(Outcome ~ ., data = training, 
                      method = 'glmboost',
                      tuneLength=7,
                      trControl=fitControl)

#----------------------------------------

 results <- resamples(list(rpart = rpartFit,
                           ctree = ctreeFit,
                           rforest = rfFit,
                           earth = earthFit,
                           bayes = bayesFit,
                           svm = svmFit,
                           knn=knnFit,
                           glm = glmFit,
                           glmboost=glmboostFit))

 results

## 
## Call:
## resamples.default(x = list(rpart = rpartFit, ctree = ctreeFit, rforest
##  = rfFit, earth = earthFit, bayes = bayesFit, svm = svmFit, knn =
##  knnFit, glm = glmFit, glmboost = glmboostFit))
## 
## Models: rpart, ctree, rforest, earth, bayes, svm, knn, glm, glmboost 
## Number of resamples: 15 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit

 summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: rpart, ctree, rforest, earth, bayes, svm, knn, glm, glmboost 
## Number of resamples: 15 
## 
## Accuracy 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart    0.7118644 0.7646893 0.7796610 0.7786006 0.7948568 0.8474576    0
## ctree    0.6440678 0.6888418 0.7758621 0.7580629 0.8234463 0.8644068    0
## rforest  0.7288136 0.7543103 0.7931034 0.7841295 0.8051724 0.8474576    0
## earth    0.7413793 0.7711864 0.7796610 0.7863978 0.7983051 0.8474576    0
## bayes    0.6724138 0.7310734 0.7796610 0.7772336 0.8067797 0.8666667    0
## svm      0.6896552 0.7543103 0.7627119 0.7792181 0.8066384 0.8666667    0
## knn      0.5762712 0.6583333 0.6949153 0.6907078 0.7288136 0.8103448    0
## glm      0.7288136 0.7646893 0.7796610 0.7773563 0.7966102 0.8135593    0
## glmboost 0.7288136 0.7627119 0.7796610 0.7809195 0.7966102 0.8644068    0
## 
## Kappa 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## rpart     0.33266800 0.4228063 0.4842681 0.4835966 0.5201240 0.6554186
## ctree     0.14831131 0.3078515 0.4980027 0.4475801 0.5854031 0.6898817
## rforest   0.36388140 0.4230774 0.5173370 0.5015296 0.5483675 0.6278907
## earth     0.36285097 0.4625088 0.4896873 0.4938157 0.5257738 0.6457638
## bayes     0.17883756 0.3821894 0.4889868 0.4712926 0.5585223 0.6923077
## svm       0.27600555 0.3947985 0.4520918 0.4752084 0.5470495 0.6923077
## knn      -0.06191505 0.1455241 0.2248175 0.2305024 0.3169548 0.5245902
## glm       0.34492968 0.4478916 0.4615385 0.4608551 0.5047779 0.5348226
## glmboost  0.38356164 0.4262553 0.4565789 0.4829798 0.5245902 0.6819407
##          NA's
## rpart       0
## ctree       0
## rforest     0
## earth       0
## bayes       0
## svm         0
## knn         0
## glm         0
## glmboost    0

Diabetes: Model Prediction Analysis

Kim Phan, Graduate Student at University of Califironia,Riverside

September 2019