** Diabetes ** Diabetes symptoms vary depending on how much your blood sugar is elevated. Some, especially those with prediabetes or type 2 diabetes, may not experience symptoms initially. In type 1 diabetes, symptoms tend to come on quickly and be more severe.
** Symptoms include: **
Increased thirst Frequent urination Extreme hunger Unexplained weight loss Presence of ketones in the urine (ketones are a byproduct of the breakdown of muscle and fat that happens when there’s not enough available insulin) Fatigue Irritability Blurred vision Slow-healing sores Frequent infections, such as gums or skin infections and vaginal infections
Type 1 diabetes can develop at any age, though it often appears during childhood or adolescence. Type 2 diabetes is more common, though it’s more common in people older than 40 years old.
** Analysis of Diabetes Kim Phan Graduate Student at University of California, Riverside September 2019 **
Loading libraries
library(data.table)
library(DataExplorer)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following objects are masked from 'package:data.table':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(corrplot)
## corrplot 0.84 loaded
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
Loading dataset
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Description of the data set: Diabetes patient records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an internal clock to timestamp events, whereas the paper records only provided “logical time” slots (breakfast, lunch, dinner, bedtime). For paper records, fixed times were assigned to breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00). Thus paper records have fictitious uniform recording times whereas electronic records have more realistic time stamps.
# Some observation rows contain zero values, they could be missing value, in some variale columns : Insulin & BMI & BloodPressure & Glucose & SkinThickenes
## Zero values are therefore being removed.
# Note Zero value in variables : Pregnancies and Outcome
# remove na and zero value rows
clean <- filter(diabetes, Insulin !=0)
clean <- filter(clean, Glucose !=0)
clean <- filter(clean, BMI !=0)
summary(clean)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 56.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.:21.00
## Median : 2.000 Median :119.0 Median : 70.00 Median :29.00
## Mean : 3.301 Mean :122.6 Mean : 70.66 Mean :29.15
## 3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.: 78.00 3rd Qu.:37.00
## Max. :17.000 Max. :198.0 Max. :110.00 Max. :63.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.00 Min. :18.20 Min. :0.0850 Min. :21.00
## 1st Qu.: 76.75 1st Qu.:28.40 1st Qu.:0.2697 1st Qu.:23.00
## Median :125.50 Median :33.20 Median :0.4495 Median :27.00
## Mean :156.06 Mean :33.09 Mean :0.5230 Mean :30.86
## 3rd Qu.:190.00 3rd Qu.:37.10 3rd Qu.:0.6870 3rd Qu.:36.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3316
## 3rd Qu.:1.0000
## Max. :1.0000
# Make outcome factor
clean$Outcome <- as.factor(clean$Outcome)
Normalizing and scaling dataset
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
clean$Insulin_normalize<-normalize(clean$Insulin)
clean$DPF_normalize<-normalize(clean$DiabetesPedigreeFunction)
clean$BloodPressure_scale <- scale(clean$BloodPressure,center = FALSE,scale = TRUE)
clean$Glucose_scale <- scale(clean$Glucose, center = FALSE,scale=TRUE)
Correlation between features
library(corrplot)
library(CorrToolBox)
library(corrr)
M<-cor(clean[,-9])
corrplot(M, method="number")
corrplot(M, method="number",type="upper", order="hclust")
plot_boxplot(clean, by="Outcome",nrow=4L,ncol=2L,title = " Boxplot of variables vs Outcome")
plot_density(clean, nrow=4L,ncol=2L)
chart.Correlation(clean[,-9], histogram=TRUE, col="grey10", pch=1, main="Chart Correlation of Variance")
library(C50)
library(tree)
hist(clean$BMI)
plot (density (clean$BMI, na.rm=TRUE) )
pairs(clean[,-9])
cor(clean[,-9])
## Pregnancies Glucose BloodPressure
## Pregnancies 1.000000000 0.1982910 0.2133548
## Glucose 0.198291043 1.0000000 0.2100266
## BloodPressure 0.213354775 0.2100266 1.0000000
## SkinThickness 0.093209397 0.1988558 0.2325712
## Insulin 0.078983625 0.5812230 0.0985115
## BMI -0.025347276 0.2095159 0.3044034
## DiabetesPedigreeFunction 0.007562116 0.1401802 -0.0159711
## Age 0.679608470 0.3436415 0.3000389
## Insulin_normalize 0.078983625 0.5812230 0.0985115
## DPF_normalize 0.007562116 0.1401802 -0.0159711
## BloodPressure_scale 0.213354775 0.2100266 1.0000000
## Glucose_scale 0.198291043 1.0000000 0.2100266
## SkinThickness Insulin BMI
## Pregnancies 0.0932094 0.07898363 -0.02534728
## Glucose 0.1988558 0.58122301 0.20951592
## BloodPressure 0.2325712 0.09851150 0.30440337
## SkinThickness 1.0000000 0.18219906 0.66435487
## Insulin 0.1821991 1.00000000 0.22639652
## BMI 0.6643549 0.22639652 1.00000000
## DiabetesPedigreeFunction 0.1604985 0.13590578 0.15877104
## Age 0.1677611 0.21708199 0.06981380
## Insulin_normalize 0.1821991 1.00000000 0.22639652
## DPF_normalize 0.1604985 0.13590578 0.15877104
## BloodPressure_scale 0.2325712 0.09851150 0.30440337
## Glucose_scale 0.1988558 0.58122301 0.20951592
## DiabetesPedigreeFunction Age
## Pregnancies 0.007562116 0.67960847
## Glucose 0.140180180 0.34364150
## BloodPressure -0.015971104 0.30003895
## SkinThickness 0.160498526 0.16776114
## Insulin 0.135905781 0.21708199
## BMI 0.158771043 0.06981380
## DiabetesPedigreeFunction 1.000000000 0.08502911
## Age 0.085029106 1.00000000
## Insulin_normalize 0.135905781 0.21708199
## DPF_normalize 1.000000000 0.08502911
## BloodPressure_scale -0.015971104 0.30003895
## Glucose_scale 0.140180180 0.34364150
## Insulin_normalize DPF_normalize
## Pregnancies 0.07898363 0.007562116
## Glucose 0.58122301 0.140180180
## BloodPressure 0.09851150 -0.015971104
## SkinThickness 0.18219906 0.160498526
## Insulin 1.00000000 0.135905781
## BMI 0.22639652 0.158771043
## DiabetesPedigreeFunction 0.13590578 1.000000000
## Age 0.21708199 0.085029106
## Insulin_normalize 1.00000000 0.135905781
## DPF_normalize 0.13590578 1.000000000
## BloodPressure_scale 0.09851150 -0.015971104
## Glucose_scale 0.58122301 0.140180180
## BloodPressure_scale Glucose_scale
## Pregnancies 0.2133548 0.1982910
## Glucose 0.2100266 1.0000000
## BloodPressure 1.0000000 0.2100266
## SkinThickness 0.2325712 0.1988558
## Insulin 0.0985115 0.5812230
## BMI 0.3044034 0.2095159
## DiabetesPedigreeFunction -0.0159711 0.1401802
## Age 0.3000389 0.3436415
## Insulin_normalize 0.0985115 0.5812230
## DPF_normalize -0.0159711 0.1401802
## BloodPressure_scale 1.0000000 0.2100266
## Glucose_scale 0.2100266 1.0000000
#-----------------------------------------
library(funModeling)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## Registered S3 method overwritten by 'cli':
## method from
## print.tree tree
## funModeling v.1.9.2 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
plot_num(clean, bins = 10, path_out = NA)
## Warning: attributes are not identical across measure variables; they will
## be dropped
correlation_table(data=clean, target="Outcome")
## Variable Outcome
## 1 Outcome 1.00
## 2 Glucose 0.52
## 3 Age 0.35
## 4 Insulin 0.30
## 5 Insulin_normalize 0.30
## 6 BMI 0.27
## 7 Pregnancies 0.26
## 8 SkinThickness 0.26
## 9 DiabetesPedigreeFunction 0.21
## 10 DPF_normalize 0.21
## 11 BloodPressure 0.19
profiling_num(clean)
## variable mean std_dev variation_coef
## 1 Pregnancies 3.3010204 3.2114245 0.9728581
## 2 Glucose 122.6275510 30.8607806 0.2516627
## 3 BloodPressure 70.6632653 12.4960916 0.1768400
## 4 SkinThickness 29.1454082 10.5164239 0.3608261
## 5 Insulin 156.0561224 118.8416898 0.7615317
## 6 BMI 33.0862245 7.0276592 0.2124044
## 7 DiabetesPedigreeFunction 0.5230459 0.3454880 0.6605310
## 8 Age 30.8647959 10.2007765 0.3304988
## 9 Insulin_normalize 0.1707405 0.1428386 0.8365827
## 10 DPF_normalize 0.1876000 0.1479606 0.7887028
## p_01 p_05 p_25 p_50 p_75 p_95
## 1 0.000000000 0.0000000 1.00000000 2.0000000 5.0000000 10.0000000
## 2 70.730000000 81.0000000 99.00000000 119.0000000 143.0000000 181.0000000
## 3 39.820000000 50.0000000 62.00000000 70.0000000 78.0000000 90.0000000
## 4 10.000000000 13.0000000 21.00000000 29.0000000 37.0000000 46.4500000
## 5 18.000000000 42.5500000 76.75000000 125.5000000 190.0000000 396.5000000
## 6 19.500000000 22.2550000 28.40000000 33.2000000 37.1000000 45.2450000
## 7 0.106460000 0.1535500 0.26975000 0.4495000 0.6870000 1.1603500
## 8 21.000000000 21.0000000 23.00000000 27.0000000 36.0000000 52.4500000
## 9 0.004807692 0.0343149 0.07542067 0.1340144 0.2115385 0.4597356
## 10 0.009190578 0.0293576 0.07912206 0.1561028 0.2578158 0.4605353
## p_99 skewness kurtosis iqr
## 1 13.0000000 1.33048013 4.452184 4.0000000
## 2 196.0000000 0.51586626 2.507647 44.0000000
## 3 102.3600000 -0.08718115 3.770028 16.0000000
## 4 52.0000000 0.20850902 2.532854 16.0000000
## 5 580.8900000 2.15682248 9.260449 113.2500000
## 6 53.3620000 0.66094351 4.521463 8.7000000
## 7 1.7384200 1.95159662 9.270504 0.4172500
## 8 60.0000000 1.39822988 4.700180 13.0000000
## 9 0.6813582 2.15682248 9.260449 0.1361178
## 10 0.7081028 1.95159662 9.270504 0.1786938
## range_98
## 1 [0, 13]
## 2 [70.73, 196]
## 3 [39.82, 102.36]
## 4 [10, 52]
## 5 [18, 580.889999999999]
## 6 [19.5, 53.362]
## 7 [0.10646, 1.73841999999999]
## 8 [21, 60]
## 9 [0.00480769230769231, 0.681358173076922]
## 10 [0.00919057815845824, 0.708102783725905]
## range_80
## 1 [0, 8]
## 2 [87, 170.9]
## 3 [56, 86]
## 4 [15, 43]
## 5 [51.1, 292.8]
## 6 [24.42, 42.07]
## 7 [0.1834, 0.9422]
## 8 [22, 46]
## 9 [0.0445913461538462, 0.335096153846154]
## 10 [0.0421413276231263, 0.36710920770878]
Creating a Tree diagram
library(rpart)
library(rpart.plot)
# Create a decision tree model
tree <- rpart(Outcome~., data=clean, cp=.02,method="class")
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)
#-----------------------------
prp(tree, main = "assorted arguments",
extra = 106, # display prob of survival and percent of obs
nn = TRUE, # display the node numbers
fallen.leaves = TRUE, # put the leaves on the bottom of the page
shadow.col = "gray", # shadows under the leaves
branch.lty = 3, # draw branches using dotted lines
branch = .5, # change angle of branch lines
faclen = 0, # faclen = 0 to print full factor names
trace = 1, # print the auto calculated cex, xlim, ylim
split.cex = 1.2, # make the split text larger than the node text
split.prefix = "is ", # put "is " before split text
split.suffix = "?", # put "?" after split text
split.box.col = "lightgreen", # lightgray split boxes (default is white)
split.border.col = "darkgray", # darkgray border on split boxes
split.round = .5) # round the split box corners a tad
## cex 1 xlim c(0, 1) ylim c(0, 1)
#-----------------
printcp(tree) # display the results
##
## Classification tree:
## rpart(formula = Outcome ~ ., data = clean, method = "class",
## cp = 0.02)
##
## Variables actually used in tree construction:
## [1] Age Glucose Insulin
##
## Root node error: 130/392 = 0.33163
##
## n= 392
##
## CP nsplit rel error xerror xstd
## 1 0.284615 0 1.00000 1.00000 0.071703
## 2 0.065385 1 0.71538 0.71538 0.064787
## 3 0.030769 3 0.58462 0.69231 0.064053
## 4 0.020000 5 0.52308 0.76923 0.066390
plotcp(tree) # visualize cross-validation results
summary(tree) # detailed summary of splits
## Call:
## rpart(formula = Outcome ~ ., data = clean, method = "class",
## cp = 0.02)
## n= 392
##
## CP nsplit rel error xerror xstd
## 1 0.28461538 0 1.0000000 1.0000000 0.07170277
## 2 0.06538462 1 0.7153846 0.7153846 0.06478742
## 3 0.03076923 3 0.5846154 0.6923077 0.06405283
## 4 0.02000000 5 0.5230769 0.7692308 0.06639036
##
## Variable importance
## Glucose Glucose_scale Age
## 28 27 14
## Insulin Insulin_normalize Pregnancies
## 10 10 5
## SkinThickness BloodPressure BloodPressure_scale
## 1 1 1
## BMI
## 1
##
## Node number 1: 392 observations, complexity param=0.2846154
## predicted class=0 expected loss=0.3316327 P(node) =1
## class counts: 262 130
## probabilities: 0.668 0.332
## left son=2 (241 obs) right son=3 (151 obs)
## Primary splits:
## Glucose < 127.5 to the left, improve=41.56381, (0 missing)
## Glucose_scale < 1.007084 to the left, improve=41.56381, (0 missing)
## Insulin < 121 to the left, improve=25.03804, (0 missing)
## Insulin_normalize < 0.1286058 to the left, improve=25.03804, (0 missing)
## Age < 28.5 to the left, improve=24.48980, (0 missing)
## Surrogate splits:
## Glucose_scale < 1.007084 to the left, agree=1.000, adj=1.000, (0 split)
## Insulin < 121 to the left, agree=0.742, adj=0.331, (0 split)
## Insulin_normalize < 0.1286058 to the left, agree=0.742, adj=0.331, (0 split)
## Age < 33.5 to the left, agree=0.673, adj=0.152, (0 split)
## Pregnancies < 6.5 to the left, agree=0.658, adj=0.113, (0 split)
##
## Node number 2: 241 observations, complexity param=0.03076923
## predicted class=0 expected loss=0.1493776 P(node) =0.6147959
## class counts: 205 36
## probabilities: 0.851 0.149
## left son=4 (181 obs) right son=5 (60 obs)
## Primary splits:
## Insulin < 143.5 to the left, improve=5.406876, (0 missing)
## Insulin_normalize < 0.155649 to the left, improve=5.406876, (0 missing)
## DiabetesPedigreeFunction < 0.6385 to the left, improve=5.179878, (0 missing)
## DPF_normalize < 0.237045 to the left, improve=5.179878, (0 missing)
## Age < 28.5 to the left, improve=5.046049, (0 missing)
## Surrogate splits:
## Insulin_normalize < 0.155649 to the left, agree=1.000, adj=1.000, (0 split)
## Glucose < 121.5 to the left, agree=0.776, adj=0.100, (0 split)
## Glucose_scale < 0.9596916 to the left, agree=0.776, adj=0.100, (0 split)
## Age < 50.5 to the left, agree=0.768, adj=0.067, (0 split)
## SkinThickness < 9 to the right, agree=0.759, adj=0.033, (0 split)
##
## Node number 3: 151 observations, complexity param=0.06538462
## predicted class=1 expected loss=0.3774834 P(node) =0.3852041
## class counts: 57 94
## probabilities: 0.377 0.623
## left son=6 (105 obs) right son=7 (46 obs)
## Primary splits:
## Glucose < 165.5 to the left, improve=9.558606, (0 missing)
## Glucose_scale < 1.307234 to the left, improve=9.558606, (0 missing)
## BMI < 29.6 to the left, improve=8.657035, (0 missing)
## Age < 23.5 to the left, improve=7.919643, (0 missing)
## SkinThickness < 21.5 to the left, improve=5.499145, (0 missing)
## Surrogate splits:
## Glucose_scale < 1.307234 to the left, agree=1.000, adj=1.000, (0 split)
## Insulin < 452.5 to the left, agree=0.709, adj=0.043, (0 split)
## DiabetesPedigreeFunction < 1.764 to the left, agree=0.709, adj=0.043, (0 split)
## Age < 51.5 to the left, agree=0.709, adj=0.043, (0 split)
## Insulin_normalize < 0.5270433 to the left, agree=0.709, adj=0.043, (0 split)
##
## Node number 4: 181 observations
## predicted class=0 expected loss=0.08839779 P(node) =0.4617347
## class counts: 165 16
## probabilities: 0.912 0.088
##
## Node number 5: 60 observations, complexity param=0.03076923
## predicted class=0 expected loss=0.3333333 P(node) =0.1530612
## class counts: 40 20
## probabilities: 0.667 0.333
## left son=10 (32 obs) right son=11 (28 obs)
## Primary splits:
## Age < 28.5 to the left, improve=10.059520, (0 missing)
## Pregnancies < 7.5 to the left, improve= 3.205128, (0 missing)
## DiabetesPedigreeFunction < 0.6375 to the left, improve= 2.986667, (0 missing)
## DPF_normalize < 0.2366167 to the left, improve= 2.986667, (0 missing)
## BloodPressure < 73 to the left, improve= 1.843810, (0 missing)
## Surrogate splits:
## Pregnancies < 3.5 to the left, agree=0.783, adj=0.536, (0 split)
## BloodPressure < 63 to the left, agree=0.633, adj=0.214, (0 split)
## SkinThickness < 44 to the left, agree=0.633, adj=0.214, (0 split)
## BloodPressure_scale < 0.8768439 to the left, agree=0.633, adj=0.214, (0 split)
## Glucose < 104.5 to the right, agree=0.617, adj=0.179, (0 split)
##
## Node number 6: 105 observations, complexity param=0.06538462
## predicted class=1 expected loss=0.4952381 P(node) =0.2678571
## class counts: 52 53
## probabilities: 0.495 0.505
## left son=12 (19 obs) right son=13 (86 obs)
## Primary splits:
## Age < 23.5 to the left, improve=9.484222, (0 missing)
## SkinThickness < 22.5 to the left, improve=6.218768, (0 missing)
## BMI < 26.3 to the left, improve=6.066667, (0 missing)
## Pregnancies < 7.5 to the left, improve=3.996881, (0 missing)
## DiabetesPedigreeFunction < 0.8925 to the left, improve=2.925346, (0 missing)
## Surrogate splits:
## BMI < 25.1 to the left, agree=0.848, adj=0.158, (0 split)
## SkinThickness < 20.5 to the left, agree=0.829, adj=0.053, (0 split)
##
## Node number 7: 46 observations
## predicted class=1 expected loss=0.1086957 P(node) =0.1173469
## class counts: 5 41
## probabilities: 0.109 0.891
##
## Node number 10: 32 observations
## predicted class=0 expected loss=0.0625 P(node) =0.08163265
## class counts: 30 2
## probabilities: 0.937 0.062
##
## Node number 11: 28 observations
## predicted class=1 expected loss=0.3571429 P(node) =0.07142857
## class counts: 10 18
## probabilities: 0.357 0.643
##
## Node number 12: 19 observations
## predicted class=0 expected loss=0.05263158 P(node) =0.04846939
## class counts: 18 1
## probabilities: 0.947 0.053
##
## Node number 13: 86 observations
## predicted class=1 expected loss=0.3953488 P(node) =0.2193878
## class counts: 34 52
## probabilities: 0.395 0.605
Comparing graphic distributions of each variable, between orginal dataset and the newly cleaned dataset
par(mfrow=c(4,2))
ggplot(clean,aes(x=Outcome,y=DiabetesPedigreeFunction,group=1,
color=Outcome)) + geom_boxplot(alpha=0.7)+
theme_bw() +
scale_colour_brewer(palette = "Set1",name = "Diabetes") +
coord_flip()
#-------------------------
ggplot(diabetes,aes(x=Outcome,y=Age, color=Outcome,group=1)) +
geom_boxplot(alpha=0.7,fill="yellow")+ theme_bw() +
scale_colour_brewer(palette = "Set1",name = "Diabetes")+
ggtitle("Age distribution") +
coord_flip()
#-----------------------
ggplot(clean,aes(x=Outcome,y=BMI, color=Outcome,group=1)) +
geom_boxplot(alpha=0.7,fill="lightgreen")+ theme_bw() +
scale_colour_brewer(palette = "Set1",name = "Diabetes")+
ggtitle("BMI Distribution")
#-----------------------
ggplot(clean,aes(x=Outcome,y=BloodPressure, fillOutcome,group=1)) +
geom_boxplot(alpha=0.7,fill="lightblue")+ theme_bw() +
scale_colour_brewer(palette = "Set1",name = "Diabetes")+
ggtitle("Blood Pressure Distribution") +
coord_flip()
par(mfrow=c(4,2))
ggplot(clean, aes(Outcome))+
geom_bar(aes(group=Outcome, fill=Outcome)) +
ggtitle("Outcome Ratio")
#----------------------
ggplot(clean, aes(Pregnancies))+
geom_bar(aes(group=Outcome, fill=Outcome)) +
ggtitle("Number of Pregnancies vs Outcome")
#---------------------------
ggplot(clean, aes(Glucose))+
geom_bar(aes(group=Outcome, fill=Outcome)) +
ggtitle("Glucose level Distribution")
#------------------------
ggplot(clean, aes(BloodPressure))+
geom_bar(aes(group=Outcome, fill=Outcome)) +
ggtitle("Blood Pressure Distribution")
dia_corr <- cor(clean[,-9])
corrplot(dia_corr, method="number")
Visual correlation analysis. Plot different graphs in order to expose the inner information of any numeric variable against the target variable
library(funModeling)
plotar(clean, target = 'Outcome', plot_type = "boxplot")
Exploratory Data Analysis
library(DataExplorer)
clean <- select(clean, c(Pregnancies,BMI,Age,SkinThickness,Insulin_normalize,DPF_normalize,
BloodPressure_scale,Glucose_scale,Outcome))
plot_intro(clean)
plot_histogram(clean)
plot_density(clean)
Loading Caret for regression analysis
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
set.seed(12345)
index <- createDataPartition(clean$Outcome, p = .75, list = FALSE)
training <- clean[ index,]
testing <- clean[-index,]
fitControl <- trainControl(## 5-fold CV
method = "repeatedcv",
number = 5,
## repeated ten times
repeats = 3)
# get predictors
predictors <- training %>% select(-Outcome)
outcome <- training$Outcome
library(mboost)
## Loading required package: parallel
## Loading required package: stabs
## This is mboost 2.9-1. See 'package?mboost' and 'news(package = "mboost")'
## for a complete list of changes.
##
## Attaching package: 'mboost'
## The following object is masked from 'package:ggplot2':
##
## %+%
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
glmboost_model <- train(Outcome ~ ., data = training, method = 'glmboost', trControl=fitControl)
glmboost_pred <-predict(glmboost_model, training)
confusionMatrix(glmboost_pred, reference=training$Outcome, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 174 40
## 1 23 58
##
## Accuracy : 0.7864
## 95% CI : (0.7352, 0.8318)
## No Information Rate : 0.6678
## P-Value [Acc > NIR] : 4.944e-06
##
## Kappa : 0.4967
##
## Mcnemar's Test P-Value : 0.04382
##
## Sensitivity : 0.5918
## Specificity : 0.8832
## Pos Pred Value : 0.7160
## Neg Pred Value : 0.8131
## Prevalence : 0.3322
## Detection Rate : 0.1966
## Detection Prevalence : 0.2746
## Balanced Accuracy : 0.7375
##
## 'Positive' Class : 1
##
gbm_model <- train(Outcome ~ ., data = training, method = "gbm",trControl=fitControl,verbose=FALSE)
gbm_pred <- predict(gbm_model,training)
confusionMatrix(gbm_pred, reference=training$Outcome, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 183 22
## 1 14 76
##
## Accuracy : 0.878
## 95% CI : (0.8351, 0.913)
## No Information Rate : 0.6678
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7192
##
## Mcnemar's Test P-Value : 0.2433
##
## Sensitivity : 0.7755
## Specificity : 0.9289
## Pos Pred Value : 0.8444
## Neg Pred Value : 0.8927
## Prevalence : 0.3322
## Detection Rate : 0.2576
## Detection Prevalence : 0.3051
## Balanced Accuracy : 0.8522
##
## 'Positive' Class : 1
##
bayes_model<-train(Outcome ~ ., data=training, method="bayesglm",trControl=fitControl)
bayes_pred <- predict(bayes_model,training[,-9])
confusionMatrix(bayes_pred, reference=training$Outcome, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 176 43
## 1 21 55
##
## Accuracy : 0.7831
## 95% CI : (0.7316, 0.8287)
## No Information Rate : 0.6678
## P-Value [Acc > NIR] : 9.118e-06
##
## Kappa : 0.4818
##
## Mcnemar's Test P-Value : 0.008665
##
## Sensitivity : 0.5612
## Specificity : 0.8934
## Pos Pred Value : 0.7237
## Neg Pred Value : 0.8037
## Prevalence : 0.3322
## Detection Rate : 0.1864
## Detection Prevalence : 0.2576
## Balanced Accuracy : 0.7273
##
## 'Positive' Class : 1
##
rf_model<-train(Outcome ~ ., data=training, method="rf",trControl=fitControl)
rf_model
## Random Forest
##
## 295 samples
## 8 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 236, 236, 236, 237, 235, 236, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7774823 0.4766071
## 5 0.7842626 0.4999003
## 8 0.7831716 0.4968882
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
rf_pred <- predict(rf_model,training[,-9])
confusionMatrix(rf_pred, reference=training$Outcome, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 197 0
## 1 0 98
##
## Accuracy : 1
## 95% CI : (0.9876, 1)
## No Information Rate : 0.6678
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3322
## Detection Rate : 0.3322
## Detection Prevalence : 0.3322
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 1
##
Accuracy Results
rf_model$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7774823 0.4766071 0.03336247 0.08195144
## 2 5 0.7842626 0.4999003 0.03458244 0.07886623
## 3 8 0.7831716 0.4968882 0.03703514 0.08552193
gbm_model$results
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa
## 1 0.1 1 10 50 0.7853432 0.4928655
## 4 0.1 2 10 50 0.7786791 0.4777174
## 7 0.1 3 10 50 0.7854400 0.4997576
## 2 0.1 1 10 100 0.7843470 0.4931914
## 5 0.1 2 10 100 0.7753075 0.4786375
## 8 0.1 3 10 100 0.7706728 0.4695441
## 3 0.1 1 10 150 0.7707676 0.4646304
## 6 0.1 2 10 150 0.7718417 0.4723185
## 9 0.1 3 10 150 0.7695610 0.4602532
## AccuracySD KappaSD
## 1 0.04532948 0.10868468
## 4 0.05075836 0.12136668
## 7 0.04926423 0.11778351
## 2 0.05353235 0.12912093
## 5 0.05079529 0.11866165
## 8 0.04349296 0.10456867
## 3 0.05285915 0.12834362
## 6 0.04974845 0.11435936
## 9 0.03841921 0.09709254
glmboost_model$results
## mstop prune Accuracy Kappa AccuracySD KappaSD
## 1 50 no 0.7763322 0.4720058 0.04198236 0.09490990
## 2 100 no 0.7707585 0.4601780 0.03344112 0.07363558
## 3 150 no 0.7696091 0.4572041 0.03653811 0.07965128
bayes_model$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.7716962 0.461895 0.04175356 0.09568866
# Comparing Multiple Models
# Having set the same seed before running gbm.tune and xgb.tune
# we have generated paired samples and are in a position to compare models
# using a resampling technique.
rpartFit <- train(Outcome~.,data=training,
"rpart",
tuneLength = 7,
trControl = fitControl)
rfFit <- train(Outcome ~ ., data=training, method="rf",
tuneLength=7,
trControl=fitControl)
ctreeFit <- train(Outcome~.,data=training,
"ctree",
tuneLength = 7,
trControl = fitControl)
earthFit <- train(Outcome~.,data=training,
"earth",
tuneLength = 7,
trControl = fitControl)
## Loading required package: earth
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
##
## Attaching package: 'TeachingDemos'
## The following objects are masked from 'package:Hmisc':
##
## cnvrt.coords, subplot
## The following object is masked from 'package:corrr':
##
## dice
bayesFit <- train(Outcome ~ ., data=training,
method="bayesglm",
tuneLength=7,
trControl=fitControl)
svmFit <- train(Outcome ~ ., data=training,
method="svmLinear",
tuneLength=7,
trControl=fitControl)
knnFit <- train(Outcome ~ ., data=training,
method="knn",
tuneLength=7,
trControl=fitControl)
glmFit <- train(Outcome ~ ., data=training,
method = "glmnet",
tuneLength=7,
trControl=fitControl)
glmboostFit <- train(Outcome ~ ., data = training,
method = 'glmboost',
tuneLength=7,
trControl=fitControl)
#----------------------------------------
results <- resamples(list(rpart = rpartFit,
ctree = ctreeFit,
rforest = rfFit,
earth = earthFit,
bayes = bayesFit,
svm = svmFit,
knn=knnFit,
glm = glmFit,
glmboost=glmboostFit))
results
##
## Call:
## resamples.default(x = list(rpart = rpartFit, ctree = ctreeFit, rforest
## = rfFit, earth = earthFit, bayes = bayesFit, svm = svmFit, knn =
## knnFit, glm = glmFit, glmboost = glmboostFit))
##
## Models: rpart, ctree, rforest, earth, bayes, svm, knn, glm, glmboost
## Number of resamples: 15
## Performance metrics: Accuracy, Kappa
## Time estimates for: everything, final model fit
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: rpart, ctree, rforest, earth, bayes, svm, knn, glm, glmboost
## Number of resamples: 15
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.7118644 0.7646893 0.7796610 0.7786006 0.7948568 0.8474576 0
## ctree 0.6440678 0.6888418 0.7758621 0.7580629 0.8234463 0.8644068 0
## rforest 0.7288136 0.7543103 0.7931034 0.7841295 0.8051724 0.8474576 0
## earth 0.7413793 0.7711864 0.7796610 0.7863978 0.7983051 0.8474576 0
## bayes 0.6724138 0.7310734 0.7796610 0.7772336 0.8067797 0.8666667 0
## svm 0.6896552 0.7543103 0.7627119 0.7792181 0.8066384 0.8666667 0
## knn 0.5762712 0.6583333 0.6949153 0.6907078 0.7288136 0.8103448 0
## glm 0.7288136 0.7646893 0.7796610 0.7773563 0.7966102 0.8135593 0
## glmboost 0.7288136 0.7627119 0.7796610 0.7809195 0.7966102 0.8644068 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## rpart 0.33266800 0.4228063 0.4842681 0.4835966 0.5201240 0.6554186
## ctree 0.14831131 0.3078515 0.4980027 0.4475801 0.5854031 0.6898817
## rforest 0.36388140 0.4230774 0.5173370 0.5015296 0.5483675 0.6278907
## earth 0.36285097 0.4625088 0.4896873 0.4938157 0.5257738 0.6457638
## bayes 0.17883756 0.3821894 0.4889868 0.4712926 0.5585223 0.6923077
## svm 0.27600555 0.3947985 0.4520918 0.4752084 0.5470495 0.6923077
## knn -0.06191505 0.1455241 0.2248175 0.2305024 0.3169548 0.5245902
## glm 0.34492968 0.4478916 0.4615385 0.4608551 0.5047779 0.5348226
## glmboost 0.38356164 0.4262553 0.4565789 0.4829798 0.5245902 0.6819407
## NA's
## rpart 0
## ctree 0
## rforest 0
## earth 0
## bayes 0
## svm 0
## knn 0
## glm 0
## glmboost 0