The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. I downloaded from UCI Machine Learning Repository. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
Look at the structure and the first few rows.
## [1] 768 9
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Check missing values
## Number of missing value: 0
Staitstical summary
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Histogram of numeric variables
All the variables have reasonable broad distribution, therefore, will be kept for the regression analysis.
Correlation Between Numeric Varibales
The numeric variabls are almost not correlated.
Correlation bewteen numeric variables and outcome.
Blood pressure and skin thickness show little variation with diabetes, they will be excluded from the model. Other variables show more or less correlation with diabetes, so will be kept.
##
## Call:
## glm(formula = Outcome ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4366 -0.7741 -0.4312 0.8021 2.7310
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.3461752 0.8157916 -10.231 < 2e-16 ***
## Pregnancies 0.1246856 0.0373214 3.341 0.000835 ***
## Glucose 0.0315778 0.0042497 7.431 1.08e-13 ***
## Insulin -0.0013400 0.0009441 -1.419 0.155781
## BMI 0.0881521 0.0164090 5.372 7.78e-08 ***
## DiabetesPedigreeFunction 0.9642132 0.3430094 2.811 0.004938 **
## Age 0.0018904 0.0107225 0.176 0.860053
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 700.47 on 539 degrees of freedom
## Residual deviance: 526.56 on 533 degrees of freedom
## AIC: 540.56
##
## Number of Fisher Scoring iterations: 5
The top three most relevant features are “Glucose”, “BMI” and “Number of times pregnant” because of the low p-values.
“Insulin” and “Age” appear not statistically significant.
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Outcome
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 539 700.47
## Pregnancies 1 26.314 538 674.16 2.901e-07 ***
## Glucose 1 102.960 537 571.20 < 2.2e-16 ***
## Insulin 1 0.062 536 571.14 0.803341
## BMI 1 36.135 535 535.00 1.841e-09 ***
## DiabetesPedigreeFunction 1 8.414 534 526.59 0.003723 **
## Age 1 0.031 533 526.56 0.860201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table of deviance, we can see that adding insulin and age have little effect on the residual deviance.
## [1] "Accuracy 0.789473684210526"
This means if a person’s BMI less than 45.4 and her diabetes digree function less than 0.8745, then she is more likely to have diabetes.
Confusion table and accuracy
##
## treePred 0 1
## 0 121 29
## 1 29 49
## [1] 0.745614
In this project, I compared the performance of Logistic Regression and Decision Tree algorithms and found that Logistic Regression performed better on this standard, unaltered dataset. However, there are things we can do to improve the generalization performance in decision tree induction such as pruning. I will perform that in the future posts.