A. Rosa Castillo
06.11.2017
The dataset is divided in two groups: train and test. The train dataset is going to be slightly bigger than the other. The test group will be used to test the model or generate a confusion matrix.
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/1.3,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]
We chose a generalized linear model with six predictors:
modelFit <- train(factor(chd) ~ age + alcohol + obesity + tobacco + typea + ldl, data = trainSA, method = "glm", family = "binomial")
With resampling we try to improve the accuracy of the model.
modelFit$resample
Accuracy Kappa Resample
1 0.6811594 0.2177274 Resample01
2 0.7177419 0.3606364 Resample02
3 0.7109375 0.3268903 Resample03
4 0.7132353 0.3591107 Resample04
5 0.6694915 0.2584596 Resample05
6 0.6834532 0.2937644 Resample06
7 0.7251908 0.3908551 Resample07
8 0.6640000 0.2093373 Resample08
9 0.7054264 0.3046809 Resample09
10 0.7500000 0.4368985 Resample10
11 0.7123288 0.3204787 Resample11
12 0.7000000 0.2795883 Resample12
13 0.7142857 0.3238348 Resample13
14 0.6829268 0.3206345 Resample14
15 0.7500000 0.4470656 Resample15
16 0.8030303 0.5338223 Resample16
17 0.7021277 0.3063949 Resample17
18 0.7593985 0.4279570 Resample18
19 0.7600000 0.4149766 Resample19
20 0.6666667 0.2804757 Resample20
21 0.7727273 0.4655870 Resample21
22 0.7751938 0.5033851 Resample22
23 0.7153285 0.3983786 Resample23
24 0.6842105 0.2961190 Resample24
25 0.7886179 0.4889741 Resample25
confusionMatrix.train(modelFit)
Bootstrapped (25 reps) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction 0 1
0 54.6 18.0
1 10.0 17.4
Accuracy (average) : 0.72
We provide at the shiny app the different coefficients for each predictor. Thus we get an idea how important is each one.
modelFit$finalModel[1]
$coefficients
(Intercept) age alcohol obesity tobacco
-5.266713846 0.067523879 0.000668665 -0.075501628 0.068533940
typea ldl
0.040885964 0.228222302
''ldl'', obesity and tobacco are important factors.
For instance for the default values of the app: tobacco = 15, obesity = 31, age = 40, alcohol = 50 and ldl =7, we get a 0 meaning FALSE. The rest of the needed values for the prediction are taken from the first test group element.
[1] 0
Levels: 0 1
Finally the accuracy is computed using the test group.
# predict new values with testing
predictedValues <- predict(modelFit, newdata = testSA)
xtab <- table(predictedValues, testSA$chd)
c <- confusionMatrix(xtab)
c$overall[1]
Accuracy
0.6448598