Description
The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass during 1986. Usage
birthwt
Format
This data frame contains the following columns:
low indicator of birth weight less than 2.5 kg.
age mother’s age in years.
lwt mother’s weight in pounds at last menstrual period.
race mother’s race (1 = white, 2 = black, 3 = other).
smoke smoking status during pregnancy.
ptl number of previous premature labours.
ht history of hypertension.
ui presence of uterine irritability.
ftv number of physician visits during the first trimester.
bwt birth weight in grams.
library(MASS) #birthwt {MASS}
library(rpart) #to fit decision tree model
data("birthwt")
head(birthwt)
check the number of unique values
apply(birthwt,2, function(x) round(length(unique(x))/nrow(birthwt),3)*100)
low age lwt race smoke ptl ht ui ftv bwt
1.1 12.7 39.7 1.6 1.1 2.1 1.1 1.1 3.2 69.3
col <- c(1,4:9)
for (i in col) {
birthwt[,i] = as.factor(birthwt[,i])
}
str(birthwt)
'data.frame': 189 obs. of 10 variables:
$ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
$ ptl : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
$ ftv : Factor w/ 6 levels "0","1","2","3",..: 1 4 2 3 1 1 2 2 2 1 ...
$ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
any(is.na(birthwt))
[1] FALSE
target variable is low
summary(birthwt)
low age lwt race smoke ptl ht ui ftv
0:130 Min. :14.00 Min. : 80.0 1:96 0:115 0:159 0:177 0:161 0:100
1: 59 1st Qu.:19.00 1st Qu.:110.0 2:26 1: 74 1: 24 1: 12 1: 28 1: 47
Median :23.00 Median :121.0 3:67 2: 5 2: 30
Mean :23.24 Mean :129.8 3: 1 3: 7
3rd Qu.:26.00 3rd Qu.:140.0 4: 4
Max. :45.00 Max. :250.0 6: 1
bwt
Min. : 709
1st Qu.:2414
Median :2977
Mean :2945
3rd Qu.:3487
Max. :4990
lets split data into traning and test data set.
set.seed(1234)
library(caTools)
index <- sample.split(Y = birthwt$low, SplitRatio = 0.80)
train <- birthwt[index,]
test<- birthwt[!index,]
fitting the model
tree <- rpart(low~.-bwt, data = train, method = 'class')
plot(tree)
text(tree, pretty = 1)
rpart.plot(tree)
pred <- predict(tree, test, type = "class")
pred
89 99 101 112 114 115 123 126 127 137 145 147 148 163 166 174 180 184 186 209 212 213 216 218 219
0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
220 4 15 17 20 23 30 35 37 42 56 79 82
0 1 1 1 0 0 1 1 1 1 1 0 1
Levels: 0 1
tab <- table(pred, actual= test$low)
tab
actual
pred 0 1
0 23 3
1 3 9
Accuracy matric
sum(diag(tab))/sum(tab)
[1] 0.8421053
misclassification error
1-sum(diag(tab))/sum(tab)
[1] 0.1578947
plotting ROC and claculating Auc matric
library(pROC)
prd <- predict(tree,test, type = "prob")
head(prd,10)
0 1
89 0.8142857 0.1857143
99 0.3888889 0.6111111
101 0.8142857 0.1857143
112 0.8142857 0.1857143
114 0.8142857 0.1857143
115 0.6800000 0.3200000
123 0.8142857 0.1857143
126 0.8142857 0.1857143
127 0.8142857 0.1857143
137 0.4615385 0.5384615
plot(roc(test$low, prd[,2],percent = T))
Setting levels: control = 0, case = 1
Setting direction: controls < cases
print(auc)
Area under the curve: 0.7788