#install.packages(c("rpart", "party", "randomForest", "e1071", "rpart.plot"))
par(ask=TRUE)
loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
ds <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data"
url <- paste(loc, ds, sep="")
breast <- read.table(url, sep = ",", header=FALSE, na.strings = "?")
names(breast) <- c("ID", "clumpThickness", "sizeUniformity","shapeUniformity", "maginalAdhesion", "singleEpithelialCellSize", "bareNuclei","blandChromatin", "normalNucleoli", "mitosis", "class")
df <- breast[-1]
df$class <- factor(df$class, levels=c(2,4), labels = c("benign","malignant"))
set.seed(1234)
train <- sample(nrow(df), 0.7*nrow(df))
df.train <- df[train,]
df.validate <- df[-train,]
#Logistic regression with glm()
fit.logit <- glm(class~., data=df.train, family = binomial())
summary(fit.logit)
##
## Call:
## glm(formula = class ~ ., family = binomial(), data = df.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.75813 -0.10602 -0.05679 0.01237 2.64317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.42758 1.47602 -7.065 1.61e-12 ***
## clumpThickness 0.52434 0.15950 3.287 0.00101 **
## sizeUniformity -0.04805 0.25706 -0.187 0.85171
## shapeUniformity 0.42309 0.26775 1.580 0.11407
## maginalAdhesion 0.29245 0.14690 1.991 0.04650 *
## singleEpithelialCellSize 0.11053 0.17980 0.615 0.53871
## bareNuclei 0.33570 0.10715 3.133 0.00173 **
## blandChromatin 0.42353 0.20673 2.049 0.04049 *
## normalNucleoli 0.28888 0.13995 2.064 0.03900 *
## mitosis 0.69057 0.39829 1.734 0.08295 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 612.063 on 482 degrees of freedom
## Residual deviance: 71.346 on 473 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 91.346
##
## Number of Fisher Scoring iterations: 8
From the p-values of the regression coefficients, we see that sizeUniformity, shapeUniformity and singleEpithelialCellSize may not make significant contribution to the equation. The p-values for these variables are 0.85171, 0.11407 and 0.53871. These are greater than 0.1 The Null Hypothesis says that the parameters for these variables is Zero, H0: Beta1 = 0 versus H1: Beta1 not equal to zero. We reject the null hypothesis when the p-value is small. In the case of the above variables the p-values are not small. Hence, we cannot reject the null hypothesis. That is to say, Beta1 is indeed zero and has no effect on the equation.
prob <- predict(fit.logit, df.validate, type="response")
logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE), labels = c("benign","malignant"))
#By default predict() function predicts the log odds of having a malignant outcome
logit.perf <- table(df.validate$class, logit.pred, dnn = c("Actual", "Predicted") )
logit.perf
## Predicted
## Actual benign malignant
## benign 118 2
## malignant 4 76
The total number of cases correctly classified also called the accuracy was (118+76)/200 or 97% in the validation sample.
It is often useful to remove the ‘predictor variables with non-significant coefficients’ from the final model. This is especially imporant in where a large number of non-informative predictor variables are adding what is essentially noise to the system.
The Akaike Information Criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.AIC is a goodness of fit measure that favours smaller residual error in the model, but penalises for including further predictors and helps avoiding overfitting.
The stepwise logistic regression can be used to generate a smaller model with fewer variables. Predictor variables are added or removed in order to obtain a model with a smaller AIC value. We obtain a more parsimonious model as follows:
logit.fit.reduced <- step(fit.logit)
## Start: AIC=91.35
## class ~ clumpThickness + sizeUniformity + shapeUniformity + maginalAdhesion +
## singleEpithelialCellSize + bareNuclei + blandChromatin +
## normalNucleoli + mitosis
##
## Df Deviance AIC
## - sizeUniformity 1 71.380 89.380
## - singleEpithelialCellSize 1 71.720 89.720
## <none> 71.346 91.346
## - shapeUniformity 1 73.713 91.713
## - mitosis 1 74.578 92.578
## - maginalAdhesion 1 75.289 93.289
## - blandChromatin 1 75.860 93.860
## - normalNucleoli 1 76.066 94.066
## - bareNuclei 1 82.485 100.485
## - clumpThickness 1 84.701 102.701
##
## Step: AIC=89.38
## class ~ clumpThickness + shapeUniformity + maginalAdhesion +
## singleEpithelialCellSize + bareNuclei + blandChromatin +
## normalNucleoli + mitosis
##
## Df Deviance AIC
## - singleEpithelialCellSize 1 71.727 87.727
## <none> 71.380 89.380
## - mitosis 1 74.588 90.588
## - shapeUniformity 1 75.086 91.086
## - maginalAdhesion 1 75.308 91.308
## - blandChromatin 1 75.863 91.863
## - normalNucleoli 1 76.166 92.166
## - bareNuclei 1 82.511 98.511
## - clumpThickness 1 85.112 101.112
##
## Step: AIC=87.73
## class ~ clumpThickness + shapeUniformity + maginalAdhesion +
## bareNuclei + blandChromatin + normalNucleoli + mitosis
##
## Df Deviance AIC
## <none> 71.727 87.727
## - mitosis 1 74.912 88.912
## - shapeUniformity 1 76.248 90.248
## - blandChromatin 1 76.712 90.712
## - maginalAdhesion 1 76.778 90.778
## - normalNucleoli 1 77.183 91.183
## - bareNuclei 1 82.967 96.967
## - clumpThickness 1 85.316 99.316
This reduced model excludes the three variables mentioned previously.When used to predict outcomes in the validation dataset, this reduced model makes fewer errors.