#install.packages(c("rpart", "party", "randomForest", "e1071", "rpart.plot"))

par(ask=TRUE)

loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
ds  <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data"
url <- paste(loc, ds, sep="")

breast <- read.table(url, sep = ",", header=FALSE, na.strings = "?")
names(breast) <- c("ID", "clumpThickness", "sizeUniformity","shapeUniformity", "maginalAdhesion", "singleEpithelialCellSize", "bareNuclei","blandChromatin", "normalNucleoli", "mitosis", "class")

df <- breast[-1]
df$class <- factor(df$class, levels=c(2,4), labels = c("benign","malignant"))

set.seed(1234)
train <- sample(nrow(df), 0.7*nrow(df))
df.train <- df[train,]
df.validate <- df[-train,]

#Logistic regression with glm()

fit.logit <- glm(class~., data=df.train, family = binomial())
summary(fit.logit)
## 
## Call:
## glm(formula = class ~ ., family = binomial(), data = df.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.75813  -0.10602  -0.05679   0.01237   2.64317  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -10.42758    1.47602  -7.065 1.61e-12 ***
## clumpThickness             0.52434    0.15950   3.287  0.00101 ** 
## sizeUniformity            -0.04805    0.25706  -0.187  0.85171    
## shapeUniformity            0.42309    0.26775   1.580  0.11407    
## maginalAdhesion            0.29245    0.14690   1.991  0.04650 *  
## singleEpithelialCellSize   0.11053    0.17980   0.615  0.53871    
## bareNuclei                 0.33570    0.10715   3.133  0.00173 ** 
## blandChromatin             0.42353    0.20673   2.049  0.04049 *  
## normalNucleoli             0.28888    0.13995   2.064  0.03900 *  
## mitosis                    0.69057    0.39829   1.734  0.08295 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 612.063  on 482  degrees of freedom
## Residual deviance:  71.346  on 473  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 91.346
## 
## Number of Fisher Scoring iterations: 8

Interpretation of Logistic Regression -

From the p-values of the regression coefficients, we see that sizeUniformity, shapeUniformity and singleEpithelialCellSize may not make significant contribution to the equation. The p-values for these variables are 0.85171, 0.11407 and 0.53871. These are greater than 0.1 The Null Hypothesis says that the parameters for these variables is Zero, H0: Beta1 = 0 versus H1: Beta1 not equal to zero. We reject the null hypothesis when the p-value is small. In the case of the above variables the p-values are not small. Hence, we cannot reject the null hypothesis. That is to say, Beta1 is indeed zero and has no effect on the equation.

prob <- predict(fit.logit, df.validate, type="response")
logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE), labels = c("benign","malignant"))
#By default predict() function predicts the log odds of having a malignant outcome

logit.perf <- table(df.validate$class, logit.pred, dnn = c("Actual", "Predicted") )
logit.perf
##            Predicted
## Actual      benign malignant
##   benign       118         2
##   malignant      4        76

The total number of cases correctly classified also called the accuracy was (118+76)/200 or 97% in the validation sample.

It is often useful to remove the ‘predictor variables with non-significant coefficients’ from the final model. This is especially imporant in where a large number of non-informative predictor variables are adding what is essentially noise to the system.

The Akaike Information Criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.AIC is a goodness of fit measure that favours smaller residual error in the model, but penalises for including further predictors and helps avoiding overfitting.

The stepwise logistic regression can be used to generate a smaller model with fewer variables. Predictor variables are added or removed in order to obtain a model with a smaller AIC value. We obtain a more parsimonious model as follows:

logit.fit.reduced <- step(fit.logit)
## Start:  AIC=91.35
## class ~ clumpThickness + sizeUniformity + shapeUniformity + maginalAdhesion + 
##     singleEpithelialCellSize + bareNuclei + blandChromatin + 
##     normalNucleoli + mitosis
## 
##                            Df Deviance     AIC
## - sizeUniformity            1   71.380  89.380
## - singleEpithelialCellSize  1   71.720  89.720
## <none>                          71.346  91.346
## - shapeUniformity           1   73.713  91.713
## - mitosis                   1   74.578  92.578
## - maginalAdhesion           1   75.289  93.289
## - blandChromatin            1   75.860  93.860
## - normalNucleoli            1   76.066  94.066
## - bareNuclei                1   82.485 100.485
## - clumpThickness            1   84.701 102.701
## 
## Step:  AIC=89.38
## class ~ clumpThickness + shapeUniformity + maginalAdhesion + 
##     singleEpithelialCellSize + bareNuclei + blandChromatin + 
##     normalNucleoli + mitosis
## 
##                            Df Deviance     AIC
## - singleEpithelialCellSize  1   71.727  87.727
## <none>                          71.380  89.380
## - mitosis                   1   74.588  90.588
## - shapeUniformity           1   75.086  91.086
## - maginalAdhesion           1   75.308  91.308
## - blandChromatin            1   75.863  91.863
## - normalNucleoli            1   76.166  92.166
## - bareNuclei                1   82.511  98.511
## - clumpThickness            1   85.112 101.112
## 
## Step:  AIC=87.73
## class ~ clumpThickness + shapeUniformity + maginalAdhesion + 
##     bareNuclei + blandChromatin + normalNucleoli + mitosis
## 
##                   Df Deviance    AIC
## <none>                 71.727 87.727
## - mitosis          1   74.912 88.912
## - shapeUniformity  1   76.248 90.248
## - blandChromatin   1   76.712 90.712
## - maginalAdhesion  1   76.778 90.778
## - normalNucleoli   1   77.183 91.183
## - bareNuclei       1   82.967 96.967
## - clumpThickness   1   85.316 99.316

This reduced model excludes the three variables mentioned previously.When used to predict outcomes in the validation dataset, this reduced model makes fewer errors.