1 Introduction

In this report, we will be using synthetic breast cancer data to make a multiple logistic regression models. Using global goodness of fit tests, we will decide which is the best model for association analysis.

1.1 Data Description

BreastCancerData<- read.csv("https://raw.githubusercontent.com/emmalaughin/sta321/main/data/SyntheticBreastCancerData")

The data set we will be using is synthetic breast cancer data and can be found in the book “Applied Analytics through Case Studies Using SAS and R” and Yested on github. The response variable (Outcome) is a binary categorical variable, and all of the predictor variable are coded as continuous. Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis. The variables are as follows:

  • Sample_No: Identification Variable

  • Thickness_of_Clump: Benign cells are more likely monolayers and malignant or cancerous cells are multilayer

  • Cell_Size_Uniformity: Benign cells does not vary in size and malignant or cancer cell vary in size

  • Cell_Shape_Uniformity: Benign cells does not vary in shape and malignant or cancer cell vary in shape

  • Marginal_Adhesion: Benign cells are more likely stick together and cancer cells are loose or does not stick together

  • Single_Epithelial_Cell_Size: In benign cells epithelial cells are normal and malignant or cancer cells are significantly enlarged

  • Bare_Nuclei: In benign cells the bare nuclei is not surrounded by cytoplasm and in cancer cells it is surrounded by cytoplasm

          ** Made categorical:
              Score given < 3, variable is coded as "Normal",
              Score > 3, variable is coded as "Surrounded".
  • Bland_Chromatin: Benign cells have uniform or fine chromatin and cancer cells have coarse chromatin

  • Normal_Nucleoli: In Benign cells nucleoli is very small and in cancer cells nucleoli is more prominent

  • Mitoses: In benign cells the cell growth is normal and in cancer cells there is abnormal cell growth

          ** Made categorical:
              Score < 3, variable is coded as "Normal",
              3 < Score < 7, variable is codded as "Abnormal"
              Score > 7, variable is coded as "Rapid"
  • Outcome(response): No denotes the presence of benign and Yes denotes the presence of malignant breast cancer

2 Research Question

In this report, I will be building a logistic regression model to predict the outcome of if a breast cancer tumor is benign or malignant, using various risk factors associated with the individual patient.

We need our response variable to be binary (either 0 or 1) in order to use logistic regression. In this report, an outcome of “No” is coded as 0 and “Yes” is coded as 1. Since we are using simple logistic regression, we only need one explanatory variable.

3 Data Exploration

3.1 Pairwise Scatterplot

We can start by making a pairwise scatterplot of the variables to see the correlation, spot patterns, and assess potential violations.

pairs.panels(BreastCancerData[,-c(7,10)], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

3.2 Discretizing Variables

The predictor variables are all unimodial, if they are skewed we would be able to discretize them.

3.3 Standizing Numerical Predictor Variables

Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance, so we will standardize the numerical predictor variables.

4 Cross Validation

We will use the cross validation method to create candidate models, validate them and identify the final model.

4.1 Data Split

To begin the cross validation process, we must split the data into two parts. These two parts are 80% (the training set) and 20% (the testing set). We will do all of our candidate modeling using the training set.

4.2 Candidate Models

We use the full, reduced, and final models obtained based on the step-wise variable selection as the three candidate models.

kable(avg.pe, caption = "Average of prediction errors of candidate models")
Average of prediction errors of candidate models
PE1 PE2 PE3
0.0729167 0.0729167 0.0645833

4.3 Final Model Accuracy

The table below shows an accuracy rate the model that we will use. In this model, we are using the test data which we have not used up until now.

kable(accuracy, caption="The actual accuracy of the final model")
The actual accuracy of the final model
x
1

5 Predictive Performance Measures

5.1 Yesitivity Rates

We first estimate the TPR (true Yesitive rate, sensitivity) and FPR (false Yesitive rate, 1 - specificity) at each cut-off probability for each of the three candidate models.

5.2 ROC Curves

Then we make ROC curves of the three candidate models, these are are given below.

##  candidate models
##  full model

    candidate01 = glm(factor(Outcome) ~sd.Thickness_of_Clump + sd.Cell_Size_Uniformity +sd.Cell_Shape_Uniformity+ sd.Marginal_Adhesion + sd.Single_Epithelial_Cell_Size + sd.Normal_Nucleoli +
                    sd.Bland_Chromatin + Bare_Nuclei.cat + Mitoses.cat, family = binomial(link = "logit"),  
                    data = train.dat)  
    ## reduced model
      candidate03 = glm(factor(Outcome) ~ sd.Thickness_of_Clump + sd.Cell_Shape_Uniformity+ sd.Marginal_Adhesion + Bare_Nuclei.cat +sd.Bland_Chromatin, 
                    family = binomial(link = "logit"),  
                    data = train.dat) 
      
## 
  candidate02 = stepAIC(candidate01, 
                      scope = list(lower=formula(candidate03),upper=formula(candidate01)),
                      direction = "forward",   # forward selection
                      trace = 0                # do not show the details
                      )
   
pred01 = predict(candidate01, newdata = test, type="response")
pred02 = predict(candidate02, newdata = test, type="response")
pred03 = predict(candidate03, newdata = test, type="response")
####
## ROC curve
 plot(TPR.FPR(pred01)[,1], TPR.FPR(pred01)[,2], 
      type="b", col=2, lty=1, xlim=c(0,1), ylim=c(0,1),
      xlab = "FPR: 1 - specificity",
      ylab ="TPR: sensitivity",
      main = "ROC Curve",
      cex.main = 0.8,
      col.main = "navy")
 lines(TPR.FPR(pred02)[,1], TPR.FPR(pred02)[,2], type="b", col=3, lty=2)
 lines(TPR.FPR(pred03)[,1], TPR.FPR(pred03)[,2], type="b", col=4, lty=3)    
 legend("bottomright", c("Candidate #1", "Candidate #2","Candidate #3"),
        col=2:4, lty=1:3, cex = 0.8, bty="n")

From the ROC curves, we can see that the candidate models are all similar in sensitivity and specificity. This means we can use any of the models that we want as the final working model, if we are just using this measurement.

6 Summary and Conclusion

The report focused on predicting breast cancer. We used three models as candidates and use both cross-validation and ROC curve to select the final working model. The ROC curves told us that we can use any of the three models, but the cross-validation told us we should use candidate 2 or the candidate where we used the step AIC procedure.