Description of Data

“Dataset:

Biopsy Data on Breast Cancer Patients

Description

This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. There are 699 rows and 11 columns. "

“When I say”Camcer" I mean Malignant"

Reading the data, renaming the features, and coding benign to 0 and malignant to 1.

data("biopsy")
head(biopsy)
##        ID V1 V2 V3 V4 V5 V6 V7 V8 V9     class
## 1 1000025  5  1  1  1  2  1  3  1  1    benign
## 2 1002945  5  4  4  5  7 10  3  2  1    benign
## 3 1015425  3  1  1  1  2  2  3  1  1    benign
## 4 1016277  6  8  8  1  3  4  3  7  1    benign
## 5 1017023  4  1  1  3  2  1  3  1  1    benign
## 6 1017122  8 10 10  8  7 10  9  7  1 malignant
biopsy=biopsy%>%
  rename(`ClumpThickness`=V1, `UniformityofCellSize`=V2, `UniformityofCellShape`=V3, `MarginalAdhesion`=V4, `SingleEpithelialCellSize`=V5, `BareNuclei`=V6, `BlandChromatin`=V7, `NormalNucleoli`=V8, `Mitoses`=V9, `Cancer`=class)%>%
    mutate(Cancer = sjmisc::rec(Cancer, rec = "benign=0; malignant=1")) %>%
  dplyr::select(-ID)
## Warning: package 'bindrcpp' was built under R version 3.4.4
biopsy=drop_na(biopsy)

Running the regression

Let’s try and predict cancer and understand clump thickness

#Using clump thickness
cancer1=glm(Cancer~`ClumpThickness`,family="binomial",data=biopsy)
summary(cancer1)
## 
## Call:
## glm(formula = Cancer ~ ClumpThickness, family = "binomial", data = biopsy)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2026  -0.4332  -0.1743   0.1731   2.8965  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.11012    0.37894  -13.48   <2e-16 ***
## ClumpThickness  0.93042    0.07418   12.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 884.35  on 682  degrees of freedom
## Residual deviance: 458.48  on 681  degrees of freedom
## AIC: 462.48
## 
## Number of Fisher Scoring iterations: 6
#Clump thickness and uniformity of cell size
cancer2=glm(Cancer~`ClumpThickness`+ `UniformityofCellSize`,family="binomial",data=biopsy)
summary(cancer2)
## 
## Call:
## glm(formula = Cancer ~ ClumpThickness + UniformityofCellSize, 
##     family = "binomial", data = biopsy)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9869  -0.2307  -0.0916   0.0190   2.6982  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -7.38224    0.64375 -11.468  < 2e-16 ***
## ClumpThickness        0.61964    0.09649   6.422 1.35e-10 ***
## UniformityofCellSize  1.29019    0.13740   9.390  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 884.35  on 682  degrees of freedom
## Residual deviance: 196.58  on 680  degrees of freedom
## AIC: 202.58
## 
## Number of Fisher Scoring iterations: 7
#Clump thickness, uniformity of cell size, and unifomrity of cell shape
cancer3=glm(Cancer~`ClumpThickness`+`UniformityofCellSize`+`UniformityofCellShape`,family="binomial",data=biopsy)
summary(cancer3)
## 
## Call:
## glm(formula = Cancer ~ ClumpThickness + UniformityofCellSize + 
##     UniformityofCellShape, family = "binomial", data = biopsy)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6703  -0.1914  -0.0791   0.0208   2.8316  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -7.7210     0.6969 -11.079  < 2e-16 ***
## ClumpThickness          0.5918     0.1030   5.746 9.14e-09 ***
## UniformityofCellSize    0.6390     0.1704   3.751 0.000176 ***
## UniformityofCellShape   0.7240     0.1661   4.358 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 884.35  on 682  degrees of freedom
## Residual deviance: 176.50  on 679  degrees of freedom
## AIC: 184.5
## 
## Number of Fisher Scoring iterations: 7
#Let's use clump thickness and now interact uniformity of cell size and cell shape
cancer4=glm(Cancer~`ClumpThickness`+`UniformityofCellSize`*`UniformityofCellShape`,family="binomial",data=biopsy)
summary(cancer4)
## 
## Call:
## glm(formula = Cancer ~ ClumpThickness + UniformityofCellSize * 
##     UniformityofCellShape, family = "binomial", data = biopsy)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.02808  -0.12632  -0.05398   0.09175   3.10968  
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                                 -9.5369     0.9857  -9.675
## ClumpThickness                               0.5679     0.1076   5.276
## UniformityofCellSize                         1.3223     0.2641   5.006
## UniformityofCellShape                        1.2769     0.2281   5.599
## UniformityofCellSize:UniformityofCellShape  -0.1609     0.0362  -4.444
##                                            Pr(>|z|)    
## (Intercept)                                 < 2e-16 ***
## ClumpThickness                             1.32e-07 ***
## UniformityofCellSize                       5.55e-07 ***
## UniformityofCellShape                      2.16e-08 ***
## UniformityofCellSize:UniformityofCellShape 8.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 884.35  on 682  degrees of freedom
## Residual deviance: 163.19  on 678  degrees of freedom
## AIC: 173.19
## 
## Number of Fisher Scoring iterations: 7
#Let's use all features
cancer5=glm(Cancer~.,family="binomial",data=biopsy)
summary(cancer5)
## 
## Call:
## glm(formula = Cancer ~ ., family = "binomial", data = biopsy)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4841  -0.1153  -0.0619   0.0222   2.4698  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -10.10394    1.17488  -8.600  < 2e-16 ***
## ClumpThickness             0.53501    0.14202   3.767 0.000165 ***
## UniformityofCellSize      -0.00628    0.20908  -0.030 0.976039    
## UniformityofCellShape      0.32271    0.23060   1.399 0.161688    
## MarginalAdhesion           0.33064    0.12345   2.678 0.007400 ** 
## SingleEpithelialCellSize   0.09663    0.15659   0.617 0.537159    
## BareNuclei                 0.38303    0.09384   4.082 4.47e-05 ***
## BlandChromatin             0.44719    0.17138   2.609 0.009073 ** 
## NormalNucleoli             0.21303    0.11287   1.887 0.059115 .  
## Mitoses                    0.53484    0.32877   1.627 0.103788    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 884.35  on 682  degrees of freedom
## Residual deviance: 102.89  on 673  degrees of freedom
## AIC: 122.89
## 
## Number of Fisher Scoring iterations: 8
coef(cancer5)
##              (Intercept)           ClumpThickness     UniformityofCellSize 
##            -10.103942243              0.535014068             -0.006279717 
##    UniformityofCellShape         MarginalAdhesion SingleEpithelialCellSize 
##              0.322706496              0.330636915              0.096635417 
##               BareNuclei           BlandChromatin           NormalNucleoli 
##              0.383024572              0.447187920              0.213030682 
##                  Mitoses 
##              0.534835631

Some Interpretations

We can interpret model 5 like this:

Let’s try clump thickness because it is significant. Ceteris paribus, an increase in clump thickness by 1 unit on average will increase the log odds of having cancer by .535. However, an increase in uniformity of a cell by 1 unit on average will decrease the log odds of having cancer by .006.

Let’s Table and Compare

#htmlreg(list(cancer1,cancer2,cancer3,cancer4,cancer5))
Statistical models
Model 1 Model 2 Model 3 Model 4 Model 5
(Intercept) -5.11*** -7.38*** -7.72*** -9.54*** -10.10***
(0.38) (0.64) (0.70) (0.99) (1.17)
Clump Thickness 0.93*** 0.62*** 0.59*** 0.57*** 0.54***
(0.07) (0.10) (0.10) (0.11) (0.14)
Uniformity of Cell Size 1.29*** 0.64*** 1.32*** -0.01
(0.14) (0.17) (0.26) (0.21)
Uniformity of Cell Shape 0.72*** 1.28*** 0.32
(0.17) (0.23) (0.23)
Uniformity of Cell Size:Uniformity of Cell Shape -0.16***
(0.04)
Marginal Adhesion 0.33**
(0.12)
Single Epithelial Cell Size 0.10
(0.16)
Bare Nuclei 0.38***
(0.09)
Bland Chromatin 0.45**
(0.17)
Normal Nucleoli 0.21
(0.11)
Mitoses 0.53
(0.33)
AIC 462.48 202.58 184.50 173.19 122.89
BIC 471.54 216.15 202.60 195.83 168.15
Log Likelihood -229.24 -98.29 -88.25 -81.60 -51.44
Deviance 458.48 196.58 176.50 163.19 102.89
Num. obs. 683 683 683 683 683
p < 0.001, p < 0.01, p < 0.05

Looking at both AIC and BIC, we can see that our models get better with complexity. The best performing model based on both of these metrics is model 5 which includes all features.

Also, we can see that clump thickness seems to matter a lot in determining if someone has cancer, as it is very statistically significant across all models. Also, it is interesting to see how uniformity of cell size and cell shape is significant until model 5, which could hint at omitted variable bias.

anova(cancer1,cancer2,cancer3,cancer4,cancer5, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Cancer ~ ClumpThickness
## Model 2: Cancer ~ ClumpThickness + UniformityofCellSize
## Model 3: Cancer ~ ClumpThickness + UniformityofCellSize + UniformityofCellShape
## Model 4: Cancer ~ ClumpThickness + UniformityofCellSize * UniformityofCellShape
## Model 5: Cancer ~ ClumpThickness + UniformityofCellSize + UniformityofCellShape + 
##     MarginalAdhesion + SingleEpithelialCellSize + BareNuclei + 
##     BlandChromatin + NormalNucleoli + Mitoses
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       681     458.48                          
## 2       680     196.58  1  261.908 < 2.2e-16 ***
## 3       679     176.50  1   20.080 7.427e-06 ***
## 4       678     163.19  1   13.301 0.0002653 ***
## 5       673     102.89  5   60.306 1.051e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Doing our analysis of deviance, we view that model 4 is the best based on Deviance as it has the lowest compared to the other models.

Looking at model 4, everything is significant, including the interaction term of unif. of cell size and shape. This can be interpreted as, ceteris paribus, an increase in unif. of cell size and cell shape by 1 unit on average will decrease the log odds of having canver by .16.

Visualizations

visreg(cancer1, "ClumpThickness",scale="response")

visreg(cancer5, "ClumpThickness",scale="response")

Taking a look at clump thickness in models 1 and 5 with cancer as the response, we view two different tales. Model 1 has only clump thickness as its feature and therefore will show a drastic increase in the probability of cancer with a unit increase in thickness. Since model 5 has all predictors in it, the marginal effect of thickness on the probability of cancer by 1 unit is subdued. This can be because we have included more relevant features in the 5th model, allowing for a better visualization and capturing of clump thickness on probability of cancer.

visreg(cancer5,"BareNuclei",scale="response")

Using our model–number 5–that scored the best with the AIC and BIC, we see that Bare Nuclei is a significant feature, so we will look at the plot of that against cancer. We can see that an increase in the amount of Bare Nuclei will increase the probabality of having cancer. It looks similar to the flat exponential probability of clump thickness for the 5th model.

In conclusion, through our analysis, we can tell that clump thickness, regardless of the model, and features we have included, has a statistically significant effect on predicting cancer. A greater clump thickness of a patient’s biopsy will lead to greater log odds, and a greater probability of having cancer, rather than it being benign.