“Dataset:
Biopsy Data on Breast Cancer Patients
Description
This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. There are 699 rows and 11 columns. "
“When I say”Camcer" I mean Malignant"
data("biopsy")
head(biopsy)
## ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class
## 1 1000025 5 1 1 1 2 1 3 1 1 benign
## 2 1002945 5 4 4 5 7 10 3 2 1 benign
## 3 1015425 3 1 1 1 2 2 3 1 1 benign
## 4 1016277 6 8 8 1 3 4 3 7 1 benign
## 5 1017023 4 1 1 3 2 1 3 1 1 benign
## 6 1017122 8 10 10 8 7 10 9 7 1 malignant
biopsy=biopsy%>%
rename(`ClumpThickness`=V1, `UniformityofCellSize`=V2, `UniformityofCellShape`=V3, `MarginalAdhesion`=V4, `SingleEpithelialCellSize`=V5, `BareNuclei`=V6, `BlandChromatin`=V7, `NormalNucleoli`=V8, `Mitoses`=V9, `Cancer`=class)%>%
mutate(Cancer = sjmisc::rec(Cancer, rec = "benign=0; malignant=1")) %>%
dplyr::select(-ID)
## Warning: package 'bindrcpp' was built under R version 3.4.4
biopsy=drop_na(biopsy)
Let’s try and predict cancer and understand clump thickness
#Using clump thickness
cancer1=glm(Cancer~`ClumpThickness`,family="binomial",data=biopsy)
summary(cancer1)
##
## Call:
## glm(formula = Cancer ~ ClumpThickness, family = "binomial", data = biopsy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2026 -0.4332 -0.1743 0.1731 2.8965
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.11012 0.37894 -13.48 <2e-16 ***
## ClumpThickness 0.93042 0.07418 12.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 884.35 on 682 degrees of freedom
## Residual deviance: 458.48 on 681 degrees of freedom
## AIC: 462.48
##
## Number of Fisher Scoring iterations: 6
#Clump thickness and uniformity of cell size
cancer2=glm(Cancer~`ClumpThickness`+ `UniformityofCellSize`,family="binomial",data=biopsy)
summary(cancer2)
##
## Call:
## glm(formula = Cancer ~ ClumpThickness + UniformityofCellSize,
## family = "binomial", data = biopsy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9869 -0.2307 -0.0916 0.0190 2.6982
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.38224 0.64375 -11.468 < 2e-16 ***
## ClumpThickness 0.61964 0.09649 6.422 1.35e-10 ***
## UniformityofCellSize 1.29019 0.13740 9.390 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 884.35 on 682 degrees of freedom
## Residual deviance: 196.58 on 680 degrees of freedom
## AIC: 202.58
##
## Number of Fisher Scoring iterations: 7
#Clump thickness, uniformity of cell size, and unifomrity of cell shape
cancer3=glm(Cancer~`ClumpThickness`+`UniformityofCellSize`+`UniformityofCellShape`,family="binomial",data=biopsy)
summary(cancer3)
##
## Call:
## glm(formula = Cancer ~ ClumpThickness + UniformityofCellSize +
## UniformityofCellShape, family = "binomial", data = biopsy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6703 -0.1914 -0.0791 0.0208 2.8316
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.7210 0.6969 -11.079 < 2e-16 ***
## ClumpThickness 0.5918 0.1030 5.746 9.14e-09 ***
## UniformityofCellSize 0.6390 0.1704 3.751 0.000176 ***
## UniformityofCellShape 0.7240 0.1661 4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 884.35 on 682 degrees of freedom
## Residual deviance: 176.50 on 679 degrees of freedom
## AIC: 184.5
##
## Number of Fisher Scoring iterations: 7
#Let's use clump thickness and now interact uniformity of cell size and cell shape
cancer4=glm(Cancer~`ClumpThickness`+`UniformityofCellSize`*`UniformityofCellShape`,family="binomial",data=biopsy)
summary(cancer4)
##
## Call:
## glm(formula = Cancer ~ ClumpThickness + UniformityofCellSize *
## UniformityofCellShape, family = "binomial", data = biopsy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.02808 -0.12632 -0.05398 0.09175 3.10968
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -9.5369 0.9857 -9.675
## ClumpThickness 0.5679 0.1076 5.276
## UniformityofCellSize 1.3223 0.2641 5.006
## UniformityofCellShape 1.2769 0.2281 5.599
## UniformityofCellSize:UniformityofCellShape -0.1609 0.0362 -4.444
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## ClumpThickness 1.32e-07 ***
## UniformityofCellSize 5.55e-07 ***
## UniformityofCellShape 2.16e-08 ***
## UniformityofCellSize:UniformityofCellShape 8.85e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 884.35 on 682 degrees of freedom
## Residual deviance: 163.19 on 678 degrees of freedom
## AIC: 173.19
##
## Number of Fisher Scoring iterations: 7
#Let's use all features
cancer5=glm(Cancer~.,family="binomial",data=biopsy)
summary(cancer5)
##
## Call:
## glm(formula = Cancer ~ ., family = "binomial", data = biopsy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4841 -0.1153 -0.0619 0.0222 2.4698
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.10394 1.17488 -8.600 < 2e-16 ***
## ClumpThickness 0.53501 0.14202 3.767 0.000165 ***
## UniformityofCellSize -0.00628 0.20908 -0.030 0.976039
## UniformityofCellShape 0.32271 0.23060 1.399 0.161688
## MarginalAdhesion 0.33064 0.12345 2.678 0.007400 **
## SingleEpithelialCellSize 0.09663 0.15659 0.617 0.537159
## BareNuclei 0.38303 0.09384 4.082 4.47e-05 ***
## BlandChromatin 0.44719 0.17138 2.609 0.009073 **
## NormalNucleoli 0.21303 0.11287 1.887 0.059115 .
## Mitoses 0.53484 0.32877 1.627 0.103788
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 884.35 on 682 degrees of freedom
## Residual deviance: 102.89 on 673 degrees of freedom
## AIC: 122.89
##
## Number of Fisher Scoring iterations: 8
coef(cancer5)
## (Intercept) ClumpThickness UniformityofCellSize
## -10.103942243 0.535014068 -0.006279717
## UniformityofCellShape MarginalAdhesion SingleEpithelialCellSize
## 0.322706496 0.330636915 0.096635417
## BareNuclei BlandChromatin NormalNucleoli
## 0.383024572 0.447187920 0.213030682
## Mitoses
## 0.534835631
We can interpret model 5 like this:
Let’s try clump thickness because it is significant. Ceteris paribus, an increase in clump thickness by 1 unit on average will increase the log odds of having cancer by .535. However, an increase in uniformity of a cell by 1 unit on average will decrease the log odds of having cancer by .006.
#htmlreg(list(cancer1,cancer2,cancer3,cancer4,cancer5))
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | ||
|---|---|---|---|---|---|---|
| (Intercept) | -5.11*** | -7.38*** | -7.72*** | -9.54*** | -10.10*** | |
| (0.38) | (0.64) | (0.70) | (0.99) | (1.17) | ||
Clump Thickness
|
0.93*** | 0.62*** | 0.59*** | 0.57*** | 0.54*** | |
| (0.07) | (0.10) | (0.10) | (0.11) | (0.14) | ||
Uniformity of Cell Size
|
1.29*** | 0.64*** | 1.32*** | -0.01 | ||
| (0.14) | (0.17) | (0.26) | (0.21) | |||
Uniformity of Cell Shape
|
0.72*** | 1.28*** | 0.32 | |||
| (0.17) | (0.23) | (0.23) | ||||
Uniformity of Cell Size:Uniformity of Cell Shape
|
-0.16*** | |||||
| (0.04) | ||||||
Marginal Adhesion
|
0.33** | |||||
| (0.12) | ||||||
Single Epithelial Cell Size
|
0.10 | |||||
| (0.16) | ||||||
Bare Nuclei
|
0.38*** | |||||
| (0.09) | ||||||
Bland Chromatin
|
0.45** | |||||
| (0.17) | ||||||
Normal Nucleoli
|
0.21 | |||||
| (0.11) | ||||||
| Mitoses | 0.53 | |||||
| (0.33) | ||||||
| AIC | 462.48 | 202.58 | 184.50 | 173.19 | 122.89 | |
| BIC | 471.54 | 216.15 | 202.60 | 195.83 | 168.15 | |
| Log Likelihood | -229.24 | -98.29 | -88.25 | -81.60 | -51.44 | |
| Deviance | 458.48 | 196.58 | 176.50 | 163.19 | 102.89 | |
| Num. obs. | 683 | 683 | 683 | 683 | 683 | |
| p < 0.001, p < 0.01, p < 0.05 | ||||||
Looking at both AIC and BIC, we can see that our models get better with complexity. The best performing model based on both of these metrics is model 5 which includes all features.
Also, we can see that clump thickness seems to matter a lot in determining if someone has cancer, as it is very statistically significant across all models. Also, it is interesting to see how uniformity of cell size and cell shape is significant until model 5, which could hint at omitted variable bias.
anova(cancer1,cancer2,cancer3,cancer4,cancer5, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Cancer ~ ClumpThickness
## Model 2: Cancer ~ ClumpThickness + UniformityofCellSize
## Model 3: Cancer ~ ClumpThickness + UniformityofCellSize + UniformityofCellShape
## Model 4: Cancer ~ ClumpThickness + UniformityofCellSize * UniformityofCellShape
## Model 5: Cancer ~ ClumpThickness + UniformityofCellSize + UniformityofCellShape +
## MarginalAdhesion + SingleEpithelialCellSize + BareNuclei +
## BlandChromatin + NormalNucleoli + Mitoses
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 681 458.48
## 2 680 196.58 1 261.908 < 2.2e-16 ***
## 3 679 176.50 1 20.080 7.427e-06 ***
## 4 678 163.19 1 13.301 0.0002653 ***
## 5 673 102.89 5 60.306 1.051e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Doing our analysis of deviance, we view that model 4 is the best based on Deviance as it has the lowest compared to the other models.
Looking at model 4, everything is significant, including the interaction term of unif. of cell size and shape. This can be interpreted as, ceteris paribus, an increase in unif. of cell size and cell shape by 1 unit on average will decrease the log odds of having canver by .16.
visreg(cancer1, "ClumpThickness",scale="response")
visreg(cancer5, "ClumpThickness",scale="response")
Taking a look at clump thickness in models 1 and 5 with cancer as the response, we view two different tales. Model 1 has only clump thickness as its feature and therefore will show a drastic increase in the probability of cancer with a unit increase in thickness. Since model 5 has all predictors in it, the marginal effect of thickness on the probability of cancer by 1 unit is subdued. This can be because we have included more relevant features in the 5th model, allowing for a better visualization and capturing of clump thickness on probability of cancer.
visreg(cancer5,"BareNuclei",scale="response")
Using our model–number 5–that scored the best with the AIC and BIC, we see that Bare Nuclei is a significant feature, so we will look at the plot of that against cancer. We can see that an increase in the amount of Bare Nuclei will increase the probabality of having cancer. It looks similar to the flat exponential probability of clump thickness for the 5th model.
In conclusion, through our analysis, we can tell that clump thickness, regardless of the model, and features we have included, has a statistically significant effect on predicting cancer. A greater clump thickness of a patient’s biopsy will lead to greater log odds, and a greater probability of having cancer, rather than it being benign.