This project aims to build a predictive model in order to forecast the likelihood of having a breast cancer when one, few or all of 9 descriptive factors namely clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, blandness of nuclear chromatin, normal nucleoli & infrequent mitoses and its interactions present. The binary variable “Class” with levels 0-benign, 1-malignant (having a breast cancer) is the response variable and logistic regression technique is used in modeling. The latter part devotes to analysis the standalone effect to the model by each factor selected.
In order to select the most relevant factors to the model, first we modeled response variable with each descriptive variable and recorded standalone effect in its AIC and residual deviance.
# Defining a summary table
summaryTab <- data.frame(matrix(ncol=3, nrow=9))
x <- c("Factor","AIC","Resi._Deviance")
colnames(summaryTab) <- x
# Looking for standalone effect of each variable
for (x in c(1:9)){
ft=glm(Class ~ mydata[,x+1], family =binomial(link=logit),data=mydata)
summaryTab[x,1]=colnames(mydata)[x+1]
summaryTab[x,2]=ft$aic
summaryTab[x,3]=ft$deviance
x=x+1
}
summaryTab %>% knitr::kable( caption = 'Standalone Factor Effect')
| Factor | AIC | Resi._Deviance |
|---|---|---|
| Clump_thickness | 468.0540 | 464.0540 |
| Cell_Size | 279.5543 | 275.5543 |
| Cell_Shape | 288.2465 | 284.2465 |
| Adhesion | 496.5542 | 492.5542 |
| Epi.Cell_Size | 485.7057 | 481.7057 |
| Bare_Nuclei | 346.5127 | 342.5127 |
| Bland_Chromatin | 405.3451 | 401.3451 |
| Normal_Nucleoli | 492.5092 | 488.5092 |
| Mitoses | 735.0814 | 731.0814 |
Then we kept adding factors to the model starting with lowest AIC and checked the significancy of each factor (5%) to the model. Even though some factors are individually highly significant to the model, with the inclusion of other factors they became insignificant. For a example, cell size is the most important factor to the model with AIC value if we consider all factors individually. However, with bare nuclei and blandness chromatin cell size became insignificant. After selecting individual factors to the model then we checked for two-factor interactive effects and any interaction with a statistically significance (5%) we added to the model.
best_fit <- glm(formula = Class~ Cell_Size,binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Cell_Size, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1081 -0.2474 -0.2474 0.0099 2.6465
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.960 0.360 -13.78 <2e-16 ***
## Cell_Size 1.489 0.121 12.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 275.55 on 697 degrees of freedom
## AIC: 279.55
##
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Cell_Size +Cell_Shape,binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Cell_Size + Cell_Shape, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8053 -0.2014 -0.2014 0.0262 2.7957
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.4771 0.4063 -13.482 < 2e-16 ***
## Cell_Size 0.7672 0.1503 5.104 3.32e-07 ***
## Cell_Shape 0.8223 0.1484 5.542 2.99e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 239.65 on 696 degrees of freedom
## AIC: 245.65
##
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Cell_Size +Cell_Shape+Bare_Nuclei,binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Cell_Size + Cell_Shape + Bare_Nuclei, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7435 -0.1425 -0.1425 0.0322 2.2067
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.25102 0.51801 -12.067 < 2e-16 ***
## Cell_Size 0.51899 0.14347 3.617 0.000298 ***
## Cell_Shape 0.58587 0.16456 3.560 0.000371 ***
## Bare_Nuclei 0.56064 0.07683 7.297 2.93e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 158.62 on 695 degrees of freedom
## AIC: 166.62
##
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Cell_Size +Cell_Shape+Bare_Nuclei+Bland_Chromatin, binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Cell_Size + Cell_Shape + Bare_Nuclei +
## Bland_Chromatin, family = binomial(link = logit), data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4254 -0.1674 -0.1338 0.0347 2.3747
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.0221 0.6312 -11.125 < 2e-16 ***
## Cell_Size 0.3689 0.1513 2.438 0.01478 *
## Cell_Shape 0.5504 0.1689 3.259 0.00112 **
## Bare_Nuclei 0.4885 0.0811 6.024 1.7e-09 ***
## Bland_Chromatin 0.4510 0.1424 3.167 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 148.26 on 694 degrees of freedom
## AIC: 158.26
##
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Cell_Size +Cell_Shape+Bare_Nuclei+
Bland_Chromatin+Clump_thickness, binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Cell_Size + Cell_Shape + Bare_Nuclei +
## Bland_Chromatin + Clump_thickness, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.11247 -0.14126 -0.07861 0.03865 2.40859
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.73285 0.87509 -9.979 < 2e-16 ***
## Cell_Size 0.26963 0.16348 1.649 0.09908 .
## Cell_Shape 0.34808 0.18371 1.895 0.05813 .
## Bare_Nuclei 0.46012 0.08849 5.200 2.0e-07 ***
## Bland_Chromatin 0.44554 0.15492 2.876 0.00403 **
## Clump_thickness 0.54039 0.12438 4.345 1.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 124.79 on 693 degrees of freedom
## AIC: 136.79
##
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+
Clump_thickness+Epi.Cell_Size, binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness +
## Epi.Cell_Size, family = binomial(link = logit), data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3254 -0.1524 -0.0599 0.0438 2.7024
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.7929 0.9335 -10.491 < 2e-16 ***
## Bare_Nuclei 0.5455 0.0842 6.478 9.27e-11 ***
## Bland_Chromatin 0.6017 0.1320 4.560 5.13e-06 ***
## Clump_thickness 0.7035 0.1163 6.049 1.45e-09 ***
## Epi.Cell_Size 0.4148 0.1192 3.478 0.000504 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 133.45 on 694 degrees of freedom
## AIC: 143.45
##
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
Epi.Cell_Size+Cell_Shape,binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness +
## Epi.Cell_Size + Cell_Shape, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.00391 -0.13845 -0.07312 0.04506 2.47551
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.21510 0.91516 -10.069 < 2e-16 ***
## Bare_Nuclei 0.46820 0.08921 5.248 1.53e-07 ***
## Bland_Chromatin 0.46858 0.14994 3.125 0.00178 **
## Clump_thickness 0.56555 0.12310 4.594 4.35e-06 ***
## Epi.Cell_Size 0.24622 0.12937 1.903 0.05701 .
## Cell_Shape 0.41226 0.14863 2.774 0.00554 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 124.21 on 693 degrees of freedom
## AIC: 136.21
##
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
Cell_Shape+Normal_Nucleoli,binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness +
## Cell_Shape + Normal_Nucleoli, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2734 -0.1366 -0.0724 0.0406 2.5339
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.93064 0.89891 -9.935 < 2e-16 ***
## Bare_Nuclei 0.47002 0.08760 5.366 8.07e-08 ***
## Bland_Chromatin 0.44424 0.15084 2.945 0.00323 **
## Clump_thickness 0.57194 0.12506 4.573 4.80e-06 ***
## Cell_Shape 0.40093 0.14903 2.690 0.00714 **
## Normal_Nucleoli 0.21304 0.09879 2.157 0.03104 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 123.23 on 693 degrees of freedom
## AIC: 135.23
##
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
Cell_Shape+Normal_Nucleoli+Adhesion,binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness +
## Cell_Shape + Normal_Nucleoli + Adhesion, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3455 -0.1303 -0.0611 0.0319 2.3192
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.33879 0.98244 -9.506 < 2e-16 ***
## Bare_Nuclei 0.43564 0.09131 4.771 1.83e-06 ***
## Bland_Chromatin 0.41210 0.15389 2.678 0.00741 **
## Clump_thickness 0.61046 0.12987 4.701 2.59e-06 ***
## Cell_Shape 0.33389 0.15331 2.178 0.02941 *
## Normal_Nucleoli 0.19081 0.09830 1.941 0.05225 .
## Adhesion 0.25028 0.10650 2.350 0.01877 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 117.39 on 692 degrees of freedom
## AIC: 131.39
##
## Number of Fisher Scoring iterations: 8
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
Adhesion+Mitoses+Cell_Shape,family =binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness +
## Adhesion + Mitoses + Cell_Shape, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1965 -0.1291 -0.0687 0.0227 2.3039
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.64765 1.02018 -9.457 < 2e-16 ***
## Bare_Nuclei 0.43743 0.09175 4.768 1.86e-06 ***
## Bland_Chromatin 0.44458 0.15506 2.867 0.00414 **
## Clump_thickness 0.53815 0.13224 4.069 4.71e-05 ***
## Adhesion 0.27343 0.10927 2.502 0.01234 *
## Mitoses 0.58290 0.29068 2.005 0.04493 *
## Cell_Shape 0.43454 0.15145 2.869 0.00412 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 115.46 on 692 degrees of freedom
## AIC: 129.46
##
## Number of Fisher Scoring iterations: 8
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
Adhesion+Mitoses+Cell_Shape+Clump_thickness*Adhesion,
family =binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness +
## Adhesion + Mitoses + Cell_Shape + Clump_thickness * Adhesion,
## family = binomial(link = logit), data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.99550 -0.11282 -0.03899 0.03674 2.30419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.79913 1.49729 -7.880 3.27e-15 ***
## Bare_Nuclei 0.43843 0.09250 4.740 2.14e-06 ***
## Bland_Chromatin 0.45896 0.16077 2.855 0.004308 **
## Clump_thickness 0.94283 0.22564 4.178 2.93e-05 ***
## Adhesion 0.95133 0.28673 3.318 0.000907 ***
## Mitoses 0.63588 0.33175 1.917 0.055266 .
## Cell_Shape 0.40167 0.15488 2.593 0.009505 **
## Clump_thickness:Adhesion -0.12963 0.04667 -2.777 0.005481 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 108.03 on 691 degrees of freedom
## AIC: 124.03
##
## Number of Fisher Scoring iterations: 8
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
Adhesion+Cell_Shape+Clump_thickness*Adhesion,
family =binomial(link=logit),data=mydata)
summary(best_fit)
##
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness +
## Adhesion + Cell_Shape + Clump_thickness * Adhesion, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.96887 -0.11102 -0.03861 0.04767 2.32284
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.29859 1.37679 -8.206 2.28e-16 ***
## Bare_Nuclei 0.41852 0.09106 4.596 4.30e-06 ***
## Bland_Chromatin 0.49786 0.15735 3.164 0.00156 **
## Clump_thickness 0.97366 0.20906 4.657 3.20e-06 ***
## Adhesion 0.88624 0.26643 3.326 0.00088 ***
## Cell_Shape 0.44317 0.14235 3.113 0.00185 **
## Clump_thickness:Adhesion -0.11786 0.04248 -2.775 0.00553 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 114.21 on 692 degrees of freedom
## AIC: 128.21
##
## Number of Fisher Scoring iterations: 8
After few iterations we found the best model.
logit (Probability of having a breast cancer)= -11.29859 + 0.41852 X
Bare_Nuclei+ 0.49786 XBland_Chromatin+ 0.97366 XClump_thickness+ 0.88624 XAdhesion+ 0.44317XCell_Shape- 0.11786 XClump_thickness*Adhesion
When bare nuclei of the breast mass increases by 1 unit, the odds of having a breast cancer is increased by 1.52 times assuming other variables to the model are fixed. With 95% confidence, the odds of having a breast cancer is increased by an amount between 1.29 and 1.85 times for every 1 unit increases in bare nuclei of breast mass when other variables to the model are fixed.
# Calculating odds ratio for Bare_Nuclei
exp(best_fit$coefficients[2])
## Bare_Nuclei
## 1.519707
# Wald confidence interval for odds ratio
beta.ci.bare_nuclei <- confint(object=best_fit,parm = "Bare_Nuclei",level=0.95)
as.numeric(exp(beta.ci.bare_nuclei))
## [1] 1.285976 1.846174
When blandness of nuclear chromatin of the breast mass increases by 1 unit, the odds of having a breast cancer is increased by 1.64 times assuming other variables to the model are fixed and with 95% confidence we can say the odds of having breast cancer is increased by an amount between 1.22-2.28 times.
# Calculating odds ratio for Bland_Chromatin
exp(best_fit$coefficients[3])
## Bland_Chromatin
## 1.645191
# Wald confidence interval for odds ratio
beta.ci.Bland_Chromatin <- confint(object=best_fit,parm = "Bland_Chromatin",level=0.95)
as.numeric(exp(beta.ci.Bland_Chromatin))
## [1] 1.223256 2.276756
A unit increased in clump thickness or the extent to which epithelial cell aggregates of breast mass affects in two ways to odds ratio. As the interaction between clump thikness and adhesion came into the model, integrated effect of clump thickness itself and interaction with adhesion both should be countered on the adds ratio. This interaction effect changes with value of adhesion. Therefore, we use below plot to observe 1 unit increases in clump thickness how affects to the likelihood of having breast cancer. Due to integrated effect a unit increased in clump thickness reduces the likelihood of having breast cancer when adhesion increases.
OR <- exp(best_fit$coefficients[4]+best_fit$coefficients[7]*mydata$Adhesion)
plot <- ggplot(mydata, aes(x = Adhesion, y=OR)) +geom_point()+geom_jitter()+
labs(labs(title="How odds ratio changes when 1 unit increases in clump thickness",
y="Odds Ratio"))
plot
When a unit increased in Adhesion affects odds ratio in both ways as due to effect of adhesion itself and its interaction with clump thickness. Therefore, in order to see the integrated effect we used below plot.A per the plot a unit increased in adhesion will decrease the likelihood of having a breast cancer when clump thickness increases.
# Calculating odds ratio
OR <- exp(best_fit$coefficients[5]+best_fit$coefficients[7]*mydata$Clump_thickness)
plot <- ggplot(mydata, aes(x = Clump_thickness, y=OR)) +geom_point()+geom_jitter()+
labs(labs(title="How odds ratio changes when 1 unit increases in adhesion",
y="Odds Ratio",x="Clump Thickness"))
plot
When the scale for uniformity of cell shape of breast mass increases by 1 unit the odds of having a beast cancer is increased by 1.56 times or with 95% confidence it increased by an amount between 1.20-2.10 times while other factors to the model are being fixed.
# Calculating odds ratio for Cell_Shape
exp(best_fit$coefficients[6])
## Cell_Shape
## 1.55763
# Wald confidence interval for odds ratio
beta.ci.Cell_Shape <- confint(object=best_fit,parm = "Cell_Shape",level=0.95)
as.numeric(exp(beta.ci.Cell_Shape))
## [1] 1.198500 2.104554
In this analysis we examined the likelihood of having a breast cancer with few factors measured on patient’s breast mass. Each factor itself increases the likelihood of having a breast cancer however, Bare nuclei,Blandness of nuclear chromatin, Uniformity of cell shape, Adhesion, Clump thickness and interaction of adhesion and clump thickness were the most affected factors deciding the likelihood of having a breast cancer.
Unlike linear regression, logistic regression does not depend very much on model assumptions. It does not require to have a linear relationship between dependent and independent variables, no need residual terms to be distributed as normal with zero mean. However, logistic regression also requires to have no multicollinearity among its independent variables, but we saw in Phase I the variables we used in this analysis have quite strong linear correlation between few pairs of variables.