Metodology

This project aims to build a predictive model in order to forecast the likelihood of having a breast cancer when one, few or all of 9 descriptive factors namely clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, blandness of nuclear chromatin, normal nucleoli & infrequent mitoses and its interactions present. The binary variable “Class” with levels 0-benign, 1-malignant (having a breast cancer) is the response variable and logistic regression technique is used in modeling. The latter part devotes to analysis the standalone effect to the model by each factor selected.

Building the model

In order to select the most relevant factors to the model, first we modeled response variable with each descriptive variable and recorded standalone effect in its AIC and residual deviance.

# Defining a summary table
summaryTab <- data.frame(matrix(ncol=3, nrow=9))
x <- c("Factor","AIC","Resi._Deviance")
colnames(summaryTab) <- x

# Looking for standalone effect of each variable
for (x in c(1:9)){
  ft=glm(Class ~ mydata[,x+1], family =binomial(link=logit),data=mydata)
  summaryTab[x,1]=colnames(mydata)[x+1]
  summaryTab[x,2]=ft$aic
  summaryTab[x,3]=ft$deviance
  x=x+1
}
summaryTab %>% knitr::kable( caption = 'Standalone Factor Effect')
Standalone Factor Effect
Factor AIC Resi._Deviance
Clump_thickness 468.0540 464.0540
Cell_Size 279.5543 275.5543
Cell_Shape 288.2465 284.2465
Adhesion 496.5542 492.5542
Epi.Cell_Size 485.7057 481.7057
Bare_Nuclei 346.5127 342.5127
Bland_Chromatin 405.3451 401.3451
Normal_Nucleoli 492.5092 488.5092
Mitoses 735.0814 731.0814

Then we kept adding factors to the model starting with lowest AIC and checked the significancy of each factor (5%) to the model. Even though some factors are individually highly significant to the model, with the inclusion of other factors they became insignificant. For a example, cell size is the most important factor to the model with AIC value if we consider all factors individually. However, with bare nuclei and blandness chromatin cell size became insignificant. After selecting individual factors to the model then we checked for two-factor interactive effects and any interaction with a statistically significance (5%) we added to the model.

best_fit <- glm(formula = Class~ Cell_Size,binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Cell_Size, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1081  -0.2474  -0.2474   0.0099   2.6465  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.960      0.360  -13.78   <2e-16 ***
## Cell_Size      1.489      0.121   12.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 275.55  on 697  degrees of freedom
## AIC: 279.55
## 
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Cell_Size +Cell_Shape,binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Cell_Size + Cell_Shape, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8053  -0.2014  -0.2014   0.0262   2.7957  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.4771     0.4063 -13.482  < 2e-16 ***
## Cell_Size     0.7672     0.1503   5.104 3.32e-07 ***
## Cell_Shape    0.8223     0.1484   5.542 2.99e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 239.65  on 696  degrees of freedom
## AIC: 245.65
## 
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Cell_Size +Cell_Shape+Bare_Nuclei,binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Cell_Size + Cell_Shape + Bare_Nuclei, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7435  -0.1425  -0.1425   0.0322   2.2067  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.25102    0.51801 -12.067  < 2e-16 ***
## Cell_Size    0.51899    0.14347   3.617 0.000298 ***
## Cell_Shape   0.58587    0.16456   3.560 0.000371 ***
## Bare_Nuclei  0.56064    0.07683   7.297 2.93e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 158.62  on 695  degrees of freedom
## AIC: 166.62
## 
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Cell_Size +Cell_Shape+Bare_Nuclei+Bland_Chromatin, binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Cell_Size + Cell_Shape + Bare_Nuclei + 
##     Bland_Chromatin, family = binomial(link = logit), data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4254  -0.1674  -0.1338   0.0347   2.3747  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -7.0221     0.6312 -11.125  < 2e-16 ***
## Cell_Size         0.3689     0.1513   2.438  0.01478 *  
## Cell_Shape        0.5504     0.1689   3.259  0.00112 ** 
## Bare_Nuclei       0.4885     0.0811   6.024  1.7e-09 ***
## Bland_Chromatin   0.4510     0.1424   3.167  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 148.26  on 694  degrees of freedom
## AIC: 158.26
## 
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Cell_Size +Cell_Shape+Bare_Nuclei+
                  Bland_Chromatin+Clump_thickness, binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Cell_Size + Cell_Shape + Bare_Nuclei + 
##     Bland_Chromatin + Clump_thickness, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.11247  -0.14126  -0.07861   0.03865   2.40859  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -8.73285    0.87509  -9.979  < 2e-16 ***
## Cell_Size        0.26963    0.16348   1.649  0.09908 .  
## Cell_Shape       0.34808    0.18371   1.895  0.05813 .  
## Bare_Nuclei      0.46012    0.08849   5.200  2.0e-07 ***
## Bland_Chromatin  0.44554    0.15492   2.876  0.00403 ** 
## Clump_thickness  0.54039    0.12438   4.345  1.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 124.79  on 693  degrees of freedom
## AIC: 136.79
## 
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+
                  Clump_thickness+Epi.Cell_Size, binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness + 
##     Epi.Cell_Size, family = binomial(link = logit), data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3254  -0.1524  -0.0599   0.0438   2.7024  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -9.7929     0.9335 -10.491  < 2e-16 ***
## Bare_Nuclei       0.5455     0.0842   6.478 9.27e-11 ***
## Bland_Chromatin   0.6017     0.1320   4.560 5.13e-06 ***
## Clump_thickness   0.7035     0.1163   6.049 1.45e-09 ***
## Epi.Cell_Size     0.4148     0.1192   3.478 0.000504 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 133.45  on 694  degrees of freedom
## AIC: 143.45
## 
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
                  Epi.Cell_Size+Cell_Shape,binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness + 
##     Epi.Cell_Size + Cell_Shape, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.00391  -0.13845  -0.07312   0.04506   2.47551  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -9.21510    0.91516 -10.069  < 2e-16 ***
## Bare_Nuclei      0.46820    0.08921   5.248 1.53e-07 ***
## Bland_Chromatin  0.46858    0.14994   3.125  0.00178 ** 
## Clump_thickness  0.56555    0.12310   4.594 4.35e-06 ***
## Epi.Cell_Size    0.24622    0.12937   1.903  0.05701 .  
## Cell_Shape       0.41226    0.14863   2.774  0.00554 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 124.21  on 693  degrees of freedom
## AIC: 136.21
## 
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
                  Cell_Shape+Normal_Nucleoli,binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness + 
##     Cell_Shape + Normal_Nucleoli, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2734  -0.1366  -0.0724   0.0406   2.5339  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -8.93064    0.89891  -9.935  < 2e-16 ***
## Bare_Nuclei      0.47002    0.08760   5.366 8.07e-08 ***
## Bland_Chromatin  0.44424    0.15084   2.945  0.00323 ** 
## Clump_thickness  0.57194    0.12506   4.573 4.80e-06 ***
## Cell_Shape       0.40093    0.14903   2.690  0.00714 ** 
## Normal_Nucleoli  0.21304    0.09879   2.157  0.03104 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 123.23  on 693  degrees of freedom
## AIC: 135.23
## 
## Number of Fisher Scoring iterations: 7
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
                  Cell_Shape+Normal_Nucleoli+Adhesion,binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness + 
##     Cell_Shape + Normal_Nucleoli + Adhesion, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3455  -0.1303  -0.0611   0.0319   2.3192  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -9.33879    0.98244  -9.506  < 2e-16 ***
## Bare_Nuclei      0.43564    0.09131   4.771 1.83e-06 ***
## Bland_Chromatin  0.41210    0.15389   2.678  0.00741 ** 
## Clump_thickness  0.61046    0.12987   4.701 2.59e-06 ***
## Cell_Shape       0.33389    0.15331   2.178  0.02941 *  
## Normal_Nucleoli  0.19081    0.09830   1.941  0.05225 .  
## Adhesion         0.25028    0.10650   2.350  0.01877 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 117.39  on 692  degrees of freedom
## AIC: 131.39
## 
## Number of Fisher Scoring iterations: 8
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
                 Adhesion+Mitoses+Cell_Shape,family =binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness + 
##     Adhesion + Mitoses + Cell_Shape, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1965  -0.1291  -0.0687   0.0227   2.3039  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -9.64765    1.02018  -9.457  < 2e-16 ***
## Bare_Nuclei      0.43743    0.09175   4.768 1.86e-06 ***
## Bland_Chromatin  0.44458    0.15506   2.867  0.00414 ** 
## Clump_thickness  0.53815    0.13224   4.069 4.71e-05 ***
## Adhesion         0.27343    0.10927   2.502  0.01234 *  
## Mitoses          0.58290    0.29068   2.005  0.04493 *  
## Cell_Shape       0.43454    0.15145   2.869  0.00412 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 115.46  on 692  degrees of freedom
## AIC: 129.46
## 
## Number of Fisher Scoring iterations: 8
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
                 Adhesion+Mitoses+Cell_Shape+Clump_thickness*Adhesion,
                family =binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness + 
##     Adhesion + Mitoses + Cell_Shape + Clump_thickness * Adhesion, 
##     family = binomial(link = logit), data = mydata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.99550  -0.11282  -0.03899   0.03674   2.30419  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -11.79913    1.49729  -7.880 3.27e-15 ***
## Bare_Nuclei                0.43843    0.09250   4.740 2.14e-06 ***
## Bland_Chromatin            0.45896    0.16077   2.855 0.004308 ** 
## Clump_thickness            0.94283    0.22564   4.178 2.93e-05 ***
## Adhesion                   0.95133    0.28673   3.318 0.000907 ***
## Mitoses                    0.63588    0.33175   1.917 0.055266 .  
## Cell_Shape                 0.40167    0.15488   2.593 0.009505 ** 
## Clump_thickness:Adhesion  -0.12963    0.04667  -2.777 0.005481 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 108.03  on 691  degrees of freedom
## AIC: 124.03
## 
## Number of Fisher Scoring iterations: 8
best_fit <- glm(formula = Class~ Bare_Nuclei+Bland_Chromatin+Clump_thickness+
                 Adhesion+Cell_Shape+Clump_thickness*Adhesion,
                family =binomial(link=logit),data=mydata)
summary(best_fit)
## 
## Call:
## glm(formula = Class ~ Bare_Nuclei + Bland_Chromatin + Clump_thickness + 
##     Adhesion + Cell_Shape + Clump_thickness * Adhesion, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.96887  -0.11102  -0.03861   0.04767   2.32284  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -11.29859    1.37679  -8.206 2.28e-16 ***
## Bare_Nuclei                0.41852    0.09106   4.596 4.30e-06 ***
## Bland_Chromatin            0.49786    0.15735   3.164  0.00156 ** 
## Clump_thickness            0.97366    0.20906   4.657 3.20e-06 ***
## Adhesion                   0.88624    0.26643   3.326  0.00088 ***
## Cell_Shape                 0.44317    0.14235   3.113  0.00185 ** 
## Clump_thickness:Adhesion  -0.11786    0.04248  -2.775  0.00553 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.53  on 698  degrees of freedom
## Residual deviance: 114.21  on 692  degrees of freedom
## AIC: 128.21
## 
## Number of Fisher Scoring iterations: 8

After few iterations we found the best model.

Final Model

logit (Probability of having a breast cancer)= -11.29859 + 0.41852 X Bare_Nuclei + 0.49786 X Bland_Chromatin + 0.97366 X Clump_thickness + 0.88624 X Adhesion + 0.44317XCell_Shape- 0.11786 X Clump_thickness*Adhesion

Effect of Bare Nuclei on breast cancer

When bare nuclei of the breast mass increases by 1 unit, the odds of having a breast cancer is increased by 1.52 times assuming other variables to the model are fixed. With 95% confidence, the odds of having a breast cancer is increased by an amount between 1.29 and 1.85 times for every 1 unit increases in bare nuclei of breast mass when other variables to the model are fixed.

# Calculating odds ratio for Bare_Nuclei
exp(best_fit$coefficients[2])
## Bare_Nuclei 
##    1.519707
# Wald confidence interval for odds ratio
beta.ci.bare_nuclei <- confint(object=best_fit,parm = "Bare_Nuclei",level=0.95)
as.numeric(exp(beta.ci.bare_nuclei))
## [1] 1.285976 1.846174

Effect of Blandness of nuclear chromatin on breast cancer

When blandness of nuclear chromatin of the breast mass increases by 1 unit, the odds of having a breast cancer is increased by 1.64 times assuming other variables to the model are fixed and with 95% confidence we can say the odds of having breast cancer is increased by an amount between 1.22-2.28 times.

# Calculating odds ratio for Bland_Chromatin
exp(best_fit$coefficients[3])
## Bland_Chromatin 
##        1.645191
# Wald confidence interval for odds ratio
beta.ci.Bland_Chromatin <- confint(object=best_fit,parm = "Bland_Chromatin",level=0.95)
as.numeric(exp(beta.ci.Bland_Chromatin))
## [1] 1.223256 2.276756

Effect of Clump thickness on breast cancer

A unit increased in clump thickness or the extent to which epithelial cell aggregates of breast mass affects in two ways to odds ratio. As the interaction between clump thikness and adhesion came into the model, integrated effect of clump thickness itself and interaction with adhesion both should be countered on the adds ratio. This interaction effect changes with value of adhesion. Therefore, we use below plot to observe 1 unit increases in clump thickness how affects to the likelihood of having breast cancer. Due to integrated effect a unit increased in clump thickness reduces the likelihood of having breast cancer when adhesion increases.

OR <- exp(best_fit$coefficients[4]+best_fit$coefficients[7]*mydata$Adhesion)
plot <- ggplot(mydata, aes(x = Adhesion, y=OR)) +geom_point()+geom_jitter()+
  labs(labs(title="How odds ratio changes when 1 unit increases in clump thickness",
            y="Odds Ratio"))
plot

Effect of Adhesion on breast cancer

When a unit increased in Adhesion affects odds ratio in both ways as due to effect of adhesion itself and its interaction with clump thickness. Therefore, in order to see the integrated effect we used below plot.A per the plot a unit increased in adhesion will decrease the likelihood of having a breast cancer when clump thickness increases.

# Calculating odds ratio 
OR <- exp(best_fit$coefficients[5]+best_fit$coefficients[7]*mydata$Clump_thickness)
plot <- ggplot(mydata, aes(x = Clump_thickness, y=OR)) +geom_point()+geom_jitter()+
  labs(labs(title="How odds ratio changes when 1 unit increases in adhesion",
            y="Odds Ratio",x="Clump Thickness"))
plot

Effect of the uniformity of cell shape on breast cancer

When the scale for uniformity of cell shape of breast mass increases by 1 unit the odds of having a beast cancer is increased by 1.56 times or with 95% confidence it increased by an amount between 1.20-2.10 times while other factors to the model are being fixed.

# Calculating odds ratio for Cell_Shape
exp(best_fit$coefficients[6])
## Cell_Shape 
##    1.55763
# Wald confidence interval for odds ratio
beta.ci.Cell_Shape <- confint(object=best_fit,parm = "Cell_Shape",level=0.95)
as.numeric(exp(beta.ci.Cell_Shape))
## [1] 1.198500 2.104554

Conclusion

In this analysis we examined the likelihood of having a breast cancer with few factors measured on patient’s breast mass. Each factor itself increases the likelihood of having a breast cancer however, Bare nuclei,Blandness of nuclear chromatin, Uniformity of cell shape, Adhesion, Clump thickness and interaction of adhesion and clump thickness were the most affected factors deciding the likelihood of having a breast cancer.

Unlike linear regression, logistic regression does not depend very much on model assumptions. It does not require to have a linear relationship between dependent and independent variables, no need residual terms to be distributed as normal with zero mean. However, logistic regression also requires to have no multicollinearity among its independent variables, but we saw in Phase I the variables we used in this analysis have quite strong linear correlation between few pairs of variables.