For this assignment I chose to work with the Wisconsin breast cancer dataset. The question at hand is whether there are characteristics (features) that can accurately predict whether a tumor is a benign one or malignant. This is an important question since if such characteristics can be found having much predictive power, potentialy malignant tumors could be identified earlier and lives will be saved. This is not a truly causal question, since the shape or size of the tumor does not cause it to be malignant, but rather could be highly correlated; therefore, it is more of a descriptive question. For true causal inference, matching or ATE for some treatment shoudl be calculated.Classification tasks such as this one are good to gauge a descriptive understanding.
Explanatory variables are medical observations including ten real-valued features computed for each cell nucleus. The variables are (as described by Kaggle):
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension (“coastline approximation” - 1)
The target variable is the observation — malignant or benign. Class distribution: 357 benign, 212 malignant
Since this is a case-study classic dataset I didn’t need to transform any variable. While numerous variables exist, and there could be a whole study regarding feature selection based on this data, since our focus is on the model specification and not feature selection I will choose only one variable (“Cell.size”) to shed light on model differences.
#install.packages("mlbench")
library(mlbench)
data(BreastCancer)
attach(BreastCancer)
The following objects are masked _by_ .GlobalEnv:
Bl.cromatin, Cell.shape, Cell.size
head(BreastCancer)
#code the response variable as a binary numerical:
class_num <- ifelse(Class=="benign",0,1)
BreastCancer$class_num <- class_num
#turn the data into numeric form:
Cell.shape <- as.numeric(Cell.shape)
Cell.size <- as.numeric(Cell.size)
Bl.cromatin <- as.numeric(Bl.cromatin)
#linear Model:
mdl <- lm(BreastCancer$class_num ~ Cell.size)
summary(mdl)
Call:
lm(formula = BreastCancer$class_num ~ Cell.size)
Residuals:
Min 1Q Median 3Q Max
-1.09256 -0.07266 -0.07266 -0.07266 0.92734
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.054830 0.014855 -3.691 0.000241 ***
Cell.size 0.127488 0.003397 37.530 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2739 on 697 degrees of freedom
Multiple R-squared: 0.669, Adjusted R-squared: 0.6685
F-statistic: 1409 on 1 and 697 DF, p-value: < 2.2e-16
We see that this variable is found statistically significant with \(\alpha < .001\) ;however, since we are using one variable as explained above, we can assume that this is due to the omitted variable bias. We see that a unit change in the cell size is assumed to increase the liklihood of the tumor to be malignant by .12 , quite a lot (but again, OVB).
# calculate y_hat
BreastCancer$y_hat = predict(mdl)
# Set new column values to appropriate colours
BreastCancer$col[BreastCancer$class_num==0]="red"
BreastCancer$col[BreastCancer$class_num==1]="blue"
par(mfrow=c(1,2))
plot(class_num,BreastCancer$Cell.size,col = BreastCancer$col,main="")
plot(BreastCancer$y_hat,BreastCancer$Cell.size,col = BreastCancer$col, main="")
#plot(class_num,BreastCancer$y_hat,col = BreastCancer$col, main="")
We see that cell shapce and cell size could serve as relevant predictors for the class of a tumor. The model relies on it for a great extent, even though the data itself (on the left) shows that it would be hard to cluster the data based on this feature. Moreover, we can see on the right hand plot that the values of $y_hat% exceed 1, which is a common problem with linear probability model. Therefore it is better to move on to the Logit and Probit.
model2 <- glm(class_num ~ Cell.size,family=binomial(link='logit'))
summary(model2)
Call:
glm(formula = class_num ~ Cell.size, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-4.1081 -0.2474 -0.2474 0.0099 2.6465
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.960 0.360 -13.78 <2e-16 ***
Cell.size 1.489 0.121 12.30 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 900.53 on 698 degrees of freedom
Residual deviance: 275.55 on 697 degrees of freedom
AIC: 279.55
Number of Fisher Scoring iterations: 7
model2$coefficients
(Intercept) Cell.size
-4.960187 1.488710
We see that Cell size gets the 1.4 coefficient in the logit model, the expected change in log odds for a one-unit increase in the cell size.
#install.packages("popbio")
library(popbio)
logi.hist.plot(Cell.size,class_num,boxp=FALSE,type="hist",col="gray", xlab="Cell Size")
We see that Cell.size , which represents the uniformity of the cell size, is used in the logit model such that patients with size that is classified higher than 4 are likely to have malignant tumor.
Let’s check confidence intervals (this is not a taks asked in the assignment, but extra work):
confint(model2)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -5.718134 -4.300687
Cell.size 1.266087 1.741961
The 95% CI points at the possibel values for the coefficient under the model assumptions and 95/100 trials under random data versions.
model3 <- glm(class_num ~ Cell.size,family=binomial(link='probit'))
summary(model2)
Call:
glm(formula = class_num ~ Cell.size, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-4.1081 -0.2474 -0.2474 0.0099 2.6465
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.960 0.360 -13.78 <2e-16 ***
Cell.size 1.489 0.121 12.30 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 900.53 on 698 degrees of freedom
Residual deviance: 275.55 on 697 degrees of freedom
AIC: 279.55
Number of Fisher Scoring iterations: 7
We see that the coefficients are highly significant, by Z and P values. Another way to test the significance of the coefficients is the Wald Test:
#install.packages("aod")
library(aod)
wald.test(b = coef(model3), Sigma = vcov(model3), Terms = 1:2)
Wald test:
----------
Chi-squared test:
X2 = 287.4, df = 2, P(> X2) = 0.0
We see that under 2 degrees of freedom the associated p-value is approaching zero, therefore very significant.
require(ggplot2); ggplot(BreastCancer, aes(x=Cell.size, y=model3$fitted.values)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family=binomial(link='probit')),
se = FALSE)
We see that as observed by other models, Cell Size of over ~3.75 is correlated with over 50% risk of malignant tumor. Lastly, let’s add confidence intervals and jitter to the classes, and use the LOWESS (LOcally Weighted Scatterplot Smoother )** so we can see the mass of the y against the predictor:
require(ggplot2);ggplot(BreastCancer,aes(x=as.numeric(Cell.size),y=as.numeric(class_num)),color=class_num)+
geom_point(position=position_jitter(width=0,height=.04))+ #Jittering helps reduce overlap since they are all binary anyways.
stat_smooth(method='loess',family=binomial(link='probit'))+ #the parameter alpha is called span and is set to .75 by default.
xlab("Cell size") + ylab("Pr (tumor malignant)")+
scale_y_continuous(breaks=c(.10, .25, .50, .75, .90))
Ignoring unknown parameters: family
We see that as our model predicted, the Cel size indeed encapsulate alot of the classes characteristic. Much more experimentation can, and should be done in order to prove this correlation;neverthelss, this analysis shows that the cell size is an important predictor when diagnosing tumors.
**LOWESS: “This is a nonparametric method because the linearity assumptions of conventional regression methods have been relaxed. Instead of estimating parameters like m and c in y = mx +c, a nonparametric regression focuses on the fitted curve. The fitted points and their standard errors represent are estimated with respect to the whole curve rather than a particular estimate. So, the overall uncertainty is measured as how well the estimated curve fits the population curve.” (from https://www.statsdirect.com/help/nonparametric_methods/loess.htm)