time | status | study | rx | sex | age | obstruct | perfor | adhere | nodes | differ | extent | surg | node4 | etype |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 1 | 1 | 3 | 0 | 32 | 0 | 0 | 0 | 14 | 3 | 2 | 0 | 1 | 1 |
9 | 1 | 1 | 3 | 1 | 55 | 1 | 0 | 0 | 5 | 3 | 3 | 0 | 1 | 1 |
19 | 1 | 1 | 1 | 1 | 59 | 1 | 0 | 0 | 2 | 2 | 3 | 1 | 0 | 1 |
20 | 1 | 1 | 2 | 1 | 66 | 1 | 0 | 0 | 4 | 2 | 3 | 0 | 0 | 1 |
23 | 0 | 1 | 3 | 0 | 52 | 0 | 0 | 0 | 3 | 3 | 3 | 1 | 0 | 1 |
24 | 0 | 1 | 1 | 1 | 72 | 1 | 0 | 0 | 4 | 3 | 3 | 1 | 0 | 1 |
colon %>%
ggplot(aes(x=age))+
geom_histogram(color="red")
There exists some outliers under age 30
data <- colon %>%
filter(age>30)
data %>% ggplot(aes(x=surg))+
geom_bar(color="blue")
Count of surgery group compared with non_surgery group.
attach(data)
tab <- table(status, surg)
chi <- chisq.test(tab)
print(chi)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 5.1145, df = 1, p-value = 0.02373
cramersV(tab)
## [1] 0.07645364
High effect size between surg and status of colon
attach(data)
tab2 <- table(status, sex)
chi <- chisq.test(tab2, correct=T)
print(chi)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab2
## X-squared = 0.60742, df = 1, p-value = 0.4358
Not significant p_value , so there is no relation between sex and status.
attach(data)
tab3 <- table(status,obstruct)
chi <- chisq.test(tab3, correct=T)
print(chi)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab3
## X-squared = 0.86882, df = 1, p-value = 0.3513
Not significant p_value.
attach(data)
tab4 <- table(status,adhere)
chi <- chisq.test(tab4, correct=T)
print(chi)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab4
## X-squared = 4.4589, df = 1, p-value = 0.03472
library(lsr)
cramersV(tab4)
## [1] 0.07138518
High effect size between adhesions and status of the colon
rpart(data$node4~., data=data)
## n= 875
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 875 169.067400 0.26171430
## 2) nodes< 4.5 652 8.875767 0.01380368 *
## 3) nodes>=4.5 223 2.959641 0.98654710 *
model <- rpart(data$node4~., data=data)
rpart.plot(model)
According to this risk ratio tree we have an exact number of nodes which splits into sensitive case,so we will impute it into our data
data <- data %>%
mutate(exposed =as.integer(nodes>5),
status=as.factor(status),
adhere=as.factor(adhere),
surg=as.factor(surg),
node4=as.factor(node4)) %>%
select(time, status, nodes,
adhere, surg, node4, exposed)
head(data) %>% gt() %>% opt_stylize(style=2,color="blue")
time | status | nodes | adhere | surg | node4 | exposed |
---|---|---|---|---|---|---|
8 | 1 | 14 | 0 | 0 | 1 | 1 |
9 | 1 | 5 | 0 | 0 | 1 | 0 |
19 | 1 | 2 | 0 | 1 | 0 | 0 |
20 | 1 | 4 | 0 | 0 | 0 | 0 |
23 | 0 | 3 | 0 | 1 | 0 | 0 |
24 | 0 | 4 | 0 | 1 | 0 | 0 |
nodesurv <- survfit(Surv(time, node4) ~ exposed , data)
plot(nodesurv, xlab="Time", ylab="Survival Probability of node4",
main="Kaplan-Meier Curve")
So as we observe for group unexposed the curve values is very low correlated approximately almost to zero survival of the disease(node4).But for exposed group it gives monotonically increasing curve meaning that as time increases,the survival probability of the disease increases with time.As in this survival we are studying the survival of the disease itself.
cox.mode <- coxph(Surv(time,exposed) ~ node4,data=data)
summary(cox.mode)
## Call:
## coxph(formula = Surv(time, exposed) ~ node4, data = data)
##
## n= 875, number of events= 177
##
## coef exp(coef) se(coef) z Pr(>|z|)
## node41 6.0019 404.2063 0.7115 8.436 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## node41 404.2 0.002474 100.2 1630
##
## Concordance= 0.896 (se = 0.007 )
## Likelihood ratio test= 573.5 on 1 df, p=<2e-16
## Wald test = 71.17 on 1 df, p=<2e-16
## Score (logrank) test = 753.5 on 1 df, p=<2e-16
So based on th hazard ratio in our cox model, The group exposed(>5 nodes) has a hazard of the cancer 404.2063 times higher than the unexposed group
set.seed(2)
split <- initial_split(data,
prop=.80,
strata=node4)
data_train <- training(split)
data_test <- testing(split)
model <- glm(node4 ~ status+exposed+adhere,
data=data_train,
family="binomial")
summary(model)
##
## Call:
## glm(formula = node4 ~ status + exposed + adhere, family = "binomial",
## data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1102 0.2816 -11.046 < 2e-16 ***
## status1 1.1890 0.3302 3.600 0.000318 ***
## exposed 6.6299 0.7357 9.012 < 2e-16 ***
## adhere1 0.0657 0.4238 0.155 0.876812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 803.76 on 698 degrees of freedom
## Residual deviance: 324.47 on 695 degrees of freedom
## AIC: 332.47
##
## Number of Fisher Scoring iterations: 7
data_test <- data_test %>%
mutate(node4_prob = predict(model,
data_test,
type= "response"),
node4_pred = ifelse(node4_prob > .5, 1, 0))
t <- table(data_test$node4,
data_test$node4_pred)
accuracy <- sum(diag(t))/sum(t)
print(accuracy)
## [1] 0.9545455
95% accuracy of the classification model built based on the three predictors.
tree_model <- rpart(node4 ~ exposed+adhere,data=data_train,
method = "class")
rpart.plot(tree_model, main = "Decision Tree for colon Classification")
TAB <- table(status,surg)
epi.2by2(TAB, method="cohort.count",
conf.level=0.95)
## Outcome+ Outcome- Total Inc risk *
## Exposure+ 334 103 437 76.43 (72.16 to 80.33)
## Exposure- 304 134 438 69.41 (64.86 to 73.69)
## Total 638 237 875 72.91 (69.84 to 75.83)
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio 1.10 (1.02, 1.19)
## Inc odds ratio 1.43 (1.06, 1.93)
## Attrib risk in the exposed * 7.02 (1.15, 12.89)
## Attrib fraction in the exposed (%) 9.19 (1.55, 16.36)
## Attrib risk in the population * 3.51 (-1.72, 8.73)
## Attrib fraction in the population (%) 4.81 (2.83, 7.13)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 5.464 Pr>chi2 = 0.019
## Fisher exact test that OR = 1: Pr>chi2 = 0.022
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 population units
A 1.1 risk ratio means that the incidence rate in the non-surgery group is 1.1 times higher than in surgery group, indicating a 10% higher risk of having colon status 0(high predictor for node4) in the non-surgery group.