Loading data

time status study rx sex age obstruct perfor adhere nodes differ extent surg node4 etype
8 1 1 3 0 32 0 0 0 14 3 2 0 1 1
9 1 1 3 1 55 1 0 0 5 3 3 0 1 1
19 1 1 1 1 59 1 0 0 2 2 3 1 0 1
20 1 1 2 1 66 1 0 0 4 2 3 0 0 1
23 0 1 3 0 52 0 0 0 3 3 3 1 0 1
24 0 1 1 1 72 1 0 0 4 3 3 1 0 1

Exploratory Analysis

colon %>% 
  ggplot(aes(x=age))+
  geom_histogram(color="red")

There exists some outliers under age 30

Filtering and preparing data

data <- colon %>% 
  filter(age>30)

EDA

data %>% ggplot(aes(x=surg))+
  geom_bar(color="blue")

Count of surgery group compared with non_surgery group.

Checking relations and associations

attach(data)
tab <- table(status, surg)
chi <- chisq.test(tab)
print(chi)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 5.1145, df = 1, p-value = 0.02373
cramersV(tab)
## [1] 0.07645364

High effect size between surg and status of colon

attach(data)
tab2 <- table(status, sex)
chi <- chisq.test(tab2, correct=T)
print(chi)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab2
## X-squared = 0.60742, df = 1, p-value = 0.4358

Not significant p_value , so there is no relation between sex and status.

attach(data)
tab3 <- table(status,obstruct)
chi <- chisq.test(tab3, correct=T)
print(chi)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab3
## X-squared = 0.86882, df = 1, p-value = 0.3513

Not significant p_value.

attach(data)
tab4 <- table(status,adhere)
chi <- chisq.test(tab4, correct=T)
print(chi)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab4
## X-squared = 4.4589, df = 1, p-value = 0.03472
library(lsr)
cramersV(tab4)
## [1] 0.07138518

High effect size between adhesions and status of the colon

Manipulating data

rpart(data$node4~., data=data)
## n= 875 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 875 169.067400 0.26171430  
##   2) nodes< 4.5 652   8.875767 0.01380368 *
##   3) nodes>=4.5 223   2.959641 0.98654710 *
model <- rpart(data$node4~., data=data)
rpart.plot(model)

According to this risk ratio tree we have an exact number of nodes which splits into sensitive case,so we will impute it into our data

data <- data %>% 
  mutate(exposed =as.integer(nodes>5),
         status=as.factor(status),
         adhere=as.factor(adhere),
         surg=as.factor(surg),
         node4=as.factor(node4)) %>% 
  select(time, status, nodes, 
         adhere, surg, node4, exposed)
head(data) %>% gt() %>% opt_stylize(style=2,color="blue")
time status nodes adhere surg node4 exposed
8 1 14 0 0 1 1
9 1 5 0 0 1 0
19 1 2 0 1 0 0
20 1 4 0 0 0 0
23 0 3 0 1 0 0
24 0 4 0 1 0 0

Survival Analysis

Creating our survival function

nodesurv <- survfit(Surv(time, node4) ~ exposed , data)

Kaplan-Meier Curve

plot(nodesurv, xlab="Time", ylab="Survival Probability of node4",
     main="Kaplan-Meier Curve")

So as we observe for group unexposed the curve values is very low correlated approximately almost to zero survival of the disease(node4).But for exposed group it gives monotonically increasing curve meaning that as time increases,the survival probability of the disease increases with time.As in this survival we are studying the survival of the disease itself.

Cox Proportional hazards model

cox.mode <- coxph(Surv(time,exposed) ~ node4,data=data)
summary(cox.mode)
## Call:
## coxph(formula = Surv(time, exposed) ~ node4, data = data)
## 
##   n= 875, number of events= 177 
## 
##            coef exp(coef) se(coef)     z Pr(>|z|)    
## node41   6.0019  404.2063   0.7115 8.436   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## node41     404.2   0.002474     100.2      1630
## 
## Concordance= 0.896  (se = 0.007 )
## Likelihood ratio test= 573.5  on 1 df,   p=<2e-16
## Wald test            = 71.17  on 1 df,   p=<2e-16
## Score (logrank) test = 753.5  on 1 df,   p=<2e-16

So based on th hazard ratio in our cox model, The group exposed(>5 nodes) has a hazard of the cancer 404.2063 times higher than the unexposed group

Logistic Regression

set.seed(2)
 split <- initial_split(data,
                        prop=.80,
                        strata=node4)  
 data_train <- training(split) 
 data_test <- testing(split)

Building the model

Here we are studying the predictors that may affect probability of node4 case in colon status

model <- glm(node4 ~ status+exposed+adhere,
              data=data_train,
              family="binomial")
summary(model)
## 
## Call:
## glm(formula = node4 ~ status + exposed + adhere, family = "binomial", 
##     data = data_train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1102     0.2816 -11.046  < 2e-16 ***
## status1       1.1890     0.3302   3.600 0.000318 ***
## exposed       6.6299     0.7357   9.012  < 2e-16 ***
## adhere1       0.0657     0.4238   0.155 0.876812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 803.76  on 698  degrees of freedom
## Residual deviance: 324.47  on 695  degrees of freedom
## AIC: 332.47
## 
## Number of Fisher Scoring iterations: 7

Testing the accuracy

data_test <- data_test %>% 
   mutate(node4_prob = predict(model,
                                data_test,
                                type= "response"),
          node4_pred = ifelse(node4_prob > .5, 1, 0))
t <- table(data_test$node4,
            data_test$node4_pred)
accuracy <- sum(diag(t))/sum(t)     
print(accuracy)
## [1] 0.9545455

95% accuracy of the classification model built based on the three predictors.

Decision tree based on exposed cases(have nodes>5)

tree_model <- rpart(node4 ~ exposed+adhere,data=data_train,
                    method = "class")
rpart.plot(tree_model, main = "Decision Tree for colon Classification")

Incidence: Risk and Odds Ratios according to the surgery of etype 1

TAB <- table(status,surg)
epi.2by2(TAB, method="cohort.count",
         conf.level=0.95)
##              Outcome+    Outcome-      Total                 Inc risk *
## Exposure+         334         103        437     76.43 (72.16 to 80.33)
## Exposure-         304         134        438     69.41 (64.86 to 73.69)
## Total             638         237        875     72.91 (69.84 to 75.83)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 1.10 (1.02, 1.19)
## Inc odds ratio                                 1.43 (1.06, 1.93)
## Attrib risk in the exposed *                   7.02 (1.15, 12.89)
## Attrib fraction in the exposed (%)            9.19 (1.55, 16.36)
## Attrib risk in the population *                3.51 (-1.72, 8.73)
## Attrib fraction in the population (%)         4.81 (2.83, 7.13)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 5.464 Pr>chi2 = 0.019
## Fisher exact test that OR = 1: Pr>chi2 = 0.022
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

A 1.1 risk ratio means that the incidence rate in the non-surgery group is 1.1 times higher than in surgery group, indicating a 10% higher risk of having colon status 0(high predictor for node4) in the non-surgery group.