Data Acquisition & Mgmt

Final Project

Final Project

Introduction

Cancer is still one of the top leading causes of death in the United States, I have family and friends who succumb to that deadly disease that’s why studying this is very near and close to my heart. According to the American Cancer Society, in 2018, there will be an estimated 1,735,350 new cancer cases diagnosed and 609,640 cancer deaths in the United States.

In New York alone, each year, over 110,000 New Yorkers learn they have cancer, and about 35,000 succumb to the disease, making it the second leading cause of death in the state. In 2015, the overall cancer incidence rate of 482.0 cases per 100,000 persons in New York was the fourth highest among the 50 states.

Lung cancer is the single largest cancer killer, causing nearly 9,000 deaths. Lung cancer has a higher mortality rate than the other common malignancies and has been less amenable to therapeutic advances.

Four cancers - lung, prostate, breast and colorectal - account for more than half of all cancer diagnoses and nearly half of all cancer deaths.

There are approximately 250 different cancers. Each has a different pathophysiology, cause, diagnosis, and treatment.

Objective

The purpose of my study is to do some predictions and analysis on cancer, particularly lung and breast cancer, using the available data set on the web.

Load libraries

knitr::opts_chunk$set(comment=NA, message=FALSE, warning=FALSE)
library(class)
library(DT)
library(ggplot2)
library(plotrix)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(survival)
library(survminer)

## Loading required package: ggpubr

## Loading required package: magrittr

Data Set 1

Let’s take the lung cancer data from the survival library

#data <- read.csv("cancer.csv")

data <- lung

And then see the summary of column’s data and the data type of each column.

str(data)

'data.frame':   228 obs. of  10 variables:
 $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
 $ time     : num  306 455 1010 210 883 ...
 $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
 $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
 $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
 $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
 $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
 $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
 $ meal.cal : num  1175 1225 NA 1150 NA ...
 $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

head(data)

  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0

summary(data)

      inst            time            status           age       
 Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00  
 1st Qu.: 3.00   1st Qu.: 166.8   1st Qu.:1.000   1st Qu.:56.00  
 Median :11.00   Median : 255.5   Median :2.000   Median :63.00  
 Mean   :11.09   Mean   : 305.2   Mean   :1.724   Mean   :62.45  
 3rd Qu.:16.00   3rd Qu.: 396.5   3rd Qu.:2.000   3rd Qu.:69.00  
 Max.   :33.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00  
 NA's   :1                                                       
      sex           ph.ecog          ph.karno        pat.karno     
 Min.   :1.000   Min.   :0.0000   Min.   : 50.00   Min.   : 30.00  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 75.00   1st Qu.: 70.00  
 Median :1.000   Median :1.0000   Median : 80.00   Median : 80.00  
 Mean   :1.395   Mean   :0.9515   Mean   : 81.94   Mean   : 79.96  
 3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00  
 Max.   :2.000   Max.   :3.0000   Max.   :100.00   Max.   :100.00  
                 NA's   :1        NA's   :1        NA's   :3       
    meal.cal         wt.loss       
 Min.   :  96.0   Min.   :-24.000  
 1st Qu.: 635.0   1st Qu.:  0.000  
 Median : 975.0   Median :  7.000  
 Mean   : 928.8   Mean   :  9.832  
 3rd Qu.:1150.0   3rd Qu.: 15.750  
 Max.   :2600.0   Max.   : 68.000  
 NA's   :47       NA's   :14

Let’s assign a descriptive column names

colnames(data)<-c("InstitutionCode","SurvivalTime","Status",
                    "Age","Sex","ECOG Performance Score",
                    "K Physician Score","K Patient Score","Meals Calories","Weight Loss")

Some Data Wrangling

data$Sex<- factor(data$Sex, levels = c("1", "2"), labels = c("Male", "Female"))

data$Status<- factor(data$Status, levels = c("1", "2"), labels = c("Censored", "Dead"))

#data.frame(data)
datatable(data)

Data Visualization

Let’s create a set of counts of the combination of Gender and Status and then visualize

kount <- table(data$Sex,data$Status)
barplot(kount, main="Frequency Distribution by Gender and Status",
        xlab="Status of the Cancer",ylab="Frequency Count",
        col=c("blue","pink"),legend = rownames(kount), beside=TRUE)

As we can see, there are more male dying from the disease than female. For the censored data, there are more women but we don’t know if these patients are survivors or fatalities as the data is either incomplete or not provided.

Let’s create a survival object

s <- Surv(data$SurvivalTime, data$Status)

s

  [1]  306:Dead  455:Dead 1010+      210:Dead  883:Dead 1022+      310:Dead
  [8]  361:Dead  218:Dead  166:Dead  170:Dead  654:Dead  728:Dead   71:Dead
 [15]  567:Dead  144:Dead  613:Dead  707:Dead   61:Dead   88:Dead  301:Dead
 [22]   81:Dead  624:Dead  371:Dead  394:Dead  520:Dead  574:Dead  118:Dead
 [29]  390:Dead   12:Dead  473:Dead   26:Dead  533:Dead  107:Dead   53:Dead
 [36]  122:Dead  814:Dead  965+       93:Dead  731:Dead  460:Dead  153:Dead
 [43]  433:Dead  145:Dead  583:Dead   95:Dead  303:Dead  519:Dead  643:Dead
 [50]  765:Dead  735:Dead  189:Dead   53:Dead  246:Dead  689:Dead   65:Dead
 [57]    5:Dead  132:Dead  687:Dead  345:Dead  444:Dead  223:Dead  175:Dead
 [64]   60:Dead  163:Dead   65:Dead  208:Dead  821+      428:Dead  230:Dead
 [71]  840+      305:Dead   11:Dead  132:Dead  226:Dead  426:Dead  705:Dead
 [78]  363:Dead   11:Dead  176:Dead  791:Dead   95:Dead  196+      167:Dead
 [85]  806+      284:Dead  641:Dead  147:Dead  740+      163:Dead  655:Dead
 [92]  239:Dead   88:Dead  245:Dead  588+       30:Dead  179:Dead  310:Dead
 [99]  477:Dead  166:Dead  559+      450:Dead  364:Dead  107:Dead  177:Dead
[106]  156:Dead  529+       11:Dead  429:Dead  351:Dead   15:Dead  181:Dead
[113]  283:Dead  201:Dead  524:Dead   13:Dead  212:Dead  524:Dead  288:Dead
[120]  363:Dead  442:Dead  199:Dead  550:Dead   54:Dead  558:Dead  207:Dead
[127]   92:Dead   60:Dead  551+      543+      293:Dead  202:Dead  353:Dead
[134]  511+      267:Dead  511+      371:Dead  387:Dead  457:Dead  337:Dead
[141]  201:Dead  404+      222:Dead   62:Dead  458+      356+      353:Dead
[148]  163:Dead   31:Dead  340:Dead  229:Dead  444+      315+      182:Dead
[155]  156:Dead  329:Dead  364+      291:Dead  179:Dead  376+      384+    
[162]  268:Dead  292+      142:Dead  413+      266+      194:Dead  320:Dead
[169]  181:Dead  285:Dead  301+      348:Dead  197:Dead  382+      303+    
[176]  296+      180:Dead  186:Dead  145:Dead  269+      300+      284+    
[183]  350:Dead  272+      292+      332+      285:Dead  259+      110:Dead
[190]  286:Dead  270:Dead   81:Dead  131:Dead  225+      269:Dead  225+    
[197]  243+      279+      276+      135:Dead   79:Dead   59:Dead  240+    
[204]  202+      235+      105:Dead  224+      239:Dead  237+      173+    
[211]  252+      221+      185+       92+       13:Dead  222+      192+    
[218]  183:Dead  211+      175+      197+      203+      116:Dead  188+    
[225]  191+      105+      174+      177+

Let’s fit a survival curve

survfit(s~1)

Call: survfit(formula = s ~ 1)

       n nevent   rmean*
Dead 228    165 645.7253
     228      0 376.2747
   *mean time in state, restricted (max time = 1022 )

survfit(Surv(time, status)~1, data=lung)

Call: survfit(formula = Surv(time, status) ~ 1, data = lung)

      n  events  median 0.95LCL 0.95UCL 
    228     165     310     285     363

sfit <- survfit(Surv(time, status)~sex, data=lung)

ggsurvplot(sfit, conf.int=TRUE, pval=TRUE, risk.table=TRUE, 
           legend.labs=c("Male", "Female"), legend.title="Gender",  
           palette=c("blue", "pink"), 
           title="Lung Cancer Survival Chart", xlab = "Time in days",
           risk.table.height=.3)

As the chart depicted, there are more female surviving from the disease than male.

Let’s create a Cox Regression test

fit <- coxph(Surv(time, status)~sex, data=lung)
fit

Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)

      coef exp(coef) se(coef)     z      p
sex -0.531     0.588    0.167 -3.18 0.0015

Likelihood ratio test=10.6  on 1 df, p=0.00111
n= 228, number of events= 165

As the regression test depicted, based on the categorical variable gender, going from male (baseline) to female results in approximately ~40% reduction in hazard. It means that males die at approximately 1.7x the rate per unit time as females (females die at 0.588x the rate per unit time as males).

Data Set 2 - using mongoDB

Let’s take the breast cancer data from csv file and then copy it to MongoDB

library(caret)
library(e1071) 
library(mongolite)

# read csv file "data_cancer.csv"
query <- read.csv("data_cancer.csv")

mdb = mongo(collection = "data", db = "bdata")

# drop db if exists
mdb$drop()

# insert csv data
mdb$insert(query)

List of 5
 $ nInserted  : num 569
 $ nMatched   : num 0
 $ nRemoved   : num 0
 $ nUpserted  : num 0
 $ writeErrors: list()

# open data
bdata <- mdb$find('{}')

# show data from MongoDB
datatable(bdata)

# Some data wrangling and cleaning
bdata$diagnosis <- factor(bdata$diagnosis, levels = c("M", "B"), labels = c("Malignant", "Benign"))

bdata$diagnosis <- as.factor(bdata$diagnosis)

# remove irrelevant data
bdata[,33] <- NULL

summary(bdata)

       id                diagnosis    radius_mean      texture_mean  
 Min.   :     8670   Malignant:212   Min.   : 6.981   Min.   : 9.71  
 1st Qu.:   869218   Benign   :357   1st Qu.:11.700   1st Qu.:16.17  
 Median :   906024                   Median :13.370   Median :18.84  
 Mean   : 30371831                   Mean   :14.127   Mean   :19.29  
 3rd Qu.:  8813129                   3rd Qu.:15.780   3rd Qu.:21.80  
 Max.   :911320502                   Max.   :28.110   Max.   :39.28  
 perimeter_mean     area_mean      smoothness_mean   compactness_mean 
 Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
 1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
 Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
 Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
 3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
 Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
 concavity_mean    concave_points_mean symmetry_mean   
 Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
 1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
 Median :0.06154   Median :0.03350     Median :0.1792  
 Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
 3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
 Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
 fractal_dimension_mean   radius_se        texture_se      perimeter_se   
 Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
 1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
 Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
 Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
 3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
 Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
    area_se        smoothness_se      compactness_se      concavity_se    
 Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
 1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
 Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
 Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
 3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
 Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
 concave_points_se   symmetry_se       fractal_dimension_se
 Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
 1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
 Median :0.010930   Median :0.018730   Median :0.0031870   
 Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
 3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
 Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
  radius_worst   texture_worst   perimeter_worst    area_worst    
 Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
 1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
 Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
 Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
 3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
 Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
 smoothness_worst  compactness_worst concavity_worst  concave_points_worst
 Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
 1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
 Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
 Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
 3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
 Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
 symmetry_worst   fractal_dimension_worst
 Min.   :0.1565   Min.   :0.05504        
 1st Qu.:0.2504   1st Qu.:0.07146        
 Median :0.2822   Median :0.08004        
 Mean   :0.2901   Mean   :0.08395        
 3rd Qu.:0.3179   3rd Qu.:0.09208        
 Max.   :0.6638   Max.   :0.20750

Let’s create train and test data set

set.seed(1000)
bdata_index <- createDataPartition(bdata$diagnosis, p=0.7, list = FALSE)
train_data <- bdata[bdata_index, -1]
test_data <- bdata[-bdata_index, -1]

dim(train_data)

[1] 399  31

dim(test_data)

[1] 170  31

model_naive<- naiveBayes(diagnosis ~ ., data = train_data)  

predict_naive <- predict(model_naive, newdata = test_data)        

cm <- table(predict_naive, test_data$diagnosis) 

confusionMatrix(cm)

Confusion Matrix and Statistics

             
predict_naive Malignant Benign
    Malignant        59      2
    Benign            4    105
                                          
               Accuracy : 0.9647          
                 95% CI : (0.9248, 0.9869)
    No Information Rate : 0.6294          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9238          
 Mcnemar's Test P-Value : 0.6831          
                                          
            Sensitivity : 0.9365          
            Specificity : 0.9813          
         Pos Pred Value : 0.9672          
         Neg Pred Value : 0.9633          
             Prevalence : 0.3706          
         Detection Rate : 0.3471          
   Detection Prevalence : 0.3588          
      Balanced Accuracy : 0.9589          
                                          
       'Positive' Class : Malignant

As we can see, Naive Bayes classifier predicted 105 benign cases correctly and 4 wrong predictions. On the other hand, the model predicted 59 malignant cases correctly and 2 wrong predictions. The accuracy of Naive Bayes Classifier is 96.47%

Conclusion

The tests that I’ve shown on this project are just some that we can use in analyzing the data about the disease. Also, in this project, I used the survminer (we’ve never used in the class), it is a library that has functions for facilitating survival analysis and visualization.

Reference

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

https://www.datacamp.com/community/tutorials/survival-analysis-R

https://www.health.ny.gov/statistics/cancer/registry/pdf/snapshot.pdf