knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(survival)
library(survminer)
## Loading required package: ggpubr
## Loading required package: magrittr
library(ggfortify)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
This is a survival analysis done on a breast cancer dataset. Breast cancer is the leading cause of cancer for women worldwide (World Cancer Report 2014. World Health Organization. 2014) and is about 1/4th of all the cases. The survival rate is usually high, but can be dependent on many factors. The data sets explored are one of a clinical nature, that include age, chemotherapy status, hormone therapy status, NPI scores, if breast surgeries have been performed, and radiotheraphy status. There is also a gene expression data set of the same patients, which will hopefully provide some insight in to if there are specific genes that can be tied to the survival rate of breast cancer. Through these data sets I will hope to find some specific clinical traits or genes that can be used to identify significant changes in survival.
clinical <- readRDS("C:/Users/Dustin and Morgan/Desktop/clinical.rds")
clinical$OS_STATUS <- as.numeric(clinical$OS_STATUS)
clinical$OS_STATUS[clinical$OS_STATUS == "DECEASED"] = "1"
clinical$OS_STATUS[clinical$OS_STATUS == "LIVING"] = "0"
clinical$OS_STATUS[clinical$OS_STATUS == "2"] = 0
clinical$OS_STATUS <- as.numeric(clinical$OS_STATUS)
clinical$CS <- clinical$CLAUDIN_SUBTYPE
clinical$HIST <- clinical$HISTOLOGICAL_SUBTYPE
clinical$TG <- clinical$THREEGENE
clinical$IS <- clinical$INFERRED_MENOPAUSAL_STATE
clinical$HIST <- as.character(clinical$HIST)
clinical$HIST[clinical$HIST == "MIXED NST AND A SPECIAL TYPE"] = "MIXED"
clinical$HIST <- as.factor(clinical$HIST)
summary(clinical)
## PATIENT_ID OS_MONTHS OS_STATUS
## Length:1904 Min. : 0.00 Min. :0.0000
## Class :character 1st Qu.: 60.83 1st Qu.:0.0000
## Mode :character Median :114.90 Median :1.0000
## Mean :125.03 Mean :0.5793
## 3rd Qu.:184.47 3rd Qu.:1.0000
## Max. :355.20 Max. :1.0000
##
## VITAL_STATUS INTCLUST COHORT AGE_AT_DIAGNOSIS
## Died of Disease :622 8 :289 Min. :1.000 Min. :21.93
## Died of Other Causes:480 3 :282 1st Qu.:1.000 1st Qu.:51.38
## Living :801 4ER+ :244 Median :3.000 Median :61.77
## NA's : 1 10 :219 Mean :2.644 Mean :61.09
## 5 :184 3rd Qu.:3.000 3rd Qu.:70.59
## 7 :182 Max. :5.000 Max. :96.29
## (Other):504
## LATERALITY NPI ER_IHC INFERRED_MENOPAUSAL_STATE
## l :935 Min. :1.000 neg : 429 post:1493
## null:106 1st Qu.:3.046 pos :1445 pre : 411
## r :863 Median :4.042 NA's: 30
## Mean :4.033
## 3rd Qu.:5.040
## Max. :6.360
##
## BREAST_SURGERY CELLULARITY HER2_SNP6
## BREAST CONSERVING: 755 high :939 GAIN : 417
## MASTECTOMY :1127 low :200 LOSS : 100
## null : 22 moderate:711 NEUT :1383
## null : 54 UNDEF: 4
##
##
##
## THREEGENE CLAUDIN_SUBTYPE CHEMOTHERAPY HORMONE_THERAPY
## ER-/HER2- :290 Basal :199 NO :1508 NO : 730
## ER+/HER2- High Prolif:603 claudin-low:199 YES: 396 YES:1174
## ER+/HER2- Low Prolif :619 Her2 :220
## HER2+ :188 LumA :679
## null :204 LumB :461
## NC : 6
## Normal :140
## RADIO_THERAPY HISTOLOGICAL_SUBTYPE CS HIST
## NO : 767 IDC :1500 Basal :199 IDC :1500
## YES:1137 ILC : 141 claudin-low:199 ILC : 141
## IDC+ILC: 87 Her2 :220 IDC+ILC: 87
## IDC-TUB: 67 LumA :679 IDC-TUB: 67
## IDC-MUC: 42 LumB :461 IDC-MUC: 42
## IDC-MED: 31 NC : 6 IDC-MED: 31
## (Other): 36 Normal :140 (Other): 36
## TG IS
## ER-/HER2- :290 post:1493
## ER+/HER2- High Prolif:603 pre : 411
## ER+/HER2- Low Prolif :619
## HER2+ :188
## null :204
##
##
This shows our variables for the data set, which include the timeframe (OS_Months), the status of the patient (OS_STATUS) where 0 is alive and 1 is deceased. Other variables are Intclust, cohort, age at diagnosis, laterality (indicating which breast the cancer was found), npi (Nottingham Prognostic Index), ER IHC (Where positive shows cancer cells grow in response to estrogen), Inferred menopausal state, the type of Breast Surgery performed, Cellularity (% of tumor volume occupied by invasive tumor cells), HER2_SNP6 (human epidermal growth factor receptor 2), THREEGENE, Claudin subtype, chemothereapy, hormon thereapy, radio therapy, and histological subtype.
qplot(clinical$INTCLUST)
This shows we have a fairly decent amount of different INCLUST types, with 10, 3, 4ER+, and 8 the largest representation in the data set.
qplot(clinical$AGE_AT_DIAGNOSIS)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Our age at diagnosis almost looks normmally distributed, with the bulk of our patients around ~60 years old
qplot(clinical$LATERALITY)
Looking at laterality, we see that most of our subjects in the data set have a left or right cancer origin, with a smaller population falling under “null”.
qplot(clinical$NPI)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at NPI scores, we see most of our patients are at a 4, with the least amount of patients in the 1-2 score range.
qplot(clinical$ER_IHC)
For ER_IHC, most of our population in the data set is positive, with about a third of that negative, and a small amount NA.
qplot(clinical$IS)
For inferred menopausal state, we see the majority are post-menopausal which makes sense since before in age at diagnosis we saw the majority was around the 60 year old age group, and by the nature of cancer developing later in life.
qplot(clinical$BREAST_SURGERY)
Mastectomies were the most common breast surgery performed in this data set, with breast conserving vollowing that, with a small percentage of null.
qplot(clinical$CELLULARITY)
High and moderate were the most frequent types of cellularity in this data set, with low and null at significantly less amounts.
qplot(clinical$HER2_SNP6)
When we look at HER2_SNP6 variable, we see that NEUT and GAIN were the most common types, but NEUT was by far the most common.
qplot(clinical$TG)
The high and low prolif of ER+/HER2 were the most common types seen in the threegene variable. ER-/HER2-, HER2+, and null also had decent amounts of patients, but were about 1/3rd of the more common types.
qplot(clinical$CS)
LumA and LumB were the most frequent claudin subtypes, with LumA being the most common. NC was a very small amount of patients.
qplot(clinical$CHEMOTHERAPY)
The majority of these patients did not have chemotherapy in this data set.
qplot(clinical$HORMONE_THERAPY)
The majority of these patients did have hormone therapy for this data set.
qplot(clinical$RADIO_THERAPY)
The majority of these patients had radio therapy as well, but this is a much closer distribution than the hormone therapy and chemotherapy.
qplot(clinical$HIST)
The most common histological subtype by far is IDC, almost to the point where we may no thave enough of the other types to use this as a meaningful variable when comparing against other histological subtypes.
Cox Proportional Hazards was used to see which of these variables are signficant to survival in breast cancer
coxclinical <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST + COHORT + AGE_AT_DIAGNOSIS + LATERALITY + NPI + ER_IHC + IS + BREAST_SURGERY + CELLULARITY + HER2_SNP6 + TG + CS + CHEMOTHERAPY + HORMONE_THERAPY + RADIO_THERAPY + HIST, data = clinical)
## Warning in fitter(X, Y, istrat, offset, init, control, weights = weights, :
## Loglik converged before variable 39 ; coefficient may be infinite.
summary(coxclinical)
## Call:
## coxph(formula = Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST + COHORT +
## AGE_AT_DIAGNOSIS + LATERALITY + NPI + ER_IHC + IS + BREAST_SURGERY +
## CELLULARITY + HER2_SNP6 + TG + CS + CHEMOTHERAPY + HORMONE_THERAPY +
## RADIO_THERAPY + HIST, data = clinical)
##
## n= 1874, number of events= 1089
## (30 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## INTCLUST10 -0.400162 0.670211 0.192484 -2.079 0.037623 *
## INTCLUST2 0.092095 1.096469 0.190409 0.484 0.628619
## INTCLUST3 -0.174030 0.840272 0.160485 -1.084 0.278189
## INTCLUST4ER- -0.231097 0.793663 0.229561 -1.007 0.314083
## INTCLUST4ER+ -0.128344 0.879551 0.166020 -0.773 0.439485
## INTCLUST5 0.551684 1.736175 0.211542 2.608 0.009109 **
## INTCLUST6 0.017199 1.017348 0.179034 0.096 0.923468
## INTCLUST7 -0.194490 0.823255 0.164340 -1.183 0.236628
## INTCLUST8 -0.082124 0.921158 0.152720 -0.538 0.590755
## INTCLUST9 0.058605 1.060356 0.161633 0.363 0.716918
## COHORT 0.036019 1.036676 0.028394 1.269 0.204608
## AGE_AT_DIAGNOSIS 0.053083 1.054517 0.003975 13.353 < 2e-16 ***
## LATERALITYnull 0.640261 1.896976 0.133339 4.802 1.57e-06 ***
## LATERALITYr -0.102772 0.902333 0.064413 -1.596 0.110597
## NPI 0.232808 1.262139 0.035971 6.472 9.67e-11 ***
## ER_IHCpos -0.226524 0.797301 0.139413 -1.625 0.104197
## ISpre 0.514892 1.673458 0.122973 4.187 2.83e-05 ***
## BREAST_SURGERYMASTECTOMY 0.249966 1.283982 0.081581 3.064 0.002184 **
## BREAST_SURGERYnull 0.401878 1.494629 0.314965 1.276 0.201974
## CELLULARITYlow 0.110008 1.116287 0.115241 0.955 0.339786
## CELLULARITYmoderate 0.037290 1.037994 0.069232 0.539 0.590144
## CELLULARITYnull 0.011039 1.011100 0.210071 0.053 0.958091
## HER2_SNP6LOSS 0.062863 1.064881 0.171567 0.366 0.714062
## HER2_SNP6NEUT 0.015280 1.015397 0.104886 0.146 0.884176
## HER2_SNP6UNDEF -0.196452 0.821640 0.594983 -0.330 0.741263
## TGER+/HER2- High Prolif 0.018773 1.018951 0.160394 0.117 0.906824
## TGER+/HER2- Low Prolif -0.043477 0.957455 0.163353 -0.266 0.790123
## TGHER2+ -0.408924 0.664365 0.217989 -1.876 0.060670 .
## TGnull -0.001178 0.998822 0.153684 -0.008 0.993882
## CSclaudin-low -0.195379 0.822523 0.159251 -1.227 0.219875
## CSHer2 -0.049823 0.951397 0.160328 -0.311 0.755984
## CSLumA -0.156718 0.854945 0.171453 -0.914 0.360687
## CSLumB -0.013972 0.986125 0.171688 -0.081 0.935139
## CSNC 0.123037 1.130926 0.480743 0.256 0.798005
## CSNormal 0.063251 1.065295 0.191390 0.330 0.741034
## CHEMOTHERAPYYES 0.393407 1.482022 0.113076 3.479 0.000503 ***
## HORMONE_THERAPYYES -0.057645 0.943985 0.077550 -0.743 0.457285
## RADIO_THERAPYYES -0.080586 0.922576 0.082288 -0.979 0.327425
## HISTDCIS -6.749872 0.001171 718.427982 -0.009 0.992504
## HISTIDC 0.120307 1.127843 0.583952 0.206 0.836774
## HISTIDC-MED -0.328760 0.719816 0.647502 -0.508 0.611638
## HISTIDC-MUC 0.188385 1.207298 0.626245 0.301 0.763555
## HISTIDC-TUB -0.117070 0.889523 0.615707 -0.190 0.849200
## HISTIDC+ILC 0.261433 1.298790 0.600519 0.435 0.663312
## HISTILC 0.381559 1.464566 0.591833 0.645 0.519117
## HISTINVASIVE TUMOUR -0.587939 0.555471 0.774588 -0.759 0.447831
## HISTMIXED -1.105652 0.330995 1.162556 -0.951 0.341577
## HISTnull -0.031327 0.969159 1.167941 -0.027 0.978601
## HISTOTHER -0.427040 0.652437 0.771826 -0.553 0.580068
## HISTOTHER INVASIVE NA NA 0.000000 NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## INTCLUST10 0.670211 1.4921 0.45959 0.9774
## INTCLUST2 1.096469 0.9120 0.75495 1.5925
## INTCLUST3 0.840272 1.1901 0.61350 1.1509
## INTCLUST4ER- 0.793663 1.2600 0.50610 1.2446
## INTCLUST4ER+ 0.879551 1.1369 0.63525 1.2178
## INTCLUST5 1.736175 0.5760 1.14691 2.6282
## INTCLUST6 1.017348 0.9829 0.71627 1.4450
## INTCLUST7 0.823255 1.2147 0.59655 1.1361
## INTCLUST8 0.921158 1.0856 0.68287 1.2426
## INTCLUST9 1.060356 0.9431 0.77245 1.4556
## COHORT 1.036676 0.9646 0.98056 1.0960
## AGE_AT_DIAGNOSIS 1.054517 0.9483 1.04633 1.0628
## LATERALITYnull 1.896976 0.5272 1.46071 2.4635
## LATERALITYr 0.902333 1.1082 0.79531 1.0238
## NPI 1.262139 0.7923 1.17622 1.3543
## ER_IHCpos 0.797301 1.2542 0.60667 1.0478
## ISpre 1.673458 0.5976 1.31504 2.1296
## BREAST_SURGERYMASTECTOMY 1.283982 0.7788 1.09425 1.5066
## BREAST_SURGERYnull 1.494629 0.6691 0.80619 2.7710
## CELLULARITYlow 1.116287 0.8958 0.89060 1.3992
## CELLULARITYmoderate 1.037994 0.9634 0.90628 1.1888
## CELLULARITYnull 1.011100 0.9890 0.66986 1.5262
## HER2_SNP6LOSS 1.064881 0.9391 0.76079 1.4905
## HER2_SNP6NEUT 1.015397 0.9848 0.82672 1.2471
## HER2_SNP6UNDEF 0.821640 1.2171 0.25599 2.6371
## TGER+/HER2- High Prolif 1.018951 0.9814 0.74409 1.3953
## TGER+/HER2- Low Prolif 0.957455 1.0444 0.69514 1.3188
## TGHER2+ 0.664365 1.5052 0.43337 1.0185
## TGnull 0.998822 1.0012 0.73905 1.3499
## CSclaudin-low 0.822523 1.2158 0.60199 1.1238
## CSHer2 0.951397 1.0511 0.69485 1.3027
## CSLumA 0.854945 1.1697 0.61094 1.1964
## CSLumB 0.986125 1.0141 0.70435 1.3806
## CSNC 1.130926 0.8842 0.44078 2.9016
## CSNormal 1.065295 0.9387 0.73208 1.5502
## CHEMOTHERAPYYES 1.482022 0.6748 1.18742 1.8497
## HORMONE_THERAPYYES 0.943985 1.0593 0.81088 1.0989
## RADIO_THERAPYYES 0.922576 1.0839 0.78516 1.0840
## HISTDCIS 0.001171 853.9499 0.00000 Inf
## HISTIDC 1.127843 0.8866 0.35908 3.5425
## HISTIDC-MED 0.719816 1.3892 0.20233 2.5608
## HISTIDC-MUC 1.207298 0.8283 0.35380 4.1198
## HISTIDC-TUB 0.889523 1.1242 0.26611 2.9734
## HISTIDC+ILC 1.298790 0.7699 0.40029 4.2141
## HISTILC 1.464566 0.6828 0.45913 4.6717
## HISTINVASIVE TUMOUR 0.555471 1.8003 0.12171 2.5351
## HISTMIXED 0.330995 3.0212 0.03390 3.2314
## HISTnull 0.969159 1.0318 0.09823 9.5620
## HISTOTHER 0.652437 1.5327 0.14373 2.9615
## HISTOTHER INVASIVE NA NA NA NA
##
## Concordance= 0.684 (se = 0.008 )
## Likelihood ratio test= 472.6 on 49 df, p=<2e-16
## Wald test = 463.8 on 49 df, p=<2e-16
## Score (logrank) test = 484.3 on 49 df, p=<2e-16
cox1 <- survfit(coxclinical)
autoplot(cox1)
This flags a few items as being significant to survival such as INTCLUST10, INTCLUST5, Age at diagnosis, Laterality being null, NPI, Inferred Menopausal State, Breast surgery type, and Chemotherapy being performed.
Next, survival curves are made for some of the covariates indicated as significant
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ CHEMOTHERAPY, data = clinical))
Oddly enough, this shows that those who did not get chemotheraphy have a bettter survival curve than those who did. This could be due to patients who needed chemotheraphy could be in a more advanced state of cancer. To see this, we will compare the chemotherapy = yes to the chemotheraphy = no population
#First create quantiles for NPI score
clinical$QNPI <- with(clinical, cut(NPI,
breaks=quantile(NPI, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))
CHEMO <- subset(clinical, CHEMOTHERAPY == "YES")
qplot(CHEMO$QNPI)
This shows that the majority of patients who had chemotheraphy = yes were those with higher NPI scores, suggesting advanced cancer rates.
NOCHEMO <- subset(clinical, CHEMOTHERAPY == "NO")
qplot(NOCHEMO$QNPI)
Now the survival curve from chemotheraphy starts to make more sense. Patients who had a higher NPI score suggesting an advanced stage of cancer were more likely to have undergone chemotheraphy. Patients who did not undergo chemotherapy were more likely to have a lower NPI score suggesting less advanced cancer.
ggplot(clinical, aes(x=QNPI, color = CHEMOTHERAPY))+
geom_bar()
This shows us again that patients with higher NPI scores seem to have a higher proportion of those who have had chemotherapy, vs. those who did not. We can look at NPI overall as well. Unfortunately, due to our survival curves crossing we cannot use a log rank test here with confidence because of the possibility of loss of power.
clinical$QNPI <- with(clinical, cut(NPI,
breaks=quantile(NPI, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ QNPI, data = clinical))
By breaking NPI into quntiles, we can see that generally as you move from Q1 towards Q4, the survival curve decreases for each quantile. This makes sense due to the NPI score is calculated by tumor size, number of involved lymph nodes, and grade of the tumor. Higher numbers of NPI seem to correspond to more advanced cancer status. So this may be a better indicator of survival rather than chemotherapy status.
survdiff(formula = Surv(OS_MONTHS, OS_STATUS) ~ QNPI, data = clinical)
## Call:
## survdiff(formula = Surv(OS_MONTHS, OS_STATUS) ~ QNPI, data = clinical)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## QNPI=[1,3.05] 478 227 338 36.31 52.74
## QNPI=(3.05,4.04] 480 252 313 11.98 16.77
## QNPI=(4.04,5.04] 470 296 253 7.47 9.71
## QNPI=(5.04,6.36] 476 328 199 82.88 102.14
##
## Chisq= 140 on 3 degrees of freedom, p= <2e-16
The log rank test shows that the survival functions in these groups are not identical, confirming that NPI score has an effect on the survival function.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ INFERRED_MENOPAUSAL_STATE, data = clinical))
This shows those who are pre-menopasual state have a better survival curve than those who are post-menopasal state. This probably directly ties into our other variable, which is age at diagnosis. For this, we will use the mean of 61 to create a pre and post 61 variable to see what the survival curves look like for them.
clinical1 <- mutate(clinical, AG = ifelse((AGE_AT_DIAGNOSIS < 61), "Under61", "Over61"))
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ AG, data = clinical1))
This is in agreement with our menopause variable, women who are under 61 tend to have a better survival curve than those who are over 61.
survdiff(Surv(OS_MONTHS, OS_STATUS) ~ AG, data = clinical1)
## Call:
## survdiff(formula = Surv(OS_MONTHS, OS_STATUS) ~ AG, data = clinical1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## AG=Over61 995 710 529 62.1 121
## AG=Under61 909 393 574 57.2 121
##
## Chisq= 121 on 1 degrees of freedom, p= <2e-16
Our log rank test confirms that the survival curve for those over 61 is not identical to those under 61, enhancing our theory that age does has an effect on survival.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ LATERALITY, data = clinical))
This actually shows us that which breast the cancer develops in does not appear to significantly change the survival curve. However, the null response seems to be the reason why our cox model shows it was significant. It would be interesting to know if these nulls were that of a cancer spread that didn’t have a known origin location, or just insufficient data.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST, data = clinical))
This shows the many options for the INTCLUST variable and corresponding survival curves. It is difficult to see many of the different options, but from our cox model we were told that 10 and 5 were significant. 10 appears to have a higher survival rate than most, where 5 looks like it may be the lower end curve, but the colors make it hard to distinguish, so we will look at them both vs. the overall survival rate for the rest.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST, data = subset(clinical, INTCLUST %in% c(5, 10))))
When we compare INCLUST 5 vs INTCLUST 10, we see a substantial difference in survival curves. It appears that INTCLUST 10 patients have a much better survival rate than INTCLUST 5.
survdiff(Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST, data = subset(clinical, INTCLUST %in% c(5, 10)))
## Call:
## survdiff(formula = Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST, data = subset(clinical,
## INTCLUST %in% c(5, 10)))
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## INTCLUST=10 219 104 136.2 7.6 19.1
## INTCLUST=5 184 126 93.8 11.0 19.1
##
## Chisq= 19.1 on 1 degrees of freedom, p= 1e-05
The log rank test again shows that the survival curves are not identical for intclust 10 and intclust 5.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST == 5, data = clinical))
This shows when we compare patients that are positive for integrative cluster 5, their overall survival rate is much lower than the rest of the patients.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST == 10, data = clinical))
This shows that if a patient is positive for integrative cluster 10, their overall survival rate at first is slightly lower than the rest, but after a certain amount of time the survival rate is better than average.
Since the survival rate is much lower with those who have INTCLUST 5, I decided to subset those individuals and run a cox model on them to see if any of the covariates have a significant impact on those patients.
L5 <- subset(clinical, INTCLUST == 5)
coxclinicalL5 <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ COHORT + AGE_AT_DIAGNOSIS + NPI + ER_IHC + IS + BREAST_SURGERY + CELLULARITY + HER2_SNP6 + TG + CS + CHEMOTHERAPY + HORMONE_THERAPY + RADIO_THERAPY + HIST, data = L5)
summary(coxclinicalL5)
## Call:
## coxph(formula = Surv(OS_MONTHS, OS_STATUS) ~ COHORT + AGE_AT_DIAGNOSIS +
## NPI + ER_IHC + IS + BREAST_SURGERY + CELLULARITY + HER2_SNP6 +
## TG + CS + CHEMOTHERAPY + HORMONE_THERAPY + RADIO_THERAPY +
## HIST, data = L5)
##
## n= 180, number of events= 123
## (4 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## COHORT -0.061607 0.940252 0.096156 -0.641 0.521717
## AGE_AT_DIAGNOSIS 0.005041 1.005053 0.013242 0.381 0.703460
## NPI 0.492253 1.635998 0.147883 3.329 0.000873 ***
## ER_IHCpos -0.057880 0.943763 0.358479 -0.161 0.871730
## ISpre -0.158717 0.853238 0.334802 -0.474 0.635455
## BREAST_SURGERYMASTECTOMY -0.090428 0.913540 0.268971 -0.336 0.736720
## BREAST_SURGERYnull -0.623352 0.536144 1.109507 -0.562 0.574233
## CELLULARITYlow -0.077823 0.925129 0.407842 -0.191 0.848670
## CELLULARITYmoderate 0.172181 1.187893 0.239488 0.719 0.472169
## CELLULARITYnull -0.163305 0.849332 0.558759 -0.292 0.770085
## HER2_SNP6LOSS NA NA 0.000000 NA NA
## HER2_SNP6NEUT NA NA 0.000000 NA NA
## HER2_SNP6UNDEF 0.192964 1.212839 1.327718 0.145 0.884447
## TGER+/HER2- High Prolif -1.454652 0.233482 0.823217 -1.767 0.077223 .
## TGER+/HER2- Low Prolif NA NA 0.000000 NA NA
## TGHER2+ -1.301742 0.272057 0.691307 -1.883 0.059698 .
## TGnull -0.971167 0.378641 0.777439 -1.249 0.211596
## CSclaudin-low 0.046830 1.047944 0.653612 0.072 0.942882
## CSHer2 0.349554 1.418434 0.443426 0.788 0.430521
## CSLumA 0.937926 2.554676 0.606809 1.546 0.122184
## CSLumB 0.966508 2.628748 0.561971 1.720 0.085459 .
## CSNC NA NA 0.000000 NA NA
## CSNormal 1.448824 4.258105 0.608385 2.381 0.017246 *
## CHEMOTHERAPYYES -0.127841 0.879994 0.339974 -0.376 0.706894
## HORMONE_THERAPYYES -0.905836 0.404204 0.280192 -3.233 0.001225 **
## RADIO_THERAPYYES -0.068000 0.934260 0.262746 -0.259 0.795784
## HISTDCIS NA NA 0.000000 NA NA
## HISTIDC -0.744260 0.475086 0.785944 -0.947 0.343657
## HISTIDC-MED 0.772374 2.164901 1.371640 0.563 0.573365
## HISTIDC-MUC NA NA 0.000000 NA NA
## HISTIDC-TUB -0.850790 0.427077 1.334154 -0.638 0.523669
## HISTIDC+ILC -0.896735 0.407899 1.318650 -0.680 0.496479
## HISTILC 0.315274 1.370635 1.098859 0.287 0.774181
## HISTINVASIVE TUMOUR NA NA 0.000000 NA NA
## HISTMIXED NA NA 0.000000 NA NA
## HISTnull NA NA 0.000000 NA NA
## HISTOTHER NA NA 0.000000 NA NA
## HISTOTHER INVASIVE NA NA 0.000000 NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## COHORT 0.9403 1.0635 0.77875 1.135
## AGE_AT_DIAGNOSIS 1.0051 0.9950 0.97930 1.031
## NPI 1.6360 0.6112 1.22435 2.186
## ER_IHCpos 0.9438 1.0596 0.46744 1.905
## ISpre 0.8532 1.1720 0.44268 1.645
## BREAST_SURGERYMASTECTOMY 0.9135 1.0946 0.53924 1.548
## BREAST_SURGERYnull 0.5361 1.8652 0.06094 4.717
## CELLULARITYlow 0.9251 1.0809 0.41595 2.058
## CELLULARITYmoderate 1.1879 0.8418 0.74289 1.899
## CELLULARITYnull 0.8493 1.1774 0.28409 2.539
## HER2_SNP6LOSS NA NA NA NA
## HER2_SNP6NEUT NA NA NA NA
## HER2_SNP6UNDEF 1.2128 0.8245 0.08988 16.367
## TGER+/HER2- High Prolif 0.2335 4.2830 0.04651 1.172
## TGER+/HER2- Low Prolif NA NA NA NA
## TGHER2+ 0.2721 3.6757 0.07018 1.055
## TGnull 0.3786 2.6410 0.08250 1.738
## CSclaudin-low 1.0479 0.9542 0.29106 3.773
## CSHer2 1.4184 0.7050 0.59479 3.383
## CSLumA 2.5547 0.3914 0.77771 8.392
## CSLumB 2.6287 0.3804 0.87377 7.909
## CSNC NA NA NA NA
## CSNormal 4.2581 0.2348 1.29228 14.031
## CHEMOTHERAPYYES 0.8800 1.1364 0.45195 1.713
## HORMONE_THERAPYYES 0.4042 2.4740 0.23340 0.700
## RADIO_THERAPYYES 0.9343 1.0704 0.55824 1.564
## HISTDCIS NA NA NA NA
## HISTIDC 0.4751 2.1049 0.10181 2.217
## HISTIDC-MED 2.1649 0.4619 0.14720 31.841
## HISTIDC-MUC NA NA NA NA
## HISTIDC-TUB 0.4271 2.3415 0.03125 5.836
## HISTIDC+ILC 0.4079 2.4516 0.03077 5.407
## HISTILC 1.3706 0.7296 0.15906 11.811
## HISTINVASIVE TUMOUR NA NA NA NA
## HISTMIXED NA NA NA NA
## HISTnull NA NA NA NA
## HISTOTHER NA NA NA NA
## HISTOTHER INVASIVE NA NA NA NA
##
## Concordance= 0.665 (se = 0.026 )
## Likelihood ratio test= 41.64 on 27 df, p=0.04
## Wald test = 40.98 on 27 df, p=0.04
## Score (logrank) test = 45.51 on 27 df, p=0.01
This shows that NPI, CLAUDIN subtype normal, and having undergone hormone therapy are all significant covariates for those who are in the INTCLUST 5 group
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ CLAUDIN_SUBTYPE == 'Normal', data = L5))
This seems to show if you have the claudin subtype = normal, you have a lower survival rate than the rest of the claudin subtypes if you have INTCLUST 5. However, this seems to be a very low sample size so the validity of this claim may not be valid without proper samples.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ HORMONE_THERAPY, data = L5))
Intersting here that overall, hormone therapy was shown to not be significant on breast cancer survival rates in our overall model. However, when looking at this by INTCLUST 5 patients, those who had hormone therapy had a slightly better survival rate than those who did not.
survdiff(Surv(OS_MONTHS, OS_STATUS) ~ HORMONE_THERAPY, data =L5)
## Call:
## survdiff(formula = Surv(OS_MONTHS, OS_STATUS) ~ HORMONE_THERAPY,
## data = L5)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## HORMONE_THERAPY=NO 97 67 55.6 2.34 4.27
## HORMONE_THERAPY=YES 87 59 70.4 1.85 4.27
##
## Chisq= 4.3 on 1 degrees of freedom, p= 0.04
This shows that there is a difference between our curves, but since they cross at the end, this may not be a vaild test to run.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ QNPI, data = L5))
This seems to show us generally the same result as QNPI did on the full clinical data set, but with the first quantile possibly with a lower survival rate than the second. However, low volume would prevent us from making a decision on this theory.
L10 <- subset(clinical, INTCLUST == 10)
coxclinicalL10 <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ COHORT + AGE_AT_DIAGNOSIS + NPI + ER_IHC + IS + BREAST_SURGERY + CELLULARITY + HER2_SNP6 + TG + CS + CHEMOTHERAPY + HORMONE_THERAPY + RADIO_THERAPY + HIST, data = L10)
## Warning in fitter(X, Y, istrat, offset, init, control, weights = weights, :
## Loglik converged before variable 23,28,29 ; coefficient may be infinite.
summary(coxclinicalL10)
## Call:
## coxph(formula = Surv(OS_MONTHS, OS_STATUS) ~ COHORT + AGE_AT_DIAGNOSIS +
## NPI + ER_IHC + IS + BREAST_SURGERY + CELLULARITY + HER2_SNP6 +
## TG + CS + CHEMOTHERAPY + HORMONE_THERAPY + RADIO_THERAPY +
## HIST, data = L10)
##
## n= 218, number of events= 104
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## COHORT 6.002e-02 1.062e+00 9.889e-02 0.607 0.54391
## AGE_AT_DIAGNOSIS 3.220e-02 1.033e+00 1.431e-02 2.251 0.02441 *
## NPI 1.689e-01 1.184e+00 1.949e-01 0.867 0.38607
## ER_IHCpos -5.056e-01 6.031e-01 4.660e-01 -1.085 0.27787
## ISpre 2.510e-01 1.285e+00 3.783e-01 0.663 0.50709
## BREAST_SURGERYMASTECTOMY -6.840e-02 9.339e-01 2.316e-01 -0.295 0.76778
## BREAST_SURGERYnull 1.648e+00 5.198e+00 5.606e-01 2.940 0.00328 **
## CELLULARITYlow 7.263e-01 2.067e+00 3.398e-01 2.138 0.03254 *
## CELLULARITYmoderate 1.988e-01 1.220e+00 2.705e-01 0.735 0.46248
## CELLULARITYnull -2.438e-01 7.836e-01 1.035e+00 -0.236 0.81377
## HER2_SNP6LOSS -6.661e-02 9.356e-01 5.813e-01 -0.115 0.90876
## HER2_SNP6NEUT -2.661e-02 9.737e-01 3.791e-01 -0.070 0.94403
## HER2_SNP6UNDEF NA NA 0.000e+00 NA NA
## TGER+/HER2- High Prolif 3.809e+00 4.509e+01 1.436e+00 2.652 0.00800 **
## TGER+/HER2- Low Prolif NA NA 0.000e+00 NA NA
## TGHER2+ -3.564e-01 7.002e-01 1.110e+00 -0.321 0.74810
## TGnull 8.386e-01 2.313e+00 2.613e-01 3.209 0.00133 **
## CSclaudin-low -3.036e-01 7.381e-01 2.712e-01 -1.120 0.26289
## CSHer2 -1.127e-01 8.934e-01 6.246e-01 -0.180 0.85684
## CSLumA -1.569e+00 2.082e-01 8.664e+03 0.000 0.99986
## CSLumB -3.574e+00 2.803e-02 1.400e+00 -2.553 0.01067 *
## CSNC NA NA 0.000e+00 NA NA
## CSNormal -1.742e+01 2.707e-08 8.223e+03 -0.002 0.99831
## CHEMOTHERAPYYES 3.933e-01 1.482e+00 3.571e-01 1.101 0.27073
## HORMONE_THERAPYYES 3.405e-01 1.406e+00 2.436e-01 1.398 0.16220
## RADIO_THERAPYYES -2.716e-01 7.622e-01 2.803e-01 -0.969 0.33256
## HISTDCIS NA NA 0.000e+00 NA NA
## HISTIDC 1.773e+01 4.991e+07 6.963e+03 0.003 0.99797
## HISTIDC-MED 1.672e+01 1.817e+07 6.963e+03 0.002 0.99808
## HISTIDC-MUC NA NA 0.000e+00 NA NA
## HISTIDC-TUB -5.160e-01 5.969e-01 9.848e+03 0.000 0.99996
## HISTIDC+ILC NA NA 0.000e+00 NA NA
## HISTILC -2.873e-01 7.503e-01 8.664e+03 0.000 0.99997
## HISTINVASIVE TUMOUR NA NA 0.000e+00 NA NA
## HISTMIXED NA NA 0.000e+00 NA NA
## HISTnull NA NA 0.000e+00 NA NA
## HISTOTHER NA NA 0.000e+00 NA NA
## HISTOTHER INVASIVE NA NA 0.000e+00 NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## COHORT 1.062e+00 9.417e-01 0.874763 1.2890
## AGE_AT_DIAGNOSIS 1.033e+00 9.683e-01 1.004167 1.0621
## NPI 1.184e+00 8.446e-01 0.808123 1.7348
## ER_IHCpos 6.031e-01 1.658e+00 0.241981 1.5033
## ISpre 1.285e+00 7.780e-01 0.612309 2.6979
## BREAST_SURGERYMASTECTOMY 9.339e-01 1.071e+00 0.593086 1.4705
## BREAST_SURGERYnull 5.198e+00 1.924e-01 1.732392 15.5978
## CELLULARITYlow 2.067e+00 4.837e-01 1.062255 4.0238
## CELLULARITYmoderate 1.220e+00 8.197e-01 0.717886 2.0730
## CELLULARITYnull 7.836e-01 1.276e+00 0.103040 5.9593
## HER2_SNP6LOSS 9.356e-01 1.069e+00 0.299423 2.9232
## HER2_SNP6NEUT 9.737e-01 1.027e+00 0.463196 2.0470
## HER2_SNP6UNDEF NA NA NA NA
## TGER+/HER2- High Prolif 4.509e+01 2.218e-02 2.702484 752.3891
## TGER+/HER2- Low Prolif NA NA NA NA
## TGHER2+ 7.002e-01 1.428e+00 0.079528 6.1645
## TGnull 2.313e+00 4.323e-01 1.385932 3.8603
## CSclaudin-low 7.381e-01 1.355e+00 0.433814 1.2560
## CSHer2 8.934e-01 1.119e+00 0.262665 3.0390
## CSLumA 2.082e-01 4.803e+00 0.000000 Inf
## CSLumB 2.803e-02 3.568e+01 0.001803 0.4358
## CSNC NA NA NA NA
## CSNormal 2.707e-08 3.694e+07 0.000000 Inf
## CHEMOTHERAPYYES 1.482e+00 6.748e-01 0.735952 2.9838
## HORMONE_THERAPYYES 1.406e+00 7.114e-01 0.871999 2.2658
## RADIO_THERAPYYES 7.622e-01 1.312e+00 0.440056 1.3201
## HISTDCIS NA NA NA NA
## HISTIDC 4.991e+07 2.004e-08 0.000000 Inf
## HISTIDC-MED 1.817e+07 5.504e-08 0.000000 Inf
## HISTIDC-MUC NA NA NA NA
## HISTIDC-TUB 5.969e-01 1.675e+00 0.000000 Inf
## HISTIDC+ILC NA NA NA NA
## HISTILC 7.503e-01 1.333e+00 0.000000 Inf
## HISTINVASIVE TUMOUR NA NA NA NA
## HISTMIXED NA NA NA NA
## HISTnull NA NA NA NA
## HISTOTHER NA NA NA NA
## HISTOTHER INVASIVE NA NA NA NA
##
## Concordance= 0.693 (se = 0.026 )
## Likelihood ratio test= 55.54 on 27 df, p=0.001
## Wald test = 47.91 on 27 df, p=0.008
## Score (logrank) test = 56.38 on 27 df, p=8e-04
Looking at INTCLUST10 patients, we see a few more covariates are significant than INTCLUST5, which include age at diagnosis, Breast surgery = null, cellularity = low, Threegene null or Threegene R+/HER2- High Prolif, and Claudin subtype = LumB.
L10AG <- mutate(L10, AG = ifelse((AGE_AT_DIAGNOSIS < 61), "Under61", "Over61"))
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ AG, data = L10AG))
Again we will split the ages by over/under 61, and for this we see that the curves are very similar to start, but start to diverge after t=100 to where those under61 have a better survival curve than those who are over 61.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ THREEGENE == 'null', data = L10))
This shows that those who had Threegene = null, they had a lower survival rate than the rest of the subjects with any other threegene response. However, this again has pretty low sample so more data would be needed before coming to this conclusion.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ THREEGENE == 'ER-/HER2- High Prolif', data = L10))
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ THREEGENE, data = L10))
Looking at ER-/HER2- High Prolif vs. all of the Threegene responses, it seems like ER-/HER2- High Prolif has a slightly better survival rate at the beginning, but again with low volume it is difficult to make this conclusion.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ BREAST_SURGERY == 'null', data = L10))
Again, with low volume no real conclusion can be drawn from this information. It would seem to suggest those who do not get breast surgery have a much lower survival rate, but that cannot be concluded from the information we have.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ CELLULARITY == 'low', data = L10))
Another low volume variable, this would suggest that patients with low cellularity would have a lower survival curve, but with low volume that assumption would need further testing.
autoplot(survfit(Surv(OS_MONTHS, OS_STATUS) ~ CLAUDIN_SUBTYPE == 'LumB', data = L10))
This would suggest that patients with INTCLUST 10 and Claudin subtype = LumB would have a higher survival curve than those who do not.
coxclinicalpt2 <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST + AGE_AT_DIAGNOSIS + LATERALITY + NPI + IS + BREAST_SURGERY + CHEMOTHERAPY, data = clinical)
AIC(coxclinical, coxclinicalpt2)
## Warning in AIC.default(coxclinical, coxclinicalpt2): models are not all fitted
## to the same number of observations
## df AIC
## coxclinical 49 14476.52
## coxclinicalpt2 18 14681.97
This tells us that removing all the covaritates from the model execept the significant ones does not result in a better model. This could possibly be due to the significance some of the variables that were marked as not significant in our original cox model still having significance in some of our subsets, as we saw in our exploration of INTCLUST 5 & 10.
aaregclinical <- aareg(Surv(OS_MONTHS, OS_STATUS) ~ INTCLUST + LATERALITY + NPI + INFERRED_MENOPAUSAL_STATE + BREAST_SURGERY + CELLULARITY + CHEMOTHERAPY, data = clinical)
autoplot(aaregclinical)+theme(legend.position = "none")
This shows us how the covariates change over time. It is interesting to see some of the curves of Breast Surgery = mastectomy and a patient having chemotherapy.
gene <- readRDS("C:/Users/Dustin and Morgan/Desktop/gene_expression.rds")
geneclinical <- cbind(clinical, gene)
geneclincox <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ BRCA1 + BRCA2 + CDH1 + PTEN + STK11 +TP53 + ATM + BARD1 + CASP8 + CTLA4 + CYP19A1 + FGFR2 + LSP1 + MAP3K1 + NBN + RAD51 + TERT, data = geneclinical)
summary(geneclincox)
## Call:
## coxph(formula = Surv(OS_MONTHS, OS_STATUS) ~ BRCA1 + BRCA2 +
## CDH1 + PTEN + STK11 + TP53 + ATM + BARD1 + CASP8 + CTLA4 +
## CYP19A1 + FGFR2 + LSP1 + MAP3K1 + NBN + RAD51 + TERT, data = geneclinical)
##
## n= 1904, number of events= 1103
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BRCA1 -0.10850 0.89718 0.22330 -0.486 0.62705
## BRCA2 0.40503 1.49934 0.08722 4.644 3.42e-06 ***
## CDH1 0.15621 1.16908 0.21986 0.711 0.47739
## PTEN 0.27725 1.31950 0.05365 5.168 2.36e-07 ***
## STK11 0.19620 1.21677 0.08226 2.385 0.01708 *
## TP53 0.38638 1.47164 0.18901 2.044 0.04093 *
## ATM 0.01648 1.01662 0.09562 0.172 0.86313
## BARD1 -0.10585 0.89956 0.12528 -0.845 0.39817
## CASP8 -0.17005 0.84362 0.22217 -0.765 0.44403
## CTLA4 0.13552 1.14513 0.05507 2.461 0.01386 *
## CYP19A1 -0.07757 0.92536 0.04898 -1.584 0.11330
## FGFR2 0.01954 1.01973 0.03644 0.536 0.59187
## LSP1 -0.21961 0.80283 0.07875 -2.789 0.00529 **
## MAP3K1 0.05887 1.06064 0.09135 0.644 0.51931
## NBN 0.19444 1.21463 0.08128 2.392 0.01675 *
## RAD51 0.07100 1.07358 0.20395 0.348 0.72776
## TERT -0.21812 0.80403 0.06634 -3.288 0.00101 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BRCA1 0.8972 1.1146 0.5792 1.3898
## BRCA2 1.4993 0.6670 1.2637 1.7789
## CDH1 1.1691 0.8554 0.7598 1.7988
## PTEN 1.3195 0.7579 1.1878 1.4658
## STK11 1.2168 0.8218 1.0356 1.4297
## TP53 1.4716 0.6795 1.0161 2.1315
## ATM 1.0166 0.9837 0.8429 1.2262
## BARD1 0.8996 1.1117 0.7037 1.1499
## CASP8 0.8436 1.1854 0.5458 1.3039
## CTLA4 1.1451 0.8733 1.0280 1.2757
## CYP19A1 0.9254 1.0807 0.8407 1.0186
## FGFR2 1.0197 0.9807 0.9494 1.0952
## LSP1 0.8028 1.2456 0.6880 0.9368
## MAP3K1 1.0606 0.9428 0.8868 1.2686
## NBN 1.2146 0.8233 1.0358 1.4244
## RAD51 1.0736 0.9315 0.7198 1.6012
## TERT 0.8040 1.2437 0.7060 0.9157
##
## Concordance= 0.599 (se = 0.009 )
## Likelihood ratio test= 88.22 on 17 df, p=1e-11
## Wald test = 89.98 on 17 df, p=6e-12
## Score (logrank) test = 90.1 on 17 df, p=6e-12
Using a cox model based on a list of genes that was found to have an impact on breast cancer (source: https://www.nationalbreastcancer.org/other-breast-cancer-genes), our model says that the most significant genes are BCRA2, PTEN, STK11, TP53, ATM, CTLA4, LSP1, NBN, and TERT.
geneclincox2 <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ BRCA2 + PTEN + STK11 +TP53 + ATM + CTLA4 + LSP1 + NBN + TERT + INTCLUST + LATERALITY + NPI + IS + BREAST_SURGERY + AGE_AT_DIAGNOSIS + CHEMOTHERAPY, data = geneclinical)
summary(geneclincox2)
## Call:
## coxph(formula = Surv(OS_MONTHS, OS_STATUS) ~ BRCA2 + PTEN + STK11 +
## TP53 + ATM + CTLA4 + LSP1 + NBN + TERT + INTCLUST + LATERALITY +
## NPI + IS + BREAST_SURGERY + AGE_AT_DIAGNOSIS + CHEMOTHERAPY,
## data = geneclinical)
##
## n= 1904, number of events= 1103
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BRCA2 0.112928 1.119551 0.092206 1.225 0.220675
## PTEN 0.239694 1.270861 0.066874 3.584 0.000338 ***
## STK11 0.120160 1.127677 0.080362 1.495 0.134853
## TP53 0.415365 1.514923 0.188029 2.209 0.027171 *
## ATM 0.170071 1.185389 0.097170 1.750 0.080075 .
## CTLA4 0.153601 1.166026 0.055884 2.749 0.005986 **
## LSP1 -0.261818 0.769651 0.080128 -3.267 0.001085 **
## NBN 0.203465 1.225642 0.082080 2.479 0.013180 *
## TERT -0.203051 0.816236 0.067096 -3.026 0.002476 **
## INTCLUST10 -0.613731 0.541327 0.175168 -3.504 0.000459 ***
## INTCLUST2 0.102642 1.108094 0.188927 0.543 0.586931
## INTCLUST3 -0.247156 0.781019 0.158267 -1.562 0.118374
## INTCLUST4ER- -0.258340 0.772332 0.210216 -1.229 0.219098
## INTCLUST4ER+ -0.204196 0.815303 0.167388 -1.220 0.222504
## INTCLUST5 0.068813 1.071236 0.158224 0.435 0.663630
## INTCLUST6 -0.083801 0.919614 0.180701 -0.464 0.642822
## INTCLUST7 -0.259875 0.771148 0.164251 -1.582 0.113607
## INTCLUST8 -0.131000 0.877218 0.150574 -0.870 0.384300
## INTCLUST9 -0.105584 0.899799 0.162356 -0.650 0.515483
## LATERALITYnull 0.639837 1.896171 0.128277 4.988 6.10e-07 ***
## LATERALITYr -0.058580 0.943103 0.063723 -0.919 0.357942
## NPI 0.206454 1.229312 0.031896 6.473 9.62e-11 ***
## ISpre 0.510982 1.666928 0.120274 4.248 2.15e-05 ***
## BREAST_SURGERYMASTECTOMY 0.299936 1.349773 0.066680 4.498 6.85e-06 ***
## BREAST_SURGERYnull 0.512199 1.668957 0.295466 1.734 0.083002 .
## AGE_AT_DIAGNOSIS 0.052767 1.054184 0.003881 13.598 < 2e-16 ***
## CHEMOTHERAPYYES 0.402654 1.495789 0.100704 3.998 6.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BRCA2 1.1196 0.8932 0.9345 1.3413
## PTEN 1.2709 0.7869 1.1147 1.4488
## STK11 1.1277 0.8868 0.9633 1.3200
## TP53 1.5149 0.6601 1.0479 2.1900
## ATM 1.1854 0.8436 0.9798 1.4341
## CTLA4 1.1660 0.8576 1.0451 1.3010
## LSP1 0.7697 1.2993 0.6578 0.9005
## NBN 1.2256 0.8159 1.0435 1.4396
## TERT 0.8162 1.2251 0.7157 0.9310
## INTCLUST10 0.5413 1.8473 0.3840 0.7631
## INTCLUST2 1.1081 0.9025 0.7652 1.6047
## INTCLUST3 0.7810 1.2804 0.5727 1.0651
## INTCLUST4ER- 0.7723 1.2948 0.5115 1.1661
## INTCLUST4ER+ 0.8153 1.2265 0.5873 1.1319
## INTCLUST5 1.0712 0.9335 0.7856 1.4607
## INTCLUST6 0.9196 1.0874 0.6453 1.3104
## INTCLUST7 0.7711 1.2968 0.5589 1.0640
## INTCLUST8 0.8772 1.1400 0.6530 1.1784
## INTCLUST9 0.8998 1.1114 0.6546 1.2369
## LATERALITYnull 1.8962 0.5274 1.4746 2.4382
## LATERALITYr 0.9431 1.0603 0.8324 1.0686
## NPI 1.2293 0.8135 1.1548 1.3086
## ISpre 1.6669 0.5999 1.3169 2.1101
## BREAST_SURGERYMASTECTOMY 1.3498 0.7409 1.1844 1.5382
## BREAST_SURGERYnull 1.6690 0.5992 0.9353 2.9781
## AGE_AT_DIAGNOSIS 1.0542 0.9486 1.0462 1.0622
## CHEMOTHERAPYYES 1.4958 0.6685 1.2279 1.8222
##
## Concordance= 0.684 (se = 0.009 )
## Likelihood ratio test= 484.6 on 27 df, p=<2e-16
## Wald test = 476.2 on 27 df, p=<2e-16
## Score (logrank) test = 493.3 on 27 df, p=<2e-16
This model shows a number of significant covariates, including genes such as PTEN, TP53, CTLA4, LSP1, NBN, TERT along with INTCLUST10, LATERALITY = null, NPI, Inferred menopausal state = pre menopausal, having a mastectomy, age at diagnosis, and chemotherapy yes. Another model will be fit with this and we will check with AIC to see which has the better fit.
geneclincox3 <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ PTEN + TP53 + CTLA4 + LSP1 + NBN + TERT + INTCLUST + LATERALITY + NPI + IS + BREAST_SURGERY + AGE_AT_DIAGNOSIS + CHEMOTHERAPY, data = geneclinical)
AIC(geneclincox2, geneclincox3)
## df AIC
## geneclincox2 27 14647.34
## geneclincox3 24 14648.57
Interestingly, the AIC model shows that the model with all our covarites and genes is slightly better at explaining our information than our model selected only by significant covariates. This would suggest that aribtrarily removing variables from our model is not a good thing. We can compare both to our original model as well
AIC(geneclincox, geneclincox2, geneclincox3)
## df AIC
## geneclincox 17 15023.72
## geneclincox2 27 14647.34
## geneclincox3 24 14648.57
Showing that both of our models with our gene data set combined with our clinical data set is better at explaining our information over the one that does not include this information. This seems to suggest that a combination of physical issues as well as genetic issues are needed to explain survival of breast cancer patients, which clinically could be useful. This would help to explain to researchers that only focusing on genetic or physical characteristics would not get the whole story when investigating this topic.
When our gene data set was added to our clinical data set, we saw that intclust 10 was the only intclust that showed to be significant. With this we can now look at the genes within patients limited to intclust 10 and see which may have an impact.
G10 <- subset(geneclinical, INTCLUST == 10)
genecoxclinicalL10 <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ BRCA1 + BRCA2 + CDH1 + PTEN + STK11 +TP53 + ATM + BARD1 + CASP8 + CTLA4 + CYP19A1 + FGFR2 + LSP1 + MAP3K1 + NBN + RAD51 + TERT+ LATERALITY + NPI + INFERRED_MENOPAUSAL_STATE + BREAST_SURGERY + AGE_AT_DIAGNOSIS + CHEMOTHERAPY, data = G10)
summary(genecoxclinicalL10)
## Call:
## coxph(formula = Surv(OS_MONTHS, OS_STATUS) ~ BRCA1 + BRCA2 +
## CDH1 + PTEN + STK11 + TP53 + ATM + BARD1 + CASP8 + CTLA4 +
## CYP19A1 + FGFR2 + LSP1 + MAP3K1 + NBN + RAD51 + TERT + LATERALITY +
## NPI + INFERRED_MENOPAUSAL_STATE + BREAST_SURGERY + AGE_AT_DIAGNOSIS +
## CHEMOTHERAPY, data = G10)
##
## n= 219, number of events= 104
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BRCA1 -0.028281 0.972115 0.846902 -0.033 0.97336
## BRCA2 0.294597 1.342585 0.294755 0.999 0.31757
## CDH1 0.479831 1.615801 0.815985 0.588 0.55651
## PTEN 0.144698 1.155691 0.198833 0.728 0.46677
## STK11 0.346679 1.414362 0.256884 1.350 0.17716
## TP53 0.025830 1.026167 0.584406 0.044 0.96475
## ATM 0.105604 1.111381 0.258763 0.408 0.68319
## BARD1 -0.037027 0.963650 0.363144 -0.102 0.91879
## CASP8 -0.730440 0.481697 0.753662 -0.969 0.33245
## CTLA4 -0.032703 0.967825 0.157262 -0.208 0.83526
## CYP19A1 -0.001335 0.998666 0.152202 -0.009 0.99300
## FGFR2 -0.074534 0.928176 0.129162 -0.577 0.56390
## LSP1 -0.251084 0.777957 0.244325 -1.028 0.30411
## MAP3K1 -0.492720 0.610962 0.386783 -1.274 0.20270
## NBN 0.257665 1.293905 0.258267 0.998 0.31844
## RAD51 -0.197629 0.820674 0.584996 -0.338 0.73549
## TERT -0.246571 0.781476 0.260270 -0.947 0.34345
## LATERALITYnull 1.029540 2.799778 0.394833 2.608 0.00912 **
## LATERALITYr 0.128532 1.137157 0.227770 0.564 0.57255
## NPI 0.216408 1.241609 0.185431 1.167 0.24319
## INFERRED_MENOPAUSAL_STATEpre -0.008909 0.991131 0.378064 -0.024 0.98120
## BREAST_SURGERYMASTECTOMY 0.196144 1.216702 0.224739 0.873 0.38279
## BREAST_SURGERYnull 1.032426 2.807868 0.655793 1.574 0.11541
## AGE_AT_DIAGNOSIS 0.024687 1.024994 0.015290 1.615 0.10639
## CHEMOTHERAPYYES 0.340553 1.405725 0.301912 1.128 0.25932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BRCA1 0.9721 1.0287 0.1849 5.112
## BRCA2 1.3426 0.7448 0.7534 2.392
## CDH1 1.6158 0.6189 0.3265 7.998
## PTEN 1.1557 0.8653 0.7827 1.706
## STK11 1.4144 0.7070 0.8549 2.340
## TP53 1.0262 0.9745 0.3264 3.226
## ATM 1.1114 0.8998 0.6693 1.846
## BARD1 0.9637 1.0377 0.4729 1.963
## CASP8 0.4817 2.0760 0.1100 2.110
## CTLA4 0.9678 1.0332 0.7111 1.317
## CYP19A1 0.9987 1.0013 0.7411 1.346
## FGFR2 0.9282 1.0774 0.7206 1.196
## LSP1 0.7780 1.2854 0.4819 1.256
## MAP3K1 0.6110 1.6368 0.2863 1.304
## NBN 1.2939 0.7729 0.7799 2.147
## RAD51 0.8207 1.2185 0.2607 2.583
## TERT 0.7815 1.2796 0.4692 1.302
## LATERALITYnull 2.7998 0.3572 1.2913 6.070
## LATERALITYr 1.1372 0.8794 0.7277 1.777
## NPI 1.2416 0.8054 0.8633 1.786
## INFERRED_MENOPAUSAL_STATEpre 0.9911 1.0089 0.4724 2.079
## BREAST_SURGERYMASTECTOMY 1.2167 0.8219 0.7832 1.890
## BREAST_SURGERYnull 2.8079 0.3561 0.7765 10.153
## AGE_AT_DIAGNOSIS 1.0250 0.9756 0.9947 1.056
## CHEMOTHERAPYYES 1.4057 0.7114 0.7779 2.540
##
## Concordance= 0.662 (se = 0.027 )
## Likelihood ratio test= 38.85 on 25 df, p=0.04
## Wald test = 43.8 on 25 df, p=0.01
## Score (logrank) test = 50.32 on 25 df, p=0.002
Oddly enough, this shows no genes are significant, and laterality = null is the only significant variable. I will attempt this again removing clinical data set covairates and only looking at genes.
genecoxclinicalL102 <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ BRCA1 + BRCA2 + CDH1 + PTEN + STK11 +TP53 + ATM + BARD1 + CASP8 + CTLA4 + CYP19A1 + FGFR2 + LSP1 + MAP3K1 + NBN + RAD51 + TERT, data = G10)
summary(genecoxclinicalL102)
## Call:
## coxph(formula = Surv(OS_MONTHS, OS_STATUS) ~ BRCA1 + BRCA2 +
## CDH1 + PTEN + STK11 + TP53 + ATM + BARD1 + CASP8 + CTLA4 +
## CYP19A1 + FGFR2 + LSP1 + MAP3K1 + NBN + RAD51 + TERT, data = G10)
##
## n= 219, number of events= 104
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BRCA1 0.08984 1.09400 0.82256 0.109 0.9130
## BRCA2 0.41490 1.51421 0.27534 1.507 0.1318
## CDH1 0.05918 1.06096 0.79621 0.074 0.9408
## PTEN 0.14252 1.15318 0.19215 0.742 0.4583
## STK11 0.43541 1.54560 0.24991 1.742 0.0815 .
## TP53 0.31930 1.37616 0.57077 0.559 0.5759
## ATM 0.11560 1.12254 0.23840 0.485 0.6278
## BARD1 -0.40478 0.66713 0.34754 -1.165 0.2441
## CASP8 -0.51568 0.59709 0.73941 -0.697 0.4855
## CTLA4 -0.04635 0.95471 0.15271 -0.304 0.7615
## CYP19A1 -0.13210 0.87626 0.14847 -0.890 0.3736
## FGFR2 -0.03510 0.96551 0.12826 -0.274 0.7843
## LSP1 -0.13136 0.87690 0.23994 -0.547 0.5841
## MAP3K1 -0.56580 0.56790 0.36443 -1.553 0.1205
## NBN 0.32584 1.38520 0.24493 1.330 0.1834
## RAD51 0.06416 1.06627 0.55446 0.116 0.9079
## TERT -0.42513 0.65369 0.25350 -1.677 0.0935 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BRCA1 1.0940 0.9141 0.2182 5.485
## BRCA2 1.5142 0.6604 0.8827 2.598
## CDH1 1.0610 0.9425 0.2228 5.052
## PTEN 1.1532 0.8672 0.7913 1.681
## STK11 1.5456 0.6470 0.9471 2.522
## TP53 1.3762 0.7267 0.4496 4.212
## ATM 1.1225 0.8908 0.7035 1.791
## BARD1 0.6671 1.4990 0.3376 1.318
## CASP8 0.5971 1.6748 0.1402 2.543
## CTLA4 0.9547 1.0474 0.7078 1.288
## CYP19A1 0.8763 1.1412 0.6550 1.172
## FGFR2 0.9655 1.0357 0.7509 1.241
## LSP1 0.8769 1.1404 0.5479 1.403
## MAP3K1 0.5679 1.7609 0.2780 1.160
## NBN 1.3852 0.7219 0.8571 2.239
## RAD51 1.0663 0.9379 0.3597 3.161
## TERT 0.6537 1.5298 0.3977 1.074
##
## Concordance= 0.607 (se = 0.031 )
## Likelihood ratio test= 16.58 on 17 df, p=0.5
## Wald test = 17.26 on 17 df, p=0.4
## Score (logrank) test = 16.99 on 17 df, p=0.5
This again shows that no genes are significant in survival rates with patients with intclust 10. Taking from what we saw before, maybe there is an area for exploration as to why intclust 10 seems to have a higher survival curve than most other intclust patients, and why the genes associated with breast cancer seem to not play a significant role with these patients.
Gn10 <- subset(geneclinical, INTCLUST =! 10)
genecoxclinicalnL10 <- coxph(Surv(OS_MONTHS, OS_STATUS) ~ BRCA1 + BRCA2 + CDH1 + PTEN + STK11 +TP53 + ATM + BARD1 + CASP8 + CTLA4 + CYP19A1 + FGFR2 + LSP1 + MAP3K1 + NBN + RAD51 + TERT+ LATERALITY + NPI + INFERRED_MENOPAUSAL_STATE + BREAST_SURGERY + AGE_AT_DIAGNOSIS + CHEMOTHERAPY, data = Gn10)
summary(genecoxclinicalnL10)
## Call:
## coxph(formula = Surv(OS_MONTHS, OS_STATUS) ~ BRCA1 + BRCA2 +
## CDH1 + PTEN + STK11 + TP53 + ATM + BARD1 + CASP8 + CTLA4 +
## CYP19A1 + FGFR2 + LSP1 + MAP3K1 + NBN + RAD51 + TERT + LATERALITY +
## NPI + INFERRED_MENOPAUSAL_STATE + BREAST_SURGERY + AGE_AT_DIAGNOSIS +
## CHEMOTHERAPY, data = Gn10)
##
## n= 1904, number of events= 1103
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BRCA1 -0.091805 0.912283 0.223277 -0.411 0.680947
## BRCA2 0.200737 1.222303 0.088790 2.261 0.023772 *
## CDH1 0.189967 1.209209 0.217097 0.875 0.381558
## PTEN 0.188517 1.207458 0.056694 3.325 0.000884 ***
## STK11 0.191844 1.211482 0.082760 2.318 0.020445 *
## TP53 0.333706 1.396133 0.188175 1.773 0.076165 .
## ATM 0.110259 1.116567 0.093972 1.173 0.240667
## BARD1 -0.156496 0.855135 0.124690 -1.255 0.209449
## CASP8 -0.248624 0.779873 0.225689 -1.102 0.270625
## CTLA4 0.133297 1.142589 0.054824 2.431 0.015042 *
## CYP19A1 -0.033263 0.967284 0.048323 -0.688 0.491230
## FGFR2 0.061420 1.063345 0.036370 1.689 0.091263 .
## LSP1 -0.281922 0.754333 0.079109 -3.564 0.000366 ***
## MAP3K1 -0.017227 0.982921 0.088798 -0.194 0.846177
## NBN 0.189548 1.208703 0.081095 2.337 0.019420 *
## RAD51 -0.053060 0.948323 0.202165 -0.262 0.792969
## TERT -0.150728 0.860082 0.066923 -2.252 0.024305 *
## LATERALITYnull 0.603814 1.829081 0.129596 4.659 3.17e-06 ***
## LATERALITYr -0.074757 0.927969 0.063113 -1.185 0.236213
## NPI 0.226434 1.254119 0.031596 7.167 7.69e-13 ***
## INFERRED_MENOPAUSAL_STATEpre 0.460658 1.585117 0.120118 3.835 0.000126 ***
## BREAST_SURGERYMASTECTOMY 0.310213 1.363715 0.066185 4.687 2.77e-06 ***
## BREAST_SURGERYnull 0.660326 1.935423 0.293005 2.254 0.024219 *
## AGE_AT_DIAGNOSIS 0.052394 1.053791 0.003879 13.509 < 2e-16 ***
## CHEMOTHERAPYYES 0.386462 1.471765 0.099187 3.896 9.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BRCA1 0.9123 1.0962 0.5889 1.4131
## BRCA2 1.2223 0.8181 1.0271 1.4546
## CDH1 1.2092 0.8270 0.7901 1.8505
## PTEN 1.2075 0.8282 1.0805 1.3494
## STK11 1.2115 0.8254 1.0301 1.4248
## TP53 1.3961 0.7163 0.9655 2.0188
## ATM 1.1166 0.8956 0.9287 1.3424
## BARD1 0.8551 1.1694 0.6697 1.0919
## CASP8 0.7799 1.2823 0.5011 1.2138
## CTLA4 1.1426 0.8752 1.0262 1.2722
## CYP19A1 0.9673 1.0338 0.8799 1.0634
## FGFR2 1.0633 0.9404 0.9902 1.1419
## LSP1 0.7543 1.3257 0.6460 0.8808
## MAP3K1 0.9829 1.0174 0.8259 1.1698
## NBN 1.2087 0.8273 1.0311 1.4169
## RAD51 0.9483 1.0545 0.6381 1.4094
## TERT 0.8601 1.1627 0.7544 0.9806
## LATERALITYnull 1.8291 0.5467 1.4188 2.3580
## LATERALITYr 0.9280 1.0776 0.8200 1.0502
## NPI 1.2541 0.7974 1.1788 1.3342
## INFERRED_MENOPAUSAL_STATEpre 1.5851 0.6309 1.2526 2.0059
## BREAST_SURGERYMASTECTOMY 1.3637 0.7333 1.1978 1.5526
## BREAST_SURGERYnull 1.9354 0.5167 1.0899 3.4370
## AGE_AT_DIAGNOSIS 1.0538 0.9490 1.0458 1.0618
## CHEMOTHERAPYYES 1.4718 0.6795 1.2117 1.7876
##
## Concordance= 0.678 (se = 0.008 )
## Likelihood ratio test= 460 on 25 df, p=<2e-16
## Wald test = 450.5 on 25 df, p=<2e-16
## Score (logrank) test = 466.7 on 25 df, p=<2e-16
Opposite of the INTCLUST 10 subset, for all other patients who do not have INTCLUST 10, we see a number of significant genes including BRCA2, PTEN, STK11, CTLA4, LSP1, NRN, and TERT. This could be useful information to test to see why INTCLUST 10 does not seem to have their survival rates affected by certain genes, but the others do. Possibly this could be used as a future area of study for these types of patients.
For these data sets I used cox proportional hazards models to examine the association between survival and the covariates listed. Some of the inital variables from the clinical data set that were seen to have an impact on survival were if a patient had chemotherapy, their NPI score, age at diagnosis, INTCLUST 5 and 10, inferred menopausal state, null laterality, and if a mastectomy had been performed.
NPI corresponds to more advanced cancer status in patients. When looking at the log-rank test of this variable, we saw there are clear differences between each quantile of the NPI score, with survival rates generally worse for those with more advanced status. Using this information we were able to make an inferrence about chemotherapy status. Initally, our model showed that a patient who has undergone chemotherapy has a lower survival rate than those who have not. Using NPI, we can see that patients who had a higher NPI score were more likely to have undergone chemotherapy, showing that the presence of a patient having chemotherapy shows they have a more advanced cancer status.
Inferred Menopausal state and age at diagnosis were both marked as significant variables in the model performed. Since inferred menopausal state gave us data that said if the patient had entered menopause or not, we saw a curve that showed that generally women who have not yet entered into menopause have a higher survival rate than those who have. Age at diagnosis gave a similar train of thought, when we used our average age of diagnosis of 61 this showed that older patients (>=61) had a lower survival rate than younger patients (< 61). This seems to make the claim that younger patients diagnosed with cancer have a higher survival rate than older patients, which seems rational. However, more questions remain before we can conclude that both tell the same story. Perhaps there is something in menopausal status that hinders survival rate aside from age.
Laterality showed that it did not matter the location of the cancer in terms of the survival rate. However, nulls were fairly present for this category, that showed a lower survival rate than patients that identified a location of cancer onset. This category could be interesting if the null terminology was defined. If it is simply missing data, then it could have an impact on our curves for laterality where a location has been confirmed. If this means that there was no known location and the cancer has spread to an advanced state, this could explain the lower survival rate observed through intuition. This highlights the need for clearly defined variables during analysis, since it could have an impact on results.
INTCLUST 5 & 10 were the only significant factors seen in this variable from our cox model. INTCLUST 10 seemed to have a higher rate of survival than other overall INTCLUST patients, while 5 had a lower survival rate than the average, and especially lower than INTCLUST 10.
For INTCLUST 5, Claudin subtype normal, NPI, and hormone therapy were signfiicant in the cox model made from looking at only the intclust 5 subset. Claudin subtype normal and NPI had lower volumes on some categories that made it difficult to make conclusions about those survival curves, but hormone therapy turned out to show interesting results. Even though in our overall model hormone therapy was not significant, it was in our INTCLUST 5 subset. Our model showed that for INTCLUST 5 patients, hormone therapy was significant and that the presence of hormone therapy in a patient resulted in a higher survival rate than those who did not. A log-rank test was performed to confirm this, but the survival curves crossed at the end, which may make this test invalid. However, this could warrant future study to show that some patients, but not all, could benefit from hormone therapy in breast cancer.
For INTCLUST 10, Age at diagnosis, breast surgery being null, cellulary being low, threegene ER+/HER2- High Prolif, threegene null, and Claudin subtype being LumB were marked as significant in our cox model. For Age at diagnosis, we saw that the survival rate at first was very comparable, but diverged after about t=100. This would seem to suggest that younger patients had a better survival rate than older patients, but the crossing of the survival curves prevented us to use a log-rank test to prove this. Threegene ER+/HER2- High Prolif looked to have a higher survival rate than the rest of the threegene variables, but with lower sample size this was not able to be concluded. Breast surgery being null was a low volume as well, but tended to mimic the same conclusion we found in the overall data set. Cellularity = low and Claudin subtype LumB had the same inconclusions due to sample size. This could give us an idea to possibly explore in future work.
I fit a model removing all non-significant variables from our original model and compared them using AIC. It showed that the orinal model was preferred. It tells us that removing all the covaritates from the model execept the significant ones does not result in a better model. This could possibly be due to the significance some of the variables that were marked as not significant in our original cox model still having significance in some of our subsets, as we saw in our exploration of INTCLUST 5 & 10.
Gene Expression
Using a cox model based on a list of genes that was found to have an impact on breast cancer (source: https://www.nationalbreastcancer.org/other-breast-cancer-genes), our model says that the most significant genes are BCRA2, PTEN, STK11, TP53, ATM, CTLA4, LSP1, NBN, and TERT. A model was then made with significant variables from the clinical data set for the sake of simplicity. This model shows a number of significant covariates, including genes such as PTEN, TP53, CTLA4, LSP1, NBN, TERT along with INTCLUST10, LATERALITY = null, NPI, Inferred menopausal state = pre menopausal, having a mastectomy, age at diagnosis, and chemotherapy yes. Another model was fit with only significant variables from our combined gene and clinical data sets for a 3rd model to compare via AIC. This showed that our model with out removing genes or clinical variables (2nd model made) was the best choice, matching our conclusion from the first section. An ideal next step would be to make a full model out of all variables, and find a way to select significant variables, possibly through stepwise selection.
When our gene data set was added to our clinical data set, we saw that intclust 10 was the only intclust that showed to be significant. Looking at only the subset of intclust 10, no genes were marked as significant via our cox model, and the only variable that was significant was laterality = null. Opposite of the INTCLUST 10 subset, for all other patients who do not have INTCLUST 10, we see a number of significant genes including BRCA2, PTEN, STK11, CTLA4, LSP1, NRN, and TERT. This could be useful information to test to see why INTCLUST 10 does not seem to have their survival rates affected by certain genes, but the others do. Possibly this could be used as a future area of study for these types of patients.
Many factors seem to play into the survival rates of breast cancer patients. From this study, it seems like an approach that considers not only physical and clinical variables but also includes gene expression is the best approach for future research into what can be studied about these patients’ survival rate. As was seen with the INTCLUST 5 and hormone therapy status, there are many combinations that could be explored that could show certain genes or procedures could have an impact on a certain set of patients, but possibly not as a whole. This kind of understanding would possibly help narrow down options for treatment that could be more personalized and effective in the future.
Data set obtained through the National Cancer Institute Genomic Data Commons Data Portal through the LinkedOmics portal https://portal.gdc.cancer.gov/
Vasaikar S., Straub P., Wang J., Zhang B. LinkedOmics: analyzing multi-omics data within and across 32 cancer types. Nucleic Acids Research, gkx1090, 2017 https://doi.org/10.1093/nar/gkx1090
Selected Breast Cancer genes were obtained from the National Breast Cancer Foundation INC website Other Breast Cancer Genes https://www.nationalbreastcancer.org/other-breast-cancer-genes
Selected breast cancer statistics used in the introduction were obtained from World Cancer Report 2014. World Health Organization. 2014. pp. Chapter 1.1. ISBN 978-92-832-0429-9.