Development of cancer after a previous diagnosis of cancer can be devastating, especially for those people whose cancer went into remission. The following project seeks to use SEER data to understand what impact various first-cancer characteristics and treatments have on development of a second cancer.
Two approaches are taken to achieve this goal: (1) logistic regression where second cancer development is the outcome, and (2) survival analysis where second cancer development is the event. Predictors were initially selected on the basis of their suitability for model building, such that variables meant to be used for cohort construction (e.g. type of cancer) were not used as predictors. Models were first fitted by use of every pre-selected predictor, and a second model evaluated. Diagnostics and validation were performed, both in- and out-of- sample. A propensity score match was attempted, and revealed similar results.
The development of a second cancer after an initial diagnosis of breast cancer largely happens within the span of about 100 months. Distant metastases of cancer, certain cancer morphologies, and some grades of cancer appear to increase the odds or rate of second cancer development. Radiation treatment is associated with significantly decreased odds and rate of second cancer development, a finding that may be supported biologically by successful treatment decreasing the chance of cancer spreading to other areas of the body.
The next direction to move from this conclusion would be to match patients to see if the effect of radiation on second cancer development remains.
Provide a Table 1 that describes the predictors in your final model, broken down by the levels of your outcome variable.
| Only One Cancer | Second Cancer | p | test | |
|---|---|---|---|---|
| n | 340302 | 29660 | ||
| Race (%) | <0.001 | |||
| White | 273127 (80.3) | 24538 (82.7) | ||
| Black | 36149 (10.6) | 2859 ( 9.6) | ||
| Other | 31026 ( 9.1) | 2263 ( 7.6) | ||
| Sex = Male (%) | 2476 ( 0.7) | 227 ( 0.8) | 0.486 | |
| Age.At.Diagnosis (mean (sd)) | 59.81 (13.78) | 62.31 (13.24) | <0.001 | |
| Metastasis (%) | 0.004 | |||
| Localized | 215499 (63.3) | 18637 (62.8) | ||
| Regional | 107376 (31.6) | 9592 (32.3) | ||
| Distant | 17427 ( 5.1) | 1431 ( 4.8) | ||
| Morphology (%) | <0.001 | |||
| Adenocarcinoma in situ | 1094 ( 0.3) | 87 ( 0.3) | ||
| Carcinoma in situ, NOS | 1509 ( 0.4) | 98 ( 0.3) | ||
| Infiltrating duct carcinoma, NOS | 258327 (75.9) | 20831 (70.2) | ||
| Intraductal and lobular in situ carcinoma | 21789 ( 6.4) | 2650 ( 8.9) | ||
| Lobular carcinoma in situ | 24478 ( 7.2) | 2763 ( 9.3) | ||
| Mucinous adenocarcinoma | 6339 ( 1.9) | 701 ( 2.4) | ||
| Other | 26766 ( 7.9) | 2530 ( 8.5) | ||
| Grade (%) | <0.001 | |||
| Grade 1 | 73168 (21.5) | 7002 (23.6) | ||
| Grade 2 | 143867 (42.3) | 13196 (44.5) | ||
| Grade 3 | 119930 (35.2) | 9174 (30.9) | ||
| Grade 4 | 3337 ( 1.0) | 288 ( 1.0) | ||
| Radiation = Radiation (%) | 173864 (51.1) | 13108 (44.2) | <0.001 |
How did you make decisions about spending the degrees of freedom that you spent?
I’m not sure whether I can use the full number of observations that are available or the subset that I’ll use to build a model, to calculate Number of Observations / Number of Variables to Include. Either way, my use of degrees of freedom is not limited by the number of observations that are available–in this sense, I am rich in degrees of freedom to spend.
What approach(es) did you use in selecting your model? How well did they work?
My model was developed for its explanatory value, so I used all of the variables.
I did make a Spearman Rho-Squared plot to see whether there were good predictors on which to spend additional degrees of freedom, and found that I could use a restricted cubic spline on age at diagnosis as well as an interaction between two potentially-related variables (metastasis and grade).
On calculation of variance inflation factors, it appeared that there was a LOT of collinearity in the model including the RCS and interaction terms. So, I remade the model without the RCS and interaction terms and evaluated the two against each other using analysis of deviance.
How do you interpret your results in terms of probabilities or odds, rather than log odds?
For the purposes of this project, I will be limiting myself to a sample set of forty thousand observations to develop a logistic regression model. This is about 5% of my total sample size, so there is ample room for out-of-training validation.
Before selecting predictors, I will look at a spearman \(\rho^2\) plot to see if I should use any interaction, polynomial, or restricted cubic spline methods on predictors.
Age at diagnosis appears to be very important, based on its p-value.
Because Grade can realte to Metastasis, I’ll add an interaction term between the two. To visualize this potential interaction, here is a plot:
This doesn’t seem like a perfectly random association between categories, so I’ll include an interaction term between the two variables. I’ll also include a restricted cubic spline on age at diagnosis because the longer someone is alive, the longer the amount of time that they have to develop a second cancer.
As previously mentioned, missingness isn’t an issue outside of the variable Surgery. Missingness in any other variable accounts for less than 5% of the total number of observations, with the exception of the variables specifically for people who developed a second cancer.
Next, I’ll make the models.
Because I’ve used several interactions and an rcs, and because the severity (Grade) of cancer is known to be associated with its chance of metastasizing, I’ll take a quick look at variance inflation factors (VIFs) to make sure there’s no glaring issue with collinearity.
| Variable | VIF |
|---|---|
| RaceBlack | 1.029 |
| RaceOther | 1.019 |
| SexMale | 1.006 |
| Age.At.Diagnosis | 1.055 |
| ordered(Metastasis).L | 2.007 |
| ordered(Metastasis).Q | 1.862 |
| ordered(Grade).L | 5.48 |
| ordered(Grade).Q | 4.99 |
| ordered(Grade).C | 2.708 |
| MorphologyCarcinoma in situ, NOS | 3.245 |
| MorphologyInfiltrating duct carcinoma, NOS | 106 |
| MorphologyIntraductal and lobular in situ carcinoma | 42.47 |
| MorphologyLobular carcinoma in situ | 40.64 |
| MorphologyMucinous adenocarcinoma | 10.61 |
| MorphologyOther | 44.31 |
| RadiationRadiation | 1.018 |
| Size | 1.161 |
Note: The model used to evaluate these VIF values does NOT include interactions, this is to avoid introducing collinearity, which would make the test silly.
None of these values are too terrible, except all of the Morphologies and the “L” (linear, I believe) component of Grade.
I’ll make a new model without the interactions or restricted cubic splines and see which performs better on analysis of deviance.
To help choose between the models including and not including interactions, here are the results of an analysis of deviance:
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 39974 | 21752 | NA | NA | NA |
| 39982 | 21786 | -8 | -34.41 | 0 |
It looks like the interaction and restricted cubic splines offer negligible predictive value. So, I’ll stick with the model that doesn’t have either of those features.
Following are descriptions of the model predicting second cancer development. Note, this is the model that has no interaction effects included.
Model summary data is presented on the next page.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| RaceBlack | -0.067 | 0.063 | -1.058 | 0.29 |
| RaceOther | -0.178 | 0.071 | -2.505 | 0.012 |
| SexMale | -0.212 | 0.218 | -0.972 | 0.331 |
| Age.At.Diagnosis | 0.011 | 0.001 | 8.24 | 0 |
| ordered(Metastasis).L | -0.018 | 0.064 | -0.281 | 0.779 |
| ordered(Metastasis).Q | -0.036 | 0.045 | -0.804 | 0.422 |
| ordered(Grade).L | 0.004 | 0.129 | 0.034 | 0.973 |
| ordered(Grade).Q | 0.09 | 0.098 | 0.924 | 0.355 |
| ordered(Grade).C | 0.067 | 0.052 | 1.303 | 0.193 |
| MorphologyCarcinoma in situ, NOS | 0.507 | 0.505 | 1.006 | 0.315 |
| MorphologyInfiltrating duct carcinoma, NOS | 0.422 | 0.421 | 1.002 | 0.316 |
| MorphologyIntraductal and lobular in situ carcinoma | 0.845 | 0.425 | 1.988 | 0.047 |
| MorphologyLobular carcinoma in situ | 0.597 | 0.425 | 1.405 | 0.16 |
| MorphologyMucinous adenocarcinoma | 0.494 | 0.442 | 1.117 | 0.264 |
| MorphologyOther | 0.654 | 0.425 | 1.539 | 0.124 |
| RadiationRadiation | -0.304 | 0.038 | -8.022 | 0 |
| Size | 0.001 | 0.001 | 1.032 | 0.302 |
| (Intercept) | -3.479 | 0.434 | -8.026 | 0 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 22022 on 39999 degrees of freedom |
| Residual deviance: | 21786 on 39982 degrees of freedom |
| OR | 2.5 % | 97.5 % | P | |
|---|---|---|---|---|
| (Intercept) | 0.031 | 0.012 | 0.066 | 0 |
| RaceBlack | 0.935 | 0.824 | 1.057 | 0.29 |
| RaceOther | 0.837 | 0.727 | 0.96 | 0.012 |
| SexMale | 0.809 | 0.513 | 1.212 | 0.331 |
| Age.At.Diagnosis | 1.011 | 1.009 | 1.014 | 0 |
| ordered(Metastasis).L | 0.982 | 0.865 | 1.11 | 0.779 |
| ordered(Metastasis).Q | 0.965 | 0.884 | 1.052 | 0.422 |
| ordered(Grade).L | 1.004 | 0.771 | 1.279 | 0.973 |
| ordered(Grade).Q | 1.094 | 0.896 | 1.315 | 0.355 |
| ordered(Grade).C | 1.07 | 0.964 | 1.181 | 0.193 |
| MorphologyCarcinoma in situ, NOS | 1.661 | 0.643 | 4.821 | 0.315 |
| MorphologyInfiltrating duct carcinoma, NOS | 1.524 | 0.729 | 3.907 | 0.316 |
| MorphologyIntraductal and lobular in situ carcinoma | 2.328 | 1.102 | 6.005 | 0.047 |
| MorphologyLobular carcinoma in situ | 1.817 | 0.86 | 4.69 | 0.16 |
| MorphologyMucinous adenocarcinoma | 1.638 | 0.744 | 4.333 | 0.264 |
| MorphologyOther | 1.923 | 0.911 | 4.956 | 0.124 |
| RadiationRadiation | 0.738 | 0.685 | 0.795 | 0 |
| Size | 1.001 | 0.999 | 1.002 | 0.302 |
Looking at the odds ratios, age and intraductal and lobular in situ carcinoma morphology are associated with increased odds of developing a second cancer. On the other hand, races that are not white or black and the use of radiation treatment are apparently associated with decreased odds of developing a second cancer.
Here’s an analysis of deviance for the chosen model:
| Df | Deviance | Resid. Df | Resid. Dev | Pr(>Chi) | |
|---|---|---|---|---|---|
| NULL | NA | NA | 39999 | 22022 | NA |
| Race | 2 | 14.07 | 39997 | 22008 | 0.001 |
| Sex | 1 | 0.242 | 39996 | 22008 | 0.622 |
| Age.At.Diagnosis | 1 | 92.4 | 39995 | 21915 | 0 |
| ordered(Metastasis) | 2 | 1.107 | 39993 | 21914 | 0.575 |
| ordered(Grade) | 3 | 8.982 | 39990 | 21905 | 0.03 |
| Morphology | 6 | 52.74 | 39984 | 21852 | 0 |
| Radiation | 1 | 65.23 | 39983 | 21787 | 0 |
| Size | 1 | 0.986 | 39982 | 21786 | 0.321 |
It appears that several of the predictors are statistically significant. Specifically, race, age at diagnosis, grade, morphology, and radiation are all statistically significant.
The p-value of the overall model is as close to zero as R can go, and I expect5 this might not be too useful a measure of performance because having so many observations probably drives this p-value down.
The Nagelkerke’s R^2 is 0.014, indicating that this model accounts for very little of the variation in response.
The Brier score is 0.072, which doesn’t have much predictive power.
The model has a C-statistic of 0.584, which is also pretty low. Here’s an ROC curve, where the C-statistic is equal to the area under the curve:
##
## Call:
## roc.formula(formula = bcfpModTrain$MultipleCancerDx ~ predict(modelGLM, type = "response"))
##
## Data: predict(modelGLM, type = "response") in 36857 controls (bcfpModTrain$MultipleCancerDx Only One Cancer) < 3143 cases (bcfpModTrain$MultipleCancerDx Second Cancer).
## Area under the curve: 0.5842
Here’s a calibration plot to show how our model compares to the observed probabilities used to create the model:
##
## n=40000 Mean absolute error=0.002 Mean squared error=2e-05
## 0.9 Quantile of absolute error=0.004
The model seems to perform fine between probabilities of 0.6 and 0.11. For probabilities greater than 0.12, it looks like the curve loses predictive value by an increasingly wide margin.
In section I of this document, I mentioned concern about a possible interaction between Grade and Metastasis. Here’s an interaction plot to see how that turned out:
It looks like there might be an interaction, where metastasis-stratified probability values converge as grade increases. This possible interaction is not visually significant enough to address.
Let’s see what diagnostic plots have to say about possible influential observations:
There are several high-leverage points, but none of them have major values for residuals. This is acceptable.
Here are results from cross-validation of the model:
| index.orig | training | test | optimism | index.corrected | n | |
|---|---|---|---|---|---|---|
| Dxy | 0.168 | 0.173 | 0.164 | 0.009 | 0.16 | 40 |
| R2 | 0.014 | 0.015 | 0.013 | 0.002 | 0.012 | 40 |
| Intercept | 0 | 0 | -0.137 | 0.137 | -0.137 | 40 |
| Slope | 1 | 1 | 0.944 | 0.056 | 0.944 | 40 |
| Emax | 0 | 0 | 0.04 | 0.04 | 0.04 | 40 |
| D | 0.006 | 0.006 | 0.006 | 0.001 | 0.005 | 40 |
| U | 0 | 0 | 0 | 0 | 0 | 40 |
| Q | 0.006 | 0.006 | 0.006 | 0.001 | 0.005 | 40 |
| B | 0.072 | 0.072 | 0.072 | 0 | 0.072 | 40 |
| g | 0.319 | 0.328 | 0.309 | 0.019 | 0.299 | 40 |
| gp | 0.023 | 0.024 | 0.022 | 0.001 | 0.022 | 40 |
Corrected indices do change, but it doesn’t look like too much. For Dxy and Nagelkerke’s R^2, the change is less than ten percent of the previous value, which isn’t too bad. However, the intercept decreases by 0.137, which might indicate the fit won’t perform as well on new data.
For fun, here’s a plot of the distribution of age at diagnosis, according to development of a second cancer:
Consistent with the model, it looks like age greater than the mean is associated with increased rates of developing a second cancer.
Next, I’ll look into the effect of age across different cancer morphologies. Here’s a bubble plot looking at Morphology vs Age.At.Diagnosis. The larger the bubble, the higher the probability of developing a second cancer.
The previous plot shows the predicted probability of developing a second cancer as proportional to the diameter of each circle, where each circle corresponds to an observation represented by its x- and y-axis values.
As an example, for the sixth level of morphology, at the age of 90, there is a much larger probability of developing a second cancer than there is for someone at 38 years old.
Also, it appears that if you have “carcinoma in situ”, your chances of developing a second cancer are relatively high no matter how old you are.
Here is an ROC curve for new data:
##
## Call:
## roc.default(response = newdata$MultipleCancerDx, predictor = predictedVals)
##
## Data: predictedVals in 9183 controls (newdata$MultipleCancerDx Only One Cancer) < 817 cases (newdata$MultipleCancerDx Second Cancer).
## Area under the curve: 0.5939
The AUC using this new data is 0.594, which corresponds nicely to the validated AUC of 0.580.
Overall, the model isn’t so great. It does, however, point out the impact of some of its predictors on the development of a second caner, and it was fun.
#retype the dates for easier handling
bcfp$Date.of.Diag.1 <-
parse_date_time(bcfp$Date.of.Diag, order = "my")
bcfp$Date.of.Diag.2 <-
parse_date_time(bcfp$Two.Date.of.Diag, order = "my")
#make an interval between the dates
bcfp$TimeToSecondCancer <-
with(bcfp, interval(start = Date.of.Diag.1, end = Date.of.Diag.2))
#convert the interval to years, then divide by 12 to get months.
#I divide by twelve because months have different numbers of days,
#but years always have the same number of months.
#the value 364.75 is used because every fourth year has 364 days
#this will introduce less issues than the difference in days of months,
#overall
bcfp$TTSCMonths <-
(bcfp$TimeToSecondCancer@.Data * 12) / (60 * 60 * 24 * 364.75)
#make a time variable to use for analysis
#this uses duration of survival for people who died without developing
#a second cancer,
#and uses time to second cancer for those who did develop a second cancer
bcfp$timeCPH <-
ifelse(is.na(bcfp$TTSCMonths) == TRUE,
bcfp$Survival.Months,
bcfp$TTSCMonths)
#give time units
units(bcfp$time) <- "Months"
#next, I'll make a censor variable for those people who died
bcfp$censor <-
ifelse(is.na(bcfp$TTSCMonths) == TRUE, 0, 1)
#for reproducability
set.seed(432431)
#again, time to subset the data for model training
bcfpModTrain <- sample_n(bcfp, 4e4) %>% select(MultipleCancerDx,
Race,
Sex,
Age.At.Diagnosis,
Metastasis,
Grade,
Morphology,
Radiation,
Size,
timeCPH,
censor,
Case.Number)
bcfpSurv <- with(bcfpModTrain, Surv(time = timeCPH, event = censor, type = "right"))
#here, I'll make the model
bcfpSFit <- survfit(bcfpSurv ~ 1)
The following responses refer to a survival analysis described in Section IV.
Provide a Table 1 that describes the predictors in your final model, broken down by the levels of your outcome variable.
Please see Table 1 in Section I.
Plot and assess the pair of Kaplan-Meier estimates comparing intervention groups
Here is a plot of Kaplan-Meier estimates comparing groups that did and did not receive radiation treatment:
Interestingly, it looks like the group that received radiation had a lower hazard of developing a second cancer. Here’s a quantitative approach to evaluating this difference:
| N | Observed | Expected | (O-E)^2/E | (O-E)^2/V | |
|---|---|---|---|---|---|
| Radiation=No Radiation | 19819 | 1804 | 1514 | 55.59 | 106.5 |
| Radiation=Radiation | 20181 | 1388 | 1678 | 50.15 | 106.5 |
A log-rank test indicates that there is, indeed, a difference between groups who were and were not exposed to radiation as a treatment for cancer.
Interestingly, it looks like the group that received radiation had a lower chance of developing a second cancer. Here’s a quantitative approach to evaluating this difference:
| N | Observed | Expected | (O-E)^2/E | (O-E)^2/V | |
|---|---|---|---|---|---|
| Race=White | 32102 | 2614 | 2576 | 0.555 | 2.892 |
| Race=Black | 4290 | 314 | 332.2 | 0.992 | 1.114 |
| Race=Other | 3608 | 264 | 283.6 | 1.36 | 1.502 |
A log-rank test indicates that there is no statistically significant difference in survival across race.
| N | Observed | Expected | (O-E)^2/E | (O-E)^2/V | |
|---|---|---|---|---|---|
| Metastasis=Localized | 25307 | 1971 | 2062 | 4.038 | 11.48 |
| Metastasis=Regional | 12600 | 1047 | 1006 | 1.641 | 2.41 |
| Metastasis=Distant | 2093 | 174 | 123.4 | 20.77 | 21.86 |
By the log-rank test, there is a very statistically significant difference across metastasis groups, where distant metastasis is associated with decreased survival.
| N | Observed | Expected | (O-E)^2/E | (O-E)^2/V | |
|---|---|---|---|---|---|
| Grade=Grade 1 | 8604 | 739 | 699.1 | 2.281 | 2.938 |
| Grade=Grade 2 | 16935 | 1451 | 1346 | 8.204 | 14.27 |
| Grade=Grade 3 | 14064 | 966 | 1112 | 19.06 | 29.41 |
| Grade=Grade 4 | 397 | 36 | 35.48 | 0.008 | 0.008 |
By the log-rank test, there is a very statistically significant difference across Grade groups, where people who have grade 3 cancer appear to survive at a greater proportion than expected.
What approach(es) did you use in selecting your Cox model? How well did they work?
I did stepwise forward selection. The approach didn’t work too well, I found that the selected model was just as good as the kitchen sink model except for the pruning of insignificant predictors (whose effect was minimized by the model, anyway).
How well does your Cox model match up with the proportional hazards assumption?
The p-values provided by the output of my cox.zph command are really concerning. They’re really low, indicating significant slope of residuals. However, I would expect such low p-values, considering the number of observations available to me.
Looking at rho values, the biggest issue was the predictor Radiation. I plotted the survival curve fit for this variable alone, and the curves did, indeed, meet towards the end of the survival time maximum.
The event under consideration for this project is development of a second cancer. SEER provides a date of diagnosis for each record of cancer, where records of cancer are created for each new cancer found in a patient. The ID number, Case.Number, then can be used to identify individuals with more than one record in SEER. Further, a separate variable, “Sequence Number” (processed out of the data before use in this data set), can be used to identify whether this is the first, second, third, or so on cancer that a patient may have.
For the purposes of this study, survival time will be defined as the interval between record of a first and second cancer. Patients will be considered censored if they die, and thus will be dropped out of the study.
To improve the accuracy of the dataset, observations will be kept only if the date of last contact is greater than the date of diagnosis. The rates of each possible follow-up interval are described for each population in the following tables:
| Only One Cancer | Second Cancer | |
|---|---|---|
| Date Last Contact = Date Dx | 340 | 4 |
| Date Last Contact > Date Dx | 308258 | 26459 |
| Incomplete dates, maybe zero FU days | 461 | 13 |
| Incomplete dates, not zero FU days | 31243 | 3184 |
| Unknown | 0 | 0 |
Looking at the distributions of follow-up times among patients who did and did not develop second cancers, each group is within about an order of magnitude difference of the other. I will take this as evidence that including only those patients with follow-up beyond the date of diagnosis will not introduce too much bias.
| exp(coef) | exp(-coef) | lower .95 | upper .95 | |
|---|---|---|---|---|
| RaceBlack | 0.979 | 1.021 | 0.87 | 1.102 |
| RaceOther | 0.973 | 1.027 | 0.857 | 1.105 |
| SexMale | 0.875 | 1.142 | 0.58 | 1.321 |
| Age.At.Diagnosis | 1.013 | 0.987 | 1.01 | 1.016 |
| ordered(Metastasis).L | 1.35 | 0.741 | 1.205 | 1.513 |
| ordered(Metastasis).Q | 1.064 | 0.94 | 0.981 | 1.153 |
| ordered(Grade).L | 0.966 | 1.035 | 0.77 | 1.212 |
| ordered(Grade).Q | 1.076 | 0.929 | 0.905 | 1.28 |
| ordered(Grade).C | 1.104 | 0.906 | 1.006 | 1.211 |
| MorphologyCarcinoma in situ, NOS | 1.209 | 0.827 | 0.45 | 3.249 |
| MorphologyInfiltrating duct carcinoma, NOS | 1.377 | 0.726 | 0.655 | 2.894 |
| MorphologyIntraductal and lobular in situ carcinoma | 1.826 | 0.548 | 0.862 | 3.867 |
| MorphologyLobular carcinoma in situ | 1.985 | 0.504 | 0.938 | 4.2 |
| MorphologyMucinous adenocarcinoma | 1.538 | 0.65 | 0.705 | 3.352 |
| MorphologyOther | 1.505 | 0.664 | 0.71 | 3.189 |
| RadiationRadiation | 0.719 | 1.391 | 0.67 | 0.772 |
| Size | 1 | 1 | 0.999 | 1.002 |
| C | se(C) |
|---|---|
| 0.609 | 0.005 |
| test | df | pvalue |
|---|---|---|
| 334.2 | 17 | 0 |
As seen above, distant metastasis is associated with 1.5 times hazard of developing a second cancer, and regional metastasis is also associated with increased hazard. Also as shown above, Grade 3 cancer is associated with decreased hazard of developing a second cancer. Finally, radiation is also associated with decreased hazard of developing a second cancer, at about a 0.77 rate compared to patients not treated with radiation. P-values correspond to these explanations of significance, which are mentioned if the 95% confidence interval of hazard does not cross 1.
The Wald test for this model shows that, overall, the model is statistically significant.
Because there is a continuous predictor in the model, I included the measure of concordance. At 0.609, which is a little less than typical for survival analysis data. This means that there is some agreement between the survival time and risk score generated by the predictor, but it’s not perfect.
The Cox-and-Snell’s pseudo-R-squared value for the overall model is .3f, which indicates that the fit isn’t that much better than the intercept alone. The maximum value of R-squared is .3f, indicating that our model reflects about one percent of the maximum.
Here is an analysis of deviance for the model:
| loglik | Chisq | Df | Pr(>|Chi|) | |
|---|---|---|---|---|
| NULL | -32495 | NA | NA | NA |
| Race | -32493 | 3.003 | 2 | 0.223 |
| Sex | -32493 | 0.02 | 1 | 0.887 |
| Age.At.Diagnosis | -32431 | 125.2 | 1 | 0 |
| ordered(Metastasis) | -32413 | 35.38 | 2 | 0 |
| ordered(Grade) | -32402 | 21.98 | 3 | 0 |
| Morphology | -32375 | 54.06 | 6 | 0 |
| Radiation | -32332 | 84.89 | 1 | 0 |
| Size | -32332 | 0.333 | 1 | 0.564 |
It appears that age at diagnosis, metastasis, grade, morphology, and radiation are all significant. Morphology wasn’t significant when broken down according to its levels, but the variable as a whole is significant. There may be a combination of factors that does have predictive value, but I wouldn’t know how to approach this possibility without imposing my own bias on the model.
Here is a plot of the survival curve for this model:
Note, the confidence interval curves are very close to the estimated curve. This makes the curve appear as if it is thick.
As an alternative, I’m going to see what the step function has to say about the kitchen sink model to see if there’s anything better. I won’t include step function output because it would make this document far longer.
Using AIC to select predictors, the step function returned the call coxph(formula = Surv(time = timeCPH, event = censor) ~ Age.At.Diagnosis + Metastasis + Grade + Morphology + Radiation, data = bcfpModTrain). So, the step function would have us exclude sex, race, and size from the model. This is consistent with what has appeared to be significant so far. I’ll use analysis of deviance to see which model has statistically significantly lower deviance.
| loglik | Chisq | Df | P(>|Chi|) |
|---|---|---|---|
| -32332 | NA | NA | NA |
| -32333 | 1.011 | 4 | 0.908 |
The drop in deviance is not significant, so I’ll probably stick with the kitchen sink model. Just to be sure that the stepwise-selected model isn’t that much better, I’ll see what a summary of the stepwise model looks like.
| exp(coef) | exp(-coef) | lower .95 | upper .95 | |
|---|---|---|---|---|
| Age.At.Diagnosis | 1.013 | 0.987 | 1.01 | 1.016 |
| ordered(Metastasis).L | 1.359 | 0.736 | 1.216 | 1.519 |
| ordered(Metastasis).Q | 1.064 | 0.94 | 0.981 | 1.154 |
| ordered(Grade).L | 0.967 | 1.034 | 0.771 | 1.213 |
| ordered(Grade).Q | 1.076 | 0.929 | 0.906 | 1.28 |
| ordered(Grade).C | 1.103 | 0.907 | 1.005 | 1.21 |
| MorphologyCarcinoma in situ, NOS | 1.219 | 0.821 | 0.454 | 3.273 |
| MorphologyInfiltrating duct carcinoma, NOS | 1.382 | 0.723 | 0.658 | 2.904 |
| MorphologyIntraductal and lobular in situ carcinoma | 1.838 | 0.544 | 0.868 | 3.891 |
| MorphologyLobular carcinoma in situ | 2.003 | 0.499 | 0.947 | 4.237 |
| MorphologyMucinous adenocarcinoma | 1.545 | 0.647 | 0.709 | 3.367 |
| MorphologyOther | 1.511 | 0.662 | 0.713 | 3.201 |
| RadiationRadiation | 0.72 | 1.389 | 0.671 | 0.772 |
| C | se(C) |
|---|---|
| 0.609 | 0.005 |
| test | df | pvalue |
|---|---|---|
| 332.9 | 13 | 0 |
All the predictors are statistically significant, but at the same significance levels. For the purposes of explanation and description of the model, I’ll keep the statistically insignificant predictors.
That’s a nice plot of the effects. Those bars that don’t cross one have a significant impact on hazard of developing a second cancer. For age at diagnosis going from 70 to 50, metastasis increasing from localized to regional, metastasis increasing from localized to distant, some of the morphology category changes, and radiation to lack of radiation, the hazard of developing a second cancer increases.
Still, curiously, going from grade 2 to grade 3 decreases the hazard of developing a second cancer.
Here’s a test for the cox proportional hazards assumption:
| rho | chisq | p | |
|---|---|---|---|
| Race=Black | 0.081 | 20.93 | 0 |
| Race=Other | -0.003 | 0.023 | 0.88 |
| Sex=Male | 0.028 | 2.493 | 0.114 |
| Age.At.Diagnosis | 0.089 | 23.26 | 0 |
| Metastasis | -0.072 | 15.92 | 0 |
| Metastasis=3 | 0.008 | 0.219 | 0.64 |
| Grade | 0.038 | 4.577 | 0.032 |
| Grade=3 | 0.007 | 0.17 | 0.68 |
| Grade=4 | 0.003 | 0.022 | 0.882 |
| Morphology=Carcinoma in situ, NOS | -0.031 | 3.059 | 0.08 |
| Morphology=Infiltrating duct carcinoma, NOS | -0.018 | 1.044 | 0.307 |
| Morphology=Intraductal and lobular in situ carcinoma | -0.024 | 1.856 | 0.173 |
| Morphology=Lobular carcinoma in situ | -0.026 | 2.214 | 0.137 |
| Morphology=Mucinous adenocarcinoma | -0.029 | 2.641 | 0.104 |
| Morphology=Other | -0.02 | 1.246 | 0.264 |
| Radiation=Radiation | 0.174 | 98.03 | 0 |
| Size | -0.017 | 0.45 | 0.502 |
| GLOBAL | NA | 202.3 | 0 |
Given the great number of observations in the data, it’s hard not to have low p-values. Still, we do have significantly non-zero slopes, and this indicates approximately proportional of hazards.
My greatest concern regards the predictor, radiation. It has a rho value of 0.195. Age is also a bit concerning, with rho = 0.088.
There are two continuous variables to evaluate visually, size and age at diagnosis. Here’s the plot for size:
This is pretty flat, so it doesn’t seem like the DFBetas change too much with increase in response.
This curve is a bit more curvy, corresponding to the numeric output above showing rho = 0.088.
In any case, I’m not sure that the proportional hazards assumption is violated. Still, I’ll check if the survival curves cross each other for radiation (which was the most concerning variable):
It appears that the curves cross when time is near its maximum. That makes me think that weight should be given to the left-hand side of the curve when doing a log-rank test.
In any case, I am worried about the proportional hazards violation.
| Variable | VIF |
|---|---|
| RaceBlack | 1.029 |
| RaceOther | 1.021 |
| SexMale | 1.005 |
| Age.At.Diagnosis | 1.057 |
| ordered(Metastasis).L | 1.907 |
| ordered(Metastasis).Q | 1.797 |
| ordered(Grade).L | 4.869 |
| ordered(Grade).Q | 4.567 |
| ordered(Grade).C | 2.52 |
| MorphologyCarcinoma in situ, NOS | 2.281 |
| MorphologyInfiltrating duct carcinoma, NOS | 97.39 |
| MorphologyIntraductal and lobular in situ carcinoma | 38.84 |
| MorphologyLobular carcinoma in situ | 43.48 |
| MorphologyMucinous adenocarcinoma | 10.67 |
| MorphologyOther | 36.28 |
| RadiationRadiation | 1.018 |
| Size | 1.125 |
Again, there is significant collinearity among Morphology categories. I won’t remove this, though, because the predictor was significant and because morphology plays a role in the natural history of cancer as a disease.
Here’s a calibration plot for the model:
## Using Cox survival estimates at 12 Days
The model looks well-calibrated. It performs worse as time progresses.
Here’s a 100-fold validation:
## index.orig training test optimism index.corrected n
## Dxy 0.2185 0.2223 0.2136 0.0087 0.2098 100
## R2 0.0101 0.0106 0.0096 0.0010 0.0091 100
## Slope 1.0000 1.0000 0.9525 0.0475 0.9525 100
## D 0.0050 0.0052 0.0047 0.0005 0.0045 100
## U 0.0000 0.0000 0.0000 -0.0001 0.0000 100
## Q 0.0050 0.0053 0.0047 0.0006 0.0045 100
## g 0.3570 0.3657 0.3480 0.0177 0.3393 100
| index.orig | training | test | optimism | index.corrected | n | |
|---|---|---|---|---|---|---|
| Dxy | 0.218 | 0.222 | 0.214 | 0.009 | 0.21 | 100 |
| R2 | 0.01 | 0.011 | 0.01 | 0.001 | 0.009 | 100 |
| Slope | 1 | 1 | 0.952 | 0.048 | 0.952 | 100 |
| D | 0.005 | 0.005 | 0.005 | 0.001 | 0.004 | 100 |
| U | 0 | 0 | 0 | 0 | 0 | 100 |
| Q | 0.005 | 0.005 | 0.005 | 0.001 | 0.004 | 100 |
| g | 0.357 | 0.366 | 0.348 | 0.018 | 0.339 | 100 |
The corrected Dxy decreases by less than five percent of its original value, so that’s not too bad.
Corrected R-squared decreases by more than ten percent, but that might be a rounding issue because it’s a smaller value.
Slope decreases a bit, and might be concerning depending on how the model performs out-of-sample.
Other values don’t change too much.
Here, I’ll make some predictions and see how they line up with actual values. I’ll be predicting the survival
## [1] 348
## [1] 363
There were 348 events predicted by the original model, but there were actually 363 events. To make a comparison between the two, it would be great if I recalculated survival for the new data to match the length of the model-fitted data. Instead, I’ll just pluck random events out of the actual data to produce the right number of events.
While I did learn a whole lot about data analysis, I think I learned just as much about how to make data useable. SEER provides its data in a widely dispersed format for reasons I can only speculate, but the R community has found ways around this common roadblock.
Recoding, however, is a less challenging task that requires more time and attention. I had the opportunity to do a lot of recoding. Despite the fact that it is less challenging for others, I found myself learning how to write and compose functions so that I could effectively manipulate data. For example, I remember when I realized that do.call can be thought of as analagous to the apply family, where do.call doesn’t behave sequentially. This allowed me to read many files at once, when used in combination with list.files.
I also took the time to learn how to use GitHub in RStudio. The first time I used GitHub was around September of 2015, and that little crack in the ice was enough for me to completely slip through during the course of this semester. Whatever the task, it seems, there is fertile ground for learning.
I’ve also been branching out into other languages. Javascript and D3 are really attractive technologies, but I’ve forgone learning that in lieu of buffing up my computer science knowledge using a python-based version of Structure and Interpretation of Computer Programs. Given my background in Economics and Philosophy, this has required that I re-read the same page several times before I get it. These baby steps, however, are still steps.
Of course, I’ve also spent time how to more effectively work in R. As I mentioned in a previous paragraph, I’ve learned how to use some commands effectively. Besides that, I’ve learned some of the theory behind the R environment. I learned a while ago that everything in a Unix system can be considered a file. Similarly, everything that exists in R is an object, and everything that happens is a function call. Learning more about R and abstracting away various layers of user interface to reach these absolute generalities has been very rewarding, even if it just meant learning how long a road lies ahead of me.
Finally, and perhaps most importantly, I’ve learned more about the rigorous application of quantitative methods. Rather than hide behind a p-value, for example, I’m now more inclined to provide more descriptive summaries of a concept. Along with this, I have the chance to explain myself more often, and find myself re-formulating and cementing concepts out of a need to say them out loud. It’s easy to think I’ve learned something after reading it, but it’s honest to think I’ve learned something after I can teach it.
I am a bit worried about the fact that a lot of the second cancers were breast cancers. This could be an issue if the second cancer is a recurrence of previous cancer, in which case this analysis would have lumped together cancer recurrence and new second cancer.
One solution to this potential issue would be to exclude breast cancer as a second cancer. However, I’m not sure how to confirm that a second cancer is recurrent instead of de novo. There may be a SEER protocol for doing so, but I haven’t been able to find it.
It would be interesting to see whether matching race, sex, age, and other variables would impact the analysis. I imagine that doing so would help identify those other variables (namely, cancer characteristics) that influence development of second cancers.
In the previous models, it looked like radiation is associated with lower odds of developing a second cancer. I think that’s really weird. So I’m going to see if I can successfully do a propensity score analysis where radiation is the predictor of interest, and all other covariates are used to generate propensity scores.
I can’t be confident in the following analysis because I don’t have much experience in this approach, but here’s my attempt at matching observations. Matching methods include (1) propensity score matching and (2) greedy algorithm matching.
It would be a lot cooler to implement some sort of matching on my own, but for now I’ll have to rely on an R package to help me out.
First, I’ll try doing this using the “Matching” package. I’ll print the code that I use for this, unlike previous chunks. I’ll be implementing this for the logistic regression whose outcome is development of a second cancer, looking to see if radiation treatment has the same impact as non-matched logistic regression. I’ll base the propensity score on all covariates.
library(Matching)
bcfpUnmatched <- sample_n(bcfp, 5e4)
#setting seed in case it makes a difference to the command
set.seed(432431)
#retype for ease of use
bcfpUnmatched$RadMat <- with(bcfpUnmatched, ifelse(Radiation == "Radiation", 1, 0))
#set up formals
Tr <- bcfpUnmatched$RadMat
Y <- bcfpUnmatched$MultipleCancerDx
X <- dplyr::select(.data = bcfpUnmatched, Race,
Sex,
Age.At.Diagnosis,
Metastasis,
Morphology,
Grade,
Size)
X <- data.matrix(X)
#do the match
matchedRadCan <- Match(Y = Y, Tr = Tr, X = X, M = 1, version = "fast")
#see how balance worked out
MBRadCan <- MatchBalance(RadMat ~
Race +
Sex +
Age.At.Diagnosis +
Metastasis +
Morphology +
Grade +
Size,
data = bcfpUnmatched,
match.out = matchedRadCan,
nboots = 100,
digits = 3)
##
## ***** (V1) RaceBlack *****
## Before Matching After Matching
## mean treatment........ 0.101 0.101
## mean control.......... 0.113 0.101
## std mean diff......... -3.93 0.0263
##
## mean raw eQQ diff..... 0.0118 3.77e-05
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.00591 1.88e-05
## med eCDF diff........ 0.00591 1.88e-05
## max eCDF diff........ 0.0118 3.77e-05
##
## var ratio (Tr/Co)..... 0.907 1
## T-test p-value........ 1.86e-05 0.48
##
##
## ***** (V2) RaceOther *****
## Before Matching After Matching
## mean treatment........ 0.0827 0.0827
## mean control.......... 0.0953 0.0827
## std mean diff......... -4.57 0.0287
##
## mean raw eQQ diff..... 0.0126 3.77e-05
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.0063 1.88e-05
## med eCDF diff........ 0.0063 1.88e-05
## max eCDF diff........ 0.0126 3.77e-05
##
## var ratio (Tr/Co)..... 0.88 1
## T-test p-value........ 7.58e-07 0.157
##
##
## ***** (V3) SexMale *****
## Before Matching After Matching
## mean treatment........ 0.00348 0.00348
## mean control.......... 0.0119 0.00348
## std mean diff......... -14.3 0
##
## mean raw eQQ diff..... 0.00841 0
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 0
##
## mean eCDF diff........ 0.0042 0
## med eCDF diff........ 0.0042 0
## max eCDF diff........ 0.00841 0
##
## var ratio (Tr/Co)..... 0.295 1
## T-test p-value........ <2e-16 1
##
##
## ***** (V4) Age.At.Diagnosis *****
## Before Matching After Matching
## mean treatment........ 58.9 58.9
## mean control.......... 61 58.9
## std mean diff......... -17.3 -0.557
##
## mean raw eQQ diff..... 2.37 0.0436
## med raw eQQ diff..... 2 0
## max raw eQQ diff..... 7 5
##
## mean eCDF diff........ 0.0279 0.000559
## med eCDF diff........ 0.00812 0.000283
## max eCDF diff........ 0.0907 0.00275
##
## var ratio (Tr/Co)..... 0.725 1.02
## T-test p-value........ <2e-16 3.51e-14
## KS Bootstrap p-value.. <2e-16 0.94
## KS Naive p-value...... <2e-16 0.988
## KS Statistic.......... 0.0907 0.00275
##
##
## ***** (V5) MetastasisRegional *****
## Before Matching After Matching
## mean treatment........ 0.314 0.314
## mean control.......... 0.316 0.314
## std mean diff......... -0.337 -0.0256
##
## mean raw eQQ diff..... 0.00158 5.65e-05
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.000783 2.83e-05
## med eCDF diff........ 0.000783 2.83e-05
## max eCDF diff........ 0.00157 5.65e-05
##
## var ratio (Tr/Co)..... 0.997 1
## T-test p-value........ 0.706 0.18
##
##
## ***** (V6) MetastasisDistant *****
## Before Matching After Matching
## mean treatment........ 0.043 0.043
## mean control.......... 0.063 0.0429
## std mean diff......... -9.85 0.0585
##
## mean raw eQQ diff..... 0.02 5.65e-05
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.00999 2.83e-05
## med eCDF diff........ 0.00999 2.83e-05
## max eCDF diff........ 0.02 5.65e-05
##
## var ratio (Tr/Co)..... 0.697 1
## T-test p-value........ <2e-16 0.0833
##
##
## ***** (V7) MorphologyCarcinoma in situ, NOS *****
## Before Matching After Matching
## mean treatment........ 0.00281 0.00281
## mean control.......... 0.00505 0.00249
## std mean diff......... -4.24 0.598
##
## mean raw eQQ diff..... 0.00226 0.000207
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.00112 0.000104
## med eCDF diff........ 0.00112 0.000104
## max eCDF diff........ 0.00224 0.000207
##
## var ratio (Tr/Co)..... 0.557 1.13
## T-test p-value........ 6.22e-05 0.144
##
##
## ***** (V8) MorphologyInfiltrating duct carcinoma, NOS *****
## Before Matching After Matching
## mean treatment........ 0.768 0.768
## mean control.......... 0.74 0.77
## std mean diff......... 6.57 -0.511
##
## mean raw eQQ diff..... 0.0277 0.00115
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.0139 0.000575
## med eCDF diff........ 0.0139 0.000575
## max eCDF diff........ 0.0278 0.00115
##
## var ratio (Tr/Co)..... 0.927 1.01
## T-test p-value........ 5.85e-13 4.47e-11
##
##
## ***** (V9) MorphologyIntraductal and lobular in situ carcinoma *****
## Before Matching After Matching
## mean treatment........ 0.0644 0.0644
## mean control.......... 0.0677 0.0636
## std mean diff......... -1.33 0.339
##
## mean raw eQQ diff..... 0.00328 0.000415
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.00163 0.000207
## med eCDF diff........ 0.00163 0.000207
## max eCDF diff........ 0.00327 0.000415
##
## var ratio (Tr/Co)..... 0.955 1.01
## T-test p-value........ 0.142 0.0127
##
##
## ***** (V10) MorphologyLobular carcinoma in situ *****
## Before Matching After Matching
## mean treatment........ 0.0695 0.0695
## mean control.......... 0.081 0.0695
## std mean diff......... -4.49 0.0194
##
## mean raw eQQ diff..... 0.0114 0
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 0
##
## mean eCDF diff........ 0.00571 0
## med eCDF diff........ 0.00571 0
## max eCDF diff........ 0.0114 0
##
## var ratio (Tr/Co)..... 0.87 1
## T-test p-value........ 1.29e-06 0.858
##
##
## ***** (V11) MorphologyMucinous adenocarcinoma *****
## Before Matching After Matching
## mean treatment........ 0.0165 0.0165
## mean control.......... 0.019 0.0154
## std mean diff......... -2 0.901
##
## mean raw eQQ diff..... 0.00255 0.000679
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.00127 0.000339
## med eCDF diff........ 0.00127 0.000339
## max eCDF diff........ 0.00254 0.000679
##
## var ratio (Tr/Co)..... 0.869 1.07
## T-test p-value........ 0.0314 1.53e-05
##
##
## ***** (V12) MorphologyOther *****
## Before Matching After Matching
## mean treatment........ 0.0763 0.0763
## mean control.......... 0.0839 0.0769
## std mean diff......... -2.88 -0.235
##
## mean raw eQQ diff..... 0.00768 0.000415
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.00382 0.000207
## med eCDF diff........ 0.00382 0.000207
## max eCDF diff........ 0.00765 0.000415
##
## var ratio (Tr/Co)..... 0.916 0.993
## T-test p-value........ 0.00163 0.00518
##
##
## ***** (V13) GradeGrade 2 *****
## Before Matching After Matching
## mean treatment........ 0.425 0.425
## mean control.......... 0.424 0.425
## std mean diff......... 0.185 -0.04
##
## mean raw eQQ diff..... 0.00089 9.42e-05
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.000457 4.71e-05
## med eCDF diff........ 0.000457 4.71e-05
## max eCDF diff........ 0.000913 9.42e-05
##
## var ratio (Tr/Co)..... 1 1
## T-test p-value........ 0.836 0.398
##
##
## ***** (V14) GradeGrade 3 *****
## Before Matching After Matching
## mean treatment........ 0.333 0.333
## mean control.......... 0.367 0.334
## std mean diff......... -7.15 -0.151
##
## mean raw eQQ diff..... 0.0337 0.000339
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.0168 0.00017
## med eCDF diff........ 0.0168 0.00017
## max eCDF diff........ 0.0337 0.000339
##
## var ratio (Tr/Co)..... 0.957 0.999
## T-test p-value........ 2.89e-15 0.00146
##
##
## ***** (V15) GradeGrade 4 *****
## Before Matching After Matching
## mean treatment........ 0.00831 0.00831
## mean control.......... 0.0109 0.00795
## std mean diff......... -2.87 0.392
##
## mean raw eQQ diff..... 0.00263 0.00017
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.0013 8.48e-05
## med eCDF diff........ 0.0013 8.48e-05
## max eCDF diff........ 0.00261 0.00017
##
## var ratio (Tr/Co)..... 0.763 1.04
## T-test p-value........ 0.00283 0.00665
##
##
## ***** (V16) Size *****
## Before Matching After Matching
## mean treatment........ 21.8 21.8
## mean control.......... 24 21.7
## std mean diff......... -9.99 0.619
##
## mean raw eQQ diff..... 2.32 0.154
## med raw eQQ diff..... 2 0
## max raw eQQ diff..... 580 51
##
## mean eCDF diff........ 0.0122 0.000854
## med eCDF diff........ 0.00115 0.000452
## max eCDF diff........ 0.0921 0.00899
##
## var ratio (Tr/Co)..... 0.751 1.05
## T-test p-value........ <2e-16 4.44e-16
## KS Bootstrap p-value.. <2e-16 <2e-16
## KS Naive p-value...... <2e-16 0.0274
## KS Statistic.......... 0.0921 0.00899
##
##
## Before Matching Minimum p.value: <2e-16
## Variable Name(s): SexMale Age.At.Diagnosis MetastasisDistant Size Number(s): 3 4 6 16
##
## After Matching Minimum p.value: <2e-16
## Variable Name(s): Size Number(s): 16
#in case I want to plot these values...
before <- sapply(1:16, function(x) MBRadCan$BeforeMatching[[x]]$sdiff) %>% abs
after <- sapply(1:16, function(x) MBRadCan$AfterMatching[[x]]$sdiff) %>% abs
library(tidyr)
plotFrame <- data.frame(before, after)
plotFrame <- plotFrame[order(plotFrame$before, decreasing = TRUE),]
plotFrame <- gather(plotFrame)
plotFrame$beforeAfter <- plotFrame$key
plotFrame$key <- c(1:16, 1:16)
ggplot(plotFrame, aes(y = key, x = value, color = beforeAfter)) +
geom_point() +
ggtitle("Love Plot for All Predictors") +
ylab("Variable (or Factor Level) ID") +
xlab("Absolute Standardized Difference") +
theme(legend.title = element_blank())
Looking at the Love plot, the absolute standardized differences relating to Radiation have been reduced by a lot. Note that I did not stratify this plot according to treatment with radiation. So, it’s perhaps not as good a representation as it could be. It may actually also not qualify as a Love plot.
Here, I’ll run the analysis using matched observations and see what I get.
Here, I’ll show a summary of the matched logistic regression:
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| RadiationRadiation | -0.315 | 0.023 | -13.71 | 0 |
| RaceBlack | -0.067 | 0.051 | -1.302 | 0.193 |
| RaceOther | -0.185 | 0.059 | -3.138 | 0.002 |
| SexMale | -0.788 | 0.387 | -2.038 | 0.042 |
| Age.At.Diagnosis | 0.015 | 0.001 | 14.75 | 0 |
| ordered(Metastasis).L | 0.194 | 0.053 | 3.662 | 0 |
| ordered(Metastasis).Q | 0.008 | 0.035 | 0.245 | 0.807 |
| MorphologyCarcinoma in situ, NOS | -0.487 | 0.475 | -1.025 | 0.305 |
| MorphologyInfiltrating duct carcinoma, NOS | -0.195 | 0.304 | -0.64 | 0.522 |
| MorphologyIntraductal and lobular in situ carcinoma | 0.145 | 0.308 | 0.472 | 0.637 |
| MorphologyLobular carcinoma in situ | 0.061 | 0.308 | 0.197 | 0.844 |
| MorphologyMucinous adenocarcinoma | -0.141 | 0.326 | -0.431 | 0.666 |
| MorphologyOther | -0.045 | 0.308 | -0.145 | 0.884 |
| ordered(Grade).L | -0.195 | 0.127 | -1.537 | 0.124 |
| ordered(Grade).Q | 0.094 | 0.095 | 0.99 | 0.322 |
| ordered(Grade).C | 0.099 | 0.046 | 2.142 | 0.032 |
| Size | 0.001 | 0.001 | 1.249 | 0.212 |
| (Intercept) | -2.963 | 0.315 | -9.401 | 0 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 59029 on 106109 degrees of freedom |
| Residual deviance: | 58334 on 106092 degrees of freedom |
I’ll remake a logistic regression to fit the unmatched data (that was used to generate the matched dataset). That way, variation in the specific observations won’t interfere with the results.
Here’s the unmatched model summary:
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| RadiationRadiation | -0.241 | 0.034 | -7.191 | 0 |
| RaceBlack | -0.07 | 0.056 | -1.234 | 0.217 |
| RaceOther | -0.094 | 0.061 | -1.529 | 0.126 |
| SexMale | -0.2 | 0.201 | -0.997 | 0.319 |
| Age.At.Diagnosis | 0.011 | 0.001 | 8.702 | 0 |
| ordered(Metastasis).L | 0.052 | 0.055 | 0.931 | 0.352 |
| ordered(Metastasis).Q | -0.09 | 0.039 | -2.306 | 0.021 |
| MorphologyCarcinoma in situ, NOS | 0.117 | 0.478 | 0.245 | 0.806 |
| MorphologyInfiltrating duct carcinoma, NOS | 0.418 | 0.364 | 1.147 | 0.251 |
| MorphologyIntraductal and lobular in situ carcinoma | 0.839 | 0.368 | 2.278 | 0.023 |
| MorphologyLobular carcinoma in situ | 0.783 | 0.368 | 2.13 | 0.033 |
| MorphologyMucinous adenocarcinoma | 0.534 | 0.383 | 1.397 | 0.163 |
| MorphologyOther | 0.579 | 0.368 | 1.574 | 0.116 |
| ordered(Grade).L | 0.009 | 0.111 | 0.08 | 0.936 |
| ordered(Grade).Q | 0.152 | 0.085 | 1.802 | 0.072 |
| ordered(Grade).C | 0.129 | 0.045 | 2.859 | 0.004 |
| Size | 0 | 0.001 | -0.023 | 0.982 |
| (Intercept) | -3.394 | 0.376 | -9.032 | 0 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 27896 on 49999 degrees of freedom |
| Residual deviance: | 27586 on 49982 degrees of freedom |
And here are the odds ratios for each of the models. First, I’ll display the odds ratios for the matched model:
## function(model) {
## cbind(
## exp(cbind(
## OR = coef(model),
## confint(model))),
## P = summary(model)$coefficients[,4])
## }
Waiting for profiling to be done…
| OR | 2.5 % | 97.5 % | P | |
|---|---|---|---|---|
| (Intercept) | 0.052 | 0.026 | 0.092 | 0 |
| RadiationRadiation | 0.73 | 0.698 | 0.764 | 0 |
| RaceBlack | 0.935 | 0.845 | 1.033 | 0.193 |
| RaceOther | 0.832 | 0.74 | 0.932 | 0.002 |
| SexMale | 0.455 | 0.193 | 0.899 | 0.042 |
| Age.At.Diagnosis | 1.015 | 1.013 | 1.017 | 0 |
| ordered(Metastasis).L | 1.214 | 1.093 | 1.345 | 0 |
| ordered(Metastasis).Q | 1.009 | 0.942 | 1.079 | 0.807 |
| MorphologyCarcinoma in situ, NOS | 0.615 | 0.233 | 1.541 | 0.305 |
| MorphologyInfiltrating duct carcinoma, NOS | 0.823 | 0.473 | 1.576 | 0.522 |
| MorphologyIntraductal and lobular in situ carcinoma | 1.157 | 0.659 | 2.23 | 0.637 |
| MorphologyLobular carcinoma in situ | 1.063 | 0.606 | 2.047 | 0.844 |
| MorphologyMucinous adenocarcinoma | 0.869 | 0.475 | 1.725 | 0.666 |
| MorphologyOther | 0.956 | 0.545 | 1.843 | 0.884 |
| ordered(Grade).L | 0.823 | 0.633 | 1.043 | 0.124 |
| ordered(Grade).Q | 1.099 | 0.903 | 1.314 | 0.322 |
| ordered(Grade).C | 1.104 | 1.005 | 1.205 | 0.032 |
| Size | 1.001 | 0.999 | 1.002 | 0.212 |
And here are the odds ratios for the unmatched model:
Waiting for profiling to be done…
| OR | 2.5 % | 97.5 % | P | |
|---|---|---|---|---|
| (Intercept) | 0.034 | 0.015 | 0.066 | 0 |
| RadiationRadiation | 0.786 | 0.736 | 0.839 | 0 |
| RaceBlack | 0.933 | 0.834 | 1.041 | 0.217 |
| RaceOther | 0.91 | 0.806 | 1.025 | 0.126 |
| SexMale | 0.818 | 0.539 | 1.19 | 0.319 |
| Age.At.Diagnosis | 1.011 | 1.008 | 1.013 | 0 |
| ordered(Metastasis).L | 1.053 | 0.943 | 1.172 | 0.352 |
| ordered(Metastasis).Q | 0.914 | 0.847 | 0.986 | 0.021 |
| MorphologyCarcinoma in situ, NOS | 1.124 | 0.443 | 2.976 | 0.806 |
| MorphologyInfiltrating duct carcinoma, NOS | 1.519 | 0.795 | 3.379 | 0.251 |
| MorphologyIntraductal and lobular in situ carcinoma | 2.313 | 1.2 | 5.177 | 0.023 |
| MorphologyLobular carcinoma in situ | 2.188 | 1.136 | 4.894 | 0.033 |
| MorphologyMucinous adenocarcinoma | 1.706 | 0.855 | 3.907 | 0.163 |
| MorphologyOther | 1.785 | 0.926 | 3.995 | 0.116 |
| ordered(Grade).L | 1.009 | 0.804 | 1.245 | 0.936 |
| ordered(Grade).Q | 1.165 | 0.98 | 1.367 | 0.072 |
| ordered(Grade).C | 1.138 | 1.039 | 1.241 | 0.004 |
| Size | 1 | 0.998 | 1.001 | 0.982 |
Looking at the odds ratios of matched versus unmatched models, it appears that matching actually shows reduced odds of developing a second cancer given radiation treatment. Despite the fact that I understand radiation is associated with cancer, it would appear that the therapeutic use of radiation is actually associated with decreased odds of new cancer.
The next step I would take is to remove observations of breast cancer as a second cancer to remove the chance that we observe recurring cancer.
It would be interesting to see how this implementation of propensity score matching compares to other implementations and other matching approaches. There’s a “genetic algorithm” approach in the Matching package, that could be fun.
It would be pretty cool to implement n-fold validation on the new data, given that there’s so much of it. This might provide a better estimate of model performance.
I want to learn more about survival analysis and longitudinal data analysis. I appreciated our discussion in class, but there’s just so much more for me to learn in so many different areas. I’m still figuring out where I should start.
I think that, at this point in my career, I know enough about statistics to be dangerous. To keep this danger in check, I’ll be consulting with professionals to make sure it’s safe when I try this at home.
However, I would like to reach a point where I can learn an arbitrary new technique and confidently present results using that technique.
Here are the numbers of observations falling into each category of categorical variables:
| Overall | |
|---|---|
| n | 369962 |
| MultipleCancerDx = Second Cancer (%) | 29660 ( 8.0) |
| Race (%) | |
| White | 297665 ( 80.5) |
| Black | 39008 ( 10.5) |
| Other | 33289 ( 9.0) |
| Sex = Male (%) | 2703 ( 0.7) |
| Metastasis (%) | |
| Localized | 234136 ( 63.3) |
| Regional | 116968 ( 31.6) |
| Distant | 18858 ( 5.1) |
| Morphology (%) | |
| Adenocarcinoma in situ | 1181 ( 0.3) |
| Carcinoma in situ, NOS | 1607 ( 0.4) |
| Infiltrating duct carcinoma, NOS | 279158 ( 75.5) |
| Intraductal and lobular in situ carcinoma | 24439 ( 6.6) |
| Lobular carcinoma in situ | 27241 ( 7.4) |
| Mucinous adenocarcinoma | 7040 ( 1.9) |
| Other | 29296 ( 7.9) |
| Grade (%) | |
| Grade 1 | 80170 ( 21.7) |
| Grade 2 | 157063 ( 42.5) |
| Grade 3 | 129104 ( 34.9) |
| Grade 4 | 3625 ( 1.0) |
| Behavior.For.Analysis (%) | |
| Benign | 0 ( 0.0) |
| Borderline Malignant | 0 ( 0.0) |
| In situ | 0 ( 0.0) |
| Maignant | 369962 (100.0) |
| No longer reportable in ICDO3 | 0 ( 0.0) |
| Only malignant 2010+ | 0 ( 0.0) |
| Only malignant in ICDO3 | 0 ( 0.0) |
| Survival.Flag (%) | |
| Date Last Contact = Date Dx | 344 ( 0.1) |
| Date Last Contact > Date Dx | 334717 ( 90.5) |
| Incomplete dates, maybe zero FU days | 474 ( 0.1) |
| Incomplete dates, not zero FU days | 34427 ( 9.3) |
| Unknown | 0 ( 0.0) |
| Survival.PA.Flag (%) | |
| Date Last Contact = Date Dx | 91 ( 0.0) |
| Date Last Contact > Date Dx | 335170 ( 90.6) |
| Incomplete dates, maybe zero FU days | 404 ( 0.1) |
| Incomplete dates, not zero FU days | 34297 ( 9.3) |
| Unknown | 0 ( 0.0) |
| Death.Due.To.Cancer (%) | |
| Alive or Dead due to cancer | 349553 ( 94.5) |
| Dead | 20409 ( 5.5) |
| NA - not first tumor | 0 ( 0.0) |
| Death.Not.Due.To.Cancer (%) | |
| Alive or Dead of other cause | 341638 ( 92.3) |
| Dead | 28324 ( 7.7) |
| NA - not first tumor | 0 ( 0.0) |
| Radiation = Radiation (%) | 186972 ( 50.5) |