Abstract
I. Logistic Regression Project Instruction Answers
II. Logistic Regression Model Development
III. Survival Analysis Project Instruction Questions
IV. Survival Analysis Development
V. Reflection and Lessons Learned
VI. Future Directions
Appendix I: Population Distributions

Abstract

Background

Development of cancer after a previous diagnosis of cancer can be devastating, especially for those people whose cancer went into remission. The following project seeks to use SEER data to understand what impact various first-cancer characteristics and treatments have on development of a second cancer.

Methods

Two approaches are taken to achieve this goal: (1) logistic regression where second cancer development is the outcome, and (2) survival analysis where second cancer development is the event. Predictors were initially selected on the basis of their suitability for model building, such that variables meant to be used for cohort construction (e.g. type of cancer) were not used as predictors. Models were first fitted by use of every pre-selected predictor, and a second model evaluated. Diagnostics and validation were performed, both in- and out-of- sample. A propensity score match was attempted, and revealed similar results.

Conclusions

The development of a second cancer after an initial diagnosis of breast cancer largely happens within the span of about 100 months. Distant metastases of cancer, certain cancer morphologies, and some grades of cancer appear to increase the odds or rate of second cancer development. Radiation treatment is associated with significantly decreased odds and rate of second cancer development, a finding that may be supported biologically by successful treatment decreasing the chance of cancer spreading to other areas of the body.

The next direction to move from this conclusion would be to match patients to see if the effect of radiation on second cancer development remains.

I. Logistic Regression Project Instruction Answers

1. Table One

Provide a Table 1 that describes the predictors in your final model, broken down by the levels of your outcome variable.

Descriptions of predictors stratified by people who did not and did develop a second cancer
	Only One Cancer	Second Cancer	p
n	340302	29660
Race (%)			<0.001
White	273127 (80.3)	24538 (82.7)
Black	36149 (10.6)	2859 ( 9.6)
Other	31026 ( 9.1)	2263 ( 7.6)
Sex = Male (%)	2476 ( 0.7)	227 ( 0.8)	0.486
Age.At.Diagnosis (mean (sd))	59.81 (13.78)	62.31 (13.24)	<0.001
Metastasis (%)			0.004
Localized	215499 (63.3)	18637 (62.8)
Regional	107376 (31.6)	9592 (32.3)
Distant	17427 ( 5.1)	1431 ( 4.8)
Morphology (%)			<0.001
Adenocarcinoma in situ	1094 ( 0.3)	87 ( 0.3)
Carcinoma in situ, NOS	1509 ( 0.4)	98 ( 0.3)
Infiltrating duct carcinoma, NOS	258327 (75.9)	20831 (70.2)
Intraductal and lobular in situ carcinoma	21789 ( 6.4)	2650 ( 8.9)
Lobular carcinoma in situ	24478 ( 7.2)	2763 ( 9.3)
Mucinous adenocarcinoma	6339 ( 1.9)	701 ( 2.4)
Other	26766 ( 7.9)	2530 ( 8.5)
Grade (%)			<0.001
Grade 1	73168 (21.5)	7002 (23.6)
Grade 2	143867 (42.3)	13196 (44.5)
Grade 3	119930 (35.2)	9174 (30.9)
Grade 4	3337 ( 1.0)	288 ( 1.0)
Radiation = Radiation (%)	173864 (51.1)	13108 (44.2)	<0.001

2. Spending Degrees of Freedom

How did you make decisions about spending the degrees of freedom that you spent?

I’m not sure whether I can use the full number of observations that are available or the subset that I’ll use to build a model, to calculate Number of Observations / Number of Variables to Include. Either way, my use of degrees of freedom is not limited by the number of observations that are available–in this sense, I am rich in degrees of freedom to spend.

3. Model Selection Methods

What approach(es) did you use in selecting your model? How well did they work?

My model was developed for its explanatory value, so I used all of the variables.

I did make a Spearman Rho-Squared plot to see whether there were good predictors on which to spend additional degrees of freedom, and found that I could use a restricted cubic spline on age at diagnosis as well as an interaction between two potentially-related variables (metastasis and grade).

On calculation of variance inflation factors, it appeared that there was a LOT of collinearity in the model including the RCS and interaction terms. So, I remade the model without the RCS and interaction terms and evaluated the two against each other using analysis of deviance.

4. Interpreting Odds and Probabilities

How do you interpret your results in terms of probabilities or odds, rather than log odds?

II. Logistic Regression Model Development

For the purposes of this project, I will be limiting myself to a sample set of forty thousand observations to develop a logistic regression model. This is about 5% of my total sample size, so there is ample room for out-of-training validation.

Spearman Rho-Squared

Before selecting predictors, I will look at a spearman \(\rho^2\) plot to see if I should use any interaction, polynomial, or restricted cubic spline methods on predictors.

Age at diagnosis appears to be very important, based on its p-value.

Because Grade can realte to Metastasis, I’ll add an interaction term between the two. To visualize this potential interaction, here is a plot:

This doesn’t seem like a perfectly random association between categories, so I’ll include an interaction term between the two variables. I’ll also include a restricted cubic spline on age at diagnosis because the longer someone is alive, the longer the amount of time that they have to develop a second cancer.

As previously mentioned, missingness isn’t an issue outside of the variable Surgery. Missingness in any other variable accounts for less than 5% of the total number of observations, with the exception of the variables specifically for people who developed a second cancer.

Next, I’ll make the models.

Collinearity

Because I’ve used several interactions and an rcs, and because the severity (Grade) of cancer is known to be associated with its chance of metastasizing, I’ll take a quick look at variance inflation factors (VIFs) to make sure there’s no glaring issue with collinearity.

Variance Inflation Factors for a GLM
Variable	VIF
RaceBlack	1.029
RaceOther	1.019
SexMale	1.006
Age.At.Diagnosis	1.055
ordered(Metastasis).L	2.007
ordered(Metastasis).Q	1.862
ordered(Grade).L	5.48
ordered(Grade).Q	4.99
ordered(Grade).C	2.708
MorphologyCarcinoma in situ, NOS	3.245
MorphologyInfiltrating duct carcinoma, NOS	106
MorphologyIntraductal and lobular in situ carcinoma	42.47
MorphologyLobular carcinoma in situ	40.64
MorphologyMucinous adenocarcinoma	10.61
MorphologyOther	44.31
RadiationRadiation	1.018
Size	1.161

Note: The model used to evaluate these VIF values does NOT include interactions, this is to avoid introducing collinearity, which would make the test silly.

None of these values are too terrible, except all of the Morphologies and the “L” (linear, I believe) component of Grade.

I’ll make a new model without the interactions or restricted cubic splines and see which performs better on analysis of deviance.

How do these models compare in ANOVA?

To help choose between the models including and not including interactions, here are the results of an analysis of deviance:

Analysis of Deviance Table
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
39974	21752	NA	NA	NA
39982	21786	-8	-34.41	0

It looks like the interaction and restricted cubic splines offer negligible predictive value. So, I’ll stick with the model that doesn’t have either of those features.

Logistic Regression Summary

Following are descriptions of the model predicting second cancer development. Note, this is the model that has no interaction effects included.

Model summary data is presented on the next page.

	Estimate	Std. Error	z value	Pr(>\|z\|)
RaceBlack	-0.067	0.063	-1.058	0.29
RaceOther	-0.178	0.071	-2.505	0.012
SexMale	-0.212	0.218	-0.972	0.331
Age.At.Diagnosis	0.011	0.001	8.24	0
ordered(Metastasis).L	-0.018	0.064	-0.281	0.779
ordered(Metastasis).Q	-0.036	0.045	-0.804	0.422
ordered(Grade).L	0.004	0.129	0.034	0.973
ordered(Grade).Q	0.09	0.098	0.924	0.355
ordered(Grade).C	0.067	0.052	1.303	0.193
MorphologyCarcinoma in situ, NOS	0.507	0.505	1.006	0.315
MorphologyInfiltrating duct carcinoma, NOS	0.422	0.421	1.002	0.316
MorphologyIntraductal and lobular in situ carcinoma	0.845	0.425	1.988	0.047
MorphologyLobular carcinoma in situ	0.597	0.425	1.405	0.16
MorphologyMucinous adenocarcinoma	0.494	0.442	1.117	0.264
MorphologyOther	0.654	0.425	1.539	0.124
RadiationRadiation	-0.304	0.038	-8.022	0
Size	0.001	0.001	1.032	0.302
(Intercept)	-3.479	0.434	-8.026	0

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	22022 on 39999 degrees of freedom
Residual deviance:	21786 on 39982 degrees of freedom

Logistic Regression Odds Ratios:

Odds Ratios for Model Predictors
	OR	2.5 %	97.5 %	P
(Intercept)	0.031	0.012	0.066	0
RaceBlack	0.935	0.824	1.057	0.29
RaceOther	0.837	0.727	0.96	0.012
SexMale	0.809	0.513	1.212	0.331
Age.At.Diagnosis	1.011	1.009	1.014	0
ordered(Metastasis).L	0.982	0.865	1.11	0.779
ordered(Metastasis).Q	0.965	0.884	1.052	0.422
ordered(Grade).L	1.004	0.771	1.279	0.973
ordered(Grade).Q	1.094	0.896	1.315	0.355
ordered(Grade).C	1.07	0.964	1.181	0.193
MorphologyCarcinoma in situ, NOS	1.661	0.643	4.821	0.315
MorphologyInfiltrating duct carcinoma, NOS	1.524	0.729	3.907	0.316
MorphologyIntraductal and lobular in situ carcinoma	2.328	1.102	6.005	0.047
MorphologyLobular carcinoma in situ	1.817	0.86	4.69	0.16
MorphologyMucinous adenocarcinoma	1.638	0.744	4.333	0.264
MorphologyOther	1.923	0.911	4.956	0.124
RadiationRadiation	0.738	0.685	0.795	0
Size	1.001	0.999	1.002	0.302

Looking at the odds ratios, age and intraductal and lobular in situ carcinoma morphology are associated with increased odds of developing a second cancer. On the other hand, races that are not white or black and the use of radiation treatment are apparently associated with decreased odds of developing a second cancer.

What does an ANOVA of this model say?

Here’s an analysis of deviance for the chosen model:

Analysis of Deviance Table
	Df	Deviance	Resid. Df	Resid. Dev	Pr(>Chi)
NULL	NA	NA	39999	22022	NA
Race	2	14.07	39997	22008	0.001
Sex	1	0.242	39996	22008	0.622
Age.At.Diagnosis	1	92.4	39995	21915	0
ordered(Metastasis)	2	1.107	39993	21914	0.575
ordered(Grade)	3	8.982	39990	21905	0.03
Morphology	6	52.74	39984	21852	0
Radiation	1	65.23	39983	21787	0
Size	1	0.986	39982	21786	0.321

It appears that several of the predictors are statistically significant. Specifically, race, age at diagnosis, grade, morphology, and radiation are all statistically significant.

How does this model look according to summary statistics?

The p-value of the overall model is as close to zero as R can go, and I expect5 this might not be too useful a measure of performance because having so many observations probably drives this p-value down.

The Nagelkerke’s R^2 is 0.014, indicating that this model accounts for very little of the variation in response.

The Brier score is 0.072, which doesn’t have much predictive power.

The model has a C-statistic of 0.584, which is also pretty low. Here’s an ROC curve, where the C-statistic is equal to the area under the curve:

## 
## Call:
## roc.formula(formula = bcfpModTrain$MultipleCancerDx ~ predict(modelGLM,     type = "response"))
## 
## Data: predict(modelGLM, type = "response") in 36857 controls (bcfpModTrain$MultipleCancerDx Only One Cancer) < 3143 cases (bcfpModTrain$MultipleCancerDx Second Cancer).
## Area under the curve: 0.5842

How does this perform in diagnostic tests?

Here’s a calibration plot to show how our model compares to the observed probabilities used to create the model:

## 
## n=40000   Mean absolute error=0.002   Mean squared error=2e-05
## 0.9 Quantile of absolute error=0.004

The model seems to perform fine between probabilities of 0.6 and 0.11. For probabilities greater than 0.12, it looks like the curve loses predictive value by an increasingly wide margin.

In section I of this document, I mentioned concern about a possible interaction between Grade and Metastasis. Here’s an interaction plot to see how that turned out:

It looks like there might be an interaction, where metastasis-stratified probability values converge as grade increases. This possible interaction is not visually significant enough to address.

Let’s see what diagnostic plots have to say about possible influential observations:

There are several high-leverage points, but none of them have major values for residuals. This is acceptable.

Here are results from cross-validation of the model:

40-fold validation of the model
	index.orig	training	test	optimism	index.corrected	n
Dxy	0.168	0.173	0.164	0.009	0.16	40
R2	0.014	0.015	0.013	0.002	0.012	40
Intercept	0	0	-0.137	0.137	-0.137	40
Slope	1	1	0.944	0.056	0.944	40
Emax	0	0	0.04	0.04	0.04	40
D	0.006	0.006	0.006	0.001	0.005	40
U	0	0	0	0	0	40
Q	0.006	0.006	0.006	0.001	0.005	40
B	0.072	0.072	0.072	0	0.072	40
g	0.319	0.328	0.309	0.019	0.299	40
gp	0.023	0.024	0.022	0.001	0.022	40

Corrected indices do change, but it doesn’t look like too much. For Dxy and Nagelkerke’s R^2, the change is less than ten percent of the previous value, which isn’t too bad. However, the intercept decreases by 0.137, which might indicate the fit won’t perform as well on new data.

How can we understand this plot visually?

For fun, here’s a plot of the distribution of age at diagnosis, according to development of a second cancer:

Consistent with the model, it looks like age greater than the mean is associated with increased rates of developing a second cancer.

Next, I’ll look into the effect of age across different cancer morphologies. Here’s a bubble plot looking at Morphology vs Age.At.Diagnosis. The larger the bubble, the higher the probability of developing a second cancer.

The previous plot shows the predicted probability of developing a second cancer as proportional to the diameter of each circle, where each circle corresponds to an observation represented by its x- and y-axis values.

As an example, for the sixth level of morphology, at the age of 90, there is a much larger probability of developing a second cancer than there is for someone at 38 years old.

Also, it appears that if you have “carcinoma in situ”, your chances of developing a second cancer are relatively high no matter how old you are.

How does the model perform on real data?

Here is an ROC curve for new data:

## 
## Call:
## roc.default(response = newdata$MultipleCancerDx, predictor = predictedVals)
## 
## Data: predictedVals in 9183 controls (newdata$MultipleCancerDx Only One Cancer) < 817 cases (newdata$MultipleCancerDx Second Cancer).
## Area under the curve: 0.5939

The AUC using this new data is 0.594, which corresponds nicely to the validated AUC of 0.580.

Overall, the model isn’t so great. It does, however, point out the impact of some of its predictors on the development of a second caner, and it was fun.

III. Survival Analysis Project Instruction Questions

#retype the dates for easier handling
bcfp$Date.of.Diag.1 <- 
   parse_date_time(bcfp$Date.of.Diag, order = "my")

bcfp$Date.of.Diag.2 <-
  parse_date_time(bcfp$Two.Date.of.Diag, order = "my")

#make an interval between the dates
bcfp$TimeToSecondCancer <- 
  with(bcfp, interval(start = Date.of.Diag.1, end = Date.of.Diag.2))
  
#convert the interval to years, then divide by 12 to get months. 
  #I divide by twelve because months have different numbers of days,
    #but years always have the same number of months. 
#the value 364.75 is used because every fourth year has 364 days
  #this will introduce less issues than the difference in days of months, 
    #overall
bcfp$TTSCMonths <- 
  (bcfp$TimeToSecondCancer@.Data * 12) / (60 * 60 * 24 * 364.75)

#make a time variable to use for analysis
  #this uses duration of survival for people who died without developing 
  #a second cancer,
    #and uses time to second cancer for those who did develop a second cancer
bcfp$timeCPH <- 
  ifelse(is.na(bcfp$TTSCMonths) == TRUE, 
         bcfp$Survival.Months, 
         bcfp$TTSCMonths)

#give time units
units(bcfp$time) <- "Months"

#next, I'll make a censor variable for those people who died
bcfp$censor <- 
  ifelse(is.na(bcfp$TTSCMonths) == TRUE, 0, 1)

#for reproducability
set.seed(432431)

#again, time to subset the data for model training
bcfpModTrain <- sample_n(bcfp, 4e4) %>% select(MultipleCancerDx,
                                               Race,
                                               Sex,
                                               Age.At.Diagnosis, 
                                               Metastasis, 
                                               Grade, 
                                               Morphology,
                                               Radiation, 
                                               Size, 
                                               timeCPH,
                                               censor, 
                                               Case.Number)

bcfpSurv <- with(bcfpModTrain, Surv(time = timeCPH, event = censor, type = "right"))

#here, I'll make the model
bcfpSFit <- survfit(bcfpSurv ~ 1)

The following responses refer to a survival analysis described in Section IV.

1. Table One

Provide a Table 1 that describes the predictors in your final model, broken down by the levels of your outcome variable.

Please see Table 1 in Section I.

2. Assessing K-M Estimates Across Groups

Plot and assess the pair of Kaplan-Meier estimates comparing intervention groups

Radiation

Here is a plot of Kaplan-Meier estimates comparing groups that did and did not receive radiation treatment:

Interestingly, it looks like the group that received radiation had a lower hazard of developing a second cancer. Here’s a quantitative approach to evaluating this difference:

Call: Surv(time = timeCPH, event = censor) ~ Radiation Chisq = 106.476624 on 1 degrees of freedom, p = 0.000000
	N	Observed	Expected	(O-E)^2/E	(O-E)^2/V
Radiation=No Radiation	19819	1804	1514	55.59	106.5
Radiation=Radiation	20181	1388	1678	50.15	106.5

A log-rank test indicates that there is, indeed, a difference between groups who were and were not exposed to radiation as a treatment for cancer.

Race

Interestingly, it looks like the group that received radiation had a lower chance of developing a second cancer. Here’s a quantitative approach to evaluating this difference:

Call: Surv(time = timeCPH, event = censor) ~ Race Chisq = 2.924377 on 2 degrees of freedom, p = 0.231729
	N	Observed	Expected	(O-E)^2/E	(O-E)^2/V
Race=White	32102	2614	2576	0.555	2.892
Race=Black	4290	314	332.2	0.992	1.114
Race=Other	3608	264	283.6	1.36	1.502

A log-rank test indicates that there is no statistically significant difference in survival across race.

Metastasis

Call: Surv(time = timeCPH, event = censor) ~ Metastasis Chisq = 26.744043 on 2 degrees of freedom, p = 0.000002
	N	Observed	Expected	(O-E)^2/E	(O-E)^2/V
Metastasis=Localized	25307	1971	2062	4.038	11.48
Metastasis=Regional	12600	1047	1006	1.641	2.41
Metastasis=Distant	2093	174	123.4	20.77	21.86

By the log-rank test, there is a very statistically significant difference across metastasis groups, where distant metastasis is associated with decreased survival.

Grade

Call: Surv(time = timeCPH, event = censor) ~ Grade Chisq = 29.719009 on 3 degrees of freedom, p = 0.000002
	N	Observed	Expected	(O-E)^2/E	(O-E)^2/V
Grade=Grade 1	8604	739	699.1	2.281	2.938
Grade=Grade 2	16935	1451	1346	8.204	14.27
Grade=Grade 3	14064	966	1112	19.06	29.41
Grade=Grade 4	397	36	35.48	0.008	0.008

By the log-rank test, there is a very statistically significant difference across Grade groups, where people who have grade 3 cancer appear to survive at a greater proportion than expected.

3. Model Selection

What approach(es) did you use in selecting your Cox model? How well did they work?

I did stepwise forward selection. The approach didn’t work too well, I found that the selected model was just as good as the kitchen sink model except for the pruning of insignificant predictors (whose effect was minimized by the model, anyway).

4. Meeting Assumptions

How well does your Cox model match up with the proportional hazards assumption?

The p-values provided by the output of my cox.zph command are really concerning. They’re really low, indicating significant slope of residuals. However, I would expect such low p-values, considering the number of observations available to me.

Looking at rho values, the biggest issue was the predictor Radiation. I plotted the survival curve fit for this variable alone, and the curves did, indeed, meet towards the end of the survival time maximum.

IV. Survival Analysis Development

Setting up data for survival analysis

The event under consideration for this project is development of a second cancer. SEER provides a date of diagnosis for each record of cancer, where records of cancer are created for each new cancer found in a patient. The ID number, Case.Number, then can be used to identify individuals with more than one record in SEER. Further, a separate variable, “Sequence Number” (processed out of the data before use in this data set), can be used to identify whether this is the first, second, third, or so on cancer that a patient may have.

For the purposes of this study, survival time will be defined as the interval between record of a first and second cancer. Patients will be considered censored if they die, and thus will be dropped out of the study.

To improve the accuracy of the dataset, observations will be kept only if the date of last contact is greater than the date of diagnosis. The rates of each possible follow-up interval are described for each population in the following tables:

Types of Follow-up Distribution Across Outcome
	Only One Cancer	Second Cancer
Date Last Contact = Date Dx	340	4
Date Last Contact > Date Dx	308258	26459
Incomplete dates, maybe zero FU days	461	13
Incomplete dates, not zero FU days	31243	3184
Unknown	0	0

Looking at the distributions of follow-up times among patients who did and did not develop second cancers, each group is within about an order of magnitude difference of the other. I will take this as evidence that including only those patients with follow-up beyond the date of diagnosis will not introduce too much bias.

Making Cox Proportional Hazards Models

Hazards and Confidence Intervals for the Cox Proportional Hazards Model
	exp(coef)	exp(-coef)	lower .95	upper .95
RaceBlack	0.979	1.021	0.87	1.102
RaceOther	0.973	1.027	0.857	1.105
SexMale	0.875	1.142	0.58	1.321
Age.At.Diagnosis	1.013	0.987	1.01	1.016
ordered(Metastasis).L	1.35	0.741	1.205	1.513
ordered(Metastasis).Q	1.064	0.94	0.981	1.153
ordered(Grade).L	0.966	1.035	0.77	1.212
ordered(Grade).Q	1.076	0.929	0.905	1.28
ordered(Grade).C	1.104	0.906	1.006	1.211
MorphologyCarcinoma in situ, NOS	1.209	0.827	0.45	3.249
MorphologyInfiltrating duct carcinoma, NOS	1.377	0.726	0.655	2.894
MorphologyIntraductal and lobular in situ carcinoma	1.826	0.548	0.862	3.867
MorphologyLobular carcinoma in situ	1.985	0.504	0.938	4.2
MorphologyMucinous adenocarcinoma	1.538	0.65	0.705	3.352
MorphologyOther	1.505	0.664	0.71	3.189
RadiationRadiation	0.719	1.391	0.67	0.772
Size	1	1	0.999	1.002

C	se(C)
0.609	0.005

test	df	pvalue
334.2	17	0

As seen above, distant metastasis is associated with 1.5 times hazard of developing a second cancer, and regional metastasis is also associated with increased hazard. Also as shown above, Grade 3 cancer is associated with decreased hazard of developing a second cancer. Finally, radiation is also associated with decreased hazard of developing a second cancer, at about a 0.77 rate compared to patients not treated with radiation. P-values correspond to these explanations of significance, which are mentioned if the 95% confidence interval of hazard does not cross 1.

The Wald test for this model shows that, overall, the model is statistically significant.

Because there is a continuous predictor in the model, I included the measure of concordance. At 0.609, which is a little less than typical for survival analysis data. This means that there is some agreement between the survival time and risk score generated by the predictor, but it’s not perfect.

The Cox-and-Snell’s pseudo-R-squared value for the overall model is .3f, which indicates that the fit isn’t that much better than the intercept alone. The maximum value of R-squared is .3f, indicating that our model reflects about one percent of the maximum.

Here is an analysis of deviance for the model:

Analysis of Deviance Table
	loglik	Chisq	Df	Pr(>\|Chi\|)
NULL	-32495	NA	NA	NA
Race	-32493	3.003	2	0.223
Sex	-32493	0.02	1	0.887
Age.At.Diagnosis	-32431	125.2	1	0
ordered(Metastasis)	-32413	35.38	2	0
ordered(Grade)	-32402	21.98	3	0
Morphology	-32375	54.06	6	0
Radiation	-32332	84.89	1	0
Size	-32332	0.333	1	0.564

It appears that age at diagnosis, metastasis, grade, morphology, and radiation are all significant. Morphology wasn’t significant when broken down according to its levels, but the variable as a whole is significant. There may be a combination of factors that does have predictive value, but I wouldn’t know how to approach this possibility without imposing my own bias on the model.

Here is a plot of the survival curve for this model:

Note, the confidence interval curves are very close to the estimated curve. This makes the curve appear as if it is thick.

As an alternative, I’m going to see what the step function has to say about the kitchen sink model to see if there’s anything better. I won’t include step function output because it would make this document far longer.

Using AIC to select predictors, the step function returned the call coxph(formula = Surv(time = timeCPH, event = censor) ~ Age.At.Diagnosis + Metastasis + Grade + Morphology + Radiation, data = bcfpModTrain). So, the step function would have us exclude sex, race, and size from the model. This is consistent with what has appeared to be significant so far. I’ll use analysis of deviance to see which model has statistically significantly lower deviance.

Analysis of Deviance Table
loglik	Chisq	Df	P(>\|Chi\|)
-32332	NA	NA	NA
-32333	1.011	4	0.908

The drop in deviance is not significant, so I’ll probably stick with the kitchen sink model. Just to be sure that the stepwise-selected model isn’t that much better, I’ll see what a summary of the stepwise model looks like.

Hazards and Confidence Intervals for the Stepwise Cox Proportional Hazards Model
	exp(coef)	exp(-coef)	lower .95	upper .95
Age.At.Diagnosis	1.013	0.987	1.01	1.016
ordered(Metastasis).L	1.359	0.736	1.216	1.519
ordered(Metastasis).Q	1.064	0.94	0.981	1.154
ordered(Grade).L	0.967	1.034	0.771	1.213
ordered(Grade).Q	1.076	0.929	0.906	1.28
ordered(Grade).C	1.103	0.907	1.005	1.21
MorphologyCarcinoma in situ, NOS	1.219	0.821	0.454	3.273
MorphologyInfiltrating duct carcinoma, NOS	1.382	0.723	0.658	2.904
MorphologyIntraductal and lobular in situ carcinoma	1.838	0.544	0.868	3.891
MorphologyLobular carcinoma in situ	2.003	0.499	0.947	4.237
MorphologyMucinous adenocarcinoma	1.545	0.647	0.709	3.367
MorphologyOther	1.511	0.662	0.713	3.201
RadiationRadiation	0.72	1.389	0.671	0.772

C	se(C)
0.609	0.005

test	df	pvalue
332.9	13	0

All the predictors are statistically significant, but at the same significance levels. For the purposes of explanation and description of the model, I’ll keep the statistically insignificant predictors.

Summarizing the effect size of the Cox PH Model

That’s a nice plot of the effects. Those bars that don’t cross one have a significant impact on hazard of developing a second cancer. For age at diagnosis going from 70 to 50, metastasis increasing from localized to regional, metastasis increasing from localized to distant, some of the morphology category changes, and radiation to lack of radiation, the hazard of developing a second cancer increases.

Still, curiously, going from grade 2 to grade 3 decreases the hazard of developing a second cancer.

Testing the Proportional Hazards Assumption

Here’s a test for the cox proportional hazards assumption:

Testing the Cox Proportional Hazards Assumption
	rho	chisq	p
Race=Black	0.081	20.93	0
Race=Other	-0.003	0.023	0.88
Sex=Male	0.028	2.493	0.114
Age.At.Diagnosis	0.089	23.26	0
Metastasis	-0.072	15.92	0
Metastasis=3	0.008	0.219	0.64
Grade	0.038	4.577	0.032
Grade=3	0.007	0.17	0.68
Grade=4	0.003	0.022	0.882
Morphology=Carcinoma in situ, NOS	-0.031	3.059	0.08
Morphology=Infiltrating duct carcinoma, NOS	-0.018	1.044	0.307
Morphology=Intraductal and lobular in situ carcinoma	-0.024	1.856	0.173
Morphology=Lobular carcinoma in situ	-0.026	2.214	0.137
Morphology=Mucinous adenocarcinoma	-0.029	2.641	0.104
Morphology=Other	-0.02	1.246	0.264
Radiation=Radiation	0.174	98.03	0
Size	-0.017	0.45	0.502
GLOBAL	NA	202.3	0

Given the great number of observations in the data, it’s hard not to have low p-values. Still, we do have significantly non-zero slopes, and this indicates approximately proportional of hazards.

My greatest concern regards the predictor, radiation. It has a rho value of 0.195. Age is also a bit concerning, with rho = 0.088.

There are two continuous variables to evaluate visually, size and age at diagnosis. Here’s the plot for size:

This is pretty flat, so it doesn’t seem like the DFBetas change too much with increase in response.

This curve is a bit more curvy, corresponding to the numeric output above showing rho = 0.088.

In any case, I’m not sure that the proportional hazards assumption is violated. Still, I’ll check if the survival curves cross each other for radiation (which was the most concerning variable):

It appears that the curves cross when time is near its maximum. That makes me think that weight should be given to the left-hand side of the curve when doing a log-rank test.

In any case, I am worried about the proportional hazards violation.

Assessing Collinearity

Variance Inflation Factors for the Cox PH Model
Variable	VIF
RaceBlack	1.029
RaceOther	1.021
SexMale	1.005
Age.At.Diagnosis	1.057
ordered(Metastasis).L	1.907
ordered(Metastasis).Q	1.797
ordered(Grade).L	4.869
ordered(Grade).Q	4.567
ordered(Grade).C	2.52
MorphologyCarcinoma in situ, NOS	2.281
MorphologyInfiltrating duct carcinoma, NOS	97.39
MorphologyIntraductal and lobular in situ carcinoma	38.84
MorphologyLobular carcinoma in situ	43.48
MorphologyMucinous adenocarcinoma	10.67
MorphologyOther	36.28
RadiationRadiation	1.018
Size	1.125

Again, there is significant collinearity among Morphology categories. I won’t remove this, though, because the predictor was significant and because morphology plays a role in the natural history of cancer as a disease.

In-Sample Validation

Here’s a calibration plot for the model:

## Using Cox survival estimates at 12 Days

The model looks well-calibrated. It performs worse as time progresses.

Here’s a 100-fold validation:

##       index.orig training   test optimism index.corrected   n
## Dxy       0.2185   0.2223 0.2136   0.0087          0.2098 100
## R2        0.0101   0.0106 0.0096   0.0010          0.0091 100
## Slope     1.0000   1.0000 0.9525   0.0475          0.9525 100
## D         0.0050   0.0052 0.0047   0.0005          0.0045 100
## U         0.0000   0.0000 0.0000  -0.0001          0.0000 100
## Q         0.0050   0.0053 0.0047   0.0006          0.0045 100
## g         0.3570   0.3657 0.3480   0.0177          0.3393 100

100-fold validation of the model
	index.orig	training	test	optimism	index.corrected	n
Dxy	0.218	0.222	0.214	0.009	0.21	100
R2	0.01	0.011	0.01	0.001	0.009	100
Slope	1	1	0.952	0.048	0.952	100
D	0.005	0.005	0.005	0.001	0.004	100
U	0	0	0	0	0	100
Q	0.005	0.005	0.005	0.001	0.004	100
g	0.357	0.366	0.348	0.018	0.339	100

The corrected Dxy decreases by less than five percent of its original value, so that’s not too bad.

Corrected R-squared decreases by more than ten percent, but that might be a rounding issue because it’s a smaller value.

Slope decreases a bit, and might be concerning depending on how the model performs out-of-sample.

Other values don’t change too much.

Out-of-Sample Validation

Here, I’ll make some predictions and see how they line up with actual values. I’ll be predicting the survival

## [1] 348

## [1] 363

There were 348 events predicted by the original model, but there were actually 363 events. To make a comparison between the two, it would be great if I recalculated survival for the new data to match the length of the model-fitted data. Instead, I’ll just pluck random events out of the actual data to produce the right number of events.

V. Reflection and Lessons Learned

While I did learn a whole lot about data analysis, I think I learned just as much about how to make data useable. SEER provides its data in a widely dispersed format for reasons I can only speculate, but the R community has found ways around this common roadblock.

Recoding, however, is a less challenging task that requires more time and attention. I had the opportunity to do a lot of recoding. Despite the fact that it is less challenging for others, I found myself learning how to write and compose functions so that I could effectively manipulate data. For example, I remember when I realized that do.call can be thought of as analagous to the apply family, where do.call doesn’t behave sequentially. This allowed me to read many files at once, when used in combination with list.files.

I also took the time to learn how to use GitHub in RStudio. The first time I used GitHub was around September of 2015, and that little crack in the ice was enough for me to completely slip through during the course of this semester. Whatever the task, it seems, there is fertile ground for learning.

I’ve also been branching out into other languages. Javascript and D3 are really attractive technologies, but I’ve forgone learning that in lieu of buffing up my computer science knowledge using a python-based version of Structure and Interpretation of Computer Programs. Given my background in Economics and Philosophy, this has required that I re-read the same page several times before I get it. These baby steps, however, are still steps.

Of course, I’ve also spent time how to more effectively work in R. As I mentioned in a previous paragraph, I’ve learned how to use some commands effectively. Besides that, I’ve learned some of the theory behind the R environment. I learned a while ago that everything in a Unix system can be considered a file. Similarly, everything that exists in R is an object, and everything that happens is a function call. Learning more about R and abstracting away various layers of user interface to reach these absolute generalities has been very rewarding, even if it just meant learning how long a road lies ahead of me.

Finally, and perhaps most importantly, I’ve learned more about the rigorous application of quantitative methods. Rather than hide behind a p-value, for example, I’m now more inclined to provide more descriptive summaries of a concept. Along with this, I have the chance to explain myself more often, and find myself re-formulating and cementing concepts out of a need to say them out loud. It’s easy to think I’ve learned something after reading it, but it’s honest to think I’ve learned something after I can teach it.

VI. Future Directions

1. Cohort Selection

I am a bit worried about the fact that a lot of the second cancers were breast cancers. This could be an issue if the second cancer is a recurrence of previous cancer, in which case this analysis would have lumped together cancer recurrence and new second cancer.

One solution to this potential issue would be to exclude breast cancer as a second cancer. However, I’m not sure how to confirm that a second cancer is recurrent instead of de novo. There may be a SEER protocol for doing so, but I haven’t been able to find it.

2. Propensity Score and Other Matching Methods

It would be interesting to see whether matching race, sex, age, and other variables would impact the analysis. I imagine that doing so would help identify those other variables (namely, cancer characteristics) that influence development of second cancers.

In the previous models, it looked like radiation is associated with lower odds of developing a second cancer. I think that’s really weird. So I’m going to see if I can successfully do a propensity score analysis where radiation is the predictor of interest, and all other covariates are used to generate propensity scores.

I can’t be confident in the following analysis because I don’t have much experience in this approach, but here’s my attempt at matching observations. Matching methods include (1) propensity score matching and (2) greedy algorithm matching.

It would be a lot cooler to implement some sort of matching on my own, but for now I’ll have to rely on an R package to help me out.

First, I’ll try doing this using the “Matching” package. I’ll print the code that I use for this, unlike previous chunks. I’ll be implementing this for the logistic regression whose outcome is development of a second cancer, looking to see if radiation treatment has the same impact as non-matched logistic regression. I’ll base the propensity score on all covariates.

library(Matching)

bcfpUnmatched <- sample_n(bcfp, 5e4)

#setting seed in case it makes a difference to the command
set.seed(432431)

#retype for ease of use
bcfpUnmatched$RadMat <- with(bcfpUnmatched, ifelse(Radiation == "Radiation", 1, 0))

#set up formals
Tr <- bcfpUnmatched$RadMat

Y <- bcfpUnmatched$MultipleCancerDx

X <- dplyr::select(.data = bcfpUnmatched, Race, 
                             Sex,
                             Age.At.Diagnosis,
                             Metastasis,
                             Morphology,
                             Grade,
                             Size)

X <- data.matrix(X)

#do the match
matchedRadCan <- Match(Y = Y, Tr = Tr, X = X, M = 1, version = "fast")

#see how balance worked out
 MBRadCan   <-    MatchBalance(RadMat ~ 
                                     Race + 
                                     Sex + 
                                     Age.At.Diagnosis + 
                                     Metastasis + 
                                     Morphology + 
                                     Grade + 
                                     Size,
                            data = bcfpUnmatched, 
                            match.out = matchedRadCan, 
                            nboots = 100, 
                            digits = 3)

## 
## ***** (V1) RaceBlack *****
##                        Before Matching        After Matching
## mean treatment........  0.101          0.101 
## mean control..........  0.113          0.101 
## std mean diff.........  -3.93         0.0263 
## 
## mean raw eQQ diff..... 0.0118         3.77e-05 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.00591        1.88e-05 
## med  eCDF diff........ 0.00591        1.88e-05 
## max  eCDF diff........ 0.0118         3.77e-05 
## 
## var ratio (Tr/Co).....  0.907              1 
## T-test p-value........ 1.86e-05         0.48 
## 
## 
## ***** (V2) RaceOther *****
##                        Before Matching        After Matching
## mean treatment........ 0.0827         0.0827 
## mean control.......... 0.0953         0.0827 
## std mean diff.........  -4.57         0.0287 
## 
## mean raw eQQ diff..... 0.0126         3.77e-05 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.0063         1.88e-05 
## med  eCDF diff........ 0.0063         1.88e-05 
## max  eCDF diff........ 0.0126         3.77e-05 
## 
## var ratio (Tr/Co).....   0.88              1 
## T-test p-value........ 7.58e-07        0.157 
## 
## 
## ***** (V3) SexMale *****
##                        Before Matching        After Matching
## mean treatment........ 0.00348        0.00348 
## mean control.......... 0.0119         0.00348 
## std mean diff.........  -14.3              0 
## 
## mean raw eQQ diff..... 0.00841             0 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              0 
## 
## mean eCDF diff........ 0.0042              0 
## med  eCDF diff........ 0.0042              0 
## max  eCDF diff........ 0.00841             0 
## 
## var ratio (Tr/Co).....  0.295              1 
## T-test p-value........ <2e-16              1 
## 
## 
## ***** (V4) Age.At.Diagnosis *****
##                        Before Matching        After Matching
## mean treatment........   58.9           58.9 
## mean control..........     61           58.9 
## std mean diff.........  -17.3         -0.557 
## 
## mean raw eQQ diff.....   2.37         0.0436 
## med  raw eQQ diff.....      2              0 
## max  raw eQQ diff.....      7              5 
## 
## mean eCDF diff........ 0.0279         0.000559 
## med  eCDF diff........ 0.00812        0.000283 
## max  eCDF diff........ 0.0907         0.00275 
## 
## var ratio (Tr/Co).....  0.725           1.02 
## T-test p-value........ <2e-16         3.51e-14 
## KS Bootstrap p-value.. <2e-16           0.94 
## KS Naive p-value...... <2e-16          0.988 
## KS Statistic.......... 0.0907         0.00275 
## 
## 
## ***** (V5) MetastasisRegional *****
##                        Before Matching        After Matching
## mean treatment........  0.314          0.314 
## mean control..........  0.316          0.314 
## std mean diff......... -0.337         -0.0256 
## 
## mean raw eQQ diff..... 0.00158        5.65e-05 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.000783       2.83e-05 
## med  eCDF diff........ 0.000783       2.83e-05 
## max  eCDF diff........ 0.00157        5.65e-05 
## 
## var ratio (Tr/Co).....  0.997              1 
## T-test p-value........  0.706           0.18 
## 
## 
## ***** (V6) MetastasisDistant *****
##                        Before Matching        After Matching
## mean treatment........  0.043          0.043 
## mean control..........  0.063         0.0429 
## std mean diff.........  -9.85         0.0585 
## 
## mean raw eQQ diff.....   0.02         5.65e-05 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.00999        2.83e-05 
## med  eCDF diff........ 0.00999        2.83e-05 
## max  eCDF diff........   0.02         5.65e-05 
## 
## var ratio (Tr/Co).....  0.697              1 
## T-test p-value........ <2e-16         0.0833 
## 
## 
## ***** (V7) MorphologyCarcinoma in situ, NOS *****
##                        Before Matching        After Matching
## mean treatment........ 0.00281        0.00281 
## mean control.......... 0.00505        0.00249 
## std mean diff.........  -4.24          0.598 
## 
## mean raw eQQ diff..... 0.00226        0.000207 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.00112        0.000104 
## med  eCDF diff........ 0.00112        0.000104 
## max  eCDF diff........ 0.00224        0.000207 
## 
## var ratio (Tr/Co).....  0.557           1.13 
## T-test p-value........ 6.22e-05        0.144 
## 
## 
## ***** (V8) MorphologyInfiltrating duct carcinoma, NOS *****
##                        Before Matching        After Matching
## mean treatment........  0.768          0.768 
## mean control..........   0.74           0.77 
## std mean diff.........   6.57         -0.511 
## 
## mean raw eQQ diff..... 0.0277         0.00115 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.0139         0.000575 
## med  eCDF diff........ 0.0139         0.000575 
## max  eCDF diff........ 0.0278         0.00115 
## 
## var ratio (Tr/Co).....  0.927           1.01 
## T-test p-value........ 5.85e-13       4.47e-11 
## 
## 
## ***** (V9) MorphologyIntraductal and lobular in situ carcinoma *****
##                        Before Matching        After Matching
## mean treatment........ 0.0644         0.0644 
## mean control.......... 0.0677         0.0636 
## std mean diff.........  -1.33          0.339 
## 
## mean raw eQQ diff..... 0.00328        0.000415 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.00163        0.000207 
## med  eCDF diff........ 0.00163        0.000207 
## max  eCDF diff........ 0.00327        0.000415 
## 
## var ratio (Tr/Co).....  0.955           1.01 
## T-test p-value........  0.142         0.0127 
## 
## 
## ***** (V10) MorphologyLobular carcinoma in situ *****
##                        Before Matching        After Matching
## mean treatment........ 0.0695         0.0695 
## mean control..........  0.081         0.0695 
## std mean diff.........  -4.49         0.0194 
## 
## mean raw eQQ diff..... 0.0114              0 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              0 
## 
## mean eCDF diff........ 0.00571             0 
## med  eCDF diff........ 0.00571             0 
## max  eCDF diff........ 0.0114              0 
## 
## var ratio (Tr/Co).....   0.87              1 
## T-test p-value........ 1.29e-06        0.858 
## 
## 
## ***** (V11) MorphologyMucinous adenocarcinoma *****
##                        Before Matching        After Matching
## mean treatment........ 0.0165         0.0165 
## mean control..........  0.019         0.0154 
## std mean diff.........     -2          0.901 
## 
## mean raw eQQ diff..... 0.00255        0.000679 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.00127        0.000339 
## med  eCDF diff........ 0.00127        0.000339 
## max  eCDF diff........ 0.00254        0.000679 
## 
## var ratio (Tr/Co).....  0.869           1.07 
## T-test p-value........ 0.0314         1.53e-05 
## 
## 
## ***** (V12) MorphologyOther *****
##                        Before Matching        After Matching
## mean treatment........ 0.0763         0.0763 
## mean control.......... 0.0839         0.0769 
## std mean diff.........  -2.88         -0.235 
## 
## mean raw eQQ diff..... 0.00768        0.000415 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.00382        0.000207 
## med  eCDF diff........ 0.00382        0.000207 
## max  eCDF diff........ 0.00765        0.000415 
## 
## var ratio (Tr/Co).....  0.916          0.993 
## T-test p-value........ 0.00163        0.00518 
## 
## 
## ***** (V13) GradeGrade 2 *****
##                        Before Matching        After Matching
## mean treatment........  0.425          0.425 
## mean control..........  0.424          0.425 
## std mean diff.........  0.185          -0.04 
## 
## mean raw eQQ diff..... 0.00089        9.42e-05 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.000457       4.71e-05 
## med  eCDF diff........ 0.000457       4.71e-05 
## max  eCDF diff........ 0.000913       9.42e-05 
## 
## var ratio (Tr/Co).....      1              1 
## T-test p-value........  0.836          0.398 
## 
## 
## ***** (V14) GradeGrade 3 *****
##                        Before Matching        After Matching
## mean treatment........  0.333          0.333 
## mean control..........  0.367          0.334 
## std mean diff.........  -7.15         -0.151 
## 
## mean raw eQQ diff..... 0.0337         0.000339 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.0168         0.00017 
## med  eCDF diff........ 0.0168         0.00017 
## max  eCDF diff........ 0.0337         0.000339 
## 
## var ratio (Tr/Co).....  0.957          0.999 
## T-test p-value........ 2.89e-15       0.00146 
## 
## 
## ***** (V15) GradeGrade 4 *****
##                        Before Matching        After Matching
## mean treatment........ 0.00831        0.00831 
## mean control.......... 0.0109         0.00795 
## std mean diff.........  -2.87          0.392 
## 
## mean raw eQQ diff..... 0.00263        0.00017 
## med  raw eQQ diff.....      0              0 
## max  raw eQQ diff.....      1              1 
## 
## mean eCDF diff........ 0.0013         8.48e-05 
## med  eCDF diff........ 0.0013         8.48e-05 
## max  eCDF diff........ 0.00261        0.00017 
## 
## var ratio (Tr/Co).....  0.763           1.04 
## T-test p-value........ 0.00283        0.00665 
## 
## 
## ***** (V16) Size *****
##                        Before Matching        After Matching
## mean treatment........   21.8           21.8 
## mean control..........     24           21.7 
## std mean diff.........  -9.99          0.619 
## 
## mean raw eQQ diff.....   2.32          0.154 
## med  raw eQQ diff.....      2              0 
## max  raw eQQ diff.....    580             51 
## 
## mean eCDF diff........ 0.0122         0.000854 
## med  eCDF diff........ 0.00115        0.000452 
## max  eCDF diff........ 0.0921         0.00899 
## 
## var ratio (Tr/Co).....  0.751           1.05 
## T-test p-value........ <2e-16         4.44e-16 
## KS Bootstrap p-value.. <2e-16         <2e-16 
## KS Naive p-value...... <2e-16         0.0274 
## KS Statistic.......... 0.0921         0.00899 
## 
## 
## Before Matching Minimum p.value: <2e-16 
## Variable Name(s): SexMale Age.At.Diagnosis MetastasisDistant Size  Number(s): 3 4 6 16 
## 
## After Matching Minimum p.value: <2e-16 
## Variable Name(s): Size  Number(s): 16

#in case I want to plot these values...
before <- sapply(1:16, function(x) MBRadCan$BeforeMatching[[x]]$sdiff) %>% abs
after <- sapply(1:16, function(x) MBRadCan$AfterMatching[[x]]$sdiff) %>% abs

library(tidyr)

plotFrame <- data.frame(before, after)

plotFrame <- plotFrame[order(plotFrame$before, decreasing = TRUE),]

plotFrame <- gather(plotFrame)

plotFrame$beforeAfter <- plotFrame$key

plotFrame$key <- c(1:16, 1:16)

ggplot(plotFrame, aes(y = key, x = value, color = beforeAfter)) + 
  geom_point() + 
  ggtitle("Love Plot for All Predictors") + 
  ylab("Variable (or Factor Level) ID") + 
  xlab("Absolute Standardized Difference") + 
  theme(legend.title = element_blank())

Looking at the Love plot, the absolute standardized differences relating to Radiation have been reduced by a lot. Note that I did not stratify this plot according to treatment with radiation. So, it’s perhaps not as good a representation as it could be. It may actually also not qualify as a Love plot.

Here, I’ll run the analysis using matched observations and see what I get.

Here, I’ll show a summary of the matched logistic regression:

	Estimate	Std. Error	z value	Pr(>\|z\|)
RadiationRadiation	-0.315	0.023	-13.71	0
RaceBlack	-0.067	0.051	-1.302	0.193
RaceOther	-0.185	0.059	-3.138	0.002
SexMale	-0.788	0.387	-2.038	0.042
Age.At.Diagnosis	0.015	0.001	14.75	0
ordered(Metastasis).L	0.194	0.053	3.662	0
ordered(Metastasis).Q	0.008	0.035	0.245	0.807
MorphologyCarcinoma in situ, NOS	-0.487	0.475	-1.025	0.305
MorphologyInfiltrating duct carcinoma, NOS	-0.195	0.304	-0.64	0.522
MorphologyIntraductal and lobular in situ carcinoma	0.145	0.308	0.472	0.637
MorphologyLobular carcinoma in situ	0.061	0.308	0.197	0.844
MorphologyMucinous adenocarcinoma	-0.141	0.326	-0.431	0.666
MorphologyOther	-0.045	0.308	-0.145	0.884
ordered(Grade).L	-0.195	0.127	-1.537	0.124
ordered(Grade).Q	0.094	0.095	0.99	0.322
ordered(Grade).C	0.099	0.046	2.142	0.032
Size	0.001	0.001	1.249	0.212
(Intercept)	-2.963	0.315	-9.401	0

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	59029 on 106109 degrees of freedom
Residual deviance:	58334 on 106092 degrees of freedom

I’ll remake a logistic regression to fit the unmatched data (that was used to generate the matched dataset). That way, variation in the specific observations won’t interfere with the results.

Here’s the unmatched model summary:

	Estimate	Std. Error	z value	Pr(>\|z\|)
RadiationRadiation	-0.241	0.034	-7.191	0
RaceBlack	-0.07	0.056	-1.234	0.217
RaceOther	-0.094	0.061	-1.529	0.126
SexMale	-0.2	0.201	-0.997	0.319
Age.At.Diagnosis	0.011	0.001	8.702	0
ordered(Metastasis).L	0.052	0.055	0.931	0.352
ordered(Metastasis).Q	-0.09	0.039	-2.306	0.021
MorphologyCarcinoma in situ, NOS	0.117	0.478	0.245	0.806
MorphologyInfiltrating duct carcinoma, NOS	0.418	0.364	1.147	0.251
MorphologyIntraductal and lobular in situ carcinoma	0.839	0.368	2.278	0.023
MorphologyLobular carcinoma in situ	0.783	0.368	2.13	0.033
MorphologyMucinous adenocarcinoma	0.534	0.383	1.397	0.163
MorphologyOther	0.579	0.368	1.574	0.116
ordered(Grade).L	0.009	0.111	0.08	0.936
ordered(Grade).Q	0.152	0.085	1.802	0.072
ordered(Grade).C	0.129	0.045	2.859	0.004
Size	0	0.001	-0.023	0.982
(Intercept)	-3.394	0.376	-9.032	0

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	27896 on 49999 degrees of freedom
Residual deviance:	27586 on 49982 degrees of freedom

And here are the odds ratios for each of the models. First, I’ll display the odds ratios for the matched model:

## function(model) {
##   cbind(
##     exp(cbind(
##       OR = coef(model), 
##       confint(model))),
##     P = summary(model)$coefficients[,4])
## }

Waiting for profiling to be done…

Odds Ratios for Matched Data
	OR	2.5 %	97.5 %	P
(Intercept)	0.052	0.026	0.092	0
RadiationRadiation	0.73	0.698	0.764	0
RaceBlack	0.935	0.845	1.033	0.193
RaceOther	0.832	0.74	0.932	0.002
SexMale	0.455	0.193	0.899	0.042
Age.At.Diagnosis	1.015	1.013	1.017	0
ordered(Metastasis).L	1.214	1.093	1.345	0
ordered(Metastasis).Q	1.009	0.942	1.079	0.807
MorphologyCarcinoma in situ, NOS	0.615	0.233	1.541	0.305
MorphologyInfiltrating duct carcinoma, NOS	0.823	0.473	1.576	0.522
MorphologyIntraductal and lobular in situ carcinoma	1.157	0.659	2.23	0.637
MorphologyLobular carcinoma in situ	1.063	0.606	2.047	0.844
MorphologyMucinous adenocarcinoma	0.869	0.475	1.725	0.666
MorphologyOther	0.956	0.545	1.843	0.884
ordered(Grade).L	0.823	0.633	1.043	0.124
ordered(Grade).Q	1.099	0.903	1.314	0.322
ordered(Grade).C	1.104	1.005	1.205	0.032
Size	1.001	0.999	1.002	0.212

And here are the odds ratios for the unmatched model:

Waiting for profiling to be done…

Odds Ratios for the Unmatched Data
	OR	2.5 %	97.5 %	P
(Intercept)	0.034	0.015	0.066	0
RadiationRadiation	0.786	0.736	0.839	0
RaceBlack	0.933	0.834	1.041	0.217
RaceOther	0.91	0.806	1.025	0.126
SexMale	0.818	0.539	1.19	0.319
Age.At.Diagnosis	1.011	1.008	1.013	0
ordered(Metastasis).L	1.053	0.943	1.172	0.352
ordered(Metastasis).Q	0.914	0.847	0.986	0.021
MorphologyCarcinoma in situ, NOS	1.124	0.443	2.976	0.806
MorphologyInfiltrating duct carcinoma, NOS	1.519	0.795	3.379	0.251
MorphologyIntraductal and lobular in situ carcinoma	2.313	1.2	5.177	0.023
MorphologyLobular carcinoma in situ	2.188	1.136	4.894	0.033
MorphologyMucinous adenocarcinoma	1.706	0.855	3.907	0.163
MorphologyOther	1.785	0.926	3.995	0.116
ordered(Grade).L	1.009	0.804	1.245	0.936
ordered(Grade).Q	1.165	0.98	1.367	0.072
ordered(Grade).C	1.138	1.039	1.241	0.004
Size	1	0.998	1.001	0.982

Looking at the odds ratios of matched versus unmatched models, it appears that matching actually shows reduced odds of developing a second cancer given radiation treatment. Despite the fact that I understand radiation is associated with cancer, it would appear that the therapeutic use of radiation is actually associated with decreased odds of new cancer.

The next step I would take is to remove observations of breast cancer as a second cancer to remove the chance that we observe recurring cancer.

It would be interesting to see how this implementation of propensity score matching compares to other implementations and other matching approaches. There’s a “genetic algorithm” approach in the Matching package, that could be fun.

3. Better model evaluation

It would be pretty cool to implement n-fold validation on the new data, given that there’s so much of it. This might provide a better estimate of model performance.

4. Evaluating competing risks

I want to learn more about survival analysis and longitudinal data analysis. I appreciated our discussion in class, but there’s just so much more for me to learn in so many different areas. I’m still figuring out where I should start.

5. Better understanding what I’m doing

I think that, at this point in my career, I know enough about statistics to be dangerous. To keep this danger in check, I’ll be consulting with professionals to make sure it’s safe when I try this at home.

However, I would like to reach a point where I can learn an arbitrary new technique and confidently present results using that technique.

Appendix I: Population Distributions

Here are the numbers of observations falling into each category of categorical variables:

Overall Characteristics of the Entire Dataset
	Overall
n	369962
MultipleCancerDx = Second Cancer (%)	29660 ( 8.0)
Race (%)
White	297665 ( 80.5)
Black	39008 ( 10.5)
Other	33289 ( 9.0)
Sex = Male (%)	2703 ( 0.7)
Metastasis (%)
Localized	234136 ( 63.3)
Regional	116968 ( 31.6)
Distant	18858 ( 5.1)
Morphology (%)
Adenocarcinoma in situ	1181 ( 0.3)
Carcinoma in situ, NOS	1607 ( 0.4)
Infiltrating duct carcinoma, NOS	279158 ( 75.5)
Intraductal and lobular in situ carcinoma	24439 ( 6.6)
Lobular carcinoma in situ	27241 ( 7.4)
Mucinous adenocarcinoma	7040 ( 1.9)
Other	29296 ( 7.9)
Grade (%)
Grade 1	80170 ( 21.7)
Grade 2	157063 ( 42.5)
Grade 3	129104 ( 34.9)
Grade 4	3625 ( 1.0)
Behavior.For.Analysis (%)
Benign	0 ( 0.0)
Borderline Malignant	0 ( 0.0)
In situ	0 ( 0.0)
Maignant	369962 (100.0)
No longer reportable in ICDO3	0 ( 0.0)
Only malignant 2010+	0 ( 0.0)
Only malignant in ICDO3	0 ( 0.0)
Survival.Flag (%)
Date Last Contact = Date Dx	344 ( 0.1)
Date Last Contact > Date Dx	334717 ( 90.5)
Incomplete dates, maybe zero FU days	474 ( 0.1)
Incomplete dates, not zero FU days	34427 ( 9.3)
Unknown	0 ( 0.0)
Survival.PA.Flag (%)
Date Last Contact = Date Dx	91 ( 0.0)
Date Last Contact > Date Dx	335170 ( 90.6)
Incomplete dates, maybe zero FU days	404 ( 0.1)
Incomplete dates, not zero FU days	34297 ( 9.3)
Unknown	0 ( 0.0)
Death.Due.To.Cancer (%)
Alive or Dead due to cancer	349553 ( 94.5)
Dead	20409 ( 5.5)
NA - not first tumor	0 ( 0.0)
Death.Not.Due.To.Cancer (%)
Alive or Dead of other cause	341638 ( 92.3)
Dead	28324 ( 7.7)
NA - not first tumor	0 ( 0.0)
Radiation = Radiation (%)	186972 ( 50.5)