Instructions: Write your code in the code “chunks” following each
question. Then compile the html file in RStudio using “Knit HTML”.
Submit your R Markdown file (rename to something like
“YOURNAME_hwR.Rmd”) and the HTML file.
Edit this R Markdown file directly. Add code to code chunks and text
answers to text sections.
Check out examples from class and R function help pages. Break a problem down into steps. Google each step! You don’t have to answer the question in one line of code.
Email your instructor with questions ahead of the due date. Send your code, which question you are working on, and a screenshot of the error.
Please do not print really long outputs to the html.
Load needed packages. Uncomment the install lines if you need to install the packages. Then recomment before knitting.
Read in the data. Do not modify this code chunk!!
Notice that the csv is imported and saved to dataframe called brcaClin
The table has 1003 rows and 15 columns.
The columns are PATIENT.ID (character), Diagnosis.Age (integer), Person.Gender (character), Ethnicity.Category (character), Race.Category (character), Year.Cancer.Initial.Diagnosis (integer), CNA (numeric), Mutation.Count (integer), Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code (character), Overall.Survival..Months. (numeric), Overall.Survival.Status (character), Disease.Free..Months. (numeric), Disease.Free.Status (character), Metastatic.Site (character), Tumor.Other.Histologic.Subtype (character)
## [1] 1003 15
## 'data.frame': 1003 obs. of 15 variables:
## $ PATIENT.ID : chr "TCGA-5T-A9QA" "TCGA-A1-A0SB" "TCGA-A1-A0SD" "TCGA-A1-A0SE" ...
## $ Diagnosis.Age : int 52 70 59 56 54 61 39 52 39 54 ...
## $ Person.Gender : chr "FEMALE" "FEMALE" "FEMALE" "FEMALE" ...
## $ Ethnicity.Category : chr "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" ...
## $ Race.Category : chr "BLACK OR AFRICAN AMERICAN" "WHITE" "WHITE" "WHITE" ...
## $ Year.Cancer.Initial.Diagnosis : int 2013 2008 2005 2005 2006 2006 2006 2007 2006 2007 ...
## $ CNA : num 0.413 0.000802 0.247 0.213 0.202 ...
## $ Mutation.Count : int NA 16 23 18 32 27 70 139 25 89 ...
## $ Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code: chr "Stage IB" "Stage I" "Stage IIA" "Stage I" ...
## $ Overall.Survival..Months. : num 0.39 8.51 14.4 43.4 48.1 14.3 47.2 20.9 13.7 31.8 ...
## $ Overall.Survival.Status : chr "LIVING" "LIVING" "LIVING" "LIVING" ...
## $ Disease.Free..Months. : num 0.39 8.51 14.4 43.4 48.1 14.3 47.2 20.9 13.7 NA ...
## $ Disease.Free.Status : chr "DiseaseFree" "DiseaseFree" "DiseaseFree" "DiseaseFree" ...
## $ Metastatic.Site : chr "" "" "" "" ...
## $ Tumor.Other.Histologic.Subtype : chr "" "Adenoid Cystic Carcinoma" "" "invasive ductal and lobular carcinoma" ...
68 subjects experienced recurrence/progression. 216 subjects are deceased.
brcaClin <- read.csv("TCGA_BRCA_Clinical_HW.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE, na.strings = c("NA","","[Not Evaluated]","[Unknown]"))
disease.free.status.vector <- brcaClin$Disease.Free.Status
disease.free.summary <- summary(disease.free.status.vector)
kable(disease.free.summary, caption = "Disease-Free Status")| x | |
|---|---|
| DiseaseFree | 769 |
| Recurred/Progressed | 68 |
| NAs | 166 |
overall.survival.status.vector <- brcaClin$Overall.Survival.Status
survival.status.summary <- summary(overall.survival.status.vector)
kable(survival.status.summary, caption = "Overall Survival Status")| x | |
|---|---|
| DECEASED | 216 |
| LIVING | 787 |
Metastatic.Site and identify how
many subjects had bone metastasis using grep or subset to
find all the entries that have ‘bone’ OR ‘Bone’. What were the
corresponding patient IDs for these patients?13 patients had some kind of bone metasasis.
Patient IDs were:TCGA-A2-A04V TCGA-A2-A0SV TCGA-AO-A0J5 TCGA-AR-A0TY TCGA-AR-A0TZ TCGA-AR-A0U2 TCGA-AR-A2LH TCGA-B6-A0I9 TCGA-BH-A18J TCGA-BH-A1FH TCGA-EW-A1P8 TCGA-GM-A2D9 TCGA-GM-A2DA.
metastatic.site.vector <- brcaClin$Metastatic.Site
metastatic.site.summary <- summary(metastatic.site.vector)
kable(metastatic.site.summary, caption = "Metastatic Site")| x | |
|---|---|
| Bone | 9 |
| Bone|Liver | 2 |
| Liver | 2 |
| Lung | 1 |
| Lung|Bone|Liver | 1 |
| Lung|Bone|Liver|Other, specify | 1 |
| Lung|Liver | 1 |
| Lung|Other, specify | 1 |
| Other, specify | 5 |
| NAs | 980 |
metastatic.site.bone <- subset(brcaClin, grepl("Bone", brcaClin$Metastatic.Site),select = c("PATIENT.ID", "Metastatic.Site"))
metastatic.site.bone.IDs <- subset(brcaClin, grepl("Bone", brcaClin$Metastatic.Site),select = c("PATIENT.ID"))
metastatic.site.bone.list <- metastatic.site.bone$PATIENT.ID
length(metastatic.site.bone.list)## [1] 13
| PATIENT.ID | |
|---|---|
| 22 | TCGA-A2-A04V |
| 61 | TCGA-A2-A0SV |
| 321 | TCGA-AO-A0J5 |
| 374 | TCGA-AR-A0TY |
| 375 | TCGA-AR-A0TZ |
| 377 | TCGA-AR-A0U2 |
| 421 | TCGA-AR-A2LH |
| 438 | TCGA-B6-A0I9 |
| 579 | TCGA-BH-A18J |
| 613 | TCGA-BH-A1FH |
| 924 | TCGA-EW-A1P8 |
| 943 | TCGA-GM-A2D9 |
| 944 | TCGA-GM-A2DA |
| PATIENT.ID | Metastatic.Site | |
|---|---|---|
| 22 | TCGA-A2-A04V | Bone |
| 61 | TCGA-A2-A0SV | Bone |
| 321 | TCGA-AO-A0J5 | Lung|Bone|Liver |
| 374 | TCGA-AR-A0TY | Bone |
| 375 | TCGA-AR-A0TZ | Bone |
| 377 | TCGA-AR-A0U2 | Bone|Liver |
| 421 | TCGA-AR-A2LH | Bone|Liver |
| 438 | TCGA-B6-A0I9 | Bone |
| 579 | TCGA-BH-A18J | Bone |
| 613 | TCGA-BH-A1FH | Bone |
| 924 | TCGA-EW-A1P8 | Lung|Bone|Liver|Other, specify |
| 943 | TCGA-GM-A2D9 | Bone |
| 944 | TCGA-GM-A2DA | Bone |
There are 28 different centers in this data. 5T: 1 subject, A1: 14 subjects, A2: 102 subjects, A7: 39 subjects, A8:84 subjects, AC: 29 subjects, AN: 39 subjects, A0:47 subjects, AQ: 9 subjects, AR: 68 subjects, B6: 54 subjects, BH: 145 subjects, C8: 45 subjects, D8: 75 subjects, E2: 91 subjects, E9 60 subjects, EW: 38 subjects, GI:2 subjects, GM: 21 subjects, JL: 2 subjects, LD: 3 subjects, LL: 11 subjects, LQ: 1 subject, MS: 1 subject, OK: 1 subject, OL: 19 subjects, S3: 1 subject, W8: 1 subject
?substring
TCGA.center.site <- substring(brcaClin$PATIENT.ID, 6, 7)
brcaClin<-cbind(TCGA.center.site,brcaClin)
head_brca <- head(brcaClin)
kable(head_brca, caption = "brcaClin Head")| TCGA.center.site | PATIENT.ID | Diagnosis.Age | Person.Gender | Ethnicity.Category | Race.Category | Year.Cancer.Initial.Diagnosis | CNA | Mutation.Count | Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code | Overall.Survival..Months. | Overall.Survival.Status | Disease.Free..Months. | Disease.Free.Status | Metastatic.Site | Tumor.Other.Histologic.Subtype |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5T | TCGA-5T-A9QA | 52 | FEMALE | NOT HISPANIC OR LATINO | BLACK OR AFRICAN AMERICAN | 2013 | 0.4130000 | NA | Stage IB | 0.39 | LIVING | 0.39 | DiseaseFree | NA | NA |
| A1 | TCGA-A1-A0SB | 70 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2008 | 0.0008017 | 16 | Stage I | 8.51 | LIVING | 8.51 | DiseaseFree | NA | Adenoid Cystic Carcinoma |
| A1 | TCGA-A1-A0SD | 59 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2005 | 0.2470000 | 23 | Stage IIA | 14.40 | LIVING | 14.40 | DiseaseFree | NA | NA |
| A1 | TCGA-A1-A0SE | 56 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2005 | 0.2130000 | 18 | Stage I | 43.40 | LIVING | 43.40 | DiseaseFree | NA | invasive ductal and lobular carcinoma |
| A1 | TCGA-A1-A0SF | 54 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2006 | 0.2020000 | 32 | Stage IIA | 48.10 | LIVING | 48.10 | DiseaseFree | NA | NA |
| A1 | TCGA-A1-A0SG | 61 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2006 | 0.0969000 | 27 | Stage IIB | 14.30 | LIVING | 14.30 | DiseaseFree | NA | micropapillary carcinoma invasive |
## [1] 28
| TCGA.center.site | Freq |
|---|---|
| 5T | 1 |
| A1 | 14 |
| A2 | 102 |
| A7 | 39 |
| A8 | 84 |
| AC | 29 |
| AN | 39 |
| AO | 47 |
| AQ | 9 |
| AR | 68 |
| B6 | 54 |
| BH | 145 |
| C8 | 45 |
| D8 | 75 |
| E2 | 91 |
| E9 | 60 |
| EW | 38 |
| GI | 2 |
| GM | 21 |
| JL | 2 |
| LD | 3 |
| LL | 11 |
| LQ | 1 |
| MS | 1 |
| OK | 1 |
| OL | 19 |
| S3 | 1 |
| W8 | 1 |
nchar to find the length of each variable (column)
name in the data set. Use summary and hist to
explore the distribution of these lengths. Which variable name is the
longest? The shortest?The longest variable name is “Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code” which has 62 characters and the shortest is “CNA” with 3 characters
## [1] 16 10 13 13 18 13 29 3 14 62 25 23 21 19 15 30
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 13.00 17.00 20.25 23.50 62.00
hist(brcaClin_char, main = "Histogram of Variable Name Lengths", xlab="variable length", xlim=c(0,70),ylim=c(0,8))brcaClin_char_df <- as.data.frame(brcaClin_char)
brcaClin_char_df$Variable_Name <- colnames(brcaClin)
brcaclinchar_summary <- summary(brcaClin_char_df)
head_brcaClin_df <- head(brcaClin_char_df)
kable(brcaclinchar_summary, caption = "Character Summary")| brcaClin_char | Variable_Name | |
|---|---|---|
| Min. : 3.00 | Length :16 | |
| 1st Qu.:13.00 | N.unique :16 | |
| Median :17.00 | N.blank : 0 | |
| Mean :20.25 | Min.nchar: 3 | |
| 3rd Qu.:23.50 | Max.nchar:62 | |
| Max. :62.00 | NA |
| brcaClin_char | Variable_Name |
|---|---|
| 16 | TCGA.center.site |
| 10 | PATIENT.ID |
| 13 | Diagnosis.Age |
| 13 | Person.Gender |
| 18 | Ethnicity.Category |
| 13 | Race.Category |
| 29 | Year.Cancer.Initial.Diagnosis |
| 3 | CNA |
| 14 | Mutation.Count |
| 62 | Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code |
| 25 | Overall.Survival..Months. |
| 23 | Overall.Survival.Status |
| 21 | Disease.Free..Months. |
| 19 | Disease.Free.Status |
| 15 | Metastatic.Site |
| 30 | Tumor.Other.Histologic.Subtype |
## [1] 3
## [1] 62
## [1] "CNA"
## [1] "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code"
Tumor.Other.Histologic.Subtype? Use grep to
find all the entries that have ‘ductal’ OR ‘Ductal’ in them. How many
subjects have this entry?There are 59 unique entries in Tumor.Other.Histologic.Subtype. 34 patients have ductal or ductal-related histologic types.
## [1] 1003
## [1] 59
tumor_ductal_IDs <- subset(brcaClin, grepl("ductal", brcaClin$Tumor.Other.Histologic.Subtype),select = c("PATIENT.ID", "Tumor.Other.Histologic.Subtype"))
tumor_ductal <- subset(brcaClin, grepl("ductal", brcaClin$Tumor.Other.Histologic.Subtype), select = c("Tumor.Other.Histologic.Subtype"))
length(tumor_ductal_IDs$PATIENT.ID)## [1] 34
The subjects with missing survival time (NA) are found at the bottom of the table.
brcaClin_ordered <- brcaClin[order(brcaClin$Overall.Survival..Months., decreasing = F),]
is.data.frame(brcaClin_ordered)## [1] TRUE
brcaClin_ordered_head <- head(brcaClin_ordered)
brcaClin_ordered_tail <- tail(brcaClin_ordered)
kable(brcaClin_ordered_head, caption = "brcaClin Ordered Head")| TCGA.center.site | PATIENT.ID | Diagnosis.Age | Person.Gender | Ethnicity.Category | Race.Category | Year.Cancer.Initial.Diagnosis | CNA | Mutation.Count | Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code | Overall.Survival..Months. | Overall.Survival.Status | Disease.Free..Months. | Disease.Free.Status | Metastatic.Site | Tumor.Other.Histologic.Subtype | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 157 | A8 | TCGA-A8-A06N | 66 | FEMALE | NA | NA | 2009 | 5.58e-01 | 25 | Stage IIIB | 0 | LIVING | 0 | DiseaseFree | NA | NA |
| 185 | A8 | TCGA-A8-A081 | 80 | FEMALE | NA | NA | 2008 | 6.45e-01 | 67 | Stage IIA | 0 | LIVING | 0 | DiseaseFree | NA | NA |
| 187 | A8 | TCGA-A8-A083 | 67 | FEMALE | NA | NA | 2008 | 1.82e-01 | 29 | Stage IIB | 0 | LIVING | 0 | DiseaseFree | NA | NA |
| 196 | A8 | TCGA-A8-A08H | 66 | FEMALE | NA | NA | 2007 | 2.19e-05 | 46 | Stage IIA | 0 | LIVING | 0 | DiseaseFree | NA | NA |
| 207 | A8 | TCGA-A8-A090 | 74 | FEMALE | NA | NA | 2008 | 2.86e-01 | 26 | Stage IIA | 0 | LIVING | 0 | DiseaseFree | NA | NA |
| 211 | A8 | TCGA-A8-A094 | 75 | FEMALE | NA | NA | 2009 | 4.87e-01 | 138 | Stage IIA | 0 | LIVING | 0 | DiseaseFree | NA | NA |
| TCGA.center.site | PATIENT.ID | Diagnosis.Age | Person.Gender | Ethnicity.Category | Race.Category | Year.Cancer.Initial.Diagnosis | CNA | Mutation.Count | Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code | Overall.Survival..Months. | Overall.Survival.Status | Disease.Free..Months. | Disease.Free.Status | Metastatic.Site | Tumor.Other.Histologic.Subtype | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 874 | E9 | TCGA-E9-A226 | 45 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2011 | 0.2560000 | 22 | Stage IIIA | NA | DECEASED | 25.9 | Recurred/Progressed | NA | NA |
| 884 | E9 | TCGA-E9-A243 | 52 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2011 | 0.1010000 | 49 | Stage IIA | NA | DECEASED | 10.0 | Recurred/Progressed | NA | NA |
| 886 | E9 | TCGA-E9-A245 | 47 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2011 | 0.0345000 | 13 | Stage IIB | NA | DECEASED | NA | NA | NA | NA |
| 917 | EW | TCGA-EW-A1P0 | 55 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 2008 | 0.2440000 | 21 | Stage IIB | NA | DECEASED | 35.9 | Recurred/Progressed | NA | IDC+ mucinous carcinoma |
| 918 | EW | TCGA-EW-A1P1 | 68 | FEMALE | HISPANIC OR LATINO | WHITE | 2008 | 0.0008294 | 1 | Stage IIIC | NA | DECEASED | 33.3 | Recurred/Progressed | Lung | NA |
| 944 | GM | TCGA-GM-A2DA | 46 | FEMALE | NOT HISPANIC OR LATINO | WHITE | 1993 | 0.1410000 | 21 | Stage IIB | NA | DECEASED | NA | NA | Bone | NA |
Problems 8 through 12 are best done using ggplot2
Comment here about what this plot tells us and how it can be useful:
This plot gives us information about copy number alterations by race and by age at diagnosis. From this plot, we can see median, upper quartile, lower quartile, maximum, minimum, and any outliers. It can be useful to quickly make comparisons between several variables in the data, such as race, CNA, and age category (decades). In this plot, you can see a trend with CNA value and increasing age in Asians. However, no such trend in Black/African American and white races exists. Overall, white people have lower median CNA values compared to Asian and Black/African Americans at any age of diagnosis, though outliers indicate the highest CNA of all three races. Boxplots are a very useful tool that can reveal a lot of information about a dataset.
brcaClin_subset_race <- subset(brcaClin_ordered, Race.Category %in% c("ASIAN", "BLACK OR AFRICAN AMERICAN", "WHITE"))
brcaClin_subset_race$Race.Category2 <- factor(brcaClin_subset_race$Race.Category)
race_subset_table <- table(brcaClin_subset_race$Race.Category2)
brcaClin_subset_race$Race.Category <- NULL
kable(race_subset_table, caption = "brcaClin Race Subset")| Var1 | Freq |
|---|---|
| ASIAN | 57 |
| BLACK OR AFRICAN AMERICAN | 129 |
| WHITE | 721 |
brcaClin_subset_race$Diag.Age.Cat <- cut(brcaClin_subset_race$Diagnosis.Age, breaks = seq(10, 100, by = 10))
brcaClin_subset_race$Diag.Age.Cat <- factor(brcaClin_subset_race$Diag.Age.Cat)
diag_table <- table(brcaClin_subset_race$Diag.Age.Cat)
head_diag.age.cat <- head(brcaClin_subset_race)
kable(head_diag.age.cat, caption = "brcaClin Diagnosis Age Category Head")| TCGA.center.site | PATIENT.ID | Diagnosis.Age | Person.Gender | Ethnicity.Category | Year.Cancer.Initial.Diagnosis | CNA | Mutation.Count | Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code | Overall.Survival..Months. | Overall.Survival.Status | Disease.Free..Months. | Disease.Free.Status | Metastatic.Site | Tumor.Other.Histologic.Subtype | Race.Category2 | Diag.Age.Cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 577 | BH | TCGA-BH-A18H | 63 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.263000 | 24 | Stage IA | 0 | LIVING | 0 | DiseaseFree | NA | NA | WHITE | (60,70] |
| 639 | C8 | TCGA-C8-A12T | 43 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.231000 | 120 | Stage IIA | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] |
| 644 | C8 | TCGA-C8-A12Y | 44 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.233000 | 22 | Stage IIB | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] |
| 645 | C8 | TCGA-C8-A12Z | 45 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.270000 | 37 | Stage IIB | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] |
| 649 | C8 | TCGA-C8-A133 | 65 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.005322 | 17 | Stage IIIA | 0 | LIVING | 0 | DiseaseFree | NA | ductal/mucinous | WHITE | (60,70] |
| 663 | C8 | TCGA-C8-A1HO | 34 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.320000 | 16 | Stage IIIC | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (30,40] |
| Var1 | Freq |
|---|---|
| (20,30] | 9 |
| (30,40] | 78 |
| (40,50] | 201 |
| (50,60] | 234 |
| (60,70] | 226 |
| (70,80] | 122 |
| (80,90] | 37 |
ggplot(brcaClin_subset_race, aes(x=Diag.Age.Cat, y=CNA, fill=Diag.Age.Cat)) +
geom_boxplot()+ theme_bw() +
facet_wrap(~Race.Category2) +
scale_fill_brewer(palette = "Set3") Then make a new column in the table of log mutation count. Remember you cannot log a 0 and you will need to add some small number. If there are 0 values in the mutation count, try something like log(Mutation+0.25)
Then produce a scatterplot of CNA (x variable) by log mutation count
(y variable) with points color coded by stage of disease. Use
geom_smooth to draw a smoothed line depicting the
relationship between CNA and log mutation count.
brcaClin_head_race <- head(brcaClin_subset_race)
kable(brcaClin_head_race, caption = "brcaClin Head")| TCGA.center.site | PATIENT.ID | Diagnosis.Age | Person.Gender | Ethnicity.Category | Year.Cancer.Initial.Diagnosis | CNA | Mutation.Count | Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code | Overall.Survival..Months. | Overall.Survival.Status | Disease.Free..Months. | Disease.Free.Status | Metastatic.Site | Tumor.Other.Histologic.Subtype | Race.Category2 | Diag.Age.Cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 577 | BH | TCGA-BH-A18H | 63 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.263000 | 24 | Stage IA | 0 | LIVING | 0 | DiseaseFree | NA | NA | WHITE | (60,70] |
| 639 | C8 | TCGA-C8-A12T | 43 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.231000 | 120 | Stage IIA | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] |
| 644 | C8 | TCGA-C8-A12Y | 44 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.233000 | 22 | Stage IIB | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] |
| 645 | C8 | TCGA-C8-A12Z | 45 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.270000 | 37 | Stage IIB | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] |
| 649 | C8 | TCGA-C8-A133 | 65 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.005322 | 17 | Stage IIIA | 0 | LIVING | 0 | DiseaseFree | NA | ductal/mucinous | WHITE | (60,70] |
| 663 | C8 | TCGA-C8-A1HO | 34 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.320000 | 16 | Stage IIIC | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (30,40] |
brcaClin_subset_race$cancer.stage <- gsub(pattern = "[ABC]", replacement = "",
x = brcaClin_subset_race$Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code)
stage.table <- table(brcaClin_subset_race$Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code)
kable(stage.table, caption = "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code")| Var1 | Freq |
|---|---|
| Stage I | 76 |
| Stage IA | 77 |
| Stage IB | 11 |
| Stage II | 8 |
| Stage IIA | 305 |
| Stage IIB | 209 |
| Stage III | 2 |
| Stage IIIA | 128 |
| Stage IIIB | 19 |
| Stage IIIC | 47 |
| Stage IV | 11 |
| Stage Tis | 1 |
| Stage X | 12 |
no.abc <- table(brcaClin_subset_race$cancer.stage)
kable(no.abc, caption = "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Condensed")| Var1 | Freq |
|---|---|
| Stage I | 164 |
| Stage II | 522 |
| Stage III | 196 |
| Stage IV | 11 |
| Stage Tis | 1 |
| Stage X | 12 |
## [1] "Stage I" "Stage II" "Stage III" "Stage IV"
brcaClin_subset_race$cancer.stage2 <- factor(brcaClin_subset_race$cancer.stage, levels = c("Stage I", "Stage II", "Stage III", "Stage IV"))
brcaClin_subset_race<-subset(brcaClin_subset_race, !is.na(brcaClin_subset_race$cancer.stage2))
table.cancer.stage2 <- table(brcaClin_subset_race$cancer.stage2)
kable(table.cancer.stage2, caption = "Cancer Stages I-IV")| Var1 | Freq |
|---|---|
| Stage I | 164 |
| Stage II | 522 |
| Stage III | 196 |
| Stage IV | 11 |
brcaClin_subset_race$cancer.stage <- NULL
brcaClin_subset_race$logmc <- log((brcaClin_subset_race$Mutation.Count)+.000001)
head_logmc <- head(brcaClin_subset_race)
kable(head_logmc, caption = "brcaClin Head LogMC")| TCGA.center.site | PATIENT.ID | Diagnosis.Age | Person.Gender | Ethnicity.Category | Year.Cancer.Initial.Diagnosis | CNA | Mutation.Count | Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code | Overall.Survival..Months. | Overall.Survival.Status | Disease.Free..Months. | Disease.Free.Status | Metastatic.Site | Tumor.Other.Histologic.Subtype | Race.Category2 | Diag.Age.Cat | cancer.stage2 | logmc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 577 | BH | TCGA-BH-A18H | 63 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.263000 | 24 | Stage IA | 0 | LIVING | 0 | DiseaseFree | NA | NA | WHITE | (60,70] | Stage I | 3.178054 |
| 639 | C8 | TCGA-C8-A12T | 43 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.231000 | 120 | Stage IIA | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] | Stage II | 4.787492 |
| 644 | C8 | TCGA-C8-A12Y | 44 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.233000 | 22 | Stage IIB | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] | Stage II | 3.091042 |
| 645 | C8 | TCGA-C8-A12Z | 45 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.270000 | 37 | Stage IIB | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (40,50] | Stage II | 3.610918 |
| 649 | C8 | TCGA-C8-A133 | 65 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.005322 | 17 | Stage IIIA | 0 | LIVING | 0 | DiseaseFree | NA | ductal/mucinous | WHITE | (60,70] | Stage III | 2.833213 |
| 663 | C8 | TCGA-C8-A1HO | 34 | FEMALE | NOT HISPANIC OR LATINO | 2010 | 0.320000 | 16 | Stage IIIC | 0 | LIVING | 0 | DiseaseFree | NA | NA | ASIAN | (30,40] | Stage III | 2.772589 |
ggplot(brcaClin_subset_race, aes(x=CNA, y=logmc, color=cancer.stage2)) +theme_bw() +
geom_point(size=2) +
geom_smooth(method=lm , color="magenta3", fill="steelblue", se=TRUE)facet_wrap to create separate panels for each stage rather
than shading. Also add a smooth line with geom_smooth.Comment here about what this plot tells us and how it can be useful:
This plot is much easier to visualize than in question 10 because all of the points are not overlapping, so you can see the log mutation count and CNA at each stage. Additionally, instead of an overall trend line, this plot gives a trend line for each stage. From looking at the data, it appears that there is a low correlation between CNA and log mutation count in different stages, though slightly trend upward, with both CNA and logmc increasing. This can be useful to quickly identify outliers and identify trends in multiple variables like CNA, logmc, and stage of cancer.
ggplot(brcaClin_subset_race, aes(x=CNA, y=logmc)) +theme_bw() +
geom_point(size=2) +
facet_wrap(~cancer.stage2) +
geom_smooth(method=lm , color="magenta3", fill="steelblue", se=TRUE)ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = cancer.stage2)) + theme_bw() +
geom_bar() +
labs(title = "Records per Stage",
x = "Cancer Stage",
y = "Record Count") +
scale_fill_manual(values = c("green","orange","slateblue","magenta") )ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = Overall.Survival.Status)) + theme_bw() +
geom_bar() +
labs(title = "Records per Stage by survival status",
x = "Cancer Stage",
y = "Record Count")+
scale_fill_manual(values = c("slateblue","magenta") )ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = Overall.Survival.Status)) + theme_bw() +
geom_bar() +
labs(title = "Records per Stage by Survival Status and Race",
x = "Cancer Stage",
y = "Record Count")+
facet_wrap(~Race.Category2) +
scale_fill_manual(values = c("slateblue","magenta") )OPTIONAL: For added difficulty, try to make a hover-over text label using the patient ID.