Instructions: Write your code in the code “chunks” following each question. Then compile the html file in RStudio using “Knit HTML”. Submit your R Markdown file (rename to something like “YOURNAME_hwR.Rmd”) and the HTML file.
Edit this R Markdown file directly. Add code to code chunks and text answers to text sections.

Check out examples from class and R function help pages. Break a problem down into steps. Google each step! You don’t have to answer the question in one line of code.

Email your instructor with questions ahead of the due date. Send your code, which question you are working on, and a screenshot of the error.

Please do not print really long outputs to the html.

Load needed packages. Uncomment the install lines if you need to install the packages. Then recomment before knitting.

# install.packages("ggplot2")
# install.packages("plotly")
library("ggplot2")
library("plotly")

TCGA Data

Read in the data. Do not modify this code chunk!!

Notice that the csv is imported and saved to dataframe called brcaClin

brcaClin <- read.csv("TCGA_BRCA_Clinical_HW.csv")
  1. Describe the table. What are the dimensions? What are the columns and data types?

The table has 1003 rows and 15 columns.

The columns are PATIENT.ID (character), Diagnosis.Age (integer), Person.Gender (character), Ethnicity.Category (character), Race.Category (character), Year.Cancer.Initial.Diagnosis (integer), CNA (numeric), Mutation.Count (integer), Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code (character), Overall.Survival..Months. (numeric), Overall.Survival.Status (character), Disease.Free..Months. (numeric), Disease.Free.Status (character), Metastatic.Site (character), Tumor.Other.Histologic.Subtype (character)

dim(brcaClin)
## [1] 1003   15
str_brca <- str(brcaClin)
## 'data.frame':    1003 obs. of  15 variables:
##  $ PATIENT.ID                                                    : chr  "TCGA-5T-A9QA" "TCGA-A1-A0SB" "TCGA-A1-A0SD" "TCGA-A1-A0SE" ...
##  $ Diagnosis.Age                                                 : int  52 70 59 56 54 61 39 52 39 54 ...
##  $ Person.Gender                                                 : chr  "FEMALE" "FEMALE" "FEMALE" "FEMALE" ...
##  $ Ethnicity.Category                                            : chr  "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" ...
##  $ Race.Category                                                 : chr  "BLACK OR AFRICAN AMERICAN" "WHITE" "WHITE" "WHITE" ...
##  $ Year.Cancer.Initial.Diagnosis                                 : int  2013 2008 2005 2005 2006 2006 2006 2007 2006 2007 ...
##  $ CNA                                                           : num  0.413 0.000802 0.247 0.213 0.202 ...
##  $ Mutation.Count                                                : int  NA 16 23 18 32 27 70 139 25 89 ...
##  $ Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code: chr  "Stage IB" "Stage I" "Stage IIA" "Stage I" ...
##  $ Overall.Survival..Months.                                     : num  0.39 8.51 14.4 43.4 48.1 14.3 47.2 20.9 13.7 31.8 ...
##  $ Overall.Survival.Status                                       : chr  "LIVING" "LIVING" "LIVING" "LIVING" ...
##  $ Disease.Free..Months.                                         : num  0.39 8.51 14.4 43.4 48.1 14.3 47.2 20.9 13.7 NA ...
##  $ Disease.Free.Status                                           : chr  "DiseaseFree" "DiseaseFree" "DiseaseFree" "DiseaseFree" ...
##  $ Metastatic.Site                                               : chr  "" "" "" "" ...
##  $ Tumor.Other.Histologic.Subtype                                : chr  "" "Adenoid Cystic Carcinoma" "" "invasive ductal and lobular carcinoma" ...
  1. How many subjects in the data experienced recurrence / progression? How many subjects are deceased? Check out columns on Disease Free Status and Overall Survival Status

68 subjects experienced recurrence/progression. 216 subjects are deceased.

brcaClin <- read.csv("TCGA_BRCA_Clinical_HW.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE, na.strings = c("NA","","[Not Evaluated]","[Unknown]"))
disease.free.status.vector <- brcaClin$Disease.Free.Status
disease.free.summary <- summary(disease.free.status.vector)
kable(disease.free.summary, caption = "Disease-Free Status")
Disease-Free Status
x
DiseaseFree 769
Recurred/Progressed 68
NAs 166
overall.survival.status.vector <- brcaClin$Overall.Survival.Status
survival.status.summary <- summary(overall.survival.status.vector)
kable(survival.status.summary, caption = "Overall Survival Status")
Overall Survival Status
x
DECEASED 216
LIVING 787
  1. Look at the variable Metastatic.Site and identify how many subjects had bone metastasis using grep or subset to find all the entries that have ‘bone’ OR ‘Bone’. What were the corresponding patient IDs for these patients?

13 patients had some kind of bone metasasis.

Patient IDs were:TCGA-A2-A04V TCGA-A2-A0SV TCGA-AO-A0J5 TCGA-AR-A0TY TCGA-AR-A0TZ TCGA-AR-A0U2 TCGA-AR-A2LH TCGA-B6-A0I9 TCGA-BH-A18J TCGA-BH-A1FH TCGA-EW-A1P8 TCGA-GM-A2D9 TCGA-GM-A2DA.

metastatic.site.vector <- brcaClin$Metastatic.Site
metastatic.site.summary <- summary(metastatic.site.vector)
kable(metastatic.site.summary, caption = "Metastatic Site")
Metastatic Site
x
Bone 9
Bone|Liver 2
Liver 2
Lung 1
Lung|Bone|Liver 1
Lung|Bone|Liver|Other, specify 1
Lung|Liver 1
Lung|Other, specify 1
Other, specify 5
NAs 980
metastatic.site.bone <- subset(brcaClin, grepl("Bone", brcaClin$Metastatic.Site),select = c("PATIENT.ID", "Metastatic.Site"))
metastatic.site.bone.IDs <- subset(brcaClin, grepl("Bone", brcaClin$Metastatic.Site),select = c("PATIENT.ID"))
metastatic.site.bone.list <- metastatic.site.bone$PATIENT.ID
length(metastatic.site.bone.list)
## [1] 13
kable(metastatic.site.bone.IDs, caption = "Bone Metastasis Patient IDs")
Bone Metastasis Patient IDs
PATIENT.ID
22 TCGA-A2-A04V
61 TCGA-A2-A0SV
321 TCGA-AO-A0J5
374 TCGA-AR-A0TY
375 TCGA-AR-A0TZ
377 TCGA-AR-A0U2
421 TCGA-AR-A2LH
438 TCGA-B6-A0I9
579 TCGA-BH-A18J
613 TCGA-BH-A1FH
924 TCGA-EW-A1P8
943 TCGA-GM-A2D9
944 TCGA-GM-A2DA
kable(metastatic.site.bone, caption = "Metastatic Site Patients")
Metastatic Site Patients
PATIENT.ID Metastatic.Site
22 TCGA-A2-A04V Bone
61 TCGA-A2-A0SV Bone
321 TCGA-AO-A0J5 Lung|Bone|Liver
374 TCGA-AR-A0TY Bone
375 TCGA-AR-A0TZ Bone
377 TCGA-AR-A0U2 Bone|Liver
421 TCGA-AR-A2LH Bone|Liver
438 TCGA-B6-A0I9 Bone
579 TCGA-BH-A18J Bone
613 TCGA-BH-A1FH Bone
924 TCGA-EW-A1P8 Lung|Bone|Liver|Other, specify
943 TCGA-GM-A2D9 Bone
944 TCGA-GM-A2DA Bone
  1. The middle portion of the patient ID, sandwiched between the dashes ‘-’, identifies the TCGA center/site for the subject. Check the help page for the function substring and use this to create new variable for TCGA center in your brcaClin table. How many different centers are there in this data? How many subjects came from each center? Function table will help greatly

There are 28 different centers in this data. 5T: 1 subject, A1: 14 subjects, A2: 102 subjects, A7: 39 subjects, A8:84 subjects, AC: 29 subjects, AN: 39 subjects, A0:47 subjects, AQ: 9 subjects, AR: 68 subjects, B6: 54 subjects, BH: 145 subjects, C8: 45 subjects, D8: 75 subjects, E2: 91 subjects, E9 60 subjects, EW: 38 subjects, GI:2 subjects, GM: 21 subjects, JL: 2 subjects, LD: 3 subjects, LL: 11 subjects, LQ: 1 subject, MS: 1 subject, OK: 1 subject, OL: 19 subjects, S3: 1 subject, W8: 1 subject

?substring
TCGA.center.site <- substring(brcaClin$PATIENT.ID, 6, 7)
brcaClin<-cbind(TCGA.center.site,brcaClin)
head_brca <- head(brcaClin)
kable(head_brca, caption = "brcaClin Head")
brcaClin Head
TCGA.center.site PATIENT.ID Diagnosis.Age Person.Gender Ethnicity.Category Race.Category Year.Cancer.Initial.Diagnosis CNA Mutation.Count Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Overall.Survival..Months. Overall.Survival.Status Disease.Free..Months. Disease.Free.Status Metastatic.Site Tumor.Other.Histologic.Subtype
5T TCGA-5T-A9QA 52 FEMALE NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 2013 0.4130000 NA Stage IB 0.39 LIVING 0.39 DiseaseFree NA NA
A1 TCGA-A1-A0SB 70 FEMALE NOT HISPANIC OR LATINO WHITE 2008 0.0008017 16 Stage I 8.51 LIVING 8.51 DiseaseFree NA Adenoid Cystic Carcinoma
A1 TCGA-A1-A0SD 59 FEMALE NOT HISPANIC OR LATINO WHITE 2005 0.2470000 23 Stage IIA 14.40 LIVING 14.40 DiseaseFree NA NA
A1 TCGA-A1-A0SE 56 FEMALE NOT HISPANIC OR LATINO WHITE 2005 0.2130000 18 Stage I 43.40 LIVING 43.40 DiseaseFree NA invasive ductal and lobular carcinoma
A1 TCGA-A1-A0SF 54 FEMALE NOT HISPANIC OR LATINO WHITE 2006 0.2020000 32 Stage IIA 48.10 LIVING 48.10 DiseaseFree NA NA
A1 TCGA-A1-A0SG 61 FEMALE NOT HISPANIC OR LATINO WHITE 2006 0.0969000 27 Stage IIB 14.30 LIVING 14.30 DiseaseFree NA micropapillary carcinoma invasive
?base::table
length(unique(TCGA.center.site))
## [1] 28
TCGA.table <- table(TCGA.center.site)
kable(TCGA.table, caption = "TCGA Center Site")
TCGA Center Site
TCGA.center.site Freq
5T 1
A1 14
A2 102
A7 39
A8 84
AC 29
AN 39
AO 47
AQ 9
AR 68
B6 54
BH 145
C8 45
D8 75
E2 91
E9 60
EW 38
GI 2
GM 21
JL 2
LD 3
LL 11
LQ 1
MS 1
OK 1
OL 19
S3 1
W8 1
  1. Use nchar to find the length of each variable (column) name in the data set. Use summary and hist to explore the distribution of these lengths. Which variable name is the longest? The shortest?

The longest variable name is “Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code” which has 62 characters and the shortest is “CNA” with 3 characters

brcaClin_char <- nchar(colnames(brcaClin))
brcaClin_char
##  [1] 16 10 13 13 18 13 29  3 14 62 25 23 21 19 15 30
summary(brcaClin_char)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   13.00   17.00   20.25   23.50   62.00
hist(brcaClin_char, main = "Histogram of Variable Name Lengths", xlab="variable length", xlim=c(0,70),ylim=c(0,8))

brcaClin_char_df <- as.data.frame(brcaClin_char)
brcaClin_char_df$Variable_Name <- colnames(brcaClin)
brcaclinchar_summary  <- summary(brcaClin_char_df)
head_brcaClin_df <- head(brcaClin_char_df)
kable(brcaclinchar_summary, caption = "Character Summary")
Character Summary
brcaClin_char Variable_Name
Min. : 3.00 Length :16
1st Qu.:13.00 N.unique :16
Median :17.00 N.blank : 0
Mean :20.25 Min.nchar: 3
3rd Qu.:23.50 Max.nchar:62
Max. :62.00 NA
kable(brcaClin_char_df, caption = "Variable Character Number")
Variable Character Number
brcaClin_char Variable_Name
16 TCGA.center.site
10 PATIENT.ID
13 Diagnosis.Age
13 Person.Gender
18 Ethnicity.Category
13 Race.Category
29 Year.Cancer.Initial.Diagnosis
3 CNA
14 Mutation.Count
62 Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code
25 Overall.Survival..Months.
23 Overall.Survival.Status
21 Disease.Free..Months.
19 Disease.Free.Status
15 Metastatic.Site
30 Tumor.Other.Histologic.Subtype
min <- min(brcaClin_char_df$brcaClin_char)
min
## [1] 3
max <- max(brcaClin_char_df$brcaClin_char)
max
## [1] 62
brcaClin_char_df$Variable_Name[which.min(brcaClin_char_df$brcaClin_char)]
## [1] "CNA"
brcaClin_char_df$Variable_Name[which.max(brcaClin_char_df$brcaClin_char)]
## [1] "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code"
  1. How many unique entries are in Tumor.Other.Histologic.Subtype? Use grep to find all the entries that have ‘ductal’ OR ‘Ductal’ in them. How many subjects have this entry?

There are 59 unique entries in Tumor.Other.Histologic.Subtype. 34 patients have ductal or ductal-related histologic types.

tumor <- brcaClin$Tumor.Other.Histologic.Subtype
length(tumor)
## [1] 1003
uniquetumor <- unique(tumor)
length(uniquetumor)
## [1] 59
tumor_ductal_IDs <- subset(brcaClin, grepl("ductal", brcaClin$Tumor.Other.Histologic.Subtype),select = c("PATIENT.ID", "Tumor.Other.Histologic.Subtype"))
tumor_ductal <- subset(brcaClin, grepl("ductal", brcaClin$Tumor.Other.Histologic.Subtype), select = c("Tumor.Other.Histologic.Subtype"))
length(tumor_ductal_IDs$PATIENT.ID)
## [1] 34
  1. Sort the data set from shortest to longest survival time and save to a new data.frame. Where do the subjects with missing survival time end up? Use head and tail functions and please do not print out the whole dataframe!

The subjects with missing survival time (NA) are found at the bottom of the table.

brcaClin_ordered <- brcaClin[order(brcaClin$Overall.Survival..Months., decreasing = F),]
is.data.frame(brcaClin_ordered)
## [1] TRUE
brcaClin_ordered_head <- head(brcaClin_ordered)
brcaClin_ordered_tail <- tail(brcaClin_ordered)
kable(brcaClin_ordered_head, caption = "brcaClin Ordered Head")
brcaClin Ordered Head
TCGA.center.site PATIENT.ID Diagnosis.Age Person.Gender Ethnicity.Category Race.Category Year.Cancer.Initial.Diagnosis CNA Mutation.Count Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Overall.Survival..Months. Overall.Survival.Status Disease.Free..Months. Disease.Free.Status Metastatic.Site Tumor.Other.Histologic.Subtype
157 A8 TCGA-A8-A06N 66 FEMALE NA NA 2009 5.58e-01 25 Stage IIIB 0 LIVING 0 DiseaseFree NA NA
185 A8 TCGA-A8-A081 80 FEMALE NA NA 2008 6.45e-01 67 Stage IIA 0 LIVING 0 DiseaseFree NA NA
187 A8 TCGA-A8-A083 67 FEMALE NA NA 2008 1.82e-01 29 Stage IIB 0 LIVING 0 DiseaseFree NA NA
196 A8 TCGA-A8-A08H 66 FEMALE NA NA 2007 2.19e-05 46 Stage IIA 0 LIVING 0 DiseaseFree NA NA
207 A8 TCGA-A8-A090 74 FEMALE NA NA 2008 2.86e-01 26 Stage IIA 0 LIVING 0 DiseaseFree NA NA
211 A8 TCGA-A8-A094 75 FEMALE NA NA 2009 4.87e-01 138 Stage IIA 0 LIVING 0 DiseaseFree NA NA
kable(brcaClin_ordered_tail, caption = "brcaClin Ordered Tail")
brcaClin Ordered Tail
TCGA.center.site PATIENT.ID Diagnosis.Age Person.Gender Ethnicity.Category Race.Category Year.Cancer.Initial.Diagnosis CNA Mutation.Count Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Overall.Survival..Months. Overall.Survival.Status Disease.Free..Months. Disease.Free.Status Metastatic.Site Tumor.Other.Histologic.Subtype
874 E9 TCGA-E9-A226 45 FEMALE NOT HISPANIC OR LATINO WHITE 2011 0.2560000 22 Stage IIIA NA DECEASED 25.9 Recurred/Progressed NA NA
884 E9 TCGA-E9-A243 52 FEMALE NOT HISPANIC OR LATINO WHITE 2011 0.1010000 49 Stage IIA NA DECEASED 10.0 Recurred/Progressed NA NA
886 E9 TCGA-E9-A245 47 FEMALE NOT HISPANIC OR LATINO WHITE 2011 0.0345000 13 Stage IIB NA DECEASED NA NA NA NA
917 EW TCGA-EW-A1P0 55 FEMALE NOT HISPANIC OR LATINO WHITE 2008 0.2440000 21 Stage IIB NA DECEASED 35.9 Recurred/Progressed NA IDC+ mucinous carcinoma
918 EW TCGA-EW-A1P1 68 FEMALE HISPANIC OR LATINO WHITE 2008 0.0008294 1 Stage IIIC NA DECEASED 33.3 Recurred/Progressed Lung NA
944 GM TCGA-GM-A2DA 46 FEMALE NOT HISPANIC OR LATINO WHITE 1993 0.1410000 21 Stage IIB NA DECEASED NA NA Bone NA

Problems 8 through 12 are best done using ggplot2

  1. Use code from the notes to make race variable with only the levels “ASIAN”, “BLACK OR AFRICAN AMERICAN”, and “WHITE”. Be sure to factor after subsetting. Also make a new categorical age at diagnosis variable with age in decades or deciles. Then produce boxplots of CNA by diagnosis age category, with separate panels (facets) by the 3-level race variable.

Comment here about what this plot tells us and how it can be useful:

This plot gives us information about copy number alterations by race and by age at diagnosis. From this plot, we can see median, upper quartile, lower quartile, maximum, minimum, and any outliers. It can be useful to quickly make comparisons between several variables in the data, such as race, CNA, and age category (decades). In this plot, you can see a trend with CNA value and increasing age in Asians. However, no such trend in Black/African American and white races exists. Overall, white people have lower median CNA values compared to Asian and Black/African Americans at any age of diagnosis, though outliers indicate the highest CNA of all three races. Boxplots are a very useful tool that can reveal a lot of information about a dataset.

brcaClin_subset_race <- subset(brcaClin_ordered, Race.Category %in% c("ASIAN", "BLACK OR AFRICAN AMERICAN", "WHITE"))
brcaClin_subset_race$Race.Category2 <- factor(brcaClin_subset_race$Race.Category)
race_subset_table <- table(brcaClin_subset_race$Race.Category2)
brcaClin_subset_race$Race.Category <- NULL
kable(race_subset_table, caption = "brcaClin Race Subset")
brcaClin Race Subset
Var1 Freq
ASIAN 57
BLACK OR AFRICAN AMERICAN 129
WHITE 721
brcaClin_subset_race$Diag.Age.Cat <- cut(brcaClin_subset_race$Diagnosis.Age, breaks = seq(10, 100, by = 10))
brcaClin_subset_race$Diag.Age.Cat <- factor(brcaClin_subset_race$Diag.Age.Cat)
diag_table <- table(brcaClin_subset_race$Diag.Age.Cat)
head_diag.age.cat <- head(brcaClin_subset_race)
kable(head_diag.age.cat, caption = "brcaClin Diagnosis Age Category Head")
brcaClin Diagnosis Age Category Head
TCGA.center.site PATIENT.ID Diagnosis.Age Person.Gender Ethnicity.Category Year.Cancer.Initial.Diagnosis CNA Mutation.Count Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Overall.Survival..Months. Overall.Survival.Status Disease.Free..Months. Disease.Free.Status Metastatic.Site Tumor.Other.Histologic.Subtype Race.Category2 Diag.Age.Cat
577 BH TCGA-BH-A18H 63 FEMALE NOT HISPANIC OR LATINO 2010 0.263000 24 Stage IA 0 LIVING 0 DiseaseFree NA NA WHITE (60,70]
639 C8 TCGA-C8-A12T 43 FEMALE NOT HISPANIC OR LATINO 2010 0.231000 120 Stage IIA 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50]
644 C8 TCGA-C8-A12Y 44 FEMALE NOT HISPANIC OR LATINO 2010 0.233000 22 Stage IIB 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50]
645 C8 TCGA-C8-A12Z 45 FEMALE NOT HISPANIC OR LATINO 2010 0.270000 37 Stage IIB 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50]
649 C8 TCGA-C8-A133 65 FEMALE NOT HISPANIC OR LATINO 2010 0.005322 17 Stage IIIA 0 LIVING 0 DiseaseFree NA ductal/mucinous WHITE (60,70]
663 C8 TCGA-C8-A1HO 34 FEMALE NOT HISPANIC OR LATINO 2010 0.320000 16 Stage IIIC 0 LIVING 0 DiseaseFree NA NA ASIAN (30,40]
kable(diag_table, caption = "Diagnosis Age Category")
Diagnosis Age Category
Var1 Freq
(20,30] 9
(30,40] 78
(40,50] 201
(50,60] 234
(60,70] 226
(70,80] 122
(80,90] 37
ggplot(brcaClin_subset_race, aes(x=Diag.Age.Cat, y=CNA, fill=Diag.Age.Cat)) + 
  geom_boxplot()+ theme_bw() +
  facet_wrap(~Race.Category2) +
  scale_fill_brewer(palette = "Set3") 

  1. Use code from the notes to create a Stage variable with only values “Stage I”, “Stage II”, “Stage III”, and “Stage IV”. Look at the table to find the name of the current stage variable. Rename variables like “Stage Ia” and “Stage Ib” to just “Stage I”. gsub may help with this.

Then make a new column in the table of log mutation count. Remember you cannot log a 0 and you will need to add some small number. If there are 0 values in the mutation count, try something like log(Mutation+0.25)

Then produce a scatterplot of CNA (x variable) by log mutation count (y variable) with points color coded by stage of disease. Use geom_smooth to draw a smoothed line depicting the relationship between CNA and log mutation count.

brcaClin_head_race <- head(brcaClin_subset_race)
kable(brcaClin_head_race, caption = "brcaClin Head")
brcaClin Head
TCGA.center.site PATIENT.ID Diagnosis.Age Person.Gender Ethnicity.Category Year.Cancer.Initial.Diagnosis CNA Mutation.Count Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Overall.Survival..Months. Overall.Survival.Status Disease.Free..Months. Disease.Free.Status Metastatic.Site Tumor.Other.Histologic.Subtype Race.Category2 Diag.Age.Cat
577 BH TCGA-BH-A18H 63 FEMALE NOT HISPANIC OR LATINO 2010 0.263000 24 Stage IA 0 LIVING 0 DiseaseFree NA NA WHITE (60,70]
639 C8 TCGA-C8-A12T 43 FEMALE NOT HISPANIC OR LATINO 2010 0.231000 120 Stage IIA 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50]
644 C8 TCGA-C8-A12Y 44 FEMALE NOT HISPANIC OR LATINO 2010 0.233000 22 Stage IIB 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50]
645 C8 TCGA-C8-A12Z 45 FEMALE NOT HISPANIC OR LATINO 2010 0.270000 37 Stage IIB 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50]
649 C8 TCGA-C8-A133 65 FEMALE NOT HISPANIC OR LATINO 2010 0.005322 17 Stage IIIA 0 LIVING 0 DiseaseFree NA ductal/mucinous WHITE (60,70]
663 C8 TCGA-C8-A1HO 34 FEMALE NOT HISPANIC OR LATINO 2010 0.320000 16 Stage IIIC 0 LIVING 0 DiseaseFree NA NA ASIAN (30,40]
brcaClin_subset_race$cancer.stage <- gsub(pattern = "[ABC]", replacement = "", 
                            x = brcaClin_subset_race$Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code)
stage.table <- table(brcaClin_subset_race$Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code)
kable(stage.table, caption = "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code")
Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code
Var1 Freq
Stage I 76
Stage IA 77
Stage IB 11
Stage II 8
Stage IIA 305
Stage IIB 209
Stage III 2
Stage IIIA 128
Stage IIIB 19
Stage IIIC 47
Stage IV 11
Stage Tis 1
Stage X 12
no.abc <- table(brcaClin_subset_race$cancer.stage)
kable(no.abc, caption = "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Condensed")
Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Condensed
Var1 Freq
Stage I 164
Stage II 522
Stage III 196
Stage IV 11
Stage Tis 1
Stage X 12
cancer.stage_levels <- paste("Stage", c("I", "II", "III", "IV"), sep = " ")
cancer.stage_levels
## [1] "Stage I"   "Stage II"  "Stage III" "Stage IV"
brcaClin_subset_race$cancer.stage2 <- factor(brcaClin_subset_race$cancer.stage, levels = c("Stage I", "Stage II", "Stage III", "Stage IV"))
brcaClin_subset_race<-subset(brcaClin_subset_race, !is.na(brcaClin_subset_race$cancer.stage2))
table.cancer.stage2 <- table(brcaClin_subset_race$cancer.stage2)
kable(table.cancer.stage2, caption = "Cancer Stages I-IV")
Cancer Stages I-IV
Var1 Freq
Stage I 164
Stage II 522
Stage III 196
Stage IV 11
brcaClin_subset_race$cancer.stage <- NULL

brcaClin_subset_race$logmc <- log((brcaClin_subset_race$Mutation.Count)+.000001)

head_logmc <- head(brcaClin_subset_race)
kable(head_logmc, caption = "brcaClin Head LogMC")
brcaClin Head LogMC
TCGA.center.site PATIENT.ID Diagnosis.Age Person.Gender Ethnicity.Category Year.Cancer.Initial.Diagnosis CNA Mutation.Count Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Overall.Survival..Months. Overall.Survival.Status Disease.Free..Months. Disease.Free.Status Metastatic.Site Tumor.Other.Histologic.Subtype Race.Category2 Diag.Age.Cat cancer.stage2 logmc
577 BH TCGA-BH-A18H 63 FEMALE NOT HISPANIC OR LATINO 2010 0.263000 24 Stage IA 0 LIVING 0 DiseaseFree NA NA WHITE (60,70] Stage I 3.178054
639 C8 TCGA-C8-A12T 43 FEMALE NOT HISPANIC OR LATINO 2010 0.231000 120 Stage IIA 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50] Stage II 4.787492
644 C8 TCGA-C8-A12Y 44 FEMALE NOT HISPANIC OR LATINO 2010 0.233000 22 Stage IIB 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50] Stage II 3.091042
645 C8 TCGA-C8-A12Z 45 FEMALE NOT HISPANIC OR LATINO 2010 0.270000 37 Stage IIB 0 LIVING 0 DiseaseFree NA NA ASIAN (40,50] Stage II 3.610918
649 C8 TCGA-C8-A133 65 FEMALE NOT HISPANIC OR LATINO 2010 0.005322 17 Stage IIIA 0 LIVING 0 DiseaseFree NA ductal/mucinous WHITE (60,70] Stage III 2.833213
663 C8 TCGA-C8-A1HO 34 FEMALE NOT HISPANIC OR LATINO 2010 0.320000 16 Stage IIIC 0 LIVING 0 DiseaseFree NA NA ASIAN (30,40] Stage III 2.772589
ggplot(brcaClin_subset_race, aes(x=CNA, y=logmc, color=cancer.stage2)) +theme_bw() +
  geom_point(size=2) +
  geom_smooth(method=lm , color="magenta3", fill="steelblue", se=TRUE)

  1. Produce the same plot as the previous problem of CNA (x-axis) vs. log mutation count (y-axis) but this time use facet_wrap to create separate panels for each stage rather than shading. Also add a smooth line with geom_smooth.

Comment here about what this plot tells us and how it can be useful:

This plot is much easier to visualize than in question 10 because all of the points are not overlapping, so you can see the log mutation count and CNA at each stage. Additionally, instead of an overall trend line, this plot gives a trend line for each stage. From looking at the data, it appears that there is a low correlation between CNA and log mutation count in different stages, though slightly trend upward, with both CNA and logmc increasing. This can be useful to quickly identify outliers and identify trends in multiple variables like CNA, logmc, and stage of cancer.

ggplot(brcaClin_subset_race, aes(x=CNA, y=logmc)) +theme_bw() +
  geom_point(size=2) +
    facet_wrap(~cancer.stage2) +
  geom_smooth(method=lm , color="magenta3", fill="steelblue", se=TRUE)

  1. First make a barplot of number of records per stage. Add in shading by survival status for the second plot. Then make a third version which will split the plot with separate panels for race (facet).
ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = cancer.stage2)) + theme_bw() +
  geom_bar() +
  labs(title = "Records per Stage",
       x = "Cancer Stage",
       y = "Record Count") +
  scale_fill_manual(values = c("green","orange","slateblue","magenta") )

ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = Overall.Survival.Status)) + theme_bw() +
  geom_bar() +
  labs(title = "Records per Stage by survival status",
       x = "Cancer Stage",
       y = "Record Count")+
  scale_fill_manual(values = c("slateblue","magenta") )

ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = Overall.Survival.Status)) + theme_bw() +
  geom_bar() +
  labs(title = "Records per Stage by Survival Status and Race",
       x = "Cancer Stage",
       y = "Record Count")+
  facet_wrap(~Race.Category2) +
  scale_fill_manual(values = c("slateblue","magenta") )

  1. For the final plot, use the plotly package to produce a scatterplot of CNA (x-axis) vs. log mutation count (y-axis), with points color-coded by diagnosis age category.

OPTIONAL: For added difficulty, try to make a hover-over text label using the patient ID.

plotly <- plot_ly(data = brcaClin_subset_race, type = "scatter", x = ~CNA, y = ~logmc, color =~Diag.Age.Cat)
plotly
plotly_ID <- plot_ly(data = brcaClin_subset_race, type = "scatter", x = ~CNA, y = ~logmc, color =~Diag.Age.Cat, text =~PATIENT.ID)
plotly_ID