Instructions: Write your code in the code “chunks” following each question. Then compile the html file in RStudio using “Knit HTML”. Submit your R Markdown file (rename to something like “YOURNAME_hwR.Rmd”) and the HTML file.
Edit this R Markdown file directly. Add code to code chunks and text answers to text sections.

Check out examples from class and R function help pages. Break a problem down into steps. Google each step! You don’t have to answer the question in one line of code.

Email your instructor with questions ahead of the due date. Send your code, which question you are working on, and a screenshot of the error.

Load needed packages. Uncomment the install lines if you need to install the packages. Then recomment before knitting.

TCGA Data

Read in the data. Do not modify this code chunk!!

Notice that the csv is imported and saved to dataframe called brcaClin

brcaClin <- read.csv("TCGA_BRCA_Clinical_HW.csv")

Describe the table. What are the dimensions? What are the columns and data types?

The table has 1003 rows and 15 columns.

The columns are PATIENT.ID (character), Diagnosis.Age (integer), Person.Gender (character), Ethnicity.Category (character), Race.Category (character), Year.Cancer.Initial.Diagnosis (integer), CNA (numeric), Mutation.Count (integer), Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code (character), Overall.Survival..Months. (numeric), Overall.Survival.Status (character), Disease.Free..Months. (numeric), Disease.Free.Status (character), Metastatic.Site (character), Tumor.Other.Histologic.Subtype (character)

dim(brcaClin)

## [1] 1003   15

str_brca <- str(brcaClin)

## 'data.frame':    1003 obs. of  15 variables:
##  $ PATIENT.ID                                                    : chr  "TCGA-5T-A9QA" "TCGA-A1-A0SB" "TCGA-A1-A0SD" "TCGA-A1-A0SE" ...
##  $ Diagnosis.Age                                                 : int  52 70 59 56 54 61 39 52 39 54 ...
##  $ Person.Gender                                                 : chr  "FEMALE" "FEMALE" "FEMALE" "FEMALE" ...
##  $ Ethnicity.Category                                            : chr  "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" "NOT HISPANIC OR LATINO" ...
##  $ Race.Category                                                 : chr  "BLACK OR AFRICAN AMERICAN" "WHITE" "WHITE" "WHITE" ...
##  $ Year.Cancer.Initial.Diagnosis                                 : int  2013 2008 2005 2005 2006 2006 2006 2007 2006 2007 ...
##  $ CNA                                                           : num  0.413 0.000802 0.247 0.213 0.202 ...
##  $ Mutation.Count                                                : int  NA 16 23 18 32 27 70 139 25 89 ...
##  $ Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code: chr  "Stage IB" "Stage I" "Stage IIA" "Stage I" ...
##  $ Overall.Survival..Months.                                     : num  0.39 8.51 14.4 43.4 48.1 14.3 47.2 20.9 13.7 31.8 ...
##  $ Overall.Survival.Status                                       : chr  "LIVING" "LIVING" "LIVING" "LIVING" ...
##  $ Disease.Free..Months.                                         : num  0.39 8.51 14.4 43.4 48.1 14.3 47.2 20.9 13.7 NA ...
##  $ Disease.Free.Status                                           : chr  "DiseaseFree" "DiseaseFree" "DiseaseFree" "DiseaseFree" ...
##  $ Metastatic.Site                                               : chr  "" "" "" "" ...
##  $ Tumor.Other.Histologic.Subtype                                : chr  "" "Adenoid Cystic Carcinoma" "" "invasive ductal and lobular carcinoma" ...

How many subjects in the data experienced recurrence / progression? How many subjects are deceased? Check out columns on Disease Free Status and Overall Survival Status

68 subjects experienced recurrence/progression. 216 subjects are deceased.

brcaClin <- read.csv("TCGA_BRCA_Clinical_HW.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE, na.strings = c("NA","","[Not Evaluated]","[Unknown]"))
disease.free.status.vector <- brcaClin$Disease.Free.Status
disease.free.summary <- summary(disease.free.status.vector)
kable(disease.free.summary, caption = "Disease-Free Status")

Disease-Free Status
	x
DiseaseFree	769
Recurred/Progressed	68
NAs	166

overall.survival.status.vector <- brcaClin$Overall.Survival.Status
survival.status.summary <- summary(overall.survival.status.vector)
kable(survival.status.summary, caption = "Overall Survival Status")

Overall Survival Status
	x
DECEASED	216
LIVING	787

Look at the variable Metastatic.Site and identify how many subjects had bone metastasis using grep or subset to find all the entries that have ‘bone’ OR ‘Bone’. What were the corresponding patient IDs for these patients?

13 patients had some kind of bone metasasis.

Patient IDs were:TCGA-A2-A04V TCGA-A2-A0SV TCGA-AO-A0J5 TCGA-AR-A0TY TCGA-AR-A0TZ TCGA-AR-A0U2 TCGA-AR-A2LH TCGA-B6-A0I9 TCGA-BH-A18J TCGA-BH-A1FH TCGA-EW-A1P8 TCGA-GM-A2D9 TCGA-GM-A2DA.

metastatic.site.vector <- brcaClin$Metastatic.Site
metastatic.site.summary <- summary(metastatic.site.vector)
kable(metastatic.site.summary, caption = "Metastatic Site")

Metastatic Site
	x
Bone	9
Bone\|Liver	2
Liver	2
Lung	1
Lung\|Bone\|Liver	1
Lung\|Bone\|Liver\|Other, specify	1
Lung\|Liver	1
Lung\|Other, specify	1
Other, specify	5
NAs	980

metastatic.site.bone <- subset(brcaClin, grepl("Bone", brcaClin$Metastatic.Site),select = c("PATIENT.ID", "Metastatic.Site"))
metastatic.site.bone.IDs <- subset(brcaClin, grepl("Bone", brcaClin$Metastatic.Site),select = c("PATIENT.ID"))
metastatic.site.bone.list <- metastatic.site.bone$PATIENT.ID
length(metastatic.site.bone.list)

## [1] 13

kable(metastatic.site.bone.IDs, caption = "Bone Metastasis Patient IDs")

Bone Metastasis Patient IDs
	PATIENT.ID
22	TCGA-A2-A04V
61	TCGA-A2-A0SV
321	TCGA-AO-A0J5
374	TCGA-AR-A0TY
375	TCGA-AR-A0TZ
377	TCGA-AR-A0U2
421	TCGA-AR-A2LH
438	TCGA-B6-A0I9
579	TCGA-BH-A18J
613	TCGA-BH-A1FH
924	TCGA-EW-A1P8
943	TCGA-GM-A2D9
944	TCGA-GM-A2DA

kable(metastatic.site.bone, caption = "Metastatic Site Patients")

Metastatic Site Patients
	PATIENT.ID	Metastatic.Site
22	TCGA-A2-A04V	Bone
61	TCGA-A2-A0SV	Bone
321	TCGA-AO-A0J5	Lung\|Bone\|Liver
374	TCGA-AR-A0TY	Bone
375	TCGA-AR-A0TZ	Bone
377	TCGA-AR-A0U2	Bone\|Liver
421	TCGA-AR-A2LH	Bone\|Liver
438	TCGA-B6-A0I9	Bone
579	TCGA-BH-A18J	Bone
613	TCGA-BH-A1FH	Bone
924	TCGA-EW-A1P8	Lung\|Bone\|Liver\|Other, specify
943	TCGA-GM-A2D9	Bone
944	TCGA-GM-A2DA	Bone

The middle portion of the patient ID, sandwiched between the dashes ‘-’, identifies the TCGA center/site for the subject. Check the help page for the function substring and use this to create new variable for TCGA center in your brcaClin table. How many different centers are there in this data? How many subjects came from each center? Function table will help greatly

There are 28 different centers in this data. 5T: 1 subject, A1: 14 subjects, A2: 102 subjects, A7: 39 subjects, A8:84 subjects, AC: 29 subjects, AN: 39 subjects, A0:47 subjects, AQ: 9 subjects, AR: 68 subjects, B6: 54 subjects, BH: 145 subjects, C8: 45 subjects, D8: 75 subjects, E2: 91 subjects, E9 60 subjects, EW: 38 subjects, GI:2 subjects, GM: 21 subjects, JL: 2 subjects, LD: 3 subjects, LL: 11 subjects, LQ: 1 subject, MS: 1 subject, OK: 1 subject, OL: 19 subjects, S3: 1 subject, W8: 1 subject

?substring
TCGA.center.site <- substring(brcaClin$PATIENT.ID, 6, 7)
brcaClin<-cbind(TCGA.center.site,brcaClin)
head_brca <- head(brcaClin)
kable(head_brca, caption = "brcaClin Head")

brcaClin Head
TCGA.center.site	PATIENT.ID	Diagnosis.Age	Person.Gender	Ethnicity.Category	Race.Category	Year.Cancer.Initial.Diagnosis	CNA	Mutation.Count	Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code	Overall.Survival..Months.	Overall.Survival.Status	Disease.Free..Months.	Disease.Free.Status	Metastatic.Site	Tumor.Other.Histologic.Subtype
5T	TCGA-5T-A9QA	52	FEMALE	NOT HISPANIC OR LATINO	BLACK OR AFRICAN AMERICAN	2013	0.4130000	NA	Stage IB	0.39	LIVING	0.39	DiseaseFree	NA	NA
A1	TCGA-A1-A0SB	70	FEMALE	NOT HISPANIC OR LATINO	WHITE	2008	0.0008017	16	Stage I	8.51	LIVING	8.51	DiseaseFree	NA	Adenoid Cystic Carcinoma
A1	TCGA-A1-A0SD	59	FEMALE	NOT HISPANIC OR LATINO	WHITE	2005	0.2470000	23	Stage IIA	14.40	LIVING	14.40	DiseaseFree	NA	NA
A1	TCGA-A1-A0SE	56	FEMALE	NOT HISPANIC OR LATINO	WHITE	2005	0.2130000	18	Stage I	43.40	LIVING	43.40	DiseaseFree	NA	invasive ductal and lobular carcinoma
A1	TCGA-A1-A0SF	54	FEMALE	NOT HISPANIC OR LATINO	WHITE	2006	0.2020000	32	Stage IIA	48.10	LIVING	48.10	DiseaseFree	NA	NA
A1	TCGA-A1-A0SG	61	FEMALE	NOT HISPANIC OR LATINO	WHITE	2006	0.0969000	27	Stage IIB	14.30	LIVING	14.30	DiseaseFree	NA	micropapillary carcinoma invasive

?base::table
length(unique(TCGA.center.site))

## [1] 28

TCGA.table <- table(TCGA.center.site)
kable(TCGA.table, caption = "TCGA Center Site")

TCGA Center Site
TCGA.center.site	Freq
5T	1
A1	14
A2	102
A7	39
A8	84
AC	29
AN	39
AO	47
AQ	9
AR	68
B6	54
BH	145
C8	45
D8	75
E2	91
E9	60
EW	38
GI	2
GM	21
JL	2
LD	3
LL	11
LQ	1
MS	1
OK	1
OL	19
S3	1
W8	1

Use nchar to find the length of each variable (column) name in the data set. Use summary and hist to explore the distribution of these lengths. Which variable name is the longest? The shortest?

The longest variable name is “Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code” which has 62 characters and the shortest is “CNA” with 3 characters

brcaClin_char <- nchar(colnames(brcaClin))
brcaClin_char

##  [1] 16 10 13 13 18 13 29  3 14 62 25 23 21 19 15 30

summary(brcaClin_char)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   13.00   17.00   20.25   23.50   62.00

hist(brcaClin_char, main = "Histogram of Variable Name Lengths", xlab="variable length", xlim=c(0,70),ylim=c(0,8))

brcaClin_char_df <- as.data.frame(brcaClin_char)
brcaClin_char_df$Variable_Name <- colnames(brcaClin)
brcaclinchar_summary  <- summary(brcaClin_char_df)
head_brcaClin_df <- head(brcaClin_char_df)
kable(brcaclinchar_summary, caption = "Character Summary")

Character Summary
	brcaClin_char	Variable_Name
	Min. : 3.00	Length :16
	1st Qu.:13.00	N.unique :16
	Median :17.00	N.blank : 0
	Mean :20.25	Min.nchar: 3
	3rd Qu.:23.50	Max.nchar:62
	Max. :62.00	NA

kable(brcaClin_char_df, caption = "Variable Character Number")

Variable Character Number
brcaClin_char	Variable_Name
16	TCGA.center.site
10	PATIENT.ID
13	Diagnosis.Age
13	Person.Gender
18	Ethnicity.Category
13	Race.Category
29	Year.Cancer.Initial.Diagnosis
3	CNA
14	Mutation.Count
62	Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code
25	Overall.Survival..Months.
23	Overall.Survival.Status
21	Disease.Free..Months.
19	Disease.Free.Status
15	Metastatic.Site
30	Tumor.Other.Histologic.Subtype

min <- min(brcaClin_char_df$brcaClin_char)
min

## [1] 3

max <- max(brcaClin_char_df$brcaClin_char)
max

## [1] 62

brcaClin_char_df$Variable_Name[which.min(brcaClin_char_df$brcaClin_char)]

## [1] "CNA"

brcaClin_char_df$Variable_Name[which.max(brcaClin_char_df$brcaClin_char)]

## [1] "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code"

How many unique entries are in Tumor.Other.Histologic.Subtype? Use grep to find all the entries that have ‘ductal’ OR ‘Ductal’ in them. How many subjects have this entry?

There are 59 unique entries in Tumor.Other.Histologic.Subtype. 34 patients have ductal or ductal-related histologic types.

tumor <- brcaClin$Tumor.Other.Histologic.Subtype
length(tumor)

## [1] 1003

uniquetumor <- unique(tumor)
length(uniquetumor)

## [1] 59

tumor_ductal_IDs <- subset(brcaClin, grepl("ductal", brcaClin$Tumor.Other.Histologic.Subtype),select = c("PATIENT.ID", "Tumor.Other.Histologic.Subtype"))
tumor_ductal <- subset(brcaClin, grepl("ductal", brcaClin$Tumor.Other.Histologic.Subtype), select = c("Tumor.Other.Histologic.Subtype"))
length(tumor_ductal_IDs$PATIENT.ID)

## [1] 34

Sort the data set from shortest to longest survival time and save to a new data.frame. Where do the subjects with missing survival time end up? Use head and tail functions and please do not print out the whole dataframe!

The subjects with missing survival time (NA) are found at the bottom of the table.

brcaClin_ordered <- brcaClin[order(brcaClin$Overall.Survival..Months., decreasing = F),]
is.data.frame(brcaClin_ordered)

## [1] TRUE

brcaClin_ordered_head <- head(brcaClin_ordered)
brcaClin_ordered_tail <- tail(brcaClin_ordered)
kable(brcaClin_ordered_head, caption = "brcaClin Ordered Head")

brcaClin Ordered Head
	TCGA.center.site	PATIENT.ID	Diagnosis.Age	Person.Gender	Ethnicity.Category	Race.Category	Year.Cancer.Initial.Diagnosis	CNA	Mutation.Count	Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code	Overall.Survival.Status	Disease.Free.Status	Metastatic.Site	Tumor.Other.Histologic.Subtype
157	A8	TCGA-A8-A06N	66	FEMALE	NA	NA	2009	5.58e-01	25	Stage IIIB	LIVING	DiseaseFree	NA	NA
185	A8	TCGA-A8-A081	80	FEMALE	NA	NA	2008	6.45e-01	67	Stage IIA	LIVING	DiseaseFree	NA	NA
187	A8	TCGA-A8-A083	67	FEMALE	NA	NA	2008	1.82e-01	29	Stage IIB	LIVING	DiseaseFree	NA	NA
196	A8	TCGA-A8-A08H	66	FEMALE	NA	NA	2007	2.19e-05	46	Stage IIA	LIVING	DiseaseFree	NA	NA
207	A8	TCGA-A8-A090	74	FEMALE	NA	NA	2008	2.86e-01	26	Stage IIA	LIVING	DiseaseFree	NA	NA
211	A8	TCGA-A8-A094	75	FEMALE	NA	NA	2009	4.87e-01	138	Stage IIA	LIVING	DiseaseFree	NA	NA

kable(brcaClin_ordered_tail, caption = "brcaClin Ordered Tail")

brcaClin Ordered Tail
	TCGA.center.site	PATIENT.ID	Diagnosis.Age	Person.Gender	Ethnicity.Category	Race.Category	Year.Cancer.Initial.Diagnosis	CNA	Mutation.Count	Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code	Overall.Survival..Months.	Overall.Survival.Status	Disease.Free..Months.	Disease.Free.Status	Metastatic.Site	Tumor.Other.Histologic.Subtype
874	E9	TCGA-E9-A226	45	FEMALE	NOT HISPANIC OR LATINO	WHITE	2011	0.2560000	22	Stage IIIA	NA	DECEASED	25.9	Recurred/Progressed	NA	NA
884	E9	TCGA-E9-A243	52	FEMALE	NOT HISPANIC OR LATINO	WHITE	2011	0.1010000	49	Stage IIA	NA	DECEASED	10.0	Recurred/Progressed	NA	NA
886	E9	TCGA-E9-A245	47	FEMALE	NOT HISPANIC OR LATINO	WHITE	2011	0.0345000	13	Stage IIB	NA	DECEASED	NA	NA	NA	NA
917	EW	TCGA-EW-A1P0	55	FEMALE	NOT HISPANIC OR LATINO	WHITE	2008	0.2440000	21	Stage IIB	NA	DECEASED	35.9	Recurred/Progressed	NA	IDC+ mucinous carcinoma
918	EW	TCGA-EW-A1P1	68	FEMALE	HISPANIC OR LATINO	WHITE	2008	0.0008294	1	Stage IIIC	NA	DECEASED	33.3	Recurred/Progressed	Lung	NA
944	GM	TCGA-GM-A2DA	46	FEMALE	NOT HISPANIC OR LATINO	WHITE	1993	0.1410000	21	Stage IIB	NA	DECEASED	NA	NA	Bone	NA

Problems 8 through 12 are best done using ggplot2

Use code from the notes to make race variable with only the levels “ASIAN”, “BLACK OR AFRICAN AMERICAN”, and “WHITE”. Be sure to factor after subsetting. Also make a new categorical age at diagnosis variable with age in decades or deciles. Then produce boxplots of CNA by diagnosis age category, with separate panels (facets) by the 3-level race variable.

Comment here about what this plot tells us and how it can be useful:

This plot gives us information about copy number alterations by race and by age at diagnosis. From this plot, we can see median, upper quartile, lower quartile, maximum, minimum, and any outliers. It can be useful to quickly make comparisons between several variables in the data, such as race, CNA, and age category (decades). In this plot, you can see a trend with CNA value and increasing age in Asians. However, no such trend in Black/African American and white races exists. Overall, white people have lower median CNA values compared to Asian and Black/African Americans at any age of diagnosis, though outliers indicate the highest CNA of all three races. Boxplots are a very useful tool that can reveal a lot of information about a dataset.

brcaClin_subset_race <- subset(brcaClin_ordered, Race.Category %in% c("ASIAN", "BLACK OR AFRICAN AMERICAN", "WHITE"))
brcaClin_subset_race$Race.Category2 <- factor(brcaClin_subset_race$Race.Category)
race_subset_table <- table(brcaClin_subset_race$Race.Category2)
brcaClin_subset_race$Race.Category <- NULL
kable(race_subset_table, caption = "brcaClin Race Subset")

brcaClin Race Subset
Var1	Freq
ASIAN	57
BLACK OR AFRICAN AMERICAN	129
WHITE	721

brcaClin_subset_race$Diag.Age.Cat <- cut(brcaClin_subset_race$Diagnosis.Age, breaks = seq(10, 100, by = 10))
brcaClin_subset_race$Diag.Age.Cat <- factor(brcaClin_subset_race$Diag.Age.Cat)
diag_table <- table(brcaClin_subset_race$Diag.Age.Cat)
head_diag.age.cat <- head(brcaClin_subset_race)
kable(head_diag.age.cat, caption = "brcaClin Diagnosis Age Category Head")

brcaClin Diagnosis Age Category Head
	TCGA.center.site	PATIENT.ID	Diagnosis.Age	Person.Gender	Ethnicity.Category	Year.Cancer.Initial.Diagnosis	CNA	Mutation.Count	Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code	Overall.Survival.Status	Disease.Free.Status	Metastatic.Site	Tumor.Other.Histologic.Subtype	Race.Category2	Diag.Age.Cat
577	BH	TCGA-BH-A18H	63	FEMALE	NOT HISPANIC OR LATINO	2010	0.263000	24	Stage IA	LIVING	DiseaseFree	NA	NA	WHITE	(60,70]
639	C8	TCGA-C8-A12T	43	FEMALE	NOT HISPANIC OR LATINO	2010	0.231000	120	Stage IIA	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]
644	C8	TCGA-C8-A12Y	44	FEMALE	NOT HISPANIC OR LATINO	2010	0.233000	22	Stage IIB	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]
645	C8	TCGA-C8-A12Z	45	FEMALE	NOT HISPANIC OR LATINO	2010	0.270000	37	Stage IIB	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]
649	C8	TCGA-C8-A133	65	FEMALE	NOT HISPANIC OR LATINO	2010	0.005322	17	Stage IIIA	LIVING	DiseaseFree	NA	ductal/mucinous	WHITE	(60,70]
663	C8	TCGA-C8-A1HO	34	FEMALE	NOT HISPANIC OR LATINO	2010	0.320000	16	Stage IIIC	LIVING	DiseaseFree	NA	NA	ASIAN	(30,40]

kable(diag_table, caption = "Diagnosis Age Category")

Diagnosis Age Category
Var1	Freq
(20,30]	9
(30,40]	78
(40,50]	201
(50,60]	234
(60,70]	226
(70,80]	122
(80,90]	37

ggplot(brcaClin_subset_race, aes(x=Diag.Age.Cat, y=CNA, fill=Diag.Age.Cat)) + 
  geom_boxplot()+ theme_bw() +
  facet_wrap(~Race.Category2) +
  scale_fill_brewer(palette = "Set3")

Use code from the notes to create a Stage variable with only values “Stage I”, “Stage II”, “Stage III”, and “Stage IV”. Look at the table to find the name of the current stage variable. Rename variables like “Stage Ia” and “Stage Ib” to just “Stage I”. gsub may help with this.

Then make a new column in the table of log mutation count. Remember you cannot log a 0 and you will need to add some small number. If there are 0 values in the mutation count, try something like log(Mutation+0.25)

Then produce a scatterplot of CNA (x variable) by log mutation count (y variable) with points color coded by stage of disease. Use geom_smooth to draw a smoothed line depicting the relationship between CNA and log mutation count.

brcaClin_head_race <- head(brcaClin_subset_race)
kable(brcaClin_head_race, caption = "brcaClin Head")

brcaClin Head
	TCGA.center.site	PATIENT.ID	Diagnosis.Age	Person.Gender	Ethnicity.Category	Year.Cancer.Initial.Diagnosis	CNA	Mutation.Count	Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code	Overall.Survival.Status	Disease.Free.Status	Metastatic.Site	Tumor.Other.Histologic.Subtype	Race.Category2	Diag.Age.Cat
577	BH	TCGA-BH-A18H	63	FEMALE	NOT HISPANIC OR LATINO	2010	0.263000	24	Stage IA	LIVING	DiseaseFree	NA	NA	WHITE	(60,70]
639	C8	TCGA-C8-A12T	43	FEMALE	NOT HISPANIC OR LATINO	2010	0.231000	120	Stage IIA	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]
644	C8	TCGA-C8-A12Y	44	FEMALE	NOT HISPANIC OR LATINO	2010	0.233000	22	Stage IIB	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]
645	C8	TCGA-C8-A12Z	45	FEMALE	NOT HISPANIC OR LATINO	2010	0.270000	37	Stage IIB	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]
649	C8	TCGA-C8-A133	65	FEMALE	NOT HISPANIC OR LATINO	2010	0.005322	17	Stage IIIA	LIVING	DiseaseFree	NA	ductal/mucinous	WHITE	(60,70]
663	C8	TCGA-C8-A1HO	34	FEMALE	NOT HISPANIC OR LATINO	2010	0.320000	16	Stage IIIC	LIVING	DiseaseFree	NA	NA	ASIAN	(30,40]

brcaClin_subset_race$cancer.stage <- gsub(pattern = "[ABC]", replacement = "", 
                            x = brcaClin_subset_race$Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code)
stage.table <- table(brcaClin_subset_race$Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code)
kable(stage.table, caption = "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code")

Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code
Var1	Freq
Stage I	76
Stage IA	77
Stage IB	11
Stage II	8
Stage IIA	305
Stage IIB	209
Stage III	2
Stage IIIA	128
Stage IIIB	19
Stage IIIC	47
Stage IV	11
Stage Tis	1
Stage X	12

no.abc <- table(brcaClin_subset_race$cancer.stage)
kable(no.abc, caption = "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Condensed")

Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code Condensed
Var1	Freq
Stage I	164
Stage II	522
Stage III	196
Stage IV	11
Stage Tis	1
Stage X	12

cancer.stage_levels <- paste("Stage", c("I", "II", "III", "IV"), sep = " ")
cancer.stage_levels

## [1] "Stage I"   "Stage II"  "Stage III" "Stage IV"

brcaClin_subset_race$cancer.stage2 <- factor(brcaClin_subset_race$cancer.stage, levels = c("Stage I", "Stage II", "Stage III", "Stage IV"))
brcaClin_subset_race<-subset(brcaClin_subset_race, !is.na(brcaClin_subset_race$cancer.stage2))
table.cancer.stage2 <- table(brcaClin_subset_race$cancer.stage2)
kable(table.cancer.stage2, caption = "Cancer Stages I-IV")

Cancer Stages I-IV
Var1	Freq
Stage I	164
Stage II	522
Stage III	196
Stage IV	11

brcaClin_subset_race$cancer.stage <- NULL

brcaClin_subset_race$logmc <- log((brcaClin_subset_race$Mutation.Count)+.000001)

head_logmc <- head(brcaClin_subset_race)
kable(head_logmc, caption = "brcaClin Head LogMC")

brcaClin Head LogMC
	TCGA.center.site	PATIENT.ID	Diagnosis.Age	Person.Gender	Ethnicity.Category	Year.Cancer.Initial.Diagnosis	CNA	Mutation.Count	Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code	Overall.Survival.Status	Disease.Free.Status	Metastatic.Site	Tumor.Other.Histologic.Subtype	Race.Category2	Diag.Age.Cat	cancer.stage2	logmc
577	BH	TCGA-BH-A18H	63	FEMALE	NOT HISPANIC OR LATINO	2010	0.263000	24	Stage IA	LIVING	DiseaseFree	NA	NA	WHITE	(60,70]	Stage I	3.178054
639	C8	TCGA-C8-A12T	43	FEMALE	NOT HISPANIC OR LATINO	2010	0.231000	120	Stage IIA	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]	Stage II	4.787492
644	C8	TCGA-C8-A12Y	44	FEMALE	NOT HISPANIC OR LATINO	2010	0.233000	22	Stage IIB	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]	Stage II	3.091042
645	C8	TCGA-C8-A12Z	45	FEMALE	NOT HISPANIC OR LATINO	2010	0.270000	37	Stage IIB	LIVING	DiseaseFree	NA	NA	ASIAN	(40,50]	Stage II	3.610918
649	C8	TCGA-C8-A133	65	FEMALE	NOT HISPANIC OR LATINO	2010	0.005322	17	Stage IIIA	LIVING	DiseaseFree	NA	ductal/mucinous	WHITE	(60,70]	Stage III	2.833213
663	C8	TCGA-C8-A1HO	34	FEMALE	NOT HISPANIC OR LATINO	2010	0.320000	16	Stage IIIC	LIVING	DiseaseFree	NA	NA	ASIAN	(30,40]	Stage III	2.772589

ggplot(brcaClin_subset_race, aes(x=CNA, y=logmc, color=cancer.stage2)) +theme_bw() +
  geom_point(size=2) +
  geom_smooth(method=lm , color="magenta3", fill="steelblue", se=TRUE)

Produce the same plot as the previous problem of CNA (x-axis) vs. log mutation count (y-axis) but this time use facet_wrap to create separate panels for each stage rather than shading. Also add a smooth line with geom_smooth.

Comment here about what this plot tells us and how it can be useful:

This plot is much easier to visualize than in question 10 because all of the points are not overlapping, so you can see the log mutation count and CNA at each stage. Additionally, instead of an overall trend line, this plot gives a trend line for each stage. From looking at the data, it appears that there is a low correlation between CNA and log mutation count in different stages, though slightly trend upward, with both CNA and logmc increasing. This can be useful to quickly identify outliers and identify trends in multiple variables like CNA, logmc, and stage of cancer.

ggplot(brcaClin_subset_race, aes(x=CNA, y=logmc)) +theme_bw() +
  geom_point(size=2) +
    facet_wrap(~cancer.stage2) +
  geom_smooth(method=lm , color="magenta3", fill="steelblue", se=TRUE)

First make a barplot of number of records per stage. Add in shading by survival status for the second plot. Then make a third version which will split the plot with separate panels for race (facet).

ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = cancer.stage2)) + theme_bw() +
  geom_bar() +
  labs(title = "Records per Stage",
       x = "Cancer Stage",
       y = "Record Count") +
  scale_fill_manual(values = c("green","orange","slateblue","magenta") )

ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = Overall.Survival.Status)) + theme_bw() +
  geom_bar() +
  labs(title = "Records per Stage by survival status",
       x = "Cancer Stage",
       y = "Record Count")+
  scale_fill_manual(values = c("slateblue","magenta") )

ggplot(brcaClin_subset_race, aes(x=cancer.stage2, fill = Overall.Survival.Status)) + theme_bw() +
  geom_bar() +
  labs(title = "Records per Stage by Survival Status and Race",
       x = "Cancer Stage",
       y = "Record Count")+
  facet_wrap(~Race.Category2) +
  scale_fill_manual(values = c("slateblue","magenta") )

For the final plot, use the plotly package to produce a scatterplot of CNA (x-axis) vs. log mutation count (y-axis), with points color-coded by diagnosis age category.

OPTIONAL: For added difficulty, try to make a hover-over text label using the patient ID.

plotly <- plot_ly(data = brcaClin_subset_race, type = "scatter", x = ~CNA, y = ~logmc, color =~Diag.Age.Cat)
plotly

plotly_ID <- plot_ly(data = brcaClin_subset_race, type = "scatter", x = ~CNA, y = ~logmc, color =~Diag.Age.Cat, text =~PATIENT.ID)
plotly_ID

BMI 5750 - Homework for R

Meghan

Due: 06/13/2025 at 5 pm

TCGA Data

TCGA.center.site	Freq
5T	1
A1	14
A2	102
A7	39
A8	84
AC	29
AN	39
AO	47
AQ	9
AR	68
B6	54
BH	145
C8	45
D8	75
E2	91
E9	60
EW	38
GI	2
GM	21
JL	2
LD	3
LL	11
LQ	1
MS	1
OK	1
OL	19
S3	1
W8	1

TCGA.center.site	Freq
5T	1
A1	14
A2	102
A7	39
A8	84
AC	29
AN	39
AO	47
AQ	9
AR	68
B6	54
BH	145
C8	45
D8	75
E2	91
E9	60
EW	38
GI	2
GM	21
JL	2
LD	3
LL	11
LQ	1
MS	1
OK	1
OL	19
S3	1
W8	1

TCGA.center.site	Freq
5T	1
A1	14
A2	102
A7	39
A8	84
AC	29
AN	39
AO	47
AQ	9
AR	68
B6	54
BH	145
C8	45
D8	75
E2	91
E9	60
EW	38
GI	2
GM	21
JL	2
LD	3
LL	11
LQ	1
MS	1
OK	1
OL	19
S3	1
W8	1