Meaningful question for analysis:
Did survival time for stage T2 prostate cancer patient differ depending
upon grade of cancer, moderately versus poorly differentiated?
Data Set:
ProstateSurvival
Description of dataset (The below description was cut and
pasted from a word file which accompanied the dataset and is not meant
to represent my work. I thought it would benefit the reader of this
.RMD):
This data set contains survival times for two competing causes: time
from prostate cancer diagnosis to death from prostate cancer, and time
from prostate cancer diagnosis to death from other causes. The data set
also contains information on several risk factors. The data in this data
set are simulated from detailed competing risk survival curves and
counts of numbers of patients per group presented in Lu-Yao et
al. (2009). Thus, the simulated data presented here contain many of the
characteristics of the original SEER-Medicare prostate cancer data used
in Lu-Yao et al. (2009).
Number of Records:
A data frame with 14294 observations on the following 5 variables.
Variables:
grade = a factor with levels mode (moderately differentiated) and poor
(poorly differentiated)
stage = a factor with levels T1ab (Stage T1, clinically diagnoseed), T1c
(Stage T1, diagnosed via a PSA test), and T2 (Stage T2)
ageGroup = a factor with levels 66-69 70-74 75-79 80+
survTime = time (months) from diagnosis to death or last date known
alive
status = a censoring variable, 0, (censored), 1 (death from prostate
cancer), and 2 (death from other causes)
Question 1
Data Exploration: This should include summary statistics, means,
medians, quartiles, or any other relevant information about the data
set. Please include some conclusions in the R (Conclusions are included
below the code)
summary(dataset)
## X grade stage ageGroup
## Min. : 1 Length:14294 Length:14294 Length:14294
## 1st Qu.: 3574 Class :character Class :character Class :character
## Median : 7148 Mode :character Mode :character Mode :character
## Mean : 7148
## 3rd Qu.:10721
## Max. :14294
## survTime status
## Min. : 0.00 Min. :0.0000
## 1st Qu.: 13.00 1st Qu.:0.0000
## Median : 30.00 Median :0.0000
## Mean : 38.96 Mean :0.5092
## 3rd Qu.: 60.00 3rd Qu.:1.0000
## Max. :119.00 Max. :2.0000
table(dataset$ageGroup)
##
## 66-69 70-74 75-79 80+
## 1423 2952 4313 5606
table(dataset$grade)
##
## mode poor
## 10988 3306
table(dataset$stage)
##
## T1ab T1c T2
## 3881 4493 5920
table(dataset$status)
##
## 0 1 2
## 10255 799 3240
#**Conclusions:**
#Mean survival time is ~39 months with a median of 30 months from first diagnosis.
#A majority of the patients are over 75+ years of age (age range from 66-80+)
#70% of the patients were diagnosed with moderately differentiated cancer as opposed to poorly differentiated cancer.
#~42% of patients had a Stage 2 diagnosis
#survTime descriptive statistics
#mean survival time after diagnosis was 38.96 months
#median survival time after diagnosis was 30 months
#1st quartile:13
#3rd quartile:60
#min survival time= 0 months
#maximum surival time=119 months
#ageGroup descriptive statistics
#66-69 1423 patients
#70-74 2952 patients
#75-79 4313 patients
#80+ 5606 patiens
#grade descriptive statistics
#moderately differentiated =10988
#poorly differentiated = 3306
#stage descriptive statistics
#T1ab=3381
#T1c =4493
#T2 =5920
#status descriptive statistics
#0(censored) =10255
#1(death from prostate cancer)= 799
#2(death from other causes) = 3240
Question 2
Data wrangling: Please perform some basic transformations. They
will need to make sense but could include column renaming, creating a
subset of the data, replacing values, or creating new columns with
derived data (for example – if it makes sense you could sum two columns
together)
#create a dataframe which includes only stage 2 diagnosis
class(dataset)
## [1] "data.frame"
Stage2<- data.frame(subset(dataset,dataset$stage=='T2',select=c(X,grade,stage,ageGroup,survTime,status)))
View(Stage2)
#rename column survTime to survTimeInMonths
names(Stage2)[names(Stage2) == "survTime"] <- "survTimeMonths"
View(Stage2)
#rename column grade values mode=moderate
Stage2[Stage2=='mode'] <-'moderately differentiated'
Stage2[Stage2=='poor'] <-'poorly differentiated'
View(Stage2)
Question 3
Graphics: Please make sure to display at least one scatter plot,
box plot and histogram. Don’t be limited to this. Please explore the
many other options in R packages such as ggplot2.
#**I included the basic histogram, boxplot and scatterplot below to demonstate the basic concepts and then included more elaborate ggplot2 visuals below**
#basic histogram ()
hist(Stage2$survTimeMonths)
#basic boxplot
boxplot(Stage2$survTimeMonths)
#basic scatterplot - I realize this scatterplot does not show anything other than the ability to create a scatterplot. However, my dataset only have one continuous variable and the rest categorical variables. Using a scatterplot did not make sense,
plot(x=Stage2$survTimeMonths, y=Stage2$survTimeMonths)
##using ggplot2
library(ggplot2)
##histogram/bar charts
ggplot(Stage2, aes(x=survTimeMonths, fill=grade)) + geom_bar(state="count") +
xlab("Survival Time after Diagnosis(Months)") +
ylab("Number of Patients") +
labs(fill = "Prostate Cancer Grade")+
scale_y_continuous(minor_breaks = seq(0 , 120, 5), breaks = seq(0, 120, 10))
## Warning: Ignoring unknown parameters: state
ggplot(Stage2, aes(x=ageGroup,fill=grade)) + geom_bar(position = "fill")+
ylab("Percent of Patients")+
xlab("Age Group")+
labs(fill = "Prostate Cancer Grade")+
stat_count(geom = "text",
aes(label = stat(count)),
position=position_fill(vjust=0.5), colour="white")
#Moderately differentiated had a slightly longer survival time per this boxplot, with a slightly higher median
library(ggplot2)
ggplot(Stage2, aes(x=grade, y=survTimeMonths, col=grade)) +
geom_boxplot() +
xlab("Prostate Cancer Grade")+
ylab("Survival Time after Diagnosis(Months)")+
scale_color_brewer(palette = "Dark2")
#For all age groups moderately differentiated had a greater survival time compared to poorly differentiated for patients diagnosed over 65.
library(ggplot2)
ggplot(Stage2, aes(x=ageGroup, y=survTimeMonths, col=grade)) +
geom_boxplot() +
xlab("Age Group")+
ylab("Survival Time after Diagnosis(Months)")+
scale_color_brewer(palette = "Dark2")
Question 4
Meaningful question for analysis: Please state at the beginning
a meaningful question for analysis. Use the first three steps and
anything else that would be helpful to answer the question you are
posing from the data set you chose. Please write a brief conclusion
paragraph in R markdown at the end.
Meaningful question for analysis:
Does survival time for stage T2 prostate cancer patient vary
much depending upon grade of cancer, moderate versus poorly
differentiated?
For patients diagnosed over 65 years of age with stage T2 prostate cancer, the analysis supports that moderately differentiated diagnosed patients have a slightly greater survival time in months than poorly diagnosed as evidence in the boxplot below. This finding is also supported by a decrease in mean and median months survived for the poorly differentiated group when stratifying on grade, poorly versus moderately differentiated. The mean survival time for T2 patients dramatically decreased when stratifying by grade (T2 moderately differentiated survival time=39.62 with a median of 31.00 months versus T2 poorly differentiated survival time= 34.74 months with a median of 27.0 months). The mean survival time for stage T2 prostate cancer patients varied by ~5 months. The median survival time for stage t2 prostate cancer patients varied by 4 months. This analysis did not control for cause of death, being ptostate cancer, all other causes or patient still alive at conclusion of study.
library(ggplot2)
ggplot(Stage2, aes(x=ageGroup, y=survTimeMonths, col=grade)) +
geom_boxplot() +
xlab("Age Group")+
ylab("Survival Time after Diagnosis(Months)")+
scale_color_brewer(palette = "Dark2")
Stratifying on grade, poorly versus moderately differentiated
T2PoorlyDiff<- data.frame(subset(Stage2,Stage2$grade=='poorly differentiated',select=c(X,grade,stage,ageGroup,survTimeMonths,status)))
T2ModeratelylyDiff<- data.frame(subset(Stage2,Stage2$grade=='moderately differentiated',select=c(X,grade,stage,ageGroup,survTimeMonths,status)))
summary(T2PoorlyDiff$survTimeMonths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 12.00 27.00 34.74 53.00 119.00
summary(T2ModeratelylyDiff$survTimeMonths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 14.00 31.00 39.62 61.00 119.00
Question 5
BONUS – place the original .csv in a github file and have R read
from the link. This will be a very useful skill as you progress in your
data science education and career.
library(tidyverse)
library(readr)
install.packages("RCurl", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/maric/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'RCurl' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\maric\AppData\Local\Temp\RtmpgH7Sh0\downloaded_packages
library(RCurl)
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
temp<-getURL("https://raw.githubusercontent.com/goygoyummm/2022_CUNY_DS_Bridge_R/main/prostateSurvival.csv")
y<- read.csv(text=temp)
View(temp)