Meaningful question for analysis:
Did survival time for stage T2 prostate cancer patient differ depending upon grade of cancer, moderately versus poorly differentiated?

Data Set:
ProstateSurvival

Description of dataset (The below description was cut and pasted from a word file which accompanied the dataset and is not meant to represent my work. I thought it would benefit the reader of this .RMD):
This data set contains survival times for two competing causes: time from prostate cancer diagnosis to death from prostate cancer, and time from prostate cancer diagnosis to death from other causes. The data set also contains information on several risk factors. The data in this data set are simulated from detailed competing risk survival curves and counts of numbers of patients per group presented in Lu-Yao et al. (2009). Thus, the simulated data presented here contain many of the characteristics of the original SEER-Medicare prostate cancer data used in Lu-Yao et al. (2009).

Number of Records:
A data frame with 14294 observations on the following 5 variables.

Variables:
grade = a factor with levels mode (moderately differentiated) and poor (poorly differentiated)
stage = a factor with levels T1ab (Stage T1, clinically diagnoseed), T1c (Stage T1, diagnosed via a PSA test), and T2 (Stage T2)
ageGroup = a factor with levels 66-69 70-74 75-79 80+
survTime = time (months) from diagnosis to death or last date known alive
status = a censoring variable, 0, (censored), 1 (death from prostate cancer), and 2 (death from other causes)

Question 1
Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R (Conclusions are included below the code)

summary(dataset)
##        X            grade              stage             ageGroup        
##  Min.   :    1   Length:14294       Length:14294       Length:14294      
##  1st Qu.: 3574   Class :character   Class :character   Class :character  
##  Median : 7148   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 7148                                                           
##  3rd Qu.:10721                                                           
##  Max.   :14294                                                           
##     survTime          status      
##  Min.   :  0.00   Min.   :0.0000  
##  1st Qu.: 13.00   1st Qu.:0.0000  
##  Median : 30.00   Median :0.0000  
##  Mean   : 38.96   Mean   :0.5092  
##  3rd Qu.: 60.00   3rd Qu.:1.0000  
##  Max.   :119.00   Max.   :2.0000
table(dataset$ageGroup)
## 
## 66-69 70-74 75-79   80+ 
##  1423  2952  4313  5606
table(dataset$grade)
## 
##  mode  poor 
## 10988  3306
table(dataset$stage)
## 
## T1ab  T1c   T2 
## 3881 4493 5920
table(dataset$status)
## 
##     0     1     2 
## 10255   799  3240
#**Conclusions:**
#Mean survival time is ~39 months with a median of 30 months from first diagnosis.
#A majority of the patients are over 75+ years of age (age range from 66-80+)
#70% of the patients were diagnosed with moderately differentiated cancer as opposed to poorly differentiated cancer.
#~42% of patients had a Stage 2 diagnosis 

#survTime descriptive statistics
#mean survival time after diagnosis was 38.96 months
#median survival time after diagnosis was 30 months
#1st quartile:13
#3rd quartile:60
#min survival time= 0 months
#maximum surival time=119 months

#ageGroup descriptive statistics
#66-69 1423 patients 
#70-74 2952 patients
#75-79 4313 patients
#80+   5606 patiens

#grade descriptive statistics
#moderately differentiated =10988
#poorly differentiated     = 3306

#stage descriptive statistics
#T1ab=3381
#T1c =4493
#T2  =5920

#status descriptive statistics
#0(censored)                  =10255
#1(death from prostate cancer)=  799
#2(death from other causes)   = 3240

Question 2
Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

#create a dataframe which includes only stage 2 diagnosis 
class(dataset)
## [1] "data.frame"
Stage2<- data.frame(subset(dataset,dataset$stage=='T2',select=c(X,grade,stage,ageGroup,survTime,status)))
View(Stage2)

#rename column survTime to survTimeInMonths
names(Stage2)[names(Stage2) == "survTime"] <- "survTimeMonths"
View(Stage2)

#rename column grade values mode=moderate
Stage2[Stage2=='mode'] <-'moderately differentiated'
Stage2[Stage2=='poor'] <-'poorly differentiated'
View(Stage2)

Question 3
Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

#**I included the basic histogram, boxplot and scatterplot below to demonstate the basic concepts and then included more elaborate ggplot2 visuals below**
#basic histogram () 
hist(Stage2$survTimeMonths)

#basic boxplot
boxplot(Stage2$survTimeMonths)

#basic scatterplot - I realize this scatterplot does not show anything other than the ability to create a scatterplot.  However, my dataset only have one continuous variable and the rest categorical variables. Using a scatterplot did not make sense,
plot(x=Stage2$survTimeMonths, y=Stage2$survTimeMonths)

##using ggplot2
library(ggplot2)
##histogram/bar charts
ggplot(Stage2, aes(x=survTimeMonths, fill=grade)) + geom_bar(state="count") + 
  xlab("Survival Time after Diagnosis(Months)") +
  ylab("Number of Patients") +
  labs(fill = "Prostate Cancer Grade")+
  scale_y_continuous(minor_breaks = seq(0 , 120, 5), breaks = seq(0, 120, 10))
## Warning: Ignoring unknown parameters: state

ggplot(Stage2, aes(x=ageGroup,fill=grade)) + geom_bar(position = "fill")+
  ylab("Percent of Patients")+
  xlab("Age Group")+
  labs(fill = "Prostate Cancer Grade")+
  stat_count(geom = "text", 
             aes(label = stat(count)),
             position=position_fill(vjust=0.5), colour="white")

#Moderately differentiated had a slightly longer survival time per this boxplot, with a slightly higher median
library(ggplot2)
ggplot(Stage2, aes(x=grade, y=survTimeMonths, col=grade)) + 
  geom_boxplot() +
  xlab("Prostate Cancer Grade")+
  ylab("Survival Time after Diagnosis(Months)")+
  scale_color_brewer(palette = "Dark2")

#For all age groups moderately differentiated had a greater survival time compared to poorly differentiated for patients diagnosed over 65.
library(ggplot2)
ggplot(Stage2, aes(x=ageGroup, y=survTimeMonths, col=grade)) + 
  geom_boxplot() +
  xlab("Age Group")+
  ylab("Survival Time after Diagnosis(Months)")+
  scale_color_brewer(palette = "Dark2")

Question 4
Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Meaningful question for analysis:
Does survival time for stage T2 prostate cancer patient vary much depending upon grade of cancer, moderate versus poorly differentiated?

For patients diagnosed over 65 years of age with stage T2 prostate cancer, the analysis supports that moderately differentiated diagnosed patients have a slightly greater survival time in months than poorly diagnosed as evidence in the boxplot below. This finding is also supported by a decrease in mean and median months survived for the poorly differentiated group when stratifying on grade, poorly versus moderately differentiated. The mean survival time for T2 patients dramatically decreased when stratifying by grade (T2 moderately differentiated survival time=39.62 with a median of 31.00 months versus T2 poorly differentiated survival time= 34.74 months with a median of 27.0 months). The mean survival time for stage T2 prostate cancer patients varied by ~5 months. The median survival time for stage t2 prostate cancer patients varied by 4 months. This analysis did not control for cause of death, being ptostate cancer, all other causes or patient still alive at conclusion of study.

library(ggplot2)
ggplot(Stage2, aes(x=ageGroup, y=survTimeMonths, col=grade)) + 
  geom_boxplot() +
  xlab("Age Group")+
  ylab("Survival Time after Diagnosis(Months)")+
  scale_color_brewer(palette = "Dark2")

Stratifying on grade, poorly versus moderately differentiated

T2PoorlyDiff<- data.frame(subset(Stage2,Stage2$grade=='poorly differentiated',select=c(X,grade,stage,ageGroup,survTimeMonths,status)))
T2ModeratelylyDiff<- data.frame(subset(Stage2,Stage2$grade=='moderately differentiated',select=c(X,grade,stage,ageGroup,survTimeMonths,status)))
summary(T2PoorlyDiff$survTimeMonths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   12.00   27.00   34.74   53.00  119.00
summary(T2ModeratelylyDiff$survTimeMonths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   14.00   31.00   39.62   61.00  119.00

Question 5
BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

library(tidyverse)
library(readr)
install.packages("RCurl", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/maric/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'RCurl' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\maric\AppData\Local\Temp\RtmpgH7Sh0\downloaded_packages
library(RCurl)
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
temp<-getURL("https://raw.githubusercontent.com/goygoyummm/2022_CUNY_DS_Bridge_R/main/prostateSurvival.csv")
y<- read.csv(text=temp)
View(temp)