The dataset contains observations made on patients with melanoma. Data dictionary:
time = Survival time in days since the operation
status is the patients status at the end of the study. 1 indicates that they had died from melanoma, 2 indicates that they were still alive and 3 indicates that they had died from causes unrelated to their melanoma.
The patients sex; 1=male, 0=female
Age in years at the time of the operation
Year of operation
Thickness is tumour thickness in mm
Ulcer is Indicator of ulceration; 1=present, 0=absent
The purpose is to find how thickness and existence of ulcer correlates (if does at all) with success of operation and how this correlation varies across age groups and gender.
melanoma<-read.csv("https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/MASS/Melanoma.csv")
str(melanoma)
## 'data.frame': 205 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ time : int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : int 3 3 2 3 1 1 1 3 1 1 ...
## $ sex : int 1 1 1 0 1 1 1 0 1 0 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ year : int 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
## $ thickness: num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : int 1 0 0 0 1 1 1 1 1 1 ...
By default R treats variables status, sex, ulcer as integers, athough they would be better suitable for the purpose as factors and threfore converted into such types. Also, for the analysis, age is broken down into categories resulting in additional column “age_band” created. Year of operation will be ommited as it does not convey information relevant to my analysis.
library(plyr)
melanoma$ulcer<-factor(melanoma$ulcer)
melanoma$sex<-factor(melanoma$sex)
melanoma$status<-factor(melanoma$status)
df<-data.frame(melanoma)
df$age_band<- cut(df$age, breaks=seq(0,100,20))
df$ulcer<-revalue(df$ulcer, c("0"="yes", "1"="no"))
df$sex<-revalue(df$sex, c("0"="F", "1"="M"))
df$status<-revalue(df$status, c("1"="died from melanoma", "2"="alive","3"="died other reason"))
df$year<-NULL
colnames(df)<-c("id","time_since_operation","status","gender","age","thickness","ulcer","age_band")
summary(df)
## id time_since_operation status gender
## Min. : 1 Min. : 10 died from melanoma: 57 F:126
## 1st Qu.: 52 1st Qu.:1525 alive :134 M: 79
## Median :103 Median :2005 died other reason : 14
## Mean :103 Mean :2153
## 3rd Qu.:154 3rd Qu.:3042
## Max. :205 Max. :5565
## age thickness ulcer age_band
## Min. : 4.00 Min. : 0.10 yes:115 (0,20] : 9
## 1st Qu.:42.00 1st Qu.: 0.97 no : 90 (20,40] :37
## Median :54.00 Median : 1.94 (40,60] :92
## Mean :52.46 Mean : 2.92 (60,80] :61
## 3rd Qu.:65.00 3rd Qu.: 3.56 (80,100]: 6
## Max. :95.00 Max. :17.42
str(df)
## 'data.frame': 205 obs. of 8 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ time_since_operation: int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : Factor w/ 3 levels "died from melanoma",..: 3 3 2 3 1 1 1 3 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 1 2 2 2 1 2 1 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ thickness : num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : Factor w/ 2 levels "yes","no": 2 1 1 1 2 2 2 2 2 2 ...
## $ age_band : Factor w/ 5 levels "(0,20]","(20,40]",..: 4 3 3 4 3 2 4 3 3 4 ...
Structure: As the result of data alterations mentioned above, the data set constitues of 9 variables, 4 of which are factors of different levels.
Summary of Analysis: Average “time being alive since operation” is 2153 days (mean) Age: Average age is 52.46 (mean) Thickness: the average thickness of a tumor is 1.94mm (median as data)
library(ggplot2)
my_plot<-function(data,Xname,Yname,title,ATsize,Asize,colour,TitleSize){
data + xlab(Xname) + ylab(Yname)+
ggtitle (title)+
theme(axis.title.x = element_text(colour="DarkGreen", size=ATsize),
axis.title.y = element_text(colour="DarkGreen", size=ATsize),
axis.text.x = element_text(size=Asize),
axis.text.y = element_text(size=Asize),
plot.title=element_text(colour=colour,
size=TitleSize,
family="Courier"))
}
Filter is used in order to split data set into 2 subsets:
first subset is the subset with patients who was alive or died from not related to melanoma causes by the end of the study.
second subset is the subset with patients who died from melanoma by the end of the period covered by study.
filter1<-df$status=="died from melanoma"
df_dead<-df[filter1,]
filter2<-(df$status=="alive")|(df$status=="died other reason")
df_alive<-df[filter2,]
f<-ggplot(data=df_dead, aes (x=thickness, y=time_since_operation))
fk<-f+geom_point(colour="DarkGreen")+geom_smooth()
k<-my_plot(fk,"Thickness","Survival Time After Operation","Thickness Distribution By Survival Time (not survived)",8,6,"DarkRed",10)
k
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
j<-ggplot(data=df_alive, aes (x=thickness, y=time_since_operation))
lj<-j+geom_point(colour="DarkGreen")+geom_smooth()
l<-my_plot(lj,"Thickness","Survival Time After Operation","Thickness Distribution By Survival Time (survived)",8,6,"DarkRed",10)
l
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Charts show that there is no significant correlation between survival after operation and thickness of melanoma. The most common size of melanoma is within 5 mm for both patients who survived and who has died by the end of the period covered. Some negative correlation takes place between thickness and survival time among patients who died by the end of period.
In addition, the majority of not survived patients, lived up to 2000 days after operation, whereas majority patients who still were alive by the end of study survived for 2000+ days. It may indicate that 2000 days is a survival break point for a patients who were operated.
tm<-ggplot(data=df_alive,aes(x=time_since_operation))
tm+geom_histogram(binwidth=100,colour="DarkBlue",fill="DarkGreen")+facet_grid(ulcer~.) + ggtitle ("Time since operations being alive by existence of ulcer (survived patients)")
tl<-ggplot(data=df_dead,aes(x=time_since_operation))
tl+geom_histogram(binwidth=100,colour="DarkBlue")+facet_grid(ulcer~.)+ggtitle ("Time since operations being alive by existence of ulcer (not survived patients)")
There is no visible correlation between existence of ulcer and time being alive for patients for both groups.
hu<-ggplot(data=df_dead,aes(x=age_band, y=time_since_operation))
hk<-hu+geom_boxplot(size=0.5)+geom_point(size=0.5)
k<-my_plot(hk,"Age Band","Survival Time After Operation","Age By Survival Time (Not Survived Patients)",8,6,"DarkRed",10)
k
xm<-ggplot(data=df_alive,aes(x=age_band, y=time_since_operation))
xz<-xm+geom_boxplot(size=0.5, colour="DarkGreen")+geom_point(size=0.5)
z<-my_plot(xz,"Age Band","Survival Time After Operation","Age By Survival Time (Survived Patients)",8,6,"DarkRed",10)
z
For not survived Largest groups are 40-60 and 60-80 with approximately 1200 and 1100 average survival time respectively, 40-60 group has higher deviation. The smallest survival time has group 80-100. The trend: the older the patient the smaller survival time.
For survived Largest groups are 20-40, 40-60, 60-80 with approximately 3100, 2250 and 2050 average survival time respectively. The smalles survival time has group 80-100. The first 2 groups has higher deviation thah the last one. The trend: the older the patient the smaller the survival time.
Difference: more observations fall withing 20-40 y.o. group for survived patients in comparison to not survived.
tz<-ggplot(data=df_alive,aes(x=time_since_operation))
tz+geom_histogram(binwidth=100,colour="DarkBlue",fill="DarkGreen")+facet_grid(gender~.) + ggtitle ("Time since operations being alive by gender (survived patients)")
ta<-ggplot(data=df_dead,aes(x=time_since_operation))
ta+geom_histogram(binwidth=100,colour="DarkBlue")+facet_grid(gender~.)+ggtitle ("Time since operations being alive by gender (not survived patients)")
Thre is no visible correlation indicated by the analysis by gender.
There is no clear correlation between survival time and thickness of tumor, ulcer(by age and gender). However, among dead patients some negative correlation found between thickness and time being alived after operation.
The following observations found:
1.For the majority of patients the average tumor thickness is withing 5 mm.
2.Death break point is approximately 2000 days since operation.
3.The older the person the smaller the survival time.
BONUS
melanoma_MYgit<-read.csv("https://raw.githubusercontent.com/olgashiligin/R-assignment/master/melanoma.csv")
head(melanoma_MYgit,10)
## X time status sex age year thickness ulcer
## 1 1 10 3 1 76 1972 6.76 1
## 2 2 30 3 1 56 1968 0.65 0
## 3 3 35 2 1 41 1977 1.34 0
## 4 4 99 3 0 71 1968 2.90 0
## 5 5 185 1 1 52 1965 12.08 1
## 6 6 204 1 1 28 1971 4.84 1
## 7 7 210 1 1 77 1972 5.16 1
## 8 8 232 3 0 60 1974 3.22 1
## 9 9 232 1 1 49 1968 12.88 1
## 10 10 279 1 0 68 1971 7.41 1