melanoma.csv

The dataset contains observations made on patients with melanoma. Data dictionary:

  1. time = Survival time in days since the operation

  2. status is the patients status at the end of the study. 1 indicates that they had died from melanoma, 2 indicates that they were still alive and 3 indicates that they had died from causes unrelated to their melanoma.

  3. The patients sex; 1=male, 0=female

  4. Age in years at the time of the operation

  5. Year of operation

  6. Thickness is tumour thickness in mm

  7. Ulcer is Indicator of ulceration; 1=present, 0=absent

The purpose is to find how thickness and existence of ulcer correlates (if does at all) with success of operation and how this correlation varies across age groups and gender.

melanoma<-read.csv("https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/MASS/Melanoma.csv")
str(melanoma)
## 'data.frame':    205 obs. of  8 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ time     : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ status   : int  3 3 2 3 1 1 1 3 1 1 ...
##  $ sex      : int  1 1 1 0 1 1 1 0 1 0 ...
##  $ age      : int  76 56 41 71 52 28 77 60 49 68 ...
##  $ year     : int  1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
##  $ thickness: num  6.76 0.65 1.34 2.9 12.08 ...
##  $ ulcer    : int  1 0 0 0 1 1 1 1 1 1 ...

Approach and assumptions

By default R treats variables status, sex, ulcer as integers, athough they would be better suitable for the purpose as factors and threfore converted into such types. Also, for the analysis, age is broken down into categories resulting in additional column “age_band” created. Year of operation will be ommited as it does not convey information relevant to my analysis.

library(plyr)
melanoma$ulcer<-factor(melanoma$ulcer)
melanoma$sex<-factor(melanoma$sex)
melanoma$status<-factor(melanoma$status)
df<-data.frame(melanoma)
df$age_band<- cut(df$age, breaks=seq(0,100,20))
df$ulcer<-revalue(df$ulcer, c("0"="yes", "1"="no"))
df$sex<-revalue(df$sex, c("0"="F", "1"="M"))
df$status<-revalue(df$status, c("1"="died from melanoma", "2"="alive","3"="died other reason"))
df$year<-NULL
colnames(df)<-c("id","time_since_operation","status","gender","age","thickness","ulcer","age_band")
summary(df)
##        id      time_since_operation                status    gender 
##  Min.   :  1   Min.   :  10         died from melanoma: 57   F:126  
##  1st Qu.: 52   1st Qu.:1525         alive             :134   M: 79  
##  Median :103   Median :2005         died other reason : 14          
##  Mean   :103   Mean   :2153                                         
##  3rd Qu.:154   3rd Qu.:3042                                         
##  Max.   :205   Max.   :5565                                         
##       age          thickness     ulcer         age_band 
##  Min.   : 4.00   Min.   : 0.10   yes:115   (0,20]  : 9  
##  1st Qu.:42.00   1st Qu.: 0.97   no : 90   (20,40] :37  
##  Median :54.00   Median : 1.94             (40,60] :92  
##  Mean   :52.46   Mean   : 2.92             (60,80] :61  
##  3rd Qu.:65.00   3rd Qu.: 3.56             (80,100]: 6  
##  Max.   :95.00   Max.   :17.42
str(df)
## 'data.frame':    205 obs. of  8 variables:
##  $ id                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ time_since_operation: int  10 30 35 99 185 204 210 232 232 279 ...
##  $ status              : Factor w/ 3 levels "died from melanoma",..: 3 3 2 3 1 1 1 3 1 1 ...
##  $ gender              : Factor w/ 2 levels "F","M": 2 2 2 1 2 2 2 1 2 1 ...
##  $ age                 : int  76 56 41 71 52 28 77 60 49 68 ...
##  $ thickness           : num  6.76 0.65 1.34 2.9 12.08 ...
##  $ ulcer               : Factor w/ 2 levels "yes","no": 2 1 1 1 2 2 2 2 2 2 ...
##  $ age_band            : Factor w/ 5 levels "(0,20]","(20,40]",..: 4 3 3 4 3 2 4 3 3 4 ...

Structure: As the result of data alterations mentioned above, the data set constitues of 9 variables, 4 of which are factors of different levels.

Summary of Analysis: Average “time being alive since operation” is 2153 days (mean) Age: Average age is 52.46 (mean) Thickness: the average thickness of a tumor is 1.94mm (median as data)

Function for plotting themes

library(ggplot2)
my_plot<-function(data,Xname,Yname,title,ATsize,Asize,colour,TitleSize){
  data + xlab(Xname) + ylab(Yname)+
    ggtitle (title)+
    theme(axis.title.x = element_text(colour="DarkGreen", size=ATsize),
          axis.title.y = element_text(colour="DarkGreen", size=ATsize),
          axis.text.x = element_text(size=Asize),
          axis.text.y = element_text(size=Asize),
          plot.title=element_text(colour=colour, 
                                  size=TitleSize,
                                  family="Courier"))
  
}

Data Set Splitting

Filter is used in order to split data set into 2 subsets:

filter1<-df$status=="died from melanoma"
df_dead<-df[filter1,]

filter2<-(df$status=="alive")|(df$status=="died other reason")
df_alive<-df[filter2,]

Thickness

f<-ggplot(data=df_dead, aes (x=thickness, y=time_since_operation))
fk<-f+geom_point(colour="DarkGreen")+geom_smooth()
k<-my_plot(fk,"Thickness","Survival Time After Operation","Thickness Distribution By Survival Time (not survived)",8,6,"DarkRed",10)
k
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

j<-ggplot(data=df_alive, aes (x=thickness, y=time_since_operation))
lj<-j+geom_point(colour="DarkGreen")+geom_smooth()
l<-my_plot(lj,"Thickness","Survival Time After Operation","Thickness Distribution By Survival Time (survived)",8,6,"DarkRed",10)
l
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Charts show that there is no significant correlation between survival after operation and thickness of melanoma. The most common size of melanoma is within 5 mm for both patients who survived and who has died by the end of the period covered. Some negative correlation takes place between thickness and survival time among patients who died by the end of period.

In addition, the majority of not survived patients, lived up to 2000 days after operation, whereas majority patients who still were alive by the end of study survived for 2000+ days. It may indicate that 2000 days is a survival break point for a patients who were operated.

Ulcer

tm<-ggplot(data=df_alive,aes(x=time_since_operation))
tm+geom_histogram(binwidth=100,colour="DarkBlue",fill="DarkGreen")+facet_grid(ulcer~.) + ggtitle ("Time since operations being alive by existence of ulcer (survived patients)")

tl<-ggplot(data=df_dead,aes(x=time_since_operation))
tl+geom_histogram(binwidth=100,colour="DarkBlue")+facet_grid(ulcer~.)+ggtitle ("Time since operations being alive by existence of ulcer (not survived patients)")

There is no visible correlation between existence of ulcer and time being alive for patients for both groups.

Age

hu<-ggplot(data=df_dead,aes(x=age_band, y=time_since_operation))
hk<-hu+geom_boxplot(size=0.5)+geom_point(size=0.5)
k<-my_plot(hk,"Age Band","Survival Time After Operation","Age By Survival Time (Not Survived Patients)",8,6,"DarkRed",10)
k

xm<-ggplot(data=df_alive,aes(x=age_band, y=time_since_operation))
xz<-xm+geom_boxplot(size=0.5, colour="DarkGreen")+geom_point(size=0.5)
z<-my_plot(xz,"Age Band","Survival Time After Operation","Age By Survival Time (Survived Patients)",8,6,"DarkRed",10)
z

For not survived Largest groups are 40-60 and 60-80 with approximately 1200 and 1100 average survival time respectively, 40-60 group has higher deviation. The smallest survival time has group 80-100. The trend: the older the patient the smaller survival time.

For survived Largest groups are 20-40, 40-60, 60-80 with approximately 3100, 2250 and 2050 average survival time respectively. The smalles survival time has group 80-100. The first 2 groups has higher deviation thah the last one. The trend: the older the patient the smaller the survival time.

Difference: more observations fall withing 20-40 y.o. group for survived patients in comparison to not survived.

Gender

tz<-ggplot(data=df_alive,aes(x=time_since_operation))
tz+geom_histogram(binwidth=100,colour="DarkBlue",fill="DarkGreen")+facet_grid(gender~.) + ggtitle ("Time since operations being alive by gender (survived patients)")

ta<-ggplot(data=df_dead,aes(x=time_since_operation))
ta+geom_histogram(binwidth=100,colour="DarkBlue")+facet_grid(gender~.)+ggtitle ("Time since operations being alive by gender (not survived patients)")

Thre is no visible correlation indicated by the analysis by gender.

Conclusion

There is no clear correlation between survival time and thickness of tumor, ulcer(by age and gender). However, among dead patients some negative correlation found between thickness and time being alived after operation.

The following observations found:

1.For the majority of patients the average tumor thickness is withing 5 mm.

2.Death break point is approximately 2000 days since operation.

3.The older the person the smaller the survival time.

BONUS

melanoma_MYgit<-read.csv("https://raw.githubusercontent.com/olgashiligin/R-assignment/master/melanoma.csv")
head(melanoma_MYgit,10)
##     X time status sex age year thickness ulcer
## 1   1   10      3   1  76 1972      6.76     1
## 2   2   30      3   1  56 1968      0.65     0
## 3   3   35      2   1  41 1977      1.34     0
## 4   4   99      3   0  71 1968      2.90     0
## 5   5  185      1   1  52 1965     12.08     1
## 6   6  204      1   1  28 1971      4.84     1
## 7   7  210      1   1  77 1972      5.16     1
## 8   8  232      3   0  60 1974      3.22     1
## 9   9  232      1   1  49 1968     12.88     1
## 10 10  279      1   0  68 1971      7.41     1