Within the MASS package, the Melanoma data frame has data on 205 patients in Denmark with malignant melanoma.
This data frame contains the following columns: time: survival time in days, possibly censored. status: 1 died from melanoma, 2 alive, 3 dead from other causes. sex: 1 = male, 0 = female. age: age in years. year: year of operation. thickness: tumour thickness in mm. ulcer: 1 = presence, 0 = absence.
library(MASS)
str(Melanoma)
## 'data.frame': 205 obs. of 7 variables:
## $ time : int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : int 3 3 2 3 1 1 1 3 1 1 ...
## $ sex : int 1 1 1 0 1 1 1 0 1 0 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ year : int 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
## $ thickness: num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : int 1 0 0 0 1 1 1 1 1 1 ...
-> Integer. This is incorrect, it should be categorical numerical.
-> Integer. This is also incorrect, as it should also be a categorical numerical.
#You will not have to do this part in your HW! But to use this data, we need to recode the variables we will use.
MM <- Melanoma # Duplicate the Melanoma data and call it MM for Malignant Melanoma
MM$status <- as.factor(MM$status) #Change status from an integer to a factor
levels(MM$status) <- list(DiedFromMelanoma = "1", Alive = "2", DiedOtherCauses = "3") #Rename the status numbers 1, 2, 3 to DiedFromMelanoma, Alive, and DiedOtherCauses
MM$ulcer <- as.factor(MM$ulcer) #Change ulcer from an integer to a factor
levels(MM$ulcer) <- list(Absence = "0", Presence = "1") #Rename the ulcer numbers 0,1 to Absence and Presence
str(MM) #Get MM data structure
## 'data.frame': 205 obs. of 7 variables:
## $ time : int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : Factor w/ 3 levels "DiedFromMelanoma",..: 3 3 2 3 1 1 1 3 1 1 ...
## $ sex : int 1 1 1 0 1 1 1 0 1 0 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ year : int 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
## $ thickness: num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : Factor w/ 2 levels "Absence","Presence": 2 1 1 1 2 2 2 2 2 2 ...
-> Factor, 3.
-> Factor, 2.
#We will use the status and ulcer variables to conduct a chi-square test of independence. First, we must do an exploratory data analysis (EDA) on the variables.
# Get table of status and save as object called status
status <- table(MM$status)
status
##
## DiedFromMelanoma Alive DiedOtherCauses
## 57 134 14
-> 57
# Get table of ulcer and save as object called ulcer
ulcer <- table(MM$ulcer)
ulcer
##
## Absence Presence
## 115 90
-> 90
# Get contingency table with ulcer on the rows and status on the columns and add the sum to each row & column
addmargins(table(MM$ulcer, MM$status))
##
## DiedFromMelanoma Alive DiedOtherCauses Sum
## Absence 16 92 7 115
## Presence 41 42 7 90
## Sum 57 134 14 205
-> 41/90= 0.4556
-> 16/115= 0.1391. 0.4556/0.1391= 3.275
-> The risk of dying from melanoma if the patient has an ulcer on his/her melanoma spot is 3.275 times more likely than the risk if the patient does not have a melanoma ulceration
-> 41/(42+7)= 0.8367
-> 16/(92+7)= 0.1616. 0.8367/0.1616= 5.178
-> The odds of dying from melanoma if the patient has an ulcer on their melanoma spot is 5.178 times greater than the odds than if the patient does not have an ulcer present on their melanoma
# Get a contingency table showing row percentages (100% for each row category), save it as rowpct and tell R to show the table
rowpct <- prop.table(table(MM$ulcer, MM$status),1)
rowpct
##
## DiedFromMelanoma Alive DiedOtherCauses
## Absence 0.13913043 0.80000000 0.06086957
## Presence 0.45555556 0.46666667 0.07777778
->45.56%
# Get a side-by-side bar chart based on row percentages table
barplot(rowpct, main="Ulcer and Status by Row Percentages", #Add title
xlab="Status", #Label x-axis with column name
col=c("hotpink", "pink"), #Color code bars; there must be same number of colors as row categories
legend = rownames(rowpct), #Add legend
beside=TRUE) #Put bars beside each other instead of stacked
-> There does appear to be a relationship. AS the people dying with an ulcer present is much higher than those dying without an ulcer present.
Since you have commented about the relationship between gender and death, now, we will test the relationship with statistical evidence.
-> Is their a relationship between ulcer and status among melanoma patients? Chi-square test.
-> H0: There is no relationship between ulcer and status among melanoma patients in the population.
-> Ha: There is a relationship between ulcer and status among melanoma patients in the population.
#Save contingency table as ChiTable
ChiTable <- table(MM$ulcer,MM$status)
ChiTable
##
## DiedFromMelanoma Alive DiedOtherCauses
## Absence 16 92 7
## Presence 41 42 7
#Conduct chi-square test, save it as an object
chisq <- chisq.test(x = ChiTable)
We are not going to view it yet, because unless the conditions are met, it is not useful. So, check the conditions first (R just requires you do the chi-square test first before you get the expected counts table - unless you want to write code to compute them first, but that is not necessary)
#Get both the observed and expected values
chisq$observed #Get the observed values
##
## DiedFromMelanoma Alive DiedOtherCauses
## Absence 16 92 7
## Presence 41 42 7
round(chisq$expected,3) #Get expected values. Round three decimal places (optional)
##
## DiedFromMelanoma Alive DiedOtherCauses
## Absence 31.976 75.171 7.854
## Presence 25.024 58.829 6.146
-> Yes, all conditions are met
#Tell R to show the chi-square test output we saved as chisq and compare to the values from the last two code chunks
chisq
##
## Pearson's Chi-squared test
##
## data: ChiTable
## X-squared = 26.974, df = 2, p-value = 1.389e-06
-> 26.974
-> 0.000001389
-> Since the p-value is less than 0.05, we must reject the null hypothesis.
-> Since we have rejected the null hypothesis, we accept the alternative hypothesis. This means there is a statistically significant relationship between ulcer and staus in melanoma patients in the population.