Within the MASS package, the Melanoma data frame has data on 205 patients in Denmark with malignant melanoma.

This data frame contains the following columns: time: survival time in days, possibly censored. status: 1 died from melanoma, 2 alive, 3 dead from other causes. sex: 1 = male, 0 = female. age: age in years. year: year of operation. thickness: tumour thickness in mm. ulcer: 1 = presence, 0 = absence.

Get and view the data

library(MASS)
str(Melanoma)
## 'data.frame':    205 obs. of  7 variables:
##  $ time     : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ status   : int  3 3 2 3 1 1 1 3 1 1 ...
##  $ sex      : int  1 1 1 0 1 1 1 0 1 0 ...
##  $ age      : int  76 56 41 71 52 28 77 60 49 68 ...
##  $ year     : int  1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
##  $ thickness: num  6.76 0.65 1.34 2.9 12.08 ...
##  $ ulcer    : int  1 0 0 0 1 1 1 1 1 1 ...
  1. What type of variable is status listed as, and is that correct?

-> Integer. This is incorrect, it should be categorical numerical.

  1. What type of variable is ulcer listed as, and is that correct?

-> Integer. This is also incorrect, as it should also be a categorical numerical.

#You will not have to do this part in your HW! But to use this data, we need to recode the variables we will use.

MM <- Melanoma # Duplicate the Melanoma data and call it MM for Malignant Melanoma
MM$status <- as.factor(MM$status) #Change status from an integer to a factor
levels(MM$status) <- list(DiedFromMelanoma = "1", Alive = "2", DiedOtherCauses = "3") #Rename the status numbers 1, 2, 3 to DiedFromMelanoma, Alive, and DiedOtherCauses
MM$ulcer <- as.factor(MM$ulcer) #Change ulcer from an integer to a factor
levels(MM$ulcer) <- list(Absence = "0", Presence = "1") #Rename the ulcer numbers 0,1 to Absence and Presence
str(MM) #Get MM data structure
## 'data.frame':    205 obs. of  7 variables:
##  $ time     : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ status   : Factor w/ 3 levels "DiedFromMelanoma",..: 3 3 2 3 1 1 1 3 1 1 ...
##  $ sex      : int  1 1 1 0 1 1 1 0 1 0 ...
##  $ age      : int  76 56 41 71 52 28 77 60 49 68 ...
##  $ year     : int  1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
##  $ thickness: num  6.76 0.65 1.34 2.9 12.08 ...
##  $ ulcer    : Factor w/ 2 levels "Absence","Presence": 2 1 1 1 2 2 2 2 2 2 ...
  1. What type of variable is status now, and how many categories does it have?

-> Factor, 3.

  1. What type of variable is ulcer now, and how many categories does it have?

-> Factor, 2.

#We will use the status and ulcer variables to conduct a chi-square test of independence. First, we must do an exploratory data analysis (EDA) on the variables.

Explore the variables

# Get table of status and save as object called status
status <- table(MM$status)
status
## 
## DiedFromMelanoma            Alive  DiedOtherCauses 
##               57              134               14
  1. How many of the melanoma patients died from melanoma?

-> 57

# Get table of ulcer and save as object called ulcer
ulcer <- table(MM$ulcer)
ulcer
## 
##  Absence Presence 
##      115       90
  1. How many of these melanoma patients have a skin ulcer on their melanoma spot?

-> 90

# Get contingency table with ulcer on the rows and status on the columns and add the sum to each row & column
addmargins(table(MM$ulcer, MM$status))
##           
##            DiedFromMelanoma Alive DiedOtherCauses Sum
##   Absence                16    92               7 115
##   Presence               41    42               7  90
##   Sum                    57   134              14 205
  1. What is the risk of dying from melanoma if a melanoma patient has an ulcer on his/her melanoma spot?

-> 41/90= 0.4556

  1. What is the relative risk (i.e. risk ratio) of dying from melanoma if the melanoma patient has an ulcer on his/her melanoma spot versus if the patient does not have a melanoma ulceration?

-> 16/115= 0.1391. 0.4556/0.1391= 3.275

  1. Interpret the relative risk above:

-> The risk of dying from melanoma if the patient has an ulcer on his/her melanoma spot is 3.275 times more likely than the risk if the patient does not have a melanoma ulceration

  1. What are the odds of dying from melanoma if the melanoma patient has an ulcer on his/her melanoma spot?

-> 41/(42+7)= 0.8367

  1. What is the odds ratio for dying from melanoma if the melanoma patient has an ulcer on his/her melanoma spot versus if the patient does not have a melanoma ulceration?

-> 16/(92+7)= 0.1616. 0.8367/0.1616= 5.178

  1. Interpret the odds ratio above:

-> The odds of dying from melanoma if the patient has an ulcer on their melanoma spot is 5.178 times greater than the odds than if the patient does not have an ulcer present on their melanoma

# Get a contingency table showing row percentages (100% for each row category), save it as rowpct and tell R to show the table
rowpct <- prop.table(table(MM$ulcer, MM$status),1)
rowpct
##           
##            DiedFromMelanoma      Alive DiedOtherCauses
##   Absence        0.13913043 0.80000000      0.06086957
##   Presence       0.45555556 0.46666667      0.07777778
  1. What is the percentage of death from melanoma among the patients with ulceration?

->45.56%

# Get a side-by-side bar chart based on row percentages table
barplot(rowpct, main="Ulcer and Status by Row Percentages", #Add title
  xlab="Status", #Label x-axis with column name
col=c("hotpink", "pink"), #Color code bars; there must be same number of colors as row categories
  legend = rownames(rowpct), #Add legend
beside=TRUE) #Put bars beside each other instead of stacked 

  1. Explain why there does or does not appear to be a relationship between the variables.

-> There does appear to be a relationship. AS the people dying with an ulcer present is much higher than those dying without an ulcer present.

Conduct test of independence with Chi-square test.

Since you have commented about the relationship between gender and death, now, we will test the relationship with statistical evidence.

  1. What is our research question in terms of test of independence? What is the name of test?

-> Is their a relationship between ulcer and status among melanoma patients? Chi-square test.

  1. What is the null hypothesis?

-> H0: There is no relationship between ulcer and status among melanoma patients in the population.

  1. What is the alternative hypothesis?

-> Ha: There is a relationship between ulcer and status among melanoma patients in the population.

#Save contingency table as ChiTable
ChiTable <- table(MM$ulcer,MM$status)
ChiTable
##           
##            DiedFromMelanoma Alive DiedOtherCauses
##   Absence                16    92               7
##   Presence               41    42               7
#Conduct chi-square test, save it as an object
chisq <- chisq.test(x = ChiTable)

We are not going to view it yet, because unless the conditions are met, it is not useful. So, check the conditions first (R just requires you do the chi-square test first before you get the expected counts table - unless you want to write code to compute them first, but that is not necessary)

#Get both the observed and expected values
chisq$observed #Get the observed values
##           
##            DiedFromMelanoma Alive DiedOtherCauses
##   Absence                16    92               7
##   Presence               41    42               7
round(chisq$expected,3) #Get expected values. Round three decimal places (optional)
##           
##            DiedFromMelanoma  Alive DiedOtherCauses
##   Absence            31.976 75.171           7.854
##   Presence           25.024 58.829           6.146
  1. Based on the expected values, are all the conditions met?

-> Yes, all conditions are met

#Tell R to show the chi-square test output we saved as chisq and compare to the values from the last two code chunks
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  ChiTable
## X-squared = 26.974, df = 2, p-value = 1.389e-06
  1. What is the chi-square test statistic?

-> 26.974

  1. What is the p-value (take it out of scientific notation)?

-> 0.000001389

  1. What is the conclusion we should make about the null hypothesis: Should we reject it or fail to reject it?

-> Since the p-value is less than 0.05, we must reject the null hypothesis.

  1. What is the conclusion we should make about the alternative hypothesis?

-> Since we have rejected the null hypothesis, we accept the alternative hypothesis. This means there is a statistically significant relationship between ulcer and staus in melanoma patients in the population.

Knit file as either an html or a PDF. Submitting as a PDF may require installing the tinytex package. If you can, great. If not, the just knit as html. Submit your file in the D2L Categorical EDA submission folder.