Use the SPSS dataset “Breast Cancer Survival.” In this study, 1207 women diagnosed with breast cancer were assessed on a number of demographic and health-related variables such as: age, pathological tumor size (in cm), the # of lymph nodes affected (lnpos), histologic grade (the higher the #, the more dangerous the cancer is), estrogen receptor status (a binary variable: positive or negative), progesteron receptor status (a binary variable: positive or negative), status (whether a patient is alive or dead), time (number of months under observation).

Import Data

bcs <- haven::read_sav("~/Northeastern/Fall 2018/CAEP7712/Week 2/Breast cancer survival.sav")

Explore Data

lapply(bcs[, lapply(bcs, class) %>% unlist == "labelled"], unique)
## $histgrad
## [1]  3 NA  2  1
## 
## $er
## [1]  0 NA  1
## 
## $pr
## [1]  0 NA  1
## 
## $status
## [1] 0 1
lapply(bcs[, lapply(bcs, class) %>% unlist == "numeric"], summary)
## $id
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   310.5   619.0   621.1   931.5  1266.0 
## 
## $age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   46.00   56.00   56.39   66.50   88.00 
## 
## $pathsize
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.100   1.000   1.500   1.733   2.200   7.000      86 
## 
## $lnpos
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.8807  0.0000 35.0000 
## 
## $time
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.633  22.550  42.967  46.956  65.583 133.800
  1. Write out 3 research questions that can be asked of these data?
    • What is the relationship between size and histological grade / estrogen status / progesterone status?
    • Is the estrogen and progesterone receptor status frequency of this sample statistically different than that of the population as specified on cancer.org 1: 2/3 or .66?
    • Do any of the variables have a correlation with the patient’s status - ie is a particular variable more linked to mortality? (Intuitively, histological grade seems like it will be)
  2. Calculate descriptive statistics for the age variable.
  3. Descriptive Statistics

    tags$p("Mean age: ", bcs$age %>% mean(na.rm = T) %>% round(2))

    Mean age: 56.39

    tags$p("Standard deviation: ", bcs$age %>% sd(na.rm = T) %>% round(2))

    Standard deviation: 13.33

    tags$p("Range: ", range(bcs$age))

    Range: 22

    tags$p("IQR: ", IQR(bcs$age, na.rm = T, type = 6))

    IQR: 21

    tags$p("Min/Max/Median and Quartiles: ")

    Min/Max/Median and Quartiles:

    bcs$age %>% summary
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   22.00   46.00   56.00   56.39   66.50   88.00
    tags$em("Alternatively: ")
    Alternatively:
    psych::describe(bcs$age)

    Is this variable normally distributed? Test normality with a number of approaches: skewness and kurtosis statistics, Kolmogorov-Smirnov and Shapiro-Wilks tests.

    Tests of Normality

    bcs$age %>% hist(main = "Histogram of Age")

    tags$p("Measure of skew: ", psych::skew(bcs$age), ". Age is slightly right skewed according to this test, which was predictable given that breast cancer tends to develop more often in aged populations.")

    Measure of skew: 0.0493656960285639 . Age is slightly right skewed according to this test, which was predictable given that breast cancer tends to develop more often in aged populations.

    tags$p("Measure of kurtosis: ", psych::kurtosi(bcs$age), ". Age is platykurtic indicating that the sample is well distributed across each age segment represented.")

    Measure of kurtosis: -0.770526714373488 . Age is platykurtic indicating that the sample is well distributed across each age segment represented.

    tags$p("Kolmogorov-Smirnov goodness-of-fit test: ")

    Kolmogorov-Smirnov goodness-of-fit test:

    ks.test(x = bcs$age, y = rnorm(10000, mean = mean(bcs$age), sd = sd(bcs$age)), alternative = "two.sided", 
        exact = T)
    ## 
    ##  Two-sample Kolmogorov-Smirnov test
    ## 
    ## data:  bcs$age and rnorm(10000, mean = mean(bcs$age), sd = sd(bcs$age))
    ## D = 0.054868, p-value = 0.003053
    ## alternative hypothesis: two-sided
    tags$p("Based on comparison of the age sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of age sample differs significantly from normality. ")

    Based on comparison of the age sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of age sample differs significantly from normality.

    tags$p("Shapiro-Wilks goodness-of-fit test: ")

    Shapiro-Wilks goodness-of-fit test:

    shapiro.test(bcs$age)
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  bcs$age
    ## W = 0.98719, p-value = 0.000000008472
    tags$p("The Shapiro-Wilks test p-value is also < .05 indicating that the sample of ages significantly differs from the assumption of normality.")

    The Shapiro-Wilks test p-value is also < .05 indicating that the sample of ages significantly differs from the assumption of normality.

  4. Repeat computation of descriptive statistics and the test of normality for pathological tumor size and the # of lymph nodes affected.
  5. Descriptives for Pathological Size

    psych::describe(bcs$pathsize)
    bcs$pathsize %>% hist(main = "Histogram of pathsize")

    tags$p("Measure of skew: ", psych::skew(bcs$pathsize), ". pathsize is right skewed indicating that the mean size of tumors sampled was greater than the most common size.")

    Measure of skew: 1.337129628861 . pathsize is right skewed indicating that the mean size of tumors sampled was greater than the most common size.

    tags$p("Measure of kurtosis: ", psych::kurtosi(bcs$pathsize), ". pathsize is leptokurtic indicating that the sample has a high number of observations clustered around the mean.")

    Measure of kurtosis: 2.63205770460441 . pathsize is leptokurtic indicating that the sample has a high number of observations clustered around the mean.

    tags$p("Kolmogorov-Smirnov goodness-of-fit test: ")

    Kolmogorov-Smirnov goodness-of-fit test:

    ks.test(x = bcs$pathsize, y = rnorm(10000, mean = mean(bcs$pathsize, na.rm = T), 
        sd = sd(bcs$pathsize, na.rm = T)), alternative = "two.sided", exact = T)
    ## 
    ##  Two-sample Kolmogorov-Smirnov test
    ## 
    ## data:  bcs$pathsize and rnorm(10000, mean = mean(bcs$pathsize, na.rm = T), sd = sd(bcs$pathsize, bcs$pathsize and     na.rm = T))
    ## D = 0.12594, p-value = 0.00000000000002598
    ## alternative hypothesis: two-sided
    tags$p("Based on comparison of the pathsize sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of pathsize sample differs significantly from normality. ")

    Based on comparison of the pathsize sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of pathsize sample differs significantly from normality.

    tags$p("Shapiro-Wilks goodness-of-fit test: ")

    Shapiro-Wilks goodness-of-fit test:

    shapiro.test(bcs$pathsize)
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  bcs$pathsize
    ## W = 0.90833, p-value < 0.00000000000000022
    tags$p("The Shapiro-Wilks test p-value is also < .05 indicating that the sample of pathsizes significantly differs from the assumption of normality.")

    The Shapiro-Wilks test p-value is also < .05 indicating that the sample of pathsizes significantly differs from the assumption of normality.

    Descriptives for Lymph node number

    psych::describe(bcs$lnpos)
    bcs$lnpos %>% hist(main = "Histogram of lnpos")

    tags$p("Measure of skew: ", psych::skew(bcs$lnpos), ". lnpos is right skewed indicating that the mean # of effected lymph nodes sampled was greater than the most common #.")

    Measure of skew: 5.33317752160697 . lnpos is right skewed indicating that the mean # of effected lymph nodes sampled was greater than the most common #.

    tags$p("Measure of kurtosis: ", psych::kurtosi(bcs$lnpos), ". lnpos is extremely leptokurtic with nearly all observations clustered around the mean. This makes sense because the number of lymph nodes in a normal breast is a bounded range with a small standard deviation.")

    Measure of kurtosis: 42.7882110700196 . lnpos is extremely leptokurtic with nearly all observations clustered around the mean. This makes sense because the number of lymph nodes in a normal breast is a bounded range with a small standard deviation.

    tags$p("Kolmogorov-Smirnov goodness-of-fit test: ")

    Kolmogorov-Smirnov goodness-of-fit test:

    ks.test(x = bcs$lnpos, y = rnorm(10000, mean = mean(bcs$lnpos, na.rm = T), sd = sd(bcs$lnpos, 
        na.rm = T)), alternative = "two.sided", exact = T)
    ## 
    ##  Two-sample Kolmogorov-Smirnov test
    ## 
    ## data:  bcs$lnpos and rnorm(10000, mean = mean(bcs$lnpos, na.rm = T), sd = sd(bcs$lnpos, bcs$lnpos and     na.rm = T))
    ## D = 0.40688, p-value < 0.00000000000000022
    ## alternative hypothesis: two-sided
    tags$p("Based on comparison of the lnpos sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of lnpos sample differs significantly from normality. ")

    Based on comparison of the lnpos sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of lnpos sample differs significantly from normality.

    tags$p("Shapiro-Wilks goodness-of-fit test: ")

    Shapiro-Wilks goodness-of-fit test:

    shapiro.test(bcs$lnpos)
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  bcs$lnpos
    ## W = 0.39459, p-value < 0.00000000000000022
    tags$p("The Shapiro-Wilks test p-value is also < .05 indicating that the sample of lnposs significantly differs from the assumption of normality.")

    The Shapiro-Wilks test p-value is also < .05 indicating that the sample of lnposs significantly differs from the assumption of normality.

    Can the # of lymph nodes affected variable be transformed into categorical to resolve the distribution issue? Create a new binary indicator for those who do not have any nodes affected and those who have their nodes affected (through TRANSFORM – RECODE INTO DIFFERENT VARIABLES).

    Recode Number of Lymph Nodes

    bcs %<>% mutate(lnpos = if_else(lnpos > 0, T, F))
  6. Compute descriptive statistics on age and the tumor size by group (for those who have nodes affected and those who don’t) [In EXPLORE, enter a grouping variable as a FACTOR LIST].
  7. Descriptives

    bcs %>% group_by(lnpos) %>% dplyr::summarise(nAge = n(), nSize = n(), meanAge = mean(age, 
        na.rm = T), meanSize = mean(pathsize, na.rm = T), sdAge = sd(age, na.rm = T), 
        sdSize = sd(pathsize, na.rm = T), medAge = median(age, na.rm = T), medSize = median(pathsize, 
            na.rm = T), minAge = min(age, na.rm = T), minSize = min(pathsize, na.rm = T), 
        maxAge = max(age, na.rm = T), maxSize = max(pathsize, na.rm = T), rangeAge = {
            maxAge - minAge
        }, rangeSize = {
            maxSize - minSize
        })
  8. Standardize the variable ‘tumor size’. Compare histograms of the unstandardized and standardized variables. Do the distributions change? (TRANSFORM – COMPUTE VARIABLE).
  9. The scale on the x axis and the bins will change but the distribution will remain the same.

    Normalization

    bcs$pathsize %>% hist

    bcs$pathsize %>% scale %>% hist

  10. Using the mean and the standard deviation value form the descriptive statistics of the unstandardized variable, compute the z score for the tumor size of 2 cm and 3.5 cm. interpret them.
  11. z <- function(x, data) {
        out <- (x - mean(data, na.rm = T))/sd(data, na.rm = T)
        return(out)
    }
    x <- 2
    tags$p("A tumor with a size of ", x, " cm is ", z(x, bcs$pathsize) %>% round(2), 
        "standard deviations away from the mean of tumor sizes in the sample")

    A tumor with a size of 2 cm is 0.27 standard deviations away from the mean of tumor sizes in the sample

    x <- 3.5
    tags$p("A tumor with a size of ", x, " cm is ", z(x, bcs$pathsize) %>% round(2), 
        "standard deviations away from the mean of tumor sizes in the sample")

    A tumor with a size of 3.5 cm is 1.77 standard deviations away from the mean of tumor sizes in the sample

  12. Compute SE of the mean of the “tumor size” variable and construct the 95% CI. What would have happened to this CI had we decreased the sample size to 100 women?
  13. \(SE=\frac{s}{\sqrt{N}}\) Standard Error

    se <- function(data) {
        out <- sd(data, na.rm = T)/sqrt(length(data))
        return(out)
    }
    tags$p("The standard error of tumor size is ", se(bcs$pathsize) %>% round(3))

    The standard error of tumor size is 0.029

    psych::describe(bcs$pathsize)$se
    ## [1] 0.02974363

    \[ CI_{0.95} = [\bar{x} - 1.959964*se, \bar{x} \\ + 1.959964*se] \] Confidence Interval

    ci <- function(cl, data, tail = "2") {
        if (tail == "2") {
            z <- (1 - cl)/2 + cl
        } else if (tail == "L" | tail == "l") {
            z <- 1 - cl
        } else {
            z <- cl
        }
        out <- c(mean(data, na.rm = T) - qt(z, length(data) - 1) * sd(data, na.rm = T)/sqrt(length(data)), 
            mean(data, na.rm = T) + qt(z, length(data) - 1) * sd(data, na.rm = T)/sqrt(length(data)))
        return(out)
    }
    tags$p("The confidence interval at the .95% confidence level: ")

    The confidence interval at the .95% confidence level:

    ci(0.95, bcs$pathsize, "2")
    ## [1] 1.677250 1.789726
    tags$p("If we decreased the number of observations to 100 the confidence interval will have a much larger range. This is true because the degrees of freedom is the number of observations - 1 and is also in the divisor position on the standard error calculation - therefore the SE grows larger as the number of observations is reduced. To demonstrate: ")

    If we decreased the number of observations to 100 the confidence interval will have a much larger range. This is true because the degrees of freedom is the number of observations - 1 and is also in the divisor position on the standard error calculation - therefore the SE grows larger as the number of observations is reduced. To demonstrate:

    cl <- 0.95
    a <- (1 - cl)/2 + cl
    c(mean(bcs$pathsize, na.rm = T) - qt(a, 100 - 1) * sd(bcs$pathsize, na.rm = T)/sqrt(100), 
        mean(bcs$pathsize, na.rm = T) + qt(a, 100 - 1) * sd(bcs$pathsize, na.rm = T)/sqrt(100))
    ## [1] 1.535888 1.931087