Use the SPSS dataset “Breast Cancer Survival.” In this study, 1207 women diagnosed with breast cancer were assessed on a number of demographic and health-related variables such as: age, pathological tumor size (in cm), the # of lymph nodes affected (lnpos), histologic grade (the higher the #, the more dangerous the cancer is), estrogen receptor status (a binary variable: positive or negative), progesteron receptor status (a binary variable: positive or negative), status (whether a patient is alive or dead), time (number of months under observation).
Import Data
bcs <- haven::read_sav("~/Northeastern/Fall 2018/CAEP7712/Week 2/Breast cancer survival.sav")Explore Data
lapply(bcs[, lapply(bcs, class) %>% unlist == "labelled"], unique)## $histgrad
## [1] 3 NA 2 1
##
## $er
## [1] 0 NA 1
##
## $pr
## [1] 0 NA 1
##
## $status
## [1] 0 1
lapply(bcs[, lapply(bcs, class) %>% unlist == "numeric"], summary)## $id
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 310.5 619.0 621.1 931.5 1266.0
##
## $age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 46.00 56.00 56.39 66.50 88.00
##
## $pathsize
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.100 1.000 1.500 1.733 2.200 7.000 86
##
## $lnpos
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.8807 0.0000 35.0000
##
## $time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.633 22.550 42.967 46.956 65.583 133.800
Descriptive Statistics
tags$p("Mean age: ", bcs$age %>% mean(na.rm = T) %>% round(2))Mean age: 56.39
tags$p("Standard deviation: ", bcs$age %>% sd(na.rm = T) %>% round(2))Standard deviation: 13.33
tags$p("Range: ", range(bcs$age))Range: 22
tags$p("IQR: ", IQR(bcs$age, na.rm = T, type = 6))IQR: 21
tags$p("Min/Max/Median and Quartiles: ")Min/Max/Median and Quartiles:
bcs$age %>% summary## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 46.00 56.00 56.39 66.50 88.00
tags$em("Alternatively: ")psych::describe(bcs$age)Is this variable normally distributed? Test normality with a number of approaches: skewness and kurtosis statistics, Kolmogorov-Smirnov and Shapiro-Wilks tests.
Tests of Normality
bcs$age %>% hist(main = "Histogram of Age")tags$p("Measure of skew: ", psych::skew(bcs$age), ". Age is slightly right skewed according to this test, which was predictable given that breast cancer tends to develop more often in aged populations.")Measure of skew: 0.0493656960285639 . Age is slightly right skewed according to this test, which was predictable given that breast cancer tends to develop more often in aged populations.
tags$p("Measure of kurtosis: ", psych::kurtosi(bcs$age), ". Age is platykurtic indicating that the sample is well distributed across each age segment represented.")Measure of kurtosis: -0.770526714373488 . Age is platykurtic indicating that the sample is well distributed across each age segment represented.
tags$p("Kolmogorov-Smirnov goodness-of-fit test: ")Kolmogorov-Smirnov goodness-of-fit test:
ks.test(x = bcs$age, y = rnorm(10000, mean = mean(bcs$age), sd = sd(bcs$age)), alternative = "two.sided",
exact = T)##
## Two-sample Kolmogorov-Smirnov test
##
## data: bcs$age and rnorm(10000, mean = mean(bcs$age), sd = sd(bcs$age))
## D = 0.054868, p-value = 0.003053
## alternative hypothesis: two-sided
tags$p("Based on comparison of the age sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of age sample differs significantly from normality. ")Based on comparison of the age sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of age sample differs significantly from normality.
tags$p("Shapiro-Wilks goodness-of-fit test: ")Shapiro-Wilks goodness-of-fit test:
shapiro.test(bcs$age)##
## Shapiro-Wilk normality test
##
## data: bcs$age
## W = 0.98719, p-value = 0.000000008472
tags$p("The Shapiro-Wilks test p-value is also < .05 indicating that the sample of ages significantly differs from the assumption of normality.")The Shapiro-Wilks test p-value is also < .05 indicating that the sample of ages significantly differs from the assumption of normality.
Descriptives for Pathological Size
psych::describe(bcs$pathsize)bcs$pathsize %>% hist(main = "Histogram of pathsize")tags$p("Measure of skew: ", psych::skew(bcs$pathsize), ". pathsize is right skewed indicating that the mean size of tumors sampled was greater than the most common size.")Measure of skew: 1.337129628861 . pathsize is right skewed indicating that the mean size of tumors sampled was greater than the most common size.
tags$p("Measure of kurtosis: ", psych::kurtosi(bcs$pathsize), ". pathsize is leptokurtic indicating that the sample has a high number of observations clustered around the mean.")Measure of kurtosis: 2.63205770460441 . pathsize is leptokurtic indicating that the sample has a high number of observations clustered around the mean.
tags$p("Kolmogorov-Smirnov goodness-of-fit test: ")Kolmogorov-Smirnov goodness-of-fit test:
ks.test(x = bcs$pathsize, y = rnorm(10000, mean = mean(bcs$pathsize, na.rm = T),
sd = sd(bcs$pathsize, na.rm = T)), alternative = "two.sided", exact = T)##
## Two-sample Kolmogorov-Smirnov test
##
## data: bcs$pathsize and rnorm(10000, mean = mean(bcs$pathsize, na.rm = T), sd = sd(bcs$pathsize, bcs$pathsize and na.rm = T))
## D = 0.12594, p-value = 0.00000000000002598
## alternative hypothesis: two-sided
tags$p("Based on comparison of the pathsize sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of pathsize sample differs significantly from normality. ")Based on comparison of the pathsize sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of pathsize sample differs significantly from normality.
tags$p("Shapiro-Wilks goodness-of-fit test: ")Shapiro-Wilks goodness-of-fit test:
shapiro.test(bcs$pathsize)##
## Shapiro-Wilk normality test
##
## data: bcs$pathsize
## W = 0.90833, p-value < 0.00000000000000022
tags$p("The Shapiro-Wilks test p-value is also < .05 indicating that the sample of pathsizes significantly differs from the assumption of normality.")The Shapiro-Wilks test p-value is also < .05 indicating that the sample of pathsizes significantly differs from the assumption of normality.
Descriptives for Lymph node number
psych::describe(bcs$lnpos)bcs$lnpos %>% hist(main = "Histogram of lnpos")tags$p("Measure of skew: ", psych::skew(bcs$lnpos), ". lnpos is right skewed indicating that the mean # of effected lymph nodes sampled was greater than the most common #.")Measure of skew: 5.33317752160697 . lnpos is right skewed indicating that the mean # of effected lymph nodes sampled was greater than the most common #.
tags$p("Measure of kurtosis: ", psych::kurtosi(bcs$lnpos), ". lnpos is extremely leptokurtic with nearly all observations clustered around the mean. This makes sense because the number of lymph nodes in a normal breast is a bounded range with a small standard deviation.")Measure of kurtosis: 42.7882110700196 . lnpos is extremely leptokurtic with nearly all observations clustered around the mean. This makes sense because the number of lymph nodes in a normal breast is a bounded range with a small standard deviation.
tags$p("Kolmogorov-Smirnov goodness-of-fit test: ")Kolmogorov-Smirnov goodness-of-fit test:
ks.test(x = bcs$lnpos, y = rnorm(10000, mean = mean(bcs$lnpos, na.rm = T), sd = sd(bcs$lnpos,
na.rm = T)), alternative = "two.sided", exact = T)##
## Two-sample Kolmogorov-Smirnov test
##
## data: bcs$lnpos and rnorm(10000, mean = mean(bcs$lnpos, na.rm = T), sd = sd(bcs$lnpos, bcs$lnpos and na.rm = T))
## D = 0.40688, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
tags$p("Based on comparison of the lnpos sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of lnpos sample differs significantly from normality. ")Based on comparison of the lnpos sample with a normal distribution, the Kolmogorov-Smirnov test p-value is less than an alpha level of .05 with a two-tailed test indicating that the distribution of lnpos sample differs significantly from normality.
tags$p("Shapiro-Wilks goodness-of-fit test: ")Shapiro-Wilks goodness-of-fit test:
shapiro.test(bcs$lnpos)##
## Shapiro-Wilk normality test
##
## data: bcs$lnpos
## W = 0.39459, p-value < 0.00000000000000022
tags$p("The Shapiro-Wilks test p-value is also < .05 indicating that the sample of lnposs significantly differs from the assumption of normality.")The Shapiro-Wilks test p-value is also < .05 indicating that the sample of lnposs significantly differs from the assumption of normality.
Can the # of lymph nodes affected variable be transformed into categorical to resolve the distribution issue? Create a new binary indicator for those who do not have any nodes affected and those who have their nodes affected (through TRANSFORM – RECODE INTO DIFFERENT VARIABLES).
Recode Number of Lymph Nodes
bcs %<>% mutate(lnpos = if_else(lnpos > 0, T, F))Descriptives
bcs %>% group_by(lnpos) %>% dplyr::summarise(nAge = n(), nSize = n(), meanAge = mean(age,
na.rm = T), meanSize = mean(pathsize, na.rm = T), sdAge = sd(age, na.rm = T),
sdSize = sd(pathsize, na.rm = T), medAge = median(age, na.rm = T), medSize = median(pathsize,
na.rm = T), minAge = min(age, na.rm = T), minSize = min(pathsize, na.rm = T),
maxAge = max(age, na.rm = T), maxSize = max(pathsize, na.rm = T), rangeAge = {
maxAge - minAge
}, rangeSize = {
maxSize - minSize
})The scale on the x axis and the bins will change but the distribution will remain the same.
Normalization
bcs$pathsize %>% histbcs$pathsize %>% scale %>% histz <- function(x, data) {
out <- (x - mean(data, na.rm = T))/sd(data, na.rm = T)
return(out)
}
x <- 2
tags$p("A tumor with a size of ", x, " cm is ", z(x, bcs$pathsize) %>% round(2),
"standard deviations away from the mean of tumor sizes in the sample")A tumor with a size of 2 cm is 0.27 standard deviations away from the mean of tumor sizes in the sample
x <- 3.5
tags$p("A tumor with a size of ", x, " cm is ", z(x, bcs$pathsize) %>% round(2),
"standard deviations away from the mean of tumor sizes in the sample")A tumor with a size of 3.5 cm is 1.77 standard deviations away from the mean of tumor sizes in the sample
\(SE=\frac{s}{\sqrt{N}}\) Standard Error
se <- function(data) {
out <- sd(data, na.rm = T)/sqrt(length(data))
return(out)
}
tags$p("The standard error of tumor size is ", se(bcs$pathsize) %>% round(3))The standard error of tumor size is 0.029
psych::describe(bcs$pathsize)$se## [1] 0.02974363
\[ CI_{0.95} = [\bar{x} - 1.959964*se, \bar{x} \\ + 1.959964*se] \] Confidence Interval
ci <- function(cl, data, tail = "2") {
if (tail == "2") {
z <- (1 - cl)/2 + cl
} else if (tail == "L" | tail == "l") {
z <- 1 - cl
} else {
z <- cl
}
out <- c(mean(data, na.rm = T) - qt(z, length(data) - 1) * sd(data, na.rm = T)/sqrt(length(data)),
mean(data, na.rm = T) + qt(z, length(data) - 1) * sd(data, na.rm = T)/sqrt(length(data)))
return(out)
}
tags$p("The confidence interval at the .95% confidence level: ")The confidence interval at the .95% confidence level:
ci(0.95, bcs$pathsize, "2")## [1] 1.677250 1.789726
tags$p("If we decreased the number of observations to 100 the confidence interval will have a much larger range. This is true because the degrees of freedom is the number of observations - 1 and is also in the divisor position on the standard error calculation - therefore the SE grows larger as the number of observations is reduced. To demonstrate: ")If we decreased the number of observations to 100 the confidence interval will have a much larger range. This is true because the degrees of freedom is the number of observations - 1 and is also in the divisor position on the standard error calculation - therefore the SE grows larger as the number of observations is reduced. To demonstrate:
cl <- 0.95
a <- (1 - cl)/2 + cl
c(mean(bcs$pathsize, na.rm = T) - qt(a, 100 - 1) * sd(bcs$pathsize, na.rm = T)/sqrt(100),
mean(bcs$pathsize, na.rm = T) + qt(a, 100 - 1) * sd(bcs$pathsize, na.rm = T)/sqrt(100))## [1] 1.535888 1.931087