What are Confidence Intervals?

  • Confidence intervals show the interval in which true mean of a population most likely lies.
  • Most common confidence intervals are 90%, 95%, and 99%
  • Let’s say we calculate a 95% two-sided confidence interval and get the result (3, 6). This means that:
    • Based on the sample data, the probability that the interval (3, 6) contains the true mean of the population, \(\mu\), is 0.95.

Important Variables

n: sample size (number of observations)

\(x_i\): each individual observation - (ex. if the observation values are 1, 5, 7, 3, 2, then \(x_2 = 5\))

sample mean: \(\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\)

sample standard deviation: \(\sigma = \sqrt {\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2} {n}}\)

critical value: \(1 - \alpha = \text{Confidence Level}\)

Equations for Confidence Intervals

We will be focusing on two-sided confidence intervals. Here are the necessary equations:

Lower bound: \(\hat{L} = \bar{x} - z_\frac{\alpha}{2}\frac{\sigma}{\sqrt{n}}\)

Upper bound: \(\hat{U} = \bar{x} + z_\frac{\alpha}{2}\frac{\sigma}{\sqrt{n}}\)

Dataset

We will be using data regarding survival in patients with advanced lung cancer from the North Central Cancer Treatment Group.

First, let’s load the dataset we’ll be using:

library(survival)
data(cancer)

Calculating Confidence Interval

Find a 95% two-sided confidence interval for the mean survival time of cancer patients in this data set.

\(\bar{x}\) = 305.2324561

\(\sigma\) = 210.6455431

n = 228

\(1 - \alpha = .95\), so \(\alpha = .05\)

\(\hat{L}\) = 305.2324561 - \(z_\frac{\\.05}{2}\frac{210.6455431}{\sqrt{228}} = 277.7437328\)

\(\hat{U}\) = 305.2324561 + \(z_\frac{\\.05}{2}\frac{210.6455431}{\sqrt{228}} = 332.7211795\)

Conclusion: Based on the sample data, the probability that the interval (277.7437, 332.7212) contains the true mean, \(\mu\), of survival time of cancer patients, is 0.95.

Code to do calculate confidence interval in R

mean = mean(cancer$time)
std = sd(cancer$time)
n = nrow(cancer)
result = t.test(cancer$time)
L = result$conf.int[1]
U = result$conf.int[2]

Boxplot using plot_ly

The Karnofsky performance score is a scale of 0 to 100, where bad=0, and good=100. In this boxplot, you can see the physicians’ scores of patients.

Scatterplot showing age of patients vs their survival time using ggplot

What is ph.ecog, and what do the numbers mean?

ph.ecog is the ECOG performance score as rated by the physician

  • 0 = asymptomatic
  • 1 = symptomatic but completely ambulatory
  • 2 = in bed <50% of the day
  • 3 = in bed > 50% of the day but not bedbound
  • 4 = bedbound

It is important to note that there are no patients who scored a 4 in this dataset.

Bar graph of number of patients who received each score, using ggplot

Code for bar graph from last slide

ggplot(cancer, aes(x=ph.ecog)) + 
  geom_bar(stat="count", width=0.7, fill="pink") +
  theme_minimal() + 
  labs(title = "Number of Patients per ECOG Performance Score",
       x = "ECOG Performance Score",
       y = "Number of Patients")