Cancer research often uses statistics to study relationships between clinical variables and patient outcomes.
In this presentation, we apply linear regression to two cancer-related datasets:
- The lung dataset
- The BreastCancer dataset
2026-04-12
## ## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2': ## ## last_plot
## The following object is masked from 'package:stats': ## ## filter
## The following object is masked from 'package:graphics': ## ## layout
## Warning in data(lung): data set 'lung' not found
Cancer research often uses statistics to study relationships between clinical variables and patient outcomes.
In this presentation, we apply linear regression to two cancer-related datasets:
Linear regression is a statistical method that measures the relationship between a dependent variable and an independent variable.
In this presentation we will use linear regression to answer the following questions about the lung dataset and the BreastCancer dataset:
Linear regression can be written as:
\[ y = \beta_0 + \beta_1 x + \epsilon \] where
The lung dataset comes from the survival package in R.
For this example:
## age time ## Min. :39.00 Min. : 5.0 ## 1st Qu.:56.00 1st Qu.: 166.8 ## Median :63.00 Median : 255.5 ## Mean :62.45 Mean : 305.2 ## 3rd Qu.:69.00 3rd Qu.: 396.5 ## Max. :82.00 Max. :1022.0
## `geom_smooth()` using formula = 'y ~ x'
The BreastCancer dataset comes from mlbench package.
For this example:
These variables describe physical cell characteristics that may be related.
## Cl.thickness Cell.size ## Min. : 1.000 Min. : 1.000 ## 1st Qu.: 2.000 1st Qu.: 1.000 ## Median : 4.000 Median : 1.000 ## Mean : 4.442 Mean : 3.151 ## 3rd Qu.: 6.000 3rd Qu.: 5.000 ## Max. :10.000 Max. :10.000
## `geom_smooth()` using formula = 'y ~ x'
The slope estimate in simple linear regression is:
\[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})} {\sum (x_i - \bar{x})^2} \]
This formula calculates the slope of the regression line. It shows how much the dependent variable tends to change as the independent variable increases.
ggplot(lung2, aes(x=age, y=time)) +
geom_point() +
geom_smooth(method ="lm") +
labs(
title = "Regression Line Fit for Lung Cancer Data",
x = "Age",
y = "Survival Time (Days)"
)
For the lung cancer dataset, the regression line describes the average relationship between age and survival time.
For the breast cancer dataset, the regression model describes the relationship between two cell characteristics.
Linear regression does not prove causation, but it is useful in identifying patterns in data.
This presentation showed how linear regression can be applied in cancer research.
Main points: