2026-04-12

Loading Lung Cancer and Breast Cancer Datasets

## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Warning in data(lung): data set 'lung' not found

Introduction

Cancer research often uses statistics to study relationships between clinical variables and patient outcomes.

In this presentation, we apply linear regression to two cancer-related datasets:

  • The lung dataset
  • The BreastCancer dataset

What is Linear Regression?

Linear regression is a statistical method that measures the relationship between a dependent variable and an independent variable.

In this presentation we will use linear regression to answer the following questions about the lung dataset and the BreastCancer dataset:

  • How does survival time change with age?
  • Do cellular features increase together?

Regression Model

Linear regression can be written as:

\[ y = \beta_0 + \beta_1 x + \epsilon \] where

  • \(y\) is the dependent variable (outcome)
  • \(x\) is the independent variable (predictor)
  • \(\beta_0\) is the intercept (the baseline value of y when x = 0)
  • \(\beta_1\) is the slope (for each unit increase in x, it is the change in y)
  • \(\epsilon\) is the random error (an unexpected variation)

Lung Cancer Data

The lung dataset comes from the survival package in R.

For this example:

  • \(x\) = age of patient
  • \(y\) = survival time in days

Lung Cancer Data

##       age             time       
##  Min.   :39.00   Min.   :   5.0  
##  1st Qu.:56.00   1st Qu.: 166.8  
##  Median :63.00   Median : 255.5  
##  Mean   :62.45   Mean   : 305.2  
##  3rd Qu.:69.00   3rd Qu.: 396.5  
##  Max.   :82.00   Max.   :1022.0

Lung Cancer Scatter Plot

Lung Cancer Regression Line

## `geom_smooth()` using formula = 'y ~ x'

Interactive Plotly Plot

Breast Cancer Data

The BreastCancer dataset comes from mlbench package.

For this example:

  • \(x\) = clump thickness
  • \(y\) = cell size

These variables describe physical cell characteristics that may be related.

##   Cl.thickness      Cell.size     
##  Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 2.000   1st Qu.: 1.000  
##  Median : 4.000   Median : 1.000  
##  Mean   : 4.442   Mean   : 3.151  
##  3rd Qu.: 6.000   3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :10.000

Breast Cancer Plot

## `geom_smooth()` using formula = 'y ~ x'

Least Squares Estimate

The slope estimate in simple linear regression is:

\[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})} {\sum (x_i - \bar{x})^2} \]

This formula calculates the slope of the regression line. It shows how much the dependent variable tends to change as the independent variable increases.

R Code for the Lung Regression Model

ggplot(lung2, aes(x=age, y=time)) +
  geom_point() +
  geom_smooth(method ="lm") +
  labs(
    title = "Regression Line Fit for Lung Cancer Data",
    x = "Age",
    y = "Survival Time (Days)"
  )

Interpretation

For the lung cancer dataset, the regression line describes the average relationship between age and survival time.

For the breast cancer dataset, the regression model describes the relationship between two cell characteristics.

Linear regression does not prove causation, but it is useful in identifying patterns in data.

Conclusion

This presentation showed how linear regression can be applied in cancer research.

Main points:

  • Linear regression models relationships between variables
  • The lung dataset was used to study age and survival time
  • The BreastCancer dataset was used to study cellular measurements
  • Statistical tools such as plots, formulas, and regression output help to summarize data clearly

Sources

  • lung dataset from survival package in R
  • BreastCancer dataset from the mlbench package in R
  • Linear regression formulas from Statistics by Jim Frost (online resource)
  • Calculating a Least Squares Regression Line: Equation, Example, Explanation by Andrew Lee (online resource)