Predicting eshophageal cancer

2016-04-20

Farhad Mehdipour

Read-And-Delete

  • In this project we aim to perform an anlysis on the effect of various age groups for different amount of alcohol and tobacco consumption per day. The analysis is based on a case-control study of (o)esophageal cancer done in Ille-et-Vilaine, France.
  • We have a developed a predictor tool to get inputs such as age groups for different amount of alcohol and tobacco consumption per day and prodcues the probablity of esophageal cancer.
  • In this tool, we fit a linear model to the esoph dataset for modeling the effect of above parameters on the probability of getting esophageal cancer.
  • User should choose among various age groups and alcohol and tobaco consumption and get the probability value as the output.').

Source:

Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.

Looking at the Dataset

  • The ephos dataset comes from R dataset library:
head(esoph)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7

The ncases (no. of cases) and ncontrols (no. of controls) are measured depending on the three features including the age group, and the amount of alcohol and tobaco consumption per day.

dim(esoph)
## [1] 88  5

A Primary Analysis

Re-arrange data for a mosaic plot

mosaicplot(tt, main = "esoph data set", color = TRUE)

plot of chunk unnamed-chunk-5

Fitting A Model

  1. Linear models are fit for the two parameters ncases and ncontrols based on the three features mentioned before:

    • mca <- lm(ncases ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp), data = esoph)
    • mco <- lm(ncontrols ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp), data = esoph)
  2. Lineay equations are fomed for predicting the two parameters ncases and ncontrols:

    • nca<- mca$coefficients[1]+ mca$coefficients[2]*age+ mca$coefficients[3]*alc+ mca$coefficients[4]*tob
    • nco<- mco$coefficients[1]+ mco$coefficients[2]*age+ mco$coefficients[3]*alc+ mco$coefficients[4]*tob
  3. The probability is predicted based on the estimated number of cases and controls:

    • es_prob<- round((nca/(nca+nco))*100, digits= 2)