16 January 2018

System for Estimating Risk of Esophogeal Cancer

I have created an application to estimate a person's risk of having esophogeal cancer based on three criteria:

  • Age group
  • Daily alcohol consumption
  • Daily tobacco consumption

The estimations are based on data from a case-control study of esophageal cancer in Ille-et-Vilaine, France from 1980.

A demo is available at https://ericscuccimarra.shinyapps.io/DataProductsCourseProject/

The Data

The data comes from the R dataset "esoph." The data is preprocessed to calculate the prevalence of esophogeal cancer as cases / (cases + controls). In the data the features are stored as factors, I convert those to numeric by using the middle of each range to represent the range.

# Load the data
data(esoph)
esoph$rate <- esoph$ncases / (esoph$ncases + esoph$ncontrols)
esoph$ageI <- as.numeric(substr(esoph$agegp, 1, 2)) + 5
levels(esoph$alcgp) <- c(20,60,100,140)
esoph$alcI <- as.numeric(substr(esoph$alcgp,1,4))
levels(esoph$tobgp)[1] <- "00-09"
esoph$tobI <- as.numeric(substr(esoph$tobgp,1,2)) + 5

Fit a Linear Model to the Data

Due to the relatively small sample size of the raw data and unequal sample sizes for each category, no smooth trends were present. In the raw data the prevalence of cancer for 65-74 year old people who drink 2 drinks a day and smoke 20 cigarettes was 22%, while that for people who smoke more than 30 cigarettes a day was 0%.

To make this tool report usable data, I fit a linear model to the data, using the numeric representation of the factors as the features.

# fit a linear model
fit <- lm(rate ~ ageI + alcI + tobI, data = esoph)

App Interface

Since we are using a linear model, the user can enter their actual age rather than selecting from the categories. In the data the alcohol and tobacco consumption are in grams / day. To calculate the risk I assume that a cigarette contains 1 gram of tobacco and a drink contains 10 grams of alcohol.

Once the user has entered the data into the inputs in the application, the data is fed into the linear model and used to generate an estimate of the user's risk of having esophogeal cancer.

I use my actual data as an example:

age <- 40
alc <- 0 * 10
tob <- 0

Estimating Risk

The user's data is fed into the linear model, which uses a prediction interval to generate a confidence interval. Since the chance of having cancer is always greater than 0, I round any numbers less than or equal to 0 up to 0.1.

predict(fit, newdata=data.frame(ageI = age, alcI = alc, tobI=tob), 
        interval="prediction")
##          fit        lwr       upr
## 1 -0.1026984 -0.3426435 0.1372466

While accuracy of the model is questionable due to the vagaries of the raw data, it would be trivial to incorporate other data into the model, preferably from larger studies.

I hope this application will show people how reducing their tobacco and alcohol consumption can affect their risks for esophogeal cancer.