Project of Developing Data Products Course-2nd part

author : Farhad M.

date: 2015-10-23

This presentation gives a brief overview in five slides on the 2nd part of the project for Developing Data Products Course

Introduction

  • In this project we aim to perform an anlysis on the effect of various age groups for different amount of alcohol and tobacco consumption per day. The analysis is based on a case-control study of (o)esophageal cancer done in Ille-et-Vilaine, France.
  • We have developed a predictor tool to get inputs such as age groups for different amount of alcohol and tobacco consumption per day and predicts the probablity of esophageal cancer.
  • User should choose among various age groups and alcohol and tobaco consumption and get the probability value as the output.

Source: Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.

Looking at the Dataset

  • The ephos dataset comes from R dataset library:
head(esoph)
  agegp     alcgp    tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day      0        40
2 25-34 0-39g/day    10-19      0        10
3 25-34 0-39g/day    20-29      0         6
4 25-34 0-39g/day      30+      0         5
5 25-34     40-79 0-9g/day      0        27
6 25-34     40-79    10-19      0         7

The ncases (no. of cases) and ncontrols (no. of controls) are measured depending on the three features including the age group, and the amount of alcohol and tobaco consumption per day.

A Primary Analysis

Re-arrange data for a mosaic plot

mosaicplot(tt, main = "esoph data set", color = TRUE)

plot of chunk unnamed-chunk-4

Fitting A Model

  1. Two linear models are fit for the two parameters ncases and ncontrols based on the three features mentioned before:
    • mca <- lm(ncases ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp), data = esoph)
    • mco <- lm(ncontrols ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp), data = esoph)
  2. Two equations are fomed for predicting ncases and ncontrols:
    • ncases<- mca$coefficients[1]+ mca$coefficients[2]*agegp+ mca$coefficients[3]*alcgp+ mca$coefficients[4]*tobgp
    • ncontrols<- mco$coefficients[1]+ mco$coefficients[2]*agegp+ mco$coefficients[3]*alcgp+ mco$coefficients[4]*tobgp
  3. Finally, the probability is predicted based on the estimated number of cases and controls: