Constructing the linear model

  1. Select a dataset from “100+ Interesting Data Sets for Statistics”: http://bit.ly/1uQIVLU
  2. Select one independent variable and select one dependent variable
    • Independent variable: CIG = Number of cigarettes purchased (hds per capita)
    • Dependent variable: LUNG = Deaths per 100K population from lung cancer (age adjusted)
  3. Read the dataset as a data.table or data.frame
smoking.data.table <- read.table("~/Documents/MSSM/Applied_Regression_Analysis/smoking_cancer_data.txt",header=T,sep="\t")
  1. Describe, in detail, H0, your null hypothesis
    • The H0 is that statewide deaths per 100K population from lung cancer is independent of the statewide cigarette consumption.
  2. Describe, in detail, your (linear) model
    • The linear model is using the number of cigarettes consumed (hundreds per capita) to predict the incidence of death by lung cancer (per 100K population) in each state in 1960.
  3. Describe the dataset you selected
    • I’ve chosen a data set that contains the per capita numbers for smoking and cancer incidence in 43 states and the District of Columbia in 1960. The data is taken from a study published in 1968 on the relationship between smoking and urinary tract cancers.
      • J.F. Fraumeni, “Cigarette Smoking and Cancers of the Urinary Tract: Geographic Variations in the United States,” Journal of the National Cancer Institute, 41, 1205-1211.
    • For each state/district, the cigarette consumption is presented in the form of number (hundreds) of cigarettes purchased per capita in 1960.
      • CIG = Number of cigarettes purchased (hds per capita)
    • For each state/district, four different types of cancers are presented in the form of deaths per 100K population (adjusted for age).
      • BLAD = Deaths per 100K population from bladder cancer
      • LUNG = Deaths per 100K population from lung cancer
      • KID = Deaths per 100K population from bladder cancer
      • LEUK = Deaths per 100 K population from leukemia
    • It should be noted that there are two obvious outliers in the cigarette consumption data:
      • Nevada (where non-resident tourists swell the cigarette purchase numbers): CIG=42.40
      • District of Columbia (where commuters from surrounding states purchase cigarettes): CIG=40.46
smoking.data.table <- read.table("~/Documents/MSSM/Applied_Regression_Analysis/smoking_cancer_data.txt",header=T,sep="\t")
attach(smoking.data.table)
smoking.lm <- lm(LUNG~CIG)

Plots

  1. Plot the scattergram of your data
plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")

  1. Plot the regression line
plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")
abline(smoking.lm$coef, lwd=3)

  1. Plot the 95% confidence intervals of the regression line, b0 and b1
conf <- confint(smoking.lm,level=0.95)
lwr <- conf[,1]
upr <- conf[,2]
plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")
abline(smoking.lm$coef, lwd=3)
abline(lwr, lty=2, col='red')
abline(upr, lty=2, col='red')

  1. Print a summary of your model
summary(smoking.lm)
## 
## Call:
## lm(formula = LUNG ~ CIG)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.943 -1.656  0.382  1.614  7.561 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.4717     2.1407   3.023  0.00425 ** 
## CIG           0.5291     0.0839   6.306 1.44e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.066 on 42 degrees of freedom
## Multiple R-squared:  0.4864, Adjusted R-squared:  0.4741 
## F-statistic: 39.77 on 1 and 42 DF,  p-value: 1.439e-07
  1. Interpret the results of the statistical analysis \(b_0\), \(b_1\) and \(r\)
    • \(b_0\) = 6.4717 (the y-intercept; the lung cancer incidence after removing the effect of smoking)
    • \(b_1\) = 0.5291 (slope; for each additional hundred cigarettes smoked per capita, there are an additional 0.5291 lung cancer deaths/100K)
    • \(r\) = 0.4864 (the correlation between cigarette consumption and lung cancer). So variation in cigarette consumption explains almost half of the variation in lung cancer incidence.