Constructing the linear model
- Select a dataset from “100+ Interesting Data Sets for Statistics”: http://bit.ly/1uQIVLU
- Select one independent variable and select one dependent variable
- Independent variable: CIG = Number of cigarettes purchased (hds per capita)
- Dependent variable: LUNG = Deaths per 100K population from lung cancer (age adjusted)
- Read the dataset as a data.table or data.frame
smoking.data.table <- read.table("~/Documents/MSSM/Applied_Regression_Analysis/smoking_cancer_data.txt",header=T,sep="\t")
- Describe, in detail, H0, your null hypothesis
- The H0 is that statewide deaths per 100K population from lung cancer is independent of the statewide cigarette consumption.
- Describe, in detail, your (linear) model
- The linear model is using the number of cigarettes consumed (hundreds per capita) to predict the incidence of death by lung cancer (per 100K population) in each state in 1960.
- Describe the dataset you selected
- I’ve chosen a data set that contains the per capita numbers for smoking and cancer incidence in 43 states and the District of Columbia in 1960. The data is taken from a study published in 1968 on the relationship between smoking and urinary tract cancers.
- J.F. Fraumeni, “Cigarette Smoking and Cancers of the Urinary Tract: Geographic Variations in the United States,” Journal of the National Cancer Institute, 41, 1205-1211.
- For each state/district, the cigarette consumption is presented in the form of number (hundreds) of cigarettes purchased per capita in 1960.
- CIG = Number of cigarettes purchased (hds per capita)
- For each state/district, four different types of cancers are presented in the form of deaths per 100K population (adjusted for age).
- BLAD = Deaths per 100K population from bladder cancer
- LUNG = Deaths per 100K population from lung cancer
- KID = Deaths per 100K population from bladder cancer
- LEUK = Deaths per 100 K population from leukemia
- It should be noted that there are two obvious outliers in the cigarette consumption data:
- Nevada (where non-resident tourists swell the cigarette purchase numbers): CIG=42.40
- District of Columbia (where commuters from surrounding states purchase cigarettes): CIG=40.46
smoking.data.table <- read.table("~/Documents/MSSM/Applied_Regression_Analysis/smoking_cancer_data.txt",header=T,sep="\t")
attach(smoking.data.table)
smoking.lm <- lm(LUNG~CIG)
Plots
- Plot the scattergram of your data
plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")

- Plot the regression line
plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")
abline(smoking.lm$coef, lwd=3)

- Plot the 95% confidence intervals of the regression line, b0 and b1
conf <- confint(smoking.lm,level=0.95)
lwr <- conf[,1]
upr <- conf[,2]
plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")
abline(smoking.lm$coef, lwd=3)
abline(lwr, lty=2, col='red')
abline(upr, lty=2, col='red')

- Print a summary of your model
summary(smoking.lm)
##
## Call:
## lm(formula = LUNG ~ CIG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.943 -1.656 0.382 1.614 7.561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.4717 2.1407 3.023 0.00425 **
## CIG 0.5291 0.0839 6.306 1.44e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.066 on 42 degrees of freedom
## Multiple R-squared: 0.4864, Adjusted R-squared: 0.4741
## F-statistic: 39.77 on 1 and 42 DF, p-value: 1.439e-07
- Interpret the results of the statistical analysis \(b_0\), \(b_1\) and \(r\)
- \(b_0\) = 6.4717 (the y-intercept; the lung cancer incidence after removing the effect of smoking)
- \(b_1\) = 0.5291 (slope; for each additional hundred cigarettes smoked per capita, there are an additional 0.5291 lung cancer deaths/100K)
- \(r\) = 0.4864 (the correlation between cigarette consumption and lung cancer). So variation in cigarette consumption explains almost half of the variation in lung cancer incidence.