Applied Regression Analysis

Constructing the linear model

Select a dataset from “100+ Interesting Data Sets for Statistics”: http://bit.ly/1uQIVLU
- I chose this one: http://lib.stat.cmu.edu/DASL/Datafiles/cigcancerdat.html
Select one independent variable and select one dependent variable
- Independent variable: CIG = Number of cigarettes purchased (hds per capita)
- Dependent variable: LUNG = Deaths per 100K population from lung cancer (age adjusted)
Read the dataset as a data.table or data.frame

smoking.data.table <- read.table("~/Documents/MSSM/Applied_Regression_Analysis/smoking_cancer_data.txt",header=T,sep="\t")

Describe, in detail, H0, your null hypothesis
- The H0 is that statewide deaths per 100K population from lung cancer is independent of the statewide cigarette consumption.
Describe, in detail, your (linear) model
- The linear model is using the number of cigarettes consumed (hundreds per capita) to predict the incidence of death by lung cancer (per 100K population) in each state in 1960.
Describe the dataset you selected
- I’ve chosen a data set that contains the per capita numbers for smoking and cancer incidence in 43 states and the District of Columbia in 1960. The data is taken from a study published in 1968 on the relationship between smoking and urinary tract cancers.
  - J.F. Fraumeni, “Cigarette Smoking and Cancers of the Urinary Tract: Geographic Variations in the United States,” Journal of the National Cancer Institute, 41, 1205-1211.
- For each state/district, the cigarette consumption is presented in the form of number (hundreds) of cigarettes purchased per capita in 1960.
  - CIG = Number of cigarettes purchased (hds per capita)
- For each state/district, four different types of cancers are presented in the form of deaths per 100K population (adjusted for age).
  - BLAD = Deaths per 100K population from bladder cancer
  - LUNG = Deaths per 100K population from lung cancer
  - KID = Deaths per 100K population from bladder cancer
  - LEUK = Deaths per 100 K population from leukemia
- It should be noted that there are two obvious outliers in the cigarette consumption data:
  - Nevada (where non-resident tourists swell the cigarette purchase numbers): CIG=42.40
  - District of Columbia (where commuters from surrounding states purchase cigarettes): CIG=40.46

smoking.data.table <- read.table("~/Documents/MSSM/Applied_Regression_Analysis/smoking_cancer_data.txt",header=T,sep="\t")
attach(smoking.data.table)
smoking.lm <- lm(LUNG~CIG)

Plots

Plot the scattergram of your data

plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")

Plot the regression line

plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")
abline(smoking.lm$coef, lwd=3)

Plot the 95% confidence intervals of the regression line, b0 and b1

conf <- confint(smoking.lm,level=0.95)
lwr <- conf[,1]
upr <- conf[,2]
plot(CIG,LUNG,pch=21, bg='red',main="Lung cancer vs Cigarette consumption")
abline(smoking.lm$coef, lwd=3)
abline(lwr, lty=2, col='red')
abline(upr, lty=2, col='red')

Print a summary of your model

summary(smoking.lm)

## 
## Call:
## lm(formula = LUNG ~ CIG)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.943 -1.656  0.382  1.614  7.561 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.4717     2.1407   3.023  0.00425 ** 
## CIG           0.5291     0.0839   6.306 1.44e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.066 on 42 degrees of freedom
## Multiple R-squared:  0.4864, Adjusted R-squared:  0.4741 
## F-statistic: 39.77 on 1 and 42 DF,  p-value: 1.439e-07

Interpret the results of the statistical analysis \(b_0\), \(b_1\) and \(r\)
- \(b_0\) = 6.4717 (the y-intercept; the lung cancer incidence after removing the effect of smoking)
- \(b_1\) = 0.5291 (slope; for each additional hundred cigarettes smoked per capita, there are an additional 0.5291 lung cancer deaths/100K)
- \(r\) = 0.4864 (the correlation between cigarette consumption and lung cancer). So variation in cigarette consumption explains almost half of the variation in lung cancer incidence.

Applied Regression Analysis - Project #1

John Beaulaurier

February 19, 2015

Constructing the linear model

Plots