library("RCurl", lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library")
## Loading required package: bitops
URL<-getURL("http://www.amstat.org/publications/jse/datasets/cigarettes.dat.txt",ssl.verifypeer=FALSE)
cigs<-read.table(text=URL)
names(cigs)<-c("brand","tar","nicotene","weight","CO")
summary(cigs)
## brand tar nicotene weight
## Alpine : 1 Min. : 1.00 Min. :0.1300 Min. :0.7851
## Benson&Hedges: 1 1st Qu.: 8.60 1st Qu.:0.6900 1st Qu.:0.9225
## BullDurham : 1 Median :12.80 Median :0.9000 Median :0.9573
## CamelLights : 1 Mean :12.22 Mean :0.8764 Mean :0.9703
## Carlton : 1 3rd Qu.:15.10 3rd Qu.:1.0200 3rd Qu.:1.0070
## Chesterfield : 1 Max. :29.80 Max. :2.0300 Max. :1.1650
## (Other) :19
## CO
## Min. : 1.50
## 1st Qu.:10.00
## Median :13.00
## Mean :12.53
## 3rd Qu.:15.40
## Max. :23.50
##
I would use a scatter plot to plot the relationship between CO and a numeric variable.
library(ggplot2)
attach(cigs)
plot(tar,CO)
The relationship between tar and carbon monoxide does look linear. I would feel comfortable using a linear model to predict the amount of CO emitted by a cigarette brand given its tar content.
cor(cigs$CO,cigs$tar)
## [1] 0.9574853
There appears to be a strong, positive correlation between tar and carbon monoxide. There is also one outlier, but I think it may not have much influence on the least squares line, and if we were to apply a least squares line, I would expect the variation around the line to be roughly constant.
u<-"https://raw.githubusercontent.com/nickreich/stat-modeling-2015/gh-pages/assets/labs/plot_ss.R"
script<-getURL(u,ssl.verifypeer=FALSE)
eval(parse(text=script))
plot_ss(x=cigs$tar,y=cigs$CO)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 2.743 0.801
##
## Sum of Squares: 44.869
u<-"https://raw.githubusercontent.com/nickreich/stat-modeling-2015/gh-pages/assets/labs/plot_ss.R"
script<-getURL(u,ssl.verifypeer=FALSE)
eval(parse(text=script))
plot_ss(x=cigs$tar,y=cigs$CO)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 2.743 0.801
##
## Sum of Squares: 44.869
The sum of squares for my first least squares line was 65.1, while the sum of squares for the line that I intentionally fit poorly was 131.096. I am not surprised by my results; the sum of squares confirms that the line I intended to fit to the data is a better fit than the one that I purposely drew to not fit.
The smallest sum of squares that I was able to achieve was 46.502. I tried to change the intercept to see how it would influence the sum of squares, and from there I tried to place the second point where I thought I would end up closest to the greatest number of points.
m1<-lm(CO~tar,data=cigs)
summary(m1)
##
## Call:
## lm(formula = CO ~ tar, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1124 -0.7167 -0.3754 1.0091 2.5450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.74328 0.67521 4.063 0.000481 ***
## tar 0.80098 0.05032 15.918 6.55e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared: 0.9168, Adjusted R-squared: 0.9132
## F-statistic: 253.4 on 1 and 23 DF, p-value: 6.552e-14
b1<-sd(CO)/sd(tar)*cor(tar,CO)
b1
## [1] 0.800976
b0=mean(CO)-b1*mean(tar)
b0
## [1] 2.743278
The values I’ve calculated for intercept (b0) and slope (b1) do match R’s summary values.
The slope tells us that for every 1 unit change in cigarette tar, we would expect a .80098 unit change in carbon monoxide emitted into the environment.
m2<-lm(CO~weight,data=cigs)
summary(m2)
##
## Call:
## lm(formula = CO ~ weight, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.524 -2.533 0.622 2.842 7.268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.795 9.722 -1.213 0.2373
## weight 25.068 9.980 2.512 0.0195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.289 on 23 degrees of freedom
## Multiple R-squared: 0.2153, Adjusted R-squared: 0.1811
## F-statistic: 6.309 on 1 and 23 DF, p-value: 0.01948
plot(weight,CO)
Judging by the scatter of the points and the multiple R-squared of model 2, I would say that model 1 is a better fit. Model 1 has a visible linear relationship and its multiple R-squared is larger than model 2’s, meaning that model 1 explains more of the variation in the points.
qplot(tar,CO,data=cigs)
ggplot(cigs,aes(tar,CO))+geom_point()+geom_smooth(method="lm")
ggplot(cigs,aes(tar,CO))+geom_point()+geom_smooth(method="lm",se=FALSE)
If I saw the least squares regression line and not the data, I would predict that a cigarette containing 15 mg of tar would emit slightly less than 15 mg of CO. This prediction is pretty accurate, the data point at 15 mg of tar is at 15 mg of CO. I would estimate that the residual is something like .2.
qplot(tar,m1$residuals) + geom_hline(yintercept=0,linetype=3)
I think there is a slight positive trend in the residuals, suggesting that not all of the carbon monoxide variation is explained by CO~tar.
qplot(m1$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(m1$residuals)
qqline(m1$residuals)
Based on the histogram and the normal probability plot, the distribution of the residuals do seem nearly normal.
The constant variability condition does seem to be met. There does seem to be a weak trend in the variability, but it’s difficult to make a judgement since there are so few observations.
m3<-lm(CO~nicotene,data=cigs)
summary(m3)
##
## Call:
## lm(formula = CO ~ nicotene, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3273 -1.2228 0.2304 1.2700 3.9357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6647 0.9936 1.675 0.107
## nicotene 12.3954 1.0542 11.759 3.31e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.828 on 23 degrees of freedom
## Multiple R-squared: 0.8574, Adjusted R-squared: 0.8512
## F-statistic: 138.3 on 1 and 23 DF, p-value: 3.312e-11
qplot(nicotene,CO,data=cigs)
ggplot(cigs,aes(nicotene,CO))+geom_point()+geom_smooth(method="lm")
ggplot(cigs,aes(nicotene,CO))+geom_point()+geom_smooth(method="lm", se=FALSE)
At a glance, there does seem to be a linear relationship between nicotene and CO, the linear relationship does not look as strong however, as the relationship between tar and CO.
Based on the R squared values from the model summaries, CO~tar is a better model than CO~nicotene. The R squared value for carbon monoxide and tar is .9168, which means that there was a reduction of 91.68% in the data variation by modeling CO as a response to tar content. The R squared value for CO~nicotene was .8574, so this model did not reduce as much of the vartiation as CO~tar.
Of the three variables we’ve looked at, tar best predicts CO. The relationship between tar and CO appears linear by looking at the graph of CO~tar, and the R squared value for this model confirms that it is the best of the three models.
qplot(m1$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(m1$residuals)
qqline(m1$residuals)