library(RCurl)
## Loading required package: bitops
URL <-getURL("http://www.amstat.org/publications/jse/datasets/cigarettes.dat.txt",
ssl.verifypeer=FALSE)
cigs <-read.table(text=URL)
names(cigs)<-c("brand","tar","nicotene","weight","CO")
summary(cigs)
## brand tar nicotene weight
## Alpine : 1 Min. : 1.00 Min. :0.1300 Min. :0.7851
## Benson&Hedges: 1 1st Qu.: 8.60 1st Qu.:0.6900 1st Qu.:0.9225
## BullDurham : 1 Median :12.80 Median :0.9000 Median :0.9573
## CamelLights : 1 Mean :12.22 Mean :0.8764 Mean :0.9703
## Carlton : 1 3rd Qu.:15.10 3rd Qu.:1.0200 3rd Qu.:1.0070
## Chesterfield : 1 Max. :29.80 Max. :2.0300 Max. :1.1650
## (Other) :19
## CO
## Min. : 1.50
## 1st Qu.:10.00
## Median :13.00
## Mean :12.53
## 3rd Qu.:15.40
## Max. :23.50
##
I would use a scatter plot to examine the relationship between two variables in this data set. The plot I developed for tar vs. CO is below.
attach(cigs)
plot(tar, CO)
Based on the plot, I feel comfortable saying that the relationship between tar and CO appears to be linear. As such, I would also feel justified in predicting the CO level of a cigarette brand if I knew the tar level. To confirm the linear nature of this plot, I looked at the correlation coefficient.
cor(cigs$CO, cigs$tar)
## [1] 0.9574853
Based on the scatter plot above, the data appear to have a strongly positive correlation. As the tar content increases, so does the CO content. The points are clustered between 7 and 17 primarily, but the points that fall outside this range still fit the overall trend.
To obtain a poor fitting line, I selected two points that were nearly stacked one above the other. The sum I obtained was over 7300.Though I cannot say I was surprised that the result was large, I was surprised at the magnitude of the least squares. The first time I fit a line it was in the 100 range. However, this is much larger.
Click two points to make a line. Call: lm(formula = y ~ x, data = pts) Coefficients: (Intercept) x
-15.84 3.20
Sum of Squares: 7352.671
The smallest least squares value I obtained was 49.505. Based on some of the homeworks posted prior to mine, it seems that my value is in line with my classmates. Furthermore, the known least squares value is 43, so I think 49 is quite close.
Click two points to make a line. Call: lm(formula = y ~ x, data = pts)
Coefficients: (Intercept) x
1.8924 0.8772
Sum of Squares: 49.505
u <- "https://raw.githubusercontent.com/nickreich/stat-modeling-2015/gh-pages/assets/labs/plot_ss.R"
script <- getURL(u, ssl.verifypeer = FALSE)
eval(parse(text = script))
plot_ss(x = cigs$tar, y = cigs$CO)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 2.743 0.801
##
## Sum of Squares: 44.869
m1 <- lm(CO ~ tar, data = cigs)
summary(m1)
##
## Call:
## lm(formula = CO ~ tar, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1124 -0.7167 -0.3754 1.0091 2.5450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.74328 0.67521 4.063 0.000481 ***
## tar 0.80098 0.05032 15.918 6.55e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared: 0.9168, Adjusted R-squared: 0.9132
## F-statistic: 253.4 on 1 and 23 DF, p-value: 6.552e-14
b1 = (sy/sx)R
b1 <- ((sd(CO))/(sd(tar)))*cor(tar, CO)
b1
## [1] 0.800976
b0 = y(mean) - b1x(mean)
b0 <- (mean(CO))-(b1*mean(tar))
b0
## [1] 2.743278
These values are matched with those produced earlier with lm.
The slope indicates that for every 1 unit increase in tar, the CO content will increase by about 0.8 units.
m2<- lm(CO ~ weight, data = cigs)
summary(m2)
##
## Call:
## lm(formula = CO ~ weight, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.524 -2.533 0.622 2.842 7.268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.795 9.722 -1.213 0.2373
## weight 25.068 9.980 2.512 0.0195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.289 on 23 degrees of freedom
## Multiple R-squared: 0.2153, Adjusted R-squared: 0.1811
## F-statistic: 6.309 on 1 and 23 DF, p-value: 0.01948
Based on this output, the line’s equation is: y^= -11.87 + 25.07*weight
From the given information about R squared, it can be used as a measure of variability in a model. It “represents the proportion of variability in the response variable that is explained by the explanatory variable.” For m1, the R squared value is 0.9168. Therefore, 91.68% of the variability in carbon monoxide content is explained by the amount of tar in the cigarette. However, for m2, the R scquared value is 0.2153. Therefore, 21.53% of the variability in carbon monoxide content is explained by the weight of the cigarette. Based on this information, the first model is a better predcictor of CO content. The higher R squared value is preferable.
library(ggplot2)
qplot(tar, CO, data=cigs)
ggplot(cigs, aes(tar, CO)) + geom_point() + geom_smooth(method="lm")
ggplot(cigs, aes(tar, CO)) + geom_point() + geom_smooth(method="lm", se=FALSE)
Looking at only the least squares regression line and not the actual data, I would predict around 14.9 mg emitted from a cigarette with 15 mg of tar. This represents a very slight underestimation of roughly 0.1 mg perhaps.
qplot(tar, m1$residuals, data=cigs) + geom_hline(yintercept=0, linetype=3)
There seems to be a positive trend, however it is not very strong. This confirms the linearity of the relationship between tar and CO.
qplot(m1$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
The histogram does not show a pattern and the normal probability plot is evenly distributed. These factors indicate that the normal residuals conditions appear to have been met.
Yes, the constant variability condition seems to have been met because there is not a strong association in the residuals plot.
There does seem to be a linear relationship between nicotene and CO.
attach(cigs)
## The following objects are masked from cigs (pos = 4):
##
## brand, CO, nicotene, tar, weight
plot(CO, nicotene)
m3 <- lm(CO ~ nicotene, data = cigs)
summary(m3)
##
## Call:
## lm(formula = CO ~ nicotene, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3273 -1.2228 0.2304 1.2700 3.9357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6647 0.9936 1.675 0.107
## nicotene 12.3954 1.0542 11.759 3.31e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.828 on 23 degrees of freedom
## Multiple R-squared: 0.8574, Adjusted R-squared: 0.8512
## F-statistic: 138.3 on 1 and 23 DF, p-value: 3.312e-11
The R squared value given here is 0.8574. This means that 85.74% of the variability in carbon monoxide content is explained by the amount of nicotene in the cigarette. However, the R squared value for the relationship between tar and CO gives a value of 0.9168. Therefore, 91.68% of the variability in carbon monoxide content is explained by the amount of tar in a cigarette. Based on these two values, tar appears to be a more appropriate predictor of CO content.
The best predictor of CO based on R squared values is tar. Tar has a 0.9168 R squared value, while weight has a 0.2153 R squared value. As just shown above, the R squared for nicotene is 0.8574. Therefore, the highest value is for tar, making it the best predictor of CO.
I concluded that the best predictor of CO is tar. Therefore, these model diagnostics are for the tar vs. CO relationship.
qplot(tar, m1$residuals, data=cigs) + geom_hline(yintercept=0, linetype=3)
qplot(m1$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(m1$residuals)
qqline(m1$residuals)
These model diagnostics all confirm that tar is an effective predictor of CO content. The residuals plot is randomly distributed, the histogram does not have a pattern, and the normal probability plot is evenly distributed around the line. These tools each confirm that tar is the best predictor of CO.