library(RCurl)
## Loading required package: bitops
URL <-getURL("http://www.amstat.org/publications/jse/datasets/cigarettes.dat.txt",
ssl.verifypeer=FALSE)
cigs <-read.table(text=URL)
names(cigs)<-c("brand","tar","nicotene","weight","CO")
summary(cigs)
## brand tar nicotene weight
## Alpine : 1 Min. : 1.00 Min. :0.1300 Min. :0.7851
## Benson&Hedges: 1 1st Qu.: 8.60 1st Qu.:0.6900 1st Qu.:0.9225
## BullDurham : 1 Median :12.80 Median :0.9000 Median :0.9573
## CamelLights : 1 Mean :12.22 Mean :0.8764 Mean :0.9703
## Carlton : 1 3rd Qu.:15.10 3rd Qu.:1.0200 3rd Qu.:1.0070
## Chesterfield : 1 Max. :29.80 Max. :2.0300 Max. :1.1650
## (Other) :19
## CO
## Min. : 1.50
## 1st Qu.:10.00
## Median :13.00
## Mean :12.53
## 3rd Qu.:15.40
## Max. :23.50
##
I would use a scatterplot to display the relationship between CO and one of the other numerical variables, such as weight, nicotene, or tar. Plotting this relationship using the variable tar:
library(ggplot2)
attach(cigs)
plot(tar, CO)
The relationship appears to be linear between tar and CO emissions, with the exception of one outlier at the top right of the graph. If I knew how much tar was in a given brand of cigarettes, I would feel comfortable using a linear model to predict the carbon monoxide content of that brand.
cor(cigs$CO, cigs$tar)
## [1] 0.9574853
There seems to be a strong, positive, linear correlation between the two variables. There is one outlier though, at the top-right-hand corner of the graph.
u <- "https://raw.githubusercontent.com/nickreich/stat-modeling-2015/gh-pages/assets/labs/plot_ss.R"
script <- getURL(u, ssl.verifypeer = FALSE)
eval(parse(text = script))
plot_ss(x = cigs$tar, y = cigs$CO)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 2.743 0.801
##
## Sum of Squares: 44.869
When I selected the first data point on the scatterplot and the last data point, the squared residuals this time were much larger, and the sum of squares is 186.404, which is much larger (and therefore much worse) than the sum of squares of the first graph, where the sum of squares was only 44.869.
After running the above code several more times, the smallest sum of squares value I was able to get was 49.602. This is very close to the actual lowest sum of squares value of 44.869. The first sum of squares that I obtained, with a value of 186.404, this is a much better result. To lower the RSS value in each trial, I tried picking two points that were more and more representative of the data as a whole each trial. This involved “eyeballing it”, so that the two points I chose would create a line that minimized the distance from all the points to the line (the residual values).
m1 <- lm(CO ~
tar, data = cigs)
summary(m1)
##
## Call:
## lm(formula = CO ~ tar, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1124 -0.7167 -0.3754 1.0091 2.5450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.74328 0.67521 4.063 0.000481 ***
## tar 0.80098 0.05032 15.918 6.55e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared: 0.9168, Adjusted R-squared: 0.9132
## F-statistic: 253.4 on 1 and 23 DF, p-value: 6.552e-14
This gets us the regression line and its summary.
To calculate the B1, the following formula was used based off of the slides: B1 = (sy/sx)R, where sy = standard deviation in y (CO), sx = standard deviation in x (tar), and cor = R.
B1 <- (sd(CO)/sd(tar))*cor(tar, CO)
B1
## [1] 0.800976
To calculate the B0, the following formula was used based off of the slides: b0 = y¯ - B1x¯, where y¯ = sample mean of y (CO), x¯ = sample mean of x (tar), and B1 = slope of the regression.
B0 <- mean(CO) - (B1*mean(tar))
B0
## [1] 2.743278
To confirm that my calculated values of B0 and B1 match up to the fitted model, the following code was used again:
m1 <- lm(CO ~
tar, data = cigs)
summary(m1)
##
## Call:
## lm(formula = CO ~ tar, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1124 -0.7167 -0.3754 1.0091 2.5450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.74328 0.67521 4.063 0.000481 ***
## tar 0.80098 0.05032 15.918 6.55e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared: 0.9168, Adjusted R-squared: 0.9132
## F-statistic: 253.4 on 1 and 23 DF, p-value: 6.552e-14
The calculated intercept coefficient (B0) seems to be exactly the same number as the m1 value of B0, and the calculated tar coefficient (B1) is only 0.000004 less than the m1 value of B1, which is almost exactly the same number.
The slope tells us that in the context of the relationship between the amount of carbon monoxide emitted into the environment and the amount of tar in the cigarette, for every one unit increase in tar (in mg), we would expect the CO content to increase on average by about 0.8 mg.
m2 <- lm(CO ~
weight, data = cigs)
summary(m2)
##
## Call:
## lm(formula = CO ~ weight, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.524 -2.533 0.622 2.842 7.268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.795 9.722 -1.213 0.2373
## weight 25.068 9.980 2.512 0.0195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.289 on 23 degrees of freedom
## Multiple R-squared: 0.2153, Adjusted R-squared: 0.1811
## F-statistic: 6.309 on 1 and 23 DF, p-value: 0.01948
Fits a new model m2 that uses weight to predict CO. based off of the summary of m2, the equation of the new regression line would be:
y^ = -11.795 +25.068(weight)
According to the summary, the Multiple R-squared value is 0.2153, which is significantly lower than the Multiple R-squared value of 0.9168 for m1. The R-squared value tells us what percent of variability in the response variable is explained by the model. The closer the R-squared value is to 1, the better the model. Since the R-squared value for m2 is 0.2153, then 21.53% of the variability in CO emissions is explained by the linear relationship between the weight of the cigarette and CO emissions. I would trust m1 more to predict CO emissions becuase 91.68% of the variability in CO emissions is explained by the linear relationship between the amount of tar and CO emissions, which is much a much higher R-squared value.
ggplot(cigs, aes(tar, CO)) + geom_point() + geom_smooth(method="lm", se=FALSE)
Creates a scatterplot of CO versus tar with the least squares line laid on top.
If I saw the least squares regression line and not the actual data, I would predict about 14.8 mg of CO to be emitted from a cigarette with 15 mg of tar. This is an underestimate my 0.2 mg, since the residual for this prediction is:
ei = yi - y^i, = 15.0 - 14.8 = 0.2 mg.
qplot(tar, m1$residuals, data=cigs) + geom_hline(yintercept=0, linetype=3)
Verifies that the relationship between CO content and amount of tar is linear.
In general, in the resdual plot there aren’t clear patterns, which is a good sign. In addition, the data points seem to be relatively symmetrically distributed above and below the y = 0 line, which is also indicative of a linear relationship between CO content and tar.
qplot(m1$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
Based on the histogram and the normal probability plot, the nearly normal residuals condition appears to be met. There seems to be virtually no unusual observations that don’t follow the trend of the rest of the data, which implies that the residuals are nearly normal.
Based on the plot in (1), the constant variability condition appears to be met. The variability of the residuals around the 0 line are roughly constant, which satisfies the constant variability condition.
plot(nicotene, CO)
Produces a scatteplot of CO and nicotene.
m1 <- lm(CO ~ nicotene, data = cigs)
summary(m1)
##
## Call:
## lm(formula = CO ~ nicotene, data = cigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3273 -1.2228 0.2304 1.2700 3.9357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6647 0.9936 1.675 0.107
## nicotene 12.3954 1.0542 11.759 3.31e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.828 on 23 degrees of freedom
## Multiple R-squared: 0.8574, Adjusted R-squared: 0.8512
## F-statistic: 138.3 on 1 and 23 DF, p-value: 3.312e-11
Fits the linear model.
At first glance, there seems to be a linear relationship between nicotene content and CO emissions.
According to the summary, the Multiple R-squared value is 0.8574, which is slightly lower than the Multiple R-squared value of 0.9168 for the tar model. Since the R-squared value for the nicotene model is lower than that of the tar model R-squared value, I would trust the tar model more to predict CO emissions becuase 91.68% of the variability in CO emissions is explained by the linear relationship between the amount of tar and CO emissions, which is a higher R-squared value than 85.74%.
Of all of the R-squared values, where an R-squared value closest to 1 is considered the best model of CO emissions due to the predictor variable, the R-squared value for the tar model is the highest, at 0.9168 (91.68%). Therefore, I conclude that tar best predicts CO emissions out of the tar, nicotene, and weight predictor variables.
ggplot(cigs, aes(tar, CO)) + geom_point() + geom_smooth(method="lm", se=FALSE)
ggplot(cigs, aes(weight, CO)) + geom_point() + geom_smooth(method="lm", se=FALSE)
ggplot(cigs, aes(nicotene, CO)) + geom_point() + geom_smooth(method="lm", se=FALSE)
Since the variability of points around the least squares line are the most constant in the CO vs nicotene, and the relationship between the explanatory variable (tar) and the response variable (CO) appears to be very linear, tar appears to be the best predictor of CO emissions.
qplot(tar, m1$residuals, data=cigs) + geom_hline(yintercept=0, linetype=3)
In general, in the resdual plot there aren’t clear patterns, which is a good sign. In addition, the data points seem to be relatively symmetrically distributed above and below the y = 0 line, which is also indicative of a linear relationship between CO content and tar.
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
Based on the normal probability plot (Normal Q-Q Plot), the nearly normal residuals condition appears to be met. There seems to be virtually no unusual observations that don’t follow the trend of the rest of the data, which implies that the residuals are nearly normal. Additionally, the constant variability condition appears to be met. The variability of the residuals around the 0 line are roughly constant, which satisfies the constant variability condition. The linearity condition was already known to be met from part (3), since the relationship between the explanatory variable (tar) and the response variable (CO) appears to be very linear. All of these tests and results seem to prove that the tar predictor variable seems to be the best predictor of CO emissions.