The data

library(RCurl)
## Loading required package: bitops
library(ggplot2)


cigs = read.table("https://ww2.amstat.org/publications/jse/datasets/cigarettes.dat.txt")
names(cigs)<-c("brand","tar","nicotene","weight","CO")
summary(cigs)
##            brand         tar           nicotene          weight      
##  Alpine       : 1   Min.   : 1.00   Min.   :0.1300   Min.   :0.7851  
##  Benson&Hedges: 1   1st Qu.: 8.60   1st Qu.:0.6900   1st Qu.:0.9225  
##  BullDurham   : 1   Median :12.80   Median :0.9000   Median :0.9573  
##  CamelLights  : 1   Mean   :12.22   Mean   :0.8764   Mean   :0.9703  
##  Carlton      : 1   3rd Qu.:15.10   3rd Qu.:1.0200   3rd Qu.:1.0070  
##  Chesterfield : 1   Max.   :29.80   Max.   :2.0300   Max.   :1.1650  
##  (Other)      :19                                                    
##        CO       
##  Min.   : 1.50  
##  1st Qu.:10.00  
##  Median :13.00  
##  Mean   :12.53  
##  3rd Qu.:15.40  
##  Max.   :23.50  
## 

Exercise 1:

ggplot(cigs, aes(tar, CO)) +
  geom_point((aes(colour = brand))) +
  geom_smooth(method ="lm") +
  labs(title = "CO2 vs Tar by Brand of cigarettes ")

cor(cigs$CO, cigs$tar)
## [1] 0.9574853

The relationship does look linear. Looking at this graph, without running any other tests, I would say that I would be comfortable using a linear model to predict the amount of carbon monoxide in a brand of cigarette given the amount of tar it contained.


Sum of squared residuals

Exercise 2:
When looking at the amount of tar in a cigarette vs the amount of carbon monoxide that it releases into the atmosphere it is strong, positive, and linear realtionship. As tar in the cigarettes increases so does the amount of CO released into the atmosphere. One brand of cigarettes, BullDurham, was an outlier data point with much higher levels of both CO2 and tar than most of the others and also seemed to have high leverage, since it was overestimated by the trend line.

u <- "https://raw.githubusercontent.com/nickreich/stat-modeling-2015/gh-pages/assets/labs/plot_ss.R"
script <- RCurl::getURL(u, ssl.verifypeer = FALSE)
eval(parse(text = script))
plot_ss(x = cigs$tar, y = cigs$CO)

## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##       2.743        0.801  
## 
## Sum of Squares:  44.869

Exercise 3:

lm(formula = y ~ x, data = pts)

Coefficients: (Intercept) x
0.6635 0.7605

Sum of Squares: 211.773

The sum of the Squares of the second time I ran the code to a line that did not fit well made the sum of squares skyrocket. They have a difference of 166.17. I’m not surprised by this result because the worst fit the line the higher the squared residuals and since it is squared by each point it is no surprise it is so much higher than the better fit line.

Sum_sqr_good <- 44.869
Sum_sqr_bad <- 211.773
Diff_Sum_sqrs <- Sum_sqr_bad - Sum_sqr_good
Diff_Sum_sqrs
## [1] 166.904

Exercise 4:
Oddly enough, the first time I tried this was the best I could get. I tried many more times to get a Sum of Squares better 44.869 with no luck. The adjustments I made to try to replicate the first line I drew was to pick points near each other that were in a sloping position or try to pick two points that were on the x=y line.


The linear model

m1 <- lm(CO ~ tar, data = cigs)
summary(m1)
## 
## Call:
## lm(formula = CO ~ tar, data = cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1124 -0.7167 -0.3754  1.0091  2.5450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.74328    0.67521   4.063 0.000481 ***
## tar          0.80098    0.05032  15.918 6.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared:  0.9168, Adjusted R-squared:  0.9132 
## F-statistic: 253.4 on 1 and 23 DF,  p-value: 6.552e-14

Exercise 5:

r <- cor(cigs$CO, cigs$tar)
SD_CO <- sd(cigs$CO)
SD_tar <- sd(cigs$tar)
B_1 <- SD_CO/SD_tar * r
mean_CO <- mean(cigs$CO)
mean_tar <- mean(cigs$tar)
B_0 <- mean_CO - B_1 * mean_tar
B_0
## [1] 2.743278
B_1
## [1] 0.800976
summary(m1)
## 
## Call:
## lm(formula = CO ~ tar, data = cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1124 -0.7167 -0.3754  1.0091  2.5450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.74328    0.67521   4.063 0.000481 ***
## tar          0.80098    0.05032  15.918 6.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared:  0.9168, Adjusted R-squared:  0.9132 
## F-statistic: 253.4 on 1 and 23 DF,  p-value: 6.552e-14

Exercise 6:

The slope is 0.800976, this tells us that for every 1 more milligram of tar in the cigarette there will be 0.800976 more milligrams of CO released into the atmosphere.

Exercise: 7

r_m2 <-cor(cigs$CO, cigs$weight)
SD_weight <- sd(cigs$weight)
B_1_m2 <- SD_CO/SD_weight *r_m2
mean_weight <- mean(cigs$weight)
B_0_m2 <- mean_CO - B_1_m2 * mean_weight
B_0_m2
## [1] -11.79527
B_1_m2
## [1] 25.0682
m2 <- lm(CO ~ weight, data = cigs)
summary(m2)
## 
## Call:
## lm(formula = CO ~ weight, data = cigs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.524 -2.533  0.622  2.842  7.268 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -11.795      9.722  -1.213   0.2373  
## weight        25.068      9.980   2.512   0.0195 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.289 on 23 degrees of freedom
## Multiple R-squared:  0.2153, Adjusted R-squared:  0.1811 
## F-statistic: 6.309 on 1 and 23 DF,  p-value: 0.01948

-11.79527(B_0_m2) = mean_CO(12.528) - B_1_m2(25.0682) * mean_weight(0.970284)

21.53% of CO emission is explained by the weight of the cigarette. I would trust the m1 model much more than the m2 model. This is because in the m1 model 91.68% of the CO emission is explained by the amount of tar in the cigarettes, which is a much more accurate predictor.


Prediction and prediction errors

qplot(tar, CO, data=cigs)

ggplot(cigs, aes(tar, CO)) + geom_point() + geom_smooth(method="lm")

ggplot(cigs, aes(tar, CO)) + geom_point() + geom_smooth(method="lm", se=FALSE)

Exercise 8:

If I did not see the data and just went off of the least squares regression line, it would seem that for a cigarette with 15mg of tar it would emit a little bit under 15mg of CO. This is a pretty accurate prediction because there is an actual data point on (15,15). So this would be a slight underestimate, by what looks like 0.5miligrams.


Model diagnostics

qplot(tar, m1$residuals, data=cigs) + geom_hline(yintercept=0, linetype=3)

Exercise 9:

There is a slight pattern in the residuals plot in an upward sloping direction, this indicates that the linearity of the relationship between CO content and tar is, as assumed earlier, positive, linear, and strong.

qplot(m1$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot

Exercise 10:

The nearly normal residuals condition seems to be met here. The residuals are normally distributed, being centered at 0. Yet, there are outliers around 1 and -0.4, though besides that they seem to be distributed evenly.

Exercise 11:

Based on the plot in (1), it seems that the constant variability condition appears to be met. This is because the data is spread all over the place but still appears to be slightly linear and positive, there does not seem to be any other type of grouping, such as a quadrartic, which would warrent another test with a different type of model.


Comparing linear models with other variables

Exercise 12:

ggplot(cigs, aes(nicotene, CO)) +
  geom_point((aes(colour = brand))) +
  geom_smooth(method ="lm") +
  labs(title = "CO vs Nicotene by Brand of cigarettes ")

There seems to be a slight positive linear relationship between CO and Nicotine, though from first glance it is not as strong as CO vs tar.

Exercise 13:

m3 <- lm(CO ~ nicotene, data = cigs)
summary(m3)
## 
## Call:
## lm(formula = CO ~ nicotene, data = cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3273 -1.2228  0.2304  1.2700  3.9357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.6647     0.9936   1.675    0.107    
## nicotene     12.3954     1.0542  11.759 3.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.828 on 23 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8512 
## F-statistic: 138.3 on 1 and 23 DF,  p-value: 3.312e-11
summary(m1)
## 
## Call:
## lm(formula = CO ~ tar, data = cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1124 -0.7167 -0.3754  1.0091  2.5450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.74328    0.67521   4.063 0.000481 ***
## tar          0.80098    0.05032  15.918 6.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared:  0.9168, Adjusted R-squared:  0.9132 
## F-statistic: 253.4 on 1 and 23 DF,  p-value: 6.552e-14

Using the multiple R-squared from both model summaries it seems that tar is a better predictor of CO emissions than nicotine is. That is because 91.68% of the CO emission is explained by the amount of tar in the cigarettes and only 85.74% is explained by the nicotine in them.

Exercise 14:

The variable that best predicts CO emission from the data collected and tests run, is tar. When looking at the graphs more of the data points stay closer around the trend line than the other two variables, they have a constant, strong, linear, and positive trend. When we look at the multiple R-squared it also backs this assumption up because it has the highest prediction rate for CO emissions out of the three of them; 91.68% of CO emissions are explained by tar.

Exercise 15:

qplot(tar, m1$residuals, data=cigs) + geom_hline(yintercept=0, linetype=3)

qplot(m1$residuals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(m1$residuals)
qqline(m1$residuals)