library(RCurl)

## Loading required package: bitops

URL <-getURL("http://www.amstat.org/publications/jse/datasets/cigarettes.dat.txt",
ssl.verifypeer=FALSE)
cigs <-read.table(text=URL)
names(cigs)<-c("brand","tar","nicotene","weight","CO")
summary(cigs)

##            brand         tar           nicotene          weight      
##  Alpine       : 1   Min.   : 1.00   Min.   :0.1300   Min.   :0.7851  
##  Benson&Hedges: 1   1st Qu.: 8.60   1st Qu.:0.6900   1st Qu.:0.9225  
##  BullDurham   : 1   Median :12.80   Median :0.9000   Median :0.9573  
##  CamelLights  : 1   Mean   :12.22   Mean   :0.8764   Mean   :0.9703  
##  Carlton      : 1   3rd Qu.:15.10   3rd Qu.:1.0200   3rd Qu.:1.0070  
##  Chesterfield : 1   Max.   :29.80   Max.   :2.0300   Max.   :1.1650  
##  (Other)      :19                                                    
##        CO       
##  Min.   : 1.50  
##  1st Qu.:10.00  
##  Median :13.00  
##  Mean   :12.53  
##  3rd Qu.:15.40  
##  Max.   :23.50  
##

Exercise 1

I would use a scatter plot to examine the relationship between two variables in this data set. The plot I developed for tar vs. CO is below.

attach(cigs)
plot(tar, CO)

Based on the plot, I feel comfortable saying that the relationship between tar and CO appears to be linear. As such, I would also feel justified in predicting the CO level of a cigarette brand if I knew the tar level. To confirm the linear nature of this plot, I looked at the correlation coefficient.

cor(cigs$CO, cigs$tar)

## [1] 0.9574853

Exercise 2

Based on the scatter plot above, the data appear to have a strongly positive correlation. As the tar content increases, so does the CO content. The points are clustered between 7 and 17 primarily, but the points that fall outside this range still fit the overall trend.

Exercise 3

To obtain a poor fitting line, I selected two points that were nearly stacked one above the other. The sum I obtained was over 7300.Though I cannot say I was surprised that the result was large, I was surprised at the magnitude of the least squares. The first time I fit a line it was in the 100 range. However, this is much larger.

Click two points to make a line. Call: lm(formula = y ~ x, data = pts) Coefficients: (Intercept) x
-15.84 3.20
Sum of Squares: 7352.671

Exercise 4

The smallest least squares value I obtained was 49.505. Based on some of the homeworks posted prior to mine, it seems that my value is in line with my classmates. Furthermore, the known least squares value is 43, so I think 49 is quite close.

Click two points to make a line. Call: lm(formula = y ~ x, data = pts)

Coefficients: (Intercept) x
1.8924 0.8772

Sum of Squares: 49.505

u <- "https://raw.githubusercontent.com/nickreich/stat-modeling-2015/gh-pages/assets/labs/plot_ss.R"
script <- getURL(u, ssl.verifypeer = FALSE)
eval(parse(text = script))
plot_ss(x = cigs$tar, y = cigs$CO)

## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##       2.743        0.801  
## 
## Sum of Squares:  44.869

m1 <- lm(CO ~ tar, data = cigs)
summary(m1)

## 
## Call:
## lm(formula = CO ~ tar, data = cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1124 -0.7167 -0.3754  1.0091  2.5450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.74328    0.67521   4.063 0.000481 ***
## tar          0.80098    0.05032  15.918 6.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared:  0.9168, Adjusted R-squared:  0.9132 
## F-statistic: 253.4 on 1 and 23 DF,  p-value: 6.552e-14

Exercise 5

b1 = (sy/sx)R

b1 <- ((sd(CO))/(sd(tar)))*cor(tar, CO)
b1

## [1] 0.800976

b0 = y(mean) - b1x(mean)

b0 <- (mean(CO))-(b1*mean(tar))
b0

## [1] 2.743278

These values are matched with those produced earlier with lm.

Exercise 6

The slope indicates that for every 1 unit increase in tar, the CO content will increase by about 0.8 units.

Exercise 7

m2<- lm(CO ~ weight, data = cigs)
summary(m2)

## 
## Call:
## lm(formula = CO ~ weight, data = cigs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.524 -2.533  0.622  2.842  7.268 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -11.795      9.722  -1.213   0.2373  
## weight        25.068      9.980   2.512   0.0195 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.289 on 23 degrees of freedom
## Multiple R-squared:  0.2153, Adjusted R-squared:  0.1811 
## F-statistic: 6.309 on 1 and 23 DF,  p-value: 0.01948

Based on this output, the line’s equation is: y^= -11.87 + 25.07*weight

From the given information about R squared, it can be used as a measure of variability in a model. It “represents the proportion of variability in the response variable that is explained by the explanatory variable.” For m1, the R squared value is 0.9168. Therefore, 91.68% of the variability in carbon monoxide content is explained by the amount of tar in the cigarette. However, for m2, the R scquared value is 0.2153. Therefore, 21.53% of the variability in carbon monoxide content is explained by the weight of the cigarette. Based on this information, the first model is a better predcictor of CO content. The higher R squared value is preferable.

Exercise 8

library(ggplot2)
qplot(tar, CO, data=cigs)

ggplot(cigs, aes(tar, CO)) + geom_point() + geom_smooth(method="lm")

ggplot(cigs, aes(tar, CO)) + geom_point() + geom_smooth(method="lm", se=FALSE)

Looking at only the least squares regression line and not the actual data, I would predict around 14.9 mg emitted from a cigarette with 15 mg of tar. This represents a very slight underestimation of roughly 0.1 mg perhaps.

Exercise 9

qplot(tar, m1$residuals, data=cigs) + geom_hline(yintercept=0, linetype=3)

There seems to be a positive trend, however it is not very strong. This confirms the linearity of the relationship between tar and CO.

qplot(m1$residuals)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot

Exercise 10

The histogram does not show a pattern and the normal probability plot is evenly distributed. These factors indicate that the normal residuals conditions appear to have been met.

Exercise 11

Yes, the constant variability condition seems to have been met because there is not a strong association in the residuals plot.

On Your Own

1.

There does seem to be a linear relationship between nicotene and CO.

attach(cigs)

## The following objects are masked from cigs (pos = 4):
## 
##     brand, CO, nicotene, tar, weight

plot(CO, nicotene)

2.

m3 <- lm(CO ~ nicotene, data = cigs)
summary(m3)

## 
## Call:
## lm(formula = CO ~ nicotene, data = cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3273 -1.2228  0.2304  1.2700  3.9357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.6647     0.9936   1.675    0.107    
## nicotene     12.3954     1.0542  11.759 3.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.828 on 23 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8512 
## F-statistic: 138.3 on 1 and 23 DF,  p-value: 3.312e-11

The R squared value given here is 0.8574. This means that 85.74% of the variability in carbon monoxide content is explained by the amount of nicotene in the cigarette. However, the R squared value for the relationship between tar and CO gives a value of 0.9168. Therefore, 91.68% of the variability in carbon monoxide content is explained by the amount of tar in a cigarette. Based on these two values, tar appears to be a more appropriate predictor of CO content.

3.

The best predictor of CO based on R squared values is tar. Tar has a 0.9168 R squared value, while weight has a 0.2153 R squared value. As just shown above, the R squared for nicotene is 0.8574. Therefore, the highest value is for tar, making it the best predictor of CO.

4.

I concluded that the best predictor of CO is tar. Therefore, these model diagnostics are for the tar vs. CO relationship.

qplot(tar, m1$residuals, data=cigs) + geom_hline(yintercept=0, linetype=3)

qplot(m1$residuals)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(m1$residuals)
qqline(m1$residuals)

These model diagnostics all confirm that tar is an effective predictor of CO content. The residuals plot is randomly distributed, the histogram does not have a pattern, and the normal probability plot is evenly distributed around the line. These tools each confirm that tar is the best predictor of CO.

Lab 2

Abigail Ray

February 24, 2016