Lab 2

library("RCurl", lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library")

## Loading required package: bitops

URL<-getURL("http://www.amstat.org/publications/jse/datasets/cigarettes.dat.txt",ssl.verifypeer=FALSE)

cigs<-read.table(text=URL)

names(cigs)<-c("brand","tar","nicotene","weight","CO")

summary(cigs)

##            brand         tar           nicotene          weight      
##  Alpine       : 1   Min.   : 1.00   Min.   :0.1300   Min.   :0.7851  
##  Benson&Hedges: 1   1st Qu.: 8.60   1st Qu.:0.6900   1st Qu.:0.9225  
##  BullDurham   : 1   Median :12.80   Median :0.9000   Median :0.9573  
##  CamelLights  : 1   Mean   :12.22   Mean   :0.8764   Mean   :0.9703  
##  Carlton      : 1   3rd Qu.:15.10   3rd Qu.:1.0200   3rd Qu.:1.0070  
##  Chesterfield : 1   Max.   :29.80   Max.   :2.0300   Max.   :1.1650  
##  (Other)      :19                                                    
##        CO       
##  Min.   : 1.50  
##  1st Qu.:10.00  
##  Median :13.00  
##  Mean   :12.53  
##  3rd Qu.:15.40  
##  Max.   :23.50  
##

Exercise 1

I would use a scatter plot to plot the relationship between CO and a numeric variable.

library(ggplot2)

attach(cigs)

plot(tar,CO)

The relationship between tar and carbon monoxide does look linear. I would feel comfortable using a linear model to predict the amount of CO emitted by a cigarette brand given its tar content.

cor(cigs$CO,cigs$tar)

## [1] 0.9574853

Exercise 2

There appears to be a strong, positive correlation between tar and carbon monoxide. There is also one outlier, but I think it may not have much influence on the least squares line, and if we were to apply a least squares line, I would expect the variation around the line to be roughly constant.

u<-"https://raw.githubusercontent.com/nickreich/stat-modeling-2015/gh-pages/assets/labs/plot_ss.R"
script<-getURL(u,ssl.verifypeer=FALSE)
eval(parse(text=script))
plot_ss(x=cigs$tar,y=cigs$CO)

## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##       2.743        0.801  
## 
## Sum of Squares:  44.869

Exercise 3

u<-"https://raw.githubusercontent.com/nickreich/stat-modeling-2015/gh-pages/assets/labs/plot_ss.R"
script<-getURL(u,ssl.verifypeer=FALSE)
eval(parse(text=script))
plot_ss(x=cigs$tar,y=cigs$CO)

## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##       2.743        0.801  
## 
## Sum of Squares:  44.869

The sum of squares for my first least squares line was 65.1, while the sum of squares for the line that I intentionally fit poorly was 131.096. I am not surprised by my results; the sum of squares confirms that the line I intended to fit to the data is a better fit than the one that I purposely drew to not fit.

Exercise 4

The smallest sum of squares that I was able to achieve was 46.502. I tried to change the intercept to see how it would influence the sum of squares, and from there I tried to place the second point where I thought I would end up closest to the greatest number of points.

m1<-lm(CO~tar,data=cigs)
summary(m1)

## 
## Call:
## lm(formula = CO ~ tar, data = cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1124 -0.7167 -0.3754  1.0091  2.5450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.74328    0.67521   4.063 0.000481 ***
## tar          0.80098    0.05032  15.918 6.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.397 on 23 degrees of freedom
## Multiple R-squared:  0.9168, Adjusted R-squared:  0.9132 
## F-statistic: 253.4 on 1 and 23 DF,  p-value: 6.552e-14

Exercise 5

b1<-sd(CO)/sd(tar)*cor(tar,CO)
b1

## [1] 0.800976

b0=mean(CO)-b1*mean(tar)
b0

## [1] 2.743278

The values I’ve calculated for intercept (b0) and slope (b1) do match R’s summary values.

Exercise 6

The slope tells us that for every 1 unit change in cigarette tar, we would expect a .80098 unit change in carbon monoxide emitted into the environment.

Exercise 7

m2<-lm(CO~weight,data=cigs)
summary(m2)

## 
## Call:
## lm(formula = CO ~ weight, data = cigs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.524 -2.533  0.622  2.842  7.268 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -11.795      9.722  -1.213   0.2373  
## weight        25.068      9.980   2.512   0.0195 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.289 on 23 degrees of freedom
## Multiple R-squared:  0.2153, Adjusted R-squared:  0.1811 
## F-statistic: 6.309 on 1 and 23 DF,  p-value: 0.01948

plot(weight,CO)

Judging by the scatter of the points and the multiple R-squared of model 2, I would say that model 1 is a better fit. Model 1 has a visible linear relationship and its multiple R-squared is larger than model 2’s, meaning that model 1 explains more of the variation in the points.

qplot(tar,CO,data=cigs)

ggplot(cigs,aes(tar,CO))+geom_point()+geom_smooth(method="lm")

ggplot(cigs,aes(tar,CO))+geom_point()+geom_smooth(method="lm",se=FALSE)

Exercise 8

If I saw the least squares regression line and not the data, I would predict that a cigarette containing 15 mg of tar would emit slightly less than 15 mg of CO. This prediction is pretty accurate, the data point at 15 mg of tar is at 15 mg of CO. I would estimate that the residual is something like .2.

qplot(tar,m1$residuals) + geom_hline(yintercept=0,linetype=3)

Exercise 9

I think there is a slight positive trend in the residuals, suggesting that not all of the carbon monoxide variation is explained by CO~tar.

qplot(m1$residuals)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(m1$residuals)
qqline(m1$residuals)

Exercise 10

Based on the histogram and the normal probability plot, the distribution of the residuals do seem nearly normal.

Exercise 11

The constant variability condition does seem to be met. There does seem to be a weak trend in the variability, but it’s difficult to make a judgement since there are so few observations.

On my own

m3<-lm(CO~nicotene,data=cigs)
summary(m3)

## 
## Call:
## lm(formula = CO ~ nicotene, data = cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3273 -1.2228  0.2304  1.2700  3.9357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.6647     0.9936   1.675    0.107    
## nicotene     12.3954     1.0542  11.759 3.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.828 on 23 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8512 
## F-statistic: 138.3 on 1 and 23 DF,  p-value: 3.312e-11

qplot(nicotene,CO,data=cigs)

ggplot(cigs,aes(nicotene,CO))+geom_point()+geom_smooth(method="lm")

ggplot(cigs,aes(nicotene,CO))+geom_point()+geom_smooth(method="lm", se=FALSE)

At a glance, there does seem to be a linear relationship between nicotene and CO, the linear relationship does not look as strong however, as the relationship between tar and CO.
Based on the R squared values from the model summaries, CO~tar is a better model than CO~nicotene. The R squared value for carbon monoxide and tar is .9168, which means that there was a reduction of 91.68% in the data variation by modeling CO as a response to tar content. The R squared value for CO~nicotene was .8574, so this model did not reduce as much of the vartiation as CO~tar.
Of the three variables we’ve looked at, tar best predicts CO. The relationship between tar and CO appears linear by looking at the graph of CO~tar, and the R squared value for this model confirms that it is the best of the three models.

qplot(m1$residuals)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(m1$residuals)
qqline(m1$residuals)