Problem Set # 3

Olivia Williams

date()

## [1] "Tue Oct 16 17:02:26 2012"

Due Date: October 18, 2012
Total Points: 38

1 The use of a cell phone while driving is hypothosized to increase the chance of an accident. The data set reaction.time (UsingR) is simulated data on the time it takes to react to an external event while driving. Subjects with control=C are not using a cell phone, and those with control=T are. The time to respond to some external event is recorded in seconds.
a) Perform a t-test on the difference in mean reaction time for groups T and C. What do you conclude. (2)

require(UsingR)

## Loading required package: UsingR

## Loading required package: MASS

head(reaction.time)

##     age gender control  time
## 1 16-24      F       T 1.360
## 2 16-24      M       T 1.468
## 3 16-24      M       T 1.512
## 4 16-24      F       T 1.391
## 5 16-24      M       T 1.384
## 6 16-24      M       C 1.394

tail(reaction.time)

##    age gender control  time
## 55 25+      M       T 1.445
## 56 25+      M       C 1.443
## 57 25+      M       C 1.355
## 58 25+      F       T 1.615
## 59 25+      M       T 1.528
## 60 25+      M       T 1.467

attach(reaction.time)
class(time)

## [1] "numeric"

class(control)

## [1] "factor"

t.test(time ~ control, data = reaction.time)  #The P Value of .03 is small enough to provide moderate evidence to reject the null hypothesis that driving while using a cell phone does not affect reaction time.

## 
##  Welch Two Sample t-test
## 
## data:  time by control 
## t = -2.205, df = 29.83, p-value = 0.03529
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.107793 -0.004122 
## sample estimates:
## mean in group C mean in group T 
##           1.390           1.446

b) Repeat the test separately for women and men. What do you conclude? (4)

t.test(time[gender == "F"] ~ control[gender == "F"], data = reaction.time)  #The P Value of 0.4 is too large to provide evidence to reject the null hypothesis that driving while using a cell phone does not affect reaction time for women.

## 
##  Welch Two Sample t-test
## 
## data:  time[gender == "F"] by control[gender == "F"] 
## t = -0.875, df = 9.966, p-value = 0.4021
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.12179  0.05313 
## sample estimates:
## mean in group C mean in group T 
##           1.416           1.451

t.test(time[gender == "M"] ~ control[gender == "M"], data = reaction.time)  #The P Value of 0.06 provides suggestive, but inclusive evidence to reject the null hypothesis that driving while using a cell phone does not affect reaction time for men.

## 
##  Welch Two Sample t-test
## 
## data:  time[gender == "M"] by control[gender == "M"] 
## t = -1.989, df = 19.13, p-value = 0.06121
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.138626  0.003504 
## sample estimates:
## mean in group C mean in group T 
##           1.372           1.439

c) Repeat the test separately for the two age groups. What do you conclude? (4)

t.test(time[age == "16-24"] ~ control[age == "16-24"], data = reaction.time)  #The P Value of 0.13  provides suggestive, but inclusive evidence to reject the null hypothesis that driving while using a cell phone does not affect reaction time for drivers age 16-24.

## 
##  Welch Two Sample t-test
## 
## data:  time[age == "16-24"] by control[age == "16-24"] 
## t = -1.8, df = 5.191, p-value = 0.1296
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.14171  0.02422 
## sample estimates:
## mean in group C mean in group T 
##           1.336           1.394

t.test(time[age == "25+"] ~ control[age == "25+"], data = reaction.time)  #The P Value of 0.02 provides moderate evidence to reject the null hypothesis that driving while using a cell phone does not affect reaction time for drivers age 25+.

## 
##  Welch Two Sample t-test
## 
## data:  time[age == "25+"] by control[age == "25+"] 
## t = -2.627, df = 21.58, p-value = 0.01553
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.13715 -0.01607 
## sample estimates:
## mean in group C mean in group T 
##           1.403           1.480

detach(reaction.time)

2 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond and you are interested in predicting price from carat size.
a) Make a scatter plot of carat versus price. (2)

head(diamond)

##   carat price
## 1  0.17   355
## 2  0.16   328
## 3  0.17   350
## 4  0.18   325
## 5  0.25   642
## 6  0.16   342

require(ggplot2)

## Loading required package: ggplot2

## Attaching package: 'ggplot2'

## The following object(s) are masked from 'package:UsingR':
## 
## movies

ggplot(diamond, aes(x = carat, y = price)) + geom_point(size = 2)

plot of chunk unnamed-chunk-5

b) Add a linear regression line to the plot. (2)

ggplot(diamond, aes(x = carat, y = price)) + geom_point(size = 2) + geom_smooth(method = lm, 
    se = FALSE)

plot of chunk unnamed-chunk-6

c) Use the model to predict the amount a 1/3 carat diamond ring would cost. (4)

model = lm(price ~ carat, data = diamond)
model

## 
## Call:
## lm(formula = price ~ carat, data = diamond)
## 
## Coefficients:
## (Intercept)        carat  
##        -260         3721

predict(model, data.frame(carat = 1/3))

##     1 
## 980.7

3 The data set trees contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to predict the volume of timber from a measure of girth.
a) Create a scatter plot of the data and label the axes. (4)

head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() + xlab("Girth (inches)") + 
    ylab("Volume")

plot of chunk unnamed-chunk-8

b) Add a linear regression line to the plot. (2)

ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() + xlab("Girth (inches)") + 
    ylab("Volume") + geom_smooth(method = lm, se = FALSE)

plot of chunk unnamed-chunk-9

c) Determine the sum of squared residuals? (4)

attach(trees)
sum(residuals(lm(Volume ~ Girth))^2)

## [1] 524.3

d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why? (10)

SG = Girth^2
ggplot(trees, aes(x = SG, y = Volume)) + geom_point() + xlab("Girth sqaured (square inches)") + 
    ylab("Volume") + geom_smooth(method = lm, se = FALSE)

plot of chunk unnamed-chunk-11

sum(residuals(lm(Volume ~ SG))^2)

## [1] 329.3

detach(trees)  #I prefer the second model because the SSE is smaller and the linear regression model therefore fits the data better when the square of the girth is used as the explanatory variable rather than simply the girth.