Problem Set # 3

Rob Leteff

date()

## [1] "Tue Oct 16 09:43:45 2012"

Due Date: October 18, 2012
Total Points: 38

1 The use of a cell phone while driving is hypothosized to increase the chance of an accident. The data set reaction.time (UsingR) is simulated data on the time it takes to react to an external event while driving. Subjects with control=C are not using a cell phone, and those with control=T are. The time to respond to some external event is recorded in seconds.
a) Perform a t-test on the difference in mean reaction time for groups T and C. What do you conclude. (2)

require(UsingR)

## Loading required package: UsingR

## Loading required package: MASS

attach(reaction.time)
t.test(time[control == "T"], time[control == "C"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T"] and time[control == "C"] 
## t = 2.205, df = 29.83, p-value = 0.03529
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  0.004122 0.107793 
## sample estimates:
## mean of x mean of y 
##     1.446     1.390

detach(reaction.time)

The p value is 0.03529, a moderately convincing value, but not enough to discount the null hypothesis altogether. Talking on a cell phone may have some affect on reaction time.

b) Repeat the test separately for women and men. What do you conclude? (4)

attach(reaction.time)
t.test(time[control == "T" & gender == "F"], time[control == "C" & gender == 
    "F"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T" & gender == "F"] and time[control == "C" & gender == "F"] 
## t = 0.875, df = 9.966, p-value = 0.4021
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.05313  0.12179 
## sample estimates:
## mean of x mean of y 
##     1.451     1.416

t.test(time[control == "T" & gender == "M"], time[control == "C" & gender == 
    "M"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T" & gender == "M"] and time[control == "C" & gender == "M"] 
## t = 1.989, df = 19.13, p-value = 0.06121
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.003504  0.138626 
## sample estimates:
## mean of x mean of y 
##     1.439     1.372

detach(reaction.time)

In testing for a difference in mean reaction time for women, the p value is 0.4021, too large to reject the null hypothesis. The t.test for mean reaction times for men on the other hand has a p value of 0.06121, which is suggestive but not small enough to be conclusive. Therefor the null hypothesis cannot be rejected altogether in this case either. There appears to be no difference in reaction time between men and women.

c) Repeat the test separately for the two age groups. What do you conclude? (4)

attach(reaction.time)
t.test(time[control == "T" & age == "16-24"], time[control == "C" & age == "16-24"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T" & age == "16-24"] and time[control == "C" & age == "16-24"] 
## t = 1.8, df = 5.191, p-value = 0.1296
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.02422  0.14171 
## sample estimates:
## mean of x mean of y 
##     1.394     1.336

t.test(time[control == "T" & age == "25+"], time[control == "C" & age == "25+"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T" & age == "25+"] and time[control == "C" & age == "25+"] 
## t = 2.627, df = 21.58, p-value = 0.01553
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  0.01607 0.13715 
## sample estimates:
## mean of x mean of y 
##     1.480     1.403

detach(reaction.time)

The t.test for the first age group gives a p value of 0.1296, too larger to reject the null hypothesis. The t.test test for mean reaction times in the second age group produces a p value of 0.01553, moderately convincing. There appears to be an increase in reaction time with age, but further testing may be required.

2 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond and you are interested in predicting price from carat size.
a) Make a scatter plot of carat versus price. (2)

require(UsingR)
require(ggplot2)

## Loading required package: ggplot2

## Attaching package: 'ggplot2'

## The following object(s) are masked from 'package:UsingR':
## 
## movies

ggplot(diamond, aes(x = carat, y = price)) + geom_point()

plot of chunk unnamed-chunk-5

b) Add a linear regression line to the plot. (2)

ggplot(diamond, aes(x = carat, y = price)) + geom_point() + geom_smooth(method = lm, 
    se = FALSE)

plot of chunk unnamed-chunk-6

c) Use the model to predict the amount a 1/3 carat diamond ring would cost. (4)

model = lm(price ~ carat, data = diamond)
predict(model, data.frame(carat = 0.33))

##     1 
## 968.3

3 The data set trees contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to predict the volume of timber from a measure of girth.
a) Create a scatter plot of the data and label the axes. (4)

vol = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() + xlab("Girth (inches)") + 
    ylab("Volume")
vol

plot of chunk unnamed-chunk-8

b) Add a linear regression line to the plot. (2)

vol = vol + geom_smooth(method = lm, se = FALSE)
vol

plot of chunk unnamed-chunk-9

c) Determine the sum of squared residuals? (4)

model = lm(Volume ~ Girth, data = trees)
deviance(model)

## [1] 524.3

d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why? (10)

attach(trees)
Girth2 = Girth^2
trees = data.frame(Girth, Height, Volume, Girth2)
vol2 = ggplot(trees, aes(x = Girth2, y = Volume)) + geom_point() + xlab("Girth (inches squared)") + 
    ylab("Volume")
vol2

plot of chunk unnamed-chunk-11

vol2 = vol2 + geom_smooth(method = lm, se = FALSE)
vol2

plot of chunk unnamed-chunk-12

model2 = lm(Volume ~ Girth2, data = trees)
deviance(model2)

## [1] 329.3

The SSE for the method using the square of the girth is lower, so it should be used as the prefered method for predicting volume.