date()
## [1] "Thu Oct 18 12:09:41 2012"
Due Date: October 18, 2012
Total Points: 38
1 The use of a cell phone while driving is hypothosized to increase the chance of an accident. The data set reaction.time (UsingR) is simulated data on the time it takes to react to an external event while driving. Subjects with control=C are not using a cell phone, and those with control=T are. The time to respond to some external event is recorded in seconds.
a) Perform a t-test on the difference in mean reaction time for groups T and C. What do you conclude. (2)
require("UsingR")
## Loading required package: UsingR
## Loading required package: MASS
t.test(reaction.time$time ~ reaction.time$control)
##
## Welch Two Sample t-test
##
## data: reaction.time$time by reaction.time$control
## t = -2.205, df = 29.83, p-value = 0.03529
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.107793 -0.004122
## sample estimates:
## mean in group C mean in group T
## 1.390 1.446
# with a p value of .03 it is statistically significant (<.05)
b) Repeat the test separately for women and men. What do you conclude? (4)
men = subset(reaction.time, reaction.time$gender == "M")
t.test(men$time ~ men$control)
##
## Welch Two Sample t-test
##
## data: men$time by men$control
## t = -1.989, df = 19.13, p-value = 0.06121
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.138626 0.003504
## sample estimates:
## mean in group C mean in group T
## 1.372 1.439
female = subset(reaction.time, reaction.time$gender == "F")
t.test(female$time ~ female$control)
##
## Welch Two Sample t-test
##
## data: female$time by female$control
## t = -0.875, df = 9.966, p-value = 0.4021
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.12179 0.05313
## sample estimates:
## mean in group C mean in group T
## 1.416 1.451
# the difference was not significant for either (.o6 or .4 for men and
# women respectively), however, the data does hint at a possible trend in
# the difference of driving qualities of men.
c) Repeat the test separately for the two age groups. What do you conclude? (4)
kids = subset(reaction.time, reaction.time$age == "16-24")
t.test(kids$time ~ kids$control)
##
## Welch Two Sample t-test
##
## data: kids$time by kids$control
## t = -1.8, df = 5.191, p-value = 0.1296
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.14171 0.02422
## sample estimates:
## mean in group C mean in group T
## 1.336 1.394
oldies = subset(reaction.time, reaction.time$age == "25+")
t.test(oldies$time ~ oldies$control)
##
## Welch Two Sample t-test
##
## data: oldies$time by oldies$control
## t = -2.627, df = 21.58, p-value = 0.01553
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.13715 -0.01607
## sample estimates:
## mean in group C mean in group T
## 1.403 1.480
# there is no sig dif. for the younger age group (p=.12). However, in the
# older group there was a significant degridation in stopping time due to
# cell phone usage.
2 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond and you are interested in predicting price from carat size.
a) Make a scatter plot of carat versus price. (2)
require("ggplot2")
## Loading required package: ggplot2
## Attaching package: 'ggplot2'
## The following object(s) are masked from 'package:UsingR':
##
## movies
dplot = ggplot(diamond, aes(carat, price)) + geom_point(size = 2)
dplot
b) Add a linear regression line to the plot. (2)
dplotline = dplot + geom_smooth(method = lm, se = FALSE, col = "red")
dplotline
c) Use the model to predict the amount a 1/3 carat diamond ring would cost. (4)
attach(diamond)
dmodel = lm(price ~ carat, data = diamond)
predict(dmodel, data.frame(carat = 1/3))
## 1
## 980.7
# the price should be $980.71
detach(diamond)
3 The data set trees contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to predict the volume of timber from a measure of girth.
a) Create a scatter plot of the data and label the axes. (4)
cplot = ggplot(trees, aes(Girth, Volume)) + geom_point(size = 2)
cplot
b) Add a linear regression line to the plot. (2)
cplotline = cplot + geom_smooth(method = lm, se = FALSE, col = "red")
cplotline
c) Determine the sum of squared residuals? (4)
sum(residuals(lm(trees$Volume ~ trees$Girth))^2)
## [1] 524.3
d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why? (10)
cplotsq = ggplot(trees, aes(Girth^2, Volume)) + geom_point(size = 2)
cplotsq
cplotlinesq = cplotsq + geom_smooth(method = lm, se = FALSE, col = "red")
cplotlinesq
sum(residuals(lm(trees$Volume ~ (trees$Girth^2)))^2)
## [1] 524.3
# in both models (girth and girth^2) we see that the sum of squared
# residuals is exactly the same (524.3). Since we look to take the model
# with the least residuals, either model would work. As such, I'd prefer
# the girth model due to simplicity