Problem Set # 3

Holly Widen

date()

## [1] "Wed Oct 17 13:33:53 2012"

Due Date: October 18, 2012
Total Points: 38

1 The use of a cell phone while driving is hypothosized to increase the chance of an accident. The data set reaction.time (UsingR) is simulated data on the time it takes to react to an external event while driving. Subjects with control=C are not using a cell phone, and those with control=T are. The time to respond to some external event is recorded in seconds.

a) Perform a t-test on the difference in mean reaction time for groups T and C. What do you conclude. (2)

require(UsingR)

## Loading required package: UsingR

## Loading required package: MASS

head(reaction.time)

##     age gender control  time
## 1 16-24      F       T 1.360
## 2 16-24      M       T 1.468
## 3 16-24      M       T 1.512
## 4 16-24      F       T 1.391
## 5 16-24      M       T 1.384
## 6 16-24      M       C 1.394

t.test(time ~ control, data = reaction.time)

## 
##  Welch Two Sample t-test
## 
## data:  time by control 
## t = -2.205, df = 29.83, p-value = 0.03529
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.107793 -0.004122 
## sample estimates:
## mean in group C mean in group T 
##           1.390           1.446

The p-value is about 0.035 which provides moderate evidence that the reaction time is affected by cell phone usage (rejecting the null hypothesis which states that there is no difference in the means).

b) Repeat the test separately for women and men. What do you conclude? (4)

attach(reaction.time)
t.test(time[gender == "F"] ~ control[gender == "F"])

## 
##  Welch Two Sample t-test
## 
## data:  time[gender == "F"] by control[gender == "F"] 
## t = -0.875, df = 9.966, p-value = 0.4021
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.12179  0.05313 
## sample estimates:
## mean in group C mean in group T 
##           1.416           1.451

t.test(time[gender == "M"] ~ control[gender == "M"])

## 
##  Welch Two Sample t-test
## 
## data:  time[gender == "M"] by control[gender == "M"] 
## t = -1.989, df = 19.13, p-value = 0.06121
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.138626  0.003504 
## sample estimates:
## mean in group C mean in group T 
##           1.372           1.439

detach(reaction.time)

The p-value for women is approximately 0.40 so we must accept the null hypothesis that there is no difference in reaction time among women. However, the p-value for men is about 0.06 which suggests that there may be a difference in reaction time among men, but it is inconclusive whether to reject the null hypothesis.

c) Repeat the test separately for the two age groups. What do you conclude? (4)

tail(reaction.time)

##    age gender control  time
## 55 25+      M       T 1.445
## 56 25+      M       C 1.443
## 57 25+      M       C 1.355
## 58 25+      F       T 1.615
## 59 25+      M       T 1.528
## 60 25+      M       T 1.467

attach(reaction.time)
t.test(time[age == "16-24"] ~ control[age == "16-24"])

## 
##  Welch Two Sample t-test
## 
## data:  time[age == "16-24"] by control[age == "16-24"] 
## t = -1.8, df = 5.191, p-value = 0.1296
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.14171  0.02422 
## sample estimates:
## mean in group C mean in group T 
##           1.336           1.394

t.test(time[age == "25+"] ~ control[age == "25+"])

## 
##  Welch Two Sample t-test
## 
## data:  time[age == "25+"] by control[age == "25+"] 
## t = -2.627, df = 21.58, p-value = 0.01553
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.13715 -0.01607 
## sample estimates:
## mean in group C mean in group T 
##           1.403           1.480

detach(reaction.time)

The p-value for ages 16-24 is about 0.13 which suggests that the reaction time of this age group may be affected by cell phone usage (rejecting the null hypothesis), but it is not conclusive. The p-value for ages 25 and over is roughly 0.015 which gives us moderate evidence that the reaction time of this age group is affected by cell phone usage (rejecting the null hypothesis).

2 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond and you are interested in predicting price from carat size.

a) Make a scatter plot of carat versus price. (2)

head(diamond)

##   carat price
## 1  0.17   355
## 2  0.16   328
## 3  0.17   350
## 4  0.18   325
## 5  0.25   642
## 6  0.16   342

require(ggplot2)

## Loading required package: ggplot2

## Attaching package: 'ggplot2'

## The following object(s) are masked from 'package:UsingR':
## 
## movies

p = ggplot(diamond, aes(x = carat, y = price)) + geom_point() + ylab("Singapore Dollars") + 
    xlab("Carat")
p

plot of chunk ScatterPlot

b) Add a linear regression line to the plot. (2)

p + geom_smooth(method = lm, se = FALSE)

plot of chunk AddRegressionLine

c) Use the model to predict the amount a 1/3 carat diamond ring would cost. (4)

model = lm(price ~ carat, data = diamond)
predict(model, data.frame(carat = 1/3))

##     1 
## 980.7

A 1/3 carat diamond ring would cost approximately 981 Singapore dollars.

3 The data set trees contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to predict the volume of timber from a measure of girth.

a) Create a scatter plot of the data and label the axes. (4)

head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

p2 = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() + ylab("Volume") + 
    xlab("Girth")
p2

plot of chunk Scatterplot2

b) Add a linear regression line to the plot. (2)

p2 + geom_smooth(method = lm, se = FALSE)

plot of chunk AddRegressionLine2

c) Determine the sum of squared residuals? (4)

sum(residuals(lm(trees$Volume ~ trees$Girth))^2)

## [1] 524.3

d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why? (10)

Create a scatter plot of the data and label the axes.

attach(trees)
Girth2 = (Girth)^2
p3 = ggplot(trees, aes(x = Girth2, y = Volume)) + geom_point() + ylab("Volume") + 
    xlab("Girth")
p3

plot of chunk Scatterplot3

detach(trees)

Add a linear regression line to the plot.

p3 + geom_smooth(method = lm, se = FALSE)

plot of chunk AddRegressionLine3

Determine the sum of squared residuals?

sum(residuals(lm(trees$Volume ~ Girth2))^2)

## [1] 329.3

Even though we are altering the data (by squaring the girth values), I prefer the second model because it fits the data better (proven by the SSE outcomes where the second model has a smaller SSE).