Problem Set # 3

Loury Migliorelli

date()

## [1] "Thu Oct 18 12:08:36 2012"

Due Date: October 18, 2012
Total Points: 38

1 The use of a cell phone while driving is hypothosized to increase the chance of an accident. The data set reaction.time (UsingR) is simulated data on the time it takes to react to an external event while driving. Subjects with control=C are not using a cell phone, and those with control=T are. The time to respond to some external event is recorded in seconds.
a) Perform a t-test on the difference in mean reaction time for groups T and C. What do you conclude. (2)

require(UsingR)

## Loading required package: UsingR

## Loading required package: MASS

attach(reaction.time)
class(control)

## [1] "factor"

class(time)

## [1] "numeric"

t.test(time ~ control)

## 
##  Welch Two Sample t-test
## 
## data:  time by control 
## t = -2.205, df = 29.83, p-value = 0.03529
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.107793 -0.004122 
## sample estimates:
## mean in group C mean in group T 
##           1.390           1.446

Reaction time is affected by whether the subjects are using a cell phone or not.

b) Repeat the test separately for women and men. What do you conclude? (4)

Fem = subset(reaction.time, gender == "F")
Men = subset(reaction.time, gender == "M")
t.test(Fem$time ~ Fem$control)

## 
##  Welch Two Sample t-test
## 
## data:  Fem$time by Fem$control 
## t = -0.875, df = 9.966, p-value = 0.4021
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.12179  0.05313 
## sample estimates:
## mean in group C mean in group T 
##           1.416           1.451

t.test(Men$time ~ Men$control)

## 
##  Welch Two Sample t-test
## 
## data:  Men$time by Men$control 
## t = -1.989, df = 19.13, p-value = 0.06121
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.138626  0.003504 
## sample estimates:
## mean in group C mean in group T 
##           1.372           1.439

Since the p-value for women is 0.4021 and the p-value for men is 0.06121, there is no significance between gender and reaction time. However, since the p-values between the two genders were so different, a possible trend may exist.

c) Repeat the test separately for the two age groups. What do you conclude? (4)

Child = subset(reaction.time, age == "16-24")
Adult = subset(reaction.time, age == "25+")
t.test(Child$time ~ Child$control)

## 
##  Welch Two Sample t-test
## 
## data:  Child$time by Child$control 
## t = -1.8, df = 5.191, p-value = 0.1296
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.14171  0.02422 
## sample estimates:
## mean in group C mean in group T 
##           1.336           1.394

t.test(Adult$time ~ Adult$control)

## 
##  Welch Two Sample t-test
## 
## data:  Adult$time by Adult$control 
## t = -2.627, df = 21.58, p-value = 0.01553
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.13715 -0.01607 
## sample estimates:
## mean in group C mean in group T 
##           1.403           1.480

detach(reaction.time)

Since the p-value for the younger age group was 0.1296, there was a significance in reaction time. Also, since the p-value for the older age group was 0.01553 there was no significance in reaction time. Therefore, a trend exists between younger drivers and slower reaction time as well as older drivers and faster reaction time.

2 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond and you are interested in predicting price from carat size.
a) Make a scatter plot of carat versus price. (2)

attach(diamond)
head(diamond)

##   carat price
## 1  0.17   355
## 2  0.16   328
## 3  0.17   350
## 4  0.18   325
## 5  0.25   642
## 6  0.16   342

install.packages("ggplot2")

## Error: trying to use CRAN without setting a mirror

require(ggplot2)

## Loading required package: ggplot2

## Attaching package: 'ggplot2'

## The following object(s) are masked from 'package:UsingR':
## 
## movies

p = ggplot(diamond, aes(x = carat, y = price)) + geom_point(size = 2) + xlab("Carat size") + 
    ylab("Price (in Singapore dollars)")
p

plot of chunk scatterPlotDiamond

b) Add a linear regression line to the plot. (2)

p = p + geom_smooth(method = lm, se = FALSE, col = "red")
p

plot of chunk linRegressionDiamond

c) Use the model to predict the amount a 1/3 carat diamond ring would cost. (4)

model = lm(price ~ carat, data = diamond)
predict(model, data.frame(carat = 1/3))

##     1 
## 980.7

detach(diamond)

A diamond ring of 1/3 carat would cost 980.72 Singapore dollars.

3 The data set trees contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to predict the volume of timber from a measure of girth.
a) Create a scatter plot of the data and label the axes. (4)

attach(trees)
head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

p1 = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point(size = 2) + xlab("Girth (in)") + 
    ylab("Volume")
p1

plot of chunk scatterPlotTrees

b) Add a linear regression line to the plot. (2)

p1 = p1 + geom_smooth(method = lm, se = FALSE, col = "red")
p1

plot of chunk lmTrees

c) Determine the sum of squared residuals? (4)

sum(residuals(lm(trees$Volume ~ trees$Girth))^2)

## [1] 524.3

d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why? (10)

head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

p2 = ggplot(trees, aes(x = Girth^2, y = Volume)) + geom_point(size = 2) + xlab("Square of Girth (in)") + 
    ylab("Volume")
p2

plot of chunk scatterPlotTreesSq

p2 = p2 + geom_smooth(method = lm, se = FALSE, col = "red")
p2

plot of chunk linRegressionTreesSq

sum(residuals(lm(trees$Volume ~ trees$Girth^2))^2)

## [1] 524.3

Since the sum of squared residuals is the same for both models, both models are appropriate for use. However, the model using girth instead of the square of girth displays the data much better. Therefore, I prefer the first model.