Problem Set # 3

Luis Enrique Ramos

date()

## [1] "Tue Oct 16 08:58:52 2012"

Due Date: October 18, 2012
Total Points: 38

1 The use of a cell phone while driving is hypothosized to increase the chance of an accident. The data set reaction.time (UsingR) is simulated data on the time it takes to react to an external event while driving. Subjects with control=C are not using a cell phone, and those with control=T are. The time to respond to some external event is recorded in seconds.
a) Perform a t-test on the difference in mean reaction time for groups T and C. What do you conclude. (2)
b) Repeat the test separately for women and men. What do you conclude? (4)
c) Repeat the test separately for the two age groups. What do you conclude? (4)

a) Perform a t-test on the difference in mean reaction time for groups T and C. What do you conclude. (2)

require(UsingR)

## Loading required package: UsingR

## Loading required package: MASS

head(reaction.time)

##     age gender control  time
## 1 16-24      F       T 1.360
## 2 16-24      M       T 1.468
## 3 16-24      M       T 1.512
## 4 16-24      F       T 1.391
## 5 16-24      M       T 1.384
## 6 16-24      M       C 1.394

attach(reaction.time)
t.test(time[control == "T"], time[control == "C"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T"] and time[control == "C"] 
## t = 2.205, df = 29.83, p-value = 0.03529
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  0.004122 0.107793 
## sample estimates:
## mean of x mean of y 
##     1.446     1.390

T test results report t=2.2053, p-value=0.035,CI [1.54-1.39] as such it represents moderate evidence (p-value falls within the range 0.01-0.05) in favor of rejecting the null hypothesis (the null hypothesis would be that there is no difference between the average means of the samples).

b) Repeat the test separately for women and men. What do you conclude? (4)

t.test(time[control == "T" & gender == "F"], time[control == "C" & gender == 
    "F"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T" & gender == "F"] and time[control == "C" & gender == "F"] 
## t = 0.875, df = 9.966, p-value = 0.4021
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.05313  0.12179 
## sample estimates:
## mean of x mean of y 
##     1.451     1.416

t.test(time[control == "T" & gender == "M"], time[control == "C" & gender == 
    "M"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T" & gender == "M"] and time[control == "C" & gender == "M"] 
## t = 1.989, df = 19.13, p-value = 0.06121
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.003504  0.138626 
## sample estimates:
## mean of x mean of y 
##     1.439     1.372

T test results for the female sample reports t=0.875 and p-value=0.4021 thus fails to reject the null hypothesis. On the other hand, results for the male sample report t=1.9889 and p-value=0.06121 which represents suggestive but inconclusive evidence for rejecting the null hypothesis. A such, it appears that based on the simulated data there is no sufficient evidence to conclude that there is significant difference in the average between both control groups (T & C) and between the genders (F & M).

c) Repeat the test separately for the two age groups. What do you conclude? (4)

t.test(time[control == "T" & age == "16-24"], time[control == "C" & age == "16-24"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T" & age == "16-24"] and time[control == "C" & age == "16-24"] 
## t = 1.8, df = 5.191, p-value = 0.1296
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -0.02422  0.14171 
## sample estimates:
## mean of x mean of y 
##     1.394     1.336

t.test(time[control == "T" & age == "25+"], time[control == "C" & age == "25+"])

## 
##  Welch Two Sample t-test
## 
## data:  time[control == "T" & age == "25+"] and time[control == "C" & age == "25+"] 
## t = 2.627, df = 21.58, p-value = 0.01553
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  0.01607 0.13715 
## sample estimates:
## mean of x mean of y 
##     1.480     1.403

Two-sample t test for the 16-24 age group reports t=1.8001, p-value=0.1296 and CI [-0.024 - 0.142], thus there is suggestive but not conclusive evidence for rejecting the null hyphotesis. Two-sample t test for the 25+ age group reports t=2.6273, p-value=0.015 and CI [0.016 - 0.137], thus there is moderate evidence for rejecting the null hypothesis for this older age group. It appears that reaction time increases with age but further research and/or a larger sample is warranted.

2 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond and you are interested in predicting price from carat size.
a) Make a scatter plot of carat versus price. (2)
b) Add a linear regression line to the plot. (2)
c) Use the model to predict the amount a 1/3 carat diamond ring would cost. (4)

a) Make a scatter plot of carat versus price. (2)

attach(diamond)
require(ggplot2)

## Loading required package: ggplot2

## Attaching package: 'ggplot2'

## The following object(s) are masked from 'package:UsingR':
## 
## movies

p1 = ggplot(diamond, aes(x = carat, y = price)) + geom_point(size = 4, color = "red") + 
    xlab("Diamond Size (carat)") + ylab("Price (Singapore Dollars)")
p1

plot of chunk unnamed-chunk-5

b) Add a linear regression line to the plot. (2)

p1 = p1 + geom_smooth(method = lm, se = FALSE, col = "blue")
p1

plot of chunk unnamed-chunk-6

c) Use the model to predict the amount a 1/3 carat diamond ring would cost. (4)

model = lm(price ~ carat, data = diamond)
model

## 
## Call:
## lm(formula = price ~ carat, data = diamond)
## 
## Coefficients:
## (Intercept)        carat  
##        -260         3721

predict(model, data.frame(carat = 1/3))

##     1 
## 980.7

detach(diamond)

3 The data set trees contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to predict the volume of timber from a measure of girth.
a) Create a scatter plot of the data and label the axes. (4)
b) Add a linear regression line to the plot. (2)
c) Determine the sum of squared residuals? (4)
d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why? (10)

a) Create a scatter plot of the data and label the axes. (4)

require(trees)

## Loading required package: trees

## Warning: there is no package called 'trees'

attach(trees)
head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

p2 = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point(size = 2, color = "green") + 
    xlab("Girth (inches)") + ylab("Volume of Timber")
p2

plot of chunk unnamed-chunk-8

b) Add a linear regression line to the plot. (2)

p2 = p2 + geom_smooth(method = lm, se = FALSE, col = "gold")
p2

plot of chunk unnamed-chunk-9

c) Determine the sum of squared residuals? (4)

sum(residuals(lm(Volume ~ Girth))^2)

## [1] 524.3

d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why? (10)

a') Create a scatter plot of the data and label the axes. (4)

Girth2 = Girth^2
trees = data.frame(Girth, Height, Volume, Girth2)
head(trees)

##   Girth Height Volume Girth2
## 1   8.3     70   10.3  68.89
## 2   8.6     65   10.3  73.96
## 3   8.8     63   10.2  77.44
## 4  10.5     72   16.4 110.25
## 5  10.7     81   18.8 114.49
## 6  10.8     83   19.7 116.64

attach(trees)

## The following object(s) are masked _by_ '.GlobalEnv':
## 
##     Girth2
## The following object(s) are masked from 'trees (position 3)':
## 
##     Girth, Height, Volume

p3 = ggplot(trees, aes(x = Girth2, y = Volume)) + geom_point(size = 2, color = "green") + 
    xlab("Girth^2 (inches^2)") + ylab("Volume of Timber")
p3

plot of chunk unnamed-chunk-11

b') Add a linear regression line to the plot. (2)

p3 = p3 + geom_smooth(method = lm, se = FALSE, col = "gold")
p3

plot of chunk unnamed-chunk-12

c') Determine the sum of squared residuals? (4)

sum(residuals(lm(Volume ~ Girth2))^2)

## [1] 329.3

I prefer the squared model because it achieves a lower SSE score (“better fit”).