Problem Set # 3

Sean P Nickerson

date()

## [1] "Sun Oct 20 23:30:38 2013"

Due Date: October 17, 2013 Total Points: 30

1 The babyboom dataset (UsingR) contains the time of birth, sex, and birth weight for 44 babies born in one 24-hour period at a hospital in Brisbane, Australia.

a) Create side-by-side box plots of birth weight (grams) by gender. (2)

library(UsingR)

## Warning: package 'UsingR' was built under R version 3.0.2

## Loading required package: MASS

library(ggplot2)

## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:UsingR':
## 
## movies

data(babyboom)
ggplot(babyboom, aes(x = factor(gender), y = wt)) + geom_boxplot() + xlab("Gender") + 
    ylab("Birthweight (grams)")

plot of chunk birthWeights

b) Perform a t-test under the hypothesis that there is no difference in birth weight against the alternative hypothesis that girls weight less. What do you conclude? (5)

t.test(wt ~ gender, data = babyboom)

## 
##  Welch Two Sample t-test
## 
## data:  wt by gender
## t = -1.421, df = 27.63, p-value = 0.1665
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -593.2  107.4
## sample estimates:
## mean in group girl  mean in group boy 
##               3132               3375

with(babyboom, mean(wt[gender == "girl"]) - mean(wt[gender == "boy"]))

## [1] -242.9

library(nullabor)

## Warning: package 'nullabor' was built under R version 3.0.2

fun = null_permute("gender")
inf = lineup(fun, babyboom, n = 6)

## decrypt("yxES JIkI vU KAHvkvAU Z")

ggplot(inf, aes(x = factor(gender), y = wt)) + geom_boxplot() + facet_wrap(~.sample)

plot of chunk birtweightTtest

t.test(wt ~ gender, data = babyboom, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  wt by gender
## t = -1.523, df = 42, p-value = 0.1353
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -564.70   78.98
## sample estimates:
## mean in group girl  mean in group boy 
##               3132               3375

var.test(wt ~ gender, data = babyboom)

## 
##  F test to compare two variances
## 
## data:  wt by gender
## F = 2.177, num df = 17, denom df = 25, p-value = 0.07526
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9226 5.5482
## sample estimates:
## ratio of variances 
##              2.177

Based upon these results the p-value as evidence against the null hypothesis is suggestive, but not conclusive.

2 The BushApproval dataset (UsingR) contains approval ratings (%) for George W. Bush from different polling outlets. Perform a t-test under the hypothesis that there is no difference in approval rating between Fox and UPenn versus the alternative that there is a difference. Hint: Subset the data first. The 'or' logical predicate is indicate by the vertical line | on your keyboard. (5)

data(BushApproval)
BA.sub <- subset(BushApproval, who == "fox" | who == "upenn")
t.test(approval ~ who, data = BA.sub)

## 
##  Welch Two Sample t-test
## 
## data:  approval by who
## t = 4.269, df = 46.21, p-value = 9.65e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.791 10.553
## sample estimates:
##   mean in group fox mean in group upenn 
##               65.67               58.50

3 The mtcars dataset contains the miles per gallon and whether or not the transmission is automatic (0 = automatic, 1 = manual) for 32 automobiles.

a) Plot a histogram of the miles per gallon over all cars. Use a bin width of 3 mpg. (3)

data(mtcars)
ggplot(mtcars, aes(mpg, fill = factor(am))) + geom_histogram(binwidth = 3, color = "white") + 
    xlab("Miles Per Gallon (mpg)")

plot of chunk unnamed-chunk-2

b) Perform a Mann-Whitney-Wilcoxon test under the hypothesis that there is no difference in mpg between automatic and manual transmission cars without assuming they follow a normal distribution. The alternative is there is a difference. What do you conclude? (5)

wilcox.test(mpg ~ am, data = mtcars)

## Warning: cannot compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mpg by am
## W = 42, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0

4 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond and you are interested in predicting price from carat size.

a) Make a scatter plot of carat versus price. (3)

data(diamonds)
dPlot = ggplot(diamond, aes(x = carat, y = price)) + geom_point(size = 3) + 
    xlab("size (carat)") + ylab("price (S$)")
dPlot

plot of chunk diamonCarats

b) Add a linear regression line to the plot. (3)

lm(price ~ carat, data = diamonds)

## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Coefficients:
## (Intercept)        carat  
##       -2256         7756

dPlot + geom_smooth(method = lm, se = FALSE)

plot of chunk diamonCaratsRegression

c) Use the model to predict the amount a 1/3 carat diamond ring would cost. (4)

model = lm(price ~ carat, data = diamonds)
summary(model)

## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18585   -805    -19    537  12732 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2256.4       13.1    -173   <2e-16 ***
## carat         7756.4       14.1     551   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1550 on 53938 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.849 
## F-statistic: 3.04e+05 on 1 and 53938 DF,  p-value: <2e-16

predict(model, data.frame(carat = (1/3)), interval = "predict")

##     fit   lwr  upr
## 1 329.1 -2706 3364

ggplot(diamond, aes(x = cut(carat, 6), y = price)) + geom_boxplot()

plot of chunk diamonCaratsPrediction

The prediction based upon this model is off. The scatter plot shows a definite linear correlation and looking at the graph I would predict a a 1/3 carat diamond would cost approximately S$980. The predict function does not produce that result.