Rules

On this exam you will explore the data set KaiserBabies.rda which contains the data frame infants containing information that Kaiser collects of all patients in their maternity ward at Kaiser hospital in SF during the month of May 2017.

#the function load, loads a dataset into your workspace. It outputs a vector of all dataframes in the data set. In this case there is just one data set called infants. 
data <- load(url("http://www.stat.berkeley.edu/users/nolan/data/KaiserBabies.rda"))
data
## [1] "infants"
#you can see the variables in this data frame using the head command.
head(infants)
##   gestation bwt parity age          ed ht  wt dage              ded dht
## 1       284 120      1  27     College 62 100   31          College  65
## 2       282 113      2  33     College 64 135   38          College  70
## 3       279 128      1  28 High School 64 115   32 Some High School  NA
## 4        NA 123      2  36     College 69 190   43     Some College  68
## 5       282 108      1  23     College 67 125   24          College  NA
## 6       286 136      4  25 High School 62  93   28      High School  64
##   dwt marital            inc          smoke number
## 1 110 Married   [2500, 5000)          Never  Never
## 2 148 Married   [7000, 8000)          Never  Never
## 3  NA Married   [5000, 6000)            Now    1-4
## 4 197 Married [12500, 15000)  Once, Not Now  20-29
## 5  NA Married   [2500, 5000)            Now  20-29
## 6 130 Married   [7000, 8000) Until Pregnant    5-9

Question 1

Take a simple random sample (without replacement) of size 100 observations using the two lines of code below. The function set.seed makes it so that everyone will be using the same sample.

set.seed(7)
mysample=sample(na.omit(infants$wt),10)

a)

Use the sample average to estimate the average weight of the mothers, calculate the estimated standard error of these estimates and form a 95% confidence interval for the average of the population (assuming normality works)

b)

Repeat 1000 times (without using the set.seed function) to get 1000 different confidence intervals. How many of them do you expect to cover the true average? How many do? Note that in practice you would be unable to do this since you only get one sample.

c)

Calculate the SD of the sample averages. Is it close to the estimated standard error from a)? Make a histogram of the sample averages to see if it seems plausible that the probability histogram for the sample average follows the normal curve pretty closely. Make a probability plot to further investigate. Does it seem like the confidence intervalis valid?

Question 2

Starting with your original sample do the following.

a)

Use the nonparametric bootstrap to get 1000 random samples of size 100. For each, get the sample average and make a histogram of these sample averages (this is called the sampling distribution of the mean). Put a vertical line through the average of the sampling distribtuion. Calculate the SD of the sample averages. Is it close to the estimated SE from 1a above?

b)

Construct a 95% bootstrap CI by taking 2.5 and 97.5 percentile of the bootstrap averages. How does it compare to the CI you got in 1a)?

Question 3

We will now use the 1236 observations in the data set infants to perform a linear regression analysis.

a)

Fit a linear relationship between the gestation period (x=gestation) of a baby and the baby’s birthweight (y=bwt). What is the equation of the regression line? Make a plot of the regression line on top of the scatter diagram.

b)

Examine the residuals and comment whether the assumptions of the standard statistical model are satisfied.

c)

Assuming the model is reasonable, what birthweight would you predict to have if a baby has a gestation period of 300 days?

Question 4

We wish to compare the mean gestation birth weight (bwt) of mothers who smoke (smoke=Now) versus mothers who don’t smoke (smoke=Never). I will help you with data cleaning and then I want you to do a statistical analysis of whether birthweights are significantly different if mothers smoke or not.

Lets create a smaller data frame from infants consisting of two variables gestation and smoke. We will eliminate observations if the smoke doesn’t take the value smoke=Never or smoke=Now.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
smoke <- infants$smoke
gestation <- infants$gestation
df <- data.frame(smoke, gestation)
df <- df %>% filter(smoke=="Never" |  smoke =="Now") #make data frame where smoke variable takes only two values, "Never" or "Now"
head(df)
##   smoke gestation
## 1 Never       284
## 2 Never       282
## 3   Now       279
## 4   Now       282
## 5 Never       244
## 6 Never       245

The data frame df is in tidy format (see lecture 14 in class exercises).

a)

Make a boxplot to examine gestation as a function of smoke. What do you conclude?

b)

Make a check of the assumption that the distributions of gestation are normal by making a Normal Q-Q plot. Also make a plot to check that the variances are the same. What are your conclusions?

c)

Make a t-test and a nonparametric Mann-Whitney test to test the null hypoethesis that the mean gestation periods are the same. Which result to you have more confidence in?