On this exam you will explore the data set KaiserBabies.rda which contains the data frame infants containing information that Kaiser collects of all patients in their maternity ward at Kaiser hospital in SF during the month of May 2017.
#the function load, loads a dataset into your workspace. It outputs a vector of all dataframes in the data set. In this case there is just one data set called infants.
data <- load(url("http://www.stat.berkeley.edu/users/nolan/data/KaiserBabies.rda"))
data
## [1] "infants"
#you can see the variables in this data frame using the head command.
head(infants)
## gestation bwt parity age ed ht wt dage ded dht
## 1 284 120 1 27 College 62 100 31 College 65
## 2 282 113 2 33 College 64 135 38 College 70
## 3 279 128 1 28 High School 64 115 32 Some High School NA
## 4 NA 123 2 36 College 69 190 43 Some College 68
## 5 282 108 1 23 College 67 125 24 College NA
## 6 286 136 4 25 High School 62 93 28 High School 64
## dwt marital inc smoke number
## 1 110 Married [2500, 5000) Never Never
## 2 148 Married [7000, 8000) Never Never
## 3 NA Married [5000, 6000) Now 1-4
## 4 197 Married [12500, 15000) Once, Not Now 20-29
## 5 NA Married [2500, 5000) Now 20-29
## 6 130 Married [7000, 8000) Until Pregnant 5-9
Take a simple random sample (without replacement) of size 100 observations using the two lines of code below. The function set.seed makes it so that everyone will be using the same sample.
set.seed(7)
mysample=sample(na.omit(infants$wt),10)
Use the sample average to estimate the average weight of the mothers, calculate the estimated standard error of these estimates and form a 95% confidence interval for the average of the population (assuming normality works)
Repeat 1000 times (without using the set.seed function) to get 1000 different confidence intervals. How many of them do you expect to cover the true average? How many do? Note that in practice you would be unable to do this since you only get one sample.
Calculate the SD of the sample averages. Is it close to the estimated standard error from a)? Make a histogram of the sample averages to see if it seems plausible that the probability histogram for the sample average follows the normal curve pretty closely. Make a probability plot to further investigate. Does it seem like the confidence intervalis valid?
Starting with your original sample do the following.
Use the nonparametric bootstrap to get 1000 random samples of size 100. For each, get the sample average and make a histogram of these sample averages (this is called the sampling distribution of the mean). Put a vertical line through the average of the sampling distribtuion. Calculate the SD of the sample averages. Is it close to the estimated SE from 1a above?
Construct a 95% bootstrap CI by taking 2.5 and 97.5 percentile of the bootstrap averages. How does it compare to the CI you got in 1a)?
We will now use the 1236 observations in the data set infants
to perform a linear regression analysis.
Fit a linear relationship between the gestation period (x=gestation
) of a baby and the baby’s birthweight (y=bwt
). What is the equation of the regression line? Make a plot of the regression line on top of the scatter diagram.
Examine the residuals and comment whether the assumptions of the standard statistical model are satisfied.
Assuming the model is reasonable, what birthweight would you predict to have if a baby has a gestation period of 300 days?
We wish to compare the mean gestation birth weight (bwt
) of mothers who smoke (smoke=Now
) versus mothers who don’t smoke (smoke=Never
). I will help you with data cleaning and then I want you to do a statistical analysis of whether birthweights are significantly different if mothers smoke or not.
Lets create a smaller data frame from infants
consisting of two variables gestation
and smoke
. We will eliminate observations if the smoke
doesn’t take the value smoke=Never
or smoke=Now
.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
smoke <- infants$smoke
gestation <- infants$gestation
df <- data.frame(smoke, gestation)
df <- df %>% filter(smoke=="Never" | smoke =="Now") #make data frame where smoke variable takes only two values, "Never" or "Now"
head(df)
## smoke gestation
## 1 Never 284
## 2 Never 282
## 3 Now 279
## 4 Now 282
## 5 Never 244
## 6 Never 245
The data frame df
is in tidy format (see lecture 14 in class exercises).
Make a boxplot to examine gestation
as a function of smoke
. What do you conclude?
Make a check of the assumption that the distributions of gestation
are normal by making a Normal Q-Q plot. Also make a plot to check that the variances are the same. What are your conclusions?
Make a t-test and a nonparametric Mann-Whitney test to test the null hypoethesis that the mean gestation periods are the same. Which result to you have more confidence in?