Welcome! In this lab, you’ll use the techniques you learned in the previous lab to analyze births in North Carolina, USA. Vamonos!
In 2004, the state of North Carolina released a large data set containing information on births recorded in their state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/nc.Rdata'))
names(nc)
## [1] "fage" "mage" "mature" "weeks"
## [5] "premie" "visits" "marital" "gained"
## [9] "weight" "lowbirthweight" "gender" "habit"
## [13] "whitemom"
The nc
data frame contains observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows:
fage
: father’s age in years.mage
: mother’s age in years.mature
: maturity status of mother.weeks
: length of pregnancy in weeks.premie
: whether the birth was classified as premature (premie) or full-term.visits
: number of hospital visits during pregnancy.marital
: whether mother is married or not married at birth.gained
: weight gained by mother during pregnancy in pounds.weight
: weight of the baby at birth in pounds.lowbirthweight
: whether baby was classified as low birthweight (low) or not (not low).gender
: gender of the baby, female or male.habit
: status of the mother as a nonsmoker or a smoker.whitemom
: whether mom is white or not white.As a first step in the analysis, you should consider summaries of the data. This can be done using the summary command.
The variable summaries tell you which variables are categorical and which are numerical. Think about the types of the variables. If you aren’t sure or want to take a closer look at the data, make a graph.
We will first start with analyzing the weight gained by mothers throughout the pregnancy: gained
.
summary(nc)
## fage mage mature weeks
## Min. :14.0 Min. :13 mature mom :133 Min. :20.0
## 1st Qu.:25.0 1st Qu.:22 younger mom:867 1st Qu.:37.0
## Median :30.0 Median :27 Median :39.0
## Mean :30.3 Mean :27 Mean :38.3
## 3rd Qu.:35.0 3rd Qu.:32 3rd Qu.:40.0
## Max. :55.0 Max. :50 Max. :45.0
## NA's :171 NA's :2
## premie visits marital gained
## full term:846 Min. : 0.0 married :386 Min. : 0.0
## premie :152 1st Qu.:10.0 not married:613 1st Qu.:20.0
## NA's : 2 Median :12.0 NA's : 1 Median :30.0
## Mean :12.1 Mean :30.3
## 3rd Qu.:15.0 3rd Qu.:38.0
## Max. :30.0 Max. :85.0
## NA's :9 NA's :27
## weight lowbirthweight gender habit
## Min. : 1.00 low :111 female:503 nonsmoker:873
## 1st Qu.: 6.38 not low:889 male :497 smoker :126
## Median : 7.31 NA's : 1
## Mean : 7.10
## 3rd Qu.: 8.06
## Max. :11.75
##
## whitemom
## not white:284
## white :714
## NA's : 2
##
##
##
##
summary(nc$gained)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 20.0 30.0 30.3 38.0 85.0 27
hist(nc$gained)
Great! Take a moment to analyze your results, can you describe the distribution of weight gained by mothers during their pregnancy?
skewed, with right tail
Question 2 How many mothers are we missing weight gain data from?
sum(is.na(nc$gained))
## [1] 27
Since we have some missing data on weight gain, we’ll first create a cleaned-up version of the weight gain variable, and use this variable in the next few steps of the analysis.
There are many ways of accomplishing this task in R, we’ll do it using the na.omit function. The command na.omit(a_data_set)
returns a_data_set
stripped of its (possible) NA
values.
We’ll also store the sample size of the clean data set (which should be less than 1000 since we dropped the observations with NAs) in order to be able to use this value in the next few steps of the analysis as well.
gained_clean = na.omit(nc$gained)
n = length(gained_clean)
Using this sample we would like to construct a bootstrap confidence interval for the average weight gained by all mothers during pregnancy. A quick reminder of how bootstrapping works:
Let’s first take 100 bootstrap samples (i.e. with replacement), and record their means in a new object called boot_means. You’ll also make a histogram of the bootstrap distribution.
TRUE
.boot_means
.boot_means = rep(NA, 100)
for (i in 1:100) {
boot_means[i] = mean( sample(gained_clean, n, replace=T) )
}
hist(boot_means)
The sampling distribution is calculated by resampling from the population, the bootstrap distribution is calculated by resampling from the sample.
TRUE
To construct the 95% bootstrap confidence interval using the percentile method, we estimate the values of the 5th and 95th percentiles of the bootstrap distribution.
FALSE
(it’s 2.5 and 97.5)
Next we’ll introduce a new function that you’ll be seeing a lot more of in the upcoming labs – a custom function that allows you to apply any statistical inference method that you’ll be learning in this course. Since this is a custom function, we need to load it first.
Writing a for loop every time you want to calculate a bootstrap interval or run a randomization test is cumbersome. This function automates the process. By default the inference function takes 10,000 bootstrap samples (instead of the 100 you’ve taken above), creates a bootstrap distribution, and calculates the confidence interval, using the percentile method.
The inference function is a bit on the heavy side, it may take a while to run.
load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/inference.Rdata'))
inference(nc$gained, type = "ci", method = "simulation", conflevel = 0.9, est = "mean",
boot_method = "perc")
## Warning: package 'lmPerm' was built under R version 3.1.1
## package 'BHH2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Alexey Grigorev\AppData\Local\Temp\RtmpA3dwhM\downloaded_packages
## Warning: package 'BHH2' was built under R version 3.1.1
## Single mean
## Summary statistics:
## mean = 30.3258 ; sd = 14.2413 ; n = 973
## Bootstrap method: Percentile
## 90 % Bootstrap interval = ( 29.5713 , 31.0617 )
We can easily change the confidence level to 95% by changing the argument conflevel of the inference function.
inference(nc$gained, type = "ci", method = "simulation", conflevel = 0.95, est = "mean",
boot_method = "perc")
## Single mean
## Summary statistics:
## mean = 30.3258 ; sd = 14.2413 ; n = 973
## Bootstrap method: Percentile
## 95 % Bootstrap interval = ( 29.447 , 31.1771 )
We can also use the standard error method by changing the boot_method
argument. Change the boot_method
argument to “se” and run the code.
inference(nc$gained, type = "ci", method = "simulation", conflevel = 0.95, est = "mean",
boot_method = "se")
## Single mean
## Summary statistics:
## mean = 30.3258 ; sd = 14.2413 ; n = 973
## Bootstrap method: Standard error; Boot. mean = 30.3245 ; Boot. SE = 0.4451
## 95 % Bootstrap interval = ( 29.452 , 31.1969 )
To create an interval for the median instead of the mean, you can change the est argument of the inference function.
inference(nc$gained, type = "ci", method = "simulation", conflevel = 0.95, est = "median",
boot_method = "se")
## Single median
## Summary statistics:
## median = 30 ; n = 973
## Bootstrap method: Standard error; Boot. mean = 29.946 ; Boot. SE = 0.2473
## 95 % Bootstrap interval = ( 29.4613 , 30.4307 )
Question
The bootstrap distribution of the median weight gain is a smooth, unimodal, symmetric distribution that yields a reliable confidence interval estimate.
FALSE
Recognize that when the bootstrap distribution is extremely skewed and sparse, the bootstrap confidence interval may not be reliable.
In this exercise, you’ll use the inference function to create a 95% bootstrap confidence interval for the mean age of fathers at the birth of their child using the standard error method.
inference(nc$fage, type="ci", method="simulation", conflevel=0.95, est="mean",
boot_method="se")
## Single mean
## Summary statistics:
## mean = 30.2557 ; sd = 6.7638 ; n = 829
## Bootstrap method: Standard error; Boot. mean = 30.2553 ; Boot. SE = 0.2398
## 95 % Bootstrap interval = ( 29.7853 , 30.7253 )
When the response variable is numerical and the explanatory variable is categorical, we can evaluate the relationship between the two variables by comparing means (or medians, or other measures) of the numerical response variable across the levels of the explanatory categorical variable.
boxplot(nc$weight ~ nc$habit)
Based on the plot from the previous exercise, which of the following is false about the relationship between habit and weight?
Both distributions are slightly right skewed.
by
functionThe box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the by function to split the weight variable into the habit groups, and then take the mean of each using the mean function. You can use the by function as follows to compare the means of the groups: by(numerical_dataset, categorical_dataset, mean)
. You can also use other functions like median, summary, etc.
by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144
## --------------------------------------------------------
## nc$habit: smoker
## [1] 6.829
by(nc$weight, nc$habit, summary)
## nc$habit: nonsmoker
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.44 7.31 7.14 8.06 11.80
## --------------------------------------------------------
## nc$habit: smoker
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.69 6.08 7.06 6.83 7.74 9.19
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
Before you can start hypothesis testing, you need to think about the conditions that are necessary for inference.
To do this, you’ll need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above by replacing mean with length.
Instructions
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## --------------------------------------------------------
## nc$habit: smoker
## [1] 126
Next, we’ll use the inference function for evaluating whether there is a difference between the average birth weights of babies born to smoker and non-smoker mothers. Let’s pause for a moment to go through the arguments of this custom function:
y
, which is the response variable that we are interested in: nc$weight
.x
, which is the explanatory variable – the grouping variable accross the levels of which we’re comparing the average value for the response variable, smokers and non-smokers: nc$habit
."mean"
(other options are "median"
, or "proportion"
.)"ht"
) or a confidence interval("ci"
)."less"
, "greater"
, or "twosided"
."theoretical"
or "simulation"
based.inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.144, sd_nonsmoker = 1.519
## n_smoker = 126, mean_smoker = 6.829, sd_smoker = 1.386
## Observed difference between means (nonsmoker-smoker) = 0.3155
## H0: mu_nonsmoker - mu_smoker = 0
## HA: mu_nonsmoker - mu_smoker != 0
## Standard error = 0.134
## Test statistic: Z = 2.359
## p-value = 0.0184
By default the inference function sets the parameter of interest to be mu_nonsmokers - mu_smokers
. We can easily change this order by using the order argument. To set the order to mu_first - mu_second
use: order = c("first","second")
.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical", order=c('smoker', 'nonsmoker'))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.829, sd_smoker = 1.386
## n_nonsmoker = 873, mean_nonsmoker = 7.144, sd_nonsmoker = 1.519
## Observed difference between means (smoker-nonsmoker) = -0.3155
## H0: mu_smoker - mu_nonsmoker = 0
## HA: mu_smoker - mu_nonsmoker != 0
## Standard error = 0.134
## Test statistic: Z = -2.359
## p-value = 0.0184
Change the type argument to "ci"
to construct and record a confidence interval for the difference between the weights of babies born to smoking and nonsmoking mothers.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.144, sd_nonsmoker = 1.519
## n_smoker = 126, mean_smoker = 6.829, sd_smoker = 1.386
## Observed difference between means (nonsmoker-smoker) = 0.3155
## Standard error = 0.1338
## 95 % Confidence interval = ( 0.0534 , 0.5777 )
Which of the following is the best interpretation of the interval? > We are 95% confident that babies born to nonsmoker mothers are on average 0.05 to 0.58 pounds heavier at birth than babies born to smoker mothers.
Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice. What is the maximum age of a younger mom and the minimum age of a mature mom, according to the data? table(nc$mature)
by(nc$fage, nc$mature, summary)
## nc$mature: mature mom
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 26.0 35.0 38.0 38.4 41.0 55.0 11
## --------------------------------------------------------
## nc$mature: younger mom
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.0 24.0 29.0 28.9 33.0 48.0 160
boxplot(nc$fage ~ nc$mature)
The maximum age of younger moms is 34 and minimum age of mature moms is 35.
? (no idea why)