Complete all Exercises, and submit answers to Questions on the Coursera platform.
Let’s load the necessary packages for this week’s lab:
library(statsr)
library(dplyr)
library(ggplot2)
As usual, the data set and analysis functions will be provided by the statsr
package and we will be using dplyr
and ggplot2
for manipulating and visualizing the data.
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. These data contain information on both the expectant mothers and their children. We will be working with a random sample of the complete data set. For those of you who took the Inferential Statistics course as part of the Statistics with R specialization should recognize this as the same data set used in the Inference for numerical data lab where we used frequentist inference methods to explore these data.
You can load the nc
data set into our workspace using the data
function once the statsr
package is loaded.
data(nc)
This data set consists of 1000 observations on 13 different variables, some categorical and some numerical. The definition of each variable is as follows:
variable | description |
---|---|
fage |
father’s age in years. |
mage |
mother’s age in years. |
mature |
maturity status of mother. |
weeks |
length of pregnancy in weeks. |
premie |
whether the birth was classified as premature (premie) or full-term. |
visits |
number of hospital visits during pregnancy. |
marital |
whether mother is married or not married at birth. |
gained |
weight gained by mother during pregnancy in pounds. |
weight |
weight of the baby at birth in pounds. |
lowbirthweight |
whether baby was classified as low birthweight (low ) or not (not low ). |
gender |
gender of the baby, female or male . |
habit |
status of the mother as a nonsmoker or a smoker . |
whitemom |
whether mom is white or not white . |
weight
As a first step in the analysis, we should take a look at the variables in the dataset and how R has encoded them. The most straight forward way of doing this is using the str
command:
str(nc)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
As you review the variable summaries, consider which variables are categorical and which are numerical. For numerical variables, are there outliers? If you aren’t sure or want to take a closer look at the data, make a graph.
We will start with analyzing the weight of the babies at birth, which are contained in the variable weight
.
Exercise: Using visualization and summary statistics, describe the distribution of weight of the babies at birth.
# type your code for the Exercise here, and Knit
summary(nc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.380 7.310 7.101 8.060 11.750
ggplot(nc, aes(x = weight)) + geom_bar()
weight
?
As with the frequentist approach we use these data to perform basic inference on \(\mu\) the average weight of all babies born in North Carolina. To do this we will use the bayes_inference
function which will allow us construct credible intervals and calculate Bayes factors for a variety of different circumstances.
In order to construct a credible interval we must first provide the data, weight
in this case, and then indicate that we want a credible interval (type="ci"
) for a mean (statistic="mean"
).
bayes_inference(y = weight, data = nc, statistic = "mean", type = "ci", cred_level = 0.99)
## Single numerical variable
## n = 1000, y-bar = 7.101, s = 1.5089
## (Assuming improper prior: P(mu, sigma^2) = 1/sigma^2)
##
## 99% CI: (6.978 , 7.2243)
##
## Post. mean = 7.101
## Post. median = 7.101
## Post. mode = 7.1006
The credible level for the interval can be specified using the cred_level
argument.
We can also conduct a Bayesian hypothesis test by calculating a Bayes factor, let’s test to see if the average birth weight in North Carolina is significantly different from 7 pounds.
\[ H_1: \mu = 7 \] \[ H_2: \mu \ne 7 \]
To conduct this hypothesis test we will again use the bayes_inference
function but this time specify type="ht"
, will then also need to provide the null
and alternative
arguments which define the null value (7) and the type of alternative hypothesis ("twosided"
).
bayes_inference(y = weight, data = nc, statistic = "mean", type = "ht", null = 7, alternative = "twosided")
## Single numerical variable
## n = 1000, y-bar = 7.101, s = 1.5089
## (Assuming improper prior: P(mu, sigma^2) = 1/sigma^2)
##
## Hypotheses:
## H1: mu = 7
## H2: mu != 7
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 3.3915
## P(H1|data) = 0.7723
## P(H2|data) = 0.2277
The graphical results show the magnitude of \(P(H_1 ~|~ data)\) with the blue line and the posterior of \(\mu\) given \(H_2\) (scaled by \(P(H_2 ~|~ data)\)) via the black curve. The 95% credible interval for \(\mu|data,H_2\) is given in grey.
Exercise: In the US low birth is defined as being less than 2500 grams (\(\approx 5.5\) lbs). Use the bayes_inference
function to assess if the average birth weight in North Carolina is significantly different from this value. (The answer here should be obvious, but make sure that the Bayes factor result conforms with your intuition.)
# type your code for the Exercise here, and Knit
bayes_inference(y = weight, data = nc, statistic = "mean", type = "ht", null = 5.5, alternative = "twosided")
## Single numerical variable
## n = 1000, y-bar = 7.101, s = 1.5089
## (Assuming improper prior: P(mu, sigma^2) = 1/sigma^2)
##
## Hypotheses:
## H1: mu = 5.5
## H2: mu != 5.5
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 9.42506e+161
## P(H1|data) = 0
## P(H2|data) = 1
Next, let’s consider if the mother’s smoking habit has any clear effect on the child’s birth weight. Here we will use the variable habit
to distinguish between smoking and non-smoking mothers. As with any analysis, a visualization is a good place to start, and will give us a better understanding of the data.
Exercise: Construct a side-by-side boxplot of habit
and weight
and compare the two distributions.
# type your code for the Exercise here, and Knit
summary(nc$habit)
## nonsmoker smoker NA's
## 873 126 1
ggplot(nc, aes(y = weight, x = habit)) + geom_boxplot()+ coord_flip()
# type your code for the Question 3 here, and Knit
As before we can use the bayes_inference
function to either construct a credible interval and or calculate a Bayes factor. The calls will be identical to the single mean case except now we will provide habit
as a predictor variable (argument x
). Note we also change null=0
, since we are interested in comparing if the means of the two groups are equal or not.
bayes_inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ci")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## (Assuming independent Jeffrey's priors for mu and sigma^2)
##
## 95% Cred. Int.: (0.0512 , 0.5795)
##
## Post. mean = 0.3154
## Post. median = 0.3155
## Post. mode = 0.3213
bayes_inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, alternative = "twosided")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_nonsmoker = mu_smoker
## H2: mu_nonsmoker != mu_smoker
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 1.4387
## P(H1|data) = 0.5899
## P(H2|data) = 0.4101
Finally, we can also conduct Bayesian inference when our outcome variable of interest is categorical. As with the frequentist inference function the only change is to specify a \(y\) argument that is categorical (with only two levels) and then specify which of its levels is the “success” using the success
argument and finally change the statistic
of interest to "proportion"
.
For example if we want to test if only 7.5% of births in North Carolina are considered low birth weight we can calculate the Bayes factor using the following code:
bayes_inference(y = lowbirthweight, data = nc, success = "low", statistic = "proportion", type = "ht", null = 0.075, alternative = "twosided")
## Warning in check_beta_prior(beta_prior): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior=c(a,b),
## where a and b are your desired hyperparameters.
## Single categorical variable, success: low
## n = 1000, p-hat = 0.111
##
## Hypotheses:
## H1: p = 0.075
## H2: p != 0.075
##
## Priors:
## P(p) ~ Beta(a=1,b=1)
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 93.071
## P(H1|data) = 0.0106
## P(H2|data) = 0.9894
bayes_inference(y = lowbirthweight, data = nc, success = "low", statistic = "proportion", type = "ht", null = 0.075, alternative = "twosided", beta_prior=c(75,925) )
## Single categorical variable, success: low
## n = 1000, p-hat = 0.111
##
## Hypotheses:
## H1: p = 0.075
## H2: p != 0.075
##
## Priors:
## P(p) ~ Beta(a=75,b=925)
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 50.3154
## P(H1|data) = 0.0195
## P(H2|data) = 0.9805
Using what you have learned so far, conduct a Bayesian inference procedure to evaluate whether these data provide evidence for or against smoking being associated with low birth weight and smoking being associated with premature birth.
bayes_inference(y = lowbirthweight, x = habit, data = nc, success = "low", statistic = "proportion", type = "ht", null = 0.0, alternative = "twosided")
## Warning in check_beta_prior(beta_prior1): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior1=c(a,b),
## where a and b are your desired hyperparameters.
## Warning in check_beta_prior(beta_prior2): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior2=c(a,b),
## where a and b are your desired hyperparameters.
## Response variable: categorical (2 levels, success: low)
## Explanatory variable: categorical (2 levels)
## n_nonsmoker = 873, p_hat_nonsmoker = 0.1054
## n_smoker = 126, p_hat_smoker = 0.1429
## Hypotheses:
## H1: p_nonsmoker = p_smoker
## H2: p_nonsmoker != p_smoker
##
## Priors:
## P(p_nonsmoker) ~ Beta(a=1,b=1)
## P(p_smoker) ~ Beta(a=1,b=1)
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 3.4563
## P(H1|data) = 0.7756
## P(H2|data) = 0.2244
bayes_inference(y = premie, x = habit, data = nc, success = "premie", statistic = "proportion", type = "ht", null = 0.0, alternative = "twosided")
## Warning in check_beta_prior(beta_prior1): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior1=c(a,b),
## where a and b are your desired hyperparameters.
## Warning in check_beta_prior(beta_prior2): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior2=c(a,b),
## where a and b are your desired hyperparameters.
## Response variable: categorical (2 levels, success: premie)
## Explanatory variable: categorical (2 levels)
## n_nonsmoker = 872, p_hat_nonsmoker = 0.1525
## n_smoker = 126, p_hat_smoker = 0.1508
## Hypotheses:
## H1: p_nonsmoker = p_smoker
## H2: p_nonsmoker != p_smoker
##
## Priors:
## P(p_nonsmoker) ~ Beta(a=1,b=1)
## P(p_smoker) ~ Beta(a=1,b=1)
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 9.1008
## P(H1|data) = 0.901
## P(H2|data) = 0.099