Statistics With R
Bayesian Statisics
Week 3 (SR_4_3): Inference for numerical data“
Complete all Exercises, and submit answers to Questions on the Coursera platform.
Let’s load the necessary packages for this week’s lab:
library(statsr)
library(dplyr)
library(ggplot2)As usual, the data set and analysis functions will be provided by the statsr package and we will be using dplyr and ggplot2 for manipulating and visualizing the data.
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. These data contain information on both the expectant mothers and their children. We will be working with a random sample of the complete data set. For those of you who took the Inferential Statistics course as part of the Statistics with R specialization should recognize this as the same data set used in the Inference for numerical data lab where we used frequentist inference methods to explore these data.
You can load the nc data set into our workspace using the data function once the statsr package is loaded.
data(nc)This data set consists of 1000 observations on 13 different variables, some categorical and some numerical. The definition of each variable is as follows:
| variable | description |
|---|---|
fage |
father’s age in years. |
mage |
mother’s age in years. |
mature |
maturity status of mother. |
weeks |
length of pregnancy in weeks. |
premie |
whether the birth was classified as premature (premie) or full-term. |
visits |
number of hospital visits during pregnancy. |
marital |
whether mother is married or not married at birth. |
gained |
weight gained by mother during pregnancy in pounds. |
weight |
weight of the baby at birth in pounds. |
lowbirthweight |
whether baby was classified as low birthweight (low) or not (not low). |
gender |
gender of the baby, female or male. |
habit |
status of the mother as a nonsmoker or a smoker. |
whitemom |
whether mom is white or not white. |
weightAs a first step in the analysis, we should take a look at the variables in the dataset and how R has encoded them. The most straight forward way of doing this is using the str command:
str(nc)## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
As you review the variable summaries, consider which variables are categorical and which are numerical. For numerical variables, are there outliers? If you aren’t sure or want to take a closer look at the data, make a graph.
We will start with analyzing the weight of the babies at birth, which are contained in the variable weight.
Exercise: Using visualization and summary statistics, describe the distribution of weight of the babies at birth.
The median baby weight is 7.3 pounds. The mean baby weight is 7.1 pounds. The distribution is approximately normal, with some left skew despite the lower limit of zero and a theoretically infinite upper limit. No babies were above 11.8 pounds.
# type your code for the Exercise here, and Knit
summary(nc$weight)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.380 7.310 7.101 8.060 11.750
ggplot(nc, aes(x = weight)) +
geom_histogram() +
ggtitle("Weight of the baby at birth in pounds") +
xlab("weight") +
ylab("count of babies")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
weight?
As with the frequentist approach we use these data to perform basic inference on \(\mu\) the average weight of all babies born in North Carolina. To do this we will use the bayes_inference function which will allow us construct credible intervals and calculate Bayes factors for a variety of different circumstances.
In order to construct a credible interval we must first provide the data, weight in this case, and then indicate that we want a credible interval (type="ci") for a mean (statistic="mean").
bayes_inference(y = weight, data = nc, statistic = "mean", type = "ci", cred_level = .99)## Single numerical variable
## n = 1000, y-bar = 7.101, s = 1.5089
## (Assuming improper prior: P(mu, sigma^2) = 1/sigma^2)
##
## 99% CI: (6.9778 , 7.2242)
##
## Post. mean = 7.101
## Post. median = 7.101
## Post. mode = 7.1022
The credible level for the interval can be specified using the cred_level argument.
We can also conduct a Bayesian hypothesis test by calculating a Bayes factor, let’s test to see if the average birth weight in North Carolina is significantly different from 7 pounds.
\[ H_1: \mu = 7 \] \[ H_2: \mu \ne 7 \]
To conduct this hypothesis test we will again use the bayes_inference function but this time specify type="ht", will then also need to provide the null and alternative arguments which define the null value (7) and the type of alternative hypothesis ("twosided").
bayes_inference(y = weight, data = nc, statistic = "mean", type = "ht", null = 7, alternative = "twosided")## Single numerical variable
## n = 1000, y-bar = 7.101, s = 1.5089
## (Assuming improper prior: P(mu, sigma^2) = 1/sigma^2)
##
## Hypotheses:
## H1: mu = 7
## H2: mu != 7
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 3.3915
## P(H1|data) = 0.7723
## P(H2|data) = 0.2277
The graphical results show the magnitude of \(P(H_1 ~|~ data)\) with the blue line and the posterior of \(\mu\) given \(H_2\) (scaled by \(P(H_2 ~|~ data)\)) via the black curve. The 95% credible interval for \(\mu|data,H_2\) is given in grey.
Exercise: In the US low birth is defined as being less than 2500 grams (\(\approx 5.5\) lbs). Use the bayes_inference function to assess if the average birth weight in North Carolina is significantly different from this value. (The answer here should be obvious, but make sure that the Bayes factor result conforms with your intuition.)
# type your code for the Exercise here, and Knit
bayes_inference(y = weight, data = nc, statistic = "mean", type = "ht", null = 5.5, alternative = "twosided")## Single numerical variable
## n = 1000, y-bar = 7.101, s = 1.5089
## (Assuming improper prior: P(mu, sigma^2) = 1/sigma^2)
##
## Hypotheses:
## H1: mu = 5.5
## H2: mu != 5.5
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 9.42506e+161
## P(H1|data) = 0
## P(H2|data) = 1
Next, let’s consider if the mother’s smoking habit has any clear effect on the child’s birth weight. Here we will use the variable habit to distinguish between smoking and non-smoking mothers. As with any analysis, a visualization is a good place to start, and will give us a better understanding of the data.
Exercise: Construct a side-by-side boxplot of habit and weight and compare the two distributions.
The median weight of the baby for non-smoking mothers is about a half pound heavier than for smoking mothers (7.5 vs 7.0). The range of outcomes also appears to be somewhat tighter for non-smoking mothers.
# type your code for the Exercise here, and Knit
ggplot(na.omit(nc), aes(x = habit, y = weight)) +
geom_boxplot() +
ggtitle("Weight of the baby at birth") +
xlab("status of the mother") +
ylab("pounds")# type your code for Question 5 here, and Knit
nc %>%
group_by(habit) %>%
summarise(n = n(), avg = mean(weight), min = min(weight), q1 = quantile(weight, .25), median = median(weight), q3 = quantile(weight, .75), max = max(weight))## # A tibble: 3 x 8
## habit n avg min q1 median q3 max
## <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 nonsmoker 873 7.14 1.00 6.44 7.31 8.06 11.8
## 2 smoker 126 6.83 1.69 6.08 7.06 7.74 9.19
## 3 <NA> 1 3.63 3.63 3.63 3.63 3.63 3.63
As before we can use the bayes_inference function to either construct a credible interval and or calculate a Bayes factor. The calls will be identical to the single mean case except now we will provide habit as a predictor variable (argument x). Note we also change null=0, since we are interested in comparing if the means of the two groups are equal or not.
bayes_inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ci")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## (Assuming independent Jeffrey's priors for mu and sigma^2)
##
## 95% Cred. Int.: (0.0513 , 0.5802)
##
## Post. mean = 0.3156
## Post. median = 0.3157
## Post. mode = 0.3168
bayes_inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, alternative = "twosided")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_nonsmoker = mu_smoker
## H2: mu_nonsmoker != mu_smoker
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 1.4387
## P(H1|data) = 0.5899
## P(H2|data) = 0.4101
Finally, we can also conduct Bayesian inference when our outcome variable of interest is categorical. As with the frequentist inference function the only change is to specify a \(y\) argument that is categorical (with only two levels) and then specify which of its levels is the “success” using the success argument and finally change the statistic of interest to "proportion".
For example if we want to test if only 7.5% of births in North Carolina are considered low birth weight we can calculate the Bayes factor using the following code:
bayes_inference(y = lowbirthweight, data = nc, success = "low", statistic = "proportion", type = "ht", null = 0.075, alternative = "twosided", beta_prior = c(1,1))## Single categorical variable, success: low
## n = 1000, p-hat = 0.111
##
## Hypotheses:
## H1: p = 0.075
## H2: p != 0.075
##
## Priors:
## P(p) ~ Beta(a=1,b=1)
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 93.071
## P(H1|data) = 0.0106
## P(H2|data) = 0.9894
bayes_inference(y = lowbirthweight, data = nc, success = "low", statistic = "proportion", type = "ht", null = 0.075, alternative = "twosided", beta_prior = c(1,1), hypothesis_prior = c(.25,.75))## Single categorical variable, success: low
## n = 1000, p-hat = 0.111
##
## Hypotheses:
## H1: p = 0.075
## H2: p != 0.075
##
## Priors:
## P(p) ~ Beta(a=1,b=1)
## P(H1) = 0.25
## P(H2) = 0.75
##
## Results:
## BF[H2:H1] = 93.071
## P(H1|data) = 0.0036
## P(H2|data) = 0.9964
bayes_inference(y = lowbirthweight, data = nc, success = "low", statistic = "proportion", type = "ht", null = 0.075, alternative = "twosided", beta_prior = c(75,925))## Single categorical variable, success: low
## n = 1000, p-hat = 0.111
##
## Hypotheses:
## H1: p = 0.075
## H2: p != 0.075
##
## Priors:
## P(p) ~ Beta(a=75,b=925)
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 50.3154
## P(H1|data) = 0.0195
## P(H2|data) = 0.9805
Using what you have learned so far, conduct a Bayesian inference procedure to evaluate whether these data provide evidence for or against smoking being associated with low birth weight and smoking being associated with premature birth.
bayes_inference(y = lowbirthweight, x = habit, data = nc, success = "low", statistic = "proportion", type = "ht", null = 0.0, alternative = "twosided")## Warning in check_beta_prior(beta_prior1): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior1=c(a,b),
## where a and b are your desired hyperparameters.
## Warning in check_beta_prior(beta_prior2): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior2=c(a,b),
## where a and b are your desired hyperparameters.
## Response variable: categorical (2 levels, success: low)
## Explanatory variable: categorical (2 levels)
## n_nonsmoker = 873, p_hat_nonsmoker = 0.1054
## n_smoker = 126, p_hat_smoker = 0.1429
## Hypotheses:
## H1: p_nonsmoker = p_smoker
## H2: p_nonsmoker != p_smoker
##
## Priors:
## P(p_nonsmoker) ~ Beta(a=1,b=1)
## P(p_smoker) ~ Beta(a=1,b=1)
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 3.4563
## P(H1|data) = 0.7756
## P(H2|data) = 0.2244
bayes_inference(y = premie, x = habit, data = nc, success = "premie", statistic = "proportion", type = "ht", null = 0.0, alternative = "twosided")## Warning in check_beta_prior(beta_prior1): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior1=c(a,b),
## where a and b are your desired hyperparameters.
## Warning in check_beta_prior(beta_prior2): No beta prior for p was specified, assuming a uniform prior (p ~ Beta(a=1,b=1)).
## This beta prior is specified using the argument beta_prior2=c(a,b),
## where a and b are your desired hyperparameters.
## Response variable: categorical (2 levels, success: premie)
## Explanatory variable: categorical (2 levels)
## n_nonsmoker = 872, p_hat_nonsmoker = 0.1525
## n_smoker = 126, p_hat_smoker = 0.1508
## Hypotheses:
## H1: p_nonsmoker = p_smoker
## H2: p_nonsmoker != p_smoker
##
## Priors:
## P(p_nonsmoker) ~ Beta(a=1,b=1)
## P(p_smoker) ~ Beta(a=1,b=1)
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 9.1008
## P(H1|data) = 0.901
## P(H2|data) = 0.099