The nc dataset contains information on births recorded in North Carolina. With this dataset, we can examine the relationship between habits of expectant mothers and the birth of their children.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.1.0 ✓ dplyr 1.0.4
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
trying URL 'http://www.openintro.org/stat/data/nc.RData'
Content type 'unknown' length 66509 bytes (64 KB)
==================================================
downloaded 64 KB
load("nc.RData")
We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows.
variable description fage father’s age in years. mage mother’s age in years. mature maturity status of mother. weeks length of pregnancy in weeks. premie whether the birth was classified as premature (premie) or full-term. visits number of hospital visits during pregnancy. marital whether mother is married or not married at birth. gained weight gained by mother during pregnancy in pounds. weight weight of the baby at birth in pounds. lowbirthweight whether baby was classified as low birthweight (low) or not (not low). gender gender of the baby, female or male. habit status of the mother as a nonsmoker or a smoker. whitemom whether mom is white or not white.
What are the cases in this data set? How many cases are there in our sample?
summary(nc)
fage mage mature weeks premie visits marital gained weight lowbirthweight
Min. :14.00 Min. :13 mature mom :133 Min. :20.00 full term:846 Min. : 0.0 married :386 Min. : 0.00 Min. : 1.000 low :111
1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00 premie :152 1st Qu.:10.0 not married:613 1st Qu.:20.00 1st Qu.: 6.380 not low:889
Median :30.00 Median :27 Median :39.00 NA's : 2 Median :12.0 NA's : 1 Median :30.00 Median : 7.310
Mean :30.26 Mean :27 Mean :38.33 Mean :12.1 Mean :30.33 Mean : 7.101
3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00 3rd Qu.:15.0 3rd Qu.:38.00 3rd Qu.: 8.060
Max. :55.00 Max. :50 Max. :45.00 Max. :30.0 Max. :85.00 Max. :11.750
NA's :171 NA's :2 NA's :9 NA's :27
gender habit whitemom
female:503 nonsmoker:873 not white:284
male :497 smoker :126 white :714
NA's : 1 NA's : 2
Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
by(nc$weight, nc$habit, mean)
nc$habit: nonsmoker
[1] 7.144273
---------------------------------------------------------------------------------------------------------------------------
nc$habit: smoker
[1] 6.82873
Children of mothers who smoked had lower birth weights, on average
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.
by(nc$weight, nc$habit, length)
nc$habit: nonsmoker
[1] 873
--------------------------------------------------------------------------------------------------------------------------
nc$habit: smoker
[1] 126
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
H0: There is no difference in mean birthweight between mothers who smoked and those who did not.
HA: There is as difference in mean birthweight between mothers who smoked and those who did not.
Use the inference function to conduct a hypothesis test:
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
Response variable: numerical, Explanatory variable: categorical
Difference between two means
Summary statistics:
n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
Observed difference between means (nonsmoker-smoker) = 0.3155
H0: mu_nonsmoker - mu_smoker = 0
HA: mu_nonsmoker - mu_smoker != 0
Standard error = 0.134
Test statistic: Z = 2.359
p-value = 0.0184
Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical")
Response variable: numerical, Explanatory variable: categorical
Difference between two means
Summary statistics:
n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
Observed difference between means (nonsmoker-smoker) = 0.3155
Standard error = 0.1338
95 % Confidence interval = ( 0.0534 , 0.5777 )
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical",
order = c("smoker","nonsmoker"))
trying URL 'http://cran.rstudio.com/bin/macosx/contrib/4.0/BHH2_2016.05.31.tgz'
Content type 'application/x-gzip' length 243696 bytes (237 KB)
==================================================
downloaded 237 KB
The downloaded binary packages are in
/var/folders/dx/wnfrb90j5ndfkz3lvyt847l40000gn/T//RtmpCZs6zE/downloaded_packages
Response variable: numerical, Explanatory variable: categorical
Difference between two means
Summary statistics:
n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
Observed difference between means (smoker-nonsmoker) = -0.3155
Standard error = 0.1338
95 % Confidence interval = ( -0.5777 , -0.0534 )
On your own:
Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.
inference(y = nc$weeks, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical",
order = c("smoker","nonsmoker"))
Single mean
Summary statistics: mean = 38.3347 ; sd = 2.9316 ; n = 998
Standard error = 0.0928
95 % Confidence interval = ( 38.1528 , 38.5165 )
Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90.
inference(y = nc$weeks, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical", conflevel = 0.90)
Single mean
Summary statistics: mean = 38.3347 ; sd = 2.9316 ; n = 998
Standard error = 0.0928
90 % Confidence interval = ( 38.182 , 38.4873 )
Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.
inference(y = nc$gained, x = nc$mature, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical")
Response variable: numerical, Explanatory variable: categorical
Difference between two means
Summary statistics:
n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
Observed difference between means (mature mom-younger mom) = -1.7697
Standard error = 1.2857
95 % Confidence interval = ( -4.2896 , 0.7502 )
The 95% Confidence interval contains 0, so we do not have evidence to reject the null hypothesis
Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.
nc %>%
filter(mature=="younger mom") %>%
summarise(max = max(mage))
Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.
Let’s look at whether children of mothers who smoked were likely to be born prematurely:
H0: There is no difference in likelihood of premature births between mothers who smoked and those who did not.
HA: There is as difference in likelihood of premature births between mothers who smoked and those who did not.
inference(y = nc$premie, x = nc$habit, est = "proportion", type = "ci", null = 0,
alternative = "twosided", method = "theoretical", success = "premie")
Response variable: categorical, Explanatory variable: categorical
Two categorical variables
Difference between two proportions -- success: premie
Summary statistics:
x
y nonsmoker smoker Sum
full term 739 107 846
premie 133 19 152
Sum 872 126 998
Observed difference between proportions (nonsmoker-smoker) = 0.0017
Check conditions:
nonsmoker : number of successes = 133 ; number of failures = 739
smoker : number of successes = 19 ; number of failures = 107
Standard error = 0.0341
95 % Confidence interval = ( -0.0652 , 0.0686 )
The 95% confidence interval contains 0, so we have insufficient evidence to reject the null hypothesis