# Inference for numerical data
# North Carolina births
# In 2004, the state of North Carolina released a large data set containing information on births recorded in this state.
# This data set is useful to researchers studying the relation between habits and practices of expectant mothers and
# the birth of their children.
# We will work with a random sample of observations from this data set.
# Exploratory analysis
# Load the nc data set into our workspace.
download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
names(nc)
## [1] "fage" "mage" "mature" "weeks"
## [5] "premie" "visits" "marital" "gained"
## [9] "weight" "lowbirthweight" "gender" "habit"
## [13] "whitemom"
#
# variable description
#__________ ___________
# fage father's age in years.
# mage mother's age in years.
# mature maturity status of mother.
# weeks length of pregnancy in weeks.
# premie whether the birth was classified as premature (premie) or full-term.
# visits number of hospital visits during pregnancy.
# marital whether mother is married or not married at birth.
# gained weight gained by mother during pregnancy in pounds.
# weight weight of the baby at birth in pounds.
# lowbirthweight whether baby was classified as low birthweight (low) or not (not low).
# gender gender of the baby, female or male.
# habit status of the mother as a nonsmoker or a smoker.
# whitemom whether mom is white or not white. We have observations on 13 different variables, some categorical and some numerical.
# The meaning of each variable is as follows.
# Exercise 1: What are the cases in this data set? How many cases are there in our sample?
# As a first step in the analysis, we should consider summaries of the data. This can be done using the summary command:
dim(nc)
## [1] 1000 13
summary(nc)
## fage mage mature weeks premie
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00 full term:846
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00 premie :152
## Median :30.00 Median :27 Median :39.00 NA's : 2
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## visits marital gained weight
## Min. : 0.0 married :386 Min. : 0.00 Min. : 1.000
## 1st Qu.:10.0 not married:613 1st Qu.:20.00 1st Qu.: 6.380
## Median :12.0 NA's : 1 Median :30.00 Median : 7.310
## Mean :12.1 Mean :30.33 Mean : 7.101
## 3rd Qu.:15.0 3rd Qu.:38.00 3rd Qu.: 8.060
## Max. :30.0 Max. :85.00 Max. :11.750
## NA's :9 NA's :27
## lowbirthweight gender habit whitemom
## low :111 female:503 nonsmoker:873 not white:284
## not low:889 male :497 smoker :126 white :714
## NA's : 1 NA's : 2
##
##
##
##
# As you review the variable summaries, consider which variables are categorical and which are numerical.
# For numerical variables, are there outliers? If you aren't sure or want to take a closer look at the data, make a graph.
# Consider the possible relationship between a mother's smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
plot(nc$weight ~ nc$habit)
# Exercise 2 Make a side-by-side boxplot of habit and weight.
# What does the plot highlight about the relationship between these two variables?
boxplot(nc$weight ~ nc$habit)

# The box plots show how the medians of the two distributions compare,
# but we can also compare the means of the distributions using the following function to split the weight variable
# into the habit groups, then take the mean of each using the mean function.
by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144273
## ------------------------------------------------------------
## nc$habit: smoker
## [1] 6.82873
# There is an observed difference, but is this difference statistically significant? In order to answer this question we
# will conduct a hypothesis test .
# Inference
# Exercise 3
# Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to
# check the conditions. You can compute the group size using the same by command above but replacing mean with length.
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## ------------------------------------------------------------
## nc$habit: smoker
## [1] 126
table(nc$habit)
##
## nonsmoker smoker
## 873 126
# Exercise 4
# Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
# Null Hypothesis Ho:
# Alt Hypothesis Ha:
# Next, we introduce a new function, "inference", that we will use for conducting hypothesis tests and
# constructing confidence intervals.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## H0: mu_nonsmoker - mu_smoker = 0
## HA: mu_nonsmoker - mu_smoker != 0
## Standard error = 0.134
## Test statistic: Z = 2.359
## p-value = 0.0184

# Let's pause for a moment to go through the arguments of this custom function.
# The first argument is y, which is the response variable that we are interested in: nc$weight.
# The second argument is the explanatory variable, x, which is the variable that splits the data into two groups,
# smokers and non-smokers: nc$habit.
# The third argument, est, is the parameter we're interested in: "mean" (other options are "median", or "proportion".)
# Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci").
# When performing a hypothesis test, we also need to supply the null value, which in this case is 0,
# since the null hypothesis sets the two population means equal to each other.
# The alternative hypothesis can be "less", "greater", or "twosided".
# Lastly, the method of inference can be "theoretical" or "simulation" based.
# Exercise5: Change the type argument to "ci" to construct and record a confidence interval for the difference between
# the weights of babies born to smoking and non-smoking mothers.
# By default the function reports an interval for (?? nonsmoker ????? smoker
# We can easily change this order by using the order argument:
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical",
order = c("smoker","nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
##
## Standard error = 0.1338
## 95 % Confidence interval = ( -0.5777 , -0.0534 )
#On your own
# 1. Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context.
inference(y = nc$weeks, est = "mean", type = "ci", conflevel = 0.95, method = "theoretical")
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
# 2. Calculate a new confidence interval for the same parameter at the 90% confidence level.
inference(y = nc$weeks, est = "mean", type = "ci", conflevel = 0.90, method = "theoretical")
## Single mean
## Summary statistics:

## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 90 % Confidence interval = ( 38.182 , 38.4873 )
# 3. Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.
inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
##
## H0: mu_mature mom - mu_younger mom = 0
## HA: mu_mature mom - mu_younger mom != 0
## Standard error = 1.286
## Test statistic: Z = -1.376
## p-value = 0.1686

# 4. Determine the age cutoff for younger and mature mothers.
# Let's say younger mothers are those with age <= cutoff, and mature mothers are those with age > cutoff.
# We can use summary statistics to help us decide on a reasonable cutoff.
summary(nc$mage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13 22 27 27 32 50
# 5. Pick a pair of numerical and categorical variables and come up with a research question.
# Evaluate the relationship between these variables using a hypothesis test and/or a confidence interval.
# For example, let's explore the relationship between the number of hospital visits (numerical) and marital status (categorical).
inference(y = nc$visits, x = nc$marital, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical", order = c("married", "not married"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_married = 380, mean_married = 10.9553, sd_married = 4.2408
## n_not married = 611, mean_not married = 12.82, sd_not married = 3.5883
## Observed difference between means (married-not married) = -1.8647
##
## H0: mu_married - mu_not married = 0
## HA: mu_married - mu_not married != 0
## Standard error = 0.262
## Test statistic: Z = -7.13
## p-value = 0
