load("more/nc.RData")
  1. What are the cases in this data set? How many cases are there in our sample?
dim(nc)
## [1] 1000   13

The cases in this dataset correspond to individual births that occurred in North Carolina.

There are 1000 cases in the sample.

  1. Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
boxplot(nc$weight ~ nc$habit, horizontal = TRUE, xlab = "Weight", main = "Baby weights by mother's smoking habit")

The plot highlights that there is a slight difference between baby weights of smokers and non-smokers. Non-smokers appear to have a slightly higher median baby weight compared to smokers.

  1. Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 126
by(nc$weight, nc$habit, hist)

## nc$habit: nonsmoker
## $breaks
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
## 
## $counts
##  [1]  15  10  12  28  86 197 296 174  46   7   2
## 
## $density
##  [1] 0.017182131 0.011454754 0.013745704 0.032073310 0.098510882
##  [6] 0.225658648 0.339060710 0.199312715 0.052691867 0.008018328
## [11] 0.002290951
## 
## $mids
##  [1]  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5
## 
## $xname
## [1] "dd[x, ]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## -------------------------------------------------------- 
## nc$habit: smoker
## $breaks
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $counts
## [1]  1  2  1  8 19 31 41 17  6
## 
## $density
## [1] 0.007936508 0.015873016 0.007936508 0.063492063 0.150793651 0.246031746
## [7] 0.325396825 0.134920635 0.047619048
## 
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
## 
## $xname
## [1] "dd[x, ]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Each population is less than 10% of their total respective populations, so we can assume that the observations are independent. The sample sizes for each group are both greater than 30. The data is skewed for both populations, however, since the sample size is large, this condition can be relaxed.

Aside from the moderate skewness of the data, the conditions for inference are satisfied.

  1. Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

H0: There is no difference between the weights of babies born from smoking mothers and non-smoking mothers.

HA: There is a difference between the weights of babies born from smoking mothers and non-smoking mothers.

  1. Change the type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Warning: package 'BHH2' was built under R version 3.5.3
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )

On your own

1). Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.

inference(y = nc$weeks, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )

With 95% confidence, the average length of a pregnancy in North Carolina is between 38.1528 weeks and 38.5165 weeks.

2). Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90.

inference(y = nc$weeks, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical", conflevel = 0.9)
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

3). Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 1.286 
## Test statistic: Z =  -1.376 
## p-value =  0.1686

4). Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

To determine the age cutoff for younger and mature mothers, I used a logistic regression to predict maturity using the mother’s age. I created a logistic model using the glm function to model the data. After creating the model, I predicted whether or not a mother was younger or mature using their age. Afterwards, I plotted a logistic curve to show the predicted probabilities of belonging in a certain class depending on age. The inflection point of this graph shows where cutoff point is.

For this model, the cutoff point appears to be at 40 years old.

logisticmodel = glm(mature~fage,data=nc,family=binomial(link=logit))

summary(logisticmodel)
## 
## Call:
## glm(formula = mature ~ fage, family = binomial(link = logit), 
##     data = nc)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.81115   0.09766   0.22658   0.45018   2.17974  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.27088    0.90053   12.52   <2e-16 ***
## fage        -0.28227    0.02498  -11.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 692.65  on 828  degrees of freedom
## Residual deviance: 464.17  on 827  degrees of freedom
##   (171 observations deleted due to missingness)
## AIC: 468.17
## 
## Number of Fisher Scoring iterations: 6
nc$maturebinary = nc$mature == "younger mom"

FemaleAgeVector = na.omit(data.frame(fage = nc$fage, maturebinary = nc$maturebinary))

newdata = data.frame(fage = seq(min(FemaleAgeVector$fage), max(FemaleAgeVector$fage), len = length(FemaleAgeVector$fage)))

newdata$predicted = predict(logisticmodel, newdata = newdata, type = "response")

plot(maturebinary~fage, data = FemaleAgeVector, col = "red4")
lines(predicted~fage, data = newdata, col = "green4", lwd = 2)

5). Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis testand/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.

Is there a difference between the average pregnancy term length of premature babies compared to full term babies?

inference(y = nc$weeks, x = nc$premie, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical", conflevel = 0.95)
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 846, mean_full term = 39.2482, sd_full term = 1.5674
## n_premie = 152, mean_premie = 33.25, sd_premie = 3.5064

## Observed difference between means (full term-premie) = 5.9982
## 
## Standard error = 0.2895 
## 95 % Confidence interval = ( 5.4309 , 6.5656 )

Since 0 is not iside the confidence interval, I can conclude that the average term for premie babies is different than the average term for full term babies.

In plain language, with 95% confidence, the observed difference in term weeks between premie babies and full term babies is between 5.4309 weeks and 6.5656 weeks.