load("more/nc.RData")dim(nc)## [1] 1000 13
The cases in this dataset correspond to individual births that occurred in North Carolina.
There are 1000 cases in the sample.
habit and weight. What does the plot highlight about the relationship between these two variables?boxplot(nc$weight ~ nc$habit, horizontal = TRUE, xlab = "Weight", main = "Baby weights by mother's smoking habit")The plot highlights that there is a slight difference between baby weights of smokers and non-smokers. Non-smokers appear to have a slightly higher median baby weight compared to smokers.
by command above but replacing mean with length.by(nc$weight, nc$habit, length)## nc$habit: nonsmoker
## [1] 873
## --------------------------------------------------------
## nc$habit: smoker
## [1] 126
by(nc$weight, nc$habit, hist)## nc$habit: nonsmoker
## $breaks
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
##
## $counts
## [1] 15 10 12 28 86 197 296 174 46 7 2
##
## $density
## [1] 0.017182131 0.011454754 0.013745704 0.032073310 0.098510882
## [6] 0.225658648 0.339060710 0.199312715 0.052691867 0.008018328
## [11] 0.002290951
##
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5
##
## $xname
## [1] "dd[x, ]"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
## --------------------------------------------------------
## nc$habit: smoker
## $breaks
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $counts
## [1] 1 2 1 8 19 31 41 17 6
##
## $density
## [1] 0.007936508 0.015873016 0.007936508 0.063492063 0.150793651 0.246031746
## [7] 0.325396825 0.134920635 0.047619048
##
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
##
## $xname
## [1] "dd[x, ]"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
Each population is less than 10% of their total respective populations, so we can assume that the observations are independent. The sample sizes for each group are both greater than 30. The data is skewed for both populations, however, since the sample size is large, this condition can be relaxed.
Aside from the moderate skewness of the data, the conditions for inference are satisfied.
H0: There is no difference between the weights of babies born from smoking mothers and non-smoking mothers.
HA: There is a difference between the weights of babies born from smoking mothers and non-smoking mothers.
type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical")## Warning: package 'BHH2' was built under R version 3.5.3
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## Standard error = 0.1338
## 95 % Confidence interval = ( 0.0534 , 0.5777 )
weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.inference(y = nc$weeks, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical")## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
With 95% confidence, the average length of a pregnancy in North Carolina is between 38.1528 weeks and 38.5165 weeks.
conflevel = 0.90.inference(y = nc$weeks, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical", conflevel = 0.9)## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 90 % Confidence interval = ( 38.182 , 38.4873 )
inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
##
## H0: mu_mature mom - mu_younger mom = 0
## HA: mu_mature mom - mu_younger mom != 0
## Standard error = 1.286
## Test statistic: Z = -1.376
## p-value = 0.1686
To determine the age cutoff for younger and mature mothers, I used a logistic regression to predict maturity using the mother’s age. I created a logistic model using the glm function to model the data. After creating the model, I predicted whether or not a mother was younger or mature using their age. Afterwards, I plotted a logistic curve to show the predicted probabilities of belonging in a certain class depending on age. The inflection point of this graph shows where cutoff point is.
For this model, the cutoff point appears to be at 40 years old.
logisticmodel = glm(mature~fage,data=nc,family=binomial(link=logit))
summary(logisticmodel)##
## Call:
## glm(formula = mature ~ fage, family = binomial(link = logit),
## data = nc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.81115 0.09766 0.22658 0.45018 2.17974
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.27088 0.90053 12.52 <2e-16 ***
## fage -0.28227 0.02498 -11.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.65 on 828 degrees of freedom
## Residual deviance: 464.17 on 827 degrees of freedom
## (171 observations deleted due to missingness)
## AIC: 468.17
##
## Number of Fisher Scoring iterations: 6
nc$maturebinary = nc$mature == "younger mom"
FemaleAgeVector = na.omit(data.frame(fage = nc$fage, maturebinary = nc$maturebinary))
newdata = data.frame(fage = seq(min(FemaleAgeVector$fage), max(FemaleAgeVector$fage), len = length(FemaleAgeVector$fage)))
newdata$predicted = predict(logisticmodel, newdata = newdata, type = "response")
plot(maturebinary~fage, data = FemaleAgeVector, col = "red4")
lines(predicted~fage, data = newdata, col = "green4", lwd = 2)inference function, report the statistical results, and also provide an explanation in plain language.Is there a difference between the average pregnancy term length of premature babies compared to full term babies?
inference(y = nc$weeks, x = nc$premie, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical", conflevel = 0.95)## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 846, mean_full term = 39.2482, sd_full term = 1.5674
## n_premie = 152, mean_premie = 33.25, sd_premie = 3.5064
## Observed difference between means (full term-premie) = 5.9982
##
## Standard error = 0.2895
## 95 % Confidence interval = ( 5.4309 , 6.5656 )
Since 0 is not iside the confidence interval, I can conclude that the average term for premie babies is different than the average term for full term babies.
In plain language, with 95% confidence, the observed difference in term weeks between premie babies and full term babies is between 5.4309 weeks and 6.5656 weeks.