In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
names(nc)
## [1] "fage" "mage" "mature" "weeks"
## [5] "premie" "visits" "marital" "gained"
## [9] "weight" "lowbirthweight" "gender" "habit"
## [13] "whitemom"
dim(nc)
## [1] 1000 13
# The cases in this dataset are babies and their parents, which are described by 13 variables or characteristics. There are 1,000 cases or observations.
summary(nc)
## fage mage mature weeks premie
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00 full term:846
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00 premie :152
## Median :30.00 Median :27 Median :39.00 NA's : 2
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## visits marital gained weight
## Min. : 0.0 married :386 Min. : 0.00 Min. : 1.000
## 1st Qu.:10.0 not married:613 1st Qu.:20.00 1st Qu.: 6.380
## Median :12.0 NA's : 1 Median :30.00 Median : 7.310
## Mean :12.1 Mean :30.33 Mean : 7.101
## 3rd Qu.:15.0 3rd Qu.:38.00 3rd Qu.: 8.060
## Max. :30.0 Max. :85.00 Max. :11.750
## NA's :9 NA's :27
## lowbirthweight gender habit whitemom
## low :111 female:503 nonsmoker:873 not white:284
## not low:889 male :497 smoker :126 white :714
## NA's : 1 NA's : 2
##
##
##
##
boxplot(nc$weight~nc$habit)
# This side by side boxplot shows that babies from nonsmoking mothers have a higher median birthweight. However, there is substantial overlap between the IQR's for both boxplots, meaning the middle 50% of birthweights from nonsmokers and smokers is very similar. The lower whisker, or minimum value, of the nonsmoker and smoker boxplots is the same and both boxplots are skewed to the left, with the distribution of babies from nonsmokers being highly skewed to the left due to a large number of outliers beyond the lower fence or lower limit. This indicates that bottom 25% birthweights for babies from both smokers and nonsmokers were very similar at about 4 pounds or less. The IQR of the nonsmoking mothers boxplot is smaller than that of the smoking mothers boxplot, and the whiskers are further apart, which would seem to indicate that the range of birthweights is much more spread out or variable for nonsmoking mothers. Τhe upper whisker or maximum value of the birthweights for nonsmokers is higher, and includes some outliers, which means that the top 25% of birthweights from nonsmoking mothers is larger than that of smoking mothers. Thus, while it highlights that the median and top 25% birthweights of babies from nonsmoking mothers is higher, it also highlights that the middle 50% of birthweights for both nonsmoking and smoking mothers were similar. It also highlights that nonsmoking mothers had more babies with significantly low birthweights, which means there is likely some other confounding variable affecting some nonsmoking birthweights.
by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144273
## ------------------------------------------------------------
## nc$habit: smoker
## [1] 6.82873
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## ------------------------------------------------------------
## nc$habit: smoker
## [1] 126
# Both sample sizes are well above the size necessary to qualify for the t-distribution so a normal distribution would be more appropriate in this case.
Null Hypothesis: The difference in the population average birthweight of babies from nonsmoking mothers and that of smoking mothers is equal to 0.
Alternative Hypothesis: The difference in the population average birthweight of babies from nonsmoking mothers and that of smoking mothers is not equal to 0.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## H0: mu_nonsmoker - mu_smoker = 0
## HA: mu_nonsmoker - mu_smoker != 0
## Standard error = 0.134
## Test statistic: Z = 2.359
## p-value = 0.0184
# The p-value of 0.0184 is highly significant at a level of significance α = 0.10 or 0.05. Thus, we can reject the null hypothesisthat there is no difference between the population mean birthweight of of babies born to smoking mothers compared to nonsmoking mothers in favor of the alternative hypothesis that there is a difference in the population mean birthweight of babies born to smoking mothers and those born to nonsmoking mothers.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## Standard error = 0.1338
## 95 % Confidence interval = ( 0.0534 , 0.5777 )
# The confidence interval of (0.0534 , 0.5777) confirms the result of the hypothesis test since 0 is not included in the confidence interval. The confidence interval indicates that the population average brithweight of babies born to nonsmoking mothers is between 0.0534 pounds to 0.5777 pounds larger than the population average brithweight of babies born to smoking mothers.
inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical")
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
# The 95 percent confidence interval for mean pregnancy length is (38.1528 , 38.5165).
inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical", conflevel = 0.90)
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 90 % Confidence interval = ( 38.182 , 38.4873 )
# The 90% confidence interval for mean pregnancy length is (38.182 , 38.4873).
Null Hypothesis: The average population weight gain of younger mothers is the same as the average population weight gain of mature mothers.
Alernative Hypothesis: The average population weight gain of younger mothers is different than the average population weight gain of mature mothers.
inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
##
## H0: mu_mature mom - mu_younger mom = 0
## HA: mu_mature mom - mu_younger mom != 0
## Standard error = 1.286
## Test statistic: Z = -1.376
## p-value = 0.1686
# The null hypothesis that the difference between the average weight gain of mature mothers and the average weight gain of younger mothers is 0 cannot be rejected due to the p-value of 0.1686, which exceeds the highest level of singificance for which a null hypothesis could be rejected (α = 0.10).
boxplot(nc$mage~nc$mature)
by(nc$mage, nc$mature, mean)
## nc$mature: mature mom
## [1] 37.18045
## ------------------------------------------------------------
## nc$mature: younger mom
## [1] 25.43829
# By choosing a side by side boxplot of the "mage", or mother age, numerical variable based on the categorical variable, "mature", one can determine the cutoff in age between the two different possibilities of the categorical variable mature. From the side by side boxplot, one can infer that the age cutoff for younger moms is 35. Mature moms, therefore, are 35 years old or older.
The research question is whether race is associated with hospital visits.
Null Hypothesis: The average population visits to the hospital for white mothers is the same as the average population visits to the hospital for non-white mothers.
Alternative Hypothesis: The average population visits to the hospital for white mothers is the not same as the average population visits to the hospital for non-white mothers.
inference(y = nc$visits, x = nc$whitemom, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_not white = 279, mean_not white = 11.6272, sd_not white = 4.3644
## n_white = 710, mean_white = 12.3014, sd_white = 3.7701
## Observed difference between means (not white-white) = -0.6742
##
## H0: mu_not white - mu_white = 0
## HA: mu_not white - mu_white != 0
## Standard error = 0.297
## Test statistic: Z = -2.269
## p-value = 0.0232
# Based on the p-value of 0.0232, we have strong evidence that we can reject the null hypothesis that the population average number of visits to the hospital for white mothers is the same as the population average number of visits to the hospital for non-white mothers in favor of the alternative hypothesis that there is a difference between these population averages.
inference(y = nc$visits, x = nc$whitemom, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_not white = 279, mean_not white = 11.6272, sd_not white = 4.3644
## n_white = 710, mean_white = 12.3014, sd_white = 3.7701
## Observed difference between means (not white-white) = -0.6742
##
## Standard error = 0.2971
## 95 % Confidence interval = ( -1.2565 , -0.0918 )
# The 95% confidence interval for the difference in the population average mean number of visits between white mothers and nonwhite mothers is (-1.2565,-0.0918). This means that the average number of hospital visit for non-white mothers is between 1.26 and 0.0918 visits less than the number of hospital visits for white mothers.