Load NC dataset:
Load inference function:
## Observations: 1,000
## Variables: 13
## $ fage <int> NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, 30, NA, NA, ...
## $ mage <int> 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, ...
## $ mature <fct> younger mom, younger mom, younger mom, younger mom, ...
## $ weeks <int> 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, 45, 42, 40, ...
## $ premie <fct> full term, full term, full term, full term, full ter...
## $ visits <int> 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8, 4, 12, 15,...
## $ marital <fct> married, married, married, married, married, married...
## $ gained <int> 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, 28, 34, 12, ...
## $ weight <dbl> 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44, 4.69, 8.81...
## $ lowbirthweight <fct> not low, not low, not low, not low, not low, low, no...
## $ gender <fct> male, male, female, male, female, male, male, male, ...
## $ habit <fct> nonsmoker, nonsmoker, nonsmoker, nonsmoker, nonsmoke...
## $ whitemom <fct> not white, not white, white, white, not white, not w...
There are 1,000 observations, so there are 1,000 cases.
Categorical variables here are mature, premie, marital, lowbirthweight, gender, habit, and whitemom.
Numerical variables are fage, mage, weeks, visits, gained, and weight.
I suspected there were some outliers in weight, and the histogram below shows that there are some extremeley low weights, but they might not be classified as outliers.
The plot shows that the median weight of babies at birth is lower for smoker mothers than for nonsmoker mothers. This suggests that there is a relationship between “habit” and “weight” where smoking leads to a lower birth weight.
## Warning: Factor `habit` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 3
## habit mean_weight `n()`
## <fct> <dbl> <int>
## 1 nonsmoker 7.14 873
## 2 smoker 6.83 126
## 3 <NA> 3.63 1
When looking at means, we can see nonsmokers still have a larger birth weight for their child.
The conditions we must check for are sample size, normalcy, and independence.
## Warning: Factor `habit` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 2
## habit n
## <fct> <int>
## 1 nonsmoker 873
## 2 smoker 126
## 3 <NA> 1
The nonsmoker sample has a size of 873, and the smoker sample has a size of 126. Both are larger than 30 and less than 10% of the overall population, and pass the sample size condition. Because they are less than 10% of the overall population, they also pass the independence test.
The above boxplot shows that both are near-normal and aren’t extremely skewed.
H0: There is no difference between the mean weights of children born from smoking mothers and non-smoking mothers.
HA: There is a difference between the mean weights of children born from smoking mothers and non-smoking mothers.
inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")## Warning: package 'gridExtra' was built under R version 3.6.3
## Warning: package 'broom' was built under R version 3.6.3
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## H0: mu_nonsmoker = mu_smoker
## HA: mu_nonsmoker != mu_smoker
## t = 2.359, df = 125
## p_value = 0.0199
inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical",
order = c("smoker","nonsmoker"))## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## 95% CI (smoker - nonsmoker): (-0.5803 , -0.0508)
According to this confidence interval, we are 95% confident that the population difference between the mean weight of babies born from mothers who do not smoke and mothers who do is between -0.5803 and -0.0508 pounds.
inference(y = weeks, data = nc, statistic = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical")## Single numerical variable
## n = 998, y-bar = 38.3347, s = 2.9316
## 95% CI: (38.1526 , 38.5168)
We are 95% confident that the true population mean for the length of preganancies, or average length of pregnancies, falls between 38.1526 and 38.5168 weeks.
inference(y = weeks, data = nc, statistic = "mean", type = "ci", conf_level = 0.90, null = 0,
alternative = "twosided", method = "theoretical")## Single numerical variable
## n = 998, y-bar = 38.3347, s = 2.9316
## 90% CI: (38.1819 , 38.4874)
This confidence interval is visually indistinguishable from the 95% confidence interval, even though the values have changed. We are 90% confident that the true population mean for the length of preganancies, or average length of pregnancies, falls between 38.1819 and 38.4874 weeks.
inference(y = gained, x = mature, data = nc, statistic = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_mature mom = 129, y_bar_mature mom = 28.7907, s_mature mom = 13.4824
## n_younger mom = 844, y_bar_younger mom = 30.5604, s_younger mom = 14.3469
## H0: mu_mature mom = mu_younger mom
## HA: mu_mature mom != mu_younger mom
## t = -1.3765, df = 128
## p_value = 0.1711
Our p-value is 0.1711, which is larger than our alpha value of 0.05. We cannot conclude that a significant difference exists between the average weight gained by younger mothers and the average weight gained by mature mothers. We fail to reject the null hypothesis.
## # A tibble: 2 x 3
## mature max_age min_age
## <fct> <int> <int>
## 1 mature mom 50 35
## 2 younger mom 34 13
The oldest that younger moms can be is 34 years old, and the youngest that mature moms can be is 35 years old.
The code is directed to the “nc” dataset with the pipe operator. The group_by line runs the summarise line on all of the groups within the “mature” group. Then we give values to “max_age” and “min_age” with the Max and Min functions based on values in the mage column, which is the mother’s age.
Is there any difference in the mean amount of hospital visits (visits) based upon whether the mother was white or not (whitemom)?
H0: There is no difference between the mean amount of hospital visits of white mothers and non-white mothers.
HA: There is a difference between the mean amount of hospital visits of white mothers and non-white mothers.
inference(y = visits, x = whitemom, data = nc, statistic = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_not white = 279, y_bar_not white = 11.6272, s_not white = 4.3644
## n_white = 710, y_bar_white = 12.3014, s_white = 3.7701
## H0: mu_not white = mu_white
## HA: mu_not white != mu_white
## t = -2.2689, df = 278
## p_value = 0.024
Report the statistical results: n_not white = 279 y_bar_not white = 11.6272 n_white = 710 y_bar_white = 12.3014
The mean amount of hospital visits for non-white mothers was 11.6272, and the mean amount of hospital visits for white mothers was 12.3014.
The sample sizes for both are larger than 30, but less than 10% of the overall population. We can assume that the sample size condition and independence condition are met. The sample distribution graphs above also show that both distributions were approximately normal with no extreme skews.
The default alpha level that we tested with was 0.05. Because our p-value is 0.024 and is smaller than 0.05. This means we can reject the null hypothesis in favor of the alternate hypothesis. It seems there is a statistically significant difference between the mean amount of hospital visits of white mothers and non-white mothers.