Lab 5 - Inference for numerical data

Load NC dataset:

load(url("https://raw.githubusercontent.com/GarciaRios/govt_3990/gh-pages/Labs/lab5/nc.RData"))

Load inference function:

load(url("https://raw.githubusercontent.com/GarciaRios/govt_3990/gh-pages/Labs/lab5/inference.RData"))

Exercises:

Exercise 1:

What are the cases in this data set? How many cases are there in our sample?

glimpse(nc)

## Observations: 1,000
## Variables: 13
## $ fage           <int> NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, 30, NA, NA, ...
## $ mage           <int> 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, ...
## $ mature         <fct> younger mom, younger mom, younger mom, younger mom, ...
## $ weeks          <int> 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, 45, 42, 40, ...
## $ premie         <fct> full term, full term, full term, full term, full ter...
## $ visits         <int> 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8, 4, 12, 15,...
## $ marital        <fct> married, married, married, married, married, married...
## $ gained         <int> 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, 28, 34, 12, ...
## $ weight         <dbl> 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44, 4.69, 8.81...
## $ lowbirthweight <fct> not low, not low, not low, not low, not low, low, no...
## $ gender         <fct> male, male, female, male, female, male, male, male, ...
## $ habit          <fct> nonsmoker, nonsmoker, nonsmoker, nonsmoker, nonsmoke...
## $ whitemom       <fct> not white, not white, white, white, not white, not w...

There are 1,000 observations, so there are 1,000 cases.

Categorical variables here are mature, premie, marital, lowbirthweight, gender, habit, and whitemom.

Numerical variables are fage, mage, weeks, visits, gained, and weight.

I suspected there were some outliers in weight, and the histogram below shows that there are some extremeley low weights, but they might not be classified as outliers.

hist(nc$weight)

Exercise 2:

Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?

ggplot(nc, aes(y = weight, x = habit)) + 
  geom_boxplot()

The plot shows that the median weight of babies at birth is lower for smoker mothers than for nonsmoker mothers. This suggests that there is a relationship between “habit” and “weight” where smoking leads to a lower birth weight.

nc %>%
  group_by(habit) %>%
  summarise(mean_weight = mean(weight), n())

## Warning: Factor `habit` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 3 x 3
##   habit     mean_weight `n()`
##   <fct>           <dbl> <int>
## 1 nonsmoker        7.14   873
## 2 smoker           6.83   126
## 3 <NA>             3.63     1

When looking at means, we can see nonsmokers still have a larger birth weight for their child.

Exercise 3:

Are all conditions necessary for inference satisfied? Comment on each.

The conditions we must check for are sample size, normalcy, and independence.

nc %>%
  count(habit)

## Warning: Factor `habit` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 3 x 2
##   habit         n
##   <fct>     <int>
## 1 nonsmoker   873
## 2 smoker      126
## 3 <NA>          1

The nonsmoker sample has a size of 873, and the smoker sample has a size of 126. Both are larger than 30 and less than 10% of the overall population, and pass the sample size condition. Because they are less than 10% of the overall population, they also pass the independence test.

ggplot(nc, aes(y = weight, x = habit)) + 
  geom_boxplot()

The above boxplot shows that both are near-normal and aren’t extremely skewed.

Exercise 4:

Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

H0: There is no difference between the mean weights of children born from smoking mothers and non-smoking mothers.

HA: There is a difference between the mean weights of children born from smoking mothers and non-smoking mothers.

Exercise 5:

inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, 
    alternative = "twosided", method = "theoretical")

## Warning: package 'gridExtra' was built under R version 3.6.3

## Warning: package 'broom' was built under R version 3.6.3

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## H0: mu_nonsmoker =  mu_smoker
## HA: mu_nonsmoker != mu_smoker
## t = 2.359, df = 125
## p_value = 0.0199

inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ci", null = 0, 
    alternative = "twosided", method = "theoretical",
    order = c("smoker","nonsmoker"))

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## 95% CI (smoker - nonsmoker): (-0.5803 , -0.0508)

Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to nonsmoking and smoking mothers, and interpret this interval in context of the data.

According to this confidence interval, we are 95% confident that the population difference between the mean weight of babies born from mothers who do not smoke and mothers who do is between -0.5803 and -0.0508 pounds.

On your own:

1:

inference(y = weeks, data = nc, statistic = "mean", type = "ci", null = 0, 
    alternative = "twosided", method = "theoretical")

## Single numerical variable
## n = 998, y-bar = 38.3347, s = 2.9316
## 95% CI: (38.1526 , 38.5168)

Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.

We are 95% confident that the true population mean for the length of preganancies, or average length of pregnancies, falls between 38.1526 and 38.5168 weeks.

2:

inference(y = weeks, data = nc, statistic = "mean", type = "ci", conf_level = 0.90, null = 0, 
    alternative = "twosided", method = "theoretical")

## Single numerical variable
## n = 998, y-bar = 38.3347, s = 2.9316
## 90% CI: (38.1819 , 38.4874)

Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conf_level = 0.90. Comment on the width of this interval versus the one obtained in the the previous exercise.

This confidence interval is visually indistinguishable from the 95% confidence interval, even though the values have changed. We are 90% confident that the true population mean for the length of preganancies, or average length of pregnancies, falls between 38.1819 and 38.4874 weeks.

3:

Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

inference(y = gained, x = mature, data = nc, statistic = "mean", type = "ht", null = 0, 
    alternative = "twosided", method = "theoretical")

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_mature mom = 129, y_bar_mature mom = 28.7907, s_mature mom = 13.4824
## n_younger mom = 844, y_bar_younger mom = 30.5604, s_younger mom = 14.3469
## H0: mu_mature mom =  mu_younger mom
## HA: mu_mature mom != mu_younger mom
## t = -1.3765, df = 128
## p_value = 0.1711

Our p-value is 0.1711, which is larger than our alpha value of 0.05. We cannot conclude that a significant difference exists between the average weight gained by younger mothers and the average weight gained by mature mothers. We fail to reject the null hypothesis.

4:

Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

nc %>%
  group_by(mature) %>%
  summarise(max_age = max(mage), min_age = min(mage))

## # A tibble: 2 x 3
##   mature      max_age min_age
##   <fct>         <int>   <int>
## 1 mature mom       50      35
## 2 younger mom      34      13

The oldest that younger moms can be is 34 years old, and the youngest that mature moms can be is 35 years old.

The code is directed to the “nc” dataset with the pipe operator. The group_by line runs the summarise line on all of the groups within the “mature” group. Then we give values to “max_age” and “min_age” with the Max and Min functions based on values in the mage column, which is the mother’s age.

5:

Pick a pair of variables: one numerical (response) and one categorical (explanatory). Come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions,state your α level, and conclude in context.

Is there any difference in the mean amount of hospital visits (visits) based upon whether the mother was white or not (whitemom)?

H0: There is no difference between the mean amount of hospital visits of white mothers and non-white mothers.

HA: There is a difference between the mean amount of hospital visits of white mothers and non-white mothers.

inference(y = visits, x = whitemom, data = nc, statistic = "mean", type = "ht", null = 0, 
    alternative = "twosided", method = "theoretical")

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_not white = 279, y_bar_not white = 11.6272, s_not white = 4.3644
## n_white = 710, y_bar_white = 12.3014, s_white = 3.7701
## H0: mu_not white =  mu_white
## HA: mu_not white != mu_white
## t = -2.2689, df = 278
## p_value = 0.024

Report the statistical results: n_not white = 279 y_bar_not white = 11.6272 n_white = 710 y_bar_white = 12.3014

The mean amount of hospital visits for non-white mothers was 11.6272, and the mean amount of hospital visits for white mothers was 12.3014.

The sample sizes for both are larger than 30, but less than 10% of the overall population. We can assume that the sample size condition and independence condition are met. The sample distribution graphs above also show that both distributions were approximately normal with no extreme skews.

The default alpha level that we tested with was 0.05. Because our p-value is 0.024 and is smaller than 0.05. This means we can reject the null hypothesis in favor of the alternate hypothesis. It seems there is a statistically significant difference between the mean amount of hospital visits of white mothers and non-white mothers.