Inference for Numerical Data

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")

Exercise 1

What are the cases in this data set? How many cases are there in our sample?

names(nc)

##  [1] "fage"           "mage"           "mature"         "weeks"         
##  [5] "premie"         "visits"         "marital"        "gained"        
##  [9] "weight"         "lowbirthweight" "gender"         "habit"         
## [13] "whitemom"

dim(nc)

## [1] 1000   13

There are 1,000 cases (births recorded in North Carolina) in the sample dataset, with 13 variables.

Exercise 2

Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?

nc %>% ggplot(aes(x = habit, y = weight)) + geom_boxplot()

The boxplots show that the babies born to nonsmoker mother have a slightly higher mean than those born to smoker mothers. Also, nonsmoker mothers have more variability when it comes to their babies’ weight, and more outliers as well.

Exercise 3

Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions.

by(nc$weight, nc$habit, length)

## nc$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 126

nc %>%  ggplot(aes(x = weight)) + geom_density() + facet_grid(~ habit)

The sample population is more than likely less than 10% of the population of babies born in all of North Carolina, so the samples are independent. The data, though slightly left skewed, appears normal, despite some outliers visible in the boxplot.

Exercise 4

Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

H0: mu(weight of babies born to non-smoking mothers) = mu(weight of babies born to non-smoking mothers)

Ha: mu(weight of babies born to non-smoking mothers) != mu(weight of babies born to non-smoking mothers)

Exercise 5

Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.

# Before type change
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

## Warning: package 'BHH2' was built under R version 3.3.3

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

# After type change
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )

The observed difference between the two mean is 0.3155, with the babies born to non-smoking mother weighing more on average. The 95 % Confidence interval = ( 0.0534 , 0.5777 )

ON YOUR OWN

1. Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context.

inference(y = nc$weeks, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")

## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )

We can be 95% confident that the mean length of pregnancies for both smoking and non-smoking mothers is between 38.1528 and 38.5165 weeks. Very roughly rounded, that is approximately 38 to 38.5 weeks.

2. Calculate a new confidence interval for the same parameter at the 90% confidence level.

inference(y = nc$weeks, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical", conflevel = 0.90)

## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

We can be 90% confident that the mean length of pregnancies for both smoking and non-smoking mothers is between 38.182 and 38.4873 weeks. Oddly enough, though I would expect a larger range for a smaller percentage confidence interval, the range is smaller when compared to the 95% ci.

3. Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

H0: mu(weight younger mothers gained) = mu(weight mature mothers gained)

Ha: mu(weight younger mothers gained) != mu(weight mature mothers gained)

inference(nc$weight, nc$mature, type="ht", est="mean", 
          null=0, method="theoretical", alternative="twosided")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 133, mean_mature mom = 7.1256, sd_mature mom = 1.6591
## n_younger mom = 867, mean_younger mom = 7.0972, sd_younger mom = 1.4855

## Observed difference between means (mature mom-younger mom) = 0.0283
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 0.152 
## Test statistic: Z =  0.186 
## p-value =  0.8526

With a p-value as high as 0.8526, we fail to reject the null hypothesis that the average weight younger mothers gained is not different from the average weight mature mothers gained.

4. Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

nc %>%  group_by(mature) %>%
  summarise(Max_Age = max(mage),
            Min_Age = min(mage))

## # A tibble: 2 × 3
##        mature Max_Age Min_Age
##        <fctr>   <int>   <int>
## 1  mature mom      50      35
## 2 younger mom      34      13

By looking at the min and max ages of mature and younger mothers, we are able to determine that the age cutoff for younger mothers could be 13 on the lower end but is 34 on the upper end. For mature mothers it is 35 on the lower end could be 50 on the upper end.

5. Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.

Does the birth classification (premature or full-term) have an impact on the baby’s average weight?

H0: mu(weight full-term babies) = mu(weight premature babies)

Ha: mu(weight full-term babies) != mu(weight premature babies)

inference(y = nc$weight, x = nc$premie, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 846, mean_full term = 7.4594, sd_full term = 1.075
## n_premie = 152, mean_premie = 5.1284, sd_premie = 1.9696

## Observed difference between means (full term-premie) = 2.331
## 
## H0: mu_full term - mu_premie = 0 
## HA: mu_full term - mu_premie != 0 
## Standard error = 0.164 
## Test statistic: Z =  14.216 
## p-value =  0

inference(y = nc$weight, x = nc$premie, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 846, mean_full term = 7.4594, sd_full term = 1.075
## n_premie = 152, mean_premie = 5.1284, sd_premie = 1.9696

## Observed difference between means (full term-premie) = 2.331
## 
## Standard error = 0.164 
## 95 % Confidence interval = ( 2.0096 , 2.6524 )

With a p-value so low it appears as zero, we reject the null hypothesis that there is no difference between average weight of babies born prematurely and average weight of babies born to full-term. This is also shown in the 95% confidence interval, which does not include zero and thus indictes that it is unlikely that the difference between the average baby weights are similar.

Inference for Numerical Data

Georgia Galanopoulos

ON YOUR OWN