Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

str(ncbirths)

## tibble [1,000 × 13] (S3: tbl_df/tbl/data.frame)
##  $ fage          : int [1:1000] NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int [1:1000] 13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int [1:1000] 39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int [1:1000] 10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "not married",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int [1:1000] 38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num [1:1000] 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mean_weight_gained <- mean(ncbirths$gained, na.rm = TRUE) # Store mean
sd_weight_gained <- sd(ncbirths$gained, na.rm = TRUE) # Store standard deviation
n <- nrow(ncbirths) # Store sample size of ncbirths

mean_weight_gained

## [1] 30.3258

sd_weight_gained

## [1] 14.2413

## [1] 1000

# Calculate t-critical value for 90% confidence
alpha <- 0.10 # 100% - 90%
t_critical <- qt(alpha / 2, df = n - 1, lower.tail = FALSE) # calculate the cutoff mark for boundary of the confidence interval.  Alpha is the difference between 100% and the confidence interval 90%, we've divided by two and removed the lower tail values to focus on the upper tail.  I use qt() to find the quatnile in the Student's t-test.  df is the degrees of freedom, calculated by subtracting one from the size of sample.

# Calculate standard error
standard_error_90 <- sd_weight_gained / sqrt(n) # the standard error is the standard deviation for weight gain over the squae root of the sample size of North Carolinian births in this dataset.

# Calculate margin of error
margin_of_error <- t_critical * standard_error_90 # margin of error is calculated by a simple multiplication of tCritical by the standard error.

# Boundaries of confidence interval
lower_bound <- mean_weight_gained - margin_of_error # calculating the lower bound of the CI
upper_bound <- mean_weight_gained + margin_of_error # calculating the upper bound of the CI

# Printing results
cat("90% Confidence Interval for the average weight gained: [",lower_bound, ", ", upper_bound,"]", "\n\n") # cat() concatenates the unicode character string contained within double quotes with the variables resting outside the quote boundaries.  The backslash and lowercase 'n' character works to inserts a double new line after the printout, ensuring diligent adherence to the instructions set forth by our scholarly Professor!

## 90% Confidence Interval for the average weight gained: [ 29.58435 ,  31.06724 ]

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

# Switching to a 95% Confidence Interval, use the same structure as above.
alpha_95 <- 0.05 # 100% - 95%, the alpha here is 5%
t_critical_95 <- qt(1 - alpha_95 / 2, df = n - 1) # same as the comment above for sample size and df, but for an alpha at 5% (confidence interval at 95%)

# The standard error hasn't changed one iota!  I should have copied and pasted this entire section.
standard_error_95 <- sd_weight_gained / sqrt(n) # see section above, same thing

# Calculate margin of error for 95%
margin_of_error_95 <- t_critical_95 * standard_error_95 # 

# Confidence Interval boundaries for 95%
lower_bound_95 <- mean_weight_gained - margin_of_error_95
upper_bound_95 <- mean_weight_gained + margin_of_error_95

# Printing variables to double-check
alpha_95

## [1] 0.05

t_critical_95

## [1] 1.962341

standard_error_95

## [1] 0.4503493

margin_of_error_95

## [1] 0.8837392

lower_bound_95

## [1] 29.44206

upper_bound_95

## [1] 31.20954

# Display the 95% Confidence Interval
cat("95% Confidence Interval for the average weight gained: [",lower_bound_95,", ",upper_bound_95,"]", "\n\n")

## 95% Confidence Interval for the average weight gained: [ 29.44206 ,  31.20954 ]

How does that confidence interval compare to the one in Question #1?

The confidence interval at 95% is wider than the confidence interval at 90%. This folllows logically because in order to increase the likelihood that a sample of data is representative of the larger whole it must include more of the data from the whole. In essence, the higher confidence interval means I are interested in capturing more of the variability within the population.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses \[ H_0: \mu_{north\_carolinian} = 7.7 \] \[ H_1: \mu_{north\_carolinian} \neq 7.7 \]

# calculating the mean, the standard deviation, and the sample size and storing as variables
mean_weight <- mean(ncbirths$weight, na.rm = TRUE)
sd_weight <- sd(ncbirths$weight, na.rm = TRUE)
n <- sum(!is.na(ncbirths$weight))

# Perform the t-test using a shortcut to compare to the laid out method per the instructions
t_test_result <- t.test(x = ncbirths$weight, mu = 7.7, alternative = "two.sided")

# Print out the result for reference
print(t_test_result)

## 
##  One Sample t-test
## 
## data:  ncbirths$weight
## t = -12.554, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 7.7
## 95 percent confidence interval:
##  7.007368 7.194632
## sample estimates:
## mean of x 
##     7.101

Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size) 
mean_weight <- mean(ncbirths$weight, na.rm = TRUE) # calculating mean for weights
sd_weight <- sd(ncbirths$weight, na.rm = TRUE) # standard deviation for all birthweights
n <- sum(!is.na(ncbirths$weight)) # finding the sample size

mu_0 <- 7.7 # null hypothesis mean value
t_val <- (mean_weight - mu_0) / (sd_weight / sqrt(n)) # null hypothesis t value
p_val <- (2 * pt(-abs(t_val), df = n-1)) # null hypothesis p value

# print out the values
cat("Test statistic is ", t_val, "\n") # Test statistic

## Test statistic is  -12.55388

cat("P-value is ", p_val, "\n")

## P-value is  1.135415e-33

# Test statistic
t_val

## [1] -12.55388

# Probability of test statistic by chance
p_val

## [1] 1.135415e-33

Conclusion

I have used hypothesis testing to evaluate the birth weights of North Carolinian babies against the known average birth weights of European babies at 7.7 lbs. I obtained a test statistic of -12.55388 with a p-value of 1.135415e-33, which is far lower than 0.05. Based on this p-value being lower than the alpha, I can reject the null hypothesis that the birth weights of North Carolinian (NC) babies is statistically similar to the birth weights of European babies. Due to the negative test statistic, I can also conclude that the NC babies are on average less massive than their European baby counterparts.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses

\[ H_0: \mu_{mother} = \mu_{father} \] \[ H_1: \mu_{mother} \neq \mu_{father} \]

Test by confidence interval or p-value and decision

 # b. Perform the paired t-test
t_test_output <- t.test(ncbirths$mage, ncbirths$fage, paired = TRUE)

# Print the results of the t-test
print(t_test_output)

## 
##  Paired t-test
## 
## data:  ncbirths$mage and ncbirths$fage
## t = -17.673, df = 828, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -2.947206 -2.357981
## sample estimates:
## mean difference 
##       -2.652593

# c. Conclusion (to be written after obtaining the p-value)

Conclusion

The p-value is far lower than alpha, I reject the null hypothesis.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses

\[ H_0: \mu_{smokers} = \mu_{non\_smokers} \] \[ H_1: \mu_{smokers} \neq \mu_{non\_smokers} \] b. Test by confidence interval or p-value and decision

pregnancy_length_smokers <- ncbirths$weeks[ncbirths$habit == "smoker"] # select the smoker data independently
pregnancy_length_non_smokers <- ncbirths$weeks[ncbirths$habit == "nonsmoker"] # select the  nonsmoker data independently
t_test_output <- t.test(pregnancy_length_smokers, pregnancy_length_non_smokers) # perform Student's t.test with both
(mean(pregnancy_length_smokers, na.rm = TRUE)) # printing out means for reference, removing na's.

## [1] 38.44444

(mean(pregnancy_length_non_smokers, na.rm = TRUE)) # printing out means for reference, removing na's.

## [1] 38.31881

print(t_test_output) # show results

## 
##  Welch Two Sample t-test
## 
## data:  pregnancy_length_smokers and pregnancy_length_non_smokers
## t = 0.519, df = 182.63, p-value = 0.6044
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3519903  0.6032646
## sample estimates:
## mean of x mean of y 
##  38.44444  38.31881

Conclusion

The p-value for this test is 0.6044 which is greater than the alpha, and so I fail to reject the null hypothesis. There is not enough evidence to show a significant difference in the length of pregnancy between the smoking and the nonsmoking populations.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses
\[ H_0: \mu_{young} = \mu_{mature} \] \[ H_1: \mu_{young} \neq \mu_{mature} \]
Test by confidence interval or p-value and decision

weight_gained_younger <- ncbirths$gained[ncbirths$mature == "younger mom"]
weight_gained_mature <- ncbirths$gained[ncbirths$mature == "mature mom"]
t_test_output <- t.test(weight_gained_younger, weight_gained_mature, alternative = "greater")
t_test_output

## 
##  Welch Two Sample t-test
## 
## data:  weight_gained_younger and weight_gained_mature
## t = 1.3765, df = 175.34, p-value = 0.08521
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.3562741        Inf
## sample estimates:
## mean of x mean of y 
##  30.56043  28.79070

Conclusion

Manually copying down the values from the Welch Two Sample t-test:

Test statistic: 1.3765

Degrees of freedom: 175.34

P-value: 0.08521

The p-value for this test is greater than the significance level and therefore fail to reject the null hypothesis. There is not enough evidence to conclude that the average weight gain of mothers during pregnancy is different between young and mature mothers with the available data in this dataset. The vonfidence interval for this difference includes zero, and therefore the possibility of no difference or differences in favor of either subset cannot be eliminated.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

# Calculate the z-scores for each birth wieght
z_scores <- (ncbirths$weight - mean_weight) / sd_weight
# calculating a weight cutoff using 10 percentile
weight_cutoff <- quantile(ncbirths$weight, probs = 0.10, na.rm = TRUE)

# Results (in order of z and percentile)
head(z_scores) # a few z-scores

## [1]  0.3505958  0.5162837 -0.3121561  0.5958140 -0.4778441 -1.1405960

print(weight_cutoff) # weight cutoff by quantile

##  10% 
## 5.44

z_score_cutoff <- (5.44 - mean_weight) / sd_weight
print(z_score_cutoff)

## [1] -1.100831

The weight cutoff for babies being classified as “low” or “not low” should be roughly 5.44 lbs. The weight cutoff at 5.44 lbs was determined through use of a 90% cutoff. I’ve provided z scores for reference and calculated the z score for the cutoff calculated by quantile for reference. (As an anecdotal reference point, this was determined by looking up old texts from my friend about his son, who was born around 3.5 months premature at 1.75 lbs, and then frantically searching online for statistics help before the deadline. R_ is now a healthy young toddler with a normally shaped head, thanks to modern medicine and a very seriously dedicated NICU team in Maine.)

In re-evaluating this problem before publishing, I realized that the instructions were to use a 95% confidence interval for problems in which the CI was unspecified. That means I should re-do this problem.

# Calculate the z-scores for each birth wieght
z_scores <- (ncbirths$weight - mean_weight) / sd_weight
# calculating a weight cutoff using 5 percentile
weight_cutoff <- quantile(ncbirths$weight, probs = 0.05, na.rm = TRUE)

# Results (in order of z and percentile)
head(z_scores) # a few z-scores

## [1]  0.3505958  0.5162837 -0.3121561  0.5958140 -0.4778441 -1.1405960

print(weight_cutoff) # weight cutoff by quantile

##   5% 
## 4.44

z_score_cutoff <- (4.44 - mean_weight) / sd_weight
print(z_score_cutoff)

## [1] -1.763583

For the second attempt, the weight cutoff for babies being classified as “low” should be 4.44 lbs when taking into consideration a 95% confidence interval, as stipulated in the rules. Using a z-score does not add much information to the calculation, but provides useful insight into the structure of the problem, so that has been kept. Babies lighter than 4.44 lbs would be considered underweight using this test.

As an additional point of reference, R_’s birthweight was 1.75 lbs. That means he was…

# Calculating where R_ stood at birth.
R__weight <- 1.75
mean_weight <- mean(ncbirths$weight, na.rm = TRUE)
sd_weight <- sd(ncbirths$weight, na.rm = TRUE)

# z-score for R_'s weight
z_score <- (R__weight - mean_weight) / sd_weight

# Translate z-score to a confidence level
p_value <- 2 * pnorm(abs(z_score), lower.tail = FALSE)

# Convert the p-value to a confidence level
confidence_level <- 1 - p_value

# Display the results
cat("R_'s birth weight z-score: ", z_score, "\n")

## R_'s birth weight z-score:  -3.546385

cat("Corresponding p-value: ", p_value, "\n")

## Corresponding p-value:  0.0003905545

cat("Confidence level that includes R_'s weight: ", confidence_level*100, "%", "\n")

## Confidence level that includes R_'s weight:  99.96094 %

…over 3.5 sigma below the mean! What a strange coincidence, 3.5 sigma and 3.5 months premature.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question

“Is there a relationship between the age of the father and the likelihood of having a premature infant?”

Write hypotheses

\[ H_0: \mu_{fage\_fullterm} = \mu_{fage\_premie}\] \[ H_1: \mu_{fage\_fullterm} \neq \mu_{fage\_premie}\]

Test by confidence interval or p-value and decision

fage_fullterm <- ncbirths[ncbirths$premie == 1, ] 
fage_premie <- ncbirths[ncbirths$premie == 2, ]

mean_fullterm <- mean(fage_fullterm$fage, na.rm = TRUE)
mean_premie <- mean(fage_premie$fage, na.rm = TRUE)
sd_fullterm <- sd(fage_fullterm$fage, na.rm = TRUE)
sd_premie <- sd(fage_premie$fage, na.rm = TRUE)
n_fullterm <- length(na.omit(fage_fullterm$fage))
n_premie <- length(na.omit(fage_premie$fage))

se_difference <- sqrt((sd_fullterm^2 / n_fullterm) + (sd_premie^2 / n_premie))


df <- ((sd_fullterm^2 / n_fullterm) + (sd_premie^2 / n_premie))^2 / (((sd_fullterm^2 / n_fullterm)^2 / (n_fullterm - 1)) + ((sd_premie^2 / n_premie)^2 / (n_premie - 1)))

alpha <- 0.05
t_critical_95 <- qt(1 - alpha / 2, df)

ci_lower <- (mean_fullterm - mean_premie) - t_critical_95 * se_difference
ci_upper <- (mean_fullterm - mean_premie) + t_critical_95 * se_difference

cat("The confidence interval for the difference in mean father's ages between fullterm and premature infants is [", ci_lower, ",", ci_upper, "]\n")

## The confidence interval for the difference in mean father's ages between fullterm and premature infants is [ NaN , NaN ]

Conclusion

I’ve chosen the wrong variables to compare. I am getting NaN results because I am dividing by zero, which results from removing NA values that strips the entire dataset of any values at all. This is an interesting conclusion because it leads me back to the start. Do I have to select a different pair of numerical and categorical values in order to complete this question?

There is not enough data to evaluate reject any hypothesis. I will re-evaluate this question as follows:

a2. Question

“Is there a relationship between the age of the mother and the likelihood of having a premature infant?”

b2. Write hypotheses

\[ H_0: \mu_{mage\_fullterm} = \mu_{mage\_premie}\] \[ H_1: \mu_{mage\_fullterm} \neq \mu_{mage\_premie}\]

c2. Test by confidence interval or p-value and decision

mage_fullterm <- ncbirths[ncbirths$premie == 1, ] 
mage_premie <- ncbirths[ncbirths$premie == 2, ]

mean_fullterm <- mean(mage_fullterm$mage, na.rm = TRUE)
mean_premie <- mean(mage_premie$mage, na.rm = TRUE)
sd_fullterm <- sd(mage_fullterm$mage, na.rm = TRUE)
sd_premie <- sd(mage_premie$mage, na.rm = TRUE)
n_fullterm <- length(na.omit(mage_fullterm$mage))
n_premie <- length(na.omit(mage_premie$mage))

se_difference <- sqrt((sd_fullterm^2 / n_fullterm) + (sd_premie^2 / n_premie))


df <- ((sd_fullterm^2 / n_fullterm) + (sd_premie^2 / n_premie))^2 / (((sd_fullterm^2 / n_fullterm)^2 / (n_fullterm - 1)) + ((sd_premie^2 / n_premie)^2 / (n_premie - 1)))

alpha <- 0.05
t_critical_95 <- qt(1 - alpha / 2, df)

ci_lower <- (mean_fullterm - mean_premie) - t_critical_95 * se_difference
ci_upper <- (mean_fullterm - mean_premie) + t_critical_95 * se_difference

cat("The confidence interval for the difference in mean mother's ages between fullterm and premature infants is [", ci_lower, ",", ci_upper, "]\n")

## The confidence interval for the difference in mean mother's ages between fullterm and premature infants is [ NaN , NaN ]

Nope, still getting NaN responses. Maybe my assumption that I’m stripping out the NAs and getting zeroes is wrong.

A correction on the above script below came from a suggestion online. I was misreading the “premie” variables as numbers representing categorical data due to the output of the str() above. It is correctly read as factor levels. Whoops.

Below is code representing the hypothesis test for Mother’s age and premie births.

# Correctly filtering the dataset based on 'premie' status
mage_fullterm <- ncbirths[ncbirths$premie == 'full term', ] 
mage_premie <- ncbirths[ncbirths$premie == 'premie', ]

# Calculating means, standard deviations, and sample sizes
mean_fullterm <- mean(mage_fullterm$mage, na.rm = TRUE)
mean_premie <- mean(mage_premie$mage, na.rm = TRUE)
sd_fullterm <- sd(mage_fullterm$mage, na.rm = TRUE)
sd_premie <- sd(mage_premie$mage, na.rm = TRUE)
n_fullterm <- length(na.omit(mage_fullterm$mage))
n_premie <- length(na.omit(mage_premie$mage))

# Standard error of the difference in means
se_difference <- sqrt((sd_fullterm^2 / n_fullterm) + (sd_premie^2 / n_premie))

# Degrees of freedom for the t-test
df <- ((sd_fullterm^2 / n_fullterm) + (sd_premie^2 / n_premie))^2 / 
      (((sd_fullterm^2 / n_fullterm)^2 / (n_fullterm - 1)) + ((sd_premie^2 / n_premie)^2 / (n_premie - 1)))

# Critical t-value for the 95% confidence interval
alpha <- 0.05
t_critical_95 <- qt(1 - alpha / 2, df)

# Confidence interval for the difference in means
ci_lower <- (mean_fullterm - mean_premie) - t_critical_95 * se_difference
ci_upper <- (mean_fullterm - mean_premie) + t_critical_95 * se_difference

# Output the confidence interval
cat("The confidence interval for the difference in mean mother's ages between fullterm and premature infants is [", ci_lower, ",", ci_upper, "]\n")

## The confidence interval for the difference in mean mother's ages between fullterm and premature infants is [ -0.999804 , 1.249804 ]

What follows is the corrected code which should calculate the confidence interval for Father’s age and premie births.

fage_fullterm <- ncbirths[ncbirths$premie == "full term", ] 
fage_premie <- ncbirths[ncbirths$premie == "premie", ]

mean_fullterm <- mean(fage_fullterm$fage, na.rm = TRUE)
mean_premie <- mean(fage_premie$fage, na.rm = TRUE)
sd_fullterm <- sd(fage_fullterm$fage, na.rm = TRUE)
sd_premie <- sd(fage_premie$fage, na.rm = TRUE)
n_fullterm <- length(na.omit(fage_fullterm$fage))
n_premie <- length(na.omit(fage_premie$fage))

se_difference <- sqrt((sd_fullterm^2 / n_fullterm) + (sd_premie^2 / n_premie))


df <- ((sd_fullterm^2 / n_fullterm) + (sd_premie^2 / n_premie))^2 / (((sd_fullterm^2 / n_fullterm)^2 / (n_fullterm - 1)) + ((sd_premie^2 / n_premie)^2 / (n_premie - 1)))

alpha <- 0.05
t_critical_95 <- qt(1 - alpha / 2, df)

ci_lower <- (mean_fullterm - mean_premie) - t_critical_95 * se_difference
ci_upper <- (mean_fullterm - mean_premie) + t_critical_95 * se_difference

cat("The confidence interval for the difference in mean father's ages between fullterm and premature infants is [", ci_lower, ",", ci_upper, "]\n")

## The confidence interval for the difference in mean father's ages between fullterm and premature infants is [ -1.561242 , 1.414257 ]

Conclusion 1-B: (Father’s age, fixed)

The confidence interval for the difference in mean paternal age between groups comprising fathers of fullterm and premature infants ranges from -1.56 to 1.41 years, which includes 0 in the range. I have failed to find a statiastical significance between the groups, and there may not be a difference because 0 is within the range.

Conclusion 2-B: (Mother’s age, fixed)

This confidence interval for the difference in mean maternal age between mothers of fullterm and premature infants can be as low as -0.999804 years and as high as 1.249804 years, including 0 in the range. As with the hypothesis test of the father’s age, I can infer that there is no significant difference in term length and mother’s age, and there may be no difference.