Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

str(ncbirths)

## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset

mean_gained <- mean(ncbirths$gained, na.rm=TRUE)

sd_gained <- sd(ncbirths$gained, na.rm=TRUE)

table(is.na(ncbirths$gained))

## 
## FALSE  TRUE 
##   973    27

#Storing n number of observations and degrees of freedom

n_gained <- 973

df_gained <- n_gained-1

# Calculate t-critical value for 90% confidence

t_gained <- qt(0.05, df_gained,lower.tail=FALSE)  

t_gained

## [1] 1.646423

Our t-critical value is 1.646423.

# Calculate margin of error

me_gained <- (t_gained*sd_gained) / (sqrt(n_gained))  

me_gained

## [1] 0.7516826

The margin of error is 0.7516826.

# Boundaries of confidence interval

mean_gained - me_gained

## [1] 29.57411

mean_gained + me_gained

## [1] 31.07748

We can be 90% confident that the mean amount of weight gained by all expectant mothers is between 29.57 and 31.08 pounds.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

#Finding confidence interval using the t.test function

t.test(ncbirths$gained, mu=0, alternative="two.sided")$conf.int

## [1] 29.42985 31.22174
## attr(,"conf.level")
## [1] 0.95

We can be 95% confident that the mean amount of weight gained by all expectant mothers is between 29.43 and 31.22 pounds.

How does that confidence interval compare to the one in Question #1?

The 95% confidence interval is wider than the 90% interval. This difference is logical because higher confidence levels have wider intervals.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses

\(H_0: \mu_{weight} = 7.7\)

\(H_A: \mu_{weight} \neq 7.7\)

Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)

mean_weight <- mean(ncbirths$weight)

sd_weight <- sd(ncbirths$weight)

table(is.na(ncbirths$weight))

## 
## FALSE 
##  1000

#Storing n number of observations and degrees of freedom

n_weight <- 1000

df_weight <- n_gained-1


# Test statistic

t_weight <-(mean_weight-7.7) / (sd_weight/sqrt(n_weight))
t_weight

## [1] -12.55388

Our test statistic is -12.55388.

# Probability of test statistic by chance

pt(t_weight, df_weight) *2

## [1] 1.310452e-33

The p-value is much smaller than 0.05; we can therefore reject the null hypothesis in favor of the alternative hypothesis.

Conclusion

Because our p-value is significantly less than 0.05, we can conclude that there is evidence that the mean birth weight of all North Carolina babies is different than the mean birth weight of European babies.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses

\(H_0: \mu_{mothers} - \mu_{fathers} = O\)

\(H_A: \mu_{mothers} - \mu_{fathers} \neq 0\)

Test by confidence interval or p-value and decision

#Calculating p-value using the t.test function

t.test(ncbirths$mage,ncbirths$fage, mu=0, alternative="two.sided", paired=TRUE)$p.value

## [1] 1.504608e-59

The p-value is 1.504608e-59, which is significantly less than 0.05. Therefore, we reject the null hypothesis in favor of the alternative hypothesis.

Conclusion

The data show that there is a significant difference in age between all North Carolina mothers and fathers.

Question 5 - Two Independent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses

\(H_0: \mu_{smokers} - \mu_{nonsmokers} = 0\)

\(H_A: \mu_{smokers} - \mu_{nonsmokers} \neq 0\)

Test by confidence interval or p-value and decision

#creating subsets for each category

nc_smokers <- subset(ncbirths, ncbirths$habit=="smoker")

nc_nonsmokers <- subset(ncbirths, ncbirths$habit=="nonsmoker")


#Finding p-value using t.test function, calling on the variable 'weeks' for pregnancy length

t.test(nc_smokers$weeks, nc_nonsmokers$weeks, mu=0, alternative="two.sided", paired=FALSE)$p.value

## [1] 0.6043917

Our p-value is 0.6044, which is greater than the significance level of 0.05. We therefore fail to reject the null hypothesis.

Conclusion

There is not sufficient evidence that the average pregnancy length for North Carolina smokers is significantly different than the average pregnancy length of North Carolina non-smokers.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses

\(H_0: \mu_{younger} = \mu_{mature}\)

\(H_A: \mu_{younger} > \mu_{mature}\)

Test by confidence interval or p-value and decision

#creating subsets for each category

nc_younger <- subset(ncbirths, ncbirths$mature=="younger mom")

nc_mature <- subset(ncbirths, ncbirths$mature=="mature mom")

#Finding p-value using t.test function

t.test(nc_younger$gained, nc_mature$gained, mu=0, alternative="greater", paired=FALSE)$p.value

## [1] 0.0852137

Our p-value is 0.0852. Because this is greater than the significance level of 0.05, we fail to reject the null hypothesis.

Conclusion

There is not sufficient evidence to prove that younger North Carolina mothers gain significantly more weight than older North Carolina mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

# Creating subsets for each category

nc_low <- subset(ncbirths, ncbirths$lowbirthweight=="low")

nc_notlow <- subset(ncbirths, ncbirths$lowbirthweight=="not low")

# Finding the maximum and minimum values of each subset using the five number summary

summary(nc_low$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.095   4.560   4.035   5.160   5.500

summary(nc_notlow$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.560   6.750   7.440   7.484   8.130  11.750

The maximum weight in the “low birthweight” subset is 5.50 pounds. The minimum weight in the “not low birthweight” subset is 5.56 pounds. This means that the cutoff between “low” and “not low” is likely 5.50 pounds–anything below 5.50 is “low” and anything above is “not low.”

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question

Is there a significant difference in the number of hospital visits during pregnancy between mothers of premature babies and mothers of full-term babies?

Write hypotheses

\(H_0: \mu_{premature} - \mu_{fullterm} = 0\)

\(H_A: \mu_{premature} - \mu_{fullterm} \neq 0\)

Test by confidence interval or p-value and decision

# Creating subsets for each category

nc_premie <- subset(ncbirths, ncbirths$premie=="premie")

nc_full <- subset(ncbirths, ncbirths$premie=="full term")

#Calculating confidence interval using the t.test function

t.test(nc_premie$visits, nc_full$visits, mu=0, alternative="two.sided", paired=FALSE)$p.value

## [1] 0.000108382

Our p-value for this data is 0.0001084, which is significantly less than 0.05. We can therefore reject the null hypothesis in favor of the alternative hypothesis.

Conclusion

There is significant evidence that North Carolina mothers of premature babies visit the hospital a different number of times than North Carolina mothers of full-term babies.