Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths
str(ncbirths)

## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mean_avgwghtgained <- mean(ncbirths$gained, na.rm = TRUE)
sd_avgwghtgained <- sd(ncbirths$gained, na.rm = TRUE)
sample_size <- table(!is.na(ncbirths$gained))[[2]]

# Calculate t-critical value for 90% confidence
abs(qt(.05, df=972))

## [1] 1.646423

# Calculate margin of error
1.646423 * sd_avgwghtgained/sqrt(sample_size)

## [1] 0.7516827

# Boundaries of confidence interval
mean_avgwghtgained - 0.7516827

## [1] 29.57411

mean_avgwghtgained + 0.7516827

## [1] 31.07748

We are 90% confident that the average weight gained by North Carolina Mothers during pregnancy is between 29.57 and 31.07 pounds.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

# Calculate t-critical value for 95% confidence
abs(qt(.025, df=972))

## [1] 1.962408

# Calculate margin of error
1.962408*sd_avgwghtgained/sqrt(sample_size)

## [1] 0.8959472

# Boundaries of confidence interval
mean_avgwghtgained - 0.8959472

## [1] 29.42985

mean_avgwghtgained + 0.8959472

## [1] 31.22174

We are 95% confident that the average weight gained by North Carolina Mothers during pregnancy is between 29.43 pounds and 31.22 pounds.

How does that confidence interval compare to the one in Question #1?

# 90% confidence interval difference
31.07-29.57

## [1] 1.5

# 95% confidence interval difference
31.22-29.43

## [1] 1.79

We are 95% confident that the average weight gained for all North Carolina mothers is between 29.43 and 31.22.

The 95% confidence interval is slightly larger than the 90% confidence interval, encompassing more values.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses
\(H_0: \mu = 7.7\) \(H_A: \mu \neq 7.7\)
Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
mean_avgwght <- mean(ncbirths$weight, na.rm = TRUE)
sd_avgwght <- sd(ncbirths$weight, na.rm = TRUE)
 table(is.na(ncbirths$weight))

## 
## FALSE 
##  1000

The sample has a mean of 7.101, a standard ddeviation of approximately 1.509, and a sample size of 1000.

# Test statistic
(mean_avgwght - 7.7)/sd_avgwght/sqrt(1000)

## [1] -0.01255388

The t-score is -12.55388.

# Probability of test statistic by chance
pt(-12.55388, df = 999)*2

## [1] 1.135354e-33

Because the p-value (1.135354e-33) is less than aplha(0.05), we reject the null hypothesis in favor of the alternate hypothesis.

C.Conclusion:

The data suggest that there is indeed a difference between the average birthweight of European babies and the average birthweight of babies in North Carolina

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses
\(H_0: \mu = 0\) \(H_A: \mu \neq 0\)
Test by confidence interval or p-value and decision

# Calculate Age Difference and Store mean, standard deviation, and sample size of parents 
ncbirths$diff <- ncbirths$fage - ncbirths$mage
# summary(ncbirths$habit)
table(is.na(ncbirths$diff))

## 
## FALSE  TRUE 
##   829   171

In the “diff” column of the ncbirths dataset, a postiitve value indicates an instance where the mother is younger than the father, while a negative value indicates a case where the mother is older than the father.

mean_diff <- mean(ncbirths$diff, na.rm = TRUE)
sd_diff <- sd(ncbirths$diff, na.rm = TRUE)
length_diff <- 829
df_diff <- 829 - 1

# Test stat
stat_diff <- (mean_diff-0) / (sd_diff/sqrt(length_diff))

# Probability Test
pt(stat_diff, df_diff, lower.tail = FALSE)*2

## [1] 1.504608e-59

Because the p-value (1.504649e-59) is less than alpha (0.05), we reject the null hypothesis in favor of the alternate hypothesis.

Conclusion

We found that the p-value is smaller than alpha, so we must reject the null hypothesis in favor of the alternative hypothesis. The data suggests that there is a significant difference between the mean age of mothers and fathers in the ncbirths data set.

The data suggests that there is indeed a significant difference in the average ages of mothers and fathers from the dataset.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses
\(H_O: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neg 0\)
Test by confidence interval or p-value and decision

#Creating subsets for both smokers and nonsmokers
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

#Storing the mean lengths of pregnancies and standard deviations of pregnancy lengths for smokers and nonsmokers
mn_smoker <- mean(smokers$weeks,  na.rm = TRUE)
mn_nonsmoker <- mean(nonsmokers$weeks, na.rm=TRUE)

sd_smokers <- sd(smokers$weeks, na.rm=TRUE)
sd_nonsmokers <- sd(nonsmokers$weeks, na.rm=TRUE)

#Finding the sample size of each group
summary(ncbirths$habit)

## nonsmoker    smoker      NA's 
##       873       126         1

#Storing the standard error
se <- sqrt((sd_smokers^2/126)+(sd_nonsmokers^2/873))

#Finding the test-statistic
(((mn_smoker)-(mn_nonsmoker))-0)/se

## [1] 0.5190483

# Calculate the p-value
pt(0.5190483, df=125, lower.tail=FALSE)*2

## [1] 0.6046448

The p-value (.605) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.

C. Conclusion

There is not sufficient evidence to say that there is a significant difference between the number of weeks the pregnancies of smoking and nonsmoking mothers lasted.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses \(H_O: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neg 0\)
Test by confidence interval or p-value and decision

#Creating subsets for the weight gained by younger mothers and mature mothers
younger_mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
mature_mothers <- subset(ncbirths, ncbirths$mature == "mature mom")

#Storing the mean, standard deviation, and mean difference for the younger mothers and mature mothers subsets
mn_younger_mothers <- mean(younger_mothers$gained, na.rm=TRUE)
mn_mature_mothers <- mean(mature_mothers$gained, na.rm=TRUE)

sd_younger_mothers <- sd(younger_mothers$gained, na.rm=TRUE)
sd_mature_mothers <- sd(mature_mothers$gained, na.rm=TRUE)

mn_diff_gained <- (mn_younger_mothers - mn_mature_mothers)

#Finding the sample sizes of the younger mothers and mature mothers
table(is.na(younger_mothers$gained))

## 
## FALSE  TRUE 
##   844    23

table(is.na(mature_mothers$gained))

## 
## FALSE  TRUE 
##   129     4

# Storing the standard error
se_gained <- sqrt((sd_younger_mothers^2/844)+(sd_mature_mothers^2/129))

# Calculate the test statistic
(mn_diff_gained - 0)/se_gained

## [1] 1.376483

# Calculate the p-value
pt(1.376483, df=128, lower.tail=FALSE)

## [1] 0.08553763

The p-value (0.086) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.

Conclusion

There is no significant evidence that the average weight gained by younger mothers is greater than the average weight gained by mature mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

#Creating subsets of low and not low birth weights
low_birthwght <- subset(ncbirths, ncbirths$lowbirthweight == "low" )
notlow_birthwght <- subset(ncbirths, ncbirths$lowbirthweight == "not low")

#Using the summary function to observe values within the classifications of "low" and "not low" weights
summary(low_birthwght$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.095   4.560   4.035   5.160   5.500

summary(notlow_birthwght$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.560   6.750   7.440   7.484   8.130  11.750

Using the summary function, I can say on average the birthweight for the low pregnancies is less than the ones that aren’t low.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question
Is there a significant difference between the average weights of male and female babies?
Write hypotheses
\(H_0: \mu_d = 0\) \(H_A: \mu_d \neg 0\)
Test by confidence interval or p-value and decision

#Creating a subset of male and female babies
male_babies <- subset(ncbirths, ncbirths$gender == "male")
female_babies <- subset(ncbirths, ncbirths$gender == "female")

#Storing the mean weight and standard deviation of male and female babies
mn_male_wt <- mean(male_babies$weight, na.rm = TRUE)
mn_female_wt <- mean (female_babies$weight, na.rm = TRUE)

sd_male_wt <- sd(male_babies$weight, na.rm = TRUE)
sd_female_wt <- sd(female_babies$weight, na.rm = TRUE)

#Storing the mean difference of male and female babies
mn_diff_babywt <- (mn_male_wt - mn_female_wt)

#Calculating the sample size of male and female babies
table(is.na(male_babies$weight))

## 
## FALSE 
##   497

table(is.na(female_babies$weight))

## 
## FALSE 
##   503

#Storing the standard error
SE_b_wt <- sqrt((sd_male_wt^2/497)+(sd_female_wt^2/503))

#Finding the test statistic
(mn_diff_babywt - 0)/SE_b_wt

## [1] 4.211303

#Finding the p-value
pt(4.211303, df=496, lower.tail = FALSE)*2

## [1] 3.015765e-05

The p-value (3.015765e-05) is less than alpha (0.05), and we therefore reject the null hypothesis in favor of the alternate hypothesis.

Conclusion

The data suggests that there is indeed a difference between the average weight of male and females babies.