Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths
str(ncbirths)

## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mean.gained <- mean(ncbirths$gained, na.rm = TRUE)
sd.gained <- sd(ncbirths$gained, na.rm = TRUE)
table(is.na(ncbirths$gained))

## 
## FALSE  TRUE 
##   973    27

sample.size <- 973

# Calculate t-critical value for 90% confidence
abs(qt(.05, df=972))

## [1] 1.646423

# Calculate margin of error
1.646423 * sd.gained/sqrt(sample.size)

## [1] 0.7516827

# Boundaries of interval
mean.gained - 0.7516827

## [1] 29.57411

mean.gained + 0.7516827

## [1] 31.07748

There is a 90% chance that the average weigth gained during pregnacy of the sample size is 29.57 and 31.08lbs.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.
How does that confidence interval compare to the one in Question #1?

abs(qt(.025, df=sample.size-1))

## [1] 1.962408

##1.962408
mean.gained - 1.962341*sd.gained/sqrt(sample.size)

## [1] 29.42988

mean.gained + 1.962341*sd.gained/sqrt(sample.size)

## [1] 31.22171

There is a 95% confidence interval that average weight gained during pregnacy by the sample size is 29.43 and 31.22 lbs. ### Question 3 - Single Sample t-test The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses
H0: μ = 7.7

HA: μ ≠ 7.7 b. Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
mean.weight <- mean(ncbirths$weight, na.rm = TRUE)
sd.weight <- sd(ncbirths$weight, na.rm=TRUE)
table(is.na(ncbirths$weight))

## 
## FALSE 
##  1000

##1000

# Test statistic
(mean.weight - 7.7)/(sd.weight/sqrt(1000))

## [1] -12.55388

t-score = -12.55388

# Probability of test statistic by chance
pt(-12.55388, df=999)*2

## [1] 1.135354e-33

Conclusion
Based off the p-value we reject the null hypothesis in favor of the alternate hypothesis.

The data suggest that there is a difference between the average European birthweight and of the sample size.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses
H0:μd=0

HA:μd≠0 b. Test by confidence interval or p-value and decision

ncbirths$dif <- ncbirths$fage - ncbirths$mage

table(is.na(ncbirths$dif))

## 
## FALSE  TRUE 
##   829   171

#The sample size is 829.



(mean(ncbirths$dif,na.rm=TRUE)-0)/(sd(ncbirths$dif,na.rm=TRUE)/sqrt(829))

## [1] 17.6727

## [1] 17.6727
# T-score is 17.6727


pt(17.6727, df=828, lower.tail=FALSE)*2

## [1] 1.504649e-59

## [1] 1.504649e-59

Conclusion
We reject the null hypothesis in favor of the alternative.

The data also displays that there is a difference in the average age of the mothers and fathers in the sample size. ### Question 5 - Two Indendent Sample t-test In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses
H0:μ1−μ2=0

HA:μ1−μ2≠0

Test by confidence interval or p-value and decision

smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")


mean.smoker <- mean(smokers$weeks,  na.rm = TRUE)
mean.nonsmoker <- mean(nonsmokers$weeks, na.rm = TRUE)

sd.smokers <- sd(smokers$weeks, na.rm =TRUE)
sd.nonsmokers <- sd(nonsmokers$weeks, na.rm = TRUE)
summary(ncbirths$habit)

## nonsmoker    smoker      NA's 
##       873       126         1

## nonsmoker    smoker      NA's 
##       873       126         1


SE <- sqrt((sd.smokers^2/126)+(sd.nonsmokers^2/873))


(((mean.smoker)-(mean.nonsmoker))-0)/SE

## [1] 0.5190483

## [1] 0.5190483

pt(0.5190483, df=125, lower.tail=FALSE)*2

## [1] 0.6046448

## [1] 0.6046448

The data is not strong enought to reject the null hypothesis.

Conclusion
The data does not suggest that there is a large difference in pregnacy lengths by smokers and non smokers based off the sample size. ### Question 6 Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
Write hypotheses
Test by confidence interval or p-value and decision

y.mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
m.mothers <- subset(ncbirths, ncbirths$mature == "mature mom")


mean.y.mothers <- mean(y.mothers$gained, na.rm=TRUE)
mean.m.mothers <- mean(m.mothers$gained, na.rm=TRUE)

sd.y.mothers <- sd(y.mothers$gained, na.rm=TRUE)
sd.m.mothers <- sd(m.mothers$gained, na.rm=TRUE)

agediff.gained <- (mean.y.mothers - mean.m.mothers)


table(is.na(y.mothers$gained))

## 
## FALSE  TRUE 
##   844    23

## 844  
table(is.na(m.mothers$gained))

## 
## FALSE  TRUE 
##   129     4

##  129

SE.age.diff <- sqrt((sd.y.mothers^2/844)+(sd.m.mothers^2/129))

#Finding the test statistic
(agediff.gained - 0)/SE.age.diff

## [1] 1.376483

## [1] 1.376483

#Finding the p-value
pt(1.376483, df=128, lower.tail=FALSE)

## [1] 0.08553763

## [1] 0.08553763

Conclusion We fail to reject the null hypothesis. The data based on the sample size does not suggest a big difference in weight gained by mature or younger mothers. ### Question 8 Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
Is there a significat difference between womens length of pregancy between younger mothers and mature mothers.
Write hypotheses
H0: μd = 0 HA: μd ≠ 0
Test by confidence interval or p-value and decision

Mature.mom <- subset(ncbirths, ncbirths$mature == "mature mom")
younger.mom <- subset(ncbirths, ncbirths$mature == "younger mom")
mean.y.length <- mean(Mature.mom$weeks, na.rm = TRUE)
mean.m.length <- mean(younger.mom$weeks, na.rm = TRUE)

sd.mature.length <- sd(Mature.mom$weeks,na.rm = TRUE)
sd.younger.length <-sd(younger.mom$weeks, na.rm = TRUE)

mean.diff.length <- (mean.m.length - mean.y.length)

table(is.na(Mature.mom$weeks))

## 
## FALSE  TRUE 
##   132     1

##132

table(is.na(younger.mom$weeks))

## 
## FALSE  TRUE 
##   866     1

##866

SE.age.lenght <- sqrt((sd.mature.length^2/132)+(sd.younger.length^2/866))


(mean.diff.length- 0)/ SE.age.lenght

## [1] 1.211299

##[1] 1.211299

pt(1.211299, df=131, lower.tail = FALSE)*2

## [1] 0.2279614

Conclusion

The p value is 0.228 which is less than alpha so we favor the alterative hypothesis. Which supports the statement that there is a difference in the length of pregnacy between younger moms and older moms.

Project #5 - Inference on Numerical Data

MAT143H - Introduction to Statistics Honors

Nathaniel Lilly

Due: Wednesday, April 4

Purpose

Preparation

Question 1 - Single Sample t-confidence interval

Question 2 - Single Sample t-confidence interval

Question 4 - Paired Data t-test