Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mn<-mean(ncbirths$weight, na.rm = TRUE)
std<-sd(ncbirths$weight, na.rm = TRUE)
SAM<-table(is.na(ncbirths$weight))

The mean is 7.1, the standard deviatin is 1.51 and size of dataset 1000.

# Calculate t-critical value for 90% confidence
t<-abs(qt(p =0.005, df =999))

The t-critical value for 90% confidence is 2.58.

# Calculate margin of error
t*(mn/sqrt(1000))

## [1] 0.5795182

The margin of error is 0.58.

# Boundaries of confidence interval
mn - t*std/sqrt(1000)

## [1] 6.977861

mn + t*std/sqrt(1000)

## [1] 7.224139

The boundraies of confidence interval is somewhere between, 6.98< U < 7.22

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.
How does that confidence interval compare to the one in Question #1?

The question 1 has 90% t-confidence interval and the quetion 2 has 95% t-confidence interval, by increasing the percentage of confidence the interval becomes wider.

# Calculate t-critical value for 95% confidence
t2<-abs(qt(p =0.025, df =999))

The t-critical value for 95% confidence is 1.962

# Calculate margin of error
t2*(mn/sqrt(1000))

## [1] 0.4406503

The margin of error is 0.44

# Boundaries of confidence interval
mn - t2*std/sqrt(1000)

## [1] 7.007368

mn + t2*std/sqrt(1000)

## [1] 7.194632

the boundaries of confidence interval is between 7.0 and 7.2.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses
\(H_0\): \(\mu_7.7\)
\(H_A\): \(\mu_7.7\) \(\neq\)
Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
mean(ncbirths$weight, na.rm = TRUE)

## [1] 7.101

sd(ncbirths$weight, na.rm = TRUE)

## [1] 1.50886

table(is.na(ncbirths$weight))

## 
## FALSE 
##  1000

The mean = 7.1 , standard deviation = 1.51 and size of dataset = 1000.

# Test statistic
tscore<-(mn - 7.7)/(std/sqrt(1000))

The test statistic is 330.46

# Probability of test statistic by chance
pt(tscore, df = 999)* 2

## [1] 1.135415e-33

The probability of test statistic by chance is 1.135415e-33.

Conclusion Since the probability of the babies weigth being differant is low, 1.135415e-33 < 0.05, we rejecet the H_0. There is not sufficient evidence to support the claim that the European babies (7.7 lbs) have more weight than NC babies.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses
Ho: \(\mu_d = 0\)
Ha: \(\mu_d \neq 0\)
Test by confidence interval or p-value and decision

#store ncbirths diff
ncbirths$diff <- ncbirths$mage - ncbirths$fage

# Find mean and standard deviation of sample differences of mothers and fathers age. 
mf<-mean(ncbirths$diff,na.rm = TRUE)
smf<-sd(ncbirths$diff,na.rm = TRUE)

# Calculating the t-score

(mf - 0)/(smf/sqrt(1000))

## [1] -19.41001

# Find the probability of getting combined above or below t-scores in a t-distribution with 999 degrees of freedom

pt(-19.41001, df = 999)* 2

## [1] 1.840171e-71

The difference between mothers and fathers age is between -19.41 and 1.840171e-71. Since that difference is low, we rejecet the H_0.

Conclusion
There is not sufficient evidence to support the claim of age difference.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses
\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\)
Test by confidence interval or p-value and decision

 # Smokers subset
smokers <- subset(ncbirths, ncbirths$habit == "smoker")

# Nonsmokers subset
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

# Mean difference in length of pregnancy

mean(smokers$weeks)

## [1] 38.44444

mean(nonsmokers$weeks,na.rm = TRUE)

## [1] 38.31881

meandiffB <- mean(smokers$weeks,na.rm = TRUE) - mean(nonsmokers$weeks,na.rm = TRUE)
  

# store standard error
SEB <- sqrt((sd(smokers$weeks,na.rm = TRUE)^2/126)+(sd(nonsmokers$weeks,na.rm = TRUE)^2/873))
  
# test statistic
(meandiffB -0)/SEB

## [1] 0.5190483

# p-value
pt(-0.519, df= 125)*2

## [1] 0.6046784

The p-value of the differant in pregnancy length between smokers and nonsmokers is 0.605.Therefor we fail to reject H_0.

Conclusion the data suggests there is a difference in the pregnacy lengt between smokings mothers and non-smoking mothers.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses
\(H_0\): \(\mu\) = 0 \(H_A\): \(\mu\) > 0
Test by confidence interval or p-value and decision

 # Younger mom subset
Youngermom <- subset(ncbirths, ncbirths$mature == "younger mom")

# Mature mom subset
Maturemom <- subset(ncbirths, ncbirths$mature == "mature mom")

# Mean difference in the average weigth.

mean(Youngermom$weight, na.rm = TRUE)

## [1] 7.097232

mean(Maturemom$weight,na.rm = TRUE)

## [1] 7.125564

meandiffB <- mean(Youngermom$weight,na.rm = TRUE)- mean(Maturemom$weight,na.rm = TRUE)
  
# store standard error
SEW <- sqrt((sd(Youngermom$weight,na.rm = TRUE)^2/867)+(sd(Maturemom$weight,na.rm = TRUE)^2/133))
  
# test statistic
(meandiffB -0)/SEW

## [1] -0.1858449

# p-value
pt(-0.1858449, df= 132)*2

## [1] 0.8528517

Since the p-value is greater then our significance level 0.853 > 0, we reject the H_O. c. Conclusionsince The data suggests that the average weight gained by mature mothers may be different than the younger mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

Test by confidence interval or p-value and decision

 # Younger mom subset
low<- subset(ncbirths, ncbirths$lowbirthweight == "low")

# Mature mom subset
notlow<- subset(ncbirths, ncbirths$lowbirthweight == "not low")
# Find mean and standard deviation of sample 

max(low$weight)

## [1] 5.5

min(notlow$weight)

## [1] 5.56

We are confident that the average number of babies weight is between 5.5 and 5.56 , 5.5< U < 5.56.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question
In the ncbirths dataset, test if there is a difference between “white” and “not white” in the “whitemom” column . Is there a similarity or significant difference in the birth rate of each ethnicity.
Write hypotheses

\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\)