In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mn<-mean(ncbirths$weight, na.rm = TRUE)
std<-sd(ncbirths$weight, na.rm = TRUE)
SAM<-table(is.na(ncbirths$weight))
The mean is 7.1, the standard deviatin is 1.51 and size of dataset 1000.
# Calculate t-critical value for 90% confidence
t<-abs(qt(p =0.005, df =999))
The t-critical value for 90% confidence is 2.58.
# Calculate margin of error
t*(mn/sqrt(1000))
## [1] 0.5795182
The margin of error is 0.58.
# Boundaries of confidence interval
mn - t*std/sqrt(1000)
## [1] 6.977861
mn + t*std/sqrt(1000)
## [1] 7.224139
The boundraies of confidence interval is somewhere between, 6.98< U < 7.22
The question 1 has 90% t-confidence interval and the quetion 2 has 95% t-confidence interval, by increasing the percentage of confidence the interval becomes wider.
# Calculate t-critical value for 95% confidence
t2<-abs(qt(p =0.025, df =999))
The t-critical value for 95% confidence is 1.962
# Calculate margin of error
t2*(mn/sqrt(1000))
## [1] 0.4406503
The margin of error is 0.44
# Boundaries of confidence interval
mn - t2*std/sqrt(1000)
## [1] 7.007368
mn + t2*std/sqrt(1000)
## [1] 7.194632
the boundaries of confidence interval is between 7.0 and 7.2.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
# Sample statistics (sample mean, standard deviation, and size)
mean(ncbirths$weight, na.rm = TRUE)
## [1] 7.101
sd(ncbirths$weight, na.rm = TRUE)
## [1] 1.50886
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
The mean = 7.1 , standard deviation = 1.51 and size of dataset = 1000.
# Test statistic
tscore<-(mn - 7.7)/(std/sqrt(1000))
The test statistic is 330.46
# Probability of test statistic by chance
pt(tscore, df = 999)* 2
## [1] 1.135415e-33
The probability of test statistic by chance is 1.135415e-33.
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
#store ncbirths diff
ncbirths$diff <- ncbirths$mage - ncbirths$fage
# Find mean and standard deviation of sample differences of mothers and fathers age.
mf<-mean(ncbirths$diff,na.rm = TRUE)
smf<-sd(ncbirths$diff,na.rm = TRUE)
# Calculating the t-score
(mf - 0)/(smf/sqrt(1000))
## [1] -19.41001
# Find the probability of getting combined above or below t-scores in a t-distribution with 999 degrees of freedom
pt(-19.41001, df = 999)* 2
## [1] 1.840171e-71
The difference between mothers and fathers age is between -19.41 and 1.840171e-71. Since that difference is low, we rejecet the H_0.
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
# Smokers subset
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
# Nonsmokers subset
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
# Mean difference in length of pregnancy
mean(smokers$weeks)
## [1] 38.44444
mean(nonsmokers$weeks,na.rm = TRUE)
## [1] 38.31881
meandiffB <- mean(smokers$weeks,na.rm = TRUE) - mean(nonsmokers$weeks,na.rm = TRUE)
# store standard error
SEB <- sqrt((sd(smokers$weeks,na.rm = TRUE)^2/126)+(sd(nonsmokers$weeks,na.rm = TRUE)^2/873))
# test statistic
(meandiffB -0)/SEB
## [1] 0.5190483
# p-value
pt(-0.519, df= 125)*2
## [1] 0.6046784
The p-value of the differant in pregnancy length between smokers and nonsmokers is 0.605.Therefor we fail to reject H_0.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
Write hypotheses
\(H_0\): \(\mu\) = 0 \(H_A\): \(\mu\) > 0
Test by confidence interval or p-value and decision
# Younger mom subset
Youngermom <- subset(ncbirths, ncbirths$mature == "younger mom")
# Mature mom subset
Maturemom <- subset(ncbirths, ncbirths$mature == "mature mom")
# Mean difference in the average weigth.
mean(Youngermom$weight, na.rm = TRUE)
## [1] 7.097232
mean(Maturemom$weight,na.rm = TRUE)
## [1] 7.125564
meandiffB <- mean(Youngermom$weight,na.rm = TRUE)- mean(Maturemom$weight,na.rm = TRUE)
# store standard error
SEW <- sqrt((sd(Youngermom$weight,na.rm = TRUE)^2/867)+(sd(Maturemom$weight,na.rm = TRUE)^2/133))
# test statistic
(meandiffB -0)/SEW
## [1] -0.1858449
# p-value
pt(-0.1858449, df= 132)*2
## [1] 0.8528517
Since the p-value is greater then our significance level 0.853 > 0, we reject the H_O. c. Conclusionsince The data suggests that the average weight gained by mature mothers may be different than the younger mothers.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
# Younger mom subset
low<- subset(ncbirths, ncbirths$lowbirthweight == "low")
# Mature mom subset
notlow<- subset(ncbirths, ncbirths$lowbirthweight == "not low")
# Find mean and standard deviation of sample
max(low$weight)
## [1] 5.5
min(notlow$weight)
## [1] 5.56
We are confident that the average number of babies weight is between 5.5 and 5.56 , 5.5< U < 5.56.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
In the ncbirths dataset, test if there is a difference between “white” and “not white” in the “whitemom” column . Is there a similarity or significant difference in the birth rate of each ethnicity.
Write hypotheses
\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\)