Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

#Analyzing the ncbirths dataset
str(ncbirths)

## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

#data.frame(ncbirths)

#In this project, the data.frame function above has been commented to keep the knit document more neat.

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mn_gained <- mean(ncbirths$gained, na.rm = TRUE)
sd_gained <- sd(ncbirths$gained, na.rm = TRUE)

#Testing to see how many N.A. values are recorded for the weight variable to find the accurate sample size
table(is.na(ncbirths$gained))
## 
## FALSE  TRUE 
##   973    27

ss_gained <- 973

# Calculate t-critical value for 90% confidence
abs(qt(.05, df=972))

## [1] 1.646423

# Calculate margin of error
1.646423 * sd_gained/sqrt(ss_gained)

## [1] 0.7516827

# Boundaries of confidence interval

mn_gained - 0.7516827

## [1] 29.57411

mn_gained + 0.7516827

## [1] 31.07748

We are 90% confident that the average weight gained for all North Carolina mothers is between 29.57 and 31.08.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

#Calculating the t-critical value for the 95% confidence interval
abs(qt(.025, df=ss_gained-1))
## [1] 1.962408

#Calculating the upper and lower bounds of the confidence interval
mn_gained - 1.962341*sd_gained/sqrt(ss_gained)
## [1] 29.42988
mn_gained + 1.962341*sd_gained/sqrt(ss_gained)
## [1] 31.22171

We are 95% confident that the average weight gained for all North Carolina mothers is between 29.43 and 31.22.

How does that confidence interval compare to the one in Question #1?

The 95% confidence interval is slightly larger than the 90% confidence interval, encompassing more values.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses

\(H_0\): \(\mu\) = 7.7

\(H_A\): \(\mu\) \(\neq\) 7.7

Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
mn_weight <- mean(ncbirths$weight, na.rm = TRUE)
sd_weight <- sd(ncbirths$weight, na.rm=TRUE)
table(is.na(ncbirths$weight))
## 
## FALSE 
##  1000

The sample has a mean of 7.101, a standard deviation of approximately 1.509, and a sample size of 1000.

# Test statistic
(mn_weight - 7.7)/(sd_weight/sqrt(1000))

## [1] -12.55388

The t-score is -12.55388.

# Probability of test statistic occuring by chance
pt(-12.55388, df=999)*2

## [1] 1.135354e-33

Because the p-value (1.136346e-33) is less than alpha (0.05), we reject the null hypothesis in favor of the alternate hypothesis.

Conclusion

The data suggest that there is indeed a difference between the average birthweight of European babies and the average birthweight of babies in North Carolina.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses

\(H_0 : \mu_d = 0\)

\(H_A : \mu_d \neq 0\)

Test by confidence interval or p-value and decision

#Creating a column of differences in age between mothers and fathers
ncbirths$dif <- ncbirths$fage - ncbirths$mage

In the “dif” column of the ncbirths dataset, a postiitve value indicates an instance where the mother is younger than the father, while a negative value indicates a case where the mother is older than the father.

#Finding how many N.A. values exist in the "dif" column of the ncbirths dataset
table(is.na(ncbirths$dif))
## 
## FALSE  TRUE 
##   829   171
#The sample size is 829.

#Calculating the test statistic
(mean(ncbirths$dif,na.rm=TRUE)-0)/(sd(ncbirths$dif,na.rm=TRUE)/sqrt(829))
## [1] 17.6727
#The t-score is 17.6727

#Finding the probability of getting the test statistic by chance
pt(17.6727, df=828, lower.tail=FALSE)*2
## [1] 1.504649e-59

Because the p-value (1.504649e-59) is less than alpha (0.05), we reject the null hypothesis in favor of the alternate hypothesis.

Conclusion

The data suggests that there is indeed a significant difference in the average ages of mothers and fathers from the dataset.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses

\(H_0 : \mu_1 - \mu_2 = 0\)

\(H_A : \mu_1 - \mu_2 \neq 0\)

Test by confidence interval or p-value and decision

#Creating subsets for both smokers and nonsmokers
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

#Storing the mean lengths of pregnancies and standard deviations of pregnancy lengths for smokers and nonsmokers
mn_smoker <- mean(smokers$weeks,  na.rm = TRUE)
mn_nonsmoker <- mean(nonsmokers$weeks, na.rm=TRUE)

sd_smokers <- sd(smokers$weeks, na.rm=TRUE)
sd_nonsmokers <- sd(nonsmokers$weeks, na.rm=TRUE)

#Finding the sample size of each group
summary(ncbirths$habit)
## nonsmoker    smoker      NA's 
##       873       126         1

#Storing the standard error
SE <- sqrt((sd_smokers^2/126)+(sd_nonsmokers^2/873))

#Finding the test-statistic
(((mn_smoker)-(mn_nonsmoker))-0)/SE
## [1] 0.5190483

#Finding the p-value
pt(0.5190483, df=125, lower.tail=FALSE)*2
## [1] 0.6046448

The p-value (.605) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.

Conclusion

There is not sufficient evidence to say that there is a significant difference between the number of weeks the pregnancies of smoking and nonsmoking mothers lasted.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses

\(H_0 : \mu_y \leq \mu_m\)

\(H_A : \mu_y > \mu_m\)

Test by confidence interval or p-value and decision

#Creating subsets for the weight gained by younger mothers and mature mothers
younger_mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
mature_mothers <- subset(ncbirths, ncbirths$mature == "mature mom")

#Storing the mean, standard deviation, and mean difference for the younger mothers and mature mothers subsets
mn_younger_mothers <- mean(younger_mothers$gained, na.rm=TRUE)
mn_mature_mothers <- mean(mature_mothers$gained, na.rm=TRUE)

sd_younger_mothers <- sd(younger_mothers$gained, na.rm=TRUE)
sd_mature_mothers <- sd(mature_mothers$gained, na.rm=TRUE)

mn_diff_gained <- (mn_younger_mothers - mn_mature_mothers)

#Finding the sample sizes of the younger mothers and mature mothers
table(is.na(younger_mothers$gained))
## 
## FALSE  TRUE 
##   844    23
table(is.na(mature_mothers$gained))
## 
## FALSE  TRUE 
##   129     4

#Storing the standard error
SE_gained <- sqrt((sd_younger_mothers^2/844)+(sd_mature_mothers^2/129))

#Finding the test statistic
(mn_diff_gained - 0)/SE_gained
## [1] 1.376483

#Finding the p-value
pt(1.376483, df=128, lower.tail=FALSE)
## [1] 0.08553763

The p-value (0.086) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.

Conclusion

There is no significant evidence that the average weight gained by younger mothers is greater than the average weight gained by mature mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

#Creating subsets of low and not low birth weights
low_wt <- subset(ncbirths, ncbirths$lowbirthweight == "low" )
notlow_wt <- subset(ncbirths, ncbirths$lowbirthweight == "not low")

#Using the summary function to observe values within the classifications of "low" and "not low" weights
summary(low_wt$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.095   4.560   4.035   5.160   5.500
summary(notlow_wt$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.560   6.750   7.440   7.484   8.130  11.750

#Creating a dataframe to show the frequency of the birthweights in each classification.
data.frame(table(low_wt$weight))
##    Var1 Freq
## 1     1    2
## 2  1.19    1
## 3  1.31    1
## 4  1.38    3
## 5  1.44    2
## 6   1.5    2
## 7  1.56    1
## 8  1.63    1
## 9  1.69    2
## 10 1.88    1
## 11 2.19    1
## 12 2.25    2
## 13  2.5    1
## 14 2.63    1
## 15 2.69    2
## 16 2.88    3
## 17 2.94    1
## 18    3    1
## 19 3.19    1
## 20 3.25    1
## 21 3.31    1
## 22 3.44    1
## 23 3.56    1
## 24 3.63    2
## 25 3.75    2
## 26 3.81    1
## 27 3.94    2
## 28    4    2
## 29 4.06    2
## 30 4.13    2
## 31 4.19    1
## 32 4.25    1
## 33 4.31    1
## 34 4.44    3
## 35  4.5    3
## 36 4.56    4
## 37 4.63    2
## 38 4.69    4
## 39 4.75    6
## 40 4.88    1
## 41 4.94    2
## 42    5    4
## 43 5.06    3
## 44 5.13    2
## 45 5.19    2
## 46 5.25    4
## 47 5.38    8
## 48 5.44    7
## 49  5.5    7
data.frame(table(notlow_wt$weight))
##     Var1 Freq
## 1   5.56    5
## 2   5.63    8
## 3   5.69    4
## 4   5.75    4
## 5   5.81    9
## 6   5.88   12
## 7   5.94   15
## 8      6   15
## 9   6.06   10
## 10  6.13    6
## 11  6.19   11
## 12  6.25   16
## 13  6.31   15
## 14  6.38   16
## 15  6.44    8
## 16   6.5   15
## 17  6.56   13
## 18  6.63    9
## 19  6.69   13
## 20  6.75   23
## 21  6.81   12
## 22  6.88   25
## 23  6.94   14
## 24     7   22
## 25  7.06   20
## 26  7.13   24
## 27  7.19   21
## 28  7.25   22
## 29  7.31   24
## 30  7.38   21
## 31  7.44   30
## 32   7.5   24
## 33  7.56   19
## 34  7.63   18
## 35  7.69   17
## 36  7.75   17
## 37  7.81   20
## 38  7.88   25
## 39  7.94   16
## 40     8   19
## 41  8.06   13
## 42  8.13   17
## 43  8.19   17
## 44  8.25   16
## 45  8.31   12
## 46  8.38   20
## 47  8.44   14
## 48   8.5   15
## 49  8.56   11
## 50  8.63    5
## 51  8.69    4
## 52  8.75   14
## 53  8.81   12
## 54  8.88    9
## 55  8.94    4
## 56     9    8
## 57  9.06    6
## 58  9.13    7
## 59  9.19    7
## 60  9.25    6
## 61  9.31    5
## 62  9.38    2
## 63   9.5    4
## 64  9.56    3
## 65  9.63    3
## 66  9.69    1
## 67  9.75    2
## 68  9.81    1
## 69  9.88    4
## 70  9.94    1
## 71 10.06    2
## 72 10.13    2
## 73 10.19    1
## 74 10.25    1
## 75 10.38    1
## 76 11.63    1
## 77 11.75    1

By observing the summaries of the weights of babies classified as having “low” and “not low” birthweights, it can be seen that the maximum value of “low” birthweight babies is 5.500 pounds and the minimum weight for “not low” babies is 5.560. This implies that the cutoff of what classifies a baby’s weight as “low” or “not low” is somewhere at or between these two values.

I believe that 5.500 pounds is the cutoff value, such that babies at or below this weight are classified as having a “low” birthweight and any babies above this weight are classified as “not low”. I believe this answer is correct because as seen in the dataframe the seven babies of weight 5.500 pounds were all grouped as having “low” birthweights, while the five with a weight of 5.560 pounds, only .06 pounds more, were all grouped as “not low”. This is a difference of less than one tenth of a pound, showing the two values are very close yet exist on different sides of the cutoff. A cutoff value of 5.500 is exactly half way betwen 5 and 6 pounds, making it more straightforward than other possible decimal values to mark as a cutoff point in determining this variable.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question

Is there a significant difference between the average weights of male and female babies?

Write hypotheses