Grando 5 Lab

if (Sys.info()["sysname"] == "Windows") {
    setwd("~/Masters/DATA606/Week5/Lab/Lab5")
} else {
    setwd("~/Documents/Masters/DATA606/Week5/Lab/Lab5")
}
require(ggplot2)
## Loading required package: ggplot2
load("more/nc.RData")

Exercise 1 - What are the cases in this data set? How many cases are there in our sample?

Answer:

The cases are children born in the state of North Carolina. There are 1000 cases in the sample.

Exercise 2 - Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?

Answer:

ggplot(nc, aes(y = weight, x = habit)) + geom_boxplot() + labs(x = "Smoking Habit", 
    y = "Weight") + ggtitle("Smoking Habit vs. Weight") + theme(plot.title = element_text(hjust = 0.5))

There are a few cases which have NA recorded in the habit attribute so I will filter them out.

nc_filtered <- subset(nc, !(is.na(nc$habit)))
ggplot(nc_filtered, aes(y = weight, x = habit)) + geom_boxplot() + 
    labs(x = "Smoking Habit", y = "Weight") + ggtitle("Smoking Habit vs. Weight Without NAs") + 
    theme(plot.title = element_text(hjust = 0.5))

The plot highlights the difference between the birthweights of children based on the smoking habit of the mother. It appears the median, Q1, and whisker limit of the children born to smokers are all less than their respective values for non-smokers. It appears the plot highlights that the average weight of children born to smoing mothers may be less than that of non-smoking mothers.

Exercise 3 -Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.

The conditions necessary for inference are as follows:

  1. The sample observations are independent.

From the description provided at the beggining of the lab, this is a random sample.

  1. The sample size is large (when not using a t-test).
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 126
  1. The population distribution is not strongly skewed. Note, the larger the sample size, the more lenient we can be with the sample’s skew.
smoker_mean <- mean(subset(nc$weight, nc$habit == "smoker"))
smoker_sd <- sd(subset(nc$weight, nc$habit == "smoker"))
nonsmoker_mean <- mean(subset(nc$weight, nc$habit == "nonsmoker"))
nonsmoker_sd <- sd(subset(nc$weight, nc$habit == "nonsmoker"))
ggplot(nc_filtered, aes(x = weight, fill = habit)) + geom_histogram(binwidth = 0.5, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + 
    stat_function(fun = dnorm, color = "blue", args = list(mean = smoker_mean, 
        sd = smoker_sd)) + stat_function(fun = dnorm, color = "red", 
    args = list(mean = nonsmoker_mean, sd = nonsmoker_sd))

qqnorm(subset(nc$weight, nc$habit == "smoker"))
qqline(subset(nc$weight, nc$habit == "smoker"))

qqnorm(subset(nc$weight, nc$habit == "nonsmoker"))
qqline(subset(nc$weight, nc$habit == "nonsmoker"))

The distributions appear to be slightly to moderately left-skewed; however, the sample sizes are much larger than 30 so the conditions for inference appear to have been met.

Exercise 4 - Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

Answer:

\[{ H }_{ O }:\quad { \mu }_{ nonsmoker }\quad -\quad { \mu }_{ smoker }\quad =\quad 0\\ { H }_{ A }:\quad { \mu }_{ nonsmoker }\quad -\quad { \mu }_{ smoker }\quad \neq \quad 0\]

Exercise 5 - Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.

Answer:

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", 
    null = 0, alternative = "twosided", method = "theoretical", 
    order = c("nonsmoker", "smoker"), )
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )

We are 95% confident that the average birthweight for children born to nonsmoking mothers is between .0534 and 0.577 pounds more than the average birthweight for children born to smoking mothers.

Question 1 - Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.

Answer:

It appears it is also appropriate to remove the null value since we are not testing the difference of means.

inference(y = nc$weeks, est = "mean", type = "ci", alternative = "twosided", 
    method = "theoretical", conflevel = 0.95)
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )

We are 95% confident that the average length of pregenancy for the population is between 38.1528 and 38.5165 weeks.

Question 2 - Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90.

Answer:

inference(y = nc$weeks, est = "mean", type = "ci", alternative = "twosided", 
    method = "theoretical", conflevel = 0.9)
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

We are 90% confident that the average length of pregenancy for the population is between 38.182 and 38.4873 weeks.

Question 3 - Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

Answer:

inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", 
    null = 0, alternative = "twosided", method = "theoretical", 
    order = c("mature mom", "younger mom"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 1.286 
## Test statistic: Z =  -1.376 
## p-value =  0.1686

There is not sufficient evidence to reject the null Hypothesis. Therefore, we fail to reject the hypothesis that there is no difference between the average weight gained between younger and mature mothers.

Question 4 - Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

Answer:

We can find the cutoff by taking the maximum age of younger mothers and the minimum age of mature mothers:

max(subset(nc$mage, nc$mature == "younger mom"), na.rm = TRUE)
## [1] 34
min(subset(nc$mage, nc$mature == "mature mom"), na.rm = TRUE)
## [1] 35

The cutoff appears to be that any mother who gives birth at 34 or younger is considered a younger mother while a mother who gives birth at 35 or over is considered an older mother. My assumption is that the age is truncated (rounded down) to determine the age.

Question #5 - Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.

Answer:

Research Question - Is the average age of the father different between premie and full term children that were born?

\[{ H }_{ O }:\quad { \mu }_{ (Father's\quad Age|full\quad term) }\quad -\quad { \mu }_{ (Father's\quad Age|premie) }\quad =\quad 0\\ { H }_{ A }:\quad { \mu }_{ (Father's\quad Age|full\quad term) }\quad -\quad { \mu }_{ (Father's\quad Age|premie) }\quad \neq \quad 0\]

inference(y = nc$fage, x = nc$premie, est = "mean", type = "ht", 
    null = 0, alternative = "twosided", method = "theoretical", 
    order = c("full term", "premie"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 714, mean_full term = 30.2423, sd_full term = 6.6329
## n_premie = 114, mean_premie = 30.3158, sd_premie = 7.5859
## Observed difference between means (full term-premie) = -0.0735
## 
## H0: mu_full term - mu_premie = 0 
## HA: mu_full term - mu_premie != 0 
## Standard error = 0.753 
## Test statistic: Z =  -0.098 
## p-value =  0.9222

There is not sufficient evidence to reject the null hypothesis. Therefore, we fail to reject the hypothesis that there is no difference of age between the fathers of premie and full term children.

inference(y = nc$fage, x = nc$premie, est = "mean", type = "ci", 
    null = 0, alternative = "twosided", method = "theoretical", 
    order = c("full term", "premie"), conflevel = 0.95)
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 714, mean_full term = 30.2423, sd_full term = 6.6329
## n_premie = 114, mean_premie = 30.3158, sd_premie = 7.5859

## Observed difference between means (full term-premie) = -0.0735
## 
## Standard error = 0.7526 
## 95 % Confidence interval = ( -1.5486 , 1.4016 )

We are 95% confident that the average difference of the fathers age between premie and full term children is -1.5486 and 1.4016 years.