if (Sys.info()["sysname"] == "Windows") {
setwd("~/Masters/DATA606/Week5/Lab/Lab5")
} else {
setwd("~/Documents/Masters/DATA606/Week5/Lab/Lab5")
}
require(ggplot2)
## Loading required package: ggplot2
load("more/nc.RData")
Answer:
The cases are children born in the state of North Carolina. There are 1000 cases in the sample.
Answer:
ggplot(nc, aes(y = weight, x = habit)) + geom_boxplot() + labs(x = "Smoking Habit",
y = "Weight") + ggtitle("Smoking Habit vs. Weight") + theme(plot.title = element_text(hjust = 0.5))
There are a few cases which have NA recorded in the habit attribute so I will filter them out.
nc_filtered <- subset(nc, !(is.na(nc$habit)))
ggplot(nc_filtered, aes(y = weight, x = habit)) + geom_boxplot() +
labs(x = "Smoking Habit", y = "Weight") + ggtitle("Smoking Habit vs. Weight Without NAs") +
theme(plot.title = element_text(hjust = 0.5))
The plot highlights the difference between the birthweights of children based on the smoking habit of the mother. It appears the median, Q1, and whisker limit of the children born to smokers are all less than their respective values for non-smokers. It appears the plot highlights that the average weight of children born to smoing mothers may be less than that of non-smoking mothers.
The conditions necessary for inference are as follows:
From the description provided at the beggining of the lab, this is a random sample.
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## --------------------------------------------------------
## nc$habit: smoker
## [1] 126
smoker_mean <- mean(subset(nc$weight, nc$habit == "smoker"))
smoker_sd <- sd(subset(nc$weight, nc$habit == "smoker"))
nonsmoker_mean <- mean(subset(nc$weight, nc$habit == "nonsmoker"))
nonsmoker_sd <- sd(subset(nc$weight, nc$habit == "nonsmoker"))
ggplot(nc_filtered, aes(x = weight, fill = habit)) + geom_histogram(binwidth = 0.5,
alpha = 0.5, position = "identity", aes(y = ..density..)) +
stat_function(fun = dnorm, color = "blue", args = list(mean = smoker_mean,
sd = smoker_sd)) + stat_function(fun = dnorm, color = "red",
args = list(mean = nonsmoker_mean, sd = nonsmoker_sd))
qqnorm(subset(nc$weight, nc$habit == "smoker"))
qqline(subset(nc$weight, nc$habit == "smoker"))
qqnorm(subset(nc$weight, nc$habit == "nonsmoker"))
qqline(subset(nc$weight, nc$habit == "nonsmoker"))
The distributions appear to be slightly to moderately left-skewed; however, the sample sizes are much larger than 30 so the conditions for inference appear to have been met.
Answer:
\[{ H }_{ O }:\quad { \mu }_{ nonsmoker }\quad -\quad { \mu }_{ smoker }\quad =\quad 0\\ { H }_{ A }:\quad { \mu }_{ nonsmoker }\quad -\quad { \mu }_{ smoker }\quad \neq \quad 0\]
Answer:
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci",
null = 0, alternative = "twosided", method = "theoretical",
order = c("nonsmoker", "smoker"), )
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## Standard error = 0.1338
## 95 % Confidence interval = ( 0.0534 , 0.5777 )
We are 95% confident that the average birthweight for children born to nonsmoking mothers is between .0534 and 0.577 pounds more than the average birthweight for children born to smoking mothers.
Answer:
It appears it is also appropriate to remove the null value since we are not testing the difference of means.
inference(y = nc$weeks, est = "mean", type = "ci", alternative = "twosided",
method = "theoretical", conflevel = 0.95)
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
We are 95% confident that the average length of pregenancy for the population is between 38.1528 and 38.5165 weeks.
Answer:
inference(y = nc$weeks, est = "mean", type = "ci", alternative = "twosided",
method = "theoretical", conflevel = 0.9)
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 90 % Confidence interval = ( 38.182 , 38.4873 )
We are 90% confident that the average length of pregenancy for the population is between 38.182 and 38.4873 weeks.
Answer:
inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht",
null = 0, alternative = "twosided", method = "theoretical",
order = c("mature mom", "younger mom"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
##
## H0: mu_mature mom - mu_younger mom = 0
## HA: mu_mature mom - mu_younger mom != 0
## Standard error = 1.286
## Test statistic: Z = -1.376
## p-value = 0.1686
There is not sufficient evidence to reject the null Hypothesis. Therefore, we fail to reject the hypothesis that there is no difference between the average weight gained between younger and mature mothers.
Answer:
We can find the cutoff by taking the maximum age of younger mothers and the minimum age of mature mothers:
max(subset(nc$mage, nc$mature == "younger mom"), na.rm = TRUE)
## [1] 34
min(subset(nc$mage, nc$mature == "mature mom"), na.rm = TRUE)
## [1] 35
The cutoff appears to be that any mother who gives birth at 34 or younger is considered a younger mother while a mother who gives birth at 35 or over is considered an older mother. My assumption is that the age is truncated (rounded down) to determine the age.
Answer:
Research Question - Is the average age of the father different between premie and full term children that were born?
\[{ H }_{ O }:\quad { \mu }_{ (Father's\quad Age|full\quad term) }\quad -\quad { \mu }_{ (Father's\quad Age|premie) }\quad =\quad 0\\ { H }_{ A }:\quad { \mu }_{ (Father's\quad Age|full\quad term) }\quad -\quad { \mu }_{ (Father's\quad Age|premie) }\quad \neq \quad 0\]
inference(y = nc$fage, x = nc$premie, est = "mean", type = "ht",
null = 0, alternative = "twosided", method = "theoretical",
order = c("full term", "premie"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 714, mean_full term = 30.2423, sd_full term = 6.6329
## n_premie = 114, mean_premie = 30.3158, sd_premie = 7.5859
## Observed difference between means (full term-premie) = -0.0735
##
## H0: mu_full term - mu_premie = 0
## HA: mu_full term - mu_premie != 0
## Standard error = 0.753
## Test statistic: Z = -0.098
## p-value = 0.9222
There is not sufficient evidence to reject the null hypothesis. Therefore, we fail to reject the hypothesis that there is no difference of age between the fathers of premie and full term children.
inference(y = nc$fage, x = nc$premie, est = "mean", type = "ci",
null = 0, alternative = "twosided", method = "theoretical",
order = c("full term", "premie"), conflevel = 0.95)
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 714, mean_full term = 30.2423, sd_full term = 6.6329
## n_premie = 114, mean_premie = 30.3158, sd_premie = 7.5859
## Observed difference between means (full term-premie) = -0.0735
##
## Standard error = 0.7526
## 95 % Confidence interval = ( -1.5486 , 1.4016 )
We are 95% confident that the average difference of the fathers age between premie and full term children is -1.5486 and 1.4016 years.