setwd("C:/Users/Robert/Documents/R/win-library/3.2/IS606/labs/Lab5")
load("more/nc.RData")
===============================================
Exercise 1
What are the cases in this data set? How many cases are there in our sample?
#number of cases
nrow(nc)
## [1] 1000
===============================================
Exercise 2
Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
ggplot(data=nc) + geom_boxplot(aes(x=habit, y=weight))
by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144273
## --------------------------------------------------------
## nc$habit: smoker
## [1] 6.82873
The boxplot allows us to assume there may be a general reduction in birthweight from smoking mothers. There is also a curious set of outliers in the underweight side of the birthweight for non-smokers. This might indicate nothing more than a higher proportion of non-smoking sample records for mothers.
===============================================
Exercise 3
Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## --------------------------------------------------------
## nc$habit: smoker
## [1] 126
The boxplot above would allow us to assume near-normality, along with independent variables, and an adequate sample size.
===============================================
Exercise 4
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
H0 : There is no significant difference between the means of children born from smoking versus non-smoking mothers.
HA : There is a significant difference between means.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Warning: package 'BHH2' was built under R version 3.2.4
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## H0: mu_nonsmoker - mu_smoker = 0
## HA: mu_nonsmoker - mu_smoker != 0
## Standard error = 0.134
## Test statistic: Z = 2.359
## p-value = 0.0184
===============================================
Exercise 5
Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.
library(BHH2)
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical",
order = c("smoker","nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## Observed difference between means (smoker-nonsmoker) = -0.3155
##
## Standard error = 0.1338
## 95 % Confidence interval = ( -0.5777 , -0.0534 )
===============================================
===============================================
1
Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.
We assume near normality, independence, and calculate the 95% confidence interval as follows:
#using the custom inference function
inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical")
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
===============================================
2
Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90.
#using the custom inference function
inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical", conflevel = 0.90)
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 90 % Confidence interval = ( 38.182 , 38.4873 )
===============================================
3
Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.
The following p-value of 0.8526 is over 0.05, and therefore we cannot reject the null hypothesis, or the mean difference in weights is not statistically significant.
#using the custom inference function
inference(nc$weight, nc$mature, type="ht", est="mean",
null=0, method="theoretical", alternative="twosided")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 133, mean_mature mom = 7.1256, sd_mature mom = 1.6591
## n_younger mom = 867, mean_younger mom = 7.0972, sd_younger mom = 1.4855
## Observed difference between means (mature mom-younger mom) = 0.0283
##
## H0: mu_mature mom - mu_younger mom = 0
## HA: mu_mature mom - mu_younger mom != 0
## Standard error = 0.152
## Test statistic: Z = 0.186
## p-value = 0.8526
===============================================
4
Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.
The method very simply subsets between younger mother and mature mothers, and then calculates the minimum and maximum mage (mother’s age).
#using a subset
#younger mother age range
c(min(subset(nc, mature == 'younger mom')$mage),max(subset(nc, mature == 'younger mom')$mage))
## [1] 13 34
#mature mother age range
c(min(subset(nc, mature != 'younger mom')$mage),max(subset(nc, mature != 'younger mom')$mage))
## [1] 35 50
===============================================
5
Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.
Research Question : Is there a difference between the length of pregnancy based upon age (mature vs younger)?
H0: There is no difference in the means.
H1: There is a difference between the means, two-tailed.
inference(nc$weeks, nc$mature, type="ht", est="mean",
null=0, method="theoretical", alternative="twosided")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 132, mean_mature mom = 38.0227, sd_mature mom = 3.2184
## n_younger mom = 866, mean_younger mom = 38.3822, sd_younger mom = 2.8844
## Observed difference between means (mature mom-younger mom) = -0.3595
##
## H0: mu_mature mom - mu_younger mom = 0
## HA: mu_mature mom - mu_younger mom != 0
## Standard error = 0.297
## Test statistic: Z = -1.211
## p-value = 0.2258
Given a p-value of 0.2258, we affirm the null hypothesis and determine that there is no significant difference between the means of younger or mature mothers with regards to pregnancy duration.