download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
##Exercise 1: What are the cases in this data set? How many cases are there in our sample? The cases in this data set are NC births. 1000 cases in data set for each of 13 variables measured pertaining to NC births.
Numerical: fage, mage, weeks, visits, gained, weight Catagorical: mature, premie, marital, lowbirthweight, gender, habit, whitemom
Outliers found in all numerical variables: fage, mage, weeks, visits, gained, weight
summary(nc)
## fage mage mature weeks premie
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00 full term:846
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00 premie :152
## Median :30.00 Median :27 Median :39.00 NA's : 2
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## visits marital gained weight
## Min. : 0.0 married :386 Min. : 0.00 Min. : 1.000
## 1st Qu.:10.0 not married:613 1st Qu.:20.00 1st Qu.: 6.380
## Median :12.0 NA's : 1 Median :30.00 Median : 7.310
## Mean :12.1 Mean :30.33 Mean : 7.101
## 3rd Qu.:15.0 3rd Qu.:38.00 3rd Qu.: 8.060
## Max. :30.0 Max. :85.00 Max. :11.750
## NA's :9 NA's :27
## lowbirthweight gender habit whitemom
## low :111 female:503 nonsmoker:873 not white:284
## not low:889 male :497 smoker :126 white :714
## NA's : 1 NA's : 2
##
##
##
##
boxplot(nc$fage)
boxplot(nc$mage)
boxplot(nc$weeks)
boxplot(nc$visits)
boxplot(nc$gained)
boxplot(nc$weight)
##Exercise 2:Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
The summary statistics along with the boxplot reveal a mean and median difference in weights between smokers and nonsmokers with nonsmokers having slightly heavier babies (need to run more statistics to reveal significance.) The median difference is evident in the boxplots even with the smokers having a slightly higher frequency of outliers below the lower fence.
weight<-nc$weight
habit<-nc$habit
boxplot(weight~habit)
by(weight,habit,summary)
## habit: nonsmoker
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.440 7.310 7.144 8.060 11.750
## ------------------------------------------------------------
## habit: smoker
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.690 6.077 7.060 6.829 7.735 9.190
weight_sorn<-split(weight, habit)
weight_smoker<-weight_sorn$smoker
weight_nonsmoker<-weight_sorn$nonsmoker
q1st.s<-6.077
q3rd.s<-7.735
IQR.s<-(q3rd.s- q1st.s)
'upper fence smoker'
## [1] "upper fence smoker"
upper_fence<- q3rd.s + 1.5*(IQR.s);upper_fence
## [1] 10.222
'lower fence smoker'
## [1] "lower fence smoker"
lower_fence<- q1st.s - 1.5*(IQR.s);lower_fence
## [1] 3.59
q1st.n<-6.44
q3rd.n<-8.06
IQR.n<-(q3rd.n- q1st.n)
'upper fence nonsmoker'
## [1] "upper fence nonsmoker"
upper_fence<- q3rd.n + 1.5*(IQR.n);upper_fence
## [1] 10.49
'lower fence nonsmoker'
## [1] "lower fence nonsmoker"
lower_fence<- q1st.n - 1.5*(IQR.n);lower_fence
## [1] 4.01
IQR.n<-(q3rd.n- q1st.n)
19/873
## [1] 0.02176403
4/126
## [1] 0.03174603
##Exercise 3: Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length. n(smoker) =126 n(nonsmoker) = 873
n is large enough for both samples (greater than 30). Both samples are independent of one another and (relatively) random. n is less than 10% of population.
n_nonsmoker<-length(weight_nonsmoker); n_nonsmoker
## [1] 873
n_smoker<- length(weight_smoker); n_smoker
## [1] 126
##Exercise 4: Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different. Ho: mu(weight of babies born to smokers) = mu( weight of babies born to nonsmokers) Ha: mu(weight of babies born to smokers) != mu( weight of babies born to nonsmokers)
or
H0: mu_nonsmoker - mu_smoker = 0 HA: mu_nonsmoker - mu_smoker != 0
inference(y = weight, x = habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## H0: mu_nonsmoker - mu_smoker = 0
## HA: mu_nonsmoker - mu_smoker != 0
## Standard error = 0.134
## Test statistic: Z = 2.359
## p-value = 0.0184
## Exercise 5: Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.
inference(y = weight, x = habit, est = "mean", type = "ci",
alternative = "twosided", method = "theoretical",
order = c("smoker","nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## Observed difference between means (smoker-nonsmoker) = -0.3155
##
## Standard error = 0.1338
## 95 % Confidence interval = ( -0.5777 , -0.0534 )
inference(y = weight, x = habit, est = "mean", type = "ci",
alternative = "twosided", method = "theoretical",
order = c("nonsmoker","smoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## Standard error = 0.1338
## 95 % Confidence interval = ( 0.0534 , 0.5777 )
##On your own: ###1:Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.
inference(y = nc$weeks, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical")
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
##2.Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90.
inference(y = nc$weeks, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical", conflevel = .90)
## Single mean
## Summary statistics:
## mean = 38.3347 ; sd = 2.9316 ; n = 998
## Standard error = 0.0928
## 90 % Confidence interval = ( 38.182 , 38.4873 )
##3: Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers. With a p-value of 0.1686 there is no significant evidence that there is a difference in the effects of “mature” age on weight gain.
inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
##
## H0: mu_mature mom - mu_younger mom = 0
## HA: mu_mature mom - mu_younger mom != 0
## Standard error = 1.286
## Test statistic: Z = -1.376
## p-value = 0.1686
##4.Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works. I split ‘mage’ by the ‘mature’ variable then ran the summary statistics to ge the max age of ‘younger’ moms (age max = 34) versus the minimum age of ‘mature’ moms (age min = 35). This is consistent with the definition of geriatric pregnacny.
by(nc$mage, nc$mature, summary)
## nc$mature: mature mom
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 35.00 37.00 37.18 38.00 50.00
## ------------------------------------------------------------
## nc$mature: younger mom
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 21.00 25.00 25.44 30.00 34.00
##5. Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.
Ho: mu (age of mom for premie) = mu (age mom for full-term) Ha: mu (age of mom for premie) != mu (age mom for full-term)
With a p-value of 0.8266, there is no significant evidence that there is a difference in the effects of age of mother on premature birth.
Can I flip the way I state that conclusion as I did above? Or do I need to say “With a p-value of 0.8266, there is no significant evidence that there is a difference in the effects of premature birth on the age of the mother.”
inference(y = nc$mage, x = nc$premie, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 846, mean_full term = 27, sd_full term = 6.1444
## n_premie = 152, mean_premie = 26.875, sd_premie = 6.533
## Observed difference between means (full term-premie) = 0.125
##
## H0: mu_full term - mu_premie = 0
## HA: mu_full term - mu_premie != 0
## Standard error = 0.57
## Test statistic: Z = 0.219
## p-value = 0.8266
inference(y = nc$mage, x = nc$premie, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_full term = 846, mean_full term = 27, sd_full term = 6.1444
## n_premie = 152, mean_premie = 26.875, sd_premie = 6.533
## Observed difference between means (full term-premie) = 0.125
##
## Standard error = 0.5705
## 95 % Confidence interval = ( -0.9931 , 1.2431 )