load("more/nc.RData")

Exercise 1 : What are the cases in this data set? How many cases are there in our sample?

This data set containing information on births recorded in this state of North Carolina

There are 1000 Cases in the sample.

Exercise 2 : Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
habit <- nc$habit
wgt <- nc$weight
ggplot(data=nc) + geom_boxplot(aes(x=habit, y=wgt))

The boxplot allows us to assume there may be a general reduction in birthweight from smoking mothers. There is also a set of outliers in the underweight side of the birthweight for non-smokers. This might indicate nothing more than a higher proportion of non-smoking sample records for mothers.

Exercise 3 : Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.

by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 126

The boxplot above would allow us to assume near-normality, along with independent variables, and an adequate sample size.

Exercise 4 : Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

H0 : There is no significant difference between the means of children born from smoking versus non-smoking mothers.

HA : There is a significant difference between means.

library(DATA606)
## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
## 
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
## 
##     demo
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Warning: package 'BHH2' was built under R version 3.3.3
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

Exercise 5 : Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical", 
          order = c("smoker","nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( -0.5777 , -0.0534 )

On your own

O

#using the custom inference function
inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical")
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical", conflevel = 0.90)
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

The following p-value of 0.8526 is over 0.05, and therefore we cannot reject the null hypothesis, or the mean difference in weights is not statistically significant.

inference(nc$weight, nc$mature, type="ht", est="mean", 
          null=0, method="theoretical", alternative="twosided")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 133, mean_mature mom = 7.1256, sd_mature mom = 1.6591
## n_younger mom = 867, mean_younger mom = 7.0972, sd_younger mom = 1.4855
## Observed difference between means (mature mom-younger mom) = 0.0283
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 0.152 
## Test statistic: Z =  0.186 
## p-value =  0.8526

The method very simply subsets between younger mother and mature mothers, and then calculates the minimum and maximum mage (mother’s age).

c(min(subset(nc, mature == 'younger mom')$mage),max(subset(nc, mature == 'younger mom')$mage))
## [1] 13 34
c(min(subset(nc, mature != 'younger mom')$mage),max(subset(nc, mature != 'younger mom')$mage))
## [1] 35 50

Research Question : Is there a difference between the length of pregnancy based upon age (mature vs younger)?

H0: There is no difference in the means.

H1: There is a difference between the means, two-tailed.

inference(nc$weeks, nc$mature, type="ht", est="mean", 
          null=0, method="theoretical", alternative="twosided")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 132, mean_mature mom = 38.0227, sd_mature mom = 3.2184
## n_younger mom = 866, mean_younger mom = 38.3822, sd_younger mom = 2.8844
## Observed difference between means (mature mom-younger mom) = -0.3595
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 0.297 
## Test statistic: Z =  -1.211 
## p-value =  0.2258

Given a p-value of 0.2258, we affirm the null hypothesis and determine that there is no significant difference between the means of younger or mature mothers with regards to pregnancy duration.