Inference for numerical data

load("more/nc.RData")

Exercise 1 : What are the cases in this data set? How many cases are there in our sample?

This data set containing information on births recorded in this state of North Carolina

There are 1000 Cases in the sample.

Exercise 2 : Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.3.3

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'tidyr' was built under R version 3.3.3

## Warning: package 'readr' was built under R version 3.3.3

## Warning: package 'purrr' was built under R version 3.3.3

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

habit <- nc$habit
wgt <- nc$weight
ggplot(data=nc) + geom_boxplot(aes(x=habit, y=wgt))

The boxplot allows us to assume there may be a general reduction in birthweight from smoking mothers. There is also a set of outliers in the underweight side of the birthweight for non-smokers. This might indicate nothing more than a higher proportion of non-smoking sample records for mothers.

Exercise 3 : Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.

by(nc$weight, nc$habit, length)

## nc$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 126

The boxplot above would allow us to assume near-normality, along with independent variables, and an adequate sample size.

Exercise 4 : Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

H0 : There is no significant difference between the means of children born from smoking versus non-smoking mothers.

HA : There is a significant difference between means.

library(DATA606)

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

## 
## Attaching package: 'DATA606'

## The following object is masked from 'package:utils':
## 
##     demo

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

## Warning: package 'BHH2' was built under R version 3.3.3

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

Exercise 5 : Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical", 
          order = c("smoker","nonsmoker"))

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( -0.5777 , -0.0534 )

On your own

Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.

–

#using the custom inference function
inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical")

## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )

Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90.

inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical", conflevel = 0.90)

## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

The following p-value of 0.8526 is over 0.05, and therefore we cannot reject the null hypothesis, or the mean difference in weights is not statistically significant.

inference(nc$weight, nc$mature, type="ht", est="mean", 
          null=0, method="theoretical", alternative="twosided")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 133, mean_mature mom = 7.1256, sd_mature mom = 1.6591
## n_younger mom = 867, mean_younger mom = 7.0972, sd_younger mom = 1.4855

## Observed difference between means (mature mom-younger mom) = 0.0283
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 0.152 
## Test statistic: Z =  0.186 
## p-value =  0.8526

Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

The method very simply subsets between younger mother and mature mothers, and then calculates the minimum and maximum mage (mother’s age).

c(min(subset(nc, mature == 'younger mom')$mage),max(subset(nc, mature == 'younger mom')$mage))

## [1] 13 34

c(min(subset(nc, mature != 'younger mom')$mage),max(subset(nc, mature != 'younger mom')$mage))

## [1] 35 50

Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.

Research Question : Is there a difference between the length of pregnancy based upon age (mature vs younger)?

H0: There is no difference in the means.

H1: There is a difference between the means, two-tailed.

inference(nc$weeks, nc$mature, type="ht", est="mean", 
          null=0, method="theoretical", alternative="twosided")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 132, mean_mature mom = 38.0227, sd_mature mom = 3.2184
## n_younger mom = 866, mean_younger mom = 38.3822, sd_younger mom = 2.8844

## Observed difference between means (mature mom-younger mom) = -0.3595
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 0.297 
## Test statistic: Z =  -1.211 
## p-value =  0.2258

Given a p-value of 0.2258, we affirm the null hypothesis and determine that there is no significant difference between the means of younger or mature mothers with regards to pregnancy duration.

Inference for numerical data

James Kuruvilla

March 20, 2017

On your own