Lab 5

load("lab5/more/nc.RData")

Excercises

Excercise 1

There are 1000 observations in the dataset. Each observation represents a birth record.

Excercise 2

nc[is.na(nc$habit) == F, ] %>%
  ggplot() + geom_boxplot(mapping = aes(x = habit, y = weight))

The median is higher for nonsmokers and the whisker stretches farther. The whole distribution seems to be somewhat shifted, indicating they have similar spread, but different means.

Excercise 3

by(nc$weight, nc$habit, mean)

## nc$habit: nonsmoker
## [1] 7.144273
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 6.82873

nc[is.na(nc$habit) == F, ] %>%
  ggplot() + geom_histogram(mapping = aes(x = weight)) + facet_wrap('habit')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The samples are large enough to handle some skew, and the distributions are fairly normal. The samples were chosen randomly and likely came from more than 10% of the population so we can assume independance. We can proceed.

Excercise 4

Ho: There is no difference in means between babies born from mothers who were non-smokers and those who smoked Ha: There is some difference in means between babies born from smokers and non-smokers.

Excercise 5

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")

## Warning: package 'BHH2' was built under R version 3.4.2

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )

On Your Own

Question 1

inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical")

## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )

Question 2

inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical", conflevel = .9)

## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

Question 3

inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469

## Observed difference between means (mature mom-younger mom) = -1.7697
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 1.286 
## Test statistic: Z =  -1.376 
## p-value =  0.1686

Question 4

nc[is.na(nc$fage) == F, ] %>%
  group_by(mature) %>%
  summarise(age_min = min(fage), age_max = max(fage))

I thought that by getting the oldest young moms and the youngest older moms, the age cutoff could be determined. It looks like the cutoff is 25, although it’s strange there are “young moms” who are 48.

Question 5

Is the number of hospital visits affected by the marital status of the couple?

inference(y = nc$visits, x = nc$marital, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_married = 380, mean_married = 10.9553, sd_married = 4.2408
## n_not married = 611, mean_not married = 12.82, sd_not married = 3.5883

## Observed difference between means (married-not married) = -1.8647
## 
## H0: mu_married - mu_not married = 0 
## HA: mu_married - mu_not married != 0 
## Standard error = 0.262 
## Test statistic: Z =  -7.13 
## p-value =  0

The test statistic was above 7 easily exceeding what is necesary for statistical significance. On average, their is a difference in number of hosptial visits during child birth between married and unmarried couples

My question now is what is the confidence interval around this difference.

inference(y = nc$visits, x = nc$marital, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_married = 380, mean_married = 10.9553, sd_married = 4.2408
## n_not married = 611, mean_not married = 12.82, sd_not married = 3.5883

## Observed difference between means (married-not married) = -1.8647
## 
## Standard error = 0.2615 
## 95 % Confidence interval = ( -2.3773 , -1.3521 )