download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
Exercise 1 -
What are the cases in this data set? How many cases are there in our sample?
dim(nc)
## [1] 1000 13
There are 1000 cases of babies in our sample.
summary(nc)
## fage mage mature weeks
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00
## Median :30.00 Median :27 Median :39.00
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## premie visits marital gained
## full term:846 Min. : 0.0 married :386 Min. : 0.00
## premie :152 1st Qu.:10.0 not married:613 1st Qu.:20.00
## NA's : 2 Median :12.0 NA's : 1 Median :30.00
## Mean :12.1 Mean :30.33
## 3rd Qu.:15.0 3rd Qu.:38.00
## Max. :30.0 Max. :85.00
## NA's :9 NA's :27
## weight lowbirthweight gender habit
## Min. : 1.000 low :111 female:503 nonsmoker:873
## 1st Qu.: 6.380 not low:889 male :497 smoker :126
## Median : 7.310 NA's : 1
## Mean : 7.101
## 3rd Qu.: 8.060
## Max. :11.750
##
## whitemom
## not white:284
## white :714
## NA's : 2
##
##
##
##
hist(nc$fage,breaks = 20)

hist(nc$mage, breaks=20)

hist(nc$weeks,breaks=20)

hist(nc$visits,breaks=20)

hist(nc$gained,breaks=20)

hist(nc$weight, breaks = 20)

Exercise 2 -
Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
boxplot(weight~habit,data=nc, horizontal = FALSE)

The boxplot shows that mothers who smoked tend to have babies with a lower birthweight. However, since the distributions between smokers and nonsmokers are relatively close, we can’t be certain how significant this result is.
by(nc$weight,nc$habit,mean)
## nc$habit: nonsmoker
## [1] 7.144273
## --------------------------------------------------------
## nc$habit: smoker
## [1] 6.82873
Exercise 3 -
Check if the conditions necessary for inference are satisified. Note that you will need to obtain the sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.
by(nc$weight,nc$habit,length)
## nc$habit: nonsmoker
## [1] 873
## --------------------------------------------------------
## nc$habit: smoker
## [1] 126
The sample was chosen randomly, so observations are independent of each other.
The sample size for each group is greater than 30.
Exercise 4 -
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
H(null): mean(smoker)=mean(nonsmoker)
H(a): mean(smoker) < mean(nonsmoker) OR mean(smoker) > mean(nonsmoker)
inference(y=nc$weight, x=nc$habit,est="mean",type="ht",null=0,alternative = "twosided",method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
##
## H0: mu_nonsmoker - mu_smoker = 0
## HA: mu_nonsmoker - mu_smoker != 0
## Standard error = 0.134
## Test statistic: Z = 2.359
## p-value = 0.0184

Exercise 5 -
Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.
inference(y=nc$weight, x=nc$habit,est="mean",type="ci",null=0,alternative = "twosided",method = "theoretical",order=c("smoker","nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
##
## Standard error = 0.1338
## 95 % Confidence interval = ( -0.5777 , -0.0534 )