Inference for Numerical Data

Exercise 1: What are the cases in this data set? How many cases are there in our sample?

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")

summary(nc)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 

Answer: Mothers who gave birth in North Carolina are the cases in this data set. There are 1,000 cases in our random sample.

Example 2: Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?

Weight <- nc$weight 
Smoking_Habits <- nc$habit

boxplot(Weight ~ Smoking_Habits)

Answer: The boxplot above suggests that an expectant mother with a smoking habit may be more likely to give birth to a baby that is lighter in weight compared to a nonsmoking mother.

Example 3: Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.

by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144273
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 6.82873
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 126

Answer: We can assume that the conditions necessary for inference are satisfied because the sample size is large (1,000). The distribution can be expected to be approximately normal.

Example 4: Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

Answer: The null hypothesis (Ho): There is no compelling evidence that there is a difference in the mean weights of newborn babies between smoking and nonsmoking mothers. (There is no difference between the distributions). The alternative hypothesis (Ha): There is evidence to suggest that there is a difference in the mean weights of newborn babies between smoking and nonsmoking mothers.

Example 5: Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Warning: package 'BHH2' was built under R version 4.0.4
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical", 
          order = c("smoker","nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( -0.5777 , -0.0534 )