Hypothesis Testing

Inference for Numerical Data

North Carolina Births

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Exploratory Analysis

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")

Exercise 1: What are the cases in this data set? How many cases are there in our sample?

names(nc)

##  [1] "fage"           "mage"           "mature"         "weeks"         
##  [5] "premie"         "visits"         "marital"        "gained"        
##  [9] "weight"         "lowbirthweight" "gender"         "habit"         
## [13] "whitemom"

dim(nc)

## [1] 1000   13

# The cases in this dataset are babies and their parents, which are described by 13 variables or characteristics. There are 1,000 cases or observations. 
summary(nc)

##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
##

Exercise 2: Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?

boxplot(nc$weight~nc$habit)

# This side by side boxplot shows that babies from nonsmoking mothers have a higher median birthweight. However, there is substantial overlap between the IQR's for both boxplots, meaning the middle 50% of birthweights from nonsmokers and smokers is very similar. The lower whisker, or minimum value, of the nonsmoker and smoker boxplots is the same and both boxplots are skewed to the left, with the distribution of babies from nonsmokers being highly skewed to the left due to a large number of outliers beyond the lower fence or lower limit. This indicates that bottom 25% birthweights for babies from both smokers and nonsmokers were very similar at about 4 pounds or less. The IQR of the nonsmoking mothers boxplot is smaller than that of the smoking mothers boxplot, and the whiskers are further apart, which would seem to indicate that the range of birthweights is much more spread out or variable for nonsmoking mothers. Τhe upper whisker or maximum value of the birthweights for nonsmokers is higher, and includes some outliers, which means that the top 25% of birthweights from nonsmoking mothers is larger than that of smoking mothers. Thus, while it highlights that the median and top 25% birthweights of babies from nonsmoking mothers is higher, it also highlights that the middle 50% of birthweights for both nonsmoking and smoking mothers were similar. It also highlights that nonsmoking mothers had more babies with significantly low birthweights, which means there is likely some other confounding variable affecting some nonsmoking birthweights. 
by(nc$weight, nc$habit, mean)

## nc$habit: nonsmoker
## [1] 7.144273
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 6.82873

Execise 3: Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.

by(nc$weight, nc$habit, length)

## nc$habit: nonsmoker
## [1] 873
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 126

# Both sample sizes are well above the size necessary to qualify for the t-distribution so a normal distribution would be more appropriate in this case.

Exercise 4: Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

Null Hypothesis: The difference in the population average birthweight of babies from nonsmoking mothers and that of smoking mothers is equal to 0.

Alternative Hypothesis: The difference in the population average birthweight of babies from nonsmoking mothers and that of smoking mothers is not equal to 0.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

# The p-value of 0.0184 is highly significant at a level of significance α = 0.10 or 0.05. Thus, we can reject the null hypothesisthat there is no difference between the population mean birthweight of of babies born to smoking mothers compared to nonsmoking mothers in favor of the alternative hypothesis that there is a difference in the population mean birthweight of babies born to smoking mothers and those born to nonsmoking mothers.

Exercise 5: Change the type argument to “ci” to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )

# The confidence interval of (0.0534 , 0.5777) confirms the result of the hypothesis test since 0 is not included in the confidence interval. The confidence interval indicates that the population average brithweight of babies born to nonsmoking mothers is between 0.0534 pounds to 0.5777 pounds larger than the population average brithweight of babies born to smoking mothers.

On Your Own

1. Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.

inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical")

## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )

# The 95 percent confidence interval for mean pregnancy length is (38.1528 , 38.5165).

2. Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90

inference(y = nc$weeks, est = "mean", type = "ci", method = "theoretical", conflevel = 0.90)

## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

# The 90% confidence interval for mean pregnancy length is (38.182 , 38.4873).

3. Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

Null Hypothesis: The average population weight gain of younger mothers is the same as the average population weight gain of mature mothers.

Alernative Hypothesis: The average population weight gain of younger mothers is different than the average population weight gain of mature mothers.

inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469

## Observed difference between means (mature mom-younger mom) = -1.7697
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 1.286 
## Test statistic: Z =  -1.376 
## p-value =  0.1686

# The null hypothesis that the difference between the average weight gain of mature mothers and the average weight gain of younger mothers is 0 cannot be rejected due to the p-value of 0.1686, which exceeds the highest level of singificance for which a null hypothesis could be rejected (α = 0.10).

4. Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

boxplot(nc$mage~nc$mature)

by(nc$mage, nc$mature, mean)

## nc$mature: mature mom
## [1] 37.18045
## ------------------------------------------------------------ 
## nc$mature: younger mom
## [1] 25.43829

# By choosing a side by side boxplot of the "mage", or mother age, numerical variable based on the categorical variable, "mature", one can determine the cutoff in age between the two different possibilities of the categorical variable mature. From the side by side boxplot, one can infer that the age cutoff for younger moms is 35. Mature moms, therefore, are 35 years old or older.

5. Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.

The research question is whether race is associated with hospital visits.

Null Hypothesis: The average population visits to the hospital for white mothers is the same as the average population visits to the hospital for non-white mothers.

Alternative Hypothesis: The average population visits to the hospital for white mothers is the not same as the average population visits to the hospital for non-white mothers.

inference(y = nc$visits, x = nc$whitemom, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_not white = 279, mean_not white = 11.6272, sd_not white = 4.3644
## n_white = 710, mean_white = 12.3014, sd_white = 3.7701

## Observed difference between means (not white-white) = -0.6742
## 
## H0: mu_not white - mu_white = 0 
## HA: mu_not white - mu_white != 0 
## Standard error = 0.297 
## Test statistic: Z =  -2.269 
## p-value =  0.0232

# Based on the p-value of 0.0232, we have strong evidence that we can reject the null hypothesis that the population average number of visits to the hospital for white mothers is the same as the population average number of visits to the hospital for non-white mothers in favor of the alternative hypothesis that there is a difference between these population averages.
inference(y = nc$visits, x = nc$whitemom, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_not white = 279, mean_not white = 11.6272, sd_not white = 4.3644
## n_white = 710, mean_white = 12.3014, sd_white = 3.7701

## Observed difference between means (not white-white) = -0.6742
## 
## Standard error = 0.2971 
## 95 % Confidence interval = ( -1.2565 , -0.0918 )

# The 95% confidence interval for the difference in the population average mean number of visits between white mothers and nonwhite mothers is (-1.2565,-0.0918). This means that the average number of hospital visit for non-white mothers is between 1.26 and 0.0918 visits less than the number of hospital visits for white mothers.