Lab report

Load data:

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")

Set a seed:

123478
## [1] 123478

Exercises:

Exercise 1:

set.seed(446882)
dim(nc)
## [1] 1000   13
summary(nc)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 
str(nc)
## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

The cases represent births in North Carolina in 2004, and there are 1000 cases in this sample.

Exercise 2:

set.seed(333561)
boxplot(nc$weight~nc$habit)

by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144273
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 6.82873

The plots have similar shapes and both are left-skewed, their medians are also similar, but the smoker distribution has less outliers and a smaller spread. The plot highlights the difference between the mean birth weight of smokers, showing that smoking can slightly lower the mean birth weight of babies.

Exercise 3:

set.seed(777891)
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 126

The sample was collected randomly, so that satisfies the first condition, and the observations within the sample are independent, and the 2 samples are independent of each other as well. The sample distributions are approximately normal and the sample size is greater than 30, it. Since this sampling distribution is right skewed, but both samples are sufficiently large and there is only one outlier, the sampling distribution meets the conditions.

Exercise 4:

set.seed(999865)
inference(y=nc$weight, x=nc$habit, est="mean", type="ht", null=0, alternative="twosided", method="theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

The H0 is mu-smoking = mu-nonsmoking, which means there is no difference between the average weights of babies born to smoking and non-smoking. HA is mu-smoking is >< mu-smoking, which means that there is a difference between the average weights of babies born to smoking and non-smoking mothers.

Exercise 5:

set.seed(555555)
inference(y=nc$weight, x=nc$habit, est="mean", type="ci", null=0, alternative="twosided", method="theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )

Again the box plots show similar means at around 7, with the nonsmoker distribution having many more outliers towards lower birth weight. The spread of distribution for the smoker box plot is much smaller than the nonsmokers, which would suggest that smokers are less likely to have babies that are above or below the mean weight. Since the average birth weight in these box plots differ, we can say that our alternative hypothesis from question 4, that states the birth weights should be different, seems to be supported by these box plots.


On your own:

1:

set.seed(332255)
inference(y = nc$weeks, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )

We are 95% confident that the population mean falls between (38.15, 38.52) weeks. Meaning that we are 95% confident that the average length of all pregnancies fall between (38.15, 38.52) weeks in the population.

2:

set.seed(465983)
inference(y = nc$weeks, est = "mean", type = "ci", conflevel = 0.90, null = 0, 
          alternative = "twosided", method = "theoretical")
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

We are 90% confident that the population mean falls between (38.18, 38.49) weeks. Meaning that we are 90% confident that the average length of all pregnancies fall between (38.18, 38.49) weeks.

3:

set.seed(777889)
inference(y = nc$gained, x = nc$mature,  est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 1.286 
## Test statistic: Z =  -1.376 
## p-value =  0.1686

Since out p-value is large, at 0.169, which is larger than our alpha of 0.05, the data does not provide convincing evidence that there is a difference between the average weight gained by younger mothers and the average weight gained by mature mothers. So we do not reject the null hypothesis, hence there is no difference in the average weight gained by younger mothers as compared to the average weight gained by mature mothers.

4:

set.seed(468921)
by(nc$mage, nc$mature, range)
## nc$mature: mature mom
## [1] 35 50
## ------------------------------------------------------------ 
## nc$mature: younger mom
## [1] 13 34

Using the “range” argument will allow us to find the cut off point of ages of younger moms and mature moms by taking the range of the two groups. The “by” function combined with the range argument allows us to see the range for younger moms ends around 34 (max) and mature moms begin at 35 (min). SO the cut off age for the 2 groups is 34.

5:

set.seed(784512)
inference(y = nc$visits, x = nc$mature,  est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 131, mean_mature mom = 12.6107, sd_mature mom = 4.3793
## n_younger mom = 860, mean_younger mom = 12.0279, sd_younger mom = 3.8832
## Observed difference between means (mature mom-younger mom) = 0.5828
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 0.405 
## Test statistic: Z =  1.439 
## p-value =  0.15

Research question: Is there a difference between the average number of hospital visits during pregnancy between a younger mom and the average number of hospital visits during pregnancy for a mature mom? Since our p-value is large, at 0.15, which is greater than 0.05, the data does not provide convincing evidence that there is a difference between the average number of hospital visits during pregnancy. So we do not reject the null hypothesis, and there is no difference between the average number of hospital visits during pregnancy between younger moms and mature moms.