Inference for numerical data

Exploratory analysis

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
summary(nc)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 

Exercise 1

What are the cases in this data set? How many cases are there in our sample?

Cases are the objects being described in the data set (according to google). So the number of cases would be the number of pregnancies investigsted (case = a single pregnancy). Tha means there are 1000 cases in the data set.

boxplot(nc$fage)

boxplot(nc$mage)

boxplot(nc$weeks)

boxplot(nc$visits)

boxplot(nc$gained)

boxplot(nc$weight)

Exercise 2

Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between the two variables?

The medians are fairly close, but slightly smaller in the smoker population. There are a ton of outliers for the nonsmoker weights (which seems odd, im not sure how that happens..if there are a ton of outliers wouldn’t they become part of the body of data? Maybe it is because there are so many more cases for nonsmoker). The spread of the nonsmoker varies more so than the smoker.

plot(nc$habit, nc$weight)

by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144273
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 6.82873

Inference

Exercise 3

Check if the conditions necessary for inference are satisfied.

by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 126

I can see that nonsmoker: n = 873 and smoker: n = 126. Therefore the sample size is large enought for inference testing (need a large enough n; n > 30).

Exercise 4

Write the hypothesis for testing if the average weights of babies born to smoking and non-smoking mothers are different.

The null hypothesis: There is no difference between avg baby weights of babies born to smoking verse non-smoking mothers AKA avg_baby_weight_smoking = avg_baby_weight_non-smoking

The alternative hypothesis: There is a significant difference between avg weights of babiues born to smoking verse nonsmsking mothers.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Warning: package 'BHH2' was built under R version 4.0.4
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

Exercise 5

Change the type argument to “ci” to construct and record a confidence interval.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical", order = c("smoker", "nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( -0.5777 , -0.0534 )