# Inference for numerical data


# North Carolina births

# In 2004, the state of North Carolina released a large data set containing information on births recorded in this state.
# This data set is useful to researchers studying the relation between habits and practices of expectant mothers and
# the birth of their children.
# We will work with a random sample of observations from this data set.


# Exploratory analysis

# Load the nc data set into our workspace.
download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
names(nc)
##  [1] "fage"           "mage"           "mature"         "weeks"         
##  [5] "premie"         "visits"         "marital"        "gained"        
##  [9] "weight"         "lowbirthweight" "gender"         "habit"         
## [13] "whitemom"
#
# variable      description
#__________     ___________     
# fage        father's age in years. 
# mage        mother's age in years. 
# mature      maturity status of mother. 
# weeks       length of pregnancy in weeks. 
# premie      whether the birth was classified as premature (premie) or full-term. 
# visits      number of hospital visits during pregnancy. 
# marital     whether mother is married or not married at birth. 
# gained      weight gained by mother during pregnancy in pounds. 
# weight      weight of the baby at birth in pounds. 
# lowbirthweight  whether baby was classified as low birthweight (low) or not (not low). 
# gender      gender of the baby, female or male. 
# habit       status of the mother as a nonsmoker or a smoker. 
# whitemom    whether mom is white or not white.  We have observations on 13 different variables, some categorical and some numerical. 
# The meaning of each variable is as follows.


# Exercise 1: What are the cases in this data set? How many cases are there in our sample?

# As a first step in the analysis, we should consider summaries of the data. This can be done using the summary command:
 
dim(nc)
## [1] 1000   13
summary(nc)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 
# As you review the variable summaries, consider which variables are categorical and which are numerical.
# For numerical variables, are there outliers? If you aren't sure or want to take a closer look at the data, make a graph.

# Consider the possible relationship between a mother's smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

plot(nc$weight ~ nc$habit)

# Exercise 2  Make a side-by-side boxplot of habit and weight.
# What does the plot highlight about the relationship between these two variables?

boxplot(nc$weight ~ nc$habit)

# The box plots show how the medians of the two distributions compare,
# but we can also compare the means of the distributions using the following function to split the weight variable
# into the habit groups, then take the mean of each using the mean function.
 
by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144273
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 6.82873
# There is an observed difference, but is this difference statistically significant? In order to answer this question we 
# will conduct a hypothesis test .


# Inference
# Exercise 3
# Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to
# check the conditions. You can compute the group size using the same by command above but replacing mean with length.
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 126
table(nc$habit)
## 
## nonsmoker    smoker 
##       873       126
# Exercise 4
# Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

#     Null Hypothesis Ho:
#     Alt  Hypothesis Ha:

# Next, we introduce a new function, "inference", that we will use for conducting hypothesis tests and 
#  constructing confidence intervals.

  inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

  # Let's pause for a moment to go through the arguments of this custom function.
  # The first argument is y, which is the response variable that we are interested in: nc$weight.
  # The second argument is the explanatory variable, x, which is the variable that splits the data into two groups, 
  #     smokers and non-smokers: nc$habit. 
  # The third argument, est, is the parameter we're interested in: "mean" (other options are "median", or "proportion".) 
  # Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci").
  #    When performing a hypothesis test, we also need to supply the null value, which in this case is 0, 
  #   since the null hypothesis sets the two population means equal to each other.
  # The alternative hypothesis can be "less", "greater", or "twosided". 
  # Lastly, the method of inference can be "theoretical" or "simulation" based.

# Exercise5: Change the type argument to "ci" to construct and record a confidence interval for the difference between 
  # the weights of babies born to smoking and non-smoking mothers.

# By default the function reports an interval for (?? nonsmoker ????? smoker  
# We can easily change this order by using the order argument:
  
  inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
            alternative = "twosided", method = "theoretical", 
            order = c("smoker","nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( -0.5777 , -0.0534 )
#On your own

  # 1. Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context.
  inference(y = nc$weeks, est = "mean", type = "ci", conflevel = 0.95, method = "theoretical")
## Single mean 
## Summary statistics:
## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
  # 2. Calculate a new confidence interval for the same parameter at the 90% confidence level.
  inference(y = nc$weeks, est = "mean", type = "ci", conflevel = 0.90, method = "theoretical")
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )
  # 3. Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.
  inference(y = nc$gained, x = nc$mature, est = "mean", type = "ht", null = 0, 
            alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 1.286 
## Test statistic: Z =  -1.376 
## p-value =  0.1686

  # 4. Determine the age cutoff for younger and mature mothers.
  # Let's say younger mothers are those with age <= cutoff, and mature mothers are those with age > cutoff.
  # We can use summary statistics to help us decide on a reasonable cutoff.
  summary(nc$mage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      13      22      27      27      32      50
  # 5. Pick a pair of numerical and categorical variables and come up with a research question.
  # Evaluate the relationship between these variables using a hypothesis test and/or a confidence interval.
  # For example, let's explore the relationship between the number of hospital visits (numerical) and marital status (categorical).
  inference(y = nc$visits, x = nc$marital, est = "mean", type = "ht", null = 0, 
            alternative = "twosided", method = "theoretical", order = c("married", "not married"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_married = 380, mean_married = 10.9553, sd_married = 4.2408
## n_not married = 611, mean_not married = 12.82, sd_not married = 3.5883
## Observed difference between means (married-not married) = -1.8647
## 
## H0: mu_married - mu_not married = 0 
## HA: mu_married - mu_not married != 0 
## Standard error = 0.262 
## Test statistic: Z =  -7.13 
## p-value =  0