Link: https://www.datacamp.com/courses/data-analysis-and-statistical-inference_mine-cetinkaya-rundel-by-datacamp/lab-4-inference-for-numerical-data-6

North Carolina births

Welcome! In this lab, you’ll use the techniques you learned in the previous lab to analyze births in North Carolina, USA. Vamonos!

In 2004, the state of North Carolina released a large data set containing information on births recorded in their state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/nc.Rdata'))
names(nc)
##  [1] "fage"           "mage"           "mature"         "weeks"         
##  [5] "premie"         "visits"         "marital"        "gained"        
##  [9] "weight"         "lowbirthweight" "gender"         "habit"         
## [13] "whitemom"

The nc data frame contains observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows:

A first analysis

As a first step in the analysis, you should consider summaries of the data. This can be done using the summary command.

The variable summaries tell you which variables are categorical and which are numerical. Think about the types of the variables. If you aren’t sure or want to take a closer look at the data, make a graph.

We will first start with analyzing the weight gained by mothers throughout the pregnancy: gained.

summary(nc)
##       fage           mage            mature        weeks     
##  Min.   :14.0   Min.   :13   mature mom :133   Min.   :20.0  
##  1st Qu.:25.0   1st Qu.:22   younger mom:867   1st Qu.:37.0  
##  Median :30.0   Median :27                     Median :39.0  
##  Mean   :30.3   Mean   :27                     Mean   :38.3  
##  3rd Qu.:35.0   3rd Qu.:32                     3rd Qu.:40.0  
##  Max.   :55.0   Max.   :50                     Max.   :45.0  
##  NA's   :171                                   NA's   :2     
##        premie        visits            marital        gained    
##  full term:846   Min.   : 0.0   married    :386   Min.   : 0.0  
##  premie   :152   1st Qu.:10.0   not married:613   1st Qu.:20.0  
##  NA's     :  2   Median :12.0   NA's       :  1   Median :30.0  
##                  Mean   :12.1                     Mean   :30.3  
##                  3rd Qu.:15.0                     3rd Qu.:38.0  
##                  Max.   :30.0                     Max.   :85.0  
##                  NA's   :9                        NA's   :27    
##      weight      lowbirthweight    gender          habit    
##  Min.   : 1.00   low    :111    female:503   nonsmoker:873  
##  1st Qu.: 6.38   not low:889    male  :497   smoker   :126  
##  Median : 7.31                               NA's     :  1  
##  Mean   : 7.10                                              
##  3rd Qu.: 8.06                                              
##  Max.   :11.75                                              
##                                                             
##       whitemom  
##  not white:284  
##  white    :714  
##  NA's     :  2  
##                 
##                 
##                 
## 
summary(nc$gained)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    20.0    30.0    30.3    38.0    85.0      27
hist(nc$gained)

plot of chunk unnamed-chunk-2

Great! Take a moment to analyze your results, can you describe the distribution of weight gained by mothers during their pregnancy?

skewed, with right tail

Question 2 How many mothers are we missing weight gain data from?

sum(is.na(nc$gained))
## [1] 27

Cleaning your data

Since we have some missing data on weight gain, we’ll first create a cleaned-up version of the weight gain variable, and use this variable in the next few steps of the analysis.

There are many ways of accomplishing this task in R, we’ll do it using the na.omit function. The command na.omit(a_data_set) returns a_data_set stripped of its (possible) NA values.

We’ll also store the sample size of the clean data set (which should be less than 1000 since we dropped the observations with NAs) in order to be able to use this value in the next few steps of the analysis as well.

gained_clean = na.omit(nc$gained)
n = length(gained_clean)

The bootstrap

Using this sample we would like to construct a bootstrap confidence interval for the average weight gained by all mothers during pregnancy. A quick reminder of how bootstrapping works:

  1. Take a bootstrap sample (a random sample with replacement of size equal to the original sample size) from the original sample.
  2. Record the mean of this bootstrap sample.
  3. Repeat steps (1) and (2) many times to build a bootstrap distribution.
  4. Calculate the XX% interval using the percentile or the standard error method.

Let’s first take 100 bootstrap samples (i.e. with replacement), and record their means in a new object called boot_means. You’ll also make a histogram of the bootstrap distribution.

  • In a loop, take a sample of size \(n\) (the original sample size) with replacement. You do this by setting the replace argument of the sample function to TRUE.
  • Make a histogram of the bootstrap distribution boot_means.
boot_means = rep(NA, 100)

for (i in 1:100) {
  boot_means[i] = mean( sample(gained_clean, n, replace=T) )
}

hist(boot_means)

plot of chunk unnamed-chunk-5

The sampling distribution is calculated by resampling from the population, the bootstrap distribution is calculated by resampling from the sample.

TRUE

To construct the 95% bootstrap confidence interval using the percentile method, we estimate the values of the 5th and 95th percentiles of the bootstrap distribution.

FALSE

(it’s 2.5 and 97.5)

The inference function

Next we’ll introduce a new function that you’ll be seeing a lot more of in the upcoming labs – a custom function that allows you to apply any statistical inference method that you’ll be learning in this course. Since this is a custom function, we need to load it first.

Writing a for loop every time you want to calculate a bootstrap interval or run a randomization test is cumbersome. This function automates the process. By default the inference function takes 10,000 bootstrap samples (instead of the 100 you’ve taken above), creates a bootstrap distribution, and calculates the confidence interval, using the percentile method.

The inference function is a bit on the heavy side, it may take a while to run.

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/inference.Rdata'))
inference(nc$gained, type = "ci", method = "simulation", conflevel = 0.9, est = "mean", 
    boot_method = "perc")
## Warning: package 'lmPerm' was built under R version 3.1.1
## package 'BHH2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Alexey Grigorev\AppData\Local\Temp\RtmpA3dwhM\downloaded_packages
## Warning: package 'BHH2' was built under R version 3.1.1
## Single mean 
## Summary statistics:
## mean = 30.3258 ;  sd = 14.2413 ;  n = 973

plot of chunk unnamed-chunk-7

## Bootstrap method: Percentile
## 90 % Bootstrap interval = ( 29.5713 , 31.0617 )

Setting the confidence level

We can easily change the confidence level to 95% by changing the argument conflevel of the inference function.

inference(nc$gained, type = "ci", method = "simulation", conflevel = 0.95, est = "mean",
    boot_method = "perc")
## Single mean 
## Summary statistics:
## mean = 30.3258 ;  sd = 14.2413 ;  n = 973

plot of chunk unnamed-chunk-8

## Bootstrap method: Percentile
## 95 % Bootstrap interval = ( 29.447 , 31.1771 )

Choosing a bootstrap method

We can also use the standard error method by changing the boot_method argument. Change the boot_method argument to “se” and run the code.

inference(nc$gained, type = "ci", method = "simulation", conflevel = 0.95, est = "mean",
    boot_method = "se") 
## Single mean 
## Summary statistics:
## mean = 30.3258 ;  sd = 14.2413 ;  n = 973

plot of chunk unnamed-chunk-9

## Bootstrap method: Standard error; Boot. mean = 30.3245 ; Boot. SE = 0.4451 
## 95 % Bootstrap interval = ( 29.452 , 31.1969 )

Setting the parameter of interest

To create an interval for the median instead of the mean, you can change the est argument of the inference function.

inference(nc$gained, type = "ci", method = "simulation", conflevel = 0.95, est = "median",
    boot_method = "se")
## Single median 
## Summary statistics:
## median = 30 ;  n = 973

plot of chunk unnamed-chunk-10

## Bootstrap method: Standard error; Boot. mean = 29.946 ; Boot. SE = 0.2473 
## 95 % Bootstrap interval = ( 29.4613 , 30.4307 )

Question

The bootstrap distribution of the median weight gain is a smooth, unimodal, symmetric distribution that yields a reliable confidence interval estimate.

FALSE

Recognize that when the bootstrap distribution is extremely skewed and sparse, the bootstrap confidence interval may not be reliable.

Father’s age

In this exercise, you’ll use the inference function to create a 95% bootstrap confidence interval for the mean age of fathers at the birth of their child using the standard error method.

inference(nc$fage, type="ci", method="simulation", conflevel=0.95, est="mean",
    boot_method="se")
## Single mean 
## Summary statistics:
## mean = 30.2557 ;  sd = 6.7638 ;  n = 829

plot of chunk unnamed-chunk-11

## Bootstrap method: Standard error; Boot. mean = 30.2553 ; Boot. SE = 0.2398 
## 95 % Bootstrap interval = ( 29.7853 , 30.7253 )

Relationships between two variables

When the response variable is numerical and the explanatory variable is categorical, we can evaluate the relationship between the two variables by comparing means (or medians, or other measures) of the numerical response variable across the levels of the explanatory categorical variable.

boxplot(nc$weight ~ nc$habit)

plot of chunk unnamed-chunk-12

Question 6

Based on the plot from the previous exercise, which of the following is false about the relationship between habit and weight?

  • Median birth weight of babies born to non-smoker mothers is slightly higher than that of babies born to smoker mothers.
  • Range of birth weights of babies born to non-smoker mothers is greater than that of babies born to smoker mothers.
  • Both distributions are slightly right skewed.
  • The IQRs of the distributions are roughly equal.

Both distributions are slightly right skewed.

The by function

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the by function to split the weight variable into the habit groups, and then take the mean of each using the mean function. You can use the by function as follows to compare the means of the groups: by(numerical_dataset, categorical_dataset, mean). You can also use other functions like median, summary, etc.

by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 6.829
by(nc$weight, nc$habit, summary)
## nc$habit: nonsmoker
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.44    7.31    7.14    8.06   11.80 
## -------------------------------------------------------- 
## nc$habit: smoker
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.69    6.08    7.06    6.83    7.74    9.19

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.

Conditions for inference

Before you can start hypothesis testing, you need to think about the conditions that are necessary for inference.

To do this, you’ll need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above by replacing mean with length.

Instructions

  • Check if the conditions for inference are satisfied by computing the group sizes of the babies’ weight over the categories habit.
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 126

More inference

Next, we’ll use the inference function for evaluating whether there is a difference between the average birth weights of babies born to smoker and non-smoker mothers. Let’s pause for a moment to go through the arguments of this custom function:

  • The first argument is y, which is the response variable that we are interested in: nc$weight.
  • The second argument is the grouping variable, x, which is the explanatory variable – the grouping variable accross the levels of which we’re comparing the average value for the response variable, smokers and non-smokers: nc$habit.
  • The third argument, est, is the parameter we’re interested in: "mean" (other options are "median", or "proportion".)
  • Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval("ci").
  • When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other.
  • The alternative hypothesis can be "less", "greater", or "twosided".
  • Lastly, the method of inference can be "theoretical" or "simulation" based.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, 
    alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.144, sd_nonsmoker = 1.519
## n_smoker = 126, mean_smoker = 6.829, sd_smoker = 1.386
## Observed difference between means (nonsmoker-smoker) = 0.3155
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

plot of chunk unnamed-chunk-15

Changing the order

By default the inference function sets the parameter of interest to be mu_nonsmokers - mu_smokers. We can easily change this order by using the order argument. To set the order to mu_first - mu_second use: order = c("first","second").

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, 
    alternative = "twosided", method = "theoretical", order=c('smoker', 'nonsmoker'))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.829, sd_smoker = 1.386
## n_nonsmoker = 873, mean_nonsmoker = 7.144, sd_nonsmoker = 1.519
## Observed difference between means (smoker-nonsmoker) = -0.3155
## H0: mu_smoker - mu_nonsmoker = 0 
## HA: mu_smoker - mu_nonsmoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  -2.359 
## p-value =  0.0184

plot of chunk unnamed-chunk-16

Question 7

Change the type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to smoking and nonsmoking mothers.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
    alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.144, sd_nonsmoker = 1.519
## n_smoker = 126, mean_smoker = 6.829, sd_smoker = 1.386

plot of chunk unnamed-chunk-17

## Observed difference between means (nonsmoker-smoker) = 0.3155
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )

Which of the following is the best interpretation of the interval? > We are 95% confident that babies born to nonsmoker mothers are on average 0.05 to 0.58 pounds heavier at birth than babies born to smoker mothers.

Question 8

Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice. What is the maximum age of a younger mom and the minimum age of a mature mom, according to the data? table(nc$mature)

by(nc$fage, nc$mature, summary)
## nc$mature: mature mom
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    26.0    35.0    38.0    38.4    41.0    55.0      11 
## -------------------------------------------------------- 
## nc$mature: younger mom
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    14.0    24.0    29.0    28.9    33.0    48.0     160
boxplot(nc$fage ~ nc$mature)

plot of chunk unnamed-chunk-18

The maximum age of younger moms is 34 and minimum age of mature moms is 35.

? (no idea why)

The General Social Survey

The General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The survey asks many questions, two of which we will focus on for the next few exercises: wordsum (vocabulary test scores) and class (self-reported social class). wordsum ranges between 0 and 10, and is calculated as the number of vocabulary questions (out of 10) that the respondent answered correctly.

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/gss.Rdata'))

names(gss)
## [1] "wordsum" "class"
head(gss)
##   wordsum   class
## 1       6  MIDDLE
## 2       9 WORKING
## 3       6 WORKING
## 4       5 WORKING
## 5       6 WORKING
## 6       6 WORKING

Analyze the variables

Create numerical and visual summaries of the two variables individually to better understand their distribution. Also compute summary statistics and visualizations that display the relationship between the two variables.

table(gss$wordsum)
## 
##   0   1   2   3   4   5   6   7   8   9  10 
##   1  10  23  41  74 122 203 128  87  70  36
summary(gss$wordsum)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.00    6.00    6.14    7.00   10.00
boxplot(gss$wordsum)

plot of chunk unnamed-chunk-20

hist(gss$wordsum)

plot of chunk unnamed-chunk-20

summary(gss$class)
##   LOWER  MIDDLE   UPPER WORKING 
##      41     331      16     407
barplot(table(gss$class))

plot of chunk unnamed-chunk-20

# Numerical and visual summaries of their relations:
by(gss$wordsum, gss$class, mean)
## gss$class: LOWER
## [1] 5.073
## -------------------------------------------------------- 
## gss$class: MIDDLE
## [1] 6.761
## -------------------------------------------------------- 
## gss$class: UPPER
## [1] 6.188
## -------------------------------------------------------- 
## gss$class: WORKING
## [1] 5.749
by(gss$wordsum, gss$class, summary)
## gss$class: LOWER
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    5.00    5.07    6.00    9.00 
## -------------------------------------------------------- 
## gss$class: MIDDLE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00    7.00    6.76    8.00   10.00 
## -------------------------------------------------------- 
## gss$class: UPPER
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.75    6.00    6.19    8.00    9.00 
## -------------------------------------------------------- 
## gss$class: WORKING
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    6.00    5.75    7.00   10.00
boxplot(gss$wordsum ~ gss$class)

plot of chunk unnamed-chunk-20

plot(gss$wordsum, col=gss$class)

plot of chunk unnamed-chunk-20

Question 9

Which of the following methods is appropriate for testing for a difference between the average vocabulary test scores among the various classes?

  • \(Z\) test
  • \(T\) test
  • ANOVA
  • \(\chi^2\) test

ANOVA

ANOVA test

In this exercise, we’ll run the ANOVA test and analyze its outcome. Execute the code now and investigate the output.

Let’s examine the output: First, we have a numerical response variable (score on vocabulary test) and a categorical explanatory variable (class). Since class has four levels, comparing average scores across the levels of the class variable requires ANOVA.

Before we get to the ANOVA output we are presented with a series of summary statistics and the associated hypotheses. Note that the alternative hypothesis is set to greater because \(F\)-tests are always onesided.

inference(y = gss$wordsum, x = gss$class, est = "mean", method = "theoretical", 
    type = "ht", alternative = "greater")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_LOWER = 41, mean_LOWER = 5.073, sd_LOWER = 2.24
## n_MIDDLE = 331, mean_MIDDLE = 6.761, sd_MIDDLE = 1.886
## n_UPPER = 16, mean_UPPER = 6.188, sd_UPPER = 2.344
## n_WORKING = 407, mean_WORKING = 5.749, sd_WORKING = 1.865
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##            Df Sum Sq Mean Sq F value  Pr(>F)
## x           3    237    78.9    21.7 1.6e-13
## Residuals 791   2870     3.6                
## 
## Pairwise tests: t tests with pooled SD 
##          LOWER MIDDLE  UPPER
## MIDDLE  0.0000     NA     NA
## UPPER   0.0475 0.2396     NA
## WORKING 0.0306 0.0000 0.3671

plot of chunk unnamed-chunk-21

Nice! Since the \(p\)-value for the \(F\)-test is low, we reject \(H_0\) and conclude that there is evidence of at least one pair of means being different. Next, we investigate which pair of means are different.

Question 10

Calculate the modified \(\alpha\) (\(\alpha^*\)) to be used for these tests. > \(\alpha^* = \alpha / (3 + 2 + 1) = \alpha / 6\)

Question 11

View the \(p\)-values of the pairwise tests from the ANOVA output. Which of the following pairs of means are concluded to be different at the modified significance level?

middle and lower