Introduction

My personal course notes from the Experimental Design in R course on DataCamp taught by kaelen medeiros. Kaelen is a data scientist who works in the technology and health industries.

I have now realized that Prof. Alan T. Arnholt has a repository (repo) of this course’s materials here

Course Description

Experimental design is a crucial part of data analysis in any field, whether you work in business, health or tech. If you want to use data to answer a question, you need to design an experiment! In this course you will learn about basic experimental design, including block and factorial designs, and commonly used statistical tests, such as the t-tests and ANOVAs. You will use built-in R data and real world datasets including the CDC NHANES survey, SAT Scores from NY Public Schools, and Lending Club Loan Data. Following the course, you will be able to design and analyze your own experiments!

What is Covered

Introduction to Experimental Design
Basic Experiments
Randomized Complete (& Balanced Incomplete) Block Design
Latin Squares, Graeco-Latin Squares, & Factorial Experiments

Libraries and Data

library(tidyverse)
library(broom)
library(pwr)
library(haven)
library(simputation)
library(sampling)
library(agricolae)
library(naniar)
library(knitr)

lendingclub <- read_csv("data/lendclub.csv")
nhanes_bodymeasures <- read_csv("data/nhanes_bodymeasures.csv")
nhanes_demo <- read_csv("data/nhanes_demo.csv")
nhanes_medical <- read_csv("data/nhanes_medicalconditions.csv")
nyc_scores <- read_csv("data/nyc_scores.csv")
nhanes_final <- read_csv("data/nhanes_final.csv")
# Merge the 3 datasets you just created to create nhanes_combined
nhanes_combined <- list(nhanes_demo, nhanes_medical, nhanes_bodymeasures) %>%
  Reduce(function(df1, df2) inner_join(df1, df2, by = "seqn"), .)

1. Introduction to Experimental Design

An introduction to key parts of experimental design plus some power and sample size calculations.

Intro to experimental design

Steps of an experiment

Planning
- dependent variable = outcome
- independent variable(s) = explanatory variables
Design
Analysis

Key Components of an experiment

Randomization
Replication
Blocking

Randomization

Evenly distributes any variability in outcome due to outside factors across treatment groups
Example:
- double-blind medical trials
- neither patient nor doctor knows which group has been assigned
- group assignment is made randomly by 3rd party

Replication

Must repeat an experiment to fully assess variability
If we only conduct a drug efficacy experiment on one person, how can we properly generalize those results? (We can’t!)

Blocking

Helps control variability by making treatment groups more alike
Inside of groups, differences will be minimal. Across groups, differences will be larger.
One example is blocking treatment groups by sex.

A basic experiment

Let’s dive into experimental design. Note that all of these concepts will be covered in more detail in the next video, “Hypothesis Testing.”

ToothGrowth is a built-in R dataset from a study that examined the effect of three different doses of Vitamin C on the length of the odontoplasts, the cells responsible for teeth growth in 60 guinea pigs, where tooth length was the measured outcome variable.

Built-in data can be loaded with the data() function. The dataset will be loaded as a data frame with the same name passed as an argument to data(). For example, you can load the famous iris dataset using data("iris").

If you wanted to conduct a two-sided t-test with the famous mtcars dataset, it would look like this, where x is the outcome in question, alternative is set to "two.sided", and mu is the value you’re testing to see if the mean of mpg is not equal to.

data(mtcars)
t.test(x = mtcars$mpg, alternative = "two.sided", mu = 40)

Suppose you know that the average length of a guinea pigs odontoplasts is 18 micrometers. Conduct a two-sided t-test on the ToothGrowth dataset. Here, a two-sided t-test will check to see if the mean of len is not equal to 18.

# Load the ToothGrowth dataset
data("ToothGrowth")

# Perform a two-sided t-test
t.test(x = ToothGrowth$len, alternative = "two.sided", mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.4135
## alternative hypothesis: true mean is not equal to 18
## 95 percent confidence interval:
##  16.83731 20.78936
## sample estimates:
## mean of x 
##  18.81333

Excellent job! Given the high p-value, we fail to reject the null hypothesis that the mean of len is equal to 18. That is, we don’t have evidence that it is different from 18 micrometers. P-values and hypothesis testing will be covered in more detail in the next video.

Randomization

Randomization of subjects in an experiment helps spread any variability that exists naturally between subjects evenly across groups. For ToothGrowth, an example of effective randomization would be randomly assigning male and female guinea pigs into different experimental groups, ideally canceling out any existing differences that naturally exist between male and female guinea pigs.

In the experiment that yielded the ToothGrowth dataset, guinea pigs were randomized to receive Vitamin C either through orange juice or ascorbic acid, indicated in the dataset by the supp variable. It’s natural to wonder if there is a difference in tooth length by supplement type - a question that a t-test can also answer!

Starting with this exercise, you should use t.test() and other modeling functions with formula notation:

t.test(outcome ~ explanatory_variable, data = dataset)

This can be read: “test outcome by explanatory_variable in my dataset.” The default test for t.test() is a two-sided t-test.

You no longer have to explicitly declare dataset$outcome, because the data argument is specified.

ToothGrowth_ttest <- t.test(len ~ supp, ToothGrowth)
tidy(ToothGrowth_ttest)

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1      3.7      20.7      17.0      1.92  0.0606      55.3   -0.171      7.57
## # … with 2 more variables: method <chr>, alternative <chr>

Nice job! Given the p-value of around 0.06, there seems to be no evidence to support the hypothesis that there’s a difference in mean tooth length by supplement type, or, more simply, that there is no difference in mean tooth length by supplement type. Generally in most experiments, any p-value above 0.05 will offer no evidence to support the given hypothesis.

Replication

Recall that replication means you need to conduct an experiment with an adequate number of subjects to achieve an acceptable statistical power. Sample size and power will be discussed in more detail later in this chapter.

Let’s examine the ToothGrowth dataset to make sure they followed the principle of replication. We’ll use dplyr to do some exploratory data analysis (EDA). The data has already been loaded for you.

When using dplyr functions, we can utilize the pipe operator, %>%, to chain functions together. An example using mtcars:

data(mtcars)
mtcars %>%
    count(cyl)

count() groups mtcars by cyl and then counts how many there are of each number of cylinders.

# Count number of observations for each combination of supp and dose
ToothGrowth %>% 
    count(supp, dose)

##   supp dose  n
## 1   OJ  0.5 10
## 2   OJ  1.0 10
## 3   OJ  2.0 10
## 4   VC  0.5 10
## 5   VC  1.0 10
## 6   VC  2.0 10

Great job! The researchers seem to have tested each combination of supp and dose on 10 subjects each, which is low, but was deemed adequate for this experiment.

Blocking

Though this is not true, suppose the supplement type is actually a nuisance factor we’d like to control for by blocking, and we’re actually only interested in the effect of the dose of Vitamin C on guinea pig tooth growth.

If we block by supplement type, we create groups that are more similar, in that they’ll have the same supplement type, allowing us to examine only the effect of dose on tooth length.

We’ll use the aov() function to examine this. aov() creates a linear regression model by calling lm() and examining results with anova() all in one function call. To use aov(), we’ll still need functional notation, as with the randomization exercise, but this time the formula should be len ~ dose + supp to indicate we’ve blocked by supplement type. (We’ll cover aov() and anova() in more detail in the next chapter.)

ggplot2 is loaded for you.

ggplot(ToothGrowth, aes(as_factor(dose), len))+
  geom_boxplot()

# Create ToothGrowth_aov
#ToothGrowth_aov <- aov(len ~ as_factor(dose) + supp, ToothGrowth)

# Examine ToothGrowth_aov with summary()
#summary(ToothGrowth_aov)

# Both steps above combined and tidied
aov(len ~ as_factor(dose) + supp, ToothGrowth) %>% tidy()

## # A tibble: 3 × 6
##   term               df sumsq meansq statistic   p.value
##   <chr>           <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 as_factor(dose)     2 2426. 1213.       82.8  1.87e-17
## 2 supp                1  205.  205.       14.0  4.29e- 4
## 3 Residuals          56  820.   14.7      NA   NA

Congrats! You have just designed your first Randomized Complete Block Design (RCBD) experiment. We’ll learn more about this type of experiment in Chapter 3. Given the very small observed p-value for dose, it appears we have evidence to support the hypothesis that mean len is different by dose amount.

Hypothesis Testing

Breaking down hypothesis testing

Null hypothesis:
- there is no change
- no difference between groups
- the mean, median, or observation = a number
Alternative hypothesis:
- there is a change
- difference between groups
- mean, median, or observation is >, <, or != to a number

Power and sample size

Power: probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true.
Effect size: standardized measure of the difference you’re trying to detect.
Sample size: How many experimental units you need to survey to detect the desired difference at the desired power

One sided vs. Two-sided tests

Recall in the first exercise that we tested to see if the mean of the guinea pigs’ teeth in ToothGrowth was not equal to 18 micrometers. That was an example of a two-sided t-test: we wanted to see if the mean of len is some other number on either side of 18.

We can also conduct a one-sided t-test, explicitly checking to see if the mean is less than or greater than 18. Whether to use a one- or two-sided test usually follows from your research question. Does an intervention cause longer tooth growth? One-sided, greater than. Does a drug cause the test group to lose more weight? One-sided, less than. Is there a difference in mean test scores between two groups of students? Two-sided test.

The ToothGrowth data has been loaded for you.

# Less than
t.test(x = ToothGrowth$len,
       alternative = "less",
       mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.7933
## alternative hypothesis: true mean is less than 18
## 95 percent confidence interval:
##      -Inf 20.46358
## sample estimates:
## mean of x 
##  18.81333

# Greater than
t.test(x = ToothGrowth$len,
       alternative = "greater",
       mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.2067
## alternative hypothesis: true mean is greater than 18
## 95 percent confidence interval:
##  17.16309      Inf
## sample estimates:
## mean of x 
##  18.81333

Excellent! It turns out the mean of len is actually very close to 18, so neither of these tests tells us much about the mean of tooth length.

pwr package Help Docs exploration

The pwr package has been loaded for you. Use the console to look at the documentation for the pwr.t.test() function. The list of arguments are specialized for a t-test and include the ability to specify the alternative hypothesis.

If you’d like, take some time to explore the different pwr package functions and read about their inputs.

What does a call to any pwr.*() function yield?

A vector containing the thing to be calculated.
A data frame of the power, sample size, and other inputs.
An object of class “power.htest”.
An object of class “integer”.

A call to any pwr.*() function returns an object of class power.htest, which can then be manipulated in the same way as many different R objects.

Power & Sample Size Calculations

One key part of designing an experiment is knowing the required sample size you’ll need to be able to test your hypothesis.

The pwr package provides a handy function, pwr.t.test(), which will calculate that for you. However, you do need to know your desired significance level (often 0.05), if the test is one- or two-sided, if the data is from one sample, two samples, or paired, the effect size, and the power. Some of this information will be given to you or can be reasoned from the design.

A power or sample size calculation is usually different each time you conduct one, and the details of the calculation strongly depend on what kind of experiment you’re designing and what your end goals are.

# Calculate power
pwr.t.test(n = 100, 
           d = 0.35,
           sig.level = 0.10,
           type = "two.sample", 
           alternative = "two.sided",
           power = NULL)

## 
##      Two-sample t test power calculation 
## 
##               n = 100
##               d = 0.35
##       sig.level = 0.1
##           power = 0.7943532
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

# Calculate sample size
pwr.t.test(n = NULL, 
           d = 0.25, 
           sig.level = 0.05, 
           type = "one.sample", alternative = "greater", 
           power = 0.8)

## 
##      One-sample t test power calculation 
## 
##               n = 100.2877
##               d = 0.25
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater

Because this sample size calculation was for a one-sided test, we only need 101 subjects, not 101 in each group. As you design experiments in the future, the pwr package includes functions for calculating power and sample size for a variety of different tests, including ANOVA (more on that in the next chapter!)

2. Basic Experiments

Explore the Lending Club dataset plus build and validate basic experiments, including an A/B test.

ANOVA, single and multiple factor experiments

### Exploratory Data Analysis (EDA) Lending Club A sample of 1500 observations from the Lending Club dataset has been loaded for you and is called lendingclub. Let’s do some EDA on the data, in hopes that we’ll learn what the dataset contains. We’ll use functions from dplyr and ggplot2 to explore the data.

# Examine the variables with glimpse()
glimpse(lendingclub)

## Rows: 1,500
## Columns: 12
## $ member_id           <dbl> 55096114, 1555332, 1009151, 69524202, 72128084, 53…
## $ loan_amnt           <dbl> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 500…
## $ funded_amnt         <dbl> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 500…
## $ term                <chr> "36 months", "36 months", "60 months", "36 months"…
## $ int_rate            <dbl> 12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13.11, 7.8…
## $ emp_length          <chr> "10+ years", "10+ years", "3 years", "10+ years", …
## $ home_ownership      <chr> "RENT", "MORTGAGE", "MORTGAGE", "MORTGAGE", "MORTG…
## $ annual_inc          <dbl> 51000, 40000, 78204, 51000, 96000, 47000, 40000, 3…
## $ verification_status <chr> "Not Verified", "Verified", "Not Verified", "Not V…
## $ loan_status         <chr> "Current", "Fully Paid", "Fully Paid", "Current", …
## $ purpose             <chr> "debt_consolidation", "debt_consolidation", "home_…
## $ grade               <chr> "C", "A", "B", "C", "A", "D", "C", "A", "D", "B", …

# Find median loan_amnt, mean int_rate, and mean annual_inc with summarize()
lendingclub %>% summarize(
  median(loan_amnt), 
  mean(int_rate), 
  mean(annual_inc)
  )

## # A tibble: 1 × 3
##   `median(loan_amnt)` `mean(int_rate)` `mean(annual_inc)`
##                 <dbl>            <dbl>              <dbl>
## 1               13000             13.3             75736.

# Use ggplot2 to build a bar chart of purpose
ggplot(lendingclub, aes(purpose))+
  geom_bar()+
  coord_flip()

lendingclub$purpose_recode <- lendingclub$purpose %>% recode( 
        "credit_card" = "debt_related", 
        "debt_consolidation" = "debt_related",
        "medical" = "debt_related",
        "car" = "big_purchase", 
        "major_purchase" = "big_purchase", 
        "vacation" = "big_purchase",
        "moving" = "life_change", 
        "small_business" = "life_change", 
        "wedding" = "life_change",
        "house" = "home_related", 
        "home_improvement" = "home_related")

You can see that the original purpose variable had quite a few levels, which were very detailed. By using recode() here, you created purpose_recode, which has a much more manageable 4 general levels that describe the purpose for people’s loans.

How does loan purpose affect amount funded?

In the last exercise, we pared the purpose variable down to a more reasonable 4 categories and called it purpose_recode. As a data scientist at Lending Club, we might want to design an experiment where we examine how the loan purpose influences the amount funded, which is the money actually issued to the applicant.

Remember that for an ANOVA test, the null hypothesis will be that all of the mean funded amounts are equal across the levels of purpose_recode. The alternative hypothesis is that at least one level of purpose_recode has a different mean. We will not be sure which, however, without some post hoc analysis, so it will be helpful to know how ANOVA results get stored as an object in R.

# Build a linear regression model, purpose_recode_model
purpose_recode_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub)

# Examine results of purpose_recode_model
summary(purpose_recode_model)

## 
## Call:
## lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14472  -6251  -1322   4678  25761 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      9888.1     1248.9   7.917 4.69e-15 ***
## purpose_recodedebt_related       5433.5     1270.5   4.277 2.02e-05 ***
## purpose_recodehome_related       4845.0     1501.0   3.228  0.00127 ** 
## purpose_recodelife_change        4095.3     2197.2   1.864  0.06254 .  
## purpose_recodeother              -649.3     1598.3  -0.406  0.68461    
## purpose_recoderenewable_energy  -1796.4     4943.3  -0.363  0.71636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8284 on 1494 degrees of freedom
## Multiple R-squared:  0.03473,    Adjusted R-squared:  0.0315 
## F-statistic: 10.75 on 5 and 1494 DF,  p-value: 3.598e-10

# Get anova results and save as purpose_recode_anova
purpose_recode_anova <- anova(purpose_recode_model)

# Print purpose_recode_anova
purpose_recode_anova

## Analysis of Variance Table
## 
## Response: funded_amnt
##                  Df     Sum Sq   Mean Sq F value    Pr(>F)    
## purpose_recode    5 3.6888e+09 737756668   10.75 3.598e-10 ***
## Residuals      1494 1.0253e+11  68629950                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Examine class of purpose_recode_anova
class(purpose_recode_anova)

## [1] "anova"      "data.frame"

Based on the very small p-value, purpose_recode_anova’s results indicate that there is evidence to support the hypothesis that the mean loan amounts are different for at least one combination of purpose_recode’s levels. You also saw that the ANOVA results are saved as a data frame, which is nice in case you need to access results later. Loans aren’t issued in a vacuum, however, and it’s likely that more than just purpose influences the amount funded.

Which loan purpose mean is different?

Before we examine other factors besides purpose_recode that might influence the amount of loan funded, let’s examine which means of purpose_recode are different. This is the post-hoc test referred to in the last exercise.

The result of that ANOVA test was statistically significant with a very low p-value. This means we can reject the null hypothesis and accept the alternative hypothesis that at least one mean was different. But which one?

We should use Tukey’s HSD test, which stands for Honest Significant Difference. To conduct Tukey’s HSD test in R, you can use TukeyHSD():

TukeyHSD(aov_model, "outcome_variable_name", conf.level = 0.9)

This would conduct Tukey’s HSD test on some aov_model, looking at a specific "outcome_variable_name", with a conf.level of 90%.

# Use aov() to build purpose_aov
purpose_aov <- aov(funded_amnt ~ purpose_recode, data = lendingclub)

# Conduct Tukey's HSD test to create tukey_output
tukey_output <- TukeyHSD(purpose_aov, "purpose_recode", conf.level = 0.95)

# Tidy tukey_output to make sense of the results
tidy(tukey_output)

## # A tibble: 15 × 7
##    term           contrast    null.value estimate conf.low conf.high adj.p.value
##    <chr>          <chr>            <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
##  1 purpose_recode debt_relat…          0    5434.    1808.     9059.     2.91e-4
##  2 purpose_recode home_relat…          0    4845.     562.     9128.     1.61e-2
##  3 purpose_recode life_chang…          0    4095.   -2174.    10365.     4.25e-1
##  4 purpose_recode other-big_…          0    -649.   -5210.     3911.     9.99e-1
##  5 purpose_recode renewable_…          0   -1796.  -15902.    12309.     9.99e-1
##  6 purpose_recode home_relat…          0    -589.   -3056.     1879.     9.84e-1
##  7 purpose_recode life_chang…          0   -1338.   -6539.     3863.     9.78e-1
##  8 purpose_recode other-debt…          0   -6083.   -9005.    -3160.     5.32e-8
##  9 purpose_recode renewable_…          0   -7230.  -20894.     6434.     6.58e-1
## 10 purpose_recode life_chang…          0    -750.   -6429.     4929.     9.99e-1
## 11 purpose_recode other-home…          0   -5494.   -9201.    -1787.     3.58e-4
## 12 purpose_recode renewable_…          0   -6641.  -20494.     7212.     7.46e-1
## 13 purpose_recode other-life…          0   -4745.  -10636.     1147.     1.95e-1
## 14 purpose_recode renewable_…          0   -5892.  -20482.     8698.     8.59e-1
## 15 purpose_recode renewable_…          0   -1147.  -15088.    12794.     1.00e+0

Looking at the p-values for each comparison of the levels of purpose_recode, we can see that only a few of the mean differences are statistically significant, for example the differences in the means for the debt_related and big_purchase loan amounts. In this case, these tiny p-values are most likely to be due to large sample size, and further tests would be required to determine what’s actually significant in the case of loans (known as the practical significance.)

Multiple Factor Experiments

We tested whether the purpose of a loan affects loan amount funded and found that it does. However, we also know that it’s unlikely that loans are funded based only on their intended purpose. It’s more likely that the company is looking at a holistic picture of an applicant before they decide to issue a loan.

We can examine more than one explanatory factor in a multiple factor experiment. Like our experiments on ToothGrowth from Chapter 1, an experimenter can try and control two (or more!) different factors and see how they affect the outcome. We’re using open data, so we can’t quite control the factors here (they’re submitted as someone fills out their loan application), but let’s look at how a few other factors affect loan amount funded.

# Use aov() to build purpose_emp_aov
purpose_emp_aov <- aov(funded_amnt ~ purpose_recode + emp_length, data = lendingclub)

# Print purpose_emp_aov to the console
purpose_emp_aov

## Call:
##    aov(formula = funded_amnt ~ purpose_recode + emp_length, data = lendingclub)
## 
## Terms:
##                 purpose_recode   emp_length    Residuals
## Sum of Squares      3688783338   2044273211 100488872355
## Deg. of Freedom              5           11         1483
## 
## Residual standard error: 8231.679
## Estimated effects may be unbalanced

# Call summary() to see the p-values
summary(purpose_emp_aov)

##                  Df    Sum Sq   Mean Sq F value   Pr(>F)    
## purpose_recode    5 3.689e+09 737756668  10.888 2.63e-10 ***
## emp_length       11 2.044e+09 185843019   2.743  0.00161 ** 
## Residuals      1483 1.005e+11  67760534                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Call summary() to see the p-values in a tidier way
tidy(purpose_emp_aov)

## # A tibble: 3 × 6
##   term              df         sumsq     meansq statistic   p.value
##   <chr>          <dbl>         <dbl>      <dbl>     <dbl>     <dbl>
## 1 purpose_recode     5   3688783338. 737756668.     10.9   2.63e-10
## 2 emp_length        11   2044273211. 185843019.      2.74  1.61e- 3
## 3 Residuals       1483 100488872355.  67760534.     NA    NA

Excellent! You could also perform Tukey’s HSD test on this model, but given that emp_length has 12 levels, it’ll be quite the output. If it was important to the experiment to know the answer, you’d definitely need to look at it.

Model Validation

Pre-modeling EDA

Let’s do some EDA with our experiment in mind. Lending Club has now asked you, their data scientist, to examine what effect their Lending Club-assigned loan grade variable has on the interest rate, int_rate. They’re interested to see if the grade they assign the applicant during the process of applying for the loan affects the interest rate ultimately assigned to the applicant during the repayment process.

The lendingclub data has been loaded for you, as has dplyr and ggplot2.

# Examine the summary of int_rate
summary(lendingclub$int_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.32    9.99   12.99   13.31   16.29   26.77

# Examine int_rate by grade
lendingclub %>% 
    group_by(grade) %>% 
    summarize(mean = mean(int_rate), var = var(int_rate), median = median(int_rate))

## # A tibble: 7 × 4
##   grade  mean   var median
##   <chr> <dbl> <dbl>  <dbl>
## 1 A      7.27 0.961   7.26
## 2 B     10.9  2.08   11.0 
## 3 C     14.0  1.42   14.0 
## 4 D     17.4  1.62   17.6 
## 5 E     20.1  2.71   20.0 
## 6 F     23.6  2.87   23.5 
## 7 G     26.1  0.198  25.9

ggplot(lendingclub, aes(grade, int_rate))+
  geom_boxplot()

# Use aov() to create grade_aov plus call summary() to print results
grade_aov <- aov(int_rate ~ grade, lendingclub)
summary(grade_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## grade          6  27013    4502    2637 <2e-16 ***
## Residuals   1493   2549       2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tidy(grade_aov)

## # A tibble: 2 × 6
##   term         df  sumsq  meansq statistic p.value
##   <chr>     <dbl>  <dbl>   <dbl>     <dbl>   <dbl>
## 1 grade         6 27013. 4502.       2637.       0
## 2 Residuals  1493  2549.    1.71       NA       NA

Excellent job! You can see from the numeric summary and the boxplot that grade seems to heavily influence interest rate. Therefore, the linear model results indicating that int_rate is significantly different by grade are unsurprising.

Post-modeling validation plots + variance

In the last exercise, we found that int_rate does differ by grade. Now we should validate this model, which for linear regression means examining the Residuals vs. Fitted and Normal Q-Q plots.

If you call plot() on a model object in R, it will automatically plot both of those plots plus two more. You’ll interpret these plots to evaluate model fit. We discussed how to do this in the video.

Another assumption of ANOVA and linear modeling is homogeneity of variance. Homogeneity means “same”, and here that would mean that the variance of int_rate is the same for each level of grade. We can test for homogeneity of variances using bartlett.test(), which takes a formula and a dataset as inputs.

# For a 2x2 grid of plots:
par(mfrow=c(2, 2))

# Plot grade_aov
plot(grade_aov)

# Bartlett's test for homogeneity of variance
bartlett.test(int_rate ~ grade , lendingclub)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  int_rate by grade
## Bartlett's K-squared = 78.549, df = 6, p-value = 7.121e-15

Excellent! The residuals on this model are okay, though the residuals on G have a much smaller range than any other level of grade (the dots are far less spread out.) The Q-Q plot, however, shows that the residuals are fairly normal. However, given the highly significant p-value from Bartlett’s test, the assumption of homogeneity of variances is violated, which is one of the assumptions of an ANOVA model. Therefore, ANOVA might not be the best choice for this experiment. It happens!

Kruskal-Wallis rank sum test

Given that we found in the last exercise that the homogeneity of variance assumption of linear modeling was violated, we may want to try an alternative.

One non-parametric alternative to ANOVA is the Kruskal-Wallis rank sum test. For those with some statistics knowledge, it is an extension of the Mann-Whitney U test for when there are more than two groups, like with our grade variable. For us, the null hypothesis for this test would be that all of the int_rates have the same ranking by grade.

The Kruskal-Wallis rank sum test can be conducted using the kruskal.test() function, available in base R. Luckily for you, the use of this function is very similar to using lm() or aov(): you input a formula and a dataset, and a result is returned.

# Conduct the Kruskal-Wallis rank sum test
kruskal.test(int_rate ~ grade,
             data = lendingclub)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  int_rate by grade
## Kruskal-Wallis chi-squared = 1365.5, df = 6, p-value < 2.2e-16

Good job! The low p-value indicates that based on this test, we can be confident in our result, which we found across this experiment, that int_rate varies by grade.

A/B testing

A type of controlled experiment with only two variants of something, for example:
- 1 word different in a marketing email
- Red ‘buy’ button on a website vs. blue button
- How many consumers click through to create an account based on two different website headers?

Power and sample size in A/B tests

Calculate sample size, given some power, significance level, and effect size
Run your A/B test until you attain the sample size you calculated

Which post-A/B test test?

We’ll be testing the mean loan_amnt, which is the requested amount of money the loan applicants ask for, based on which color header (green or blue) that they saw on the Lending Club website.

Which statistical test should we use after we’ve collected enough data?

Possible Answers

Kruskal-Wallis Rank Sum test
T-test
Chi-Square Test
Linear Regression

Excellent! We’ll be looking at the mean difference in the amount of loan the applicants asked for, so a t-test is appropriate.

Sample size for A/B test

We know now that we need to analyze our A/B test results with a t-test after we’ve collected data. We have two pretty important questions we need to answer before we do that: what’s the effect size and what’s the sample size required for this test?

In this case, effect size was given to us. Lending Club is looking to detect the relatively small effect size of 0.2. We’ll again use the pwr package and calculate sample size using an appropriate function to find out how many we’ll need to recruit into each group, A and B.

# The pwr package is loaded above

# Use the correct function from pwr to find the sample size
pwr.t.test(n = NULL, 
           d = 0.2,
           sig.level = 0.05,
           power = 0.8,
           alternative = "two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Nice! You can see we need about 400 people per group to reach our desired power in this A/B test.

Basic A/B test

Now that we know the sample size required, and we allowed the experiment to run long enough to get at least 400 people in each group, we can analyze our A/B test.

Remember that when applicants were using the Lending Club website, they were randomly assigned to two groups, A or B, where A was shown a mint green website header and B was shown a light blue website header. Lending Club was interested to see if website header color choice influenced loan_amnt, the amount an applicant asked to borrow.

A new dataset, lendingclub_ab is available in your workspace. The A/B test was run until there were 500 applicants in each group. Each applicant has been labeled as group A or B. Conduct the proper test to see if the mean of loan_amnt is different between the two groups.

# Make up lendingclub_ab data
lendingclub_ab <- lendingclub %>% 
  mutate( Group = rep(LETTERS[1:2], len = 1500 )) %>% 
  group_by(Group) %>% 
  slice_sample(n = 500)
# Confirm each group has got 500 applicants
lendingclub_ab %>% group_by(Group) %>% count()

## # A tibble: 2 × 2
## # Groups:   Group [2]
##   Group     n
##   <chr> <int>
## 1 A       500
## 2 B       500

# Plot the A/B test results
ggplot(lendingclub_ab, aes(Group, loan_amnt))+
  geom_boxplot()

# Conduct a two-sided t-test
t.test(loan_amnt ~ Group, data = lendingclub_ab)

## 
##  Welch Two Sample t-test
## 
## data:  loan_amnt by Group
## t = -0.60117, df = 997.79, p-value = 0.5479
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -1373.294   729.194
## sample estimates:
## mean in group A mean in group B 
##        14774.10        15096.15

Excellent! By looking at both the boxplot and the results of the t-test, it seems that there is no compelling evidence to support the hypothesis that there is a difference the two A/B test groups’ mean loan_amnt, a result which you would use to help make data-driven decisions at Lending Club.

A/B tests vs. multivariable experiments

The point of an A/B test is that only one thing is changed and the effect of that change is measured. We saw this with our examples in the video and the last few exercises. On the other hand, a multivariate experiment, such as the ToothGrowth experiment from chapter 1, is where a few things are changed (and is similar to a multiple factor experiment, which we covered earlier in this chapter.)

A Lending Club multivariate test can combine all of the explanatory variables we’ve looked at in this chapter. Let’s examine how Group, grade, and verification_status affect loan_amnt in the lendingclub_ab dataset.

# Build lendingclub_multi
lendingclub_multi <- lm(loan_amnt ~ Group + grade + verification_status, data = lendingclub_ab)

# Examine lendingclub_multi results
tidy(lendingclub_multi)

## # A tibble: 10 × 5
##    term                               estimate std.error statistic  p.value
##    <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                         11900.       751.   15.8    1.37e-50
##  2 GroupB                                167.       504.    0.332  7.40e- 1
##  3 gradeB                              -1936.       765.   -2.53   1.15e- 2
##  4 gradeC                              -1341.       786.   -1.71   8.84e- 2
##  5 gradeD                                -38.0      870.   -0.0437 9.65e- 1
##  6 gradeE                               1892.      1082.    1.75   8.07e- 2
##  7 gradeF                               2327.      1687.    1.38   1.68e- 1
##  8 gradeG                               8337.      4643.    1.80   7.29e- 2
##  9 verification_statusSource Verified   3776.       646.    5.85   6.73e- 9
## 10 verification_statusVerified          6314.       645.    9.79   1.11e-21

From the results, verification status and having an F grade are the factors in this model that have a significant effect on loan amount. Let’s move on to the next chapter and conduct more multivariable experiments like this.

3. Randomized Complete (& Balanced Incomplete) Block Designs

Use the NHANES data to build a RCBD and BIBD experiment, including model validation and design tips to make sure the BIBD is valid.

Intro to NHANES and sampling

Intro to NHANES dataset

NHANES = National Health and Nutrition Examination Survey

Conducted by the National Center for Health Statistics (NCHS), a division of the Centers for Disease Control (CDC)
Data collected a variety of ways, including interviews & a physical exam
Questions cover medical, dental, socioeconomic, dietary, and general health-related conditions

Intro to sampling

Probability Sampling: probability is used to select the sample (in various ways)

Non-probability Sampling: probability is not used to select the sample

Voluntary response: whoever agrees to respond is the sample
Convenience sampling: subjects convenient to the researcher are chosen.

NHANES dataset construction

As downloaded from the NHANES website, the NHANES datasets are available only as separate .XPT files, a native format to SAS. Luckily for us, the haven package exists.

Let’s combine the NHANES Demographics, Medical Conditions, and Body Measures datasets, available in their raw .XPT format and accessible through the variables DEMO_file, MCQ_file, and BMX_file. Join all 3 datasets using the SEQN variable. A good way to do this is using Reduce(), which allows you to combine elements in a helpful way.

The joining code, which is provided for you does the following:

Creates a list of all 3 datasets (nhanes_demo, nhanes_medical, nhanes_bodymeasures).
Uses a custom function inside of Reduce() to inner join all 3 datasets with the "SEQN" variable.
Saves this as the nhanes_combined dataset.

#The below is for the xps files' demo; but the real data sets used in this R project are the csv files  loaded in the Libraries and Data section above.
# haven package is loaded above
# Import the three datasets using read_xpt()
nhanes_demo <- read_xpt(DEMO_file)
nhanes_medical <- read_xpt(MCQ_file)
nhanes_bodymeasures <- read_xpt(BMX_file)

Awesome! Now that we have the NHANES data assembled, let’s get to work on EDA & cleaning.

NHANES EDA

Let’s examine our newly constructed dataset with a mind toward EDA. As in the last chapter, it’s a good idea to look at both numerical summary measures and visualizations. These help with understanding data and are a good way to find data cleaning steps you may have missed. The nhanes_combined dataset has been pre-loaded for you.

Say we have access to NHANES patients and want to conduct a study on the effect of being told by a physician to reduce calories/fat in their diet on weight. This is our treatment; we’re pretending that instead of this being a question asked of the patient, we randomly had physicians counsel some patients on their nutrition. However, we suspect that there may be a difference in weight based on the gender of the patient - a blocking factor!

# Fill in the dplyr code
nhanes_combined %>% 
  group_by(mcq365d) %>% 
  summarize(mean = mean(bmxwt, na.rm = TRUE))

## # A tibble: 4 × 2
##   mcq365d  mean
##     <dbl> <dbl>
## 1       1  90.7
## 2       2  76.5
## 3       9  90.8
## 4      NA  33.5

# Fill in the ggplot2 code
nhanes_combined %>% 
  ggplot(aes(as_factor(mcq365d), bmxwt)) +
  geom_boxplot() +
  labs(x = "Treatment",
       y = "Weight")

## Warning: Removed 99 rows containing non-finite values (stat_boxplot).

Great! Now that we have an idea of some of the problems with the data, namely that children weren’t given the treatment - that’s why we see an NA age category. We also have some patients have weights missing, thus the warning that the boxplot throws. It’s time for some data cleaning.

NHANES Data Cleaning

During data cleaning, we discovered that no one under the age of 16 was given the treatment. Recall that we’re pretending that the variable that indicates if a doctor has ever advised them to reduce fat or calories in their diet is purposeful nutrition counseling, our treatment. Let’s only keep patients who are greater than 16 years old in the dataset.

You also may have noticed that the default settings in ggplot2 delete any observations with a missing dependent variable, in this case, body weight. One option for dealing with the missing weights, imputation, can be implemented using the simputation package. Imputation is a technique for dealing with missing values where you replace them either with a summary statistic, like mean or median, or use a model to predict a value to use.

We’ll use impute_median(), which takes a dataset and the variable to impute or formula to impute by as arguments. For example, impute_median(ToothGrowth, len ~ dose) would fill in any missing values in the variable len with the median value for len by dose. So, if a guinea pig who received a dose of 2.0 had a missing value for the len variable, it would be filled in with the median len for those guinea pigs with a dose of 2.0.

# Filter to keep only those 16+
nhanes_filter <- nhanes_combined %>% filter(ridageyr > 16)

# simputation is loaded, impute bmxwt by riagendr

nhanes_final <- simputation::impute_median(nhanes_filter, bmxwt ~ riagendr)

# Recode mcq365d with recode() & examine with count()
nhanes_final$mcq365d <- recode(nhanes_final$mcq365d, 
                               `1` = 1,
                               `2` = 2,
                               `9` = 2)
nhanes_final %>% count(mcq365d)

## # A tibble: 2 × 2
##   mcq365d     n
##     <dbl> <int>
## 1       1  1802
## 2       2  4085

Excellent! Imputation is a powerful tool for dealing with missing data, but should be used with caution, as you can introduce bias into your data if you’re not careful how you impute. Now that we have the dataset cleaned, we’re ready to learn about RCBDs so we can analyze our experiment.

Resampling NHANES data

The NHANES data is collected on sampled units (people) specifically selected to represent the U.S. population. However, let’s resample the nhanes_final dataset in different ways so we get a feel for the different sampling methods.

We can conduct a simple random sample using sample_n() from dplyr. It takes as input a dataset and an integer of number of rows to sample.

Stratified sampling can be done by combining group_by() and sample_n(). The function will sample n from each of the groups specified in the group_by().

The sampling package’s cluster() creates cluster samples. It takes in a dataset name, the variable in the set to be used as the cluster variable, passed as a vector with the name as a string (e.g. c(“variable”)), a number of clusters to select, and a method.

# Use sample_n() to create nhanes_srs
nhanes_srs <- nhanes_final %>% sample_n(2500)

# Create nhanes_stratified with group_by() and sample_n()
nhanes_stratified <- nhanes_final %>% 
  group_by(riagendr) %>% 
  sample_n(2000)
nhanes_stratified %>% 
    count(riagendr)

## # A tibble: 2 × 2
## # Groups:   riagendr [2]
##   riagendr     n
##      <dbl> <int>
## 1        1  2000
## 2        2  2000

# Load sampling package and create nhanes_cluster with cluster()

nhanes_cluster <- cluster(nhanes_final, "indhhin2", 6, method = "srswor")
nhanes_cluster %>%
  count(indhhin2)

##   indhhin2   n
## 1        1 135
## 2        5 356
## 3        6 624
## 4        9 353
## 5       10 291
## 6       13  90

Excellent! These are some basic sampling methods you can use on your data to create the different kinds of samples that may be necessary in an experiment.

Randomized Complete Block Designs (RCBD)

RCBDs

Randomized - the treatment is assigned randomly inside each block
Complete - each treatment is used the same number of times in every block
Block - experimental groups are blocked to be similar (e.g. by sex)
Design - this is your experiment!

Which is NOT a good blocking factor?

As discussed in the video, the purpose of blocking an experiment is to make the experimental groups more like one another. Groups are blocked by a variable that is known to introduce variability that will affect the outcome of the experiment but is not of interest to study in the experiment itself.

A rule of thumb in experimental design is often “block what you can, randomize what you cannot”, which means you should aim to block the effects you can control for (e.g. sex) and randomize on those you cannot (e.g. smoking status). Variability inside a block is expected to be fairly small, but variability between blocks will be larger.

Which of the following would NOT make a good blocking factor for an experiment?

Possible Answers

When studying the effect of radiation treatment on cancer recurrence, treatment hospital type is used a blocking factor.
When studying the effect of four different tips’ hardness readings on a metal hardness tester, the known hardness of the metal is used as a block variable.
When testing the effect of a type of fertilizer on plant growth, crop type is used as a blocking factor.
When testing the effect of a drug on blood pressure, current pregnancy is used as a blocking factor.

Excellent! Current pregnancy is not really something the experimenter can control, and they’re also not likely to have enough subjects pregnant to justify blocking by that factor (unless they sampled for pregnant individuals in the first place!)

Drawing RCBDs with Agricolae

The agricolae package is very helpful when you want to “draw” out the design of an experiment for yourself using R. It can draw many different kinds of experiments, including a randomized complete block design. Here’s an example of one:

##      [,1] [,2] [,3] [,4]
## [1,] "D"  "A"  "C"  "B" 
## [2,] "A"  "D"  "C"  "B" 
## [3,] "C"  "D"  "B"  "A" 
## [4,] "B"  "D"  "A"  "C"

In this RCBD, we have 4 blocks (each row of the output). Inside of each block, each treatment “A”, “B”, “C”, and “D” is used, because this is a complete design. So if these 4 blocks/rows of the output were four fields of a farmer’s, they should give the first field the “D” treatment in the first season, then “C”, then “A”, then “B”.

Let’s draw an RCBD design with 5 treatments and 4 blocks, which go in the r argument. The agricolae package has been loaded for you.

# Create designs using ls()
designs <- ls("package:agricolae", pattern = "design")
designs

##  [1] "design.ab"      "design.alpha"   "design.bib"     "design.crd"    
##  [5] "design.cyclic"  "design.dau"     "design.graeco"  "design.lattice"
##  [9] "design.lsd"     "design.mat"     "design.rcbd"    "design.split"  
## [13] "design.strip"   "design.youden"

str(design.rcbd)

## List of 3
##  $ parameters:List of 7
##   ..$ design: chr "rcbd"
##   ..$ trt   : chr [1:4] "A" "B" "C" "D"
##   ..$ r     : num 4
##   ..$ serie : num 0
##   ..$ seed  : num 42
##   ..$ kinds : chr "Super-Duper"
##   ..$       : logi TRUE
##  $ sketch    : chr [1:4, 1:4] "D" "A" "C" "B" ...
##  $ book      :'data.frame':  16 obs. of  3 variables:
##   ..$ plots: num [1:16] 11 12 13 14 21 22 23 24 31 32 ...
##   ..$ block: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 2 3 3 ...
##   ..$ trt  : Factor w/ 4 levels "A","B","C","D": 4 1 3 2 1 4 3 2 3 4 ...

# Build treats and rep
treats <- LETTERS[1:5]
blocks <- 4
# Build my_design_rcbd and view the sketch
my_design_rcbd <- design.rcbd(trt = treats,
                              r = blocks, 
                              seed = 42)
my_design_rcbd$sketch

##      [,1] [,2] [,3] [,4] [,5]
## [1,] "D"  "A"  "C"  "B"  "E" 
## [2,] "E"  "A"  "C"  "D"  "B" 
## [3,] "D"  "B"  "E"  "A"  "C" 
## [4,] "B"  "D"  "E"  "C"  "A"

Nice! Now that you have a better idea of what a RCBD looks like, let’s try a few examples, including one with the NHANES data we cleaned.

NHANES RCBD

Recall that our blocked experiment involved a treatment wherein the doctor asks the patient to reduce their fat or calories in their diet, and we’re testing the effect this has on weight (bmxwt). We plan to block by gender, which in the NHANES dataset is stored as riagendr. Recall that blocking is done to create experimental groups that are as similar as possible. Blocking this experiment by gender means that if we observe an effect of the treatment on bmxwt, it’s more likely that the effect was actually due to the treatment versus the individual’s gender.

In your R code, you denote a blocked experiment by using a formula that looks like: outcome ~ treatment + blocking_factor in the appropriate modeling function.

nhanes_final is available.

# Use aov() to create nhanes_rcbd
nhanes_rcbd <- aov(bmxwt ~ mcq365d + riagendr, data = nhanes_final)

# Check results of nhanes_rcbd with summary()
summary(nhanes_rcbd)

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## mcq365d        1  229164  229164   571.2 <2e-16 ***
## riagendr       1  163069  163069   406.4 <2e-16 ***
## Residuals   5884 2360774     401                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Print mean weights by mcq365d and riagendr
nhanes_final %>% 
    group_by(mcq365d, riagendr) %>% 
    summarize(mean_wt = mean(bmxwt, na.rm = TRUE))

## `summarise()` has grouped output by 'mcq365d'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 3
## # Groups:   mcq365d [2]
##   mcq365d riagendr mean_wt
##     <dbl>    <dbl>   <dbl>
## 1       1        1    95.2
## 2       1        2    86.6
## 3       2        1    82.7
## 4       2        2    71.3

Nice! It’s pretty clear that there truly is a mean difference in weight by gender, so blocking was a good call for this experiment. We also observed a statistically significant effect of the treatment on bmxwt, which we hope is actually a result of the treatment. Now that we have the RCBD down, let’s tackle Balanced Incomplete Block Designs (BIBD).

RCBD Model Validation

As we did in the last chapter (and when building any model!) it’s a good idea to validate the results. We’ll examine the Residuals vs. Fitted and Normal Q-Q plots, though now we’ll only see a Constant Leverage plot in place of the other two. A good model has a Q-Q plot showing an approximately normal distribution and no clear patterns across blocks or treatments in the others.

We can also look at Interaction plots. We hope to see parallel lines, no matter which of the block or the treatment is on the x-axis. If they are, they satisfy a key assumption of the RCBD model called Additivity.

The nhanes_rcbd model object from the last exercise has been loaded for you. Examine the results with summary(nhanes_rcbd) in the console if you need a refresher.

# Set up the 2x2 plotting grid and plot nhanes_rcbd
par(mfrow = c(2, 2))

plot(nhanes_rcbd)

# Run the code to view the interaction plots
with(nhanes_final, interaction.plot(mcq365d, riagendr, bmxwt))
# Run the code to view the interaction plots
with(nhanes_final, interaction.plot(riagendr, mcq365d, bmxwt))

Excellent! The initial diganostic plots show that this model is pretty good but not great - especially at the larger end of the data, the Q-Q plot shows the data might not be normal. The interaction plots show nearly parallel lines, so we can move forward with this model.

Balanced Incomplete Block Designs (BIBD)

Balanced Incomplete Block Designs

Balanced - each pair of treatments occur together in a block an equal number of times
Incomplete - not every treatment will appear in every block
Block - experimental groups are blocked to be similar (e.g. by sex)
Design - this is your experiment!

Is a BIBD even possible?

We saw in the video that it’s possible a BIBD doesn’t exist at all. It’s useful to calculate the lambda beforehand, and if the result isn’t an integer, there isn’t a possible BIBD.

Don’t bust out the pen and paper yet: the lambda() function has been defined for you. It takes as input t = number of treatments, k = number of treatments per block, andr = number of repetitions. It then calculates lambda according to this formula: r(k - 1) / (t - 1).

Try the different combinations oft, k, and r using lambda() in your console, and choose the answer which does NOT have a BIBD.

lambda <- function(t, k, r) {
  return((r*(k-1)) / (t-1))
}

# Calculate lambda
lambda(2,2,2)

## [1] 2

lambda(2,3,4 )

## [1] 8

lambda(3,4,11)

## [1] 16.5

lambda(12,2,22)

## [1] 2

Possible Answers

t = 2, k = 2, r = 2
t = 2, k = 3, r = 4
t = 3, k = 4, r = 11
t = 12, k = 2, r = 22

Good job! When designing experiments, you could build your own version of the custom function lambda, or use agricolae to figure it out visually, as we’ll see in the next exercise.

Drawing BIBDs with agricolae

We can also use agricolae to draw BIBDs. design.bib() takes, at minimum, the treatments (treats), an integer k corresponding to the number of levels of the blocks, and a seed as inputs.

The main thing you should notice about a BIBD is that not every treatment will be used in each block (column) of the output.

From the video and the last exercise, however, you know that sometimes a BIBD isn’t valid and that you have to do a little math to be sure your BIBD design is possible. design.bib() will return an error message letting you know if a design is not valid.

Let’s draw a few BIBDs with agricolae so we can see the different warning messages and errors the package provides.

#create my_design_bibd_1
my_design_bibd_1 <- design.bib(LETTERS[1:3], k = 4, seed = 42)

## Error in AlgDesign::optBlock(~., withinData = factor(1:v), blocksizes = rep(k, : The number of trials must be at least as large as the minimum blocksize.

#create my_design_bibd_2
my_design_bibd_2 <- design.bib(LETTERS[1:8], k = 3, seed = 42)

## Error in rep(k, b): invalid 'times' argument

# Create my_design_bibd_3
my_design_bibd_3 <- design.bib(LETTERS[1:4], k = 4, seed = 42)

## 
## Parameters BIB
## ==============
## Lambda     : 2
## treatmeans : 4
## Block size : 4
## Blocks     : 2
## Replication: 2 
## 
## Efficiency factor 1 
## 
## <<< Book >>>

my_design_bibd_3$sketch

##      [,1] [,2] [,3] [,4]
## [1,] "C"  "A"  "D"  "B" 
## [2,] "C"  "D"  "B"  "A"

Nice! You saw two different function errors which help lead you in the right direction with your design, plus one that works. When the design does work, the sketch parameter shows the design. The blocks are now the columns, however, unlike with RCBDs.

BIBD - cat’s kidney function

To be sure we truly understand what a BIBD looks like, let’s build a dataset containing a BIBD from scratch.

Say we want to test the difference between four different wet foods in cats’ diets on their kidney function. Cat food, however, is expensive, so we’ll only test 3 foods per block to save some money. Our cats will be blocked by color of cat, as we aren’t interested in that as part of our experiment. The outcome will be measured blood creatinine level, an indicator of kidney function and dysfunction in cats and humans alike.

lambda <- function(t, k, r) {
  return((r*(k-1)) / (t-1))
}

# Calculate lambda
lambda(4, 3, 3)

## [1] 2

# Build the data.frame
creatinine <- c(1.98, 1.97, 2.35, 2.09, 1.87, 1.95, 2.08, 2.01, 1.84, 2.06, 1.97, 2.22)
food <- as.factor(c("A", "C", "D", "A", "B", "C", "B", "C", "D", "A", "B", "D"))
color <- as.factor(rep(c("Black", "White", "Orange", "Spotted"), each = 3))
cat_experiment <- as.data.frame(cbind(creatinine, food, color))

# Create cat_model and examine with summary()
cat_model <- aov(creatinine ~ food + color, data = cat_experiment)
summary(cat_model)

##             Df  Sum Sq  Mean Sq F value Pr(>F)
## food         1 0.01204 0.012042   0.530  0.485
## color        1 0.00697 0.006971   0.307  0.593
## Residuals    9 0.20461 0.022735

Purrfect! It seems there are no differences by type of wet food in kidney function. Hopefully now you can see how a BIBD comes to life!

NHANES BIBD

Let’s jump back into the NHANES data and pretend we have access to NHANES patients ages 18-45. Blocking the adults by race, stored in NHANES as ridreth1, we prescribe to our groups either no particular upper body weightlifting regimen, a weightlifting regimen, or a weightlifting regimen plus a prescribed daily vitamin supplement. This information is stored in a variable called weightlift_treat.

Those funding the study decide they want it to be a BIBD where only 2 treatments appear in each block. The outcome, arm circumference, is stored as bmxarmc. The nhanes_final data is loaded for you.

#Make up weightlift_treat variable

nhanes_final <- nhanes_final %>% 
mutate(weightlift_treat = rep(1:3, length = 5887))
    
 

# Calculate lambda
lambda(3, 2, 2)

## [1] 1

# Create weightlift_model & examine results
weightlift_model <- aov(bmxarmc ~ weightlift_treat + ridreth1, data = nhanes_final)
summary(weightlift_model)

##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## weightlift_treat    1      4     3.8   0.138   0.711    
## ridreth1            1   1304  1304.3  47.260 6.9e-12 ***
## Residuals        5571 153756    27.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 313 observations deleted due to missingness

Nice! As it turns out, the weight lifting regimen doesn’t seem to have a significant effect on arm circumference when the patient population is blocked by race.

4. Latin Squares, Graeco-Latin Squares, & Factorial experiments

Evaluate the NYC SAT scores data and deal with its missing values, then evaluate Latin Square, Graeco-Latin Square, and Factorial experiments.

Latin squares

Two blocking factors (instead of one)
All factors must have the same number of levels
Key assumption: the treatment and two blocking factors do not interact
Analyze like a RCBD

Intro to NYC scores

nyc_scores is an NYC open dataset
- Downloaded from Kaggle
Includes:
- All accredited NYC high schools
- SAT scores (Reading, Writing, and Math)
- 2014-2015 school year

NYC SAT Scores EDA

Math is a subject the U.S. is consistently behind the rest of the world on, so our experiments will focus on Math score. While the original dataset is an open dataset downloaded from Kaggle, throughout this chapter I will add a few variables that will allow you to pretend you are an education researcher conducting experiments ideally aimed at raising students’ scores, hopefully increasing the likelihood they will be admitted to colleges.

Before diving into analyzing the experiments, we should do some EDA to make sure we fully understand the nyc_scores data. In this lesson, we’ll do experiments where we block by Borough and Teacher_Education_Level, so let’s examine math scores by those variables. The nyc_scores dataset has been loaded for you.

# Make up data to include a column on Teacher_Education_Level
nyc_scores <- nyc_scores %>% 
  mutate( 
    Teacher_Education_Level = as_factor( rep( c("BA", "College Student", "Grad Student", "MA" ,"PhD"),
     c(91,90,94,125,35)))
  ) 

# Mean, var, and median of Math score
nyc_scores %>%
    group_by(Borough) %>% 
    summarize(mean = mean(Average_Score_SAT_Math, na.rm = TRUE),
              var = var(Average_Score_SAT_Math, na.rm = TRUE),
              median = median(Average_Score_SAT_Math, na.rm = TRUE))

## # A tibble: 5 × 4
##   Borough        mean   var median
##   <chr>         <dbl> <dbl>  <dbl>
## 1 Bronx          404. 2726.   396.
## 2 Brooklyn       416. 3658.   395 
## 3 Manhattan      456. 7026.   433 
## 4 Queens         462. 5168.   448 
## 5 Staten Island  486. 6911.   466.

# Mean, var, and median of Math score by Teacher Education Level
nyc_scores %>%
    group_by(Teacher_Education_Level) %>% 
    summarize(mean = mean(Average_Score_SAT_Math, na.rm = TRUE),
              var = var(Average_Score_SAT_Math, na.rm = TRUE),
              median = median(Average_Score_SAT_Math, na.rm = TRUE))

## # A tibble: 5 × 4
##   Teacher_Education_Level  mean   var median
##   <fct>                   <dbl> <dbl>  <dbl>
## 1 BA                       456. 6585.   433 
## 2 College Student          416. 4318.   400.
## 3 Grad Student             428. 4696.   416.
## 4 MA                       430. 4792.   404 
## 5 PhD                      441. 4802.   418

# Mean, var, and median of Math score by both
nyc_scores %>%
    group_by(Borough, Teacher_Education_Level) %>% 
     summarize(mean = mean(Average_Score_SAT_Math, na.rm = TRUE),
               var = var(Average_Score_SAT_Math, na.rm = TRUE),
               median = median(Average_Score_SAT_Math, na.rm = TRUE))

## `summarise()` has grouped output by 'Borough'. You can override using the
## `.groups` argument.

## # A tibble: 10 × 5
## # Groups:   Borough [5]
##    Borough       Teacher_Education_Level  mean    var median
##    <chr>         <fct>                   <dbl>  <dbl>  <dbl>
##  1 Bronx         College Student          394.   851.   390 
##  2 Bronx         Grad Student             416.  4630.   405 
##  3 Brooklyn      Grad Student             407.  2005.   397 
##  4 Brooklyn      MA                       419.  4051.   395 
##  5 Manhattan     BA                       456.  6585.   433 
##  6 Manhattan     College Student          453. 10701.   433 
##  7 Queens        Grad Student             484.  4475.   478 
##  8 Queens        MA                       476.  5498.   467 
##  9 Queens        PhD                      441.  4802.   418 
## 10 Staten Island College Student          486.  6911.   466.

Great job! Now that we’ve examined the data, we can move on to cleaning it, the next important step before analysis.

Dealing with Missing Test Scores

If we want to use SAT scores as our outcome, we should examine missingness. Examine the pattern of missingness across all the variables in nyc_scores using miss_var_summary() from the naniar package. naniar integrates with Tidyverse code styling, including the pipe operator (%>%).

There are 60 missing scores in each subject. Though there are many R packages which help with more advanced forms of imputation, such as MICE, Amelia, and mi, we will continue to use simputation and impute_median().

Create a new dataset, nyc_scores_2 by imputing Math score by Borough, but note that impute_median()returns the imputed variable as type “impute”. You’ll convert the variable to the numeric in a separate step.

simputation and dplyr are loaded.

# naniar package is loaded


# Examine missingness with miss_var_summary()
nyc_scores %>% miss_var_summary()

## # A tibble: 23 × 3
##    variable                  n_miss pct_miss
##    <chr>                      <int>    <dbl>
##  1 Average_Score_SAT_Math        60   13.8  
##  2 Average_Score_SAT_Reading     60   13.8  
##  3 Average_Score_SAT_Writing     60   13.8  
##  4 Percent_Tested                49   11.3  
##  5 Student_Enrollment             7    1.61 
##  6 Percent_White                  7    1.61 
##  7 Percent_Black                  7    1.61 
##  8 Percent_Hispanic               7    1.61 
##  9 Percent_Asian                  7    1.61 
## 10 Start_Time                     4    0.920
## # … with 13 more rows

# Examine missingness with md.pattern()
library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

md.pattern(nyc_scores)  # A function from the mice package

##     School_ID School_Name Borough Building_Code Street_Address City State
## 374         1           1       1             1              1    1     1
## 11          1           1       1             1              1    1     1
## 42          1           1       1             1              1    1     1
## 4           1           1       1             1              1    1     1
## 1           1           1       1             1              1    1     1
## 3           1           1       1             1              1    1     1
##             0           0       0             0              0    0     0
##     Zip_Code Latitude Longitude Phone_Number Teacher_Education_Level Start_Time
## 374        1        1         1            1                       1          1
## 11         1        1         1            1                       1          1
## 42         1        1         1            1                       1          1
## 4          1        1         1            1                       1          1
## 1          1        1         1            1                       1          0
## 3          1        1         1            1                       1          0
##            0        0         0            0                       0          4
##     End_Time Student_Enrollment Percent_White Percent_Black Percent_Hispanic
## 374        1                  1             1             1                1
## 11         1                  1             1             1                1
## 42         1                  1             1             1                1
## 4          1                  0             0             0                0
## 1          0                  1             1             1                1
## 3          0                  0             0             0                0
##            4                  7             7             7                7
##     Percent_Asian Percent_Tested Average_Score_SAT_Math
## 374             1              1                      1
## 11              1              1                      0
## 42              1              0                      0
## 4               0              0                      0
## 1               1              1                      1
## 3               0              0                      0
##                 7             49                     60
##     Average_Score_SAT_Reading Average_Score_SAT_Writing    
## 374                         1                         1   0
## 11                          0                         0   3
## 42                          0                         0   4
## 4                           0                         0   9
## 1                           1                         1   2
## 3                           0                         0  11
##                            60                        60 272

# Impute the Math score by Borough
nyc_scores_2 <- simputation::impute_median(nyc_scores, Average_Score_SAT_Math ~ Borough)

# Convert Math score to numeric
nyc_scores_2$Average_Score_SAT_Math <- as.numeric(nyc_scores_2$Average_Score_SAT_Math)

# Examine scores by Borough in both datasets, before and after imputation
nyc_scores %>% 
    group_by(Borough) %>% 
    summarize(median = median(Average_Score_SAT_Math, na.rm = TRUE), 
              mean = mean(Average_Score_SAT_Math, na.rm = TRUE))

## # A tibble: 5 × 3
##   Borough       median  mean
##   <chr>          <dbl> <dbl>
## 1 Bronx           396.  404.
## 2 Brooklyn        395   416.
## 3 Manhattan       433   456.
## 4 Queens          448   462.
## 5 Staten Island   466.  486.

nyc_scores_2 %>% 
    group_by(Borough) %>% 
    summarize(median = median(Average_Score_SAT_Math), 
              mean = mean(Average_Score_SAT_Math))

## # A tibble: 5 × 3
##   Borough       median  mean
##   <chr>          <dbl> <dbl>
## 1 Bronx           396.  403.
## 2 Brooklyn        395   414.
## 3 Manhattan       433   452.
## 4 Queens          448   460.
## 5 Staten Island   466.  486.

Nice job! Did the median scores change before and after imputation? (Hint: they shouldn’t have changed by much, but rounding may have offset them by an integer or two.)

Drawing Latin Squares with agricolae

We return, once again, to the agricolae package to examine what a Latin Square design can look like. Here’s an example:

##      [,1] [,2] [,3] [,4]
## [1,] "A"  "D"  "C"  "B" 
## [2,] "B"  "A"  "D"  "C" 
## [3,] "D"  "C"  "B"  "A" 
## [4,] "C"  "B"  "A"  "D"

Since a Latin Square experiment has two blocking factors, you can see that in this design, each treatment appears once in both each row (blocking factor 1) and each column (blocking factor 2).

Look at the help page for design.lsd() by typing ?design.lsd in the console for any help you need designing your Latin Square experiment.

# agricolae is loaded

# Design a LS with 5 treatments A:E then look at the sketch
my_design_lsd <- design.lsd(trt = LETTERS[1:5], 
                            seed = 42)
my_design_lsd$sketch

##      [,1] [,2] [,3] [,4] [,5]
## [1,] "E"  "D"  "A"  "C"  "B" 
## [2,] "D"  "C"  "E"  "B"  "A" 
## [3,] "A"  "E"  "B"  "D"  "C" 
## [4,] "C"  "B"  "D"  "A"  "E" 
## [5,] "B"  "A"  "C"  "E"  "D"

Perhaps you’re thinking to yourself ‘This looks a lot like a RCBD’…bingo! It does, but as we know from the video, there are now two blocking factors in a LS design.

Latin Square with NYC SAT Scores

To execute a Latin Square design on this data, suppose we want to know the effect of our tutoring program, which includes one-on-one tutoring, two small groups, and an in and after-school SAT prep class. A new dataset nyc_scores_ls is available that represents this experiment. Feel free to explore the dataset in the console.

We’ll block by Borough and Teacher_Education_Level to reduce their known variance on the score outcome. Borough is a good blocking factor because schools in America are funded partly based on taxes paid in each city, so it will likely make a difference in the quality of education.

# More made up data and assuming nyc_scores_ls is same as nyc_scores 
nyc_scores$Tutoring_Program <- rep(c("Yes", "No"), 
                                   replace = TRUE, length = 435)
nyc_scores$Homework_Type <- rep(c("Individual", "Large Group", "Mix of Large Group/Individual", "Mix of Small Group/Individual", "Small Group"), 
                                   replace = TRUE, length = 435)
nyc_scores_ls <- nyc_scores

# Build nyc_scores_ls_lm
nyc_scores_ls_lm <- lm(Average_Score_SAT_Math ~ Tutoring_Program + Borough + Teacher_Education_Level,
                       data = nyc_scores_ls)

# Tidy the results with broom
tidy(nyc_scores_ls_lm)

## # A tibble: 10 × 5
##    term                                   estimate std.error statistic  p.value
##    <chr>                                     <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                              399.       23.1     17.2   3.53e-49
##  2 Tutoring_ProgramYes                       -4.46      6.95    -0.642 5.21e- 1
##  3 BoroughBrooklyn                           -3.91     15.9     -0.246 8.06e- 1
##  4 BoroughManhattan                          59.6      21.5      2.77  5.95e- 3
##  5 BoroughQueens                             60.3      16.4      3.67  2.75e- 4
##  6 BoroughStaten Island                      92.5      23.2      3.99  8.15e- 5
##  7 Teacher_Education_LevelCollege Student    -3.08     20.9     -0.147 8.83e- 1
##  8 Teacher_Education_LevelGrad Student       19.6      24.9      0.788 4.31e- 1
##  9 Teacher_Education_LevelMA                 24.4      28.2      0.868 3.86e- 1
## 10 Teacher_Education_LevelPhD               -15.8      30.6     -0.515 6.07e- 1

# Examine the results with anova
anova(nyc_scores_ls_lm)

## Analysis of Variance Table
## 
## Response: Average_Score_SAT_Math
##                          Df  Sum Sq Mean Sq F value    Pr(>F)    
## Tutoring_Program          1    2206    2206  0.4877   0.48539    
## Borough                   4  245041   61260 13.5433 2.644e-10 ***
## Teacher_Education_Level   4   38005    9501  2.1005   0.08022 .  
## Residuals               365 1650999    4523                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question At the 0.05 significance level, do we have evidence to believe the tutoring program has an effect on math SAT scores, when blocked by Borough and Teacher_Education_Level?

Possible Answers

Nope! Given the p-value, we have no reason to reject the null hypothesis.
Yes! Given the p-value, we have reason to believe that the tutoring program had an effect on the Math score.

Excellent! It seems that when we block for Borough of the school and Teacher_Education_Level, our Tutoring_Program isn’t having a statistically significant effect on the Math SAT score.

Graeco-Latin squares

NYC SAT Scores Data Viz

In the last lesson, when discussing Latin Squares, we did numerical EDA in the form of looking at means, variances, and medians of the math SAT scores. Another crucial part of the EDA is data visualization, as it often helps in spotting outliers plus gives you a visual representation of the distribution of your variables.

ggplot2 has been loaded for you and the nyc_scores dataset is available. Create and examine the requested boxplot. How do the medians differ by Borough? How many outliers are present, and where are they mostly present?

# Create a boxplot of Math scores by Borough, with a title and x/y axis labels
ggplot(nyc_scores, aes(Borough, Average_Score_SAT_Math)) +
  geom_boxplot() + 
  labs(title = "Average SAT Math Scores by Borough, NYC",
       x = "Borough (NYC)",
       y = "Average SAT Math Score (2014-15)")

## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

Beautiful! It’s interesting to see the different distribution of scores by Borough and to see that every borough has scores that are outliers, though some more than others.

Drawing Graeco-Latin Squares with agricolae

As we’ve seen, agricolae provides us the ability to draw all of the experimental designs we’ve used so far, and they can also draw Graeco-Latin squares. One difference in the input to design.graeco() that we haven’t seen before is that we’ll need to input 2 vectors, trt1 and trt2, which must be of equal length. You can think of trt1 as your actual treatment and trt2 as one of your blocking factors. agricolae has been loaded for you.

# Create trt1 and trt2
trt1 <- LETTERS[1:5]
trt2 <- 1:5

# Create my_graeco_design
my_graeco_design <- design.graeco(trt1, trt2, seed = 42)

# Examine the parameters and sketch
my_graeco_design$parameters

## $design
## [1] "graeco"
## 
## $trt1
## [1] "A" "B" "C" "D" "E"
## 
## $trt2
## [1] 1 2 3 4 5
## 
## $r
## [1] 5
## 
## $serie
## [1] 2
## 
## $seed
## [1] 42
## 
## $kinds
## [1] "Super-Duper"
## 
## [[8]]
## [1] TRUE

my_graeco_design$sketch

##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,] "D 5" "A 1" "C 3" "B 4" "E 2"
## [2,] "A 3" "C 4" "B 2" "E 5" "D 1"
## [3,] "C 2" "B 5" "E 1" "D 3" "A 4"
## [4,] "B 1" "E 3" "D 4" "A 2" "C 5"
## [5,] "E 4" "D 2" "A 5" "C 1" "B 3"

Superb! You can see that this time the sketch object includes your treatment (the capital letter) and a blocking factor (the number.)

Graeco-Latin Square with NYC SAT Scores

Recall that our Latin Square exercise in this chapter tested the effect of our tutoring program, blocked by Borough and Teacher_Education_Level.

For our Graeco-Latin Square, say we also want to block out the known effect of Homework_Type, which indicates what kind of homework the student was given: individual only, small or large group homework, or some combination. We can add this as another blocking factor to create a Graeco-Latin Square experiment.

# Assume gls is the ls! The said data set wasnt provided and this is made up option
nyc_scores_gls <-  nyc_scores_ls
# Build nyc_scores_gls_lm
nyc_scores_gls_lm <- lm(Average_Score_SAT_Math ~ Tutoring_Program + Borough + Teacher_Education_Level + Homework_Type, 
                        data = nyc_scores_gls)

# Tidy the results with broom
tidy(nyc_scores_gls_lm)

## # A tibble: 14 × 5
##    term                                    estimate std.error statistic  p.value
##    <chr>                                      <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                               414.       24.4     16.9   9.40e-48
##  2 Tutoring_ProgramYes                        -4.56      6.95    -0.656 5.12e- 1
##  3 BoroughBrooklyn                            -3.86     15.9     -0.243 8.08e- 1
##  4 BoroughManhattan                           58.1      21.6      2.69  7.41e- 3
##  5 BoroughQueens                              61.4      16.5      3.73  2.21e- 4
##  6 BoroughStaten Island                       91.8      23.2      3.95  9.38e- 5
##  7 Teacher_Education_LevelCollege Student     -3.76     20.9     -0.180 8.57e- 1
##  8 Teacher_Education_LevelGrad Student        17.9      25.0      0.718 4.73e- 1
##  9 Teacher_Education_LevelMA                  22.8      28.2      0.808 4.20e- 1
## 10 Teacher_Education_LevelPhD                -18.1      30.7     -0.590 5.56e- 1
## 11 Homework_TypeLarge Group                  -14.8      11.2     -1.32  1.88e- 1
## 12 Homework_TypeMix of Large Group/Indivi…   -18.0      11.0     -1.64  1.01e- 1
## 13 Homework_TypeMix of Small Group/Indivi…   -15.5      10.9     -1.41  1.58e- 1
## 14 Homework_TypeSmall Group                  -18.6      11.0     -1.69  9.19e- 2

# Examine the results with anova
anova(nyc_scores_gls_lm)

## Analysis of Variance Table
## 
## Response: Average_Score_SAT_Math
##                          Df  Sum Sq Mean Sq F value    Pr(>F)    
## Tutoring_Program          1    2206    2206  0.4875   0.48549    
## Borough                   4  245041   61260 13.5375 2.716e-10 ***
## Teacher_Education_Level   4   38005    9501  2.0996   0.08037 .  
## Homework_Type             4   17385    4346  0.9604   0.42922    
## Residuals               361 1633614    4525                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question

At the 0.05 significance level, do we have evidence to believe the tutoring program has an effect on math SAT scores, when blocked by Borough, Teacher_Education_Level, and Homework_Type?

Nope! Given the p-value, we have no reason to reject the null hypothesis.
Yes! Given the p-value, we have reason to believe that the tutoring program had an effect on the Math score.

Bravo! It seems that here, when blocked out by all the other factors, our Tutoring program has no effect on the Math score.

Factorial experiments

NYC SAT Scores Factorial EDA

Let’s do some more EDA before we dive into the analysis of our factorial experiment.

Let’s test the effect of Percent_Black_HL, Percent_Tested_HL, and Tutoring_Program on the outcome, Average_Score_SAT_Math. The HL stands for high-low, where a 1 indicates respectively that less than 50% of Black students or that less than 50% of all students in an entire school were tested, and a 2 indicates that greater than 50% of either were tested.

Build a boxplot of each factor vs. the outcome to have an idea of which have a difference in median by factor level (ultimately, mean difference is what’s tested.) The nyc_scores dataset has been loaded for you.

# ggplot2 is loaded

# More made up data alread done in previous code chunk.
# nyc_scores$Tutoring_Program <- rep(c("Yes", "No"), 
#                                    replace = TRUE, length = 435)

# Build the boxplot for the tutoring program vs. Math SAT score
ggplot(nyc_scores,aes(Tutoring_Program, Average_Score_SAT_Math)) + 
       geom_boxplot()

## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

# Feature engineering
Percent_Black_HL <- factor(ifelse(nyc_scores$Percent_Black < 0.5, 1, 2))
Percent_Tested_HL <- factor(ifelse(nyc_scores$Percent_Tested < 0.5, 1, 2))
nyc_scores$Percent_Black_HL <- Percent_Black_HL
nyc_scores$Percent_Tested_HL <- Percent_Tested_HL
rm("Percent_Tested_HL", "Percent_Black_HL")
# Build the boxplot for the percent black vs. Math SAT score
ggplot(nyc_scores, aes(Percent_Black_HL, Average_Score_SAT_Math)) + 
    geom_boxplot()

## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

# Build the boxplot for percent tested vs. Math SAT score
ggplot(nyc_scores,
       aes(Percent_Tested_HL, Average_Score_SAT_Math)) + 
    geom_boxplot()

## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

Excellent! Now, let’s move on to the analysis of these factors on the score.

Factorial Experiment with NYC SAT Scores

Now we want to examine the effect of tutoring programs on the NYC schools’ SAT Math score. As noted in the last exercise: the variable Tutoring_Program is simply yes or no, depending on if a school got a tutoring program implemented. For Percent_Black_HL and Percent_Tested_HL, HL stands for high/low. A 1 indicates less than 50% Black students or overall students tested, and a 2 indicates greater than 50% of both.

Remember that because we intend to test all of the possible combinations of factor levels, we need to write the formula like: outcome ~ factor1 * factor2 * factor3.

# Create nyc_scores_factorial and examine the results
nyc_scores_factorial <- aov(Average_Score_SAT_Math ~ Percent_Tested_HL * Percent_Black_HL * Tutoring_Program, data = nyc_scores)

broom::tidy(nyc_scores_factorial)

## # A tibble: 8 × 6
##   term                                      df  sumsq meansq statistic   p.value
##   <chr>                                  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>
## 1 Percent_Tested_HL                          1 1.90e5 1.90e5  43.7      1.35e-10
## 2 Percent_Black_HL                           1 1.09e5 1.09e5  25.0      9.07e- 7
## 3 Tutoring_Program                           1 6.03e3 6.03e3   1.39     2.40e- 1
## 4 Percent_Tested_HL:Percent_Black_HL         1 3.29e4 3.29e4   7.57     6.23e- 3
## 5 Percent_Tested_HL:Tutoring_Program         1 2.19e1 2.19e1   0.00504  9.43e- 1
## 6 Percent_Black_HL:Tutoring_Program          1 2.30e1 2.30e1   0.00528  9.42e- 1
## 7 Percent_Tested_HL:Percent_Black_HL:Tu…     1 1.58e3 1.58e3   0.362    5.48e- 1
## 8 Residuals                                367 1.60e6 4.35e3  NA       NA

Whoo! We can see from the results that we can reject the null hypothesis that there is no difference in score based on tutoring program availability. We can also see from the low p-values that there are some interaction effects between the Percent Black and Percent Tested and the tutoring program. Next we need to check the model.

Evaluating the NYC SAT Scores Factorial Model

We’ve built our model, so we know what’s next: model checking! We need to examine both if our outcome and our model residuals are normally distributed. We’ll check the normality assumption using shapiro.test(). A low p-value means we can reject the null hypothesis that the sample came from a normally distributed population.

Let’s carry out the requisite model checks for our 2^k factorial model, nyc_scores_factorial, which has been loaded for you.

# Use shapiro.test() to test the outcome
shapiro.test(nyc_scores$Average_Score_SAT_Math)

## 
##  Shapiro-Wilk normality test
## 
## data:  nyc_scores$Average_Score_SAT_Math
## W = 0.84672, p-value < 2.2e-16

# Plot nyc_scores_factorial to examine residuals
par(mfrow = c(2,2))
plot(nyc_scores_factorial)

Brilliant! The model appears to be fairly well fit, though our evidence indicates the score may not be from a normally distributed population. Looking at the Q-Q plot, we can see that towards the higher end, the points are not on the line, so we may not be dealing with normality here. If we had more time, we might consider a transformation on the outcome to move towards normality.

Experimental Design in R

Nathan Onduma

27 June 2022

Introduction

Course Description

What is Covered

Libraries and Data

1. Introduction to Experimental Design

Intro to experimental design

Steps of an experiment

Key Components of an experiment

Randomization

Replication

Blocking

A basic experiment

Randomization

Replication

Blocking

Hypothesis Testing

Breaking down hypothesis testing

Power and sample size

One sided vs. Two-sided tests

pwr package Help Docs exploration

Power & Sample Size Calculations

2. Basic Experiments

ANOVA, single and multiple factor experiments

How does loan purpose affect amount funded?

Which loan purpose mean is different?

Multiple Factor Experiments

Model Validation

Pre-modeling EDA

Post-modeling validation plots + variance

Kruskal-Wallis rank sum test

A/B testing

A/B testing

Power and sample size in A/B tests

Which post-A/B test test?

Sample size for A/B test

Basic A/B test

A/B tests vs. multivariable experiments

3. Randomized Complete (& Balanced Incomplete) Block Designs

Intro to NHANES and sampling

Intro to NHANES dataset

Intro to sampling

NHANES dataset construction

NHANES EDA

NHANES Data Cleaning

Resampling NHANES data

Randomized Complete Block Designs (RCBD)

RCBDs

Which is NOT a good blocking factor?

Drawing RCBDs with Agricolae

NHANES RCBD

RCBD Model Validation

Balanced Incomplete Block Designs (BIBD)

Balanced Incomplete Block Designs

Is a BIBD even possible?

Drawing BIBDs with agricolae

BIBD - cat’s kidney function

NHANES BIBD

4. Latin Squares, Graeco-Latin Squares, & Factorial experiments

Latin squares

Intro to NYC scores

NYC SAT Scores EDA

Dealing with Missing Test Scores

Drawing Latin Squares with agricolae

Latin Square with NYC SAT Scores

Graeco-Latin squares

NYC SAT Scores Data Viz

Drawing Graeco-Latin Squares with agricolae

Graeco-Latin Square with NYC SAT Scores

Factorial experiments

NYC SAT Scores Factorial EDA

Factorial Experiment with NYC SAT Scores

Evaluating the NYC SAT Scores Factorial Model

What’s next in experimental design