Experimental Design in R (DataCamp)

Ch. 1 - Introduction to Experimental Design

Intro to Experimental Design

[Video]

Steps of an Experiment

Planning
- dependent variable = outcome
- independent variable(s) = explanatory variables
Design
Analysis

Example of timeline

Planning & Design - 6 months
Conduct Experiment - 18 months
Analysis - 2 months

Key Components of an Experiment

Randomization
Replication
Blocking

A basic experiment

# Load the ToothGrowth dataset
data(ToothGrowth)

# Perform a two-sided t-test
t.test(x = ToothGrowth$len, alternative = "two.sided", mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.4135
## alternative hypothesis: true mean is not equal to 18
## 95 percent confidence interval:
##  16.83731 20.78936
## sample estimates:
## mean of x 
##  18.81333

Randomization

# Perform a t-test
ToothGrowth_ttest <- t.test(len ~ supp, data = ToothGrowth)

# Load broom
library(broom)

# Tidy ToothGrowth_ttest
tidy(ToothGrowth_ttest)

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1     3.70      20.7      17.0      1.92  0.0606      55.3   -0.171      7.57
## # … with 2 more variables: method <chr>, alternative <chr>

Replication

# Load dplyr
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Count number of observations for each combination of supp and dose
ToothGrowth %>% 
    count(supp, dose)

## # A tibble: 6 x 3
##   supp   dose     n
##   <fct> <dbl> <int>
## 1 OJ      0.5    10
## 2 OJ      1      10
## 3 OJ      2      10
## 4 VC      0.5    10
## 5 VC      1      10
## 6 VC      2      10

Blocking

# Create a boxplot with geom_boxplot()
ggplot(ToothGrowth, aes(x = dose, y = len)) + 
    geom_boxplot()

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

# Create ToothGrowth_aov
ToothGrowth_aov <- aov(len ~ dose + supp, data = ToothGrowth)

# Examine ToothGrowth_aov with summary()
summary(ToothGrowth_aov)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## dose         1 2224.3  2224.3  123.99 6.31e-16 ***
## supp         1  205.3   205.3   11.45   0.0013 ** 
## Residuals   57 1022.6    17.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis Testing

[Video]

Breaking down hypothesis testing:

Null hypothesis
- there is no change
- no difference between groups
- the mean, median, or observation = a number
Alternative hypothesis
- there is a change
- difference between groups
- mean, median, or observation is >, <, or != to a number

Power & Sample Size

Power: probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true.
Effect size: standardized measure of the difference you’re trying to detect.
Sample Size: How many experimental units you need to survey to detect the desired difference at the desired power.

Power & Sample Size Calculations

# library(pwr)
# pwr.anova.test(k = 3,
#                n = 20,
#                f = 0.2,
#                sig.level = 0.05,
#                power = NULL)

One sided vs. Two-sided tests

# Less than
t.test(x = ToothGrowth$len,
       alternative = "less",
       mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.7933
## alternative hypothesis: true mean is less than 18
## 95 percent confidence interval:
##      -Inf 20.46358
## sample estimates:
## mean of x 
##  18.81333

# Greater than
t.test(x = ToothGrowth$len,
       alternative = "greater",
       mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.2067
## alternative hypothesis: true mean is greater than 18
## 95 percent confidence interval:
##  17.16309      Inf
## sample estimates:
## mean of x 
##  18.81333

pwr package Help Docs exploration

The pwr package has been loaded for you. Use the console to look at the documentation for the pwr.t.test() function. The list of arguments are specialized for a t-test and include the ability to specify the alternative hypothesis.

If you’d like, take some time to explore the different pwr package functions and read about their inputs.

What does a call to any pwr.*() function yield?

A vector containing the thing to be calculated.
A data frame of the power, sample size, and other inputs.
[*] An object of class “power.htest”.
An object of class “integer”.

Power & Sample Size Calculations

# Load the pwr package
library(pwr)

# Calculate power
pwr.t.test(n = 100, 
           d = 0.35,
           sig.level = 0.10,
           type = "two.sample", 
           alternative = "two.sided",
           power = NULL)

## 
##      Two-sample t test power calculation 
## 
##               n = 100
##               d = 0.35
##       sig.level = 0.1
##           power = 0.7943532
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

# Calculate sample size
pwr.t.test(n = NULL, 
           d = 0.25, 
           sig.level = 0.05, 
           type = "one.sample", alternative = "greater", 
           power = 0.8)

## 
##      One-sample t test power calculation 
## 
##               n = 100.2877
##               d = 0.25
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater

Ch. 2- Basic Experiments

Single & Multiple Factor Experiments

[Video]

ANOVA

Used to compare 3+ groups
An omnibus test:
- won’t know which groups’ means are different without additional post hoc testing
Two ways to implement in R:

# #one
# model_1 <- lm(y ~ x, data = dataset)
# anova(model_1)
# 
# #two
# aov(y ~ x, data = dataset)

Single Factor Experiments

# model_1 <- lm(y ~ x)

Multiple Factor Experiments

# model2 <- lm(y ~ x + r + s + t)

Exploratory Data Analysis (EDA) Lending Club

# Examine the variables with glimpse()
glimpse(lendingclub)

## Observations: 1,500
## Variables: 12
## $ member_id           <int> 55096114, 1555332, 1009151, 69524202, 72128084, 5…
## $ loan_amnt           <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 50…
## $ funded_amnt         <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 50…
## $ term                <fct> 36 months, 36 months, 60 months, 36 months, 36 mo…
## $ int_rate            <dbl> 12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13.11, 7.…
## $ emp_length          <fct> 10+ years, 10+ years, 3 years, 10+ years, 10+ yea…
## $ home_ownership      <fct> RENT, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, MOR…
## $ annual_inc          <dbl> 51000, 40000, 78204, 51000, 96000, 47000, 40000, …
## $ verification_status <fct> Not Verified, Verified, Not Verified, Not Verifie…
## $ loan_status         <fct> Current, Fully Paid, Fully Paid, Current, Current…
## $ purpose             <fct> debt_consolidation, debt_consolidation, home_impr…
## $ grade               <fct> C, A, B, C, A, D, C, A, D, B, C, B, E, C, A, C, D…

# Find median loan_amnt, mean int_rate, and mean annual_inc with summarize()
lendingclub %>% summarize(median(loan_amnt), mean(int_rate), mean(annual_inc))

##   median(loan_amnt) mean(int_rate) mean(annual_inc)
## 1             13000       13.31472         75736.03

# Use ggplot2 to build a bar chart of purpose
ggplot(data=lendingclub, aes(x = purpose)) + 
    geom_bar() +
    coord_flip()

# Use recode() to create the new purpose_recode variable
lendingclub$purpose_recode <- lendingclub$purpose %>% recode( 
        "credit_card" = "debt_related", 
        "debt_consolidation" = "debt_related",
        "medical" = "debt_related",
        "car" = "big_purchase", 
        "major_purchase" = "big_purchase", 
        "vacation" = "big_purchase",
        "moving" = "life_change", 
        "small_business" = "life_change", 
        "wedding" = "life_change",
        "house" = "home_related", 
        "home_improvement" = "home_related")

How does loan purpose affect amount funded?

# Build a linear regression model, purpose_recode_model
purpose_recode_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub)

# Examine results of purpose_recode_model
summary(purpose_recode_model)

## 
## Call:
## lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14472  -6251  -1322   4678  25761 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      9888.1     1248.9   7.917 4.69e-15 ***
## purpose_recodedebt_related       5433.5     1270.5   4.277 2.02e-05 ***
## purpose_recodehome_related       4845.0     1501.0   3.228  0.00127 ** 
## purpose_recodelife_change        4095.3     2197.2   1.864  0.06254 .  
## purpose_recodeother              -649.3     1598.3  -0.406  0.68461    
## purpose_recoderenewable_energy  -1796.4     4943.3  -0.363  0.71636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8284 on 1494 degrees of freedom
## Multiple R-squared:  0.03473,    Adjusted R-squared:  0.0315 
## F-statistic: 10.75 on 5 and 1494 DF,  p-value: 3.598e-10

# Get anova results and save as purpose_recode_anova
purpose_recode_anova <- anova(purpose_recode_model)

# Print purpose_recode_anova
purpose_recode_anova

## Analysis of Variance Table
## 
## Response: funded_amnt
##                  Df     Sum Sq   Mean Sq F value    Pr(>F)    
## purpose_recode    5 3.6888e+09 737756668   10.75 3.598e-10 ***
## Residuals      1494 1.0253e+11  68629950                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Examine class of purpose_recode_anova
class(purpose_recode_anova)

## [1] "anova"      "data.frame"

Which loan purpose mean is different?

# Use aov() to build purpose_aov
purpose_aov <- aov(funded_amnt ~ purpose_recode, data = lendingclub)

# Conduct Tukey's HSD test to create tukey_output
tukey_output <- TukeyHSD(purpose_aov, conf.level = 0.95)

# Tidy tukey_output to make sense of the results
tidy(tukey_output)

## # A tibble: 15 x 6
##    term         comparison               estimate conf.low conf.high adj.p.value
##    <chr>        <chr>                       <dbl>    <dbl>     <dbl>       <dbl>
##  1 purpose_rec… debt_related-big_purcha…    5434.    1808.     9059.     2.91e-4
##  2 purpose_rec… home_related-big_purcha…    4845.     562.     9128.     1.61e-2
##  3 purpose_rec… life_change-big_purchase    4095.   -2174.    10365.     4.25e-1
##  4 purpose_rec… other-big_purchase          -649.   -5210.     3911.     9.99e-1
##  5 purpose_rec… renewable_energy-big_pu…   -1796.  -15902.    12309.     9.99e-1
##  6 purpose_rec… home_related-debt_relat…    -589.   -3056.     1879.     9.84e-1
##  7 purpose_rec… life_change-debt_related   -1338.   -6539.     3863.     9.78e-1
##  8 purpose_rec… other-debt_related         -6083.   -9005.    -3160.     5.32e-8
##  9 purpose_rec… renewable_energy-debt_r…   -7230.  -20894.     6434.     6.58e-1
## 10 purpose_rec… life_change-home_related    -750.   -6429.     4929.     9.99e-1
## 11 purpose_rec… other-home_related         -5494.   -9201.    -1787.     3.58e-4
## 12 purpose_rec… renewable_energy-home_r…   -6641.  -20494.     7212.     7.46e-1
## 13 purpose_rec… other-life_change          -4745.  -10636.     1147.     1.95e-1
## 14 purpose_rec… renewable_energy-life_c…   -5892.  -20482.     8698.     8.59e-1
## 15 purpose_rec… renewable_energy-other     -1147.  -15088.    12794.     1.00e+0

Multiple Factor Experiments

# Use aov() to build purpose_emp_aov
purpose_emp_aov <- aov(funded_amnt ~ purpose_recode + emp_length, data = lendingclub)

# Print purpose_emp_aov to the console
purpose_emp_aov

## Call:
##    aov(formula = funded_amnt ~ purpose_recode + emp_length, data = lendingclub)
## 
## Terms:
##                 purpose_recode   emp_length    Residuals
## Sum of Squares      3688783338   2044273211 100488872355
## Deg. of Freedom              5           11         1483
## 
## Residual standard error: 8231.679
## Estimated effects may be unbalanced

# Call summary() to see the p-values
summary(purpose_emp_aov)

##                  Df    Sum Sq   Mean Sq F value   Pr(>F)    
## purpose_recode    5 3.689e+09 737756668  10.888 2.63e-10 ***
## emp_length       11 2.044e+09 185843019   2.743  0.00161 ** 
## Residuals      1483 1.005e+11  67760534                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Validation

[Video]

Pre-modeling EDA

lendingclub %>% 
  summarise(median(loan_amnt),
            mean(int_rate), 
            mean(annual_inc))

##   median(loan_amnt) mean(int_rate) mean(annual_inc)
## 1             13000       13.31472         75736.03

lendingclub %>% 
  group_by(verification_status) %>%
  summarise(mean(funded_amnt), 
            var(funded_amnt))

## # A tibble: 3 x 3
##   verification_status `mean(funded_amnt)` `var(funded_amnt)`
##   <fct>                             <dbl>              <dbl>
## 1 Not Verified                     11018.          33994991.
## 2 Source Verified                  15155.          70477801.
## 3 Verified                         17562.          82211449.

ggplot(data = lendingclub,
       aes(x = verification_status, y = funded_amnt)) +
    geom_boxplot()

Post-modeling model validation

Residual plot
QQ-plot for normality
Test ANOVA assumptions
- Homogeneity of variances
Try non-parametric alternatives to ANOVA

Pre-modeling EDA

# Examine the summary of int_rate
summary(lendingclub$int_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.32    9.99   12.99   13.31   16.29   26.77

# Examine int_rate by grade
lendingclub %>% 
    group_by(grade) %>% 
    summarize(mean = mean(int_rate), var = var(int_rate), median = median(int_rate))

## # A tibble: 7 x 4
##   grade  mean   var median
##   <fct> <dbl> <dbl>  <dbl>
## 1 A      7.27 0.961   7.26
## 2 B     10.9  2.08   11.0 
## 3 C     14.0  1.42   14.0 
## 4 D     17.4  1.62   17.6 
## 5 E     20.1  2.71   20.0 
## 6 F     23.6  2.87   23.5 
## 7 G     26.1  0.198  25.9

# Make a boxplot of int_rate by grade
ggplot(lendingclub, aes(x = grade, y = int_rate)) + 
    geom_boxplot()

# Use aov() to create grade_aov plus call summary() to print results
grade_aov <- aov(int_rate ~ grade, data = lendingclub)
summary(grade_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## grade          6  27013    4502    2637 <2e-16 ***
## Residuals   1493   2549       2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-modeling validation plots + variance

# For a 2x2 grid of plots:
par(mfrow = c(2, 2))

# Plot grade_aov
plot(grade_aov)

# Bartlett's test for homogeneity of variance
bartlett.test(int_rate ~ grade, data = lendingclub)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  int_rate by grade
## Bartlett's K-squared = 78.549, df = 6, p-value = 7.121e-15

Kruskal-Wallis rank sum test

# Conduct the Kruskal-Wallis rank sum test
kruskal.test(int_rate ~ grade,
             data = lendingclub)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  int_rate by grade
## Kruskal-Wallis chi-squared = 1365.5, df = 6, p-value < 2.2e-16

A/B Testing

[Video]

Which post-A/B test test?

We’ll be testing the mean loan_amnt, which is the requested amount of money the loan applicants ask for, based on which color header (green or blue) that they saw on the Lending Club website.

Which statistical test should we use after we’ve collected enough data?

Kruskal-Wallis Rank Sum test
[*] T-test
Chi-Square Test
Linear Regression

Sample size for A/B test

# Load the pwr package
library(pwr)

# Use the correct function from pwr to find the sample size
pwr.t.test(n = NULL, 
           d = 0.2,
           sig.level = 0.05,
           power = 0.8,
           alternative = "two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Basic A/B test

# Plot the A/B test results
ggplot(lendingclub_ab, aes(x = Group, y = loan_amnt)) + 
    geom_boxplot()

# Conduct a two-sided t-test
t.test(loan_amnt ~ Group, data = lendingclub_ab)

## 
##  Welch Two Sample t-test
## 
## data:  loan_amnt by Group
## t = -0.58112, df = 997.06, p-value = 0.5613
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1377.1748   747.8748
## sample estimates:
## mean in group A mean in group B 
##        14723.15        15037.80

A/B tests vs. multivariable experiments

# Build lendingclub_multi
lendingclub_multi <- lm(loan_amnt ~ Group + grade + verification_status, data = lendingclub_ab)

# Examine lendingclub_multi results
tidy(lendingclub_multi)

## # A tibble: 10 x 5
##    term                               estimate std.error statistic  p.value
##    <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                          11244.      792.    14.2   8.44e-42
##  2 GroupB                                 205.      515.     0.398 6.91e- 1
##  3 gradeB                                -975.      817.    -1.19  2.33e- 1
##  4 gradeC                                -631.      806.    -0.783 4.34e- 1
##  5 gradeD                                 718.      917.     0.783 4.34e- 1
##  6 gradeE                                1477.     1208.     1.22  2.22e- 1
##  7 gradeF                                5453.     1942.     2.81  5.09e- 3
##  8 gradeG                                3490.     3396.     1.03  3.04e- 1
##  9 verification_statusSource Verified    4528.      637.     7.10  2.30e-12
## 10 verification_statusVerified           5900.      668.     8.84  4.41e-18

Ch. 3 - Randomized Complete (& Balanced Incomplete) Block Designs

Intro to NHANES & Sampling

[Video]

Intro to Sampling

Probability Sampling: probability is used to select the sample (in various ways)
Non-probability Sampling: probability is not used to select the sample
- Voluntary response: whoever agrees to respond is the sample
- Convenience sampling: subjects convenient to the researcher are chosen.

Sampling

Simple Random Sampling (SRS)

# sample()

Stratified Sampling

# dataset %>% 
#     group_by(strata_variable) %>% 
#     sample_n()

Cluster Sampling

# cluster(dataset, 
#         cluster_var_name,
#         number_to_select,
#         method = "option")

Systematic Sampling

Multi-stage Sampling

NHANES dataset construction

NHANES EDA

NHANES Data Cleaning

Resampling NHANES data

Randomized Complete Block Designs (RCBD)

Which is NOT a good blocking factor?

Drawing RCBDs with Agricolae

NHANES RCBD

RCBD Model Validation

Balanced Incomplete Block Designs (BIBD)

Is a BIBD even possible?

Drawing BIBDs with agricolae

BIBD - cat’s kidney function

NHANES BIBD

Ch. 4 - Latin Squares, Graeco-Latin Squares, & Factorial experiments

Latin Squares

NYC SAT Scores EDA

Dealing with Missing Test Scores

Drawing Latin Squares with agricolae

Latin Square with NYC SAT Scores

Graeco-Latin Squares

NYC SAT Scores Data Viz

Drawing Graeco-Latin Squares with agricolae

Graeco-Latin Square with NYC SAT Scores

Factorial Experiments

NYC SAT Scores Factorial EDA

Factorial Experiment with NYC SAT Scores

Evaluating the NYC SAT Scores Factorial Model

What’s next in Experimental Design

About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470