Experimental Design in `R`

DataCamp: Statistics with `R`

Bonnie Cooper

library( dplyr )
library( ggplot2 )
library( gridExtra )
library( tidyverse )
library( broom )
library( pwr )
library( ggfortify )
library( haven )
library( simputation )
library( sampling )
library( agricolae )
library( naniar )
library( mice )

Introduction to Experimental Design

Intro

Planning, Design & Analysis

Planning: start with a hypothesis
- dependent variable == outcome
- independent variable(s) == explanatory variables
Design: logistic regression? Factorial design?
Analysis

Key components: Randomization, Replication & Blocking

Randomization: Evenly distributes any variability inoutcome due to outside factors across treatment groups
- Ex: double blind medical trials. neither patient nor doctor knows which group has been assigned. group assignment is made randomly by 3rd party
Replication: must repeat an experiment to fully asses variability. helps us feel more confident that the result will generalize for the rest of the population.
Blocking: helps control variability by making treatment groups more alike. Inside of groups, differences will be minimal. Across groups, differences will be large
- classic example: blocking male/female subjects

# Load the ToothGrowth dataset
data(ToothGrowth)

# Perform a two-sided t-test
t.test(x = ToothGrowth$len, alternative = 'two.sided', mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.4135
## alternative hypothesis: true mean is not equal to 18
## 95 percent confidence interval:
##  16.83731 20.78936
## sample estimates:
## mean of x 
##  18.81333

# Count number of observations for each combination of supp and dose
ToothGrowth %>% 
    count( supp, dose )

##   supp dose  n
## 1   OJ  0.5 10
## 2   OJ  1.0 10
## 3   OJ  2.0 10
## 4   VC  0.5 10
## 5   VC  1.0 10
## 6   VC  2.0 10

# Perform a t-test
ToothGrowth_ttest <- t.test(len ~ supp, data = ToothGrowth)
ToothGrowth_ttest

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

# Tidy ToothGrowth_ttest
tidy( ToothGrowth_ttest )

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1     3.70      20.7      17.0      1.92  0.0606      55.3   -0.171      7.57
## # … with 2 more variables: method <chr>, alternative <chr>

Because the p-value == 0.06, there is not enought evidence to support the alternative hypothesis that there is an effect of supp on len. i.e., there is no difference between the means of tooth growth between different supplement groups.

glimpse( ToothGrowth )

## Rows: 60
## Columns: 3
## $ len  <dbl> 4.2, 11.5, 7.3, 5.8, 6.4, 10.0, 11.2, 11.2, 5.2, 7.0, 16.5, 16.5…
## $ supp <fct> VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, …
## $ dose <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0,…

ToothGrowth_f <- ToothGrowth %>%
  mutate( dose = as.factor( dose ) )
# Create a boxplot with geom_boxplot()
ggplot(ToothGrowth_f, aes(x = dose, y = len)) + 
    geom_boxplot()

# Create ToothGrowth_aov
ToothGrowth_aov <- aov(len ~ dose + supp, data = ToothGrowth_f)

# Examine ToothGrowth_aov with summary()
summary( ToothGrowth_aov )

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## dose         2 2426.4  1213.2   82.81  < 2e-16 ***
## supp         1  205.4   205.4   14.02 0.000429 ***
## Residuals   56  820.4    14.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis Testung

Hypothesis: central research question
Null Hypothesis: the uninteresting result + there is no change + no difference between groups + median, mean or observation is the same Alternative Hypothesis: there is a change + there is a difference beteen groups + median, mean or observations is >, <, or != to a number * >/< == one sided * != == two-sided test

Power and Sample Size

Power: probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true
Effect Size: standardized measure of the difference you’re trying to detect (it’s easier to detect a larger size of means than a smaller size)
Sample Size: How many experimental units you need to survey to detect the desired difference at the desired power. (generally, the larger the sample size, the more power the statistic has)

Power and sample size calculations using the pwr package

pwr.anova.test( k = 3,
                n = 20,
                f = 0.2,
                sig.level = 0.05,
                power = NULL )

## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 3
##               n = 20
##               f = 0.2
##       sig.level = 0.05
##           power = 0.2521043
## 
## NOTE: n is number in each group

# Less than
t.test(x = ToothGrowth$len,
       alternative = 'less',
       mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.7933
## alternative hypothesis: true mean is less than 18
## 95 percent confidence interval:
##      -Inf 20.46358
## sample estimates:
## mean of x 
##  18.81333

# Greater than
# Less than
t.test(x = ToothGrowth$len,
       alternative = 'greater',
       mu = 18)

## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.2067
## alternative hypothesis: true mean is greater than 18
## 95 percent confidence interval:
##  17.16309      Inf
## sample estimates:
## mean of x 
##  18.81333

# Calculate power using an effect size of 0.35, a sample size of 100 in each group, and a significance level of 0.10
pwr.t.test(n = 100, 
           d = 0.35,
           sig.level = 0.1,
           type = "two.sample", 
           alternative = "two.sided",
           power = NULL)

## 
##      Two-sample t test power calculation 
## 
##               n = 100
##               d = 0.35
##       sig.level = 0.1
##           power = 0.7943532
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

# Calculate sample size needed with an effect size of 0.25, a significance level of 0.05, and a power of 0.8.
pwr.t.test(n = NULL, 
           d = 0.25,
           sig.level = 0.05,
           type = "one.sample", 
           alternative = "greater",
           power = 0.8)

## 
##      One-sample t test power calculation 
## 
##               n = 100.2877
##               d = 0.25
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater

Basic Experiments

ANOVA, single and multiple factor experiments

t-tests are great for 2 groups, but ANOVA is necessary for comparisons of 3+ groups.

ANOVA: Analysis of variance test.

used to compare 3+ groups
will only inform if 1 of the means is different from the others, but not which mean.

here we look at 2 ways to implement ANOVA in R:

lm() -> anova()

#glimpse( ToothGrowth )
model1 <- lm( data = ToothGrowth, len ~ dose )
anova( model1 )

## Analysis of Variance Table
## 
## Response: len
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## dose       1 2224.3 2224.30  105.06 1.233e-14 ***
## Residuals 58 1227.9   21.17                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

aov()

aov( len ~ dose, data = ToothGrowth )

## Call:
##    aov(formula = len ~ dose, data = ToothGrowth)
## 
## Terms:
##                     dose Residuals
## Sum of Squares  2224.304  1227.905
## Deg. of Freedom        1        58
## 
## Residual standard error: 4.601171
## Estimated effects may be unbalanced

Single Factor Experiments: only 1 possible explanatory variable is tested
model_1 <- lm( y ~ x )

Multiple Factor Experiments: more than 1 explanatory variables are tested
model_2 <- lm( y ~ x + r + s + t )

A look at the Lending Club sample dataset:
target variable: the amount of loan awarded

#work with a subset of the data provided by DataCamp:
lc_url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/e14dbe91a0840393e86e4fb9a7ec1b958842ae39/lendclub.csv'
lendingclub <- read.csv( lc_url ) %>%
  mutate_if( is.character, as.factor )
glimpse( lendingclub )

## Rows: 1,500
## Columns: 12
## $ member_id           <int> 55096114, 1555332, 1009151, 69524202, 72128084, 5…
## $ loan_amnt           <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 50…
## $ funded_amnt         <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 50…
## $ term                <fct> 36 months, 36 months, 60 months, 36 months, 36 mo…
## $ int_rate            <dbl> 12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13.11, 7.…
## $ emp_length          <fct> 10+ years, 10+ years, 3 years, 10+ years, 10+ yea…
## $ home_ownership      <fct> RENT, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, MOR…
## $ annual_inc          <dbl> 51000, 40000, 78204, 51000, 96000, 47000, 40000, …
## $ verification_status <fct> Not Verified, Verified, Not Verified, Not Verifie…
## $ loan_status         <fct> Current, Fully Paid, Fully Paid, Current, Current…
## $ purpose             <fct> debt_consolidation, debt_consolidation, home_impr…
## $ grade               <fct> C, A, B, C, A, D, C, A, D, B, C, B, E, C, A, C, D…

# Find median loan_amnt and mean int_rate, annual_inc with summarize()
lendingclub %>% summarize( median( loan_amnt ),
mean( int_rate ),
mean( annual_inc ) )

##   median(loan_amnt) mean(int_rate) mean(annual_inc)
## 1             13000       13.31472         75736.03

## set the levels in order we want
lendingclub <- within(lendingclub, 
                   purpose <- factor(purpose, 
                                      levels=names(sort(table(purpose), 
                                                        decreasing=FALSE))))
# Use ggplot2 to build a bar chart of purpose
ggplot( lendingclub, aes( x = purpose ) ) +
    geom_bar() +
    coord_flip()

the purpose feature is very detailed. Here we use recode() to simplify the representation in a new feature purpose_recode

# Use recode() to create the new purpose_recode variable
lendingclub$purpose_recode <- lendingclub$purpose %>% recode( 
        "credit_card" = "debt_related", 
        "debt_consolidation" = "debt_related",
        "medical" = "debt_related",
        "car" = "big_purchase", 
        "major_purchase" = "big_purchase", 
        "vacation" = "big_purchase",
        "moving" = "life_change", 
        "small_business" = "life_change", 
        "wedding" = "life_change",
        "house" = "home_related", 
        "home_improvement" = "home_related")

## set the levels in order we want
lendingclub <- within(lendingclub, 
                   purpose_recode <- factor(purpose_recode, 
                                      levels=names(sort(table(purpose_recode), 
                                                        decreasing=FALSE))))
# Use ggplot2 to build a bar chart of purpose
ggplot( lendingclub, aes( x = purpose_recode ) ) +
    geom_bar() +
    coord_flip()

How does loan purpose affect the amount of loan awarded?

# Build a linear regression model, purpose_recode_model
purpose_recode_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub )

# Examine results of purpose_recode_model
summary( purpose_recode_model )

## 
## Call:
## lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14472  -6251  -1322   4678  25761 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                    8092       4783   1.692   0.0909 .
## purpose_recodelife_change      5892       5113   1.152   0.2494  
## purpose_recodebig_purchase     1796       4943   0.363   0.7164  
## purpose_recodeother            1147       4886   0.235   0.8144  
## purpose_recodehome_related     6641       4855   1.368   0.1715  
## purpose_recodedebt_related     7230       4789   1.510   0.1313  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8284 on 1494 degrees of freedom
## Multiple R-squared:  0.03473,    Adjusted R-squared:  0.0315 
## F-statistic: 10.75 on 5 and 1494 DF,  p-value: 3.598e-10

# Get anova results and save as purpose_recode_anova
purpose_recode_anova <- anova( purpose_recode_model )

# Print purpose_recode_anova
purpose_recode_anova

## Analysis of Variance Table
## 
## Response: funded_amnt
##                  Df     Sum Sq   Mean Sq F value    Pr(>F)    
## purpose_recode    5 3.6888e+09 737756668   10.75 3.598e-10 ***
## Residuals      1494 1.0253e+11  68629950                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Examine class of purpose_recode_anova
class( purpose_recode_anova )

## [1] "anova"      "data.frame"

The ANOVA p-value is very small. This suggests that we can reject the null hyposthesis and accept the alternative that there is a significant difference of the means for at least one of the recoded pupose categories.

Which loan purpose mean is different?

# Use aov() to build purpose_aov
purpose_aov <- aov(funded_amnt ~ purpose_recode, data = lendingclub )

# Conduct Tukey's HSD test to create tukey_output
tukey_output <- TukeyHSD( purpose_aov, "purpose_recode", conf.level = 0.95 )

# Tidy tukey_output to make sense of the results
tidy( tukey_output )

## # A tibble: 15 x 7
##    term      contrast         null.value estimate conf.low conf.high adj.p.value
##    <chr>     <chr>                 <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
##  1 purpose_… life_change-ren…          0    5892.   -8698.    20482.     8.59e-1
##  2 purpose_… big_purchase-re…          0    1796.  -12309.    15902.     9.99e-1
##  3 purpose_… other-renewable…          0    1147.  -12794.    15088.     1.00e+0
##  4 purpose_… home_related-re…          0    6641.   -7212.    20494.     7.46e-1
##  5 purpose_… debt_related-re…          0    7230.   -6434.    20894.     6.58e-1
##  6 purpose_… big_purchase-li…          0   -4095.  -10365.     2174.     4.25e-1
##  7 purpose_… other-life_chan…          0   -4745.  -10636.     1147.     1.95e-1
##  8 purpose_… home_related-li…          0     750.   -4929.     6429.     9.99e-1
##  9 purpose_… debt_related-li…          0    1338.   -3863.     6539.     9.78e-1
## 10 purpose_… other-big_purch…          0    -649.   -5210.     3911.     9.99e-1
## 11 purpose_… home_related-bi…          0    4845.     562.     9128.     1.61e-2
## 12 purpose_… debt_related-bi…          0    5434.    1808.     9059.     2.91e-4
## 13 purpose_… home_related-ot…          0    5494.    1787.     9201.     3.58e-4
## 14 purpose_… debt_related-ot…          0    6083.    3160.     9005.     5.32e-8
## 15 purpose_… debt_related-ho…          0     589.   -1879.     3056.     9.84e-1

The output are pairwise comparisons. There are only a handful of comparisons with significant p-values.

Sort output by p-value

tukey_sigsort <- tukey_output %>%
  tidy() %>%
  select( contrast, adj.p.value ) %>%
  arrange( adj.p.value ) %>%
  mutate( adj.p.value = round( adj.p.value, 8 ) ) %>%
  head( 10 )
tukey_sigsort

## # A tibble: 10 x 2
##    contrast                      adj.p.value
##    <chr>                               <dbl>
##  1 debt_related-other             0.00000005
##  2 debt_related-big_purchase      0.000291  
##  3 home_related-other             0.000358  
##  4 home_related-big_purchase      0.0161    
##  5 other-life_change              0.195     
##  6 big_purchase-life_change       0.425     
##  7 debt_related-renewable_energy  0.658     
##  8 home_related-renewable_energy  0.746     
##  9 life_change-renewable_energy   0.859     
## 10 debt_related-life_change       0.978

Multiple factor experiment:

# Use aov() to build purpose_emp_aov
purpose_emp_aov <- aov( funded_amnt ~ purpose_recode + emp_length, data = lendingclub )

# Call summary() to see the p-values
summary( purpose_emp_aov )

##                  Df    Sum Sq   Mean Sq F value   Pr(>F)    
## purpose_recode    5 3.689e+09 737756668  10.888 2.63e-10 ***
## emp_length       11 2.044e+09 185843019   2.743  0.00161 ** 
## Residuals      1483 1.005e+11  67760534                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Validation

lendingclub %>%
  group_by( verification_status ) %>%
  summarise( mean = mean( funded_amnt ),
             var = var( funded_amnt ) )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 3
##   verification_status   mean       var
##   <fct>                <dbl>     <dbl>
## 1 Not Verified        11018. 33994991.
## 2 Source Verified     15155. 70477801.
## 3 Verified            17562. 82211449.

look at the median and spread of the data with a boxplot:

ggplot( data = lendingclub,
        aes( x = verification_status,
             y = funded_amnt ) ) +
  geom_jitter() +
  geom_boxplot()

Post-modeling model validation: are lm assumptions met?

Residual plot
QQ-plot for normality
test ANOVA assumptions
- homogeneity of variance
try non-parametric alternatives to ANOVA (non-parametric tests make no assumptions about the distribution the data i.e. is this from a normal dist or not)
- kruskal-wallis test

purpose_mlr <- lm( funded_amnt ~ purpose_recode + emp_length, data = lendingclub )
autoplot( purpose_mlr, which = 1:4 ) + theme_minimal()

Residuals vs Fitted: want to see an even variance of the points
Normal QQ: want to see the points land along the regression line
Scale-Location: want to see the horizontal line approximately level with even spread of data points
Residuals vs Leverage: shows which levels are best fit to the model
Cook’s Distance: a common measure of a datapoints leverage & a good way to identify influential outliers. If these validation criteria are not met, might want to try adding more explanatory variables or other (e.g. non-linear) modelling approaches might be a better option.

Pre-modeling EDA of int_rate

# Examine the summary of int_rate
summary(lendingclub$int_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.32    9.99   12.99   13.31   16.29   26.77

# Examine int_rate by grade
lendingclub %>% 
    group_by(grade) %>% 
    summarize(mean = mean( int_rate ), var = var( int_rate ), median = median( int_rate ) )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 7 x 4
##   grade  mean   var median
##   <fct> <dbl> <dbl>  <dbl>
## 1 A      7.27 0.961   7.26
## 2 B     10.9  2.08   11.0 
## 3 C     14.0  1.42   14.0 
## 4 D     17.4  1.62   17.6 
## 5 E     20.1  2.71   20.0 
## 6 F     23.6  2.87   23.5 
## 7 G     26.1  0.198  25.9

# Make a boxplot of int_rate by grade
ggplot( lendingclub, aes( x = grade, y = int_rate ) ) +
    geom_boxplot()

# Use aov() to create grade_aov plus call summary() to print results
grade_aov <- aov(int_rate~grade, data = lendingclub)
summary( grade_aov )

##               Df Sum Sq Mean Sq F value Pr(>F)    
## grade          6  27013    4502    2637 <2e-16 ***
## Residuals   1493   2549       2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

autoplot( grade_aov ) + theme_minimal()

test the homogeneity of variance:

# Bartlett's test for homogeneity of variance
bartlett.test( int_rate ~ grade, lendingclub )

## 
##  Bartlett test of homogeneity of variances
## 
## data:  int_rate by grade
## Bartlett's K-squared = 78.549, df = 6, p-value = 7.121e-15

The residuals on this model are okay, though the residuals on G have a much smaller range than any other level of grade (the dots are far less spread out.) The Q-Q plot, however, shows that the residuals are fairly normal. However, given the highly significant p-value from Bartlett’s test, the assumption of homogeneity of variances is violated, which is one of the assumptions of an ANOVA model. Therefore, ANOVA might not be the best choice for this experiment.

Now try the Krustal-Wallis rank sum test:

# Conduct the Kruskal-Wallis rank sum test
kruskal.test(int_rate ~ grade,
             data = lendingclub )

## 
##  Kruskal-Wallis rank sum test
## 
## data:  int_rate by grade
## Kruskal-Wallis chi-squared = 1365.5, df = 6, p-value < 2.2e-16

The Krustal-Wallis test yields a very small p-value; this indicates that we can be confident in the result that int_rate varies as a function of grade.

A/B Testing

a type of controlled experiment with ony two variants of something:

a word difference in an add slogan
Red ‘buy’ button as opposed to blue
Consumer click-through rates using two different website headers

Power and Sample Size in A/B testing

Calculate sample size, given some power, significance level and effect size
Run A/B test until sample size that is needed has been collected

Lending Club A/B Test: effect of header color for loan application page
Do softer, gentler color toan influence load applicants to apply for less money?

We’ll be looking at the difference of means for two different group’s amount of loan application asked for. Since there are only 2 groups, we can use a t-test.

What sample size is needed?: calculate the required sample size for each group with d = 0.2, a power of 0.8, and a 0.05 significance level.

# Use the t.test from pwr to find the sample size
pwr.t.test(n= NULL,
    d = 0.2, 
    power = 0.8, 
    sig.level = 0.05,
    type = 'two.sample',
    alternative = 'two.sided' )

## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

at least 394 people are needed to achieve the desired power criteria

Visualize & run a two-sided t-test to determine if the difference between the means of the two groups is statistically significant:

# Plot the A/B test results
ggplot(lendingclub_ab, aes(y = loan_amnt, x = Group)) + 
    geom_boxplot()

# Conduct a two-sided t-test
t.test( formula = loan_amnt ~ Group, data = lendingclub_ab )

Judging by the boxplot and the t-test result, there is no compelling evidence to support a significant difference between responses for the two colors of header.

A/B testing with multivariate experiments

# Build lendingclub_multi
lendingclub_multi <-lm(loan_amnt ~ Group + grade + verification_status, lendingclub_ab )

# Examine lendingclub_multi results
tidy( lendingclub_multi )

Randomized Complete (& Balanced Incomplete) Block Design

Intro to NHANES and sampling

NHANES: National Health and Nutrition Examination Survey. Information collected from participants that covers medical, dental, socioeconomic, dietary, and general health-related conditions.

Intro to sampling: Probability sampling: probability is used to select the sample
Non-probability Sampling: probability is not used to inform dampling. A in collecting voluntary responses (whoever agrees to respond) and convenience sampling (subjects convenient to the researcher)

5 Methods of Sampling:

Simple Random Sampling: sample() randomly sample from the data.
Stratified Sampling: dataset %>% group_by(variable) %>% sample_n() stratify the data into groups (e.g. M/F) and randomly sample within groups
Cluster Sampling: cluster( dataset, cluster_var, num, method='option) divide the data into clusters, randomly select a number of cluster and sample all subjects in selected clusters
Systematic Sampling: ex: sample every 10 observations
Multistage Sampling integrates 2+ of the above methods in a systematic way:

EDA & Resampling of the NHANES dataset:

BMX_url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/ee832ef6c2fa7036704c53e90dc1e710a3b50dbc/nhanes_bodymeasures.csv'
DEMO_url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/2be5ca94453a63e825bc30ccefd1429b7683c19c/nhanes_demo.csv'
MCQ_url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/d34921a9255422617cdc42f6a3fbcd189f51c19d/nhanes_medicalconditions.csv'
# Import the three datasets using read_xpt()
nhanes_demo <- read.csv(DEMO_url)
nhanes_medical <- read.csv(MCQ_url)
nhanes_bodymeasures <- read.csv(BMX_url)

# Merge the 3 datasets you just created to create nhanes_combined
nhanes_combined <- list(nhanes_demo, nhanes_medical, nhanes_bodymeasures) %>% Reduce(function(df1, df2) inner_join(df1, df2, by = "seqn"), .)

glimpse( nhanes_combined )

## Rows: 9,165
## Columns: 161
## $ seqn     <int> 83732, 83733, 83734, 83735, 83736, 83737, 83738, 83739, 8374…
## $ sddsrvyr <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
## $ ridstatr <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ riagendr <int> 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, …
## $ ridageyr <int> 62, 53, 78, 56, 42, 72, 11, 4, 1, 22, 32, 18, 56, 15, 4, 46,…
## $ ridagemn <int> NA, NA, NA, NA, NA, NA, NA, NA, 13, NA, NA, NA, NA, NA, NA, …
## $ ridreth1 <int> 3, 3, 3, 3, 4, 1, 1, 3, 2, 4, 1, 5, 4, 3, 5, 3, 4, 3, 5, 1, …
## $ ridreth3 <int> 3, 3, 3, 3, 4, 1, 1, 3, 2, 4, 1, 6, 4, 3, 6, 3, 4, 3, 7, 1, …
## $ ridexmon <int> 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, …
## $ ridexagm <int> NA, NA, NA, NA, NA, NA, 141, 54, 14, NA, NA, 217, NA, 185, 5…
## $ dmqmiliz <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, 2, 2, NA, NA, 2, NA, 2, …
## $ dmqadfc  <int> NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ dmdborn4 <int> 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, …
## $ dmdcitzn <int> 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, …
## $ dmdyrsus <int> NA, 7, NA, NA, NA, 2, NA, NA, NA, NA, 6, NA, NA, NA, 2, 8, N…
## $ dmdeduc3 <int> NA, NA, NA, NA, NA, NA, 6, NA, NA, NA, NA, 11, NA, 9, NA, NA…
## $ dmdeduc2 <int> 5, 3, 3, 5, 4, 2, NA, NA, NA, 4, 4, NA, 3, NA, NA, 5, NA, NA…
## $ dmdmartl <int> 1, 3, 1, 6, 3, 4, NA, NA, NA, 5, 1, NA, 3, NA, NA, 6, NA, NA…
## $ ridexprg <int> NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA…
## $ sialang  <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ siaproxy <int> 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, …
## $ siaintrp <int> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, …
## $ fialang  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fiaproxy <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ fiaintrp <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, …
## $ mialang  <int> 1, 1, 1, 1, 1, 1, 1, NA, NA, 1, 1, NA, 1, 1, NA, 1, NA, 1, 1…
## $ miaproxy <int> 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, 2, NA, 2, 2, NA, 2, NA, 2, 2…
## $ miaintrp <int> 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, 2, NA, 2, 2, NA, 2, NA, 2, 2…
## $ aialanga <int> 1, 1, NA, 1, 1, NA, 1, NA, NA, 1, 1, 1, 1, 1, NA, 1, NA, 1, …
## $ dmdhhsiz <int> 2, 1, 2, 1, 5, 5, 5, 5, 7, 3, 4, 3, 1, 3, 4, 2, 6, 5, 5, 6, …
## $ dmdfmsiz <int> 2, 1, 2, 1, 5, 5, 5, 5, 7, 3, 4, 3, 1, 3, 4, 2, 6, 5, 1, 6, …
## $ dmdhhsza <int> 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 1, 0, 0, 0, 1, 0, 3, 0, 0, 1, …
## $ dmdhhszb <int> 0, 0, 0, 0, 2, 1, 2, 1, 2, 0, 1, 0, 0, 2, 1, 0, 0, 1, 2, 2, …
## $ dmdhhsze <int> 1, 0, 2, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, …
## $ dmdhrgnd <int> 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, …
## $ dmdhrage <int> 62, 53, 79, 56, 42, 34, 68, 35, 48, 47, 31, 55, 56, 42, 45, …
## $ dmdhrbr4 <int> 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, NA, 2, 1, 1, NA, 1…
## $ dmdhredu <int> 5, 3, 3, 5, 4, 5, 1, 5, 9, 4, 4, 5, 3, 4, NA, 5, 4, 4, NA, 2…
## $ dmdhrmar <int> 1, 3, 1, 6, 3, 1, 2, 1, 5, 1, 1, 1, 3, 5, 1, 6, 5, 1, 1, 1, …
## $ dmdhsedu <int> 3, NA, 3, NA, NA, 5, NA, 5, NA, 5, 4, 4, NA, NA, 4, NA, NA, …
## $ wtint2yr <dbl> 134671.370, 24328.560, 12400.009, 102717.996, 17627.675, 112…
## $ wtmec2yr <dbl> 135629.507, 25282.426, 12575.839, 102078.635, 18234.736, 108…
## $ sdmvpsu  <int> 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, …
## $ sdmvstra <int> 125, 125, 131, 131, 126, 128, 120, 124, 119, 128, 125, 122, …
## $ indhhin2 <int> 10, 4, 5, 10, 7, 14, 6, 15, 77, 7, 6, 15, 3, 4, 12, 3, 6, 14…
## $ indfmin2 <int> 10, 4, 5, 10, 7, 14, 6, 15, 77, 7, 6, 15, 3, 4, 12, 3, 6, 14…
## $ indfmpir <dbl> 4.39, 1.32, 1.51, 5.00, 1.23, 2.82, 1.18, 4.22, NA, 2.08, 1.…
## $ mcq010   <int> 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, …
## $ mcq025   <int> NA, NA, 60, NA, 10, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq035   <int> NA, NA, 1, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2,…
## $ mcq040   <int> NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq050   <int> NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ agq030   <int> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq053   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ mcq080   <int> 1, 2, 1, 1, 2, 1, NA, NA, NA, 1, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq092   <int> 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, 2, 2, 2, 2, NA, 2, NA, 2, 2,…
## $ mcd093   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq149   <int> NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq151   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160a  <int> 1, 2, 1, 2, 1, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 1, NA, NA…
## $ mcq180a  <int> 40, NA, 55, NA, 10, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq195   <int> 1, NA, 4, NA, 9, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, …
## $ mcq160n  <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 1, NA, NA, 2, NA, NA…
## $ mcq180n  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, NA, NA, …
## $ mcq160b  <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 1, NA, NA, 2, NA, NA…
## $ mcq180b  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 53, NA, NA, …
## $ mcq160c  <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq180c  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160d  <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq180d  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160e  <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 1, NA, NA, 2, NA, NA…
## $ mcq180e  <int> NA, NA, 58, NA, NA, NA, NA, NA, NA, NA, NA, NA, 53, NA, NA, …
## $ mcq160f  <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq180f  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160g  <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq180g  <int> NA, NA, 59, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160m  <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq170m  <int> NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq180m  <int> NA, NA, 39, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160k  <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq170k  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq180k  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160l  <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq170l  <int> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq180l  <int> NA, NA, 11, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160o  <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq203   <int> 2, 2, 1, 2, 2, 2, 2, NA, NA, 2, 2, 2, 2, 2, NA, 2, NA, 2, 2,…
## $ mcq206   <int> NA, NA, 11, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq220   <int> 1, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq230a  <int> 25, NA, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq230b  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq230c  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq230d  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240a  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240aa <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240b  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240bb <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240c  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240cc <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240d  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240dd <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240dk <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240e  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240f  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240g  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240h  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240i  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240j  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240k  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240l  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240m  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240n  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240o  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240p  <int> 58, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240q  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240r  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240s  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240t  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240u  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240v  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240w  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240x  <int> NA, NA, 57, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240y  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240z  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq300a  <int> 2, 2, 2, 9, 2, 2, NA, NA, NA, 1, 2, NA, 2, NA, NA, 9, NA, NA…
## $ mcq300b  <int> 2, 2, 1, 9, 2, 2, 1, NA, NA, 2, 2, 2, 2, 2, NA, 9, NA, 2, 2,…
## $ mcq300c  <int> 1, 1, 2, 9, 9, 1, NA, NA, NA, 1, 1, NA, 1, NA, NA, 9, NA, NA…
## $ mcq365a  <int> 2, 2, 2, 1, 2, 2, NA, NA, NA, 2, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq365b  <int> 1, 2, 1, 1, 2, 2, NA, NA, NA, 2, 2, 1, 1, NA, NA, 2, NA, 1, …
## $ mcq365c  <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq365d  <int> 2, 2, 1, 1, 2, 1, NA, NA, NA, 2, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq370a  <int> 1, 2, 1, 1, 2, 2, NA, NA, NA, 1, 1, 2, 1, NA, NA, 2, NA, 2, …
## $ mcq370b  <int> 1, 2, 2, 2, 2, 2, NA, NA, NA, 1, 1, 2, 1, NA, NA, 1, NA, 2, …
## $ mcq370c  <int> 2, 2, 2, 1, 2, 1, NA, NA, NA, 1, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq370d  <int> 2, 2, 1, 1, 2, 1, NA, NA, NA, 1, 1, 1, 1, NA, NA, 1, NA, 2, …
## $ osq230   <int> 1, 2, 2, 2, 2, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, 2, NA, …
## $ bmdstats <int> 1, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ bmxwt    <dbl> 94.8, 90.4, 83.4, 109.8, 55.2, 64.4, 37.2, 16.4, 10.1, 76.6,…
## $ bmiwt    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, N…
## $ bmxrecum <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmirecum <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, N…
## $ bmxhead  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmihead  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmxht    <dbl> 184.5, 171.4, 170.1, 160.9, 164.9, 150.0, 143.5, 102.1, NA, …
## $ bmiht    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmxbmi   <dbl> 27.8, 30.8, 28.8, 42.4, 20.3, 28.6, 18.1, 15.7, NA, 28.0, 28…
## $ bmdbmic  <int> NA, NA, NA, NA, NA, NA, 2, 2, NA, NA, NA, 3, NA, 3, 2, NA, 2…
## $ bmxleg   <dbl> 43.3, 38.0, 35.6, 38.5, 37.4, 34.4, 32.2, NA, NA, 38.8, 34.1…
## $ bmileg   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N…
## $ bmxarml  <dbl> 43.6, 40.0, 37.0, 37.7, 36.0, 33.5, 30.5, 22.0, NA, 38.0, 33…
## $ bmiarml  <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA, NA…
## $ bmxarmc  <dbl> 35.9, 33.2, 31.0, 38.3, 27.2, 31.4, 21.7, 16.4, NA, 34.0, 31…
## $ bmiarmc  <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA, NA…
## $ bmxwaist <dbl> 101.1, 107.9, 116.5, 110.1, 80.4, 92.9, 67.5, 48.5, NA, 86.6…
## $ bmiwaist <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N…
## $ bmxsad1  <dbl> 22.9, 27.5, 26.7, 25.2, NA, 23.2, 15.3, NA, NA, 19.1, 22.5, …
## $ bmxsad2  <dbl> 22.7, 27.1, 26.5, 25.0, NA, 23.0, 15.3, NA, NA, 19.3, 22.1, …
## $ bmxsad3  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmxsad4  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmdavsad <dbl> 22.8, 27.3, 26.6, 25.1, NA, 23.1, 15.3, NA, NA, 19.2, 22.3, …
## $ bmdsadcm <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N…

Find the mean weight by treatment

# Fill in the dplyr code
nhanes_combined %>% 
  group_by(mcq365d) %>% 
  summarize(mean = mean(bmxwt, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##   mcq365d  mean
##     <int> <dbl>
## 1       1  90.7
## 2       2  76.5
## 3       9  90.8
## 4      NA  33.5

# Fill in the ggplot2 code
nhanes_combined %>% 
  ggplot(aes(as.factor(mcq365d), y = bmxwt )) +
  geom_boxplot() +
  labs(x = "Treatment",
       y = "Weight")

## Warning: Removed 99 rows containing non-finite values (stat_boxplot).

NHANES data cleaning:

# Filter to keep only those 16+
nhanes_filter <- nhanes_combined %>% filter(ridageyr > 16 )

# Load simputation & impute bmxwt by riagendr
nhanes_final <- simputation::impute_median(nhanes_filter, bmxwt ~ riagendr )

# Recode mcq365d with recode() & examine with count()
nhanes_final$mcq365d <- recode(nhanes_final$mcq365d, 
                               `1` = 1,
                               `2` = 2,
                               `9` = 2)
nhanes_final %>% count( mcq365d)

##   mcq365d    n
## 1       1 1802
## 2       2 4085

Resampling NHANES data:

# Use sample_n() to create nhanes_srs
nhanes_srs <- nhanes_final %>% sample_n(2500)

# Create nhanes_stratified with group_by() and sample_n()
nhanes_stratified <- nhanes_final %>% group_by( riagendr ) %>% sample_n( 2000 )
nhanes_stratified %>% count()

## # A tibble: 2 x 2
## # Groups:   riagendr [2]
##   riagendr     n
##      <int> <int>
## 1        1  2000
## 2        2  2000

# Load sampling package and create nhanes_cluster with cluster()
nhanes_cluster <- cluster(nhanes_final, 'indhhin2', 6, method = "srswor")

Randomized Complete Block Design

Block what you can, Randomize what you can’t

Randomized - the treatment is assigned randomly inside each block
Complete - each treatment is used the same number of times in every block
Block - experimental groups are blocked to be similar (e.g. by sex). blocking is utilized when there is a nuisence factor: something that may effect the outcome, but isn’t of interest to the experiment.
Design - this is your experiment!

use the agricolae library to render RCBD and other protocols

trt <- letters[ 1:4 ]
rep <- 4
design.rcbd <- design.rcbd( trt,
                            r = rep,
                            seed = 42,
                            serie = 0 )
design.rcbd$sketch

##      [,1] [,2] [,3] [,4]
## [1,] "d"  "a"  "c"  "b" 
## [2,] "a"  "d"  "c"  "b" 
## [3,] "c"  "d"  "b"  "a" 
## [4,] "b"  "d"  "a"  "c"

# Create designs using ls()
designs <- ls("package:agricolae", pattern = "design")
designs

##  [1] "design.ab"      "design.alpha"   "design.bib"     "design.crd"    
##  [5] "design.cyclic"  "design.dau"     "design.graeco"  "design.lattice"
##  [9] "design.lsd"     "design.mat"     "design.rcbd"    "design.split"  
## [13] "design.strip"   "design.youden"

# Use str() to view design.rcbd's criteria
str(design.rcbd)

## List of 3
##  $ parameters:List of 7
##   ..$ design: chr "rcbd"
##   ..$ trt   : chr [1:4] "a" "b" "c" "d"
##   ..$ r     : num 4
##   ..$ serie : num 0
##   ..$ seed  : num 42
##   ..$ kinds : chr "Super-Duper"
##   ..$       : logi TRUE
##  $ sketch    : chr [1:4, 1:4] "d" "a" "c" "b" ...
##  $ book      :'data.frame':  16 obs. of  3 variables:
##   ..$ plots: num [1:16] 11 12 13 14 21 22 23 24 31 32 ...
##   ..$ block: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 2 3 3 ...
##   ..$ trt  : Factor w/ 4 levels "a","b","c","d": 4 1 3 2 1 4 3 2 3 4 ...

# Build treats and rep
treats <- LETTERS[1:5]
blocks <- 4

# Build my_design_rcbd and view the sketch
my_design_rcbd <- design.rcbd(treats, r = blocks, seed = 42)
my_design_rcbd$sketch

##      [,1] [,2] [,3] [,4] [,5]
## [1,] "D"  "A"  "C"  "B"  "E" 
## [2,] "E"  "A"  "C"  "D"  "B" 
## [3,] "D"  "B"  "E"  "A"  "C" 
## [4,] "B"  "D"  "E"  "C"  "A"

NHANES RCBD

# Use aov() to create nhanes_rcbd
nhanes_rcbd <- aov( bmxwt ~ mcq365d + riagendr, nhanes_final )

# Check results of nhanes_rcbd with summary()
summary( nhanes_rcbd )

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## mcq365d        1  229164  229164   571.2 <2e-16 ***
## riagendr       1  163069  163069   406.4 <2e-16 ***
## Residuals   5884 2360774     401                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Print mean weights by mcq365d and riagendr
nhanes_final %>% 
    group_by(mcq365d, riagendr) %>% 
    summarize(mean_wt = mean(bmxwt))

## `summarise()` regrouping output by 'mcq365d' (override with `.groups` argument)

## # A tibble: 4 x 3
## # Groups:   mcq365d [2]
##   mcq365d riagendr mean_wt
##     <dbl>    <int>   <dbl>
## 1       1        1    95.2
## 2       1        2    86.6
## 3       2        1    82.7
## 4       2        2    71.3

There are some clear differences between gender groups, so blocking them was a good move. Also, we observe a consistent statistically significant effect of the treatment on weight.

# Set up the 2x2 plotting grid and plot nhanes_rcbd
par(mfrow=c(2,2))

# Plot grade_aov
plot( nhanes_rcbd )

view the interaction plots view the interaction plot between the treatment and gender and observe if the lines are parallel.

# Run the code to view the interaction plots
with(nhanes_final, interaction.plot(riagendr, mcq365d, bmxwt))

View the interaction plots between gender and the treatment (it’ll be a little different!) and observe if the lines are parallel.

# Run the code to view the interaction plots
with(nhanes_final, interaction.plot(riagendr, mcq365d, bmxwt))

The initial diganostic plots show that this model is pretty good but not great - especially at the larger end of the data, the Q-Q plot shows the data might not be normal. The interaction plots show nearly parallel lines, so we can move forward with this model.

Balanced Incomplete Block Design

Sometimes it is not possible/feasible/necessary to test every treatment condition in a block.

Balanced: each apir of treatments occurs together in a block an equal number of times
Incomplete: not every treatment will appear in every block
Block: experimental groups are blocked to be similar (e.g. by sex)
Design: this is your experiment!

Quick Maths: is a BIBD possible?

Let:
- t = # of treatments
- k = # of treatments per block
- r = # replications
- \(\lambda = r \cdot \frac{(k-1)}{(t-1)}\)

if \(\lambda\) is a whole number, then a BIBD is possible.

#a BIBD is possible
t <- 4
k <- 3
r <- 3
lambda <- r*(k-1)/(t-1)
lambda

## [1] 2

#a BIBD is NOT possible
t <- 3
k <- 4
r <- 11
lambda <- r*(k-1)/(t-1)
lambda

## [1] 16.5

Draw some agricolae sketches

# Create my_design_bibd_3
my_design_bibd_3 <- design.bib(LETTERS[1:4], k = 4, seed = 42)

## 
## Parameters BIB
## ==============
## Lambda     : 2
## treatmeans : 4
## Block size : 4
## Blocks     : 2
## Replication: 2 
## 
## Efficiency factor 1 
## 
## <<< Book >>>

my_design_bibd_3$sketch

##      [,1] [,2] [,3] [,4]
## [1,] "C"  "A"  "D"  "B" 
## [2,] "C"  "D"  "B"  "A"

# Create my_design_bibd_3
my_design_bibd_3 <- design.bib(LETTERS[1:12], k = 2, seed = 42)

## 
## Parameters BIB
## ==============
## Lambda     : 1
## treatmeans : 12
## Block size : 2
## Blocks     : 66
## Replication: 11 
## 
## Efficiency factor 0.5454545 
## 
## <<< Book >>>

my_design_bibd_3$sketch

##       [,1] [,2]
##  [1,] "E"  "F" 
##  [2,] "C"  "I" 
##  [3,] "L"  "B" 
##  [4,] "H"  "A" 
##  [5,] "D"  "F" 
##  [6,] "A"  "J" 
##  [7,] "I"  "B" 
##  [8,] "G"  "B" 
##  [9,] "D"  "B" 
## [10,] "C"  "E" 
## [11,] "F"  "J" 
## [12,] "H"  "D" 
## [13,] "C"  "F" 
## [14,] "G"  "C" 
## [15,] "B"  "K" 
## [16,] "E"  "K" 
## [17,] "A"  "F" 
## [18,] "J"  "H" 
## [19,] "B"  "J" 
## [20,] "L"  "K" 
## [21,] "E"  "I" 
## [22,] "E"  "B" 
## [23,] "K"  "J" 
## [24,] "D"  "A" 
## [25,] "D"  "I" 
## [26,] "C"  "L" 
## [27,] "H"  "E" 
## [28,] "G"  "I" 
## [29,] "E"  "L" 
## [30,] "C"  "D" 
## [31,] "E"  "J" 
## [32,] "J"  "I" 
## [33,] "I"  "K" 
## [34,] "C"  "H" 
## [35,] "I"  "F" 
## [36,] "G"  "A" 
## [37,] "J"  "C" 
## [38,] "G"  "J" 
## [39,] "H"  "B" 
## [40,] "D"  "G" 
## [41,] "G"  "L" 
## [42,] "H"  "F" 
## [43,] "J"  "D" 
## [44,] "E"  "G" 
## [45,] "L"  "F" 
## [46,] "H"  "K" 
## [47,] "K"  "D" 
## [48,] "L"  "J" 
## [49,] "F"  "B" 
## [50,] "L"  "A" 
## [51,] "L"  "H" 
## [52,] "G"  "H" 
## [53,] "E"  "A" 
## [54,] "B"  "C" 
## [55,] "I"  "L" 
## [56,] "C"  "A" 
## [57,] "D"  "L" 
## [58,] "C"  "K" 
## [59,] "G"  "F" 
## [60,] "I"  "A" 
## [61,] "K"  "F" 
## [62,] "B"  "A" 
## [63,] "D"  "E" 
## [64,] "A"  "K" 
## [65,] "I"  "H" 
## [66,] "K"  "G"

# Create my_design_bibd_3
my_design_bibd_3 <- design.bib(LETTERS[1:6], k = 6, seed = 42)

## 
## Parameters BIB
## ==============
## Lambda     : 2
## treatmeans : 6
## Block size : 6
## Blocks     : 2
## Replication: 2 
## 
## Efficiency factor 1 
## 
## <<< Book >>>

my_design_bibd_3$sketch

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "E"  "F"  "A"  "D"  "B"  "C" 
## [2,] "A"  "E"  "C"  "B"  "D"  "F"

#write a lambda function to calculate lambda for us
lambda <- function( t, k, r ) {
  lambda = r*(k-1)/(t-1)
  return( lambda )
}

# Calculate lambda
lambda( t = 4, k = 3, r = 3 )

## [1] 2

# Build the data.frame
creatinine <- c(1.98, 1.97, 2.35, 2.09, 1.87, 1.95, 2.08, 2.01, 1.84, 2.06, 1.97, 2.22)
food <- as.factor(c("A", "C", "D", "A", "B", "C", "B", "C", "D", "A", "B", "D"))
color <- as.factor(rep(c("Black", "White", "Orange", "Spotted"), each = 3))
cat_experiment <- as.data.frame(cbind(creatinine, food, color))

# Create cat_model and examine with summary()
cat_model <- aov( creatinine ~ food + color, cat_experiment )
summary( cat_model )

##             Df  Sum Sq  Mean Sq F value Pr(>F)
## food         1 0.01204 0.012042   0.530  0.485
## color        1 0.00697 0.006971   0.307  0.593
## Residuals    9 0.20461 0.022735

# Calculate lambda
lambda( t = 2, k = 2, r =2 )

## [1] 2

# Create weightlift_model & examine results
weightlift_model <- aov( bmxarmc ~ riagendr + ridreth1, nhanes_final )
summary( weightlift_model )

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## riagendr       1   3779    3779  140.51  < 2e-16 ***
## ridreth1       1   1453    1453   54.02 2.27e-13 ***
## Residuals   5571 149832      27                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 313 observations deleted due to missingness

Latin Squares, Graeco-Latin Squares, & Factorial Experiments

Latin Squares

Two blocking factors (instead of 1)
All factors have the same number of levels
Key Assumption: the treatment and two blocking factors do not interact
Analyze like an RCBD

Take a look at the blocking diagram for a Latin Squares experiment:

# Create my_design_bibd_3
lsqr <- design.rcbd(LETTERS[1:4], r = 4, seed = 42)
lsqr$sketch

##      [,1] [,2] [,3] [,4]
## [1,] "D"  "A"  "C"  "B" 
## [2,] "A"  "D"  "C"  "B" 
## [3,] "C"  "D"  "B"  "A" 
## [4,] "B"  "D"  "A"  "C"

dataset for this chapter:

url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/6eee2fcc47c8c8dbb2e9d4670cf2eabeda52b705/nyc_scores.csv'
nyc_scores <- read.csv( url )
glimpse( nyc_scores )

## Rows: 435
## Columns: 22
## $ School_ID                 <chr> "02M260", "06M211", "01M539", "02M294", "02…
## $ School_Name               <chr> "Clinton School Writers and Artists", "Inwo…
## $ Borough                   <chr> "Manhattan", "Manhattan", "Manhattan", "Man…
## $ Building_Code             <chr> "M933", "M052", "M022", "M445", "M445", "M4…
## $ Street_Address            <chr> "425 West 33rd Street", "650 Academy Street…
## $ City                      <chr> "Manhattan", "Manhattan", "Manhattan", "Man…
## $ State                     <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "…
## $ Zip_Code                  <int> 10001, 10002, 10002, 10002, 10002, 10002, 1…
## $ Latitude                  <dbl> 40.75321, 40.86605, 40.71873, 40.71687, 40.…
## $ Longitude                 <dbl> -73.99786, -73.92486, -73.97943, -73.98953,…
## $ Phone_Number              <chr> "212-695-9114", "718-935-3660  ", "212-677-…
## $ Start_Time                <chr> "", "8:30 AM", "8:15 AM", "8:00 AM", "8:30 …
## $ End_Time                  <chr> "", "3:00 PM", "4:00 PM", "2:45 PM", "3:00 …
## $ Student_Enrollment        <int> NA, 87, 1735, 358, 383, 416, 255, 545, 329,…
## $ Percent_White             <dbl> NA, 0.03, 0.29, 0.12, 0.03, 0.02, 0.04, 0.4…
## $ Percent_Black             <dbl> NA, 0.22, 0.13, 0.39, 0.28, 0.03, 0.24, 0.1…
## $ Percent_Hispanic          <dbl> NA, 0.68, 0.18, 0.41, 0.57, 0.06, 0.57, 0.1…
## $ Percent_Asian             <dbl> NA, 0.05, 0.39, 0.06, 0.09, 0.89, 0.13, 0.1…
## $ Average_Score_SAT_Math    <int> NA, NA, 657, 395, 418, 613, 410, 634, 389, …
## $ Average_Score_SAT_Reading <int> NA, NA, 601, 411, 428, 453, 406, 641, 395, …
## $ Average_Score_SAT_Writing <int> NA, NA, 601, 387, 415, 463, 381, 639, 381, …
## $ Percent_Tested            <dbl> NA, NA, 0.91, 0.79, 0.65, 0.96, 0.60, 0.71,…

Experiment: look at the effects of tutoring programs on SAT scores.

EDA:

# Mean, var, and median of Math score
nyc_scores %>%
    group_by(Borough) %>% 
   summarize(mean = mean(Average_Score_SAT_Math, na.rm = TRUE),
        var = var(Average_Score_SAT_Math, na.rm = TRUE),
        median = median(Average_Score_SAT_Math, na.rm = TRUE))

## # A tibble: 5 x 4
##   Borough        mean   var median
##   <chr>         <dbl> <dbl>  <dbl>
## 1 Bronx          404. 2726.   396.
## 2 Brooklyn       416. 3658.   395 
## 3 Manhattan      456. 7026.   433 
## 4 Queens         462. 5168.   448 
## 5 Staten Island  486. 6911.   466.

# Mean, var, and median of Math score by Teacher Education Level
nyc_scores %>%
    group_by(Zip_Code) %>% 
    summarise(mean = mean(Average_Score_SAT_Math, na.rm = TRUE),
        var = var(Average_Score_SAT_Math, na.rm = TRUE),
        median = median(Average_Score_SAT_Math, na.rm = TRUE))

## # A tibble: 120 x 4
##    Zip_Code  mean    var median
##       <int> <dbl>  <dbl>  <dbl>
##  1    10001  NaN     NA     NA 
##  2    10002  473. 11234.   430 
##  3    10003  450.  1211.   446 
##  4    10004  463   6005.   432 
##  5    10006  430.  3120.   430.
##  6    10009  454     NA    454 
##  7    10010  549.  1401.   534 
##  8    10011  460.  8731.   422.
##  9    10013  478.  3120.   478.
## 10    10016  362.   312.   362.
## # … with 110 more rows

# Examine missingness with miss_var_summary()
nyc_scores %>% miss_var_summary( )

## # A tibble: 22 x 3
##    variable                  n_miss pct_miss
##    <chr>                      <int>    <dbl>
##  1 Average_Score_SAT_Math        60    13.8 
##  2 Average_Score_SAT_Reading     60    13.8 
##  3 Average_Score_SAT_Writing     60    13.8 
##  4 Percent_Tested                49    11.3 
##  5 Student_Enrollment             7     1.61
##  6 Percent_White                  7     1.61
##  7 Percent_Black                  7     1.61
##  8 Percent_Hispanic               7     1.61
##  9 Percent_Asian                  7     1.61
## 10 School_ID                      0     0   
## # … with 12 more rows

# Examine missingness with md.pattern()
md.pattern(nyc_scores)

##     School_ID School_Name Borough Building_Code Street_Address City State
## 375         1           1       1             1              1    1     1
## 11          1           1       1             1              1    1     1
## 42          1           1       1             1              1    1     1
## 7           1           1       1             1              1    1     1
##             0           0       0             0              0    0     0
##     Zip_Code Latitude Longitude Phone_Number Start_Time End_Time
## 375        1        1         1            1          1        1
## 11         1        1         1            1          1        1
## 42         1        1         1            1          1        1
## 7          1        1         1            1          1        1
##            0        0         0            0          0        0
##     Student_Enrollment Percent_White Percent_Black Percent_Hispanic
## 375                  1             1             1                1
## 11                   1             1             1                1
## 42                   1             1             1                1
## 7                    0             0             0                0
##                      7             7             7                7
##     Percent_Asian Percent_Tested Average_Score_SAT_Math
## 375             1              1                      1
## 11              1              1                      0
## 42              1              0                      0
## 7               0              0                      0
##                 7             49                     60
##     Average_Score_SAT_Reading Average_Score_SAT_Writing    
## 375                         1                         1   0
## 11                          0                         0   3
## 42                          0                         0   4
## 7                           0                         0   9
##                            60                        60 264

#glimpse( nyc_scores_f )
nyc_scores_f <- nyc_scores %>%
  mutate_if( is.character, as.factor )
# Impute the Math score by Borough
nyc_scores_2 <- simputation::impute_median(nyc_scores_f, Average_Score_SAT_Math ~ Borough)

# Convert Math score to numeric
nyc_scores_2$Average_Score_SAT_Math <- as.numeric(nyc_scores_2$Average_Score_SAT_Math)

# Examine scores by Borough in both datasets, before and after imputation
nyc_scores %>% 
    group_by(Borough) %>% 
    dplyr::summarize(median = median(Average_Score_SAT_Math, na.rm = TRUE), 
              mean = mean(Average_Score_SAT_Math, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 3
##   Borough       median  mean
##   <chr>          <dbl> <dbl>
## 1 Bronx           396.  404.
## 2 Brooklyn        395   416.
## 3 Manhattan       433   456.
## 4 Queens          448   462.
## 5 Staten Island   466.  486.

nyc_scores_2 %>% 
    group_by(Borough) %>% 
    dplyr::summarize(median = median(Average_Score_SAT_Math, na.rm = TRUE), 
              mean = mean(Average_Score_SAT_Math, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 3
##   Borough       median  mean
##   <fct>          <dbl> <dbl>
## 1 Bronx           396.  403.
## 2 Brooklyn        395   414.
## 3 Manhattan       433   452.
## 4 Queens          448   460.
## 5 Staten Island   466.  486.

# Design a LS with 5 treatments A:E then look at the sketch
my_design_lsd <- design.lsd( LETTERS[1:5], seed=42)
my_design_lsd$sketch

##      [,1] [,2] [,3] [,4] [,5]
## [1,] "E"  "D"  "A"  "C"  "B" 
## [2,] "D"  "C"  "E"  "B"  "A" 
## [3,] "A"  "E"  "B"  "D"  "C" 
## [4,] "C"  "B"  "D"  "A"  "E" 
## [5,] "B"  "A"  "C"  "E"  "D"

glimpse( nyc_scores_2 )

## Rows: 435
## Columns: 22
## $ School_ID                 <fct> 02M260, 06M211, 01M539, 02M294, 02M308, 02M…
## $ School_Name               <fct> "Clinton School Writers and Artists", "Inwo…
## $ Borough                   <fct> Manhattan, Manhattan, Manhattan, Manhattan,…
## $ Building_Code             <fct> M933, M052, M022, M445, M445, M445, M056, M…
## $ Street_Address            <fct> 425 West 33rd Street, 650 Academy Street, 1…
## $ City                      <fct> Manhattan, Manhattan, Manhattan, Manhattan,…
## $ State                     <fct> NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY,…
## $ Zip_Code                  <int> 10001, 10002, 10002, 10002, 10002, 10002, 1…
## $ Latitude                  <dbl> 40.75321, 40.86605, 40.71873, 40.71687, 40.…
## $ Longitude                 <dbl> -73.99786, -73.92486, -73.97943, -73.98953,…
## $ Phone_Number              <fct> 212-695-9114, 718-935-3660  , 212-677-5190,…
## $ Start_Time                <fct> , 8:30 AM, 8:15 AM, 8:00 AM, 8:30 AM, 8:00 …
## $ End_Time                  <fct> , 3:00 PM, 4:00 PM, 2:45 PM, 3:00 PM, 3:35 …
## $ Student_Enrollment        <int> NA, 87, 1735, 358, 383, 416, 255, 545, 329,…
## $ Percent_White             <dbl> NA, 0.03, 0.29, 0.12, 0.03, 0.02, 0.04, 0.4…
## $ Percent_Black             <dbl> NA, 0.22, 0.13, 0.39, 0.28, 0.03, 0.24, 0.1…
## $ Percent_Hispanic          <dbl> NA, 0.68, 0.18, 0.41, 0.57, 0.06, 0.57, 0.1…
## $ Percent_Asian             <dbl> NA, 0.05, 0.39, 0.06, 0.09, 0.89, 0.13, 0.1…
## $ Average_Score_SAT_Math    <dbl> 433, 433, 657, 395, 418, 613, 410, 634, 389…
## $ Average_Score_SAT_Reading <int> NA, NA, 601, 411, 428, 453, 406, 641, 395, …
## $ Average_Score_SAT_Writing <int> NA, NA, 601, 387, 415, 463, 381, 639, 381, …
## $ Percent_Tested            <dbl> NA, NA, 0.91, 0.79, 0.65, 0.96, 0.60, 0.71,…

# Build nyc_scores_ls_lm
nyc_scores_ls_lm <- lm(Average_Score_SAT_Math ~ Percent_White + Borough + Percent_Asian,
                        data = nyc_scores_2 )

# Tidy the results with broom
tidy( nyc_scores_ls_lm )

## # A tibble: 7 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            385.        3.77    102.   2.90e-299
## 2 Percent_White          217.       18.7      11.6  4.33e- 27
## 3 BoroughBrooklyn         -7.32      5.28     -1.39 1.67e-  1
## 4 BoroughManhattan        14.6       5.62      2.61 9.48e-  3
## 5 BoroughQueens          -15.2       6.58     -2.31 2.11e-  2
## 6 BoroughStaten Island   -29.7      15.3      -1.94 5.32e-  2
## 7 Percent_Asian          288.       16.9      17.1  5.14e- 50

# Examine the results with anova
anova(nyc_scores_ls_lm )

## Analysis of Variance Table
## 
## Response: Average_Score_SAT_Math
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## Percent_White   1 693879  693879 431.778 < 2.2e-16 ***
## Borough         4 139254   34814  21.663 2.863e-16 ***
## Percent_Asian   1 468093  468093 291.279 < 2.2e-16 ***
## Residuals     421 676558    1607                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Graeco-Latin Squares

Three blocking factors (instead of 1 or 2)
All factors have the same number of levels
Key Assumption: the treatment and two blocking factors do not interact
Analyze like an RCBD

# Create a boxplot of Math scores by Borough, with a title and x/y axis labels
ggplot(nyc_scores, aes( x = Borough, y =  Average_Score_SAT_Math)) +
  geom_boxplot() + 
  labs(title = "Average SAT Math Scores by Borough, NYC",
       x = "Borough (NYC)",
       y = "Average SAT Math Scores (2014-15)")

## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

There are some differences in the distributions of math SAT scores across boroughs, but they are comparable. However, some with more outliers than others.

# Create trt1 and trt2
trt1 <- LETTERS[1:5]
trt2 <- c(1:5)

# Create my_graeco_design
my_graeco_design <- design.graeco( trt1, trt2, seed = 42)

# Examine the parameters and sketch
my_graeco_design$parameters

## $design
## [1] "graeco"
## 
## $trt1
## [1] "A" "B" "C" "D" "E"
## 
## $trt2
## [1] 1 2 3 4 5
## 
## $r
## [1] 5
## 
## $serie
## [1] 2
## 
## $seed
## [1] 42
## 
## $kinds
## [1] "Super-Duper"
## 
## [[8]]
## [1] TRUE

my_graeco_design$sketch

##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,] "D 5" "A 1" "C 3" "B 4" "E 2"
## [2,] "A 3" "C 4" "B 2" "E 5" "D 1"
## [3,] "C 2" "B 5" "E 1" "D 3" "A 4"
## [4,] "B 1" "E 3" "D 4" "A 2" "C 5"
## [5,] "E 4" "D 2" "A 5" "C 1" "B 3"

#kinds == methods for randomization
#kinds <- c("Wichmann-Hill", "Marsaglia-Multicarry", "Super-Duper", "Mersenne-Twister", "Knuth-TAOCP", "user-supplied", "Knuth-TAOCP-2002", "default" )
my_graeco_design <- design.graeco( trt1, trt2, seed = 42, kinds = 'Wichmann-Hill')
my_graeco_design$sketch

##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,] "D 1" "B 2" "A 4" "E 5" "C 3"
## [2,] "B 4" "A 5" "E 3" "C 1" "D 2"
## [3,] "A 3" "E 1" "C 2" "D 4" "B 5"
## [4,] "E 2" "C 4" "D 5" "B 3" "A 1"
## [5,] "C 5" "D 3" "B 1" "A 2" "E 4"

#say we have borough, tutoring programs and homework type to block by. this experiemnt can be designed with graeco-latin square blocking to test the effects of a tutoring program

# Build nyc_scores_gls_lm
nyc_scores_gls_lm <- lm(Average_Score_SAT_Math ~ Tutoring_Program + Borough + Teacher_Education_Level + Homework_Type,
                        data = nyc_scores_gls )

# Tidy the results with broom
tidy( nyc_scores_gls_lm )

# Examine the results with anova
anova( nyc_scores_gls_lm )

Factorial Experiments

Experimental design that considers interactions of experimental variables.

2 or more factor variables are combined and crossed
all the possible interactions between levels of factors are considered as effects on the outcome
- Ex: high/low water and high/ow sunlight’s effect on plant growth
Use Tukey’s test to look for significant differences of means between groups

\(2^k\) facorial experiments

\(2^k\) factorial experiments involve k factor variables with 2 levels
it results in \(2^k\) number of combinations of effects to test
analyze with a linear model and ANOVA
also use TukeyHSD() to determine with pairwise combinations are significantly different

#glimpse( nyc_scores )
nyc_scores_ptest <- nyc_scores %>%
  mutate( ptest_high_low = as.factor( case_when( (Percent_Tested > 0.6) ~ 2,
                                      (Percent_Tested <= 0.6) ~ 1) ),
          Percent_Black_HL = as.factor( case_when( (Percent_Black > 0.3) ~ 2,
                                      (Percent_Black <= 0.3) ~ 1) ),
          Percent_White_HL = as.factor( case_when( (Percent_Tested > 0.3) ~ 2,
                                      (Percent_Tested <= 0.3) ~ 1) ) )
# Build the boxplot for the tutoring program vs. Math SAT score
ggplot(nyc_scores_ptest,
       aes(y = Average_Score_SAT_Math, x = ptest_high_low)) + 
    geom_boxplot()

## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

# Build the boxplot for the percent black vs. Math SAT score
ggplot(nyc_scores_ptest,
       aes(y=Average_Score_SAT_Math, Percent_Black_HL)) + 
    geom_boxplot()

## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

# Build the boxplot for the percent black vs. Math SAT score
ggplot(nyc_scores_ptest,
       aes(y=Average_Score_SAT_Math, Percent_White_HL)) + 
    geom_boxplot()

## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

glimpse( nyc_scores_ptest )

## Rows: 435
## Columns: 25
## $ School_ID                 <chr> "02M260", "06M211", "01M539", "02M294", "02…
## $ School_Name               <chr> "Clinton School Writers and Artists", "Inwo…
## $ Borough                   <chr> "Manhattan", "Manhattan", "Manhattan", "Man…
## $ Building_Code             <chr> "M933", "M052", "M022", "M445", "M445", "M4…
## $ Street_Address            <chr> "425 West 33rd Street", "650 Academy Street…
## $ City                      <chr> "Manhattan", "Manhattan", "Manhattan", "Man…
## $ State                     <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "…
## $ Zip_Code                  <int> 10001, 10002, 10002, 10002, 10002, 10002, 1…
## $ Latitude                  <dbl> 40.75321, 40.86605, 40.71873, 40.71687, 40.…
## $ Longitude                 <dbl> -73.99786, -73.92486, -73.97943, -73.98953,…
## $ Phone_Number              <chr> "212-695-9114", "718-935-3660  ", "212-677-…
## $ Start_Time                <chr> "", "8:30 AM", "8:15 AM", "8:00 AM", "8:30 …
## $ End_Time                  <chr> "", "3:00 PM", "4:00 PM", "2:45 PM", "3:00 …
## $ Student_Enrollment        <int> NA, 87, 1735, 358, 383, 416, 255, 545, 329,…
## $ Percent_White             <dbl> NA, 0.03, 0.29, 0.12, 0.03, 0.02, 0.04, 0.4…
## $ Percent_Black             <dbl> NA, 0.22, 0.13, 0.39, 0.28, 0.03, 0.24, 0.1…
## $ Percent_Hispanic          <dbl> NA, 0.68, 0.18, 0.41, 0.57, 0.06, 0.57, 0.1…
## $ Percent_Asian             <dbl> NA, 0.05, 0.39, 0.06, 0.09, 0.89, 0.13, 0.1…
## $ Average_Score_SAT_Math    <int> NA, NA, 657, 395, 418, 613, 410, 634, 389, …
## $ Average_Score_SAT_Reading <int> NA, NA, 601, 411, 428, 453, 406, 641, 395, …
## $ Average_Score_SAT_Writing <int> NA, NA, 601, 387, 415, 463, 381, 639, 381, …
## $ Percent_Tested            <dbl> NA, NA, 0.91, 0.79, 0.65, 0.96, 0.60, 0.71,…
## $ ptest_high_low            <fct> NA, NA, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2,…
## $ Percent_Black_HL          <fct> NA, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, …
## $ Percent_White_HL          <fct> NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…

# Create nyc_scores_factorial and examine the results
nyc_scores_factorial <- aov( Average_Score_SAT_Math ~ ptest_high_low * Percent_Black_HL * Percent_White_HL, nyc_scores_ptest)

tidy( nyc_scores_factorial )

## # A tibble: 6 x 6
##   term                                 df     sumsq   meansq statistic   p.value
##   <chr>                             <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
## 1 ptest_high_low                        1  410564.  410564.  117.       7.66e-24
## 2 Percent_Black_HL                      1  153350.  153350.   43.6      1.38e-10
## 3 Percent_White_HL                      1      29.1     29.1   0.00827  9.28e- 1
## 4 ptest_high_low:Percent_Black_HL       1   73269.   73269.   20.8      6.78e- 6
## 5 Percent_Black_HL:Percent_White_HL     1    2286.    2286.    0.651    4.20e- 1
## 6 Residuals                           369 1296753.    3514.   NA       NA

#shapiro.test() Performs the Shapiro-Wilk test of normality.
# Use shapiro.test() to test the outcome
shapiro.test( nyc_scores$Average_Score_SAT_Math)

## 
##  Shapiro-Wilk normality test
## 
## data:  nyc_scores$Average_Score_SAT_Math
## W = 0.84672, p-value < 2.2e-16

# Plot nyc_scores_factorial to examine residuals
par(mfrow = c(2,2))
plot( nyc_scores_factorial )

The model appears to be fairly well fit, though our evidence indicates the score may not be from a normally distributed population. Looking at the Q-Q plot, we can see that towards the higher end, the points are not on the line, so we may not be dealing with normality here. If we had more time, we might consider a transformation on the outcome to move towards normality.

Experimental Design in R

DataCamp: Statistics with R