library( dplyr )
library( ggplot2 )
library( gridExtra )
library( tidyverse )
library( broom )
library( pwr )
library( ggfortify )
library( haven )
library( simputation )
library( sampling )
library( agricolae )
library( naniar )
library( mice )
Introduction to Experimental Design
Intro
Planning, Design & Analysis
- Planning: start with a hypothesis
- dependent variable == outcome
- independent variable(s) == explanatory variables
- Design: logistic regression? Factorial design?
- Analysis
Key components: Randomization, Replication & Blocking
- Randomization: Evenly distributes any variability inoutcome due to outside factors across treatment groups
- Ex: double blind medical trials. neither patient nor doctor knows which group has been assigned. group assignment is made randomly by 3rd party
- Replication: must repeat an experiment to fully asses variability. helps us feel more confident that the result will generalize for the rest of the population.
- Blocking: helps control variability by making treatment groups more alike. Inside of groups, differences will be minimal. Across groups, differences will be large
- classic example: blocking male/female subjects
# Load the ToothGrowth dataset
data(ToothGrowth)
# Perform a two-sided t-test
t.test(x = ToothGrowth$len, alternative = 'two.sided', mu = 18)
##
## One Sample t-test
##
## data: ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.4135
## alternative hypothesis: true mean is not equal to 18
## 95 percent confidence interval:
## 16.83731 20.78936
## sample estimates:
## mean of x
## 18.81333
# Count number of observations for each combination of supp and dose
ToothGrowth %>%
count( supp, dose )
## supp dose n
## 1 OJ 0.5 10
## 2 OJ 1.0 10
## 3 OJ 2.0 10
## 4 VC 0.5 10
## 5 VC 1.0 10
## 6 VC 2.0 10
# Perform a t-test
ToothGrowth_ttest <- t.test(len ~ supp, data = ToothGrowth)
ToothGrowth_ttest
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
# Tidy ToothGrowth_ttest
tidy( ToothGrowth_ttest )
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.70 20.7 17.0 1.92 0.0606 55.3 -0.171 7.57
## # … with 2 more variables: method <chr>, alternative <chr>
Because the p-value == 0.06, there is not enought evidence to support the alternative hypothesis that there is an effect of supp on len. i.e., there is no difference between the means of tooth growth between different supplement groups.
glimpse( ToothGrowth )
## Rows: 60
## Columns: 3
## $ len <dbl> 4.2, 11.5, 7.3, 5.8, 6.4, 10.0, 11.2, 11.2, 5.2, 7.0, 16.5, 16.5…
## $ supp <fct> VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, …
## $ dose <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0,…
ToothGrowth_f <- ToothGrowth %>%
mutate( dose = as.factor( dose ) )
# Create a boxplot with geom_boxplot()
ggplot(ToothGrowth_f, aes(x = dose, y = len)) +
geom_boxplot()
# Create ToothGrowth_aov
ToothGrowth_aov <- aov(len ~ dose + supp, data = ToothGrowth_f)
# Examine ToothGrowth_aov with summary()
summary( ToothGrowth_aov )
## Df Sum Sq Mean Sq F value Pr(>F)
## dose 2 2426.4 1213.2 82.81 < 2e-16 ***
## supp 1 205.4 205.4 14.02 0.000429 ***
## Residuals 56 820.4 14.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis Testung
Hypothesis: central research question
Null Hypothesis: the uninteresting result + there is no change + no difference between groups + median, mean or observation is the same Alternative Hypothesis: there is a change + there is a difference beteen groups + median, mean or observations is >, <, or != to a number * >/< == one sided * != == two-sided test
Power and Sample Size
- Power: probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true
- Effect Size: standardized measure of the difference you’re trying to detect (it’s easier to detect a larger size of means than a smaller size)
- Sample Size: How many experimental units you need to survey to detect the desired difference at the desired power. (generally, the larger the sample size, the more power the statistic has)
Power and sample size calculations using the pwr package
pwr.anova.test( k = 3,
n = 20,
f = 0.2,
sig.level = 0.05,
power = NULL )
##
## Balanced one-way analysis of variance power calculation
##
## k = 3
## n = 20
## f = 0.2
## sig.level = 0.05
## power = 0.2521043
##
## NOTE: n is number in each group
# Less than
t.test(x = ToothGrowth$len,
alternative = 'less',
mu = 18)
##
## One Sample t-test
##
## data: ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.7933
## alternative hypothesis: true mean is less than 18
## 95 percent confidence interval:
## -Inf 20.46358
## sample estimates:
## mean of x
## 18.81333
# Greater than
# Less than
t.test(x = ToothGrowth$len,
alternative = 'greater',
mu = 18)
##
## One Sample t-test
##
## data: ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.2067
## alternative hypothesis: true mean is greater than 18
## 95 percent confidence interval:
## 17.16309 Inf
## sample estimates:
## mean of x
## 18.81333
# Calculate power using an effect size of 0.35, a sample size of 100 in each group, and a significance level of 0.10
pwr.t.test(n = 100,
d = 0.35,
sig.level = 0.1,
type = "two.sample",
alternative = "two.sided",
power = NULL)
##
## Two-sample t test power calculation
##
## n = 100
## d = 0.35
## sig.level = 0.1
## power = 0.7943532
## alternative = two.sided
##
## NOTE: n is number in *each* group
# Calculate sample size needed with an effect size of 0.25, a significance level of 0.05, and a power of 0.8.
pwr.t.test(n = NULL,
d = 0.25,
sig.level = 0.05,
type = "one.sample",
alternative = "greater",
power = 0.8)
##
## One-sample t test power calculation
##
## n = 100.2877
## d = 0.25
## sig.level = 0.05
## power = 0.8
## alternative = greater
Basic Experiments
ANOVA, single and multiple factor experiments
t-tests are great for 2 groups, but ANOVA is necessary for comparisons of 3+ groups.
ANOVA: Analysis of variance test.
- used to compare 3+ groups
- will only inform if 1 of the means is different from the others, but not which mean.
here we look at 2 ways to implement ANOVA in R:
lm()->anova()
#glimpse( ToothGrowth )
model1 <- lm( data = ToothGrowth, len ~ dose )
anova( model1 )
## Analysis of Variance Table
##
## Response: len
## Df Sum Sq Mean Sq F value Pr(>F)
## dose 1 2224.3 2224.30 105.06 1.233e-14 ***
## Residuals 58 1227.9 21.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov()
aov( len ~ dose, data = ToothGrowth )
## Call:
## aov(formula = len ~ dose, data = ToothGrowth)
##
## Terms:
## dose Residuals
## Sum of Squares 2224.304 1227.905
## Deg. of Freedom 1 58
##
## Residual standard error: 4.601171
## Estimated effects may be unbalanced
Single Factor Experiments: only 1 possible explanatory variable is tested
model_1 <- lm( y ~ x )
Multiple Factor Experiments: more than 1 explanatory variables are tested
model_2 <- lm( y ~ x + r + s + t )
A look at the Lending Club sample dataset:
target variable: the amount of loan awarded
#work with a subset of the data provided by DataCamp:
lc_url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/e14dbe91a0840393e86e4fb9a7ec1b958842ae39/lendclub.csv'
lendingclub <- read.csv( lc_url ) %>%
mutate_if( is.character, as.factor )
glimpse( lendingclub )
## Rows: 1,500
## Columns: 12
## $ member_id <int> 55096114, 1555332, 1009151, 69524202, 72128084, 5…
## $ loan_amnt <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 50…
## $ funded_amnt <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 50…
## $ term <fct> 36 months, 36 months, 60 months, 36 months, 36 mo…
## $ int_rate <dbl> 12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13.11, 7.…
## $ emp_length <fct> 10+ years, 10+ years, 3 years, 10+ years, 10+ yea…
## $ home_ownership <fct> RENT, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, MOR…
## $ annual_inc <dbl> 51000, 40000, 78204, 51000, 96000, 47000, 40000, …
## $ verification_status <fct> Not Verified, Verified, Not Verified, Not Verifie…
## $ loan_status <fct> Current, Fully Paid, Fully Paid, Current, Current…
## $ purpose <fct> debt_consolidation, debt_consolidation, home_impr…
## $ grade <fct> C, A, B, C, A, D, C, A, D, B, C, B, E, C, A, C, D…
# Find median loan_amnt and mean int_rate, annual_inc with summarize()
lendingclub %>% summarize( median( loan_amnt ),
mean( int_rate ),
mean( annual_inc ) )
## median(loan_amnt) mean(int_rate) mean(annual_inc)
## 1 13000 13.31472 75736.03
## set the levels in order we want
lendingclub <- within(lendingclub,
purpose <- factor(purpose,
levels=names(sort(table(purpose),
decreasing=FALSE))))
# Use ggplot2 to build a bar chart of purpose
ggplot( lendingclub, aes( x = purpose ) ) +
geom_bar() +
coord_flip()
the
purpose feature is very detailed. Here we use recode() to simplify the representation in a new feature purpose_recode
# Use recode() to create the new purpose_recode variable
lendingclub$purpose_recode <- lendingclub$purpose %>% recode(
"credit_card" = "debt_related",
"debt_consolidation" = "debt_related",
"medical" = "debt_related",
"car" = "big_purchase",
"major_purchase" = "big_purchase",
"vacation" = "big_purchase",
"moving" = "life_change",
"small_business" = "life_change",
"wedding" = "life_change",
"house" = "home_related",
"home_improvement" = "home_related")
## set the levels in order we want
lendingclub <- within(lendingclub,
purpose_recode <- factor(purpose_recode,
levels=names(sort(table(purpose_recode),
decreasing=FALSE))))
# Use ggplot2 to build a bar chart of purpose
ggplot( lendingclub, aes( x = purpose_recode ) ) +
geom_bar() +
coord_flip()
How does loan purpose affect the amount of loan awarded?
# Build a linear regression model, purpose_recode_model
purpose_recode_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub )
# Examine results of purpose_recode_model
summary( purpose_recode_model )
##
## Call:
## lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14472 -6251 -1322 4678 25761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8092 4783 1.692 0.0909 .
## purpose_recodelife_change 5892 5113 1.152 0.2494
## purpose_recodebig_purchase 1796 4943 0.363 0.7164
## purpose_recodeother 1147 4886 0.235 0.8144
## purpose_recodehome_related 6641 4855 1.368 0.1715
## purpose_recodedebt_related 7230 4789 1.510 0.1313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8284 on 1494 degrees of freedom
## Multiple R-squared: 0.03473, Adjusted R-squared: 0.0315
## F-statistic: 10.75 on 5 and 1494 DF, p-value: 3.598e-10
# Get anova results and save as purpose_recode_anova
purpose_recode_anova <- anova( purpose_recode_model )
# Print purpose_recode_anova
purpose_recode_anova
## Analysis of Variance Table
##
## Response: funded_amnt
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose_recode 5 3.6888e+09 737756668 10.75 3.598e-10 ***
## Residuals 1494 1.0253e+11 68629950
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Examine class of purpose_recode_anova
class( purpose_recode_anova )
## [1] "anova" "data.frame"
The ANOVA p-value is very small. This suggests that we can reject the null hyposthesis and accept the alternative that there is a significant difference of the means for at least one of the recoded pupose categories.
Which loan purpose mean is different?
# Use aov() to build purpose_aov
purpose_aov <- aov(funded_amnt ~ purpose_recode, data = lendingclub )
# Conduct Tukey's HSD test to create tukey_output
tukey_output <- TukeyHSD( purpose_aov, "purpose_recode", conf.level = 0.95 )
# Tidy tukey_output to make sense of the results
tidy( tukey_output )
## # A tibble: 15 x 7
## term contrast null.value estimate conf.low conf.high adj.p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 purpose_… life_change-ren… 0 5892. -8698. 20482. 8.59e-1
## 2 purpose_… big_purchase-re… 0 1796. -12309. 15902. 9.99e-1
## 3 purpose_… other-renewable… 0 1147. -12794. 15088. 1.00e+0
## 4 purpose_… home_related-re… 0 6641. -7212. 20494. 7.46e-1
## 5 purpose_… debt_related-re… 0 7230. -6434. 20894. 6.58e-1
## 6 purpose_… big_purchase-li… 0 -4095. -10365. 2174. 4.25e-1
## 7 purpose_… other-life_chan… 0 -4745. -10636. 1147. 1.95e-1
## 8 purpose_… home_related-li… 0 750. -4929. 6429. 9.99e-1
## 9 purpose_… debt_related-li… 0 1338. -3863. 6539. 9.78e-1
## 10 purpose_… other-big_purch… 0 -649. -5210. 3911. 9.99e-1
## 11 purpose_… home_related-bi… 0 4845. 562. 9128. 1.61e-2
## 12 purpose_… debt_related-bi… 0 5434. 1808. 9059. 2.91e-4
## 13 purpose_… home_related-ot… 0 5494. 1787. 9201. 3.58e-4
## 14 purpose_… debt_related-ot… 0 6083. 3160. 9005. 5.32e-8
## 15 purpose_… debt_related-ho… 0 589. -1879. 3056. 9.84e-1
The output are pairwise comparisons. There are only a handful of comparisons with significant p-values.
Sort output by p-value
tukey_sigsort <- tukey_output %>%
tidy() %>%
select( contrast, adj.p.value ) %>%
arrange( adj.p.value ) %>%
mutate( adj.p.value = round( adj.p.value, 8 ) ) %>%
head( 10 )
tukey_sigsort
## # A tibble: 10 x 2
## contrast adj.p.value
## <chr> <dbl>
## 1 debt_related-other 0.00000005
## 2 debt_related-big_purchase 0.000291
## 3 home_related-other 0.000358
## 4 home_related-big_purchase 0.0161
## 5 other-life_change 0.195
## 6 big_purchase-life_change 0.425
## 7 debt_related-renewable_energy 0.658
## 8 home_related-renewable_energy 0.746
## 9 life_change-renewable_energy 0.859
## 10 debt_related-life_change 0.978
Multiple factor experiment:
# Use aov() to build purpose_emp_aov
purpose_emp_aov <- aov( funded_amnt ~ purpose_recode + emp_length, data = lendingclub )
# Call summary() to see the p-values
summary( purpose_emp_aov )
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose_recode 5 3.689e+09 737756668 10.888 2.63e-10 ***
## emp_length 11 2.044e+09 185843019 2.743 0.00161 **
## Residuals 1483 1.005e+11 67760534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model Validation
lendingclub %>%
group_by( verification_status ) %>%
summarise( mean = mean( funded_amnt ),
var = var( funded_amnt ) )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
## verification_status mean var
## <fct> <dbl> <dbl>
## 1 Not Verified 11018. 33994991.
## 2 Source Verified 15155. 70477801.
## 3 Verified 17562. 82211449.
look at the median and spread of the data with a boxplot:
ggplot( data = lendingclub,
aes( x = verification_status,
y = funded_amnt ) ) +
geom_jitter() +
geom_boxplot()
Post-modeling model validation: are lm assumptions met?
- Residual plot
- QQ-plot for normality
- test ANOVA assumptions
- homogeneity of variance
- try non-parametric alternatives to ANOVA (non-parametric tests make no assumptions about the distribution the data i.e. is this from a normal dist or not)
- kruskal-wallis test
purpose_mlr <- lm( funded_amnt ~ purpose_recode + emp_length, data = lendingclub )
autoplot( purpose_mlr, which = 1:4 ) + theme_minimal()
- Residuals vs Fitted: want to see an even variance of the points
- Normal QQ: want to see the points land along the regression line
- Scale-Location: want to see the horizontal line approximately level with even spread of data points
- Residuals vs Leverage: shows which levels are best fit to the model
- Cook’s Distance: a common measure of a datapoints leverage & a good way to identify influential outliers. If these validation criteria are not met, might want to try adding more explanatory variables or other (e.g. non-linear) modelling approaches might be a better option.
Pre-modeling EDA of int_rate
# Examine the summary of int_rate
summary(lendingclub$int_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.32 9.99 12.99 13.31 16.29 26.77
# Examine int_rate by grade
lendingclub %>%
group_by(grade) %>%
summarize(mean = mean( int_rate ), var = var( int_rate ), median = median( int_rate ) )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 4
## grade mean var median
## <fct> <dbl> <dbl> <dbl>
## 1 A 7.27 0.961 7.26
## 2 B 10.9 2.08 11.0
## 3 C 14.0 1.42 14.0
## 4 D 17.4 1.62 17.6
## 5 E 20.1 2.71 20.0
## 6 F 23.6 2.87 23.5
## 7 G 26.1 0.198 25.9
# Make a boxplot of int_rate by grade
ggplot( lendingclub, aes( x = grade, y = int_rate ) ) +
geom_boxplot()
# Use aov() to create grade_aov plus call summary() to print results
grade_aov <- aov(int_rate~grade, data = lendingclub)
summary( grade_aov )
## Df Sum Sq Mean Sq F value Pr(>F)
## grade 6 27013 4502 2637 <2e-16 ***
## Residuals 1493 2549 2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
autoplot( grade_aov ) + theme_minimal()
test the homogeneity of variance:
# Bartlett's test for homogeneity of variance
bartlett.test( int_rate ~ grade, lendingclub )
##
## Bartlett test of homogeneity of variances
##
## data: int_rate by grade
## Bartlett's K-squared = 78.549, df = 6, p-value = 7.121e-15
The residuals on this model are okay, though the residuals on G have a much smaller range than any other level of grade (the dots are far less spread out.) The Q-Q plot, however, shows that the residuals are fairly normal. However, given the highly significant p-value from Bartlett’s test, the assumption of homogeneity of variances is violated, which is one of the assumptions of an ANOVA model. Therefore, ANOVA might not be the best choice for this experiment.
Now try the Krustal-Wallis rank sum test:
# Conduct the Kruskal-Wallis rank sum test
kruskal.test(int_rate ~ grade,
data = lendingclub )
##
## Kruskal-Wallis rank sum test
##
## data: int_rate by grade
## Kruskal-Wallis chi-squared = 1365.5, df = 6, p-value < 2.2e-16
The Krustal-Wallis test yields a very small p-value; this indicates that we can be confident in the result that int_rate varies as a function of grade.
A/B Testing
a type of controlled experiment with ony two variants of something:
- a word difference in an add slogan
- Red ‘buy’ button as opposed to blue
- Consumer click-through rates using two different website headers
Power and Sample Size in A/B testing
- Calculate sample size, given some power, significance level and effect size
- Run A/B test until sample size that is needed has been collected
Lending Club A/B Test: effect of header color for loan application page
Do softer, gentler color toan influence load applicants to apply for less money?
We’ll be looking at the difference of means for two different group’s amount of loan application asked for. Since there are only 2 groups, we can use a t-test.
What sample size is needed?: calculate the required sample size for each group with d = 0.2, a power of 0.8, and a 0.05 significance level.
# Use the t.test from pwr to find the sample size
pwr.t.test(n= NULL,
d = 0.2,
power = 0.8,
sig.level = 0.05,
type = 'two.sample',
alternative = 'two.sided' )
##
## Two-sample t test power calculation
##
## n = 393.4057
## d = 0.2
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
at least 394 people are needed to achieve the desired power criteria
Visualize & run a two-sided t-test to determine if the difference between the means of the two groups is statistically significant:
# Plot the A/B test results
ggplot(lendingclub_ab, aes(y = loan_amnt, x = Group)) +
geom_boxplot()
# Conduct a two-sided t-test
t.test( formula = loan_amnt ~ Group, data = lendingclub_ab )
Judging by the boxplot and the t-test result, there is no compelling evidence to support a significant difference between responses for the two colors of header.
A/B testing with multivariate experiments
# Build lendingclub_multi
lendingclub_multi <-lm(loan_amnt ~ Group + grade + verification_status, lendingclub_ab )
# Examine lendingclub_multi results
tidy( lendingclub_multi )
Randomized Complete (& Balanced Incomplete) Block Design
Intro to NHANES and sampling
NHANES: National Health and Nutrition Examination Survey. Information collected from participants that covers medical, dental, socioeconomic, dietary, and general health-related conditions.
Intro to sampling: Probability sampling: probability is used to select the sample
Non-probability Sampling: probability is not used to inform dampling. A in collecting voluntary responses (whoever agrees to respond) and convenience sampling (subjects convenient to the researcher)
5 Methods of Sampling:
- Simple Random Sampling:
sample()randomly sample from the data. - Stratified Sampling:
dataset %>% group_by(variable) %>% sample_n()stratify the data into groups (e.g. M/F) and randomly sample within groups - Cluster Sampling:
cluster( dataset, cluster_var, num, method='option)divide the data into clusters, randomly select a number of cluster and sample all subjects in selected clusters - Systematic Sampling: ex: sample every 10 observations
- Multistage Sampling integrates 2+ of the above methods in a systematic way:
EDA & Resampling of the NHANES dataset:
BMX_url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/ee832ef6c2fa7036704c53e90dc1e710a3b50dbc/nhanes_bodymeasures.csv'
DEMO_url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/2be5ca94453a63e825bc30ccefd1429b7683c19c/nhanes_demo.csv'
MCQ_url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/d34921a9255422617cdc42f6a3fbcd189f51c19d/nhanes_medicalconditions.csv'
# Import the three datasets using read_xpt()
nhanes_demo <- read.csv(DEMO_url)
nhanes_medical <- read.csv(MCQ_url)
nhanes_bodymeasures <- read.csv(BMX_url)
# Merge the 3 datasets you just created to create nhanes_combined
nhanes_combined <- list(nhanes_demo, nhanes_medical, nhanes_bodymeasures) %>% Reduce(function(df1, df2) inner_join(df1, df2, by = "seqn"), .)
glimpse( nhanes_combined )
## Rows: 9,165
## Columns: 161
## $ seqn <int> 83732, 83733, 83734, 83735, 83736, 83737, 83738, 83739, 8374…
## $ sddsrvyr <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
## $ ridstatr <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ riagendr <int> 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, …
## $ ridageyr <int> 62, 53, 78, 56, 42, 72, 11, 4, 1, 22, 32, 18, 56, 15, 4, 46,…
## $ ridagemn <int> NA, NA, NA, NA, NA, NA, NA, NA, 13, NA, NA, NA, NA, NA, NA, …
## $ ridreth1 <int> 3, 3, 3, 3, 4, 1, 1, 3, 2, 4, 1, 5, 4, 3, 5, 3, 4, 3, 5, 1, …
## $ ridreth3 <int> 3, 3, 3, 3, 4, 1, 1, 3, 2, 4, 1, 6, 4, 3, 6, 3, 4, 3, 7, 1, …
## $ ridexmon <int> 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, …
## $ ridexagm <int> NA, NA, NA, NA, NA, NA, 141, 54, 14, NA, NA, 217, NA, 185, 5…
## $ dmqmiliz <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, 2, 2, NA, NA, 2, NA, 2, …
## $ dmqadfc <int> NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ dmdborn4 <int> 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, …
## $ dmdcitzn <int> 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, …
## $ dmdyrsus <int> NA, 7, NA, NA, NA, 2, NA, NA, NA, NA, 6, NA, NA, NA, 2, 8, N…
## $ dmdeduc3 <int> NA, NA, NA, NA, NA, NA, 6, NA, NA, NA, NA, 11, NA, 9, NA, NA…
## $ dmdeduc2 <int> 5, 3, 3, 5, 4, 2, NA, NA, NA, 4, 4, NA, 3, NA, NA, 5, NA, NA…
## $ dmdmartl <int> 1, 3, 1, 6, 3, 4, NA, NA, NA, 5, 1, NA, 3, NA, NA, 6, NA, NA…
## $ ridexprg <int> NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA…
## $ sialang <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ siaproxy <int> 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, …
## $ siaintrp <int> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, …
## $ fialang <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fiaproxy <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ fiaintrp <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, …
## $ mialang <int> 1, 1, 1, 1, 1, 1, 1, NA, NA, 1, 1, NA, 1, 1, NA, 1, NA, 1, 1…
## $ miaproxy <int> 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, 2, NA, 2, 2, NA, 2, NA, 2, 2…
## $ miaintrp <int> 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, 2, NA, 2, 2, NA, 2, NA, 2, 2…
## $ aialanga <int> 1, 1, NA, 1, 1, NA, 1, NA, NA, 1, 1, 1, 1, 1, NA, 1, NA, 1, …
## $ dmdhhsiz <int> 2, 1, 2, 1, 5, 5, 5, 5, 7, 3, 4, 3, 1, 3, 4, 2, 6, 5, 5, 6, …
## $ dmdfmsiz <int> 2, 1, 2, 1, 5, 5, 5, 5, 7, 3, 4, 3, 1, 3, 4, 2, 6, 5, 1, 6, …
## $ dmdhhsza <int> 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 1, 0, 0, 0, 1, 0, 3, 0, 0, 1, …
## $ dmdhhszb <int> 0, 0, 0, 0, 2, 1, 2, 1, 2, 0, 1, 0, 0, 2, 1, 0, 0, 1, 2, 2, …
## $ dmdhhsze <int> 1, 0, 2, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, …
## $ dmdhrgnd <int> 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, …
## $ dmdhrage <int> 62, 53, 79, 56, 42, 34, 68, 35, 48, 47, 31, 55, 56, 42, 45, …
## $ dmdhrbr4 <int> 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, NA, 2, 1, 1, NA, 1…
## $ dmdhredu <int> 5, 3, 3, 5, 4, 5, 1, 5, 9, 4, 4, 5, 3, 4, NA, 5, 4, 4, NA, 2…
## $ dmdhrmar <int> 1, 3, 1, 6, 3, 1, 2, 1, 5, 1, 1, 1, 3, 5, 1, 6, 5, 1, 1, 1, …
## $ dmdhsedu <int> 3, NA, 3, NA, NA, 5, NA, 5, NA, 5, 4, 4, NA, NA, 4, NA, NA, …
## $ wtint2yr <dbl> 134671.370, 24328.560, 12400.009, 102717.996, 17627.675, 112…
## $ wtmec2yr <dbl> 135629.507, 25282.426, 12575.839, 102078.635, 18234.736, 108…
## $ sdmvpsu <int> 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, …
## $ sdmvstra <int> 125, 125, 131, 131, 126, 128, 120, 124, 119, 128, 125, 122, …
## $ indhhin2 <int> 10, 4, 5, 10, 7, 14, 6, 15, 77, 7, 6, 15, 3, 4, 12, 3, 6, 14…
## $ indfmin2 <int> 10, 4, 5, 10, 7, 14, 6, 15, 77, 7, 6, 15, 3, 4, 12, 3, 6, 14…
## $ indfmpir <dbl> 4.39, 1.32, 1.51, 5.00, 1.23, 2.82, 1.18, 4.22, NA, 2.08, 1.…
## $ mcq010 <int> 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, …
## $ mcq025 <int> NA, NA, 60, NA, 10, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq035 <int> NA, NA, 1, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2,…
## $ mcq040 <int> NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq050 <int> NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ agq030 <int> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq053 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ mcq080 <int> 1, 2, 1, 1, 2, 1, NA, NA, NA, 1, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq092 <int> 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, 2, 2, 2, 2, NA, 2, NA, 2, 2,…
## $ mcd093 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq149 <int> NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq151 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160a <int> 1, 2, 1, 2, 1, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 1, NA, NA…
## $ mcq180a <int> 40, NA, 55, NA, 10, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq195 <int> 1, NA, 4, NA, 9, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, …
## $ mcq160n <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 1, NA, NA, 2, NA, NA…
## $ mcq180n <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, NA, NA, …
## $ mcq160b <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 1, NA, NA, 2, NA, NA…
## $ mcq180b <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 53, NA, NA, …
## $ mcq160c <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq180c <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160d <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq180d <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160e <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 1, NA, NA, 2, NA, NA…
## $ mcq180e <int> NA, NA, 58, NA, NA, NA, NA, NA, NA, NA, NA, NA, 53, NA, NA, …
## $ mcq160f <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq180f <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160g <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq180g <int> NA, NA, 59, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160m <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq170m <int> NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq180m <int> NA, NA, 39, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160k <int> 2, 2, 2, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq170k <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq180k <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160l <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq170l <int> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mcq180l <int> NA, NA, 11, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq160o <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq203 <int> 2, 2, 1, 2, 2, 2, 2, NA, NA, 2, 2, 2, 2, 2, NA, 2, NA, 2, 2,…
## $ mcq206 <int> NA, NA, 11, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq220 <int> 1, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, NA, 2, NA, NA, 2, NA, NA…
## $ mcq230a <int> 25, NA, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq230b <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq230c <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq230d <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240a <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240aa <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240b <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240bb <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240c <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240cc <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240d <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240dd <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240dk <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240e <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240f <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240g <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240h <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240i <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240j <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240k <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240l <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240m <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240n <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240o <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240p <int> 58, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240q <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240r <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240s <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240t <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240u <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240v <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240w <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240x <int> NA, NA, 57, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240y <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq240z <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mcq300a <int> 2, 2, 2, 9, 2, 2, NA, NA, NA, 1, 2, NA, 2, NA, NA, 9, NA, NA…
## $ mcq300b <int> 2, 2, 1, 9, 2, 2, 1, NA, NA, 2, 2, 2, 2, 2, NA, 9, NA, 2, 2,…
## $ mcq300c <int> 1, 1, 2, 9, 9, 1, NA, NA, NA, 1, 1, NA, 1, NA, NA, 9, NA, NA…
## $ mcq365a <int> 2, 2, 2, 1, 2, 2, NA, NA, NA, 2, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq365b <int> 1, 2, 1, 1, 2, 2, NA, NA, NA, 2, 2, 1, 1, NA, NA, 2, NA, 1, …
## $ mcq365c <int> 2, 2, 1, 2, 2, 2, NA, NA, NA, 2, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq365d <int> 2, 2, 1, 1, 2, 1, NA, NA, NA, 2, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq370a <int> 1, 2, 1, 1, 2, 2, NA, NA, NA, 1, 1, 2, 1, NA, NA, 2, NA, 2, …
## $ mcq370b <int> 1, 2, 2, 2, 2, 2, NA, NA, NA, 1, 1, 2, 1, NA, NA, 1, NA, 2, …
## $ mcq370c <int> 2, 2, 2, 1, 2, 1, NA, NA, NA, 1, 2, 1, 1, NA, NA, 1, NA, 2, …
## $ mcq370d <int> 2, 2, 1, 1, 2, 1, NA, NA, NA, 1, 1, 1, 1, NA, NA, 1, NA, 2, …
## $ osq230 <int> 1, 2, 2, 2, 2, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, 2, NA, …
## $ bmdstats <int> 1, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ bmxwt <dbl> 94.8, 90.4, 83.4, 109.8, 55.2, 64.4, 37.2, 16.4, 10.1, 76.6,…
## $ bmiwt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, N…
## $ bmxrecum <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmirecum <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, N…
## $ bmxhead <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmihead <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmxht <dbl> 184.5, 171.4, 170.1, 160.9, 164.9, 150.0, 143.5, 102.1, NA, …
## $ bmiht <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmxbmi <dbl> 27.8, 30.8, 28.8, 42.4, 20.3, 28.6, 18.1, 15.7, NA, 28.0, 28…
## $ bmdbmic <int> NA, NA, NA, NA, NA, NA, 2, 2, NA, NA, NA, 3, NA, 3, 2, NA, 2…
## $ bmxleg <dbl> 43.3, 38.0, 35.6, 38.5, 37.4, 34.4, 32.2, NA, NA, 38.8, 34.1…
## $ bmileg <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N…
## $ bmxarml <dbl> 43.6, 40.0, 37.0, 37.7, 36.0, 33.5, 30.5, 22.0, NA, 38.0, 33…
## $ bmiarml <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA, NA…
## $ bmxarmc <dbl> 35.9, 33.2, 31.0, 38.3, 27.2, 31.4, 21.7, 16.4, NA, 34.0, 31…
## $ bmiarmc <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA, NA…
## $ bmxwaist <dbl> 101.1, 107.9, 116.5, 110.1, 80.4, 92.9, 67.5, 48.5, NA, 86.6…
## $ bmiwaist <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N…
## $ bmxsad1 <dbl> 22.9, 27.5, 26.7, 25.2, NA, 23.2, 15.3, NA, NA, 19.1, 22.5, …
## $ bmxsad2 <dbl> 22.7, 27.1, 26.5, 25.0, NA, 23.0, 15.3, NA, NA, 19.3, 22.1, …
## $ bmxsad3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmxsad4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bmdavsad <dbl> 22.8, 27.3, 26.6, 25.1, NA, 23.1, 15.3, NA, NA, 19.2, 22.3, …
## $ bmdsadcm <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N…
Find the mean weight by treatment
# Fill in the dplyr code
nhanes_combined %>%
group_by(mcq365d) %>%
summarize(mean = mean(bmxwt, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## mcq365d mean
## <int> <dbl>
## 1 1 90.7
## 2 2 76.5
## 3 9 90.8
## 4 NA 33.5
# Fill in the ggplot2 code
nhanes_combined %>%
ggplot(aes(as.factor(mcq365d), y = bmxwt )) +
geom_boxplot() +
labs(x = "Treatment",
y = "Weight")
## Warning: Removed 99 rows containing non-finite values (stat_boxplot).
NHANES data cleaning:
# Filter to keep only those 16+
nhanes_filter <- nhanes_combined %>% filter(ridageyr > 16 )
# Load simputation & impute bmxwt by riagendr
nhanes_final <- simputation::impute_median(nhanes_filter, bmxwt ~ riagendr )
# Recode mcq365d with recode() & examine with count()
nhanes_final$mcq365d <- recode(nhanes_final$mcq365d,
`1` = 1,
`2` = 2,
`9` = 2)
nhanes_final %>% count( mcq365d)
## mcq365d n
## 1 1 1802
## 2 2 4085
Resampling NHANES data:
# Use sample_n() to create nhanes_srs
nhanes_srs <- nhanes_final %>% sample_n(2500)
# Create nhanes_stratified with group_by() and sample_n()
nhanes_stratified <- nhanes_final %>% group_by( riagendr ) %>% sample_n( 2000 )
nhanes_stratified %>% count()
## # A tibble: 2 x 2
## # Groups: riagendr [2]
## riagendr n
## <int> <int>
## 1 1 2000
## 2 2 2000
# Load sampling package and create nhanes_cluster with cluster()
nhanes_cluster <- cluster(nhanes_final, 'indhhin2', 6, method = "srswor")
Randomized Complete Block Design
Block what you can, Randomize what you can’t
- Randomized - the treatment is assigned randomly inside each block
- Complete - each treatment is used the same number of times in every block
- Block - experimental groups are blocked to be similar (e.g. by sex). blocking is utilized when there is a nuisence factor: something that may effect the outcome, but isn’t of interest to the experiment.
- Design - this is your experiment!
use the agricolae library to render RCBD and other protocols
trt <- letters[ 1:4 ]
rep <- 4
design.rcbd <- design.rcbd( trt,
r = rep,
seed = 42,
serie = 0 )
design.rcbd$sketch
## [,1] [,2] [,3] [,4]
## [1,] "d" "a" "c" "b"
## [2,] "a" "d" "c" "b"
## [3,] "c" "d" "b" "a"
## [4,] "b" "d" "a" "c"
# Create designs using ls()
designs <- ls("package:agricolae", pattern = "design")
designs
## [1] "design.ab" "design.alpha" "design.bib" "design.crd"
## [5] "design.cyclic" "design.dau" "design.graeco" "design.lattice"
## [9] "design.lsd" "design.mat" "design.rcbd" "design.split"
## [13] "design.strip" "design.youden"
# Use str() to view design.rcbd's criteria
str(design.rcbd)
## List of 3
## $ parameters:List of 7
## ..$ design: chr "rcbd"
## ..$ trt : chr [1:4] "a" "b" "c" "d"
## ..$ r : num 4
## ..$ serie : num 0
## ..$ seed : num 42
## ..$ kinds : chr "Super-Duper"
## ..$ : logi TRUE
## $ sketch : chr [1:4, 1:4] "d" "a" "c" "b" ...
## $ book :'data.frame': 16 obs. of 3 variables:
## ..$ plots: num [1:16] 11 12 13 14 21 22 23 24 31 32 ...
## ..$ block: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 2 3 3 ...
## ..$ trt : Factor w/ 4 levels "a","b","c","d": 4 1 3 2 1 4 3 2 3 4 ...
# Build treats and rep
treats <- LETTERS[1:5]
blocks <- 4
# Build my_design_rcbd and view the sketch
my_design_rcbd <- design.rcbd(treats, r = blocks, seed = 42)
my_design_rcbd$sketch
## [,1] [,2] [,3] [,4] [,5]
## [1,] "D" "A" "C" "B" "E"
## [2,] "E" "A" "C" "D" "B"
## [3,] "D" "B" "E" "A" "C"
## [4,] "B" "D" "E" "C" "A"
NHANES RCBD
# Use aov() to create nhanes_rcbd
nhanes_rcbd <- aov( bmxwt ~ mcq365d + riagendr, nhanes_final )
# Check results of nhanes_rcbd with summary()
summary( nhanes_rcbd )
## Df Sum Sq Mean Sq F value Pr(>F)
## mcq365d 1 229164 229164 571.2 <2e-16 ***
## riagendr 1 163069 163069 406.4 <2e-16 ***
## Residuals 5884 2360774 401
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Print mean weights by mcq365d and riagendr
nhanes_final %>%
group_by(mcq365d, riagendr) %>%
summarize(mean_wt = mean(bmxwt))
## `summarise()` regrouping output by 'mcq365d' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: mcq365d [2]
## mcq365d riagendr mean_wt
## <dbl> <int> <dbl>
## 1 1 1 95.2
## 2 1 2 86.6
## 3 2 1 82.7
## 4 2 2 71.3
There are some clear differences between gender groups, so blocking them was a good move. Also, we observe a consistent statistically significant effect of the treatment on weight.
# Set up the 2x2 plotting grid and plot nhanes_rcbd
par(mfrow=c(2,2))
# Plot grade_aov
plot( nhanes_rcbd )
view the interaction plots view the interaction plot between the treatment and gender and observe if the lines are parallel.
# Run the code to view the interaction plots
with(nhanes_final, interaction.plot(riagendr, mcq365d, bmxwt))
View the interaction plots between gender and the treatment (it’ll be a little different!) and observe if the lines are parallel.
# Run the code to view the interaction plots
with(nhanes_final, interaction.plot(riagendr, mcq365d, bmxwt))
The initial diganostic plots show that this model is pretty good but not great - especially at the larger end of the data, the Q-Q plot shows the data might not be normal. The interaction plots show nearly parallel lines, so we can move forward with this model.
Balanced Incomplete Block Design
Sometimes it is not possible/feasible/necessary to test every treatment condition in a block.
- Balanced: each apir of treatments occurs together in a block an equal number of times
- Incomplete: not every treatment will appear in every block
- Block: experimental groups are blocked to be similar (e.g. by sex)
- Design: this is your experiment!
Quick Maths: is a BIBD possible?
- Let:
- t = # of treatments
- k = # of treatments per block
- r = # replications
- \(\lambda = r \cdot \frac{(k-1)}{(t-1)}\)
if \(\lambda\) is a whole number, then a BIBD is possible.
#a BIBD is possible
t <- 4
k <- 3
r <- 3
lambda <- r*(k-1)/(t-1)
lambda
## [1] 2
#a BIBD is NOT possible
t <- 3
k <- 4
r <- 11
lambda <- r*(k-1)/(t-1)
lambda
## [1] 16.5
Draw some agricolae sketches
# Create my_design_bibd_3
my_design_bibd_3 <- design.bib(LETTERS[1:4], k = 4, seed = 42)
##
## Parameters BIB
## ==============
## Lambda : 2
## treatmeans : 4
## Block size : 4
## Blocks : 2
## Replication: 2
##
## Efficiency factor 1
##
## <<< Book >>>
my_design_bibd_3$sketch
## [,1] [,2] [,3] [,4]
## [1,] "C" "A" "D" "B"
## [2,] "C" "D" "B" "A"
# Create my_design_bibd_3
my_design_bibd_3 <- design.bib(LETTERS[1:12], k = 2, seed = 42)
##
## Parameters BIB
## ==============
## Lambda : 1
## treatmeans : 12
## Block size : 2
## Blocks : 66
## Replication: 11
##
## Efficiency factor 0.5454545
##
## <<< Book >>>
my_design_bibd_3$sketch
## [,1] [,2]
## [1,] "E" "F"
## [2,] "C" "I"
## [3,] "L" "B"
## [4,] "H" "A"
## [5,] "D" "F"
## [6,] "A" "J"
## [7,] "I" "B"
## [8,] "G" "B"
## [9,] "D" "B"
## [10,] "C" "E"
## [11,] "F" "J"
## [12,] "H" "D"
## [13,] "C" "F"
## [14,] "G" "C"
## [15,] "B" "K"
## [16,] "E" "K"
## [17,] "A" "F"
## [18,] "J" "H"
## [19,] "B" "J"
## [20,] "L" "K"
## [21,] "E" "I"
## [22,] "E" "B"
## [23,] "K" "J"
## [24,] "D" "A"
## [25,] "D" "I"
## [26,] "C" "L"
## [27,] "H" "E"
## [28,] "G" "I"
## [29,] "E" "L"
## [30,] "C" "D"
## [31,] "E" "J"
## [32,] "J" "I"
## [33,] "I" "K"
## [34,] "C" "H"
## [35,] "I" "F"
## [36,] "G" "A"
## [37,] "J" "C"
## [38,] "G" "J"
## [39,] "H" "B"
## [40,] "D" "G"
## [41,] "G" "L"
## [42,] "H" "F"
## [43,] "J" "D"
## [44,] "E" "G"
## [45,] "L" "F"
## [46,] "H" "K"
## [47,] "K" "D"
## [48,] "L" "J"
## [49,] "F" "B"
## [50,] "L" "A"
## [51,] "L" "H"
## [52,] "G" "H"
## [53,] "E" "A"
## [54,] "B" "C"
## [55,] "I" "L"
## [56,] "C" "A"
## [57,] "D" "L"
## [58,] "C" "K"
## [59,] "G" "F"
## [60,] "I" "A"
## [61,] "K" "F"
## [62,] "B" "A"
## [63,] "D" "E"
## [64,] "A" "K"
## [65,] "I" "H"
## [66,] "K" "G"
# Create my_design_bibd_3
my_design_bibd_3 <- design.bib(LETTERS[1:6], k = 6, seed = 42)
##
## Parameters BIB
## ==============
## Lambda : 2
## treatmeans : 6
## Block size : 6
## Blocks : 2
## Replication: 2
##
## Efficiency factor 1
##
## <<< Book >>>
my_design_bibd_3$sketch
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "E" "F" "A" "D" "B" "C"
## [2,] "A" "E" "C" "B" "D" "F"
#write a lambda function to calculate lambda for us
lambda <- function( t, k, r ) {
lambda = r*(k-1)/(t-1)
return( lambda )
}
# Calculate lambda
lambda( t = 4, k = 3, r = 3 )
## [1] 2
# Build the data.frame
creatinine <- c(1.98, 1.97, 2.35, 2.09, 1.87, 1.95, 2.08, 2.01, 1.84, 2.06, 1.97, 2.22)
food <- as.factor(c("A", "C", "D", "A", "B", "C", "B", "C", "D", "A", "B", "D"))
color <- as.factor(rep(c("Black", "White", "Orange", "Spotted"), each = 3))
cat_experiment <- as.data.frame(cbind(creatinine, food, color))
# Create cat_model and examine with summary()
cat_model <- aov( creatinine ~ food + color, cat_experiment )
summary( cat_model )
## Df Sum Sq Mean Sq F value Pr(>F)
## food 1 0.01204 0.012042 0.530 0.485
## color 1 0.00697 0.006971 0.307 0.593
## Residuals 9 0.20461 0.022735
# Calculate lambda
lambda( t = 2, k = 2, r =2 )
## [1] 2
# Create weightlift_model & examine results
weightlift_model <- aov( bmxarmc ~ riagendr + ridreth1, nhanes_final )
summary( weightlift_model )
## Df Sum Sq Mean Sq F value Pr(>F)
## riagendr 1 3779 3779 140.51 < 2e-16 ***
## ridreth1 1 1453 1453 54.02 2.27e-13 ***
## Residuals 5571 149832 27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 313 observations deleted due to missingness
Latin Squares, Graeco-Latin Squares, & Factorial Experiments
Latin Squares
- Two blocking factors (instead of 1)
- All factors have the same number of levels
- Key Assumption: the treatment and two blocking factors do not interact
- Analyze like an RCBD
Take a look at the blocking diagram for a Latin Squares experiment:
# Create my_design_bibd_3
lsqr <- design.rcbd(LETTERS[1:4], r = 4, seed = 42)
lsqr$sketch
## [,1] [,2] [,3] [,4]
## [1,] "D" "A" "C" "B"
## [2,] "A" "D" "C" "B"
## [3,] "C" "D" "B" "A"
## [4,] "B" "D" "A" "C"
dataset for this chapter:
url <- 'https://assets.datacamp.com/production/repositories/1793/datasets/6eee2fcc47c8c8dbb2e9d4670cf2eabeda52b705/nyc_scores.csv'
nyc_scores <- read.csv( url )
glimpse( nyc_scores )
## Rows: 435
## Columns: 22
## $ School_ID <chr> "02M260", "06M211", "01M539", "02M294", "02…
## $ School_Name <chr> "Clinton School Writers and Artists", "Inwo…
## $ Borough <chr> "Manhattan", "Manhattan", "Manhattan", "Man…
## $ Building_Code <chr> "M933", "M052", "M022", "M445", "M445", "M4…
## $ Street_Address <chr> "425 West 33rd Street", "650 Academy Street…
## $ City <chr> "Manhattan", "Manhattan", "Manhattan", "Man…
## $ State <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "…
## $ Zip_Code <int> 10001, 10002, 10002, 10002, 10002, 10002, 1…
## $ Latitude <dbl> 40.75321, 40.86605, 40.71873, 40.71687, 40.…
## $ Longitude <dbl> -73.99786, -73.92486, -73.97943, -73.98953,…
## $ Phone_Number <chr> "212-695-9114", "718-935-3660 ", "212-677-…
## $ Start_Time <chr> "", "8:30 AM", "8:15 AM", "8:00 AM", "8:30 …
## $ End_Time <chr> "", "3:00 PM", "4:00 PM", "2:45 PM", "3:00 …
## $ Student_Enrollment <int> NA, 87, 1735, 358, 383, 416, 255, 545, 329,…
## $ Percent_White <dbl> NA, 0.03, 0.29, 0.12, 0.03, 0.02, 0.04, 0.4…
## $ Percent_Black <dbl> NA, 0.22, 0.13, 0.39, 0.28, 0.03, 0.24, 0.1…
## $ Percent_Hispanic <dbl> NA, 0.68, 0.18, 0.41, 0.57, 0.06, 0.57, 0.1…
## $ Percent_Asian <dbl> NA, 0.05, 0.39, 0.06, 0.09, 0.89, 0.13, 0.1…
## $ Average_Score_SAT_Math <int> NA, NA, 657, 395, 418, 613, 410, 634, 389, …
## $ Average_Score_SAT_Reading <int> NA, NA, 601, 411, 428, 453, 406, 641, 395, …
## $ Average_Score_SAT_Writing <int> NA, NA, 601, 387, 415, 463, 381, 639, 381, …
## $ Percent_Tested <dbl> NA, NA, 0.91, 0.79, 0.65, 0.96, 0.60, 0.71,…
Experiment: look at the effects of tutoring programs on SAT scores.
EDA:
# Mean, var, and median of Math score
nyc_scores %>%
group_by(Borough) %>%
summarize(mean = mean(Average_Score_SAT_Math, na.rm = TRUE),
var = var(Average_Score_SAT_Math, na.rm = TRUE),
median = median(Average_Score_SAT_Math, na.rm = TRUE))
## # A tibble: 5 x 4
## Borough mean var median
## <chr> <dbl> <dbl> <dbl>
## 1 Bronx 404. 2726. 396.
## 2 Brooklyn 416. 3658. 395
## 3 Manhattan 456. 7026. 433
## 4 Queens 462. 5168. 448
## 5 Staten Island 486. 6911. 466.
# Mean, var, and median of Math score by Teacher Education Level
nyc_scores %>%
group_by(Zip_Code) %>%
summarise(mean = mean(Average_Score_SAT_Math, na.rm = TRUE),
var = var(Average_Score_SAT_Math, na.rm = TRUE),
median = median(Average_Score_SAT_Math, na.rm = TRUE))
## # A tibble: 120 x 4
## Zip_Code mean var median
## <int> <dbl> <dbl> <dbl>
## 1 10001 NaN NA NA
## 2 10002 473. 11234. 430
## 3 10003 450. 1211. 446
## 4 10004 463 6005. 432
## 5 10006 430. 3120. 430.
## 6 10009 454 NA 454
## 7 10010 549. 1401. 534
## 8 10011 460. 8731. 422.
## 9 10013 478. 3120. 478.
## 10 10016 362. 312. 362.
## # … with 110 more rows
# Examine missingness with miss_var_summary()
nyc_scores %>% miss_var_summary( )
## # A tibble: 22 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 Average_Score_SAT_Math 60 13.8
## 2 Average_Score_SAT_Reading 60 13.8
## 3 Average_Score_SAT_Writing 60 13.8
## 4 Percent_Tested 49 11.3
## 5 Student_Enrollment 7 1.61
## 6 Percent_White 7 1.61
## 7 Percent_Black 7 1.61
## 8 Percent_Hispanic 7 1.61
## 9 Percent_Asian 7 1.61
## 10 School_ID 0 0
## # … with 12 more rows
# Examine missingness with md.pattern()
md.pattern(nyc_scores)
## School_ID School_Name Borough Building_Code Street_Address City State
## 375 1 1 1 1 1 1 1
## 11 1 1 1 1 1 1 1
## 42 1 1 1 1 1 1 1
## 7 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## Zip_Code Latitude Longitude Phone_Number Start_Time End_Time
## 375 1 1 1 1 1 1
## 11 1 1 1 1 1 1
## 42 1 1 1 1 1 1
## 7 1 1 1 1 1 1
## 0 0 0 0 0 0
## Student_Enrollment Percent_White Percent_Black Percent_Hispanic
## 375 1 1 1 1
## 11 1 1 1 1
## 42 1 1 1 1
## 7 0 0 0 0
## 7 7 7 7
## Percent_Asian Percent_Tested Average_Score_SAT_Math
## 375 1 1 1
## 11 1 1 0
## 42 1 0 0
## 7 0 0 0
## 7 49 60
## Average_Score_SAT_Reading Average_Score_SAT_Writing
## 375 1 1 0
## 11 0 0 3
## 42 0 0 4
## 7 0 0 9
## 60 60 264
#glimpse( nyc_scores_f )
nyc_scores_f <- nyc_scores %>%
mutate_if( is.character, as.factor )
# Impute the Math score by Borough
nyc_scores_2 <- simputation::impute_median(nyc_scores_f, Average_Score_SAT_Math ~ Borough)
# Convert Math score to numeric
nyc_scores_2$Average_Score_SAT_Math <- as.numeric(nyc_scores_2$Average_Score_SAT_Math)
# Examine scores by Borough in both datasets, before and after imputation
nyc_scores %>%
group_by(Borough) %>%
dplyr::summarize(median = median(Average_Score_SAT_Math, na.rm = TRUE),
mean = mean(Average_Score_SAT_Math, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## Borough median mean
## <chr> <dbl> <dbl>
## 1 Bronx 396. 404.
## 2 Brooklyn 395 416.
## 3 Manhattan 433 456.
## 4 Queens 448 462.
## 5 Staten Island 466. 486.
nyc_scores_2 %>%
group_by(Borough) %>%
dplyr::summarize(median = median(Average_Score_SAT_Math, na.rm = TRUE),
mean = mean(Average_Score_SAT_Math, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## Borough median mean
## <fct> <dbl> <dbl>
## 1 Bronx 396. 403.
## 2 Brooklyn 395 414.
## 3 Manhattan 433 452.
## 4 Queens 448 460.
## 5 Staten Island 466. 486.
# Design a LS with 5 treatments A:E then look at the sketch
my_design_lsd <- design.lsd( LETTERS[1:5], seed=42)
my_design_lsd$sketch
## [,1] [,2] [,3] [,4] [,5]
## [1,] "E" "D" "A" "C" "B"
## [2,] "D" "C" "E" "B" "A"
## [3,] "A" "E" "B" "D" "C"
## [4,] "C" "B" "D" "A" "E"
## [5,] "B" "A" "C" "E" "D"
glimpse( nyc_scores_2 )
## Rows: 435
## Columns: 22
## $ School_ID <fct> 02M260, 06M211, 01M539, 02M294, 02M308, 02M…
## $ School_Name <fct> "Clinton School Writers and Artists", "Inwo…
## $ Borough <fct> Manhattan, Manhattan, Manhattan, Manhattan,…
## $ Building_Code <fct> M933, M052, M022, M445, M445, M445, M056, M…
## $ Street_Address <fct> 425 West 33rd Street, 650 Academy Street, 1…
## $ City <fct> Manhattan, Manhattan, Manhattan, Manhattan,…
## $ State <fct> NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY,…
## $ Zip_Code <int> 10001, 10002, 10002, 10002, 10002, 10002, 1…
## $ Latitude <dbl> 40.75321, 40.86605, 40.71873, 40.71687, 40.…
## $ Longitude <dbl> -73.99786, -73.92486, -73.97943, -73.98953,…
## $ Phone_Number <fct> 212-695-9114, 718-935-3660 , 212-677-5190,…
## $ Start_Time <fct> , 8:30 AM, 8:15 AM, 8:00 AM, 8:30 AM, 8:00 …
## $ End_Time <fct> , 3:00 PM, 4:00 PM, 2:45 PM, 3:00 PM, 3:35 …
## $ Student_Enrollment <int> NA, 87, 1735, 358, 383, 416, 255, 545, 329,…
## $ Percent_White <dbl> NA, 0.03, 0.29, 0.12, 0.03, 0.02, 0.04, 0.4…
## $ Percent_Black <dbl> NA, 0.22, 0.13, 0.39, 0.28, 0.03, 0.24, 0.1…
## $ Percent_Hispanic <dbl> NA, 0.68, 0.18, 0.41, 0.57, 0.06, 0.57, 0.1…
## $ Percent_Asian <dbl> NA, 0.05, 0.39, 0.06, 0.09, 0.89, 0.13, 0.1…
## $ Average_Score_SAT_Math <dbl> 433, 433, 657, 395, 418, 613, 410, 634, 389…
## $ Average_Score_SAT_Reading <int> NA, NA, 601, 411, 428, 453, 406, 641, 395, …
## $ Average_Score_SAT_Writing <int> NA, NA, 601, 387, 415, 463, 381, 639, 381, …
## $ Percent_Tested <dbl> NA, NA, 0.91, 0.79, 0.65, 0.96, 0.60, 0.71,…
# Build nyc_scores_ls_lm
nyc_scores_ls_lm <- lm(Average_Score_SAT_Math ~ Percent_White + Borough + Percent_Asian,
data = nyc_scores_2 )
# Tidy the results with broom
tidy( nyc_scores_ls_lm )
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 385. 3.77 102. 2.90e-299
## 2 Percent_White 217. 18.7 11.6 4.33e- 27
## 3 BoroughBrooklyn -7.32 5.28 -1.39 1.67e- 1
## 4 BoroughManhattan 14.6 5.62 2.61 9.48e- 3
## 5 BoroughQueens -15.2 6.58 -2.31 2.11e- 2
## 6 BoroughStaten Island -29.7 15.3 -1.94 5.32e- 2
## 7 Percent_Asian 288. 16.9 17.1 5.14e- 50
# Examine the results with anova
anova(nyc_scores_ls_lm )
## Analysis of Variance Table
##
## Response: Average_Score_SAT_Math
## Df Sum Sq Mean Sq F value Pr(>F)
## Percent_White 1 693879 693879 431.778 < 2.2e-16 ***
## Borough 4 139254 34814 21.663 2.863e-16 ***
## Percent_Asian 1 468093 468093 291.279 < 2.2e-16 ***
## Residuals 421 676558 1607
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Graeco-Latin Squares
- Three blocking factors (instead of 1 or 2)
- All factors have the same number of levels
- Key Assumption: the treatment and two blocking factors do not interact
- Analyze like an RCBD
# Create a boxplot of Math scores by Borough, with a title and x/y axis labels
ggplot(nyc_scores, aes( x = Borough, y = Average_Score_SAT_Math)) +
geom_boxplot() +
labs(title = "Average SAT Math Scores by Borough, NYC",
x = "Borough (NYC)",
y = "Average SAT Math Scores (2014-15)")
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).
There are some differences in the distributions of math SAT scores across boroughs, but they are comparable. However, some with more outliers than others.
# Create trt1 and trt2
trt1 <- LETTERS[1:5]
trt2 <- c(1:5)
# Create my_graeco_design
my_graeco_design <- design.graeco( trt1, trt2, seed = 42)
# Examine the parameters and sketch
my_graeco_design$parameters
## $design
## [1] "graeco"
##
## $trt1
## [1] "A" "B" "C" "D" "E"
##
## $trt2
## [1] 1 2 3 4 5
##
## $r
## [1] 5
##
## $serie
## [1] 2
##
## $seed
## [1] 42
##
## $kinds
## [1] "Super-Duper"
##
## [[8]]
## [1] TRUE
my_graeco_design$sketch
## [,1] [,2] [,3] [,4] [,5]
## [1,] "D 5" "A 1" "C 3" "B 4" "E 2"
## [2,] "A 3" "C 4" "B 2" "E 5" "D 1"
## [3,] "C 2" "B 5" "E 1" "D 3" "A 4"
## [4,] "B 1" "E 3" "D 4" "A 2" "C 5"
## [5,] "E 4" "D 2" "A 5" "C 1" "B 3"
#kinds == methods for randomization
#kinds <- c("Wichmann-Hill", "Marsaglia-Multicarry", "Super-Duper", "Mersenne-Twister", "Knuth-TAOCP", "user-supplied", "Knuth-TAOCP-2002", "default" )
my_graeco_design <- design.graeco( trt1, trt2, seed = 42, kinds = 'Wichmann-Hill')
my_graeco_design$sketch
## [,1] [,2] [,3] [,4] [,5]
## [1,] "D 1" "B 2" "A 4" "E 5" "C 3"
## [2,] "B 4" "A 5" "E 3" "C 1" "D 2"
## [3,] "A 3" "E 1" "C 2" "D 4" "B 5"
## [4,] "E 2" "C 4" "D 5" "B 3" "A 1"
## [5,] "C 5" "D 3" "B 1" "A 2" "E 4"
#say we have borough, tutoring programs and homework type to block by. this experiemnt can be designed with graeco-latin square blocking to test the effects of a tutoring program
# Build nyc_scores_gls_lm
nyc_scores_gls_lm <- lm(Average_Score_SAT_Math ~ Tutoring_Program + Borough + Teacher_Education_Level + Homework_Type,
data = nyc_scores_gls )
# Tidy the results with broom
tidy( nyc_scores_gls_lm )
# Examine the results with anova
anova( nyc_scores_gls_lm )
Factorial Experiments
Experimental design that considers interactions of experimental variables.
- 2 or more factor variables are combined and crossed
- all the possible interactions between levels of factors are considered as effects on the outcome
- Ex: high/low water and high/ow sunlight’s effect on plant growth
- Use Tukey’s test to look for significant differences of means between groups
\(2^k\) facorial experiments
- \(2^k\) factorial experiments involve k factor variables with 2 levels
- it results in \(2^k\) number of combinations of effects to test
- analyze with a linear model and ANOVA
- also use
TukeyHSD()to determine with pairwise combinations are significantly different
#glimpse( nyc_scores )
nyc_scores_ptest <- nyc_scores %>%
mutate( ptest_high_low = as.factor( case_when( (Percent_Tested > 0.6) ~ 2,
(Percent_Tested <= 0.6) ~ 1) ),
Percent_Black_HL = as.factor( case_when( (Percent_Black > 0.3) ~ 2,
(Percent_Black <= 0.3) ~ 1) ),
Percent_White_HL = as.factor( case_when( (Percent_Tested > 0.3) ~ 2,
(Percent_Tested <= 0.3) ~ 1) ) )
# Build the boxplot for the tutoring program vs. Math SAT score
ggplot(nyc_scores_ptest,
aes(y = Average_Score_SAT_Math, x = ptest_high_low)) +
geom_boxplot()
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).
# Build the boxplot for the percent black vs. Math SAT score
ggplot(nyc_scores_ptest,
aes(y=Average_Score_SAT_Math, Percent_Black_HL)) +
geom_boxplot()
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).
# Build the boxplot for the percent black vs. Math SAT score
ggplot(nyc_scores_ptest,
aes(y=Average_Score_SAT_Math, Percent_White_HL)) +
geom_boxplot()
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).
glimpse( nyc_scores_ptest )
## Rows: 435
## Columns: 25
## $ School_ID <chr> "02M260", "06M211", "01M539", "02M294", "02…
## $ School_Name <chr> "Clinton School Writers and Artists", "Inwo…
## $ Borough <chr> "Manhattan", "Manhattan", "Manhattan", "Man…
## $ Building_Code <chr> "M933", "M052", "M022", "M445", "M445", "M4…
## $ Street_Address <chr> "425 West 33rd Street", "650 Academy Street…
## $ City <chr> "Manhattan", "Manhattan", "Manhattan", "Man…
## $ State <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "…
## $ Zip_Code <int> 10001, 10002, 10002, 10002, 10002, 10002, 1…
## $ Latitude <dbl> 40.75321, 40.86605, 40.71873, 40.71687, 40.…
## $ Longitude <dbl> -73.99786, -73.92486, -73.97943, -73.98953,…
## $ Phone_Number <chr> "212-695-9114", "718-935-3660 ", "212-677-…
## $ Start_Time <chr> "", "8:30 AM", "8:15 AM", "8:00 AM", "8:30 …
## $ End_Time <chr> "", "3:00 PM", "4:00 PM", "2:45 PM", "3:00 …
## $ Student_Enrollment <int> NA, 87, 1735, 358, 383, 416, 255, 545, 329,…
## $ Percent_White <dbl> NA, 0.03, 0.29, 0.12, 0.03, 0.02, 0.04, 0.4…
## $ Percent_Black <dbl> NA, 0.22, 0.13, 0.39, 0.28, 0.03, 0.24, 0.1…
## $ Percent_Hispanic <dbl> NA, 0.68, 0.18, 0.41, 0.57, 0.06, 0.57, 0.1…
## $ Percent_Asian <dbl> NA, 0.05, 0.39, 0.06, 0.09, 0.89, 0.13, 0.1…
## $ Average_Score_SAT_Math <int> NA, NA, 657, 395, 418, 613, 410, 634, 389, …
## $ Average_Score_SAT_Reading <int> NA, NA, 601, 411, 428, 453, 406, 641, 395, …
## $ Average_Score_SAT_Writing <int> NA, NA, 601, 387, 415, 463, 381, 639, 381, …
## $ Percent_Tested <dbl> NA, NA, 0.91, 0.79, 0.65, 0.96, 0.60, 0.71,…
## $ ptest_high_low <fct> NA, NA, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2,…
## $ Percent_Black_HL <fct> NA, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, …
## $ Percent_White_HL <fct> NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
# Create nyc_scores_factorial and examine the results
nyc_scores_factorial <- aov( Average_Score_SAT_Math ~ ptest_high_low * Percent_Black_HL * Percent_White_HL, nyc_scores_ptest)
tidy( nyc_scores_factorial )
## # A tibble: 6 x 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ptest_high_low 1 410564. 410564. 117. 7.66e-24
## 2 Percent_Black_HL 1 153350. 153350. 43.6 1.38e-10
## 3 Percent_White_HL 1 29.1 29.1 0.00827 9.28e- 1
## 4 ptest_high_low:Percent_Black_HL 1 73269. 73269. 20.8 6.78e- 6
## 5 Percent_Black_HL:Percent_White_HL 1 2286. 2286. 0.651 4.20e- 1
## 6 Residuals 369 1296753. 3514. NA NA
#shapiro.test() Performs the Shapiro-Wilk test of normality.
# Use shapiro.test() to test the outcome
shapiro.test( nyc_scores$Average_Score_SAT_Math)
##
## Shapiro-Wilk normality test
##
## data: nyc_scores$Average_Score_SAT_Math
## W = 0.84672, p-value < 2.2e-16
# Plot nyc_scores_factorial to examine residuals
par(mfrow = c(2,2))
plot( nyc_scores_factorial )
The model appears to be fairly well fit, though our evidence indicates the score may not be from a normally distributed population. Looking at the Q-Q plot, we can see that towards the higher end, the points are not on the line, so we may not be dealing with normality here. If we had more time, we might consider a transformation on the outcome to move towards normality.