[Video]
Steps of an Experiment
Example of timeline
Key Components of an Experiment
# Load the ToothGrowth dataset
data(ToothGrowth)
# Perform a two-sided t-test
t.test(x = ToothGrowth$len, alternative = "two.sided", mu = 18)
##
## One Sample t-test
##
## data: ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.4135
## alternative hypothesis: true mean is not equal to 18
## 95 percent confidence interval:
## 16.83731 20.78936
## sample estimates:
## mean of x
## 18.81333
# Perform a t-test
ToothGrowth_ttest <- t.test(len ~ supp, data = ToothGrowth)
# Load broom
library(broom)
# Tidy ToothGrowth_ttest
tidy(ToothGrowth_ttest)
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.70 20.7 17.0 1.92 0.0606 55.3 -0.171 7.57
## # … with 2 more variables: method <chr>, alternative <chr>
# Load dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Count number of observations for each combination of supp and dose
ToothGrowth %>%
count(supp, dose)
## # A tibble: 6 x 3
## supp dose n
## <fct> <dbl> <int>
## 1 OJ 0.5 10
## 2 OJ 1 10
## 3 OJ 2 10
## 4 VC 0.5 10
## 5 VC 1 10
## 6 VC 2 10
# Create a boxplot with geom_boxplot()
ggplot(ToothGrowth, aes(x = dose, y = len)) +
geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
# Create ToothGrowth_aov
ToothGrowth_aov <- aov(len ~ dose + supp, data = ToothGrowth)
# Examine ToothGrowth_aov with summary()
summary(ToothGrowth_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## dose 1 2224.3 2224.3 123.99 6.31e-16 ***
## supp 1 205.3 205.3 11.45 0.0013 **
## Residuals 57 1022.6 17.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Video]
Breaking down hypothesis testing:
Power & Sample Size
Power & Sample Size Calculations
# library(pwr)
# pwr.anova.test(k = 3,
# n = 20,
# f = 0.2,
# sig.level = 0.05,
# power = NULL)
# Less than
t.test(x = ToothGrowth$len,
alternative = "less",
mu = 18)
##
## One Sample t-test
##
## data: ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.7933
## alternative hypothesis: true mean is less than 18
## 95 percent confidence interval:
## -Inf 20.46358
## sample estimates:
## mean of x
## 18.81333
# Greater than
t.test(x = ToothGrowth$len,
alternative = "greater",
mu = 18)
##
## One Sample t-test
##
## data: ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.2067
## alternative hypothesis: true mean is greater than 18
## 95 percent confidence interval:
## 17.16309 Inf
## sample estimates:
## mean of x
## 18.81333
The pwr
package has been loaded for you. Use the console to look at the documentation for the pwr.t.test()
function. The list of arguments are specialized for a t-test and include the ability to specify the alternative hypothesis.
If you’d like, take some time to explore the different pwr
package functions and read about their inputs.
What does a call to any pwr.*()
function yield?
# Load the pwr package
library(pwr)
# Calculate power
pwr.t.test(n = 100,
d = 0.35,
sig.level = 0.10,
type = "two.sample",
alternative = "two.sided",
power = NULL)
##
## Two-sample t test power calculation
##
## n = 100
## d = 0.35
## sig.level = 0.1
## power = 0.7943532
## alternative = two.sided
##
## NOTE: n is number in *each* group
# Calculate sample size
pwr.t.test(n = NULL,
d = 0.25,
sig.level = 0.05,
type = "one.sample", alternative = "greater",
power = 0.8)
##
## One-sample t test power calculation
##
## n = 100.2877
## d = 0.25
## sig.level = 0.05
## power = 0.8
## alternative = greater
[Video]
ANOVA
# #one
# model_1 <- lm(y ~ x, data = dataset)
# anova(model_1)
#
# #two
# aov(y ~ x, data = dataset)
Single Factor Experiments
# model_1 <- lm(y ~ x)
Multiple Factor Experiments
# model2 <- lm(y ~ x + r + s + t)
# Examine the variables with glimpse()
glimpse(lendingclub)
## Observations: 1,500
## Variables: 12
## $ member_id <int> 55096114, 1555332, 1009151, 69524202, 72128084, 5…
## $ loan_amnt <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 50…
## $ funded_amnt <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 50…
## $ term <fct> 36 months, 36 months, 60 months, 36 months, 36 mo…
## $ int_rate <dbl> 12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13.11, 7.…
## $ emp_length <fct> 10+ years, 10+ years, 3 years, 10+ years, 10+ yea…
## $ home_ownership <fct> RENT, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, MOR…
## $ annual_inc <dbl> 51000, 40000, 78204, 51000, 96000, 47000, 40000, …
## $ verification_status <fct> Not Verified, Verified, Not Verified, Not Verifie…
## $ loan_status <fct> Current, Fully Paid, Fully Paid, Current, Current…
## $ purpose <fct> debt_consolidation, debt_consolidation, home_impr…
## $ grade <fct> C, A, B, C, A, D, C, A, D, B, C, B, E, C, A, C, D…
# Find median loan_amnt, mean int_rate, and mean annual_inc with summarize()
lendingclub %>% summarize(median(loan_amnt), mean(int_rate), mean(annual_inc))
## median(loan_amnt) mean(int_rate) mean(annual_inc)
## 1 13000 13.31472 75736.03
# Use ggplot2 to build a bar chart of purpose
ggplot(data=lendingclub, aes(x = purpose)) +
geom_bar() +
coord_flip()
# Use recode() to create the new purpose_recode variable
lendingclub$purpose_recode <- lendingclub$purpose %>% recode(
"credit_card" = "debt_related",
"debt_consolidation" = "debt_related",
"medical" = "debt_related",
"car" = "big_purchase",
"major_purchase" = "big_purchase",
"vacation" = "big_purchase",
"moving" = "life_change",
"small_business" = "life_change",
"wedding" = "life_change",
"house" = "home_related",
"home_improvement" = "home_related")
# Build a linear regression model, purpose_recode_model
purpose_recode_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub)
# Examine results of purpose_recode_model
summary(purpose_recode_model)
##
## Call:
## lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14472 -6251 -1322 4678 25761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9888.1 1248.9 7.917 4.69e-15 ***
## purpose_recodedebt_related 5433.5 1270.5 4.277 2.02e-05 ***
## purpose_recodehome_related 4845.0 1501.0 3.228 0.00127 **
## purpose_recodelife_change 4095.3 2197.2 1.864 0.06254 .
## purpose_recodeother -649.3 1598.3 -0.406 0.68461
## purpose_recoderenewable_energy -1796.4 4943.3 -0.363 0.71636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8284 on 1494 degrees of freedom
## Multiple R-squared: 0.03473, Adjusted R-squared: 0.0315
## F-statistic: 10.75 on 5 and 1494 DF, p-value: 3.598e-10
# Get anova results and save as purpose_recode_anova
purpose_recode_anova <- anova(purpose_recode_model)
# Print purpose_recode_anova
purpose_recode_anova
## Analysis of Variance Table
##
## Response: funded_amnt
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose_recode 5 3.6888e+09 737756668 10.75 3.598e-10 ***
## Residuals 1494 1.0253e+11 68629950
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Examine class of purpose_recode_anova
class(purpose_recode_anova)
## [1] "anova" "data.frame"
# Use aov() to build purpose_aov
purpose_aov <- aov(funded_amnt ~ purpose_recode, data = lendingclub)
# Conduct Tukey's HSD test to create tukey_output
tukey_output <- TukeyHSD(purpose_aov, conf.level = 0.95)
# Tidy tukey_output to make sense of the results
tidy(tukey_output)
## # A tibble: 15 x 6
## term comparison estimate conf.low conf.high adj.p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 purpose_rec… debt_related-big_purcha… 5434. 1808. 9059. 2.91e-4
## 2 purpose_rec… home_related-big_purcha… 4845. 562. 9128. 1.61e-2
## 3 purpose_rec… life_change-big_purchase 4095. -2174. 10365. 4.25e-1
## 4 purpose_rec… other-big_purchase -649. -5210. 3911. 9.99e-1
## 5 purpose_rec… renewable_energy-big_pu… -1796. -15902. 12309. 9.99e-1
## 6 purpose_rec… home_related-debt_relat… -589. -3056. 1879. 9.84e-1
## 7 purpose_rec… life_change-debt_related -1338. -6539. 3863. 9.78e-1
## 8 purpose_rec… other-debt_related -6083. -9005. -3160. 5.32e-8
## 9 purpose_rec… renewable_energy-debt_r… -7230. -20894. 6434. 6.58e-1
## 10 purpose_rec… life_change-home_related -750. -6429. 4929. 9.99e-1
## 11 purpose_rec… other-home_related -5494. -9201. -1787. 3.58e-4
## 12 purpose_rec… renewable_energy-home_r… -6641. -20494. 7212. 7.46e-1
## 13 purpose_rec… other-life_change -4745. -10636. 1147. 1.95e-1
## 14 purpose_rec… renewable_energy-life_c… -5892. -20482. 8698. 8.59e-1
## 15 purpose_rec… renewable_energy-other -1147. -15088. 12794. 1.00e+0
# Use aov() to build purpose_emp_aov
purpose_emp_aov <- aov(funded_amnt ~ purpose_recode + emp_length, data = lendingclub)
# Print purpose_emp_aov to the console
purpose_emp_aov
## Call:
## aov(formula = funded_amnt ~ purpose_recode + emp_length, data = lendingclub)
##
## Terms:
## purpose_recode emp_length Residuals
## Sum of Squares 3688783338 2044273211 100488872355
## Deg. of Freedom 5 11 1483
##
## Residual standard error: 8231.679
## Estimated effects may be unbalanced
# Call summary() to see the p-values
summary(purpose_emp_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose_recode 5 3.689e+09 737756668 10.888 2.63e-10 ***
## emp_length 11 2.044e+09 185843019 2.743 0.00161 **
## Residuals 1483 1.005e+11 67760534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Video]
Pre-modeling EDA
lendingclub %>%
summarise(median(loan_amnt),
mean(int_rate),
mean(annual_inc))
## median(loan_amnt) mean(int_rate) mean(annual_inc)
## 1 13000 13.31472 75736.03
lendingclub %>%
group_by(verification_status) %>%
summarise(mean(funded_amnt),
var(funded_amnt))
## # A tibble: 3 x 3
## verification_status `mean(funded_amnt)` `var(funded_amnt)`
## <fct> <dbl> <dbl>
## 1 Not Verified 11018. 33994991.
## 2 Source Verified 15155. 70477801.
## 3 Verified 17562. 82211449.
ggplot(data = lendingclub,
aes(x = verification_status, y = funded_amnt)) +
geom_boxplot()
Post-modeling model validation
# Examine the summary of int_rate
summary(lendingclub$int_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.32 9.99 12.99 13.31 16.29 26.77
# Examine int_rate by grade
lendingclub %>%
group_by(grade) %>%
summarize(mean = mean(int_rate), var = var(int_rate), median = median(int_rate))
## # A tibble: 7 x 4
## grade mean var median
## <fct> <dbl> <dbl> <dbl>
## 1 A 7.27 0.961 7.26
## 2 B 10.9 2.08 11.0
## 3 C 14.0 1.42 14.0
## 4 D 17.4 1.62 17.6
## 5 E 20.1 2.71 20.0
## 6 F 23.6 2.87 23.5
## 7 G 26.1 0.198 25.9
# Make a boxplot of int_rate by grade
ggplot(lendingclub, aes(x = grade, y = int_rate)) +
geom_boxplot()
# Use aov() to create grade_aov plus call summary() to print results
grade_aov <- aov(int_rate ~ grade, data = lendingclub)
summary(grade_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## grade 6 27013 4502 2637 <2e-16 ***
## Residuals 1493 2549 2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# For a 2x2 grid of plots:
par(mfrow = c(2, 2))
# Plot grade_aov
plot(grade_aov)
# Bartlett's test for homogeneity of variance
bartlett.test(int_rate ~ grade, data = lendingclub)
##
## Bartlett test of homogeneity of variances
##
## data: int_rate by grade
## Bartlett's K-squared = 78.549, df = 6, p-value = 7.121e-15
# Conduct the Kruskal-Wallis rank sum test
kruskal.test(int_rate ~ grade,
data = lendingclub)
##
## Kruskal-Wallis rank sum test
##
## data: int_rate by grade
## Kruskal-Wallis chi-squared = 1365.5, df = 6, p-value < 2.2e-16
[Video]
We’ll be testing the mean loan_amnt, which is the requested amount of money the loan applicants ask for, based on which color header (green or blue) that they saw on the Lending Club website.
Which statistical test should we use after we’ve collected enough data?
# Load the pwr package
library(pwr)
# Use the correct function from pwr to find the sample size
pwr.t.test(n = NULL,
d = 0.2,
sig.level = 0.05,
power = 0.8,
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 393.4057
## d = 0.2
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
# Plot the A/B test results
ggplot(lendingclub_ab, aes(x = Group, y = loan_amnt)) +
geom_boxplot()
# Conduct a two-sided t-test
t.test(loan_amnt ~ Group, data = lendingclub_ab)
##
## Welch Two Sample t-test
##
## data: loan_amnt by Group
## t = -0.58112, df = 997.06, p-value = 0.5613
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1377.1748 747.8748
## sample estimates:
## mean in group A mean in group B
## 14723.15 15037.80
# Build lendingclub_multi
lendingclub_multi <- lm(loan_amnt ~ Group + grade + verification_status, data = lendingclub_ab)
# Examine lendingclub_multi results
tidy(lendingclub_multi)
## # A tibble: 10 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 11244. 792. 14.2 8.44e-42
## 2 GroupB 205. 515. 0.398 6.91e- 1
## 3 gradeB -975. 817. -1.19 2.33e- 1
## 4 gradeC -631. 806. -0.783 4.34e- 1
## 5 gradeD 718. 917. 0.783 4.34e- 1
## 6 gradeE 1477. 1208. 1.22 2.22e- 1
## 7 gradeF 5453. 1942. 2.81 5.09e- 3
## 8 gradeG 3490. 3396. 1.03 3.04e- 1
## 9 verification_statusSource Verified 4528. 637. 7.10 2.30e-12
## 10 verification_statusVerified 5900. 668. 8.84 4.41e-18
[Video]
Intro to Sampling
Sampling
Simple Random Sampling (SRS)
# sample()
Stratified Sampling
# dataset %>%
# group_by(strata_variable) %>%
# sample_n()
Cluster Sampling
# cluster(dataset,
# cluster_var_name,
# number_to_select,
# method = "option")
Systematic Sampling
Multi-stage Sampling
Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.
LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470