Ch. 1 - Inference for a single parameter

The General Social Survey

[Video]

Exploring consci

# From previous steps
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gss2016 <- gss %>%
  filter(year == 2016)
library(ggplot2)
ggplot(gss2016, aes(x = consci)) +
  geom_bar()

# Compute proportion of high conf
p_hat <- gss2016 %>%
  summarize(prop_high = mean(consci == "High")) %>%
  pull()

Generating via bootstrap

# From previous step
library(infer)
## 
## Attaching package: 'infer'
## The following object is masked _by_ '.GlobalEnv':
## 
##     gss
boot1 <- gss2016 %>%
  specify(response = consci, success = "High") %>%
  generate(reps = 1, type = "bootstrap")

# Compute proportion with high conf
boot1 %>%
  summarize(prop_high = mean(consci == "High")) %>%
  pull()
## [1] 0.4266667

Constructing a CI

# From previous steps
boot_dist <- gss2016 %>%
  specify(response = consci, success = "High") %>%
  generate(reps = 500, type = "bootstrap") %>%
  calculate(stat = "prop")
ggplot(boot_dist, aes(x = stat)) +
  geom_density()

SE <- boot_dist %>%
  summarize(se = sd(stat)) %>%
  pull()

# Create CI
c(p_hat - 2 * SE, p_hat + 2 * SE)
## [1] 0.3535923 0.5130744

Why more bootstraps?

If we had used 5000 instead of 500 bootstrap samples, which of the following would be true?

  • [*]The bootstrap distribution would appear smoother.
  • The standard error would decrease.
  • The standard error would increase.
  • p_hat would be 10 times larger.

Interpreting a Confidence Interval

[Video]

CIs and confidence level

We learned that for a 95% confidence interval (a confidence level of .95), if we were to take many samples of the same size and compute many intervals, we would expect 95% of the resulting intervals to contain the parameter. Based on the set of confidence intervals plotted here, what is your best guess at the confidence level used in these intervals?

As in the video, the population proportion is represented by the p in the cloud and the dotted line and each confidence interval is represented by a segment that extends out from it’s p-hat. Intervals that capture the true value are in green; those that miss it are in red.

  • 0.05
  • 0.35
  • [*] 0.65
  • 0.95

SE with less data

# From previous steps
boot_dist_small <- gss2016_small %>%
  specify(response = consci, success = "High") %>%
  generate(reps = 500, type = "bootstrap") %>%
  calculate(stat = "prop")
SE_small_n <- boot_dist_small %>%
  summarize(se = sd(stat)) %>%
  pull()
boot_dist_smaller <- gss2016_smaller %>%
  specify(response = consci, success = "High") %>%
  generate(reps = 500, type = "bootstrap") %>%
  calculate(stat = "prop")

# Compute and save estimate of second SE
SE_smaller_n <- boot_dist_smaller %>%
  summarize(se = sd(stat)) %>%
  pull()

# Compare the results for each dataset size
message("gss2016_small has ", nrow(gss2016_small), " rows and standard error ", SE_small_n)
## gss2016_small has 50 rows and standard error 0.0714496333104684
message("gss2016_smaller has ", nrow(gss2016_smaller), " rows and standard error ", SE_smaller_n)
## gss2016_smaller has 10 rows and standard error 0.163956101461375

SE with different p

# From previous steps
ggplot(gss2016, aes(x = meta_region)) +
  geom_bar()

boot_dist <- gss2016 %>%
  specify(response = meta_region, success = "pacific") %>%
  generate(reps = 500, type = "bootstrap") %>%
  calculate(stat = "prop")

# Calculate std error
SE_low_p <- boot_dist %>%
  summarize(se = sd(stat)) %>%
  pull()

# Compare SEs
c(SE_low_p, SE)
## [1] 0.02537620 0.03987052

The approximation shortcut

[Video]

CI via approximation

# From previous step
n <- nrow(gss2016)
p_hat <- gss2016 %>%
  summarize(prop_pacific = mean(meta_region == "pacific")) %>%
  pull()

# Check conditions
n * p_hat >= 10
## [1] TRUE
n * (1 - p_hat) >= 10
## [1] TRUE
# Calculate SE
SE_approx <- sqrt(p_hat * (1 - p_hat) / n)

# Form 95% CI
c(p_hat - 2 * SE_approx, p_hat + 2 * SE_approx)
## [1] 0.05101021 0.14898979

Methods compared

Now you have seen two ways to construct a confidence interval for a proportion. Which of the following about the two methods is true?

  • Once the rule-of-thumb condition is violated, the approximation method instantly becomes extremely inaccurate.
  • [*] At a very small sample size, both approaches will perform poorly. Inference on very small data sets is not recommended!
  • In traditional surveys and scientific research, the bootstrap approach is more commonly used than the approximation approach.

Ch. 2 - Proportions: testing and power

Hypothesis test for a proportion

[Video]

Life after death

Generating from H0

Testing a claim

Making a decision

Intervals for differences

Death penalty and sex

Hypothesis test on the difference in proportions

Interpreting the test

Hypothesis tests and confidence intervals

Statistical errors

When the null is true

When the null is true: decision


Ch. 3 - Comparing many parameters: independence

Contingency tables

Politics and Space

Understanding contingency tables

From tidy to table to tidy

Chi-squared test statistic

A single permuted Chi-sq

Building two null distributions

Is the data consistent with the model?

Alternate method: the chi-squared distribution

Checking conditions

The geography of happiness

A p-value two ways

Intervals for the chi-squared distribution


Ch. 4 - Comparing many parameters: goodness of fit

Case study: election fraud

Getting to know the Iran data

Who won?

Breaking it down by province

Extracting the first digit I

Goodness of fit

Goodness of fit test

A p-value, two ways

Is this evidence of fraud?

And now to US

Getting to know the Iowa data

Extracting the first digit II

Testing Iowa

Fraud in Iowa?

Election fraud in Iran and Iowa: debrief


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470