Unit 1: Introduction to Data

Data Basics

Variables

  • All variables are numerical or categorical
  • numerical take on numerical values
  • categorical take on a limited number of distinct categorical values
  • Numerical values
  • Continuous
  • Discrete
  • Categorical values
  • Ordinal (some kind of natural order)
  • “regular” (no natural order)

Relationships between variables

  • When two variables show ome connection with each other, they are associated or dependent
  • The association can be further described as positive or negative
  • If they are not associated, two variables are independent

Observational Studies and Experiments

Observational study

  • Collect data in a way that does not directly interfere with how the data arise–“observe”
  • Can only establish association between explanatory and response variables
  • If data from past, retrospective study, else prospective

Experiment

  • Randomly assign subjects to various treatments
  • Can establish causal relationships between explanatory and response variables

Confounding Variable

Extraneous variable that affect both the explanatory and response variable.

Correlation does not imply causation.

Sampling and Sources of Bias

A census is all of the data in a population. It can be expensive or even impossible to collect all of the data for a population.

Sources of Sampling Bias

  • Convenience sample: sample “low-hanging fruit”
  • Non-response: Significant percentage of subjects do not respond
  • Voluntary response: Responses come only from self-motivated subjects

Sampling methods

  • Simple random
  • Stratified: divide population into homogeneous strata and sample randomly from each (example: sports teams)
  • Cluster: divide population into clusters and randomly sample from a few of them (example: villages)

Experimental Design

Principles of Experimental Design

  • control: compare treatment of interest to control group
  • randomize: randomly assign subjects to treatments
  • replicate: within a study, collect sufficiently large sample, or repeat entire study
  • block: if there are variables known/suspected to affect response variable, group subjects into blocks on those those variables then randomize cases within each block

Blocking vs. explanatory variables

  • Explanatory variables (sometimes called factors) are conditions we can impose on experimental units
  • Block variables are characteristics that the experimental units come with, that we would like to control for
  • Blocking is like stratifying, except used in experimental settings when randomly assigning, as opposed to sampling

More terminology

  • placebo
  • placebo effect
  • blinding
  • double-blind

Random Sample Assignment

Random assignment No random assignment
Random sampling causal and generalizable not causal, but generalizable Generalizability
No random sampling causal, but not generalizable neither causal nor generalizable No generalizability
Causation Assocation

Visualizing Numerical Data

Intro to tables, scatterplots

Scatterplot

  • Response variable on y axis, explanatory on x
  • Line from lower left to upper right is positive assocation, upper left to lower right is negative
  • Straight line is linear relationship
  • Strength of assocation is based on how much the data is “scattered” from the line
  • Outliers are individual points farther from the line than most other points

Histogram

  • Provides a vew of data density
  • Especially useful for describing shape of distribution

Skew

  • Left skew is a histogram with a long left tail; right skew has long right tail
  • Symmetric has equal tail on both sides

Modality

  • Unimodal
  • Bimodal
  • Uniform
  • Multimodal (more than two)

Intros to dot plots, box plots, heat maps

Measures of Center

  • mean
  • median
  • mode

Sample statistics are point estimates of population parameters.

In a left-skewed distribution, mean is less than median, vice-versa for right-skewed.

Measures of Spread

  • variance
  • standard deviance

Variability vs. diversity

  • A very diverse group might be five cars, each with a different color.
  • A very variable group might be five cars, two with mileage of 50mpg, and three with 10mpg

Interquartile range

Range of the middle 50% of the data

Robust Statistics

Robust statistics are those on which extreme observations have little effect:

  • media
  • IQR

Non-robust:

  • mean
  • SD
  • range

Transforming Data

A transoformation is a rescaling of data using the right function. When data are very strongly skewed, we sometimes transform them so they are easier to model.

  • log transformation
  • square root
  • inverse

Goals of transformations:

  • See data structure differently
  • Reduce skew to assist in modeling
  • Straighten a nonlinear relationship in a scatter plot

Exploring Categorical Variables

Review of frequency table and bar plot. Bar plots are different from histograms in that: * bar plots are for categorical variables, histograms for numerical * x-axis on a histogram is a number line, and ordering of the bars is not interchangeable

Pie Charts Bad!

Review of contingency table–it has totals on the right and bottom margins for factors in a categorical variable.

Also relative frequencies, segmented bar plot, relative frequency segmented bar plot, mosaic plot, side-by-side box plots

Introduction to Inference

Introduction to simulations and hypothesis tests, too detailed to cover here. Summary:

  • set a null and alternative hypotheses
  • simulate the experiment assuming the null hypothesis is true
  • evaluate the probability of observing an outcome at least as extreme as the one observed in the original data (the p-value)
  • if this probability is low, reject the null hypothesis and favor the alternative

Unit 2: Probability and Distributions

Disjoint Events and General Addition Rule

Disjoint (mutually exclusive)

Disjoint events cannot happen at the same time * coin toss can’t be both heads and tails * student can’t both pass and fail a class * single card from a deck can’t be both an ace and a queen * P(A and B) = 0

Non-disjoint events can happen at the same time * student can get an A in Stats and Econ in same semester * single card can be an ace and a heart * P(A and B) \(\ne\) 0

Union of Disjoint Events

P(A or B) = P(A) + P(B)

Example: probability of drawing a jack or a 3 from a deck of cards is (4/52) + (4/52)

Union of Non-Disjoint Events

P(A or B) = P(A) + P(B) - P(A and B)

Example: probability of drawing a card that is an ace and a heart is (4/52) + (13/15) - (1/52)

General Addition Rule

P(A or B) = P(A) + P(B) - P(A and B)

When A and B are disjoint, then P(A and B) is zero, so the formula still holds.

A sample space is a collection of all possible outcomes of a trial. Example, for two flips of a coin, S={HH, HT, TH, TT}

A probability distribution lists all possible outcomes in the sample space and the probabilities with which they occur. Rules:

  1. Events listed must be disjoint
  2. Each probability must be between 0 and 1
  3. Probabilities must total 1

Complementary events are two mutually exclusive events whose probabilities add up to 1. Example, heads and tails in a coin flip.

Disjoint vs. complementary: do sum of probabilities of two disjoint outcomes always add up to 1? Not necessarily, for example rolling 1 or 2 on a die are disjoint, but you could also roll 3, 4, 5 or 6. But the sume of probabilities of two complementary outcomes always add up to 1.

Independence

Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other. Example, first coin toss is a heads, but this says nothing about the second flip.

If P(A|B)=P(A), then A and B are independent.

Product Rule for Independent Events

If A and B are independent, then P(A and B) = P(A) \(\times\) P(B).

Proability Examples

Disjoint vs. Independent

Conditional Probability

Probability Trees

Bayesian Inference

Examples of Bayesian Inference

Normal Distribution

Evaluating the Normal Distribution

Working with the Normal Distribution

Binomial Distribution

Normal Approximation to Binomial

Working with the Binomial Distribution

Unit 3: Foundations for Inference

Sampling Variability and CLT

CLT (for the mean) examples

Confidence Interval (for the mean)

Accuracy vs. Precision

Required Sample Size for Margin of Error

CI (for the mean) examples

Another Introduction to Inference

Hypothesis Testing (for a mean)

HT (for the mean) examples

Inference for Other Estimators

Decision Errors

Significance vs. Confidence Level

Statistical vs. Practical Significance

Unit 4: Inference for Numerical Variables

Hypothesis Testing for Paired Data

To conduct such a test, simply subtract one mean from the other, then use the results as a single mean. Likewise, determine the standard error for the differences of all the data points.

\(H_0: \mu_{diff}=0\)
\(H_A: \mu_{diff}\ne0\)

# Totally made-up example
data_1 = c(10, 12, 11, 14, 17, 18)
data_2 = c(13, 10, 18, 21, 16, 30)
n = length(data_1)
diff.mean = mean(data_1 - data_2)
sd.pop = 4.2 # This is a given
se.diff = sd.pop/sqrt(n)
nullval = 0
Z = abs(diff.mean - nullval)/se.diff
Z
## [1] 2.527
p_value = pnorm(Z, lower.tail=FALSE)
p_value
## [1] 0.005748

Confidence Intervals for Paired Data

Again, we use the difference in means, and the SE for the two sets.

# continuing with example above
conf_level = 0.95
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
ci = c(diff.mean - z_star * se.diff, diff.mean + z_star * se.diff)
ci
## [1] -7.6940 -0.9727

We are 95% confident that on average, the values from the first set are 7.69 to 0.97 less than the values in the second set.

Comparing Independent Means

The data is not paired. Instead, we are given two sample means, and we have the standard deviations for both.

The new element here is the calculation for the SE for the difference in means:

\(SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\)

Conditions

  1. Independence:
  • Within groups, sampled observations must be independent
  • random sampling/assignment
  • if sampling without replacement, <10% pop.
  • between groups: two groups must be independent of each other (non-paired)
  1. Sample size/skew: each sample must be at least 30, larger if population is very skewed

Confidence interval

conf_level = 0.95
alpha = 1 - conf_level
x_bar_1 = 41.8
n_1 = 505
s_1 = 15.14
x_bar_2 = 39.4
n_2 = 667
s_2 = 15.12

x_bar_diff = x_bar_1 - x_bar_2

se_diff = sqrt((s_1^2/n_1) + (s_2^2/n_2))
se_diff
## [1] 0.8926
z_star = qnorm(conf_level + (alpha/2))
z_star
## [1] 1.96
ci = c(x_bar_diff - z_star * se_diff, x_bar_diff + z_star * se_diff)
ci
## [1] 0.6506 4.1494

Hypothesis test

\(H_0: \mu_1 - \mu_2=0\)
\(H_A: \mu_1 - \mu_2\ne0\)

nullval = 0
Z = abs(x_bar_diff - nullval)/se_diff
Z
## [1] 2.689
p_value = (1 - pnorm(Z)) * 2  # twosided test
p_value
## [1] 0.007168

Bootstrapping

If we have medians as the sample statistic, the CLT does not apply. So we use bootstrapping, which takes many samples (with replacement) from the sample, and build a distribution with those values.

Confidence interval: percentile method

set.seed(3)
conf_level = 0.9
rents <- c(775, 625, 733, 929, 895, 749, 1020, 1349, 599, 1143, 1209, 1495, 
    879, 975, 1076, 1282, 665, 705, 799, 500)
num_samples = 500
rent_medians <- rep(0, num_samples)
for (i in 1:num_samples) {
    samp <- sample(rents, length(rents), replace = TRUE)
    rent_medians[i] <- median(samp)
}
num_in_interval = conf_level * num_samples
num_in_tail = (num_samples - num_in_interval)/2
medians_sorted = rent_medians[order(rent_medians)]
left_tail_val = medians_sorted[num_in_tail]
right_tail_val = medians_sorted[num_samples - num_in_tail + 1]
c(left_tail_val, right_tail_val)
## [1]  749 1020

The above values show the expected cutoff points in the confidence interval. But note that, if the sample is biased, so are these results.

Confidence interval: standard error method

First determine the mean of the bootstrap distribution, xˉboot, and caclulate the 95% confidence interval as that mean, plus/minus 1.96 times the standard error, which is the bootstrap standard error, SEboot. This is the standard deviation of the observations in the bootstrap distribution.

se_boot = sd(rent_medians)  # std. error is just the standard deviation because we use the entire bootstrap population
rent_mean = mean(rent_medians)
conf_level = 0.9
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
ci = c(rent_mean - z_star * se_boot, rent_mean + z_star * se_boot)
ci
## [1]  742.4 1014.1

t Distribution

When n is small and σ is unknown, use the t distribution to address the uncertainty of the standard error estimate. The t distribution is unimodal and symmetric, but its tails are somewhat thicker. The exact shape of the distribution depends on the degrees of freedom, and as those degrees are larger, the distribution looks more and more like the normal distribution.

Inference for a Small Sample Mean

The data set up here is used later for comparing two small sample means. But here, we use only one of them.

Confidence interval

conf_level = 0.95
alpha = 1 - conf_level
x_bar_solitaire = 52.1  # biscuit intake
s_solitaire = 45.1
n_solitaire = 22
se_solitaire = s_solitaire/sqrt(n_solitaire)

x_bar_no_dist = 27.1  # biscuit intake
s_no_dist = 26.4
n_no_dist = 22
se_no_dist = s_no_dist/sqrt(n_no_dist)

df = n_solitaire - 1
t_star = qt(conf_level + (alpha/2), df = df)
t_star
## [1] 2.08
ci_solitaire = c(x_bar_solitaire - t_star * se_solitaire, x_bar_solitaire + 
    t_star * se_solitaire)
ci_solitaire
## [1] 32.1 72.1
ci_no_dist = c(x_bar_no_dist - t_star * se_no_dist, x_bar_no_dist + t_star * 
    se_no_dist)
ci_no_dist
## [1] 15.39 38.81

Hypothesis test

\(H_0: \mu=30\)
\(H_A: \mu\ne30\)

# continues with data defined in previous section
nullval_solitaire = 30
T = (x_bar_solitaire - nullval_solitaire)/se_solitaire
T
## [1] 2.298
p_value = (1 - pt(T, df = n_solitaire)) * 2  # twosided test
p_value
## [1] 0.03141

Inference for Comparing Two Small Sample Means

The new point here is that df is the lesser of \(n_1-1\) and \(n_2-1\).

Hypothesis test

x_bar_diff = x_bar_solitaire - x_bar_no_dist
nullval = 0
se_diff = sqrt(s_solitaire^2/n_solitaire + s_no_dist^2/n_no_dist)
se_diff
## [1] 11.14
df = min(n_solitaire - 1, n_no_dist - 1)
Z = abs(x_bar_diff - nullval)/se_diff
Z
## [1] 2.244
p_value = (1 - pt(Z, df = df)) * 2
p_value
## [1] 0.03575

Confidence interval

conf_level = 0.95
alpha = 1 - conf_level
t_star = qt(conf_level + (alpha/2), df=df)
ci = c(x_bar_diff - t_star * se_diff, x_bar_diff + t_star * se_diff)
ci
## [1]  1.83 48.17

Comparing More than Two Means

This is for when we need to compare three or more means. New topic is the F statistic.

\(F = \frac{\text{variability between groups}}{\text{variability within groups}}\)

ANOVA

Variability partitioning

\(H_0:\) The mean outcome is the same across all categories: \(\mu_1 = \mu_2 = \dots = \mu_k\)
\(H_A:\) The mean outcome is different for at least two of the categories.

ANOVA output

Df Sum Sq Mean Sq F value PR(>F)
Group class 3 236.56 78.855 21.735 <0.0001
Error Residuals 791 2869.80 3.628
Total 794 3106.36

Sum of squares total: \(SST = \sum\limits^n_{i=1} (y_i - \bar{y})^2\) where \(y_i\) is value of response variable for each observation and \(\bar{y}\) is grand mean of response variable. This is 3106 in table above.

sst <- function(y) {
    y_mean = mean(y)
    return(sum(sapply(y, function(y_i) {(y_i - y_mean)^2})))
}

Sum of squares groups: \(SSG = \sum\limits^k_{j=1} n_j (\bar{y}_j - \bar{y})^2\) where \(k\) is the number of groups, \(n_j\) is the number of observations in group \(j\), \(\bar{y}_j\) is the mean of the response variable for group \(j\), and \(\bar{y}\) is the grand mean of the response variable. This is 236.56 in table above.

ssg <- function(data, y, j) {
    total = 0
    mean.grand = mean(y)  # mean for all values
    for (grp in unique(j)) {
        grp_data = y[j == grp]  # vector of the y column of data
        mean.group = mean(grp_data)  # mean for current group
        total = total + length(grp_data) * (mean.group - mean.grand)^2
    }
    return(total)
}

Sum of squares error (SSE): \(SSE = SST - SSG\). This is 2869.8 in the table above.

Degrees of freedom:

  • \(df_T = n - 1\), e.g., 795-1=794
  • \(df_G = k - 1\), e.g., 4-1=3
  • \(df_E = df_T - df_G\), e.g., 794-3=791

Mean squares: For the group, \(MSG = SSG/df_G\) (this is 78.855 in the table), \(MSE = SSE/df_E\) (this is 3.628 in the table).

The F statistic is \(F=\frac{MSG}{MSE}\) (this is 21.735 in the table).

The p-value can now be computed using the F statistic and the two degrees of freedom values.

F = 21.735
df1 = 4 - 1
df2 = 795 - df1 - 1
p_value = pf(21.735, df1 = df1, df2 = df2, lower.tail = FALSE)
p_value
## [1] 1.56e-13

This value tells us only whether there are at least two groups that vary, not which ones.

Here is an example of using the inference function to get ANOVA output:

# summary(gss.anova)
source("http://bit.ly/dasi_inference")
# This will do ANOVA analysis because the explanatory variable is
# categorical and has more than 2 levels alternative is 'greater' because an
# F test is always one-sided
inference(y = gss$wordsum, x = gss$class, type = "ht", est = "mean", method = "theoretical", 
    alternative = "greater", eda_plot = F, inf_plot = F)

Conditions for ANOVA

Multiple Comparisons

Unit 5: Inference for Categorical Variables

Sampling Variability and CLT for Proportions

Confidence Interval for a Proportion

Hypothesis Test for a Proportion

Estimating the Difference Between Two Proportions

Hypothesis Test for Comparing Two Proportions

Small Sample Proportion

Examples

Comparing Two Small Sample Proportions

Chi-Square Test for Goodness of Fit

The Chi-Square Independence Test

Unit 6: Introduction to Linear Regression

Correlation

Residuals

Least Squares Line

Prediction and Extraploation

Conditions for Linear Regression

R Squared

Regression with Categorical Explanatory Variables

Outliers in Regression

Inference for Linear Regression

Variability Partitioning

Unit 7: Mutliple Regression

Multiple Predictors

Adjusted R Squared

Collinearity and Parsimony

Inference for MLR

Model Selection

Diagnostics for MLR

Unit 9: Review

Fequentist vs. Bayesian Inference