Unit 1: Introduction to Data

Data Basics

Variables

All variables are numerical or categorical
numerical take on numerical values
categorical take on a limited number of distinct categorical values
Numerical values
Continuous
Discrete
Categorical values
Ordinal (some kind of natural order)
“regular” (no natural order)

Relationships between variables

When two variables show ome connection with each other, they are associated or dependent
The association can be further described as positive or negative
If they are not associated, two variables are independent

Observational Studies and Experiments

Observational study

Collect data in a way that does not directly interfere with how the data arise–“observe”
Can only establish association between explanatory and response variables
If data from past, retrospective study, else prospective

Experiment

Randomly assign subjects to various treatments
Can establish causal relationships between explanatory and response variables

Confounding Variable

Extraneous variable that affect both the explanatory and response variable.

Correlation does not imply causation.

Sampling and Sources of Bias

A census is all of the data in a population. It can be expensive or even impossible to collect all of the data for a population.

Sources of Sampling Bias

Convenience sample: sample “low-hanging fruit”
Non-response: Significant percentage of subjects do not respond
Voluntary response: Responses come only from self-motivated subjects

Sampling methods

Simple random
Stratified: divide population into homogeneous strata and sample randomly from each (example: sports teams)
Cluster: divide population into clusters and randomly sample from a few of them (example: villages)

Experimental Design

Principles of Experimental Design

control: compare treatment of interest to control group
randomize: randomly assign subjects to treatments
replicate: within a study, collect sufficiently large sample, or repeat entire study
block: if there are variables known/suspected to affect response variable, group subjects into blocks on those those variables then randomize cases within each block

Blocking vs. explanatory variables

Explanatory variables (sometimes called factors) are conditions we can impose on experimental units
Block variables are characteristics that the experimental units come with, that we would like to control for
Blocking is like stratifying, except used in experimental settings when randomly assigning, as opposed to sampling

More terminology

placebo
placebo effect
blinding
double-blind

Random Sample Assignment

Random assignment	No random assignment
Random sampling	causal and generalizable	not causal, but generalizable	Generalizability
No random sampling	causal, but not generalizable	neither causal nor generalizable	No generalizability
Causation	Assocation

Visualizing Numerical Data

Intro to tables, scatterplots

Scatterplot

Response variable on y axis, explanatory on x
Line from lower left to upper right is positive assocation, upper left to lower right is negative
Straight line is linear relationship
Strength of assocation is based on how much the data is “scattered” from the line
Outliers are individual points farther from the line than most other points

Histogram

Provides a vew of data density
Especially useful for describing shape of distribution

Skew

Left skew is a histogram with a long left tail; right skew has long right tail
Symmetric has equal tail on both sides

Modality

Unimodal
Bimodal
Uniform
Multimodal (more than two)

Intros to dot plots, box plots, heat maps

Measures of Center

mean
median
mode

Sample statistics are point estimates of population parameters.

In a left-skewed distribution, mean is less than median, vice-versa for right-skewed.

Measures of Spread

variance
standard deviance

Variability vs. diversity

A very diverse group might be five cars, each with a different color.
A very variable group might be five cars, two with mileage of 50mpg, and three with 10mpg

Interquartile range

Range of the middle 50% of the data

Robust Statistics

Robust statistics are those on which extreme observations have little effect:

media
IQR

Non-robust:

mean
SD
range

Transforming Data

A transoformation is a rescaling of data using the right function. When data are very strongly skewed, we sometimes transform them so they are easier to model.

log transformation
square root
inverse

Goals of transformations:

See data structure differently
Reduce skew to assist in modeling
Straighten a nonlinear relationship in a scatter plot

Exploring Categorical Variables

Review of frequency table and bar plot. Bar plots are different from histograms in that: * bar plots are for categorical variables, histograms for numerical * x-axis on a histogram is a number line, and ordering of the bars is not interchangeable

Pie Charts Bad!

Review of contingency table–it has totals on the right and bottom margins for factors in a categorical variable.

Also relative frequencies, segmented bar plot, relative frequency segmented bar plot, mosaic plot, side-by-side box plots

Introduction to Inference

Introduction to simulations and hypothesis tests, too detailed to cover here. Summary:

set a null and alternative hypotheses
simulate the experiment assuming the null hypothesis is true
evaluate the probability of observing an outcome at least as extreme as the one observed in the original data (the p-value)
if this probability is low, reject the null hypothesis and favor the alternative

Unit 2: Probability and Distributions

Disjoint Events and General Addition Rule

Disjoint (mutually exclusive)

Disjoint events cannot happen at the same time * coin toss can’t be both heads and tails * student can’t both pass and fail a class * single card from a deck can’t be both an ace and a queen * P(A and B) = 0

Non-disjoint events can happen at the same time * student can get an A in Stats and Econ in same semester * single card can be an ace and a heart * P(A and B) \(\ne\) 0

Union of Disjoint Events

P(A or B) = P(A) + P(B)

Example: probability of drawing a jack or a 3 from a deck of cards is (4/52) + (4/52)

Union of Non-Disjoint Events

P(A or B) = P(A) + P(B) - P(A and B)

Example: probability of drawing a card that is an ace and a heart is (4/52) + (13/15) - (1/52)

General Addition Rule

P(A or B) = P(A) + P(B) - P(A and B)

When A and B are disjoint, then P(A and B) is zero, so the formula still holds.

A sample space is a collection of all possible outcomes of a trial. Example, for two flips of a coin, S={HH, HT, TH, TT}

A probability distribution lists all possible outcomes in the sample space and the probabilities with which they occur. Rules:

Events listed must be disjoint
Each probability must be between 0 and 1
Probabilities must total 1

Complementary events are two mutually exclusive events whose probabilities add up to 1. Example, heads and tails in a coin flip.

Disjoint vs. complementary: do sum of probabilities of two disjoint outcomes always add up to 1? Not necessarily, for example rolling 1 or 2 on a die are disjoint, but you could also roll 3, 4, 5 or 6. But the sume of probabilities of two complementary outcomes always add up to 1.

Independence

Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other. Example, first coin toss is a heads, but this says nothing about the second flip.

If P(A|B)=P(A), then A and B are independent.

Product Rule for Independent Events

If A and B are independent, then P(A and B) = P(A) \(\times\) P(B).

Proability Examples

Disjoint vs. Independent

Conditional Probability

Probability Trees

Bayesian Inference

Examples of Bayesian Inference

Normal Distribution

Evaluating the Normal Distribution

Working with the Normal Distribution

Binomial Distribution

Normal Approximation to Binomial

Working with the Binomial Distribution

Unit 3: Foundations for Inference

Sampling Variability and CLT

CLT (for the mean) examples

Confidence Interval (for the mean)

Accuracy vs. Precision

Required Sample Size for Margin of Error

CI (for the mean) examples

Another Introduction to Inference

Hypothesis Testing (for a mean)

HT (for the mean) examples

Inference for Other Estimators

Decision Errors

Significance vs. Confidence Level

Statistical vs. Practical Significance

Unit 4: Inference for Numerical Variables

Hypothesis Testing for Paired Data

To conduct such a test, simply subtract one mean from the other, then use the results as a single mean. Likewise, determine the standard error for the differences of all the data points.

\(H_0: \mu_{diff}=0\)
\(H_A: \mu_{diff}\ne0\)

# Totally made-up example
data_1 = c(10, 12, 11, 14, 17, 18)
data_2 = c(13, 10, 18, 21, 16, 30)
n = length(data_1)
diff.mean = mean(data_1 - data_2)
sd.pop = 4.2 # This is a given
se.diff = sd.pop/sqrt(n)
nullval = 0
Z = abs(diff.mean - nullval)/se.diff
Z

## [1] 2.527

p_value = pnorm(Z, lower.tail=FALSE)
p_value

## [1] 0.005748

Confidence Intervals for Paired Data

Again, we use the difference in means, and the SE for the two sets.

# continuing with example above
conf_level = 0.95
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
ci = c(diff.mean - z_star * se.diff, diff.mean + z_star * se.diff)
ci

## [1] -7.6940 -0.9727

We are 95% confident that on average, the values from the first set are 7.69 to 0.97 less than the values in the second set.

Comparing Independent Means

The data is not paired. Instead, we are given two sample means, and we have the standard deviations for both.

The new element here is the calculation for the SE for the difference in means:

\(SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\)

Conditions

Independence:

Within groups, sampled observations must be independent
random sampling/assignment
if sampling without replacement, <10% pop.
between groups: two groups must be independent of each other (non-paired)

Sample size/skew: each sample must be at least 30, larger if population is very skewed

Confidence interval

conf_level = 0.95
alpha = 1 - conf_level
x_bar_1 = 41.8
n_1 = 505
s_1 = 15.14
x_bar_2 = 39.4
n_2 = 667
s_2 = 15.12

x_bar_diff = x_bar_1 - x_bar_2

se_diff = sqrt((s_1^2/n_1) + (s_2^2/n_2))
se_diff

## [1] 0.8926

z_star = qnorm(conf_level + (alpha/2))
z_star

## [1] 1.96

ci = c(x_bar_diff - z_star * se_diff, x_bar_diff + z_star * se_diff)
ci

## [1] 0.6506 4.1494

Hypothesis test

\(H_0: \mu_1 - \mu_2=0\)
\(H_A: \mu_1 - \mu_2\ne0\)

nullval = 0
Z = abs(x_bar_diff - nullval)/se_diff
Z

## [1] 2.689

p_value = (1 - pnorm(Z)) * 2  # twosided test
p_value

## [1] 0.007168

Bootstrapping

If we have medians as the sample statistic, the CLT does not apply. So we use bootstrapping, which takes many samples (with replacement) from the sample, and build a distribution with those values.

Confidence interval: percentile method

set.seed(3)
conf_level = 0.9
rents <- c(775, 625, 733, 929, 895, 749, 1020, 1349, 599, 1143, 1209, 1495, 
    879, 975, 1076, 1282, 665, 705, 799, 500)
num_samples = 500
rent_medians <- rep(0, num_samples)
for (i in 1:num_samples) {
    samp <- sample(rents, length(rents), replace = TRUE)
    rent_medians[i] <- median(samp)
}
num_in_interval = conf_level * num_samples
num_in_tail = (num_samples - num_in_interval)/2
medians_sorted = rent_medians[order(rent_medians)]
left_tail_val = medians_sorted[num_in_tail]
right_tail_val = medians_sorted[num_samples - num_in_tail + 1]
c(left_tail_val, right_tail_val)

## [1]  749 1020

The above values show the expected cutoff points in the confidence interval. But note that, if the sample is biased, so are these results.

Confidence interval: standard error method

First determine the mean of the bootstrap distribution, xˉboot, and caclulate the 95% confidence interval as that mean, plus/minus 1.96 times the standard error, which is the bootstrap standard error, SEboot. This is the standard deviation of the observations in the bootstrap distribution.

se_boot = sd(rent_medians)  # std. error is just the standard deviation because we use the entire bootstrap population
rent_mean = mean(rent_medians)
conf_level = 0.9
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
ci = c(rent_mean - z_star * se_boot, rent_mean + z_star * se_boot)
ci

## [1]  742.4 1014.1

t Distribution

When n is small and σ is unknown, use the t distribution to address the uncertainty of the standard error estimate. The t distribution is unimodal and symmetric, but its tails are somewhat thicker. The exact shape of the distribution depends on the degrees of freedom, and as those degrees are larger, the distribution looks more and more like the normal distribution.

Inference for a Small Sample Mean

The data set up here is used later for comparing two small sample means. But here, we use only one of them.

Confidence interval

conf_level = 0.95
alpha = 1 - conf_level
x_bar_solitaire = 52.1  # biscuit intake
s_solitaire = 45.1
n_solitaire = 22
se_solitaire = s_solitaire/sqrt(n_solitaire)

x_bar_no_dist = 27.1  # biscuit intake
s_no_dist = 26.4
n_no_dist = 22
se_no_dist = s_no_dist/sqrt(n_no_dist)

df = n_solitaire - 1
t_star = qt(conf_level + (alpha/2), df = df)
t_star

## [1] 2.08

ci_solitaire = c(x_bar_solitaire - t_star * se_solitaire, x_bar_solitaire + 
    t_star * se_solitaire)
ci_solitaire

## [1] 32.1 72.1

ci_no_dist = c(x_bar_no_dist - t_star * se_no_dist, x_bar_no_dist + t_star * 
    se_no_dist)
ci_no_dist

## [1] 15.39 38.81

Hypothesis test

\(H_0: \mu=30\)
\(H_A: \mu\ne30\)

# continues with data defined in previous section
nullval_solitaire = 30
T = (x_bar_solitaire - nullval_solitaire)/se_solitaire
T

## [1] 2.298

p_value = (1 - pt(T, df = n_solitaire)) * 2  # twosided test
p_value

## [1] 0.03141

Inference for Comparing Two Small Sample Means

The new point here is that df is the lesser of \(n_1-1\) and \(n_2-1\).

Hypothesis test

x_bar_diff = x_bar_solitaire - x_bar_no_dist
nullval = 0
se_diff = sqrt(s_solitaire^2/n_solitaire + s_no_dist^2/n_no_dist)
se_diff

## [1] 11.14

df = min(n_solitaire - 1, n_no_dist - 1)
Z = abs(x_bar_diff - nullval)/se_diff
Z

## [1] 2.244

p_value = (1 - pt(Z, df = df)) * 2
p_value

## [1] 0.03575

Confidence interval

conf_level = 0.95
alpha = 1 - conf_level
t_star = qt(conf_level + (alpha/2), df=df)
ci = c(x_bar_diff - t_star * se_diff, x_bar_diff + t_star * se_diff)
ci

## [1]  1.83 48.17

Comparing More than Two Means

This is for when we need to compare three or more means. New topic is the F statistic.

\(F = \frac{\text{variability between groups}}{\text{variability within groups}}\)

ANOVA

Variability partitioning

\(H_0:\) The mean outcome is the same across all categories: \(\mu_1 = \mu_2 = \dots = \mu_k\)
\(H_A:\) The mean outcome is different for at least two of the categories.

ANOVA output

	Df	Sum Sq	Mean Sq	F value	PR(>F)
Group	class	3	236.56	78.855	21.735	<0.0001
Error	Residuals	791	2869.80	3.628
Total	794	3106.36

Sum of squares total: \(SST = \sum\limits^n_{i=1} (y_i - \bar{y})^2\) where \(y_i\) is value of response variable for each observation and \(\bar{y}\) is grand mean of response variable. This is 3106 in table above.

sst <- function(y) {
    y_mean = mean(y)
    return(sum(sapply(y, function(y_i) {(y_i - y_mean)^2})))
}

Sum of squares groups: \(SSG = \sum\limits^k_{j=1} n_j (\bar{y}_j - \bar{y})^2\) where \(k\) is the number of groups, \(n_j\) is the number of observations in group \(j\), \(\bar{y}_j\) is the mean of the response variable for group \(j\), and \(\bar{y}\) is the grand mean of the response variable. This is 236.56 in table above.

ssg <- function(data, y, j) {
    total = 0
    mean.grand = mean(y)  # mean for all values
    for (grp in unique(j)) {
        grp_data = y[j == grp]  # vector of the y column of data
        mean.group = mean(grp_data)  # mean for current group
        total = total + length(grp_data) * (mean.group - mean.grand)^2
    }
    return(total)
}

Sum of squares error (SSE): \(SSE = SST - SSG\). This is 2869.8 in the table above.

Degrees of freedom:

\(df_T = n - 1\), e.g., 795-1=794
\(df_G = k - 1\), e.g., 4-1=3
\(df_E = df_T - df_G\), e.g., 794-3=791

Mean squares: For the group, \(MSG = SSG/df_G\) (this is 78.855 in the table), \(MSE = SSE/df_E\) (this is 3.628 in the table).

The F statistic is \(F=\frac{MSG}{MSE}\) (this is 21.735 in the table).

The p-value can now be computed using the F statistic and the two degrees of freedom values.

F = 21.735
df1 = 4 - 1
df2 = 795 - df1 - 1
p_value = pf(21.735, df1 = df1, df2 = df2, lower.tail = FALSE)
p_value

## [1] 1.56e-13

This value tells us only whether there are at least two groups that vary, not which ones.

Here is an example of using the inference function to get ANOVA output:

# summary(gss.anova)
source("http://bit.ly/dasi_inference")
# This will do ANOVA analysis because the explanatory variable is
# categorical and has more than 2 levels alternative is 'greater' because an
# F test is always one-sided
inference(y = gss$wordsum, x = gss$class, type = "ht", est = "mean", method = "theoretical", 
    alternative = "greater", eda_plot = F, inf_plot = F)