Extraneous variable that affect both the explanatory and response variable.
Correlation does not imply causation.
A census is all of the data in a population. It can be expensive or even impossible to collect all of the data for a population.
| Random assignment | No random assignment | ||
|---|---|---|---|
| Random sampling | causal and generalizable | not causal, but generalizable | Generalizability |
| No random sampling | causal, but not generalizable | neither causal nor generalizable | No generalizability |
| Causation | Assocation |
Intro to tables, scatterplots
Intros to dot plots, box plots, heat maps
Sample statistics are point estimates of population parameters.
In a left-skewed distribution, mean is less than median, vice-versa for right-skewed.
Range of the middle 50% of the data
Robust statistics are those on which extreme observations have little effect:
Non-robust:
A transoformation is a rescaling of data using the right function. When data are very strongly skewed, we sometimes transform them so they are easier to model.
Goals of transformations:
Review of frequency table and bar plot. Bar plots are different from histograms in that: * bar plots are for categorical variables, histograms for numerical * x-axis on a histogram is a number line, and ordering of the bars is not interchangeable
Pie Charts Bad!
Review of contingency table–it has totals on the right and bottom margins for factors in a categorical variable.
Also relative frequencies, segmented bar plot, relative frequency segmented bar plot, mosaic plot, side-by-side box plots
Introduction to simulations and hypothesis tests, too detailed to cover here. Summary:
Disjoint events cannot happen at the same time * coin toss can’t be both heads and tails * student can’t both pass and fail a class * single card from a deck can’t be both an ace and a queen * P(A and B) = 0
Non-disjoint events can happen at the same time * student can get an A in Stats and Econ in same semester * single card can be an ace and a heart * P(A and B) \(\ne\) 0
P(A or B) = P(A) + P(B)
Example: probability of drawing a jack or a 3 from a deck of cards is (4/52) + (4/52)
P(A or B) = P(A) + P(B) - P(A and B)
Example: probability of drawing a card that is an ace and a heart is (4/52) + (13/15) - (1/52)
P(A or B) = P(A) + P(B) - P(A and B)
When A and B are disjoint, then P(A and B) is zero, so the formula still holds.
A sample space is a collection of all possible outcomes of a trial. Example, for two flips of a coin, S={HH, HT, TH, TT}
A probability distribution lists all possible outcomes in the sample space and the probabilities with which they occur. Rules:
Complementary events are two mutually exclusive events whose probabilities add up to 1. Example, heads and tails in a coin flip.
Disjoint vs. complementary: do sum of probabilities of two disjoint outcomes always add up to 1? Not necessarily, for example rolling 1 or 2 on a die are disjoint, but you could also roll 3, 4, 5 or 6. But the sume of probabilities of two complementary outcomes always add up to 1.
Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other. Example, first coin toss is a heads, but this says nothing about the second flip.
If P(A|B)=P(A), then A and B are independent.
If A and B are independent, then P(A and B) = P(A) \(\times\) P(B).
To conduct such a test, simply subtract one mean from the other, then use the results as a single mean. Likewise, determine the standard error for the differences of all the data points.
\(H_0: \mu_{diff}=0\)
\(H_A: \mu_{diff}\ne0\)
# Totally made-up example
data_1 = c(10, 12, 11, 14, 17, 18)
data_2 = c(13, 10, 18, 21, 16, 30)
n = length(data_1)
diff.mean = mean(data_1 - data_2)
sd.pop = 4.2 # This is a given
se.diff = sd.pop/sqrt(n)
nullval = 0
Z = abs(diff.mean - nullval)/se.diff
Z
## [1] 2.527
p_value = pnorm(Z, lower.tail=FALSE)
p_value
## [1] 0.005748
Again, we use the difference in means, and the SE for the two sets.
# continuing with example above
conf_level = 0.95
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
ci = c(diff.mean - z_star * se.diff, diff.mean + z_star * se.diff)
ci
## [1] -7.6940 -0.9727
We are 95% confident that on average, the values from the first set are 7.69 to 0.97 less than the values in the second set.
The data is not paired. Instead, we are given two sample means, and we have the standard deviations for both.
The new element here is the calculation for the SE for the difference in means:
\(SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\)
Conditions
conf_level = 0.95
alpha = 1 - conf_level
x_bar_1 = 41.8
n_1 = 505
s_1 = 15.14
x_bar_2 = 39.4
n_2 = 667
s_2 = 15.12
x_bar_diff = x_bar_1 - x_bar_2
se_diff = sqrt((s_1^2/n_1) + (s_2^2/n_2))
se_diff
## [1] 0.8926
z_star = qnorm(conf_level + (alpha/2))
z_star
## [1] 1.96
ci = c(x_bar_diff - z_star * se_diff, x_bar_diff + z_star * se_diff)
ci
## [1] 0.6506 4.1494
\(H_0: \mu_1 - \mu_2=0\)
\(H_A: \mu_1 - \mu_2\ne0\)
nullval = 0
Z = abs(x_bar_diff - nullval)/se_diff
Z
## [1] 2.689
p_value = (1 - pnorm(Z)) * 2 # twosided test
p_value
## [1] 0.007168
If we have medians as the sample statistic, the CLT does not apply. So we use bootstrapping, which takes many samples (with replacement) from the sample, and build a distribution with those values.
set.seed(3)
conf_level = 0.9
rents <- c(775, 625, 733, 929, 895, 749, 1020, 1349, 599, 1143, 1209, 1495,
879, 975, 1076, 1282, 665, 705, 799, 500)
num_samples = 500
rent_medians <- rep(0, num_samples)
for (i in 1:num_samples) {
samp <- sample(rents, length(rents), replace = TRUE)
rent_medians[i] <- median(samp)
}
num_in_interval = conf_level * num_samples
num_in_tail = (num_samples - num_in_interval)/2
medians_sorted = rent_medians[order(rent_medians)]
left_tail_val = medians_sorted[num_in_tail]
right_tail_val = medians_sorted[num_samples - num_in_tail + 1]
c(left_tail_val, right_tail_val)
## [1] 749 1020
The above values show the expected cutoff points in the confidence interval. But note that, if the sample is biased, so are these results.
First determine the mean of the bootstrap distribution, xˉboot, and caclulate the 95% confidence interval as that mean, plus/minus 1.96 times the standard error, which is the bootstrap standard error, SEboot. This is the standard deviation of the observations in the bootstrap distribution.
se_boot = sd(rent_medians) # std. error is just the standard deviation because we use the entire bootstrap population
rent_mean = mean(rent_medians)
conf_level = 0.9
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
ci = c(rent_mean - z_star * se_boot, rent_mean + z_star * se_boot)
ci
## [1] 742.4 1014.1
When n is small and σ is unknown, use the t distribution to address the uncertainty of the standard error estimate. The t distribution is unimodal and symmetric, but its tails are somewhat thicker. The exact shape of the distribution depends on the degrees of freedom, and as those degrees are larger, the distribution looks more and more like the normal distribution.
The data set up here is used later for comparing two small sample means. But here, we use only one of them.
conf_level = 0.95
alpha = 1 - conf_level
x_bar_solitaire = 52.1 # biscuit intake
s_solitaire = 45.1
n_solitaire = 22
se_solitaire = s_solitaire/sqrt(n_solitaire)
x_bar_no_dist = 27.1 # biscuit intake
s_no_dist = 26.4
n_no_dist = 22
se_no_dist = s_no_dist/sqrt(n_no_dist)
df = n_solitaire - 1
t_star = qt(conf_level + (alpha/2), df = df)
t_star
## [1] 2.08
ci_solitaire = c(x_bar_solitaire - t_star * se_solitaire, x_bar_solitaire +
t_star * se_solitaire)
ci_solitaire
## [1] 32.1 72.1
ci_no_dist = c(x_bar_no_dist - t_star * se_no_dist, x_bar_no_dist + t_star *
se_no_dist)
ci_no_dist
## [1] 15.39 38.81
\(H_0: \mu=30\)
\(H_A: \mu\ne30\)
# continues with data defined in previous section
nullval_solitaire = 30
T = (x_bar_solitaire - nullval_solitaire)/se_solitaire
T
## [1] 2.298
p_value = (1 - pt(T, df = n_solitaire)) * 2 # twosided test
p_value
## [1] 0.03141
The new point here is that df is the lesser of \(n_1-1\) and \(n_2-1\).
x_bar_diff = x_bar_solitaire - x_bar_no_dist
nullval = 0
se_diff = sqrt(s_solitaire^2/n_solitaire + s_no_dist^2/n_no_dist)
se_diff
## [1] 11.14
df = min(n_solitaire - 1, n_no_dist - 1)
Z = abs(x_bar_diff - nullval)/se_diff
Z
## [1] 2.244
p_value = (1 - pt(Z, df = df)) * 2
p_value
## [1] 0.03575
conf_level = 0.95
alpha = 1 - conf_level
t_star = qt(conf_level + (alpha/2), df=df)
ci = c(x_bar_diff - t_star * se_diff, x_bar_diff + t_star * se_diff)
ci
## [1] 1.83 48.17
This is for when we need to compare three or more means. New topic is the F statistic.
\(F = \frac{\text{variability between groups}}{\text{variability within groups}}\)
\(H_0:\) The mean outcome is the same across all categories: \(\mu_1 = \mu_2 = \dots = \mu_k\)
\(H_A:\) The mean outcome is different for at least two of the categories.
| Df | Sum Sq | Mean Sq | F value | PR(>F) | ||
|---|---|---|---|---|---|---|
| Group | class | 3 | 236.56 | 78.855 | 21.735 | <0.0001 |
| Error | Residuals | 791 | 2869.80 | 3.628 | ||
| Total | 794 | 3106.36 |
Sum of squares total: \(SST = \sum\limits^n_{i=1} (y_i - \bar{y})^2\) where \(y_i\) is value of response variable for each observation and \(\bar{y}\) is grand mean of response variable. This is 3106 in table above.
sst <- function(y) {
y_mean = mean(y)
return(sum(sapply(y, function(y_i) {(y_i - y_mean)^2})))
}
Sum of squares groups: \(SSG = \sum\limits^k_{j=1} n_j (\bar{y}_j - \bar{y})^2\) where \(k\) is the number of groups, \(n_j\) is the number of observations in group \(j\), \(\bar{y}_j\) is the mean of the response variable for group \(j\), and \(\bar{y}\) is the grand mean of the response variable. This is 236.56 in table above.
ssg <- function(data, y, j) {
total = 0
mean.grand = mean(y) # mean for all values
for (grp in unique(j)) {
grp_data = y[j == grp] # vector of the y column of data
mean.group = mean(grp_data) # mean for current group
total = total + length(grp_data) * (mean.group - mean.grand)^2
}
return(total)
}
Sum of squares error (SSE): \(SSE = SST - SSG\). This is 2869.8 in the table above.
Degrees of freedom:
Mean squares: For the group, \(MSG = SSG/df_G\) (this is 78.855 in the table), \(MSE = SSE/df_E\) (this is 3.628 in the table).
The F statistic is \(F=\frac{MSG}{MSE}\) (this is 21.735 in the table).
The p-value can now be computed using the F statistic and the two degrees of freedom values.
F = 21.735
df1 = 4 - 1
df2 = 795 - df1 - 1
p_value = pf(21.735, df1 = df1, df2 = df2, lower.tail = FALSE)
p_value
## [1] 1.56e-13
This value tells us only whether there are at least two groups that vary, not which ones.
Here is an example of using the inference function to get ANOVA output:
# summary(gss.anova)
source("http://bit.ly/dasi_inference")
# This will do ANOVA analysis because the explanatory variable is
# categorical and has more than 2 levels alternative is 'greater' because an
# F test is always one-sided
inference(y = gss$wordsum, x = gss$class, type = "ht", est = "mean", method = "theoretical",
alternative = "greater", eda_plot = F, inf_plot = F)