1-1-1 Data basics

In data matrix, each row is an observation (case), each column is a variable

Types of variable:

  1. Numerical variable (takes on numerical values, sensible to add, subtract, take avg, etc)
    • Continuous numerical variable (is measured, and can take on any numerical value)
    • Discrete numerical variable (is counted, and can take on only whole non-negative numbers)
  2. Categorical variable (takes on a limited number of distinct categories, not sensible to do arithmetic operations)
    • Categorical variables that have ordered levels are called ordinal
    • If the levels do not have an inherent ordering to them, then the variable is simply called Categorical

Relationships between variables

  1. If two variables are associated, they are dependent
  2. If two variables are not associated, they are independent


1-1-2 Observational studies & experiments

observational study experiment
1. collect data in a way that does not directly interfere with how the data arise, i.e. merely “observe” 1. randomly assign subjects to various treatments
2. can only establish an association between the explanatory and response variables 2. can establish causal connections between the explanatory and response variables
3. if an observational study uses data from the past, it’s called a retrospective study, whereas if data are collected throughout the study, it’s called prospective -

Correlation does not imply causation

For observational study, if A is associated with B, it can be:

  1. A->B
  2. B->A
  3. C->A and C->B

Confounding variable

Extraneous variables that affect both the explanatory and the response variable, and that make it seem like there is a relationship between them are called confounding variables


1-1-3 Sampling & Sources of bias

A few sources of sampling bias

  • Convenience sample (Individuals who are easily accessible are more likely to be included in the sample)
  • Non-response (If only a non-random fraction of the randomly sampled people respond to a survey such that the sample is no longer representative of the population)
  • Voluntary response (Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue)

Sampling methods

  • Simple random sample (Randomly select cases from the population, such that each case is equally likely to be selected)
  • Stratified sample (Divide the population into homogenous strata, and then randomly sample from within each stratum)
  • Cluster sample (Divide the population into clusters, randomly sample a few clusters, and then randomly sample from within these clusters. The clusters should be similar to each other)


1-1-4 Experimental design

Principles of experimental design

  1. Control: Compare treatment of interest to a control group
  2. Randomize: Randomly assign subjects to treatments
  3. Replicate: Within a study, replicate by collecting a sufficiently large sample, or replicate the entire study
  4. Block: If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups

Example on Blocking

  • We would like to design an experiment to investigate if energy gels make you run faster:
    • Treatment: energy gel
    • Control: no energy gel
  • It is suspected that energy gels might affect pro and amateur athletes differently, therefore we block for pro status:
    • Divide the sample to pro and amateur
    • Randomly assign pro and amateur athletes to treatment and control groups
    • Pro and amateur athletes are equally represented in the resulting treatment and control groups

Blocking vs. Explanatory variables

  • Explanatory variables (also sometimes called factors) are conditions we can impose on the experimental units
  • Blocking variables are characteristics that the experimental units come with, that we would like to control for
  • Blocking is like stratifying, except used in experimental settings when randomly assigning, as opposed to when randomly sampling

More experimental design terminology

  • Placebo: fake treatment, often used as the control group for medical studies
  • Placebo effect: experimental units showing improvement simply because they believe they are receiving a special treatment
  • Blinding: when experimental units do not know whether they are in the control or treatment group
  • Double-blind: when both the experimental units and the researchers do not know who is in the control and who is in the treatment group


1-1-5 Spotlight - Random sampling vs. assignment

. Random assignment No random assignment .
Random sampling causal and generalizable not causal, but generalizable Generalizability
No random sampling causal, but not generalizable neither causal nor generalizable No generalizability
. Causation Association .
  1. Random sampling -> Generalizability
  2. Random assignment -> Causality


1-2-1 Visualizing numerical data

Scatterplots

  • We might conclude Correlation from the scatterplots, but never Causation
  • Relationship between A and B:
    • Positive or Negative (Direction)
    • Linear or Curved (Shape)
    • Strong or Weak (Strength)

Histogram

In histogram, the values are “binned” into different groups and the height is the number of cases (frequency) in each of the group

The chosen bin width can alter the story the histogram is telling

  1. Skewness (Distributions are skewed to the side of the long tail)
    • Left skewed (long tail to the left)
    • Symmetric (not skew)
    • Right skewed (long tail to the right)
  2. Modality
    • Unimodal (has only a single highest value)
    • Bimodal (has two highest values)
    • Uniform (has no highest value)
    • Multimodal (has many highest values)

Dot plot

Useful when individual values are of interest, but can get very busy as the sample size increases.

Boxplot

  1. Interquartile Range (IQR) is range of the middle 50% of the data, distance between the third quartile (75th percentile) and first quartile (25th percentile)
  2. Boxplot can also show the skewness as histogram.
    • If the minimum and first quartile is far away from Median, it’s left skewed.
    • If the maximum and third quartile is far away from Median, it’s right skewed.

Intensity map

  1. It’s a world or country map where each region is shaded by a color with different intensity (Usually darkness means bigger number)
  2. Useful for highlighting the spatial distribution


1-2-2 Measures of center

  1. Center
    • mean: arithmetic average
    • median: midpoint of the distribution (50th percentile)
    • mode: most frequent observation
  2. Relationship with skewness
    • left skewed: Mean < Median < Mode
    • symmetric: Mean == Median == Mode
    • right skewed: Mean > Median > Mode


1-2-3 Measures of spread

  1. Variance:
    1. If we only have sample, variance is \[s^{2} = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\]
    2. If we know the population, variance is \[\sigma^{2} = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n}\]
  2. Standard deviation:
    1. Square root of variance
    2. Spread of the data


1-2-4 Robust statistics

We define robust statistics as measures on which extreme observations have little effect

robustness/measure robust non-robust
center median mean
spread IQR SD,range


1-2-5 Transforming numerical data

A transformation is a rescaling of the data using a function. It is used when data are very strongly skewed

Types of transformation are: Log, Square root, and Inverse

Goals of transformations

  1. To see the data structure differently
  2. To reduce skew assist in modeling
  3. To straighten a nonlinear relationship in a scatterplot

Log transformation:

  1. The natural log transformation is often applied when much of the data cluster near zero (relative to the larger values in the data set) and all observations are positive

  1. The transformation can also be applied to one or both variables in a scatterplot to make the relationship between the variables more linear, and hence easier to model with simple methods


1-2-6 Exploring categorical variables

Visualizing distribution of a single categorical variable

  1. Frequency table
##           Counts Frequencies
## casein        12   0.1690141
## horsebean     10   0.1408451
## linseed       12   0.1690141
## meatmeal      11   0.1549296
## soybean       14   0.1971831
## sunflower     12   0.1690141
  1. Bar plot

Visualizing relationship between two categorical variables

  1. Contingency table
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  71 
## 
##  
##              | weight 
##         feed | (108,187] | (187,266] | (266,344] | (344,423] | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##       casein |         0 |         3 |         3 |         6 |        12 | 
##              |     0.000 |     0.250 |     0.250 |     0.500 |     0.169 | 
##              |     0.000 |     0.125 |     0.130 |     0.667 |           | 
##              |     0.000 |     0.042 |     0.042 |     0.085 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##    horsebean |         8 |         2 |         0 |         0 |        10 | 
##              |     0.800 |     0.200 |     0.000 |     0.000 |     0.141 | 
##              |     0.533 |     0.083 |     0.000 |     0.000 |           | 
##              |     0.113 |     0.028 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##      linseed |         4 |         6 |         2 |         0 |        12 | 
##              |     0.333 |     0.500 |     0.167 |     0.000 |     0.169 | 
##              |     0.267 |     0.250 |     0.087 |     0.000 |           | 
##              |     0.056 |     0.085 |     0.028 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##     meatmeal |         1 |         5 |         4 |         1 |        11 | 
##              |     0.091 |     0.455 |     0.364 |     0.091 |     0.155 | 
##              |     0.067 |     0.208 |     0.174 |     0.111 |           | 
##              |     0.014 |     0.070 |     0.056 |     0.014 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##      soybean |         2 |         7 |         5 |         0 |        14 | 
##              |     0.143 |     0.500 |     0.357 |     0.000 |     0.197 | 
##              |     0.133 |     0.292 |     0.217 |     0.000 |           | 
##              |     0.028 |     0.099 |     0.070 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##    sunflower |         0 |         1 |         9 |         2 |        12 | 
##              |     0.000 |     0.083 |     0.750 |     0.167 |     0.169 | 
##              |     0.000 |     0.042 |     0.391 |     0.222 |           | 
##              |     0.000 |     0.014 |     0.127 |     0.028 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total |        15 |        24 |        23 |         9 |        71 | 
##              |     0.211 |     0.338 |     0.324 |     0.127 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## 
## 
  1. Segmented bar plot or relative frequency segmented bar plot

  1. Mosaicplot
mosaicplot(table(feed, weight),color=rainbow(4))

Visualizing relationship between a categorical and a numerical variable

  1. Side-by-side box plots
boxplot(weight~Diet, data=ChickWeight, range=0, border=rainbow(4), main="Chick Weight by Diet")


1-3-1 Inference via simulation

  1. Set a null and an alternative hypothesis
  2. Simulate the experiment assuming that the null hypothesis is true
  3. Repeat the simulation many times to get the sampling distribution
  4. From the sampling distribution, evaluate the probability of observing an outcome as least as extreme as the one observed in the original data
  5. If this probability is low, then reject the null hypothesis in favor of the alternative.
  6. If this probability is not low, then fail to reject the null hypothesis



2-1-1 Disjoint

  1. Disjoint (mutually exclusive) events cannot happen at the same time \[P(A\ or\ B) = P(A) + P(B)\]
  2. Non-disjoint events can happen at the same time \[P(A\ or\ B) = P(A) + P(B) - P(A\ and\ B)\]
  3. A sample space is a collection of all possible outcomes of a trial
  4. A probability distributions lists all possible outcomes in the sample space, and the probabilities with which they occur
  5. Complementary events are two mutually exclusive events whose probabilities add up to 1


2-1-2 Independence

Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other \[P(A\ |\ B) = P(A)\] \[P(A\ and\ B) = P(A) * P(B)\]

Disjoint events are dependent of each other


2-2-1 Bayes’s theorem

\[P(A\ |\ B) = \frac{P(A\ and\ B)}{P(B)}\]


2-2-2 Probability tree

\[P(A\ |\ B)\ => P(B\ |\ A)\]


2-2-3 Bayesian inference

Posterior probability

  • It is generally defined as: \(P(hypothesis\ |\ data)\)
  • It depends on both the prior probability we set and the observed data
  • In the next iteration, we update our prior with our posterior probability from the previous iteration

Recap

  • Take advantage of prior information, like a previously published study or a physical model
  • Natually integrate data as you collect it, and update your priors
  • A good prior helps, a bad prior hurts, but the prior matters less when we have more data
  • Base decisions on the posterior probability: \(P(hypothesis\ is\ true\ |\ observed\ data)\)
  • It is different from Frequentist Inference \(P(observed\ data\ |\ hypothesis\ is\ true)\)


2-3-1 Normal distribution

Empirical rule:

  1. 68% of the observations lie in \([\mu-\sigma, \mu+\sigma]\)
  2. 95% of the observations lie in \([\mu-2*\sigma, \mu+2*\sigma]\)
  3. 99.7% of the observations lie in \([\mu-3*\sigma, \mu+3*\sigma]\)

Standardized (Z) score

  • Z score of an observation is the number of standard deviations it falls above or below the mean
  • Defined for distributions of any shape


2-3-2 Evaluating the normal distribution

Anatomy of a normal probability plot

  • Data are plotted on the y-axis of a normal probability plot, and theoretical quantiles (following a normal distribution) on the x-axis
  • If there is a one-to-one relationship between the data and the theoretical quantiles, then the data follow a nearly normal distribution
  • Since a one-to-one relationship would appear as a straight line on a scatter plot, the closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model
  • Constructing a normal probability plot requires calculating percentiles and corresponding z-scores for each observation, which is tedious

Shape of distribution

  1. Right skew: Points bend up and to the left of the line
  2. Left skew: Points bend down and to the right of the line
  3. Short tails (narrower than the normal distribution): Points follow an S shaped-curve
  4. Long tails (wider than the normal distribution): Points start below the line, bend to follow it, and end above it


2-4-1 Binomial distribution

Bernoulli random variables

  • When an individual trial of an experiment has only two possible outcomes, it is called a Bernoulli random variable

Binomial distribution definition

  • The binomial distribution describes the probability of having exactly k successes in n independent Bernoulli trials with probability of success p: \[{n \choose k} * p^k * (1-p)^{n-k}\]

Binomial distribution conditions

  1. The trials must be independent
  2. The number of trials n must be fixed
  3. Each trial outcome must be classified as a success or a failure (i.e. Bernoulli random variable)
  4. The probability of success p must be the same for each trial

Calculating probabilities

dbinom(8, size=10, p=0.13)
## [1] 2.77842e-06

Mean and standard deviation of binomial distribution

  • Expected value (mean) of binomial distribution: \[\mu = n p\]

  • Standard deviation of binomial distribution: \[\sigma = \sqrt{n p (1-p)}\]


2-4-2 Normal approximation to binomial distribution

Success-failure rule

  • A binomial distribution with at least 10 expected successes and 10 expected failures closely follows a normal distribution \[np >= 10\] \[n(1-p) >= 10\]

  • If the success-failure condition holds, normal approximation to the binomial: \[Binomial(n,p)\ approximates\ Normal(\mu, \sigma)\] where \(\mu = n p\) and \(\sigma = \sqrt{n p (1-p)}\)

  • A small trick when using normal distribution to approximate binomial distribution: use \(\frac{(x\ -\ 0.5)\ -\ \mu}{\sigma}\), instead of \(\frac{x\ -\ \mu}{\sigma}\)



3-1-1 Foundations for inference

Sampling distribution

Distribution of sample statistic drawn from each sample of population

Central Limit Theorem (CLT)

The sampling distribution is nearly normal, centered at the population mean, and with a standard deviation equal to the population standard deviation divided by square root of the sample size. Formula: \(\bar{x}\) ~ \(N (mean=\mu, SE=\frac{\sigma}{\sqrt{n}})\) \[mean = \mu\] \[Variance = \frac{\sigma^2}{n}\] \[SE = \sqrt{\frac{\sigma^2}{n}}\]

We usually use \(s\) instead of \(\sigma\) since we don’t know the population

Conditions for CLT

  1. Independence: Sampled observations must be independent.
    • Random sample/assignment
    • If sampling without replacement, n < 10% of the population
  2. Sample size/skewness: Either the population distribution is normal, or if the population is skewed, the sample size is large ( Rule of thumb: n > 30)


3-2-1 Confidence interval

A plausible range of values for the population parameter is called a confidence interval

Confidence interval for a population mean

Computed as the sample mean plus/minus a margin of error (critical value corresponding to the middle XX% of the normal distribution times the standard error of the sampling distribution) \[[\bar{x} - z^* \frac{s}{\sqrt{n}}, \bar{x} + z^* \frac{s}{\sqrt{n}}]\] where \(z^*\) is called critical value, and margin of error (ME) is \(z^* \frac{s}{\sqrt{n}}\)


3-2-2 Confidence level

  1. Suppose we took many samples with same size and built a confidence interval with 95% confidence level from each sample using the equation \[point estimate +/- 1.96 * SE\]
  2. Then about 95% of those confidence intervals would contain the true population mean \(\mu\)
  3. Commonly used confidence levels in practice are 90%, 95%, 98%, and 99%
  4. We can increase sample size to get higher precision (shorter confidence interval) and higher accuracy (bigger confidence level)


3-3-2 Hypothesis testing (for a mean)

Hypothesis testing for a single mean

  1. Set the hypotheses: \[H_0: \mu = null\ value\] \[H_A: \mu < or > or <> null\ value\] (one-sided or two-sided)

  2. Calculate the point estimate: \(\bar{x}\)

  3. Check conditions:
    1. Independence: Sampled observations must be independent (random sample/assignment & if sampling without replacement n < 10% of population)
    2. Sample size/skew: n > 30, larger if the population distribution is very skewed
  4. Draw sampling distribution, calculate test statistic \(Z = \frac{\bar{x}-\mu}{SE}\), \(SE = \frac{s}{\sqrt{n}}\), and get p-value. \[p-value = P(observed\ or\ more\ extreme\ outcome\ |\ H_0\ is\ true)\]

  5. Make a decision, and interpret it in context of the research question:
    • If p-value < \(\alpha\), reject \(H_0\); The data provide convincing evidence for \(H_A\)
    • If p-value > \(\alpha\), fail to reject \(H_0\); The data do not provide convincing evidence for \(H_A\)

Relationship between HT and CI

  1. Using corresponding confidence level(e.g. 95%) and two-sided HT significance level (e.g. 0.05), the hypothesis test and confidence interval yield same results.
  2. Using corresponding confidence level(e.g. 90%) and one-sided HT significance level (e.g. 0.05), the hypothesis test and confidence interval yield same results.


3-4-1 Inference for other estimators

Unbiased estimator

An important assumption of CLT about point estimates is that they are unbiased, i.e. the sampling distribution of the estimate is centered at the true population parameter it estimates.
That is, an unbiased estimate does not naturally over or underestimate the parameter, it provides a “good” estimate


3-5-1 Decision errors

Type 1 & type 2 errors

  1. Type 1 error (\(\alpha\)): Reject \(H_0\) when \(H_0\) is true. The probability of doing so is \(\alpha\)
  2. Type 2 error (\(\beta\) or \(1-power\)): Fail to reject \(H_0\) when \(H_0\) is false. The probability of doing so is \(\beta\)
  3. Power of a test is the probability of correctly rejecting \(H_0\) when \(H_0\) is false. This is what we want.


3-5-3 Statistical vs. practical significance

  • Real differences between the point estimate and null value are easier to detect with larger samples
  • However, very large samples will result in statistical significance even for tiny differences between the sample mean and the null value, even when the difference is not practically significant



4-1-1 Hypothesis testing for paired data

Comparing means of matched pairs

  • When two sets of observations have special correspondence (not independent), they are said to be paired:
    • Same individuals: pre-post studies, repeated measures, etc.
    • Different (but dependent) individuals: twins, partners, etc.
  • To analyze paired data, it is often useful to look at the difference in outcomes of each pair of observations:
    • Converting 2 variables problem into known 1 variable problem
  • It is important that we always subtract using a consistent order
  • Hypothesis testing for difference between paired means:
    1. Set the hypothesis: \[H_0: \mu_{diff} = null\ value\] \[H_A: \mu_{diff} < or > or <> null\ value\]
    2. Calculate the point estimate \(\bar{x_{diff}}\)
    3. Check conditions:
      1. Independence: Sampled observations must be independent (random sample/assignment & if sampling without replacement, \(n_{diff}\) < 10% of population)
      2. Sample size/skew: \(n_{diff}\) >= 30, larger if the population distribution is very skewed
    4. Draw sampling distribution, calculate test statistic, and get p-value
    5. Make a decision, and interpret it in context of the research qusti
  • Confidence interval for difference between paired means: \[[\bar{x_{diff}} - z^*\frac{s_{diff}}{\sqrt{n_{diff}}}, \bar{x_{diff}} + z^*\frac{s_{diff}}{\sqrt{n_{diff}}}]\]


4-1-3 Comparing two independent means

CLT Conditions for inference for comparing two independent means:

  1. Independence:
    • within groups: sampled observations must be independent
      • random sample/assignment
      • if sampling without replacement, n < 10% of population
    • between groups: the two groups must be independent of each other (non-paired)
  2. Sample size/skew: Each sample size must be at least 30 (\(n_1\)>=30 and \(n_2\)>=30), larger if the population distribution is very skewed.

Confidence interval for Estimating the difference between independent means

\[CI = [(\bar{x_1}-\bar{x_2}) - z^*SE_{\bar{x_1}-\bar{x_2}}, (\bar{x_1}-\bar{x_2}) + z^*SE_{\bar{x_1}-\bar{x_2}}]\] where \(SE_{\bar{x_1}-\bar{x_2}} = \sqrt{\frac{{s_1}^2}{n_1}+\frac{{s_2}^2}{n_2}}\)

Hypothesis testing for difference between independent means

  • Null hypothesis: no difference. \(H_0: \mu_1 - \mu_2= 0\)
  • Alternative hypothesis: some difference. \(H_A: \mu_1 - \mu_2 <> 0\)
  • Same conditions and SE as the confidence interval


4-2-1 Bootstrapping

Use bootstrapping to construct confidence intervals for parameters of interest when CLT based approach does not apply.

Bootstrapping scheme

  1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample
  2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, etc. computed on the bootstrap sample
  3. Repeat step1 and step2 many times to create a bootstrap distribution - a distribution of bootstrap statistics
  4. Get confidence interval with:
    1. percentile method: get middle x% (e.g. 95%) from bootstrap distribution
    2. standard error method: get \([\bar{x_{boot}} - z^*SE_{boot}, \bar{x_{boot}} + z^*SE_{boot}]\)

Bootstrapping limitations:

  • Not as rigid conditions as CLT based methods
  • However if the bootstrap distribution is extremely skewed or sparse, the bootstrap interval might be unreliable
  • A representative sample is required for generalizability. If the sample is biased, the estimates resulting from this sample will also be biased

Bootstrap vs. sampling distribution

  • Sampling distribution is created using sampling (with replacement) from the population
  • Bootstrap distribution is created using sampling (with repalcement) from the sample
  • Both are distributions of sample statistics


4-3-1 t-distribution

  • When n is small (n < 30) & \(\sigma\) is unknown (almost always), use the t distribution to address the uncertainty of the standard error estimate (since we use \(\frac{s}{\sqrt{n}}\) instead of \(\frac{\sigma}{\sqrt{n}}\))
  • Bell shaped but thicker tails than the normal
    • observations more likely to fall beyond 2 SDs from the mean
    • extra thick tails helpful for mitigating the effect of a less reliable estimate for the standard error of the sampling distribution
  • t distribution is always centered at 0 (like the standard normal)
  • has one parameter: degrees of freedom (df) - that determines thickness of the tails
  • t statistic & p-value is calculated the same way: \[T = \frac{obs-null}{SE}\] \[P-value = P(x>T)\ or\ P(|x|>T)\]
  • The larger the degrees of freedom is, the closer the t distribution is to standard normal distribution
  • Usually we use t distribution for inference (hypothesis test and confidence interval) for one or two means


4-3-2 Inference for one sample mean using t-distribution

CLT Conditions for inference for one sample mean

  1. Independence: Sampled observations must be independent (random sample/assignment & if sampling without replacement n < 10% of population)
  2. Sample size/skew: Condition (greater than 30) does not necessarily need to be met for t-distribution

Confidence interval for one sample mean

\[CI = [\bar{x} - t_{df}^*SE, \bar{x} - t_{df}^*SE]\] where \[SE = \frac{s}{\sqrt{n}}\] \[df = n - 1\]

Hypothesis testing for one sample mean

  • Same as Confidence Interval
  • Use t-distribution instead of standard normal distribution to calculate P-value


4-3-3 Inference for comparing two sample means using t-distribution

Inference

*Confidence interval \[CI = [(\bar{x_1}-\bar{x_2}) - t_{df}^*SE_{\bar{x_1}-\bar{x_2}}\ ,\ (\bar{x_1}-\bar{x_2}) + t_{df}^*SE_{\bar{x_1}-\bar{x_2}}]\]

  • Hypothesis testing \[T_{df} = \frac{(\bar{x_1}-\bar{x_2}) - (\mu_1-\mu_2)}{SE_{\bar{x_1}-\bar{x_2}}}\]

\[SE_{\bar{x_1}-\bar{x_2}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

  • Degrees of freedom for t statistic for inference on difference of two means \[df = min(n_1-1, n_2-1)\]


4-4-1 Comparing more than two means

To compare means of 2+ groups we use a test called analysis of variance (ANOVA) and statistic called F statistic

ANOVA

  • Compare means from more than two groups: are they so far apart that the observed differences cannot all reasonably be attributed to sampling variability?
  • Hypothesis: \[H_0: The\ mean\ outcome\ is\ the\ same\ across\ all\ categories\ (\mu_1=\mu_2=...=\mu_k)\] \[H_A: At\ least\ one\ pair\ of\ means\ are\ different\ from\ each\ other\]
  • Compute a test statistic \[F = \frac{variability\ between\ groups}{variability\ within\ groups}\]
  • Large test statistic leads to small p-values
  • If the p-value is small enough \(H_0\) is rejected, and we conclude that the data provide convincing evidence of a difference in the population means
  • ANOVA does not tell which means are different, it only tells at least two means are different

4-4-2 ANOVA in detail

Sample data:

_ n mean sd
lower class 41 5.07 2.24
working class 407 5.75 1.87
middle class 331 6.76 1.89
upper class 16 6.19 2.34
overall 795 6.14 1.98

Variability partitioning

Total variability in the sample data can be partitioned into:
1. Variability attributed to class (Between group variability)
2. Variability attributed to other factors (Witin group variability)

ANOVA output

_ _ Df Sum Sq Mean Sq F value Pr(>F)
Group class 3 236.56 78.855 21.735 <0.0001
Error Residuals 791 2869.80 3.628 - -
_ Total 794 4106.36 - - -

Sum of Squares:

  1. Sum of squares total (SST): measures the total variability in the response variable \[SST = \sum_{i=1}^n(y_i - \bar{y})^2\] where
    \(y_i\): value of the response variable for each observation
    \(\bar{y}\): grand mean of the response variable

  2. Sum of squares groups (SSG): measures the variability between groups \[SSG = \sum_{j=1}^k{n_j(\bar{y_j}-\bar{y})^2}\] where
    \(n_j\): number of observations in group j
    \(\bar{y_j}\): mean of the response variable for group j
    \(\bar{y}\): grand mean of the response variable

  3. Sum of squares error/residual (SSE): measures the variability within groups \[SSE = SST - SSG\]

Degrees of freedom associated with ANOVA

\[Total:\ df_T = n - 1\] \[Group:\ df_G = k - 1\] \[Error:\ df_E = df_T - df_G\]

Mean squares

Average variability between and within groups, calculated as the total variability (sum of squares) scaled by the associated degrees of freedom \[Group: MSG = SSG / df_G\] \[Error: MSE = SSE / df_E\]

F statistic

Ratio of the between group and within group variability \[F = \frac{MSG}{MSE}\]

p-value

  • p-value is the probability of at least as large as a ratio between the “between” and “within” group variabilities if in fact the means of all groups are equal
  • It is area under the F curve, with degrees of freedom \(df_G\) and \(df_E\), above the observed F statistic
pf(21.735, 3, 791, lower.tail=FALSE)
## [1] 1.559855e-13

Conclusion

  • If p-value is small (less than \(\alpha\)), reject \(H_0\). The data provide convincing evidence that at least one pair of population means are different from each other (but we can’t tell which one)
  • If p-value is quite large (greater than \(\alpha\)), fail to reject \(H_0\). The data do not provide convincing evidence that one pair of population means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance)


4-4-3 Conditions for ANOVA

  1. Independence:
    • within groups: sampled observations must be independent
    • between groups: the groups must be independent of each other (non-paired)
  2. Approximate normality:
    • distributions should be nearly normal within each group, especially when sample sizes are small
  3. Equal variance:
    • groups should have roughly equal variability (constant variance), especially when sample sizes differ between groups


4-4-4 Multiple comparisons

Which means differ

  • After ANOVA, use two sample t-tests for differences in each possible pair of groups
  • Testing many pairs of groups is called multiple comparisons

Control Type 1 error

  • But multiple tests lead to increased Type 1 error rate
  • Solutions to control Type 1 error:
    • Bonferroni correction: adjust \(\alpha\) by the number of comparions being considered \[\alpha^* = \alpha/K\] \[where\ K(number\ of\ comparions) = \frac{k(k-1)}{2}\]

Pairwise comparisons

  • For multiple comparions after ANOVA, since the assumption of equal variability across groups must have been satisfied, re-think the standard error and the degress of freedom:
    • Use a consistent measure of standard error and consistent degrees of freedom for all tests
    • Standard error for multiple pairwise comarisons: \[SE = \sqrt{\frac{MSE}{n_1} + \frac{MSE}{n_2}}\]
    • Degrees of freedom for multiple pairwise comparisons: \[df_t = df_E\]



5-1-1 Sampling variability & CLT for proportions

Sampling distribution

Distribution of sample statistic (here is sample proportion) drawn from each sample of population

CLT for proportions

The sampling distribution is nearly normal, centered at the population proportion, and with a standard error inversely proportional to the sample size. Formula: \(\hat{p}\) ~ \(N(mean=p, SE=\sqrt{\frac{p(1-p)}{n}})\) If \(p\) is unknown, use \(\hat{p}\)

Conditions for the CLT

  1. Independence: Sampled observations must be independent
    • Random sample/assignment
    • If sampling without replacement, n < 10% of population
  2. Sample size/skew: There should be at least 10 successes and 10 failures in the sample
    • np >= 10 and n(1-p) >= 10
    • If p is unknown, use \(\hat{p}\)
    • If the success-failure condition is not met, center and spread of the sampling distribution can still be approximated using the same formula, but the shape of the distribution might be skewed depending on whether the true population proportion is closer to 0 or closer to 1


5-1-2 Confidence interval for a proportion

  • Confidence interval is calculated as point estimate minus/plus margin of error \[[\hat{p} - z^*SE_{\ \hat{p}}\ ,\ \hat{p} + z^*SE_{\ \hat{p}}]\] \[where\ SE_{\ \hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
  • Calculate the required sample size for desired Margin of Error:
    1. Remember \(ME = z^*\sqrt{\frac{p(1-p)}{n}}\)
    2. If there is a previous study that we can reply on for the value of \(p\), use that in the calculation of the required sample size
    3. If not, use \(p = 0.5\)
      • if you don’t know any better, 50-50 is a good guess
      • gives the most conservative estimate - largest possible sample size n


5-1-3 Hypothesis test for a proportion

  1. Set the hypotheses: \[H_0: p= null value\] \[H_A: p < or > or <> null value\]
  2. Calculate the point estimate: \(\hat{p}\)
  3. Check conditions:
    1. Independence: Sampled observations must be independent (random sample/assignment & if sampling without replacement, n < 10% of population)
    2. Sample size/skew: \(np >= 10\) and \(n(1-p) >= 10\)
  4. Draw sampling distribution, calculate test statistic, and shade p-value, \[Z = \frac{\hat{p} - p}{SE},\ SE = \sqrt{\frac{p(1-p)}{n}}\]
  5. Make a decision, and interpret it in context of the research question:
    • If p-value < \(\alpha\), reject \(H_0\); the data provide convincing evidence for \(H_A\)
    • If p-value > \(\alpha\), fail to reject \(H_0\); the data do not provide convincing evidence for \(H_A\)

Note: * In hypothesis test, we use p for checking condition and computing SE * In confidence interval, we use \(\hat{p}\) for checking condition and computing SE since we don’t know p


5-2-1 Inference for comparing two independent proportions

Confidence interval

It is calculated as point estimate minus/plus margin of error \[[(\hat{p_1} - \hat{p_2}) - z^*SE_{(\hat{p_1}-\hat{p_2})}\ ,\ (\hat{p_1} - \hat{p_2}) + z^*SE_{(\hat{p_1}-\hat{p_2})}]\] \[SE = \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1} + \frac{\hat{p_2}(1-\hat{p_2})}{n_2}}\]

Conditions for inference for comparing two independent proportions

  1. Independence:
    • Within groups: sampled observations must be independent within each group
      • random sample/assignment
      • if sampling without replacement, n < 10% of population
    • Between groups: the two groups must be independent of each other (non-paired)
  2. Sample size/skew: Each sample should meet the success-failure condition:
    • \(n_1p_1 >= 10\) and \(n_1(1-p_1) >= 10\)
    • \(n_2p_2 >= 10\) and \(n_2(1-p_2) >= 10\)

Hypothesis testing

Use pooled proportion for success-failure condition check and SE calculation: \[p1 = p2 = \hat{p_{pool}} =\frac{(number\ of\ success_1) + (number\ of\ success_2)}{n_1 + n_2}\]


5-3-1 Small sample proportion (when success-failure condition fails)

Inference via simulation

  • The ultimate goal of a hypothesis test is a p-value
    • p-value = P(observed or more extreme outcome | \(H_0\) is true)
  • Devise a simulation schema that assumes the null hypothesis is true
  • Repeat the simulation many times and record relevant sample statistic (one sample statistic per simulation)
  • Calculate p-value as the proportion of simulations that have at least as extreme as the observed proportion
  • For one small sample we can use coin flip; for two small samples we can use cards for simulation


5-4-1 Chi-square goodness of fit (GOF) test

chi-square statistic

When dealing with counts and investigating how far the observed counts are from the expected counts, we use a new test statistic called the chi-square (\(\chi^2\)) statistic \[\chi^2\ statistic:\ \ \chi^2 = \sum_{i=1}^k \frac{(O-E)^2}{E}\] \[O: observed\] \[E: expected\] \[k: number\ of\ cells\]

chi-square distribution

  • chi-square distribution has just one parameter: degrees of freedom (df)
  • df influences the shape, center, and spread

Hypothesis test for one categorical variable with more than 2 levels

Example:

ethnicity white black nat.amer. asian other total
expected # 2007 302 20 73 98 2500
observed # 1920 347 19 84 130 2500


Steps:

  1. Set hypothesis
    • \(H_0\) (nothing going on): the observed distribution follows the same expected distribution in the population
    • \(H_1\) (something going on): the observed distribution does not follow the same expected distribution in the population
  2. Calculate expected and observed counts for each categorical level
  3. Conditions for the chi-square test
    1. Independence: Sampled observations must be independent
      • random sample/assignment
      • if sampling without replacement, n < 10% of population
      • each case only contributes to one cell in the table
    2. Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases
  4. Calculate chi-square (\(\chi^2\)) statistic \[\chi^2 = \sum_{i=1}^k\frac{(O-E)^2}{E}\] \[O: observed\ ;\ E: expected\ ;\ k: number\ of\ cells\]
  5. Calculate \(\chi^2\) degrees of freedom (df) \[df = k - 1\] \[k: number\ of\ cells\]
  6. Calculate p-value
    • p-value for a chi-square test is defined as the tail area above the calculated test statistic
    • because the test statistic is always positive, a higher test statistic means a higher deviation from the null hypothesis
pchisq(22.63, df=4, lower.tail=FALSE)
## [1] 0.000150104


5-4-2 Chi-square independence test

Hypothesis test for two categorical variables, at least one of which has more than 2 levels

Example:

status dating cohabiting married total
obese 81 (113) 103 (110) 147 (108) 331
not obese 359 (327) 326 (319) 277 (316) 962
total 440 429 424 1293


Steps:
1. Set hypothesis: * \(H_0\) (nothing going on): two categorical variables are independent. Variable1 does not vary by variable2 * \(H_A\) (something going on): two categorical variables are dependent. Variable1 does vary by variable2 2. Calculate observed and expected counts for each cell \[expected\ count = \frac{(row\ total) * (column\ total)}{table\ total}\] 3. Conditions for the chi-square test 1. Independence: Sampled observations must be independent * random sample/assignment * if sampling without replacement, n < 10% of population * each case only contributes to one cell in the table 2. Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases 4. Calculate chi-square (\(\chi^2\)) statistic \[\chi^2 = \sum_{i=1}^k\frac{(O-E)^2}{E}\] \[O: observed\ ;\ E: expected\ ;\ k: number\ of\ cells\] 5. Calculate \(\chi^2\) degrees of freedom (df) \[df = (R - 1) * (C - 1)\] \[R: number\ of\ rows\] \[C: number\ of\ columns\] 6. Calculate p-value * p-value for a chi-square test is defined as the tail area above the calculated test statistic



6-1-1 Correlation

Definition

  • Describes the linear association between two variables
  • Denoted as R

Properties

  1. The magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables
  2. The sign of the correlation coefficient indicates the direction of association
  3. The correlation coefficient is always between -1 (perfect negative linear association) and 1 (perfect positive linear association). \(R=0\) indicates no linear relationship
  4. The correlation coefficient is unitless, and is not affected by changes in the center or scale of either variable (such as unit conversions)
  5. The correlation of X with Y is the same as of Y with X
  6. The correlation coefficient is sensitive to outliers


6-2-1 Residuals

  • leftovers from the model fit
  • data = fit + residual
  • difference between the observed and predicted y \[Residual: e_i = y_i - \hat{y_i}\]


6-2-2 Least squares line

\[\hat{y} = \beta_0 + \beta_1 x\ \ or\ \ \hat{y} = b_0 + b_1 x\] ### Estimating the regression parameters: Slope \[Slope: b_1 = \frac{s_y}{s_x} R\] \[S_x = SD\ of\ x\] \[S_y = SD\ of\ y\] \[R = cor(x,y)\]

Estimating the regression parameters: Intercept

The least squares line always goes through (\(\bar{x}\), \(\bar{y}\)) \[b_0 = \bar{y} - b_1\bar{x}\]


6-2-3 Prediction & extrapolation

Prediction

  • Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction
  • Plug in the value of x in the linear model equation

Extrapolation

  • Applying a model estimate to values outside of the realm of the original data is called extrapolation
  • The estimate might not be accurate


6-2-4 Conditions for linear regression

1. Linearity

  • Relationship between the explanatory and the response variable should be linear
  • Methods for fitting a model to non-linear relationships exist
  • Check using a scatterplot of the data, or a residuals plot
  • If no pattern in the residuals plot, then all the variance is explained by the model

2. Nearly normal residuals

  • Residuals should be nearly normally distributed, centered at 0
  • May not be satisfied if there are unusual observations that don’t follow the trend of the rest of the data
  • Check using a histogram or normal probability plot of residuals

3. Constant variability

  • Variability of points around the least squares line should be roughly constant
  • Implies that the variability of residuals around 0 line should be roughly constant as well
  • Also called homoscedasticity


6-2-5 R-square

  • Strength of the fit of a linear model is most commonly evaluated using \(R^2\)
  • Calculated as the square of the correlation coefficient
  • Tells us what percent of variability in the response variable is explained by the model
  • The remainder of the variability is explained by variables not included in the model
  • Always between 0 and 1


6-2-6 Regression with categorical explanatory variables

  • For three levels categorical variable: \[\hat{y} = b_0 + b_1x_1 + b_2x_2 \]
  • Possible value:
    • \(x_1 = x_2 = 0\)
    • \(x_1 = 1\ and\ x_2 = 0\)
    • \(x_1 = 0\ and\ x_2 = 1\)


6-3-1 Outliers in regression

Types of outliers

  • Outliers are points that fall away from the cloud of points
  • Outliers that fall horizontally away from the center of the cloud but don’t influence the slope of the regression line are called leverage points
  • Outliers that actually influence the slope of the regression line are called influential points
    • usually high leverage points
    • to determine if a point is influential, visualize the regression line with and without the point, and see if the slope of the line change considerably
  • Outliers might reduce \(R^2\), but not always


6-4-1 Inference for linear regression

Hypothesis Test for the slope

  1. Set hypothesis:
    • \(H_0: \beta_1=0\) (nothing going on) The explanatory variable is not a significant predictor of the response variable, i.e. no relationship -> slope of the relationship is 0
    • \(H_A: \beta_1<>0\) (something going on) The explanatory variable is a significant predictor of the response variable, i.e. has relationship -> slope of the relationship is not 0
  2. Check conditions: Independence
  3. Calculate t-statistic: \[T = \frac{b_1 -0}{SE_{b_1}}\]
  4. Calculate degrees of freedom (df):
    • \(df = n - 2\)
    • Lose 1 df for each parameter estimated, and in linear regression we estimate 2 parameters: \(\beta_0\) and \(\beta_1\)

Confidence Interval for the slope

  • Point estimate minus/plus margin of error \[[b_1 - t_{df}^*SE_{b_1}\ ,\ b_1 + t_{df}^*SE_{b_1}]\]


6-4-2 Variability partitioning

  • An alternative of hypothesis test for the slope of relationship between x and y
  • It considers the variability in y explained by x, compared to the unexplained variability
  • Partitioning the variability in y to explained and unexplained variability requires ____analysis of variance (ANOVA)__

Hypothesis testing using ANOVA

  1. Sum of squares: \[Total\ variability\ in\ y:\ \ SS_{Total} = \sum(y-\bar{y})^2\] \[Unexplained\ variability\ in\ y\ (residuals):\ \ SS_{Residual} = \sum(y-\hat{y})^2\] \[Explained\ variability\ in\ y:\ \ SS_{Regression} = SS_{Total} - SS_{Residual}\]
  2. Degrees of freedom: \[Total\ degrees\ of\ freedom:\ \ df_{Total} = n - 1\] \[Regression\ degrees\ of\ freedom:\ \ df_{Regression} = number\ of\ predictors\] \[Residual\ degrees\ of\ freedom:\ \ df_{Residual} = df_{Total} - df_{Regression}\]
  3. Mean squares: \[MS\ Regression:\ \ MS_{Regression} = \frac{SS_{Regression}}{df_{Regression}}\] \[MS\ Residual:\ \ MS_{Residual} = \frac{SS_{Residual}}{df_{Residual}}\]
  4. F statistic (ratio of explained to unexplained variability) \[F_{(df_{Regression},\ df_{Residual})} = \frac{MS_{Regression}}{MS_{Residual}}\]
  5. Get p-value
    • If p-value is small, reject \(H_0\), the data provide convincing evidence that the slope is significantly different than 0
    • If p-value is not small, fail to reject \(H_0\), the data do not provide convincing evidence that the slope is different than 0

Another way to calculate R square

\[R^2 = \frac{explained\ variability}{total\ variability} = \frac{SS_{Regression}}{SS_{Total}}\]



7-1-1 Multiple predictors

Interaction variables

  • If variable1 and variable2 are dependent, then we would need to include an interaction variable in the model


7-1-2 Adjusted R square

Why adjusted R square

  • When any variable is added to the model \(R^2\) increases
  • But if the added variable doesn’t really provide any new information, we should not expect it to increase

Calculate adjusted R square

\[R_{adj}^2 = 1 - (\frac{SSE}{SST} * \frac{n-1}{n-k-1})\] \[k:\ number\ of\ predictors\]

Properties of adjusted R square

  • k is never negative -> adjusted \(R^2\) < \(R^2\)
  • adjusted \(R^2\) applies a penalty for the number of predictors included in the model
  • we choose models with higher adjusted \(R^2\) over others


7-1-3 Collinearity and parsimony

Collinearity

  • Two predictor variables are said to be collinear when they are correlated with each other
  • Inclusion of collinear predictors (also called multicollinearity) complicates model estimation

Parsimony

  • Avoid adding predictors associated with each other because often times the addition of such variable brings nothing new to the table
  • Prefer the simplest best model, i.e. the parsimonious model
  • Addition of collinear variables can result in biased estimates of the regression parameters
  • While it’s impossible to avoid collinearity from arising in observational data, experiments are usually designed to control for correlated predictors (control for confounding variables)


7-2-1 Inference for Multiple Linear Regression

Inference for the model as a whole

  • Set hypothesis \[H_0: \beta_1 = \beta_2 = ... = \beta_k = 0\] \[H_A: At\ least\ one\ \beta_i\ is\ different\ than\ 0\]
  • The F test yielding a significant result doesn’t mean the model fits the data well, it just means at least one of the \(\beta\) is non-zero
  • The F test not yielding a significant result doesn’t mean individual variables included in the model are not good predictors of \(y\), it just means that the combination of these variables doesn’t yield a good model

T-test for the slopes

  1. Set hypothesis:
    • \(H_0: \beta_1=0\), when all other variables are included in the model
    • \(H_A: \beta_1<>0\), when all other variables are included in the model
  2. Check conditions: Independence
  3. Calculate t-statistic: \[T = \frac{b_1 -0}{SE_{b_1}}\]
  4. Calculate degrees of freedom (df):
    • \(df = n - k - 1\)
    • \(k\) is number of predictors
    • Lose 1 df for each parameter estimated, and in linear regression we estimate k + 1 parameters

Confidence Interval for the slopes

  • Point estimate minus/plus margin of error \[[b_1 - t_{df}^*SE_{b_1}\ ,\ b_1 + t_{df}^*SE_{b_1}]\]


7-3-1 Model selection for LR

Stepwise model selection

  • backwards elimination: start with a full model (containing all predictors), drop one predictor at a time until the parsimonious model is reached
  • forward selection: start with an empty model and add one predictor at a time until the parsimonious model is reached
  • Criteria:
    • p-value, adjusted \(R^2\)
    • AIC, BIC, DIC, Bayes factor, Mallow’s \(C_p\)

Backwards elimination - adjusted R square

  • Start with the full model
  • Drop one variable at a time and record adjusted \(R^2\) of each smaller model
  • Pick the model with the highest increase in adjusted \(R^2\)
  • Repeat until none of the models yield an increase in adjusted \(R^2\)

Backwards elimination - p-value

  • Start with the full model
  • Drop the variable with the highest p-value and refit a smaller model
  • Repeat until all variables left in the model are significant
  • For categorical variables, unless all levels are not significant, don’t drop any variables

Adjusted R square vs. p-value

  • p-value: significant predictors
  • adjusted \(R^2\): more reliable predictions
  • p-value method depends on the (somewhat arbitrary) 5% significance level cutoff
    • different significant level -> different model
    • used commonly since it requires fitting fewer models (in the more commonly used backwards-selection approach)

Forward selection - adjusted R square

  • Start with single predictor regressions of response vs. each explanatory variable
  • Pick the model with the highest adjusted \(R^2\)
  • Add the remaining variables one at a time to the existing model, and pick the model with the highest adjusted \(R^2\)
  • Repeat until the addition of any of the remaining variables does not result in a higher adjusted \(R^2\)

Forward selection - p-value

  • Start with single predictor regressions of response vs. each explanatory variable
  • Pick the variable with the lowest p-value
  • Add the remaining variables one at a time to the existing model, and pick the variable with the lowest p-value
  • Repeat until any of the remaining variables do not have a significant p-value

Expert opinion

  • Variables can be included in (or eliminated from) the model based on expert opinion
  • If you are studying a certain variable, you might choose to leave it in the model regardless of whether it’s significant or yield a higher adjusted \(R^2\)


7-4-1 Conditions for MLR

1. Linear relationships between (numerical) x and y

  • Each (numerical) explanatory variable linearly related to the response variable
  • Check using residuals plots(e vs. x)
    • looking for a random scatter plot around 0
    • consider all variables in the model, instead of just the bivariate relationship between a given x and y

2. Nearly normal residuals with mean 0

  • Some residuals will be positive and some negative
  • On a residuals plot we look for random scatter of residuals around 0
  • This translates to a nearly normal distribution of residuals centered at 0
  • Check using histogram or normal probability plot

3. Constant variability of residuals

  • Residuals should be equally variable for low and high values of the predicted response variable
  • Check using residuals plots of residuals vs. predicted (e vs. \(\hat{y}\))
    • Residuals vs. predicted instead of residuals vs. x because it allows for considering the entire model (with all explanatory variables) at once
    • Residuals should be randomly scattered in a band with a constant width around 0 (no fan shape)
    • Also worthwhile to view absolute value of residuals vs. predicted to identify unusual observations easily

4. Independent residuals

  • Independent residuals -> independent observations
  • If time series structure is suspected, check using residuals vs. order of data collection
  • If not, think about how the data are sampled