Populations and Samples

Frequency Distributions

Mean as a simple model and Assessing this Model

Going beyond data

Null Hypothesis Testing and Effect Sizes

NHT

  • The arbitrary cut-off point of .05 probability (or 95% Confidence) may be augmented with the use of Effect sizes.
  • A test statistic is used in conjunction with the NHT that determines whether an effect is significant or not.
  • Even if an effect is found to be significant, it does not imply the effect is important. Very small and unimportant effects may turn out to be significant with sample sizes.
  • Rejecting the Alternative Hypothesis does not mean the Null Hypothesis is true! It only means that any effect is not big enough to occur by other than chance. Cohen points out that Null Hypothesis is never true - we know from sampling distributions that sample means will be different. Even if difference is small, it is one, nevertheless. And the difference may become large with large sample sizes.
  • In addition to not being able to conclude that Null Hypothesis is true, we can also not conclude undoubtedly that Alternative Hypotheis is true! This is because NHT does not follow the standard Modus Tollens. It is only interpreted as: the chances of obtaining the data we have collected given the NH is TRUE are very low…
  • NHT can use One and Two-tailed tests, for directional and non-directional hypotheses, respectively.

  • Weakness: Biased by sample size
    • In regression:
    • p-value is based on t-value
    • t=B/SE (B is regression coefficient, SE is Standard Error)
    • SE=SQRT(SS.Residual/(N-2))
  • Remedy: Supplement all NHSTs with estimates of effect size
    • In regression:
    • report standardised regression coefficients and the model R-squared.
  • Weakness: Arbitrary decision rule (the cutoff value [alpha] is arbitrary)
  • Remedy: supplement all NHSTs with estimates of effect size
  • Weakness: Yokel Lokel Test - NHST ecnourages weak hypothesis testing
  • Remedy: learn other forms of hypothesis testing, consider multiple alternative hypotheses
  • Weakness: Error prone
  • Remedy: replicate significant effects to avoid long-term impact of Type-I errors, obtain large and representative samples to avoid Type- II errors.
  • Weakness: Logic becomes probabilistic in NHST
    • Modus Tollens in NHST becomes probabilistic
    • (IF p then q; NOT q; THEREFORE NOT p)
    • (IF null-hypothesis correct, then these data are HIGHLY UNLIKELY; Data have OCCURREDl; THEREFORE, the null hypothesis is HIGHLY UNLIKELY)
  • Remedy: Don’t use NHST | remember p=P(D|H-null) | report confidence intervals only | Apply bayesian learning

Effect Sizes

  • Since NHT has the aforementioned problems, and does not assist in finding whether an effect is important (how much variance is explained by an effect), we can instead measure the size of effects.
  • An effect size is an objective and (usually) standardised measure of the magnitude of observed effect.
  • Cohen’s d, Pearson’s Correlation Coefficient, Odds Ratio
  • These provide an objective measure of the size of an effect
  • Effect sizes are calculated for a given sample, but can be used to estimate the likely effect size for population.
  • Effect size is linked to 3 other statistical properties:
    1. sample size on which sample effect size is based
    2. the probability level at which an effect is decided as being significant
    3. the ability of a test to detect an effect of a size, given it is present in a population (aka statistical power)
  • Once the above variables are known, we can calculate the sample size necessary to achieve a given level of power.

Graphical Analyses

Assumptions in Statistical Analyses

Data Transformations

Measures of Association

Correlation Analysis

  • Covariance: The simplest way to check if two variables are associated is to check if they covary.
    • Instead of squaring deviations, deviation from one variable can be multiplied with the other. If answer positive, relation is directly proportional, otherwise inversely proportional. This is called cross-product deviations.
    • Cross-product deviations added for all variables and averaged (divided by N - 1) is known as covariance.
    • However, a problem faced when calculating covariance is that it is affected by scales of measurement. Covariance, thus, cannot be an objective measure of association for two variables.
  • Pearson’s Product-Moment correlation coefficient: To obtain a standardised covariance (also known as correlation coefficient), the covariance is divided by the multiplied result of standard deviations for two variables - this gives us the covariance in standard deviation units.
  • We may perform a Null Hypothesis test for the correlation coefficient (thinking whether the observed correlation coefficient is likely or unlikely to occur if there is no effect in the population) in two ways:
    • Converting the correlation coefficient to a Z-score and then dividing this by the standard error (Z-score is computed because we know the probability of a given Z-score, as long as the distribution is normal. Since it isn’t normal for a correlation coefficient, it is divided by the standard error)
    • Using a t-statistic, the formula for which is: (correlation coefficient*sqrt(N - 2)/sqrt(1 - correlation coefficient^2))
  • Alternatively, we may compute the confidence intervals for a correlation coefficient (after converting the correlation coefficients to Z-scores and computing the standard error.
    • Since Pearson’s r doesn’t have normal sampling distribution, we adjust it to normal distribution as: zr = 1/2 * log * (1+r/1-r)
    • This zr has a Standard Deviation of: SE_zr = 1/ sqrt(N-3)
    • CI is calculated as: zr - (1.96 x SE) zr + (1.96 x SE)
    • Once the values for the CI are calculated, we need to convert them back to Correlation coefficients, since existing values are z-scores
    • The standard error of the mean is just dividing the Standard Deviation by the sqrt(N) where N = total number of values).
  • Another possible method is to use bootstrapping to repetitively draw samples from a sampling distribution, and compute correlation over each of the drawn samples. This is very easy to do with R.

  • There are two types of Correlations:
    • Bivariate
      • Pearson’s Correlation Coefficient
      • Spearman’s Rho
      • Kendall’s Tau
    • Partial: between two variables while controlling for effects of one or more other variables
  • Pearson’s Correlation Coefficient
    • Assumptions:
      • Variables are of interval type (continuous vs. continuous)
      • For checking statistical significance, data are distributed normally.
      • Linearity and constant variance
    • Pearson’s correlation is for linear relationships.
    • Squaring the correlation coefficient (known as the Correlation of Determination, R^2) is a measure of the amount of variability in one variable shared by the other.
  • Spearman’s Rho
    • Assumptions:
      • Variables are ordinal, ratio, or interval (ordinal vs. ordinal)
      • Variables are ranked
    • When data is not normal, the Spearman’s Rho can be used, as it is non-parametric.
    • This correlation is for monotonic relationships.
  • Kendall’s Tau
    • This is useful if Spearman’s Rho is to be calculated, but the data set is small with a lot of variables at the same rank.
  • Biserial and Point-biserial correlation
    • Used when one variable is dichotomous and another continuous.
    • Biserial used when one variable is continuous dichotomy, while Point-Biserial used when discrete dichotomous.
  • For Continuous vs. Ordinal variables: Biserial correlation (use polyserial() from ploycor package)
  • For Continuous vs. Nominal variables: Point-Biserial correlation (use simple correlation in R)
  • For Ordinal vs. Nominal variables: use Rank Biserial
  • For Nominal vs. Nominal variables: use Phi Coefficient, Chi-Squared Statistic, Contingency Coefficient, or Cramer’s V
    • These do not show the direction of relationships
    • Phi used for variables when each has exactly 2 possible outcomes; adjusts sample size. Phi = sqrt(chi-squared/n)
    • Contingency Coefficient used when each variable has 3 or more possible outcomes. C = sqrt(phi2/1+phi2)
    • Cramer’s V used when variables have unequal number of possible outcomes. V = (phi^2/t) OR V = sqrt(chi-squared^2/nt) where t = min(nrow-1,ncol-1)
  • For partial correlations:
    • Used when a third variable is controlled for any effects it has on both variables under investigation
    • Use the ggm package’s pcor() function
    • Semi-partial correlations control a thid variable’s effect on only a single variable
  • For comparing correlation coefficients:
    • Independent correlation coefficients, convert to Z-scores (to make the distribution normal and calculate the standard error), and calculate the difference of these Z-scores
    • For dependent correlation coefficients, use the t-statistic
  • For calculating the effect size:
    • Correlation coefficients for Pearson’s and Spearman’s are the effect sizes. However, the effect size of Kendall’s Tau and Pearon’s and Spearman’s are not comparable, as Kendall’s Tau is always smaller than Spearman’s (and Pearson’s).
  • It is worth bearing in mind that outliers do impact correlation coefficients. Especially in small samples, outliers can greatly affect correlation strengths.
  • Correlation coefficients can be converted to distance/dissimilarity measure in the following manner:
    • dissimilarity measure = (1 - correlationCoefficient)/2
  • A dissimilarity measure can be converted to a similarity measure by: 1 - dissimilarity Measure

Regression

Simple Linear Regression

  • Assessing the Regression model:
    • The mean is used as a model to calculate the Sum of Squares - this is known as the Total Sum of Squares (SS_T).
    • The regression line is used to calculate the Sum of Squares - this is known as the Residual Sum of Squares (SS_R).
    • To see how much better our model is compared to the mean (or basic model), subtract them to get SS_M.
    • Proportion of Improvement due to model: R^2 = (SS_M)/(SS_T)
    • R^2 is the same correlation coefficient squared. It provides a measure of how much variance in data is being explained by model
    • R^2 is the square of the correlation between observed values of outcome variable and the predicted values for the outcome variable.compared to how much variance there is to explain, while R is a good estimate of the overall fit of the regression model.
    • F-ratio is also based on the sums of squares.
    • F-ratio (read more) is a measure of how much the model has improved prediction, divided by how accurate the model is in fitting to the observed data: F = (MS_M)/(MS_R)
    • F-ratio tells how much variability is explained by model compared to how much it cannot explain
    • A good model has a large F-ratio
  • Assessing individual predictors:
    • A bad model’s coefficients will be 0, indicating that the predictor’s values do not change with a unit change in outcome values.
    • t-test tells whether a coefficient’s value is different than 0, relative to variation in coefficient’s values across samples
      • Variation in coefficient’s values found by using the standard error
      • coefficient’s value being 0 implies a unit change in predictor variable results in no change in outcome’s values.
      • t-statistic tests the null hypothesis, that the value of a coefficient is 0.
      • t-statistic = (b_observed - b_expected)/SE_b Several regression lines (or models) are estimated and assessed - whichever has the least Sum of Squared errors is our model (here instead of mean we use the regression line).

Multiple Regression

  • SSD calculated the same way, as are SS_R, SS_T, SS_M.
  • Assessing the Regression model:
    • R^2 has the same interpretation — how much variation accounted for by model — except that now R produces the Multiple R^2.
    • However, R^2 will always increase with addition of more variables.
    • Akaike Information Criterion (AIC) is a measure of fit which penalises models for having more variables — adjusted R^2.
      • It is a measure of parsimony adjusted model fit.
      • Another is Bayesian Information Criterion (BIC).
  • Assessing individual predictors — Feature Selection:
    • Add predictors based on past research
    • Add predictors based on theoretical/conceptual/logical significance in order of significance.
    • Order may matter if predictors are correlated.
    • Forced entry: significant predictors added based on no order
    • Step-wise regression: forward, backward, both, all subsets
    • Step-wise methods assess fit of variable based on other already selected variables already selected.
    • All-subsets method assesses every combination using Mallow’s C_p
    • Step-wise methods best for exploratory analysis. For other types of analyses, important to cross-validate.
  • Final assessment of Regression model
    • Check if model represents all data well or just some influential data (such may not be found by looking at residuals since the regression line may have been affected by the datum to lie close to it, resulting in a low residual. However, Cook’s distance may show if the datum lies away from other data)
    • Check if model generalises well (balance between bias and variance)
    • Bias:
    • Variance:
    • Residuals: unstandardised are in the same unit as the outcome variable — difficult to compare across different models. Hence, standardised residuals.
    • Outliers: May bias our model as affect values for regression coefficients. These are points that lie away from the general trend.
    • Influential cases: May bias our model.
      • Cook’s distance to check influence of a case on the model. > 1 = Bad!
      • Leverage, which measures the influence of an observed value of the outcome variable over the predicted values. (k+1)/N
      • DFBeta
      • DFFit
      • Covariance Ratio
    • To check the model that it generalises well: cross-validate.
  • Cross-validation:
    • Refers to assessing model accuracy across different samples from same population. Used to check if model generalises properly.
    • Method #1: Adjusted R^2
    • Method #2: Data Splitting
  • Confidence Intervals: Given that we know the estimate, standard error of the estimate, and the degrees of freedom, we can calculate the confidence intervals.

  • Models can be compared using the F-ratio. This calculates the significance of R^2, allowing comparison of R^2 for different models.
    • For comparing like so, model 1 must contain x predictor, model 2 x and y predictors, model 3 x, y, and z predictors.
    • Subtract the Multiple R^2 to see improvement. Use anova() to compare the models, and interpret the F-statisic.
  • Multicollinearity: if observed, implies that impossible to obtain unique regression coefficients for correlated variables.
    • It creates difficulties in assessing feature importance
    • It increases standard errors of regression coefficients
    • Limits the size of R, which is the correlation coefficient
    • To detect pairwise multicollinearity: correlation matrix. Otherwise, Variance Inflation Factor (vif).
  • If all assumptions for regression met, and such an analysis conducted successfully: the regression model is on average more probable to be the same as the population model. Thus, model is generalisable to population

  • If assumptions not met, use robust regression: bootstrapping. This allows relaxing our assumptions.