Populations and Samples
Frequency Distributions
Mean as a simple model and Assessing this Model
Going beyond data
Null Hypothesis Testing and Effect Sizes
- NHT
- Effect Sizes
Graphical Analyses
Assumptions in Statistical Analyses
Data Transformations
Measures of Association
- Correlation Analysis
Regression
- Simple Linear Regression
- Multiple Regression

Populations and Samples

Population is the entire set, while samples are only subsets of this set. Since many times not possible to collect entire data, a sample from the population is collected and then is used to infer the parameters of a population.
Bigger samples are more likely to reflect the population.

Frequency Distributions

How much each value occurs for a variable in a sample or population is known as the Frequency Distribution for the variable.
Positively skewed distribution has a thin right end. Negatively skewed distribution has a thin left end.
Positive kurtosis implies a very pointy distribution from the center. Negative curtosis implies a flat distribution.
Normal distribution has 0 values for Skewness, Kurtosis, and Mean, with 1 score for Standard Deviation.
The central tendency of a distribution can often be useful to find out.
- Mode: value that occurs most often. More than 1 mode means a distribution is multimodal.
- Median: middle value when all values arranged in order of magnitude. Not influenced by extreme values.
- Mean: the average value, or the ‘typical’ score. Influenced by extreme values, but uses all values in calculating central tendency, unlike median and mode.
The dispersion or spread in a distribution can also be interesting.
- Range: largest value - smallest value. Influenced by extreme values.
- Interquartile range: cuts off top and bottom 25% of scores. Upper quartile - lower quartile. Not affected by extreme values, but ignores a lot of data.
- Variance
- Standard Deviation
Frequency distributions can be thought of in terms of Probability, to predict from the distribution, rather than only mention the values that already occur.
To convert a distribution to a normal distribution, calculate z-scores. Subtract Mean, divide by Standard Deviation.
- 1.96 is an important value of z: 95% of z-scores fall between -1.96 to 1.96

Mean as a simple model and Assessing this Model

Mean is seen as a simple model, as it predicts a value (the average) that does not have to be present in the data.
To assess a model, general way is to compare the actual values against the predicted values.
- Sum of Squared Deviations:
  - Subtract the mean from every value: a negative number implies overestimation, while positive implies underestimation
  - Simply adding the values results in 0, as negative and positive cancel each other out.
  - To resolve this, values squared first, and then added together.
- Variance:
  - Since SSD is dependent on amount of data available (greater the data, greater the SSD), it is divided by number of observations.
  - To find average error for sample only, divide by N
  - To find average error for population, divide by N - 1 (reasons includes Degrees of Freedom - Page 38)
- Standard Deviation:
  - Since Variance presents a squared value which can be difficult to interpret, usually a square root is applied
  - Standard Deviation is a measure of how well the mean represents the data.
  - Affected by outliers, just as mean is.
- Confidence Intervals
- Null Hypothesis Testing
  - Test Statistic = variance explained by model / variance not explained by model
    - variance explained by model = effect
    - variance unexplained by model = error
- Effect sizes
SSD is the total variation of data points from the mean, Variance is the average variability, while Standard Deviation is the average variation represented in original units. As such, the size of Standard Deviation can be compared to the mean.

Going beyond data

To go beyond the data, we need to think not in terms of samples, but populations
We can find means from different samples - these will be different and can be plotted to produce not frequency distribution but sampling distribution. This is the frequency distribution of the sample means from the population.
If we take the average of the sample means, we get the mean of the population.
Calculating the standard deviation between the sample means is in fact the Standard Error of the Mean (SE)
- Standard Error = Standard Deviation / square root(N)
- Standard Error is the Standard Deviation of a given statistic (Not just the mean!) from the Sampling Distribution. For a given statistic, it tells us how much variability there is in a statistic across samples from the same population.
Central Limit Theorem:
- As samples get large, the sampling distribution has a normal distribution, with mean equal to the population mean.
- CLT is useful in that given a large sample size, the standard error can be approximated by calculating the standard deviation of the sampling distribution.
- CLT has three principles:
  - The mean of the sampling distribution is the same as the mean of the population.
  - The SD of the sampling distribution is the square root of the variance of the sampling distribution.
  - The shape of the sampling distribution is approximately normal if either N>=30 or the shape of the population distribution is normal.
Confidence Intervals:
- Another way of assessing accuracy of sample mean as population mean is to calculate the boundaries within which the true value for a parameter lies. It is an interval estimate of a population parameter.
- This method uses Z-scores lies.
- lower boundary = mean - (1.96 x Standard Error)
- upper boundary = mean + (1.96 x Standard Error)
- 1.96 is the z-score for 95% Confidence level. For any other Confidence level, e.g. 99%, use: (1-.99)/2, and lookup value from table.
- In small samples, instead of using value for z, the value for t is used.

Null Hypothesis Testing and Effect Sizes

NHT

The arbitrary cut-off point of .05 probability (or 95% Confidence) may be augmented with the use of Effect sizes.
A test statistic is used in conjunction with the NHT that determines whether an effect is significant or not.
Even if an effect is found to be significant, it does not imply the effect is important. Very small and unimportant effects may turn out to be significant with sample sizes.
Rejecting the Alternative Hypothesis does not mean the Null Hypothesis is true! It only means that any effect is not big enough to occur by other than chance. Cohen points out that Null Hypothesis is never true - we know from sampling distributions that sample means will be different. Even if difference is small, it is one, nevertheless. And the difference may become large with large sample sizes.
In addition to not being able to conclude that Null Hypothesis is true, we can also not conclude undoubtedly that Alternative Hypotheis is true! This is because NHT does not follow the standard Modus Tollens. It is only interpreted as: the chances of obtaining the data we have collected given the NH is TRUE are very low…
NHT can use One and Two-tailed tests, for directional and non-directional hypotheses, respectively.
Weakness: Biased by sample size
- In regression:
- p-value is based on t-value
- t=B/SE (B is regression coefficient, SE is Standard Error)
- SE=SQRT(SS.Residual/(N-2))
Remedy: Supplement all NHSTs with estimates of effect size
- In regression:
- report standardised regression coefficients and the model R-squared.
Weakness: Arbitrary decision rule (the cutoff value [alpha] is arbitrary)
Remedy: supplement all NHSTs with estimates of effect size
Weakness: Yokel Lokel Test - NHST ecnourages weak hypothesis testing
Remedy: learn other forms of hypothesis testing, consider multiple alternative hypotheses
Weakness: Error prone
Remedy: replicate significant effects to avoid long-term impact of Type-I errors, obtain large and representative samples to avoid Type- II errors.
Weakness: Logic becomes probabilistic in NHST
- Modus Tollens in NHST becomes probabilistic
- (IF p then q; NOT q; THEREFORE NOT p)
- (IF null-hypothesis correct, then these data are HIGHLY UNLIKELY; Data have OCCURREDl; THEREFORE, the null hypothesis is HIGHLY UNLIKELY)
Remedy: Don’t use NHST | remember p=P(D|H-null) | report confidence intervals only | Apply bayesian learning

Effect Sizes

Since NHT has the aforementioned problems, and does not assist in finding whether an effect is important (how much variance is explained by an effect), we can instead measure the size of effects.
An effect size is an objective and (usually) standardised measure of the magnitude of observed effect.
Cohen’s d, Pearson’s Correlation Coefficient, Odds Ratio
These provide an objective measure of the size of an effect
Effect sizes are calculated for a given sample, but can be used to estimate the likely effect size for population.
Effect size is linked to 3 other statistical properties:
1. sample size on which sample effect size is based
2. the probability level at which an effect is decided as being significant
3. the ability of a test to detect an effect of a size, given it is present in a population (aka statistical power)
Once the above variables are known, we can calculate the sample size necessary to achieve a given level of power.

Graphical Analyses

Univariate profiling (examining shape of distribution)/normality: Histogram; normal probability plot (aka qq-plot) is more reliable
Bivariate profiling (examining relationship between variables)/linearity: Scatterplots
Bivariate profiling (examining group differences and spread): Boxplot (show if distribution skewed or symmetrical)
Multivariate profiling: Fourier Transformation, glyphs

Assumptions in Statistical Analyses

Wrong assumptions lead to results that may be correct under the assumptions, but are actually wrong - our assumptions shape our analyses.
Many techniques make assumptions about the shape and spread of data (e.g. Linear Regression)
Parametric tests require parametric data:
- Normal Distribution
- Homogeneity of Variance (when levels of one variable change, variance in another variable for one of the levels must be the same for other levels)
- Interval data (at least)
- Independence
To check if a distribution follows normality:
- Check Kurtosis and Skewness
- Plot histograms (especially for large data sets)
- Plot QQ-Plots (when line consistently sags below or rises above diagonal, it indicates kurtosis is different from normal distribution, while if the line has an S-shaped curve, it indicates positively or negatively skewed data)
- Use Shapiro-Wilk test (with large sample sizes, small deviations from normality can produce significant results)
- Use Komogorove-Smirnov test for Kurtosis
To check for homogeneity/heterogeneity of data:
- Levene test (with large sample sizes, small deviations from normality can produce significant results)
- Hartley’s F-max (or the Variance Ratio: ratio of variances between groups with largest variance and group with smallest variance)
- Box’s M-test (tests for homogeneity of variance OR the equality of covariance)
To check for linearity:
- Examine residuals, as any non-linear portions of relationships will be shown
- Correlated errors must be treated!

Data Transformations

Data transformations are usually carried out for 2 reasons:
- To correct violations of statistical assumptions
- To improve relations between variables
To deal with outliers:
- Remove outliers
- Transform data
- Change data manually
  - Highest score plus 1
  - If outlier is a z-score, convert it back to original value
  - Mean + 2 Standard Deviations
To achieve normality:
- If distribution is flat (negative kurtosis), use Inverse.
- If distribution is negatively skewed, use a Square or a Cube.
- If distribution is positively skewed, use a Logarithm or a Square root.
To achieve homoscedasticity:
- If the cone opens to the right, take Inverse.
- If the cone opens to the left, take a Square root.
To achieve linearity:
- Can use squares, cubes, square-roots, logs, inverse, etc.
However, transformations to data make its interpretation difficult, as the values are changed from original ones.
Also, transforming data means that the hypothesis being tested is changed (when using a log transformation and comparing the means, we change from comparing arithmetic means to comparing geometric means).
Also, in small samples it is difficult to check if data follow normality or not (with graphical techniques as well as with statistical tests)
If no transformations work/are impossible to implement, then either of the following 2 methods may be invoked:
- Use trimmed values (e.g. trimmed mean)
- Bootstrapping (samples tend towards normality as they get larger - a very clever technique)

Measures of Association

Different ways to measure association between variables.
Some ways allow identification of directional relationships
- Correlation
Others do not show directional relationships
- Chi-sqaured Statistic, Cramer’s V, etc.

Correlation Analysis

Covariance: The simplest way to check if two variables are associated is to check if they covary.
- Instead of squaring deviations, deviation from one variable can be multiplied with the other. If answer positive, relation is directly proportional, otherwise inversely proportional. This is called cross-product deviations.
- Cross-product deviations added for all variables and averaged (divided by N - 1) is known as covariance.
- However, a problem faced when calculating covariance is that it is affected by scales of measurement. Covariance, thus, cannot be an objective measure of association for two variables.
Pearson’s Product-Moment correlation coefficient: To obtain a standardised covariance (also known as correlation coefficient), the covariance is divided by the multiplied result of standard deviations for two variables - this gives us the covariance in standard deviation units.
We may perform a Null Hypothesis test for the correlation coefficient (thinking whether the observed correlation coefficient is likely or unlikely to occur if there is no effect in the population) in two ways:
- Converting the correlation coefficient to a Z-score and then dividing this by the standard error (Z-score is computed because we know the probability of a given Z-score, as long as the distribution is normal. Since it isn’t normal for a correlation coefficient, it is divided by the standard error)
- Using a t-statistic, the formula for which is: (correlation coefficient*sqrt(N - 2)/sqrt(1 - correlation coefficient^2))
Alternatively, we may compute the confidence intervals for a correlation coefficient (after converting the correlation coefficients to Z-scores and computing the standard error.
- Since Pearson’s r doesn’t have normal sampling distribution, we adjust it to normal distribution as: zr = 1/2 * log * (1+r/1-r)
- This zr has a Standard Deviation of: SE_zr = 1/ sqrt(N-3)
- CI is calculated as: zr - (1.96 x SE) zr + (1.96 x SE)
- Once the values for the CI are calculated, we need to convert them back to Correlation coefficients, since existing values are z-scores
- The standard error of the mean is just dividing the Standard Deviation by the sqrt(N) where N = total number of values).
Another possible method is to use bootstrapping to repetitively draw samples from a sampling distribution, and compute correlation over each of the drawn samples. This is very easy to do with R.
There are two types of Correlations:
- Bivariate
  - Pearson’s Correlation Coefficient
  - Spearman’s Rho
  - Kendall’s Tau
- Partial: between two variables while controlling for effects of one or more other variables
Pearson’s Correlation Coefficient
- Assumptions:
  - Variables are of interval type (continuous vs. continuous)
  - For checking statistical significance, data are distributed normally.
  - Linearity and constant variance
- Pearson’s correlation is for linear relationships.
- Squaring the correlation coefficient (known as the Correlation of Determination, R^2) is a measure of the amount of variability in one variable shared by the other.
Spearman’s Rho
- Assumptions:
  - Variables are ordinal, ratio, or interval (ordinal vs. ordinal)
  - Variables are ranked
- When data is not normal, the Spearman’s Rho can be used, as it is non-parametric.
- This correlation is for monotonic relationships.
Kendall’s Tau
- This is useful if Spearman’s Rho is to be calculated, but the data set is small with a lot of variables at the same rank.
Biserial and Point-biserial correlation
- Used when one variable is dichotomous and another continuous.
- Biserial used when one variable is continuous dichotomy, while Point-Biserial used when discrete dichotomous.
For Continuous vs. Ordinal variables: Biserial correlation (use polyserial() from ploycor package)
For Continuous vs. Nominal variables: Point-Biserial correlation (use simple correlation in R)
For Ordinal vs. Nominal variables: use Rank Biserial
For Nominal vs. Nominal variables: use Phi Coefficient, Chi-Squared Statistic, Contingency Coefficient, or Cramer’s V
- These do not show the direction of relationships
- Phi used for variables when each has exactly 2 possible outcomes; adjusts sample size. Phi = sqrt(chi-squared/n)
- Contingency Coefficient used when each variable has 3 or more possible outcomes. C = sqrt(phi^2/1+phi2)
- Cramer’s V used when variables have unequal number of possible outcomes. V = (phi^2/t) OR V = sqrt(chi-squared^2/nt) where t = min(nrow-1,ncol-1)
For partial correlations:
- Used when a third variable is controlled for any effects it has on both variables under investigation
- Use the ggm package’s pcor() function
- Semi-partial correlations control a thid variable’s effect on only a single variable
For comparing correlation coefficients:
- Independent correlation coefficients, convert to Z-scores (to make the distribution normal and calculate the standard error), and calculate the difference of these Z-scores
- For dependent correlation coefficients, use the t-statistic
For calculating the effect size:
- Correlation coefficients for Pearson’s and Spearman’s are the effect sizes. However, the effect size of Kendall’s Tau and Pearon’s and Spearman’s are not comparable, as Kendall’s Tau is always smaller than Spearman’s (and Pearson’s).
It is worth bearing in mind that outliers do impact correlation coefficients. Especially in small samples, outliers can greatly affect correlation strengths.
Correlation coefficients can be converted to distance/dissimilarity measure in the following manner:
- dissimilarity measure = (1 - correlationCoefficient)/2
A dissimilarity measure can be converted to a similarity measure by: 1 - dissimilarity Measure

Regression

Measuring relationship between two variables can be taken a step further by predicting values for one using the other. This is known as simple regression; if we have several variables to use for predicting one variable, we call this multiple regression.
Regression analysis consists of fitting a suitable linear model - a straight line fitting the data points well.
- This straight line can be estimated by using the least squares method.
A regression (straight) line can be defined by:
- the slope or the gradient
- the point at which line crosses the y-axis of the graph
- the regression coefficients, values for variables used in model, and the residual term (difference between predicted and actual value for each N)
We seek a linear combination of predictors such that correlates maximally with the outcome variable.
Least Squares method:
- Model (in this case, regression line) assessed by calculating the difference (vertical distance) between predicted and actual values. In regression, these are known as residuals rather than deviations. They are squared and summed up to result in a total value. This is known as the Sum of Squared Residuals. It is minimised to produce the line of best fit. Once this is accomplished through the Least Squares method, we attempt to assess this regression line/model as regards how well it fits the data.
Assumptions to fulfil for using Regression:
- Variables must be either categorical (2 categories) or quantitative
- Predictor variables must have some variance.
- Should be no perfect collinearity or multicollinearity. Use vif() to calculate this.VIF > 10, avg(vif) > 1 is cause for concern.
- Predictors are uncorrelated with external variables
- Homoscedasticity — residuals must have same variance. Plot predicted values against standardised residuals.
- Normal Distribution of errors — Residuals must be random: Plot predicted values against standardised residuals, or hist() for residuals (standardised or studentised), or a QQ-plot. This normality is not about predictors! And beware sample size!
- Independent errors: For any two observations, residual terms must be uncorrelated. Use Durbin-Watson statistic to find out, or plot predicted values against standardised residuals for any patterns.
- Independence: All values of the outcome variable must be independent (coming from a separate entity)
- Linearity: The mean values of the outcome variable for each increment of the predictor lie along a straight line. That is, the relation we are modelling is linear. Otherwise generalisation capability is impacted. Plot QQ-plots for standardised residuals. Can also plot predicted values against standardised residuals.
- Check for influential cases using Cook’s distance, and the leverage using hatvalues()
- Check for outliers using residuals, standardised residuals, and studentised residuals.

Simple Linear Regression

Assessing the Regression model:
- The mean is used as a model to calculate the Sum of Squares - this is known as the Total Sum of Squares (SS_T).
- The regression line is used to calculate the Sum of Squares - this is known as the Residual Sum of Squares (SS_R).
- To see how much better our model is compared to the mean (or basic model), subtract them to get SS_M.
- Proportion of Improvement due to model: R^2 = (SS_M)/(SS_T)
- R^2 is the same correlation coefficient squared. It provides a measure of how much variance in data is being explained by model
- R^2 is the square of the correlation between observed values of outcome variable and the predicted values for the outcome variable.compared to how much variance there is to explain, while R is a good estimate of the overall fit of the regression model.
- F-ratio is also based on the sums of squares.
- F-ratio (read more) is a measure of how much the model has improved prediction, divided by how accurate the model is in fitting to the observed data: F = (MS_M)/(MS_R)
- F-ratio tells how much variability is explained by model compared to how much it cannot explain
- A good model has a large F-ratio
Assessing individual predictors:
- A bad model’s coefficients will be 0, indicating that the predictor’s values do not change with a unit change in outcome values.
- t-test tells whether a coefficient’s value is different than 0, relative to variation in coefficient’s values across samples
  - Variation in coefficient’s values found by using the standard error
  - coefficient’s value being 0 implies a unit change in predictor variable results in no change in outcome’s values.
  - t-statistic tests the null hypothesis, that the value of a coefficient is 0.
  - t-statistic = (b_observed - b_expected)/SE_b Several regression lines (or models) are estimated and assessed - whichever has the least Sum of Squared errors is our model (here instead of mean we use the regression line).

Multiple Regression

SSD calculated the same way, as are SS_R, SS_T, SS_M.
Assessing the Regression model:
- R^2 has the same interpretation — how much variation accounted for by model — except that now R produces the Multiple R^2.
- However, R^2 will always increase with addition of more variables.
- Akaike Information Criterion (AIC) is a measure of fit which penalises models for having more variables — adjusted R^2.
  - It is a measure of parsimony adjusted model fit.
  - Another is Bayesian Information Criterion (BIC).
Assessing individual predictors — Feature Selection:
- Add predictors based on past research
- Add predictors based on theoretical/conceptual/logical significance in order of significance.
- Order may matter if predictors are correlated.
- Forced entry: significant predictors added based on no order
- Step-wise regression: forward, backward, both, all subsets
- Step-wise methods assess fit of variable based on other already selected variables already selected.
- All-subsets method assesses every combination using Mallow’s C_p
- Step-wise methods best for exploratory analysis. For other types of analyses, important to cross-validate.
Final assessment of Regression model
- Check if model represents all data well or just some influential data (such may not be found by looking at residuals since the regression line may have been affected by the datum to lie close to it, resulting in a low residual. However, Cook’s distance may show if the datum lies away from other data)
- Check if model generalises well (balance between bias and variance)
- Bias:
- Variance:
- Residuals: unstandardised are in the same unit as the outcome variable — difficult to compare across different models. Hence, standardised residuals.
- Outliers: May bias our model as affect values for regression coefficients. These are points that lie away from the general trend.
- Influential cases: May bias our model.
  - Cook’s distance to check influence of a case on the model. > 1 = Bad!
  - Leverage, which measures the influence of an observed value of the outcome variable over the predicted values. (k+1)/N
  - DFBeta
  - DFFit
  - Covariance Ratio
- To check the model that it generalises well: cross-validate.
Cross-validation:
- Refers to assessing model accuracy across different samples from same population. Used to check if model generalises properly.
- Method #1: Adjusted R^2
- Method #2: Data Splitting
Confidence Intervals: Given that we know the estimate, standard error of the estimate, and the degrees of freedom, we can calculate the confidence intervals.
Models can be compared using the F-ratio. This calculates the significance of R^2, allowing comparison of R^2 for different models.
- For comparing like so, model 1 must contain x predictor, model 2 x and y predictors, model 3 x, y, and z predictors.
- Subtract the Multiple R^2 to see improvement. Use anova() to compare the models, and interpret the F-statisic.
Multicollinearity: if observed, implies that impossible to obtain unique regression coefficients for correlated variables.
- It creates difficulties in assessing feature importance
- It increases standard errors of regression coefficients
- Limits the size of R, which is the correlation coefficient
- To detect pairwise multicollinearity: correlation matrix. Otherwise, Variance Inflation Factor (vif).
If all assumptions for regression met, and such an analysis conducted successfully: the regression model is on average more probable to be the same as the population model. Thus, model is generalisable to population
If assumptions not met, use robust regression: bootstrapping. This allows relaxing our assumptions.

Notes on Statistical Data Analysis

Ali Arsalan Kazmi

Saturday, March 22, 2014

Populations and Samples

Frequency Distributions

Mean as a simple model and Assessing this Model

Going beyond data

Null Hypothesis Testing and Effect Sizes

NHT

Effect Sizes

Graphical Analyses

Assumptions in Statistical Analyses

Data Transformations

Measures of Association

Correlation Analysis

Regression

Simple Linear Regression

Multiple Regression