About The author

I am a data scientist ,statistical and data literacy trainer . Did my honours at the University Of Zimbabwe and certified as a Proffesional data scientist by datacamp. I currently work as a data consultant . I have deep understanding and can assist with the following

  • Inferential statistics
  • survival analysis
  • Generalised linear,additive,mixed models
  • Supervised and unsupervised
  • Longitudinal Data analysis
  • Time series
  • Panel data analysis
  • mixed effects models
  • Biostatistics
  • Casual Inference

Introduction

this is a data science series where i will be sharing how to use statistics in data science

what to expect

  • practical examples in inferential statistics
  • t.tests
  • ANOVA
  • normality tests
  • heteroscedacity
  • test for associations > ETC

What to not expect

  • statistical modeling
  • introduction to statistics
  • introduction to R

Statistics Foundations

To properly understand data science , ML models and multivariate models, an analyst or data scientist needs to have a decent grasp of foundational statistics. Many of the assumptions ,formulations and results of these models require an understanding of these foundations in order to be properly interpreted. There are three concepts that one needs to understand as they make their way to being a data scientist or effective analyst

  • Descriptive statistics of populations and samples
  • Distribution of random variables
  • Hypothesis testing

Elementary descriptive statistics of populations and samples

Any collection of numerical data on one or more variables can be described using a number of common statistical concepts. Let \(x = x_1, x_2, \dots, x_n\) be a sample of \(n\) observations of a variable drawn from a population.

Measures of Category Population Sample
- What is it? Reality A small fraction of reality (inference)
- Characteristics described by Parameters Statistics
Central Tendency Mean \(\mu = E(Y)\) \(\hat{\mu} = \overline{y}\)
Central Tendency Median 50-th percentile \(y_{(\frac{n+1}{2})}\)
Dispersion Variance \(\sigma^2=var(Y)\)  \(=E(Y-\mu)^2\) \(s^2=\frac{1}{n-1} \sum_{i = 1}^{n} (y_i-\overline{y})^2\)  \(=\frac{1}{n-1} \sum_{i = 1}^{n} (y_i^2-n\overline{y}^2)\)
Dispersion Coefficient of Variation \(\frac{\sigma}{\mu}\) \(\frac{s}{\overline{y}}\)
Dispersion Interquartile Range difference between 25th and 75th percentiles. Robust to outliers
Shape Skewness  Standardized 3rd central moment (unitless) \(g_1=\frac{\mu_3}{\mu_2^{3/2}}\) \(\hat{g_1}=\frac{m_3}{m_2sqrt(m_2)}\)
Shape Central moments \(\mu=E(Y)\)  \(\mu_2 = \sigma^2=E(Y-\mu)^2\)  \(\mu_3 = E(Y-\mu)^3\)  \(\mu_4 = E(Y-\mu)^4\) \(m_2=\sum_{i=1}^{n}(y_1-\overline{y})^2/n\)   \(m_3=\sum_{i=1}^{n}(y_1-\overline{y})^3/n\)
Shape Kurtosis (peakedness and tail thickness)  Standardized 4th central moment \(g_2^*=\frac{E(Y-\mu)^4}{\sigma^4}\) \(\hat{g_2}=\frac{m_4}{m_2^2}-3\)

Mean, variance and standard deviation

  • The mean is the average value of the observations and is defined by adding up all the values and dividing by the number of observations. The mean \(\bar{x}\) of our sample \(x\) is defined as:

\[ \bar{x} = \frac{1}{n}\sum_{i = 1}^{n}x_i \] While the mean of a sample \(x\) is denoted by \(\bar{x}\), the mean of an entire population is usually denoted by \(\mu\). The mean can have a different interpretation depending on the type of data being studied.

lets create a fake dataset here

# Defining a vector of study hours
study_hours <- c(10.0, 11.5, 9.0, 16.0, 9.25, 1.0, 11.5, 9.0,
                 8.5, 14.5, 15.5, 13.75, 9.0, 8.0, 15.5, 8.0,
                 9.0, 6.0, 10.0, 12.0, 12.5, 12.0)
mean(study_hours, na.rm = TRUE)
## [1] 10.52273

This looks very intuitive and appears to be the average amount of hours taken by the individuals in the sample data.

Other common statistical summary measures include the median, which is the middle value when the values are ranked in order, and the mode, which is the most frequently occurring value.

R does not have an inbuilt function for the mode however so we use statip package

library(statip)

## median
(med_val <- median(study_hours))
## [1] 10
## mode
(mod_val <- mfv(study_hours))
## [1] 9
  • The variance is a measure of how much the data varies around its mean. There are two different definitions of variance.
  • The population variance assumes that that we are working with the entire population and is defined as the average squared difference from the mean:

\[ \mathrm{Var}_p(x) = \frac{1}{n}\sum_{i = 1}^{n}(x_i - \bar{x})^2 \]

  • The sample variance assumes that we are working with a sample and attempts to estimate the variance of a larger population by applying Bessel’s correction to account for potential sampling error. The sample variance is:

\[ \mathrm{Var}_s(x) = \frac{1}{n-1}\sum_{i = 1}^{n}(x_i - \bar{x})^2 \]

You can see that

\[ \mathrm{Var}_p(x) = \frac{n - 1}{n}\mathrm{Var}_s(x) \] So as the data set gets larger, the sample variance and the population variance become less and less distinguishable, which intuitively makes sense.

Because we rarely work with full populations, the sample variance is calculated by default in R and in many other statistical software packages.

# sample variance 
(sample_variance_study_hours <- var(study_hours, na.rm = TRUE))
## [1] 12.16017

So where necessary, we need to apply a transformation to get the population variance.

# population variance (need length of non-NA data)
n <- length(na.omit(study_hours))
(population_variance_study_hours <- ((n-1)/n) * sample_variance_study_hours)
## [1] 11.60744

Variance does not have intuitive scale relative to the data being studied, because we have used a ‘squared distance metric’ therefore we can square-root it to get a measure of ‘deviance’ on the same scale as the data.

  • We call this the standard deviation \(\sigma(x)\), where \(\mathrm{Var}(x) = \sigma(x)^2\). As with variance, standard deviation has both population and sample versions, and the sample version is calculated by default. Conversion between the two takes the form

\[ \sigma_p(x) = \sqrt{\frac{n-1}{n}}\sigma_s(x) \]

# sample standard deviation
(sample_sd_study_hours <- sd(study_hours, na.rm = TRUE))
## [1] 3.487144
# verify that sample sd is sqrt(sample var)
sample_sd_study_hours == sqrt(sample_variance_study_hours)
## [1] TRUE
# calculate population standard deviation
(population_study_hours <- sqrt((n-1)/n) * sample_sd_study_hours)
## [1] 3.406969

Given the range of study_hours is [1, 16] and the mean is 11, we see that the standard deviation gives a more intuitive sense of the ‘spread’ of the data relative to its inherent scale.

Hypothesis testing

  • Make inferences (an interpretation) about the true parameter value based on our estimator/estimate
  • Test whether our underlying assumptions (about the true population parameters, random variables, or model specification) hold true.

however Testing does not

  • Confirm with 100% a hypothesis is true
  • Confirm with 100% a hypothesis is false
  • Tell you how to interpret the estimate value (Economic vs. Practical vs. Statistical Significance)

Hypothesis: Translate an objective in better understanding the results in terms of specifying a value (or sets of values) in which our population parameters should/should not lie.

  • Null hypothesis (\(H_0\)): A statement about the population parameter that we take to be true in which we would need the data to provide substantial evidence that against it.
    • Can be either a single value (ex: \(H_0: \beta=0\)) or a set of values (ex: \(H_0: \beta_1 \ge 0\))
    • Will generally be the value you would not like the population parameter to be (subjective)
      • \(H_0: \beta_1=0\) means you would like to see a non-zero coefficient
      • \(H_0: \beta_1 \ge 0\) means you would like to see a negative effect
    • “Test of Significance” refers to the two-sided test: \(H_0: \beta_j=0\)
  • Alternative hypothesis (\(H_a\) or \(H_1\)) (Research Hypothesis): All other possible values that the population parameter may be if the null hypothesis does not hold.

Type I Error

Error made when \(H_0\) is rejected when, in fact, \(H_0\) is true.
The probability of committing a Type I error is \(\alpha\) (known as level of significance of the test)

Type I error (\(\alpha\)): probability of rejecting \(H_0\) when it is true.

Type II Error

  • Type II error level (\(\beta\)): probability that you fail to reject the null hypothesis when it is false.

  • Error made when \(H_0\) is not rejected when, in fact, \(H_1\) is true
    The probability of committing a Type II error is \(\beta\) (known as the power of the test)

Estimators

Point Estimator

\(\hat{\theta}\) is a statistic used to approximate a population parameter \(\theta\)

Point estimate
The numerical value assumed by \(\hat{\theta}\) when evaluated for a given sample

Unbiased estimator
If \(E(\hat{\theta}) = \theta\), then \(\hat{\theta}\) is an unbiased estimator for \(\theta\)

  1. \(\bar{X}\) is an unbiased estimator for \(\mu\)
  2. \(S^2\) is an unbiased estimator for \(\sigma^2\)
  3. \(\hat{p}\) is an unbiased estimator for p
  4. \(\widehat{p_1-P_2}\) is an unbiased estimator for \(p_1- p_2\)
  5. \(\bar{X_1} - \bar{X_2}\) is an unbiased estimator for \(\mu_1 - \mu_2\)

Note: \(S\) is a biased estimator for \(\sigma\)

confidence intervals (C.I)

When \(\sigma^2\) is estimated by \(s^2\), then

\[ \frac{\bar{y}-\mu}{s/\sqrt{n}} \sim t_{n-1} \]

Then, a \(100(1-\alpha) \%\) confidence interval for \(\mu\) is obtained from:

\[ 1 - \alpha = P(-t_{\alpha/2;n-1} \le \frac{\bar{y}-\mu}{s/\sqrt{n}} \le t_{\alpha/2;n-1}) \\ = P(\bar{y} - (t_{\alpha/2;n-1})s/\sqrt{n} \le \mu \le \bar{y} + (t_{\alpha/2;n-1})s/\sqrt{n}) \]

And the interval is

\[ \bar{y} \pm (t_{\alpha/2;n-1})s/\sqrt{n} \]

and \(s/\sqrt{n}\) is the standard error of \(\bar{y}\)

If the experiment were repeated many times, \(100(1-\alpha) \%\) of these intervals would contain \(\mu\)

Confidence Interval \(100(1-\alpha)%\) Sample Sizes
Confidence \(\alpha\), Error d
Hypothesis Testing
Test Statistic
When \(\sigma^2\) is known, X is normal (or \(n \ge 25\)) \(\bar{X} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\) \(n \approx \frac{z_{\alpha/2}^2 \sigma^2}{d^2}\) \(z = \frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}\)
When \(\sigma^2\) is unknown, X is normal (or \(n \ge 25\)) \(\bar{X} \pm t_{\alpha/2}\frac{s}{\sqrt{n}}\) \(n \approx \frac{z_{\alpha/2}^2 s^2}{d^2}\) \(t = \frac{\bar{X}-\mu_0}{s/\sqrt{n}}\)

For Difference of Means (\(\mu_1-\mu_2\)), Independent Samples

\(100(1-\alpha)%\) Confidence Interval Hypothesis Testing
Test Statistic
When \(\sigma^2\) is known \(\bar{X}_1 - \bar{X}_2 \pm z_{\alpha/2}\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\) \(z= \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}}\)
When \(\sigma^2\) is unknown, Variances Assumed EQUAL \(\bar{X}_1 - \bar{X}_2 \pm t_{\alpha/2}\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}\) \(t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\) Pooled Variance: \(s_p^2 = \frac{(n_1 -1)s^2_1 - (n_2-1)s^2_2}{n_1 + n_2 -2}\)
Degrees of Freedom: \(\gamma = n_1 + n_2 -2\)
When \(\sigma^2\) is unknown, Variances Assumed UNEQUAL \(\bar{X}_1 - \bar{X}_2 \pm t_{\alpha/2}\sqrt{(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2})}\) \(t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2})}}\) Degrees of Freedom: \(\gamma = \frac{(\frac{s_1^2}{n_1}+\frac{s^2_2}{n_2})^2}{\frac{(\frac{s_1^2}{n_1})^2}{n_1-1}+\frac{(\frac{s_2^2}{n_2})^2}{n_2-1}}\)