Lecture 3

Review • Descriptive Stats • Contingency Tables • Simpson’s Paradox • Empirical/Chebychev Rules • CoV

R. Muzaffer Musal, my original slides converted by AI and changed by me

Lecture

  • Review of descriptive statistics
  • Using the mean and standard deviation: Simpson’s Paradox, Empirical Rule, Chebyshev’s Rule, Coefficient of Variation (CoV)
  • Contingency Tables: Introduction to Probability.
  • Epilogue

Descriptive Statistics

  • Mean
  • Median
  • Mode — the most common observed value
  • Variance, Standard Deviation
  • Range = Max − Min
  • Interquartile Range (IQR) = Q3 − Q1
  • Five-number summary: (min, Q1, median, Q3, max)

Some Useful Excel Functions

  • =AVERAGE(range) — mean
  • =STDEV.S(range) — sample standard deviation
  • =VAR.S(range) — sample variance
  • =PERCENTILE.EXC(range, k) — percentile (inclusive)
  • Counting by intervals: FREQUENCY
  • =COUNT(range), =SUM(range)

Descriptive Statistics & More

  • Mean, Variance, Standard Deviation, Range
  • Percentiles: Quartiles & IQR
  • Kurtosis, Mode
  • Define probability as quantification of uncertainty.
    • Chebyshev’s Theorem
    • Empirical Rule

Chebychev THeorem and Empirical Rule

  • These concepts are used to create a lower and upper bound of a random variable based on mean and standard deviation.

  • In both of these concepts distances from the mean to find the upper and lower bound are measured by number of standard deviations from the mean.

Chebychev THeorem and Empirical Rule

  • Empirical Rule is only applied to symmetric distributions (really the normal distribution not all symmetric distributions are normal but we will not make that distinction right now) It can formulate statements in precise terms.

  • Chebychev’s theorem can be applied to any distribution. It can formulate statements in imprecise terms (at least p\(\%\) of the data)

Symmetric Distribution

Non-Symmetric Distribution Example

Chebyshev’s Theorem

  • For any distribution with mean \(\mu\) and standard deviation \(\sigma\), the proportion of observations within \(k\sigma\) of the mean is at least:

\[ 1 - \frac{1}{k^2},\quad k > 1. \]

  • Note: the statement within \(k\sigma\) of the mean above can be translated as \(\mu \pm k \times \sigma\)
  • k is a constant. It can be in decimals as long as it is greater than 1.

Example: QUestion

  • The mean \((\mu)\) grade in a classroom is 70. The standard deviation \((\sigma)\) is 10. Find an upper and lower bound such that at least \(75\%\) of the grades are in that interval.

  • At least what percentage of the grades are between 40 and 100?

  • This problem is associated with the Chebychev’s theorem because we are NOT told that the grades of the students are symmetrically distributed.

Example: Solution

  • We know that at least \(75\%\) of the grades are between the upper and lower bounds we want to find.

  • Using this information we can find how many standard deviations there are between the upper and lower bound.

  • We know \[0.75= 1-\frac{1}{k^2}\]

Example: Solution-Algebra

  • subtract 0.75 from both sides of the equation and move \(-\frac{1}{k^2}\) to the left side the equality.

\[\frac{1}{k^{2}}=0.25\]

  • Multiply both hand sides with \(k^{2}\) (known values on one side, unknowns on the other side)

\[1=0.25 k^{2}\]

Example: Solution-Algebra

  • Divide both sides of the equality with 0.25.

\[4=k^{2}\]

  • Therefore k=2

Example: Finalize

  • Now that you have found \(k = 2\) we just need to think about the upper and lower bounds within the context of chebychev’s theorem which says that at least \(75\%\) of the data is going to be within 2 (this is k) \(\sigma\)(10) of the mean (70).

\[70 \pm 2 \times 10 \] \[Upper\_Bound=70+2 \times 10 =90\] \[Lower\_Bound=70- 2 \times 10 =50\]

Example: QUestion

  • Using the same mean and standard deviation that is used above, at least what percentage of the data is between 40 and 100?

  • This set up tells us there is an upper bound and lower bound that is equidistant from the mean. \(100-70=30\), \(40-70=-30\).

  • In order to answer the question about at least what percentage of the data is between \(100\) and \(40\) we need to know how many standard deviations these values are away from the mean.

  • In other words we need to find \(k\). How many standard deviations there are between 100 and 70? Or equivalently, how many standard deviations are there between 40 and 70.

NOTE

  • In 2025 Fall I have said that for chebychev’s theorem to be used the upper and lower bounds need to be equidistant from the mean. While this is true for the questions we will solve in this class, you can actually use uneven widths to solve for k in the same question. For instance we know that at least 10\(\%\) of the data is between 40 to 50 and 80 to 90. What you can NOT do is to say at least 5\(\%\) of the data is between 40 and 50.

Example: Solution

  • There are 3 \(\sigma\)s between 100 and 70.

\[k=\frac{100-70}{10}=3 \]

\[k=\frac{(UpperBound-\mu)}{\sigma} \]

  • The denominator standardizes the distance between 100 and 70 in terms of standard deviation. \(k\) is the number of standard deviations between 100 and 70.

Empirical Rule (68–95–99.7)

For approximately symmetric (normal) distributions:

  • About 68% within \(\mu \pm 1\sigma\)
  • About 95% within \(\mu \pm 2\sigma\)
  • About 99.7% within \(\mu \pm 3\sigma\)

Example:

  • Make the unrealistic assumption that the grades have a symmetric distribution. \(\mu=70\), \(\sigma=10\)
  • What percentage of the grades are between 40 and 100?
  • We already found that there were 3 standard deviations between 100 and 70. Of course the same applies to the distance between 40 and 70.
  • The \(-3\) you would obtain if you solved for k for the distance between 40 and 70 simply means that 40, is 3 standard deviations below the mean.

Example: Solution

  • This is kind of boring now. Since 100 and 40 are 3 standard deviations around the mean we know that 99.7\(%\) of the data is contained within these bounds.

  • If the distribution is symmetric, it is very easy to think in terms of non equidistant values from the mean. However we will leave this to the lecture where we properly introduce the normal distribution and reintroduce k as z.

Coefficient of Variation (CoV)

CoV indicates variability relative to the mean.

\[ \mathrm{CoV} = \frac{s}{\bar{x}}\times 100\%. \]

  • Larger CoV → greater spread relative to the mean.
  • Example: a drill with mean diameter 5 cm and SD 1 cm has much larger relative variation than a drill with mean 5 m and SD 1 cm.

Contingency Tables

  • Table(s) of counts which are also referred to as contingency tables are useful to gather information about a population from a sample.

  • These frequency counts can be converted to percentages. be interpreted as probability.

  • We will discuss this term more technically later in the course.

  • Right now let us define probability as quantification of uncertainty.

Contingency Tables: Marginal, Joint, Conditional

  • Marginal: Relative frequency of a single variable’s value.
  • Joint: Relative frequency of a combination of values across variables.
  • Conditional: Relative frequency of a value of one variable given a value of another.

Example Contingency Table (Goals by Person & Week)

W 1: Alexis W 1: Peyton W 1: Total W 2: Alexis W 2: Peyton W 2: Total
Yes 162 19 181 10 99 109
No 18 1 19 10 81 91
Total 180 20 200 20 180 200
  • Illustrates marginal, joint, and conditional calculations and potential for Simpson’s Paradox.

Examples (from the 2×2 table) Week 1

  • Marginal of “Yes”: \(0.905 = 181/200\)
  • Joint of “Yes” and “Alex”: \(0.81 = 162/200\)
  • Conditional of “Yes” given “Alexis”: \(0.90 = 162/180\)

Examples:Questions

  • Calculate prob of Alexis completing a task satisfactorily in Week 1.
  • Calculate prob of Alexis completing a task satisfactorily in Week 2.
  • Calculate prob of Peyton completing a task satisfactorily in Week 1.
  • Calculate prob of Peyton completing a task satisfactorily in Week 2.

Examples: Answers

  • \(P(Yes | Week1, Alexis)\) Look across the row of Alexis in Week 1. \[\frac{162}{180}=0.90 \]

  • \(P(Yes | Week1, Peyton)\) Look across the row of Peyton in Week 1. \[\frac{19}{20} =0.95 \]

Examples: Answers

  • \(P(Yes | Week2, Alexis)\) Look across the row of Alexis in Week 2. \[\frac{10}{20}=0.50 \]

  • \(P(Yes | Week2, Peyton)\) Look across the row of Peyton in Week 2. \[\frac{99}{180} =0.55 \]

Examples

  • Calculate prob of Alexis completing a task \[\frac{172}{200}=0.86 \]
  • Calculate prob of Peyton completing a task \[\frac{118}{200}=0.54 \]

Simpson’s Paradox: Aggregation can hide or reverse relationships seen in subgroups.

Empirical & Chebyshev at a Glance

  • Empirical (Normal): ~68% in \(\pm1\sigma\), ~95% in \(\pm2\sigma\), ~99.7% in \(\pm3\sigma\).
  • Chebyshev (Any dist.): at least \(1 - 1/k^2\) within \(\pm k\sigma\) \((k>1)\).
  • CoV: \(\mathrm{CoV}=s/\bar{x}\times 100\%\) (unitless).