Review • Descriptive Stats • Contingency Tables • Simpson’s Paradox • Empirical/Chebychev Rules • CoV
=AVERAGE(range) — mean=STDEV.S(range) — sample standard deviation=VAR.S(range) — sample variance=PERCENTILE.EXC(range, k) — percentile (inclusive)=COUNT(range), =SUM(range)These concepts are used to create a lower and upper bound of a random variable based on mean and standard deviation.
In both of these concepts distances from the mean to find the upper and lower bound are measured by number of standard deviations from the mean.
Empirical Rule is only applied to symmetric distributions (really the normal distribution not all symmetric distributions are normal but we will not make that distinction right now) It can formulate statements in precise terms.
Chebychev’s theorem can be applied to any distribution. It can formulate statements in imprecise terms (at least p\(\%\) of the data)
\[ 1 - \frac{1}{k^2},\quad k > 1. \]
The mean \((\mu)\) grade in a classroom is 70. The standard deviation \((\sigma)\) is 10. Find an upper and lower bound such that at least \(75\%\) of the grades are in that interval.
At least what percentage of the grades are between 40 and 100?
This problem is associated with the Chebychev’s theorem because we are NOT told that the grades of the students are symmetrically distributed.
We know that at least \(75\%\) of the grades are between the upper and lower bounds we want to find.
Using this information we can find how many standard deviations there are between the upper and lower bound.
We know \[0.75= 1-\frac{1}{k^2}\]
\[\frac{1}{k^{2}}=0.25\]
\[1=0.25 k^{2}\]
\[4=k^{2}\]
\[70 \pm 2 \times 10 \] \[Upper\_Bound=70+2 \times 10 =90\] \[Lower\_Bound=70- 2 \times 10 =50\]
Using the same mean and standard deviation that is used above, at least what percentage of the data is between 40 and 100?
This set up tells us there is an upper bound and lower bound that is equidistant from the mean. \(100-70=30\), \(40-70=-30\).
In order to answer the question about at least what percentage of the data is between \(100\) and \(40\) we need to know how many standard deviations these values are away from the mean.
In other words we need to find \(k\). How many standard deviations there are between 100 and 70? Or equivalently, how many standard deviations are there between 40 and 70.
\[k=\frac{100-70}{10}=3 \]
\[k=\frac{(UpperBound-\mu)}{\sigma} \]
For approximately symmetric (normal) distributions:
This is kind of boring now. Since 100 and 40 are 3 standard deviations around the mean we know that 99.7\(%\) of the data is contained within these bounds.
If the distribution is symmetric, it is very easy to think in terms of non equidistant values from the mean. However we will leave this to the lecture where we properly introduce the normal distribution and reintroduce k as z.
CoV indicates variability relative to the mean.
\[ \mathrm{CoV} = \frac{s}{\bar{x}}\times 100\%. \]
Table(s) of counts which are also referred to as contingency tables are useful to gather information about a population from a sample.
These frequency counts can be converted to percentages. be interpreted as probability.
We will discuss this term more technically later in the course.
Right now let us define probability as quantification of uncertainty.
| W 1: Alexis | W 1: Peyton | W 1: Total | W 2: Alexis | W 2: Peyton | W 2: Total | |
|---|---|---|---|---|---|---|
| Yes | 162 | 19 | 181 | 10 | 99 | 109 |
| No | 18 | 1 | 19 | 10 | 81 | 91 |
| Total | 180 | 20 | 200 | 20 | 180 | 200 |
\(P(Yes | Week1, Alexis)\) Look across the row of Alexis in Week 1. \[\frac{162}{180}=0.90 \]
\(P(Yes | Week1, Peyton)\) Look across the row of Peyton in Week 1. \[\frac{19}{20} =0.95 \]
\(P(Yes | Week2, Alexis)\) Look across the row of Alexis in Week 2. \[\frac{10}{20}=0.50 \]
\(P(Yes | Week2, Peyton)\) Look across the row of Peyton in Week 2. \[\frac{99}{180} =0.55 \]
Simpson’s Paradox: Aggregation can hide or reverse relationships seen in subgroups.
Relative frequency is just a segue into probability.
We are interpreting percentages calculated from relative frequencies as probability. We are implicitly thinking “if we randomly pick a task, this is the percentage and therefore probability that it will be such and such (for instance finishing up a task).
It is important to define probability a little bit more formally (still informal by math standards)
Probability is the quantification of uncertainty that obeys certain rules (more on that later).
For now this treatment is ok, we will give a broader
Marginal Probability: Probability evaluated for the value of a random variable. P(X=x). What is the probability that Peyton is assigned a task.
Joint Probability: Probability evaluated for the value of 2 or more random variables.P(X=x,Y=y). What is the probability that Peyton is assigned a task and it is finished.
Conditional Probability: Probability evaluated for the value of a random variables given information on the other one. \(P(X=x|Y=y)\). What is the probability that Peyton is assigned a task given that the task is finished.
| X=1 | X=2 | |
|---|---|---|
| Y=0 | 10 | 30 |
| Y=1 | 15 | 15 |
| Y=2 | 25 | 10 |
Total count is 105
\(P(Y >0)=\frac{15+15+25+10}{105}=62\%\)
\(P(Y =0,X=1)=\frac{10}{105}=9.5\%\)
\(P(Y >0|X=1)=\frac{15+25}{10+15+25}=80\%\)