Introduction
The purpose of this article is to summarize my learnings around Normal Distribution on Probability along with looking at some examples of use of Normal Distribution. This and my future articles will be very tactical without going into too much detail on theory and proof of the learnings.
Normal Distribution
Let’s start with three main characteristics of normal distribution
When we look at the distribution plot whether it is an histogram or density plot, it needs to be Unimodal and symmetric. We need to see a bell shaped curve.
Most of the variables within the dataset (samples, group), should be nearly normal, but none are exactly normal.
It is denoted as \(N(\mu , \sigma)\)
For example, if we look at the heights of females in a sample set;
We can see that it is following a normal distribution. Our plot is symmetric , unimodal and gives us the bell shaped curve.
As a second example, we can look at a scenario where we are comparing two students’ scores from different tests. Jim took ACT and Pam took SAT. We have both of their scores, however the scoring system for both of these tests are different. Both SAT and ACT scores are distributed normally, SAT having mean of 1500 and standard deviation of 300, ACT having mean of 21 and standard deviation of 5. Pam scores 1800 in her SAT and Jim scores 24 on his ACT. How do we know who did better?
Since we know both of these test scores follow normal distribution, we can create a density plot to take a look where Jim and Pam’s scores fall.
Z Scores (Or Standardized Scores)
Since we can not compare the scores , we can compare how many standard deviations beyond the mean each of their scores are. The blue lines represents the mean and the red lines represents each students score.
Pam’s score is 1800, SAT Mean is 1500 and standard deviation for SAT is 300. So Pam’s score is 1 standard deviation away(above) from the mean. Jim’s score is 24, ACT Mean is 21, standard deviations for ACT is 5. So Jim’s score is 0.6 standard deviation above the mean.
Basically, Pam’s score is more above the average score compare the Jim’s score. This concludes that Pam did better than Jim.
Finding how many standard deviation the obvervation is away from mean is called standardized scores or Z scores. Below is the basic formula to calculate the Z score:
\(Z=observation-mean/SD\)
Two main things to flag,
We can calculate Z scores for any type of distributions however you can only calculate the percentiles against the Z scores for Normal Distributions.
Observations that are more than two standard deviations away from the man are usually considered unusual.
Percentile
Percentile is the percentage of observations that fall below a given point. For example, in our earlier score comparison, if we look at Pam’s and Jim’s scores (observations);
Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
This package is designed to support this course. The text book used
is OpenIntro Statistics, 3rd Edition. You can read this by typing
vignette('os3') or visit www.OpenIntro.org.
The getLabs() function will return a list of the labs available.
The demo(package='DATA606') will list the demos that are available.
we see that the percentile is the area below the probability distribution curve to the left of that observation.
We can use pnorm() function to compute the percentiles (or areas under the curve):
[1] 0.8413447
We can also use Z tables to calculate the percentiles.
68-95-99.7 Rule
If we have a normal distribution which means nearly normally distributed data set;
68% of the data will fall within 1 SD of the mean.
95% of the data will fall within 2 SD of the mean.
99.7% of the data will fall within 3 SD of the mean.
There is still very small chance that the data might for 4,5, or more SD of the mean but if the data is normally distributed.
Let’s take a look at this with an example.
We have age information of individuals within a data set. We can look and see if it is normally distributed.
age
1 32
2 25
3 24
4 26
5 32
6 29
[1] 23.44019
[1] 4.721365
Mean of the Age distribution is 23.44 and standard deviation is 4.72.
We can calculate 1SD, 2SD and 3SD distribution by using normalPlot.
[1] 0.6826895
[1] 0.9544997
[1] 0.9973002
As we can see, 68.2% of the data falls within 1SD, 95.4% of the data falls within 2SD and 99.7% of the data falls within 3SD away from the mean. This shows us that the age data is normally distributed.
Normal Probability Plot
When we create a normal probability plot, we want to make sure;
Data are plotted on the y-axis and theoretical quantities on the x-axis.
The way we create the normal probability plot is by calculating the percentiles and corresponding z scores for each observation.
This is time consuming and tedious so we use qqnorm() function to do this in R. We also use qqline() to create the line for the normal probability plot.
There are 4 different aspects of Normal Probability and Skewness
Right skew: Poins bend up and to the left of the line.
Left skew: Points bend down and to the right of the line.
Short tails: Points follow and S shaped curve. This is also narrower than the normal distribution.
Long Tails: Points start below the line, bend to follow it and end above it.
Geomertic Distribution and Bernouilli Random Variables
We are going to look at the Geometric Distribution with a famous experiment, conducted by Stanley Milgram whi was a Yale University psychologist and conducted series of experiments on obedience to authority starting in 1963. The conditions of the experiment, it’s goals and results are as follows;
Experimenter orders the teacher who is the subject of the experiment, to give severe electric shocks to a learner each time the learner gets a question wrong. The learner is actually an actor and the electric shocks are not real but each time prerecorded sound is played at the event of supposed electric shock.
The goal of the experiment is to measure willingness of the participants to obey the authority who instructed them to perform acts that conflicts with their morals.
65% of the participants obeyed the authority.
In this case, the experiment is looking at success and failure outcomes. Each experiment is a trial, each person refuses to administer an electric shock is success and each person administers the shock is failure. Since only 35% of the people administered the shock , the probability of success p =0.35.
** When an individual trial has two only possible outcomes, it is called a Bernoulli random variable**
We can define a question such as “What is the probability of first two people to administer the shock and third one to not administer the shock?”. In this case, we find each individual probability of success (or failure dependind on the question) and multiply to find the probability in question.
\(P(1st and 2nd shock and 3rd refuse)=0.65 * 0.65 * 0.35 = 0.15\)
What happens if we need to find out the probability of the 100th trial to be success and up to that time all failure? This is when the geometric distribution comes in to play. Geomertic distribution describes the waiting time until a success for Bernoulli random variables. There are two important condition;
Bernoulli random variables must be independent, meaning outcomes of trials are mutually exclusive and dont affect each other.
Bernoulli random variables must be identically distributed, meaning success is the same for each trial. (we cant change the success to be performing the electric shock after the first trial)
To calculate the Probability of a success in nth trial we can use the below formula:
\(P(success on the nth trial)= (1-p)^{n-1}*p\)
Expected Value and It’s Variability in Geometric Distribution
Expected value which is mean is defined as \(\mu=1/p\)
Standard deviation of geometric distribution is \(\sigma=\sqrt{((1-p)/p^2)}\)
Binomial Distribution
Let’s say we picked four individuals randomly; what is the probability that exactly one of them will refuse to administer the shock? This means basically variations of probability scenarious, where each scenario only one is success. For small amount of possibilities, we can outline the variations, find probability of each and multiply them together to find the probability of exact one success. However when we have big amount of trials we will need to use a formula.
k is success n is trials.
in this case \(P(single scenario)=p^{k}*(1-p)^{n-k}\)
Binomial distribution describes the probability of having exactly k successes in n independent Bernouilli trials with probability of success p.
Besides the probability of single scenario, for example; probability of exactly one success in 4 trials, we can also calculate how many number of ways we can get to that scenario (in this case, probability of having one exact success). In order to calculate this we can use \(\binom{n}{k}\) . We can also use choose(n,k) function in R.
We can also find the probability of more than one success in four trials. For example; we can look at k successes in n trials. \(P(k successes in n trials)=\binom{n}{k}*p^k*(1-p)^{n-k}\)
A 2012 Gallup survey suggests that 26.2% of Americans are obese. Can we find the probability that minimum 8 out of 10 random sample are obese?
In this case, success is not obese and failure is obese. We can use the pbinom() function
[1] 0.9999555
Expected Value, variability and Unusual Observations in Binomial Distribution
Looking at the same example, how many Americans would we expect to be obese out of 100 samples?
\(\mu=p*n\) \(\mu=np=0.262*100=26.2\)
This does not mean, in every random 100 sample, exactly 26.2 would be obese. In some cases it will be less and some cases it would be more. So there is variability here.
\(\sigma=\sqrt{(np*(1-p))}\)
As we see here, mean and standard deviation of a binomial may not be always integers which is fine, as these values represent what we would expect to see on average
If the observations are more than two standard deviation away from the mean, they are considered unusual.
Distribution of Number of Successes
Let’s look at what happens when we increase the trial number.
The sample size is considered large enought if the expected number of successes and failures are both at least 10. \(n*p\ge10\) and \(n*(1-p) \ge 10\)
Normal approximation to the Binomial
When the sample size is large enough, the Binomial distribution with parameters n(trial) and p(probability of success) can be approximated by the normal model with \(\mu=n*p\) and \(\sigma=\sqrt{(np*(1-p))}\)