Probability with Data

df <- iris

We can fit various distributions to data. If we look at the Sepal.Width in the iris dataset it seems to be normal.

ggplot(df, mapping = aes(x = Sepal.Width))+
  geom_histogram(aes(y = ..density..), color="black", fill="white")+
  geom_density()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can do a statistical test to see if it is normal

shapiro.test(df$Sepal.Width)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$Sepal.Width
## W = 0.98492, p-value = 0.1012

We fail to reject the null that the data is not normal so we can assume that it is normal.

Let’s fit the normal distributions parameters from the data.

norm.parm <- fitdistr(df$Sepal.Width, "normal") #From the MASS package

We can now use these parameters in the normal cumulative density function (CDF). Remember the CDF is the probability of a value being less then $x$

\[ F(x) = P(X \le x) \]

For the normal distribution we would expect that 50% of the data is less then the mean of the data.

pnorm(mean(df$Sepal.Width), mean = norm.parm$estimate["mean"], sd = norm.parm$estimate["sd"])

## [1] 0.5

Using the qnorm function we see that the mean lies at the 50% probability. We would expect a 50% chance to see points above or below this point. We can also do $ 1 - P(X x) $ to see what the probability of seeing a point greater then x.

Probability with Data

Victor Feagins

11/9/2021