We introduce concepts briefly and then provide detailed case studies demonstrating how statistics is used in data analysis along with R code implementing these ideas. It is important for a data analyst to have an in-depth understanding of statistics.
prop.table(table(state.region))
## state.region
## Northeast South North Central West
## 0.18 0.32 0.24 0.26
barplot(prop.table(table(state.region)))
So how to depict distributions for numerical data. For example, with numerical data height below. Most entries are unique.
prop.table(table(heights$height))
##
## 50 51 52 53
## 0.001904762 0.000952381 0.001904762 0.000952381
## 53.77 54 55 58
## 0.000952381 0.000952381 0.000952381 0.000952381
## 59 59.0551 59.0551181102362 59.8425196850394
## 0.005714286 0.000952381 0.001904762 0.000952381
## 60 61 61.32 61.8110236220472
## 0.016190476 0.011428571 0.000952381 0.001904762
## 62 62.2047244094488 62.4 62.5
## 0.018095238 0.002857143 0.000952381 0.001904762
## 62.5984251968504 62.6 62.992125984252 63
## 0.000952381 0.000952381 0.002857143 0.029523810
## 63.3858267716535 63.7795275590551 64 64.173
## 0.000952381 0.002857143 0.037142857 0.000952381
## 64.1732 64.1732283464567 64.2 64.5
## 0.000952381 0.000952381 0.001904762 0.001904762
## 64.5669291338583 64.57 64.9 64.96
## 0.001904762 0.000952381 0.000952381 0.001904762
## 64.9606299212598 64.961 65 65.748031496063
## 0.003809524 0.000952381 0.051428571 0.002857143
## 66 66.14 66.1416 66.1417
## 0.061904762 0.000952381 0.000952381 0.000952381
## 66.1417322834646 66.4 66.5 66.5354330708661
## 0.003809524 0.000952381 0.002857143 0.003809524
## 66.7 66.75 66.9 66.92
## 0.000952381 0.000952381 0.000952381 0.001904762
## 66.9291 66.9291338582677 66.93 67
## 0.001904762 0.011428571 0.000952381 0.073333333
## 67.2 67.3 67.5 67.7
## 0.000952381 0.000952381 0.003809524 0.002857143
## 67.71 67.7165 67.7165354330709 67.72
## 0.001904762 0.001904762 0.006666667 0.002857143
## 67.78 68 68.11 68.1102
## 0.000952381 0.072380952 0.004761905 0.000952381
## 68.1102362204724 68.11024 68.4 68.5
## 0.002857143 0.000952381 0.001904762 0.008571429
## 68.503937007874 68.8 68.89 68.8976
## 0.003809524 0.000952381 0.003809524 0.002857143
## 68.8976377952756 68.8976378 68.9 69
## 0.004761905 0.000952381 0.002857143 0.079047619
## 69.29 69.2913385826772 69.3 69.6
## 0.001904762 0.001904762 0.001904762 0.001904762
## 69.6850393700787 70 70.0787401574803 70.08
## 0.000952381 0.087619048 0.004761905 0.000952381
## 70.1 70.4724409448819 70.5 70.8
## 0.001904762 0.000952381 0.001904762 0.000952381
## 70.85 70.86 70.866 70.8661
## 0.000952381 0.000952381 0.000952381 0.000952381
## 70.8661417322835 70.87 71 71.5
## 0.008571429 0.000952381 0.056190476 0.003809524
## 71.65 71.6535433070866 71.7 72
## 0.000952381 0.000952381 0.000952381 0.083809524
## 72.0472440944882 72.05 72.4 72.44
## 0.002857143 0.001904762 0.000952381 0.003809524
## 72.4409448818898 72.45 72.5 72.83
## 0.002857143 0.000952381 0.000952381 0.001904762
## 72.8346 72.8346456692913 73 73.2
## 0.001904762 0.001904762 0.021904762 0.001904762
## 73.22 73.2283464566929 73.62 74
## 0.001904762 0.000952381 0.000952381 0.028571429
## 74.5 74.8 74.8031496062992 75
## 0.000952381 0.000952381 0.001904762 0.020952381
## 75.4 75.5905511811024 75.6 75.98
## 0.000952381 0.000952381 0.000952381 0.000952381
## 76 77 77.1654 78
## 0.006666667 0.004761905 0.000952381 0.003809524
## 78.74 78.740157480315 79 79.05
## 0.000952381 0.000952381 0.001904762 0.000952381
## 80 81 82.6771653543307
## 0.001904762 0.000952381 0.000952381
barplot(prop.table(table(heights$height)))
We can define a function that reports the proportion of the data entries
that are below \(a\), for all possible
values of \(a\). This function is
called the empirical cumulative distribution function (eCDF) and often
denoted with \(F\): \(F(a) = Proportion of data points that are less
than or equal to a.\)
sorted <- sort(heights$height)
a = sapply(heights$height,function(x) {sum(sorted <= x)/length(heights$height)})
plot(heights$height,a)
You can also use built-in ecdf.
ecdf(heights$height)
## Empirical CDF
## Call: ecdf(heights$height)
## x[1:139] = 50, 51, 52, ..., 81, 82.677
plot(ecdf(heights$height))
From the plot, we can see that 16% of the values are below 65, since F(66)=0.164, or that 84% of the values are below 72, since F(72)=0.841, and so on.
Histograms are much preferred because they greatly facilitate answering such questions. The simplest way to make a histogram is to divide the span of our data into non-overlapping bins of the same size. Then, for each bin, we count the number of values that fall in that interval. The histogram plots these counts as bars with the base of the bar defined by the intervals. As you can see in the figure above, a histogram is similar to a barplot, but it differs in that the x-axis is numerical, not categorical.
hist(heights$height,breaks=65)
ggplot(heights,aes(x=height)) +geom_histogram(binwidth = 0.5)
ggplot(heights,aes(x=height)) +geom_density()
Density plots are similar to histograms, but the data is not divided into bins.
ggplot(heights,aes(x=height)) +geom_density(color="blue",alpha=0.3) +
geom_histogram(aes(y=..density..),binwidth=1,fill="white",colour="black")
In the murders dataset, the region is a categorical variable and the following is its distribution:
barplot(prop.table(table(murders$region)))
- To the closest 5%, what proportion of the states are in the North
Central region? - The graph above is a histogram or not?
ggplot(heights, aes(height)) + stat_ecdf(geom = "step", pad = FALSE)
-What percentage of males are shorter than 75 inches?
Here is an eCDF of the murder rates across states:
ggplot(murders, aes(total/population)) + stat_ecdf(geom = "step", pad = FALSE)
Knowing that there are 51 states (counting DC) and based on this plot,
how many states have murder rates larger than 10 per 100,000 people?
Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two-number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.
\(f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)
But note that it is completely defined by just two parameters: \(\mu\) and \(\sigma\) . In the standard normal distribution, 68% of data falls within 1 standard deviation of the mean, 95% falls within 2 standard deviations, and 99.7% falls within 3 standard deviations of the mean. Consider the image of the bell curve.
data <- data.frame(x = rnorm(100000, mean = 0, sd = 1))
ggplot(data, aes(x = x)) +geom_density()
\(\sigma^2=\frac{\sum(x_i-\mu)^2}{N}\)
xi <- c(56, 65, 74, 75, 76, 77, 80, 81, 91)
df <- data.frame(
Height = xi,
Mean = mean(xi),
Dispersion = xi-mean(xi),
Square_Dispersion = (xi-mean(xi))^2
)
sigma <- sqrt(sum(df$Square_Dispersion)/9)
This data shows that 68% of heights were 75 inches plus or minus 9.3 inches (1 standard deviation away from the mean), 95% of heights were 75’’ plus or minus 18.6’’ (2 standard deviations away from the mean), and 99.7% of heights were 75’’ plus or minus 27.9’’ (3 standard deviations away from the mean).
\(Z=\frac{x-\mu}{\sigma}\)
z <- df$Dispersion/sigma
z[7]
## [1] 0.5357143
The z-score of 0.54 corresponds to 0.7054 on the z-table. This means that student A is taller than 70.54% of the class and is shorter than 29.46% of the class.