Common Terms and Statistics

We introduce concepts briefly and then provide detailed case studies demonstrating how statistics is used in data analysis along with R code implementing these ideas. It is important for a data analyst to have an in-depth understanding of statistics.

Distributions: a quick way to summarize a list with lots of values. For example, with categorical data below, a distribution shows the percentage of each type.

prop.table(table(state.region))
## state.region
##     Northeast         South North Central          West 
##          0.18          0.32          0.24          0.26
barplot(prop.table(table(state.region)))

So how to depict distributions for numerical data. For example, with numerical data height below. Most entries are unique.

prop.table(table(heights$height))
## 
##               50               51               52               53 
##      0.001904762      0.000952381      0.001904762      0.000952381 
##            53.77               54               55               58 
##      0.000952381      0.000952381      0.000952381      0.000952381 
##               59          59.0551 59.0551181102362 59.8425196850394 
##      0.005714286      0.000952381      0.001904762      0.000952381 
##               60               61            61.32 61.8110236220472 
##      0.016190476      0.011428571      0.000952381      0.001904762 
##               62 62.2047244094488             62.4             62.5 
##      0.018095238      0.002857143      0.000952381      0.001904762 
## 62.5984251968504             62.6  62.992125984252               63 
##      0.000952381      0.000952381      0.002857143      0.029523810 
## 63.3858267716535 63.7795275590551               64           64.173 
##      0.000952381      0.002857143      0.037142857      0.000952381 
##          64.1732 64.1732283464567             64.2             64.5 
##      0.000952381      0.000952381      0.001904762      0.001904762 
## 64.5669291338583            64.57             64.9            64.96 
##      0.001904762      0.000952381      0.000952381      0.001904762 
## 64.9606299212598           64.961               65  65.748031496063 
##      0.003809524      0.000952381      0.051428571      0.002857143 
##               66            66.14          66.1416          66.1417 
##      0.061904762      0.000952381      0.000952381      0.000952381 
## 66.1417322834646             66.4             66.5 66.5354330708661 
##      0.003809524      0.000952381      0.002857143      0.003809524 
##             66.7            66.75             66.9            66.92 
##      0.000952381      0.000952381      0.000952381      0.001904762 
##          66.9291 66.9291338582677            66.93               67 
##      0.001904762      0.011428571      0.000952381      0.073333333 
##             67.2             67.3             67.5             67.7 
##      0.000952381      0.000952381      0.003809524      0.002857143 
##            67.71          67.7165 67.7165354330709            67.72 
##      0.001904762      0.001904762      0.006666667      0.002857143 
##            67.78               68            68.11          68.1102 
##      0.000952381      0.072380952      0.004761905      0.000952381 
## 68.1102362204724         68.11024             68.4             68.5 
##      0.002857143      0.000952381      0.001904762      0.008571429 
##  68.503937007874             68.8            68.89          68.8976 
##      0.003809524      0.000952381      0.003809524      0.002857143 
## 68.8976377952756       68.8976378             68.9               69 
##      0.004761905      0.000952381      0.002857143      0.079047619 
##            69.29 69.2913385826772             69.3             69.6 
##      0.001904762      0.001904762      0.001904762      0.001904762 
## 69.6850393700787               70 70.0787401574803            70.08 
##      0.000952381      0.087619048      0.004761905      0.000952381 
##             70.1 70.4724409448819             70.5             70.8 
##      0.001904762      0.000952381      0.001904762      0.000952381 
##            70.85            70.86           70.866          70.8661 
##      0.000952381      0.000952381      0.000952381      0.000952381 
## 70.8661417322835            70.87               71             71.5 
##      0.008571429      0.000952381      0.056190476      0.003809524 
##            71.65 71.6535433070866             71.7               72 
##      0.000952381      0.000952381      0.000952381      0.083809524 
## 72.0472440944882            72.05             72.4            72.44 
##      0.002857143      0.001904762      0.000952381      0.003809524 
## 72.4409448818898            72.45             72.5            72.83 
##      0.002857143      0.000952381      0.000952381      0.001904762 
##          72.8346 72.8346456692913               73             73.2 
##      0.001904762      0.001904762      0.021904762      0.001904762 
##            73.22 73.2283464566929            73.62               74 
##      0.001904762      0.000952381      0.000952381      0.028571429 
##             74.5             74.8 74.8031496062992               75 
##      0.000952381      0.000952381      0.001904762      0.020952381 
##             75.4 75.5905511811024             75.6            75.98 
##      0.000952381      0.000952381      0.000952381      0.000952381 
##               76               77          77.1654               78 
##      0.006666667      0.004761905      0.000952381      0.003809524 
##            78.74  78.740157480315               79            79.05 
##      0.000952381      0.000952381      0.001904762      0.000952381 
##               80               81 82.6771653543307 
##      0.001904762      0.000952381      0.000952381
barplot(prop.table(table(heights$height)))

We can define a function that reports the proportion of the data entries that are below \(a\), for all possible values of \(a\). This function is called the empirical cumulative distribution function (eCDF) and often denoted with \(F\): \(F(a) = Proportion of data points that are less than or equal to a.\)

sorted <- sort(heights$height) 
a = sapply(heights$height,function(x) {sum(sorted <= x)/length(heights$height)})
plot(heights$height,a)

You can also use built-in ecdf.

ecdf(heights$height)
## Empirical CDF 
## Call: ecdf(heights$height)
##  x[1:139] =     50,     51,     52,  ...,     81, 82.677
plot(ecdf(heights$height))

From the plot, we can see that 16% of the values are below 65, since F(66)=0.164, or that 84% of the values are below 72, since F(72)=0.841, and so on.

Histograms

Histograms are much preferred because they greatly facilitate answering such questions. The simplest way to make a histogram is to divide the span of our data into non-overlapping bins of the same size. Then, for each bin, we count the number of values that fall in that interval. The histogram plots these counts as bars with the base of the bar defined by the intervals. As you can see in the figure above, a histogram is similar to a barplot, but it differs in that the x-axis is numerical, not categorical.

hist(heights$height,breaks=65) 

ggplot(heights,aes(x=height)) +geom_histogram(binwidth = 0.5)

ggplot(heights,aes(x=height)) +geom_density()

Density

Density plots are similar to histograms, but the data is not divided into bins.

The density is basically the curve that goes through the top of the histogram bars when the bins are very, very small.

ggplot(heights,aes(x=height)) +geom_density(color="blue",alpha=0.3) +
  geom_histogram(aes(y=..density..),binwidth=1,fill="white",colour="black")

Exercise

In the murders dataset, the region is a categorical variable and the following is its distribution:

barplot(prop.table(table(murders$region)))

- To the closest 5%, what proportion of the states are in the North Central region? - The graph above is a histogram or not?

ggplot(heights, aes(height)) +   stat_ecdf(geom = "step", pad = FALSE)

-What percentage of males are shorter than 75 inches?

Here is an eCDF of the murder rates across states:

ggplot(murders, aes(total/population)) +   stat_ecdf(geom = "step", pad = FALSE)

Knowing that there are 51 states (counting DC) and based on this plot, how many states have murder rates larger than 10 per 100,000 people?

The normal distribution

Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two-number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.

\(f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)

Normal Distribution
Normal Distribution

But note that it is completely defined by just two parameters: \(\mu\) and \(\sigma\) . In the standard normal distribution, 68% of data falls within 1 standard deviation of the mean, 95% falls within 2 standard deviations, and 99.7% falls within 3 standard deviations of the mean. Consider the image of the bell curve.

data <- data.frame(x = rnorm(100000, mean = 0, sd = 1))
ggplot(data, aes(x = x)) +geom_density()  

Standard Deviation: measures the dispersion of the data in relation to the mean. A standard deviation close to zero indicates that data points are very close to the mean, whereas a larger standard deviation indicates data points are spread further away from the mean.

\(\sigma^2=\frac{\sum(x_i-\mu)^2}{N}\)

xi <- c(56, 65, 74, 75, 76, 77, 80, 81, 91)
df <- data.frame(
  Height = xi,
  Mean = mean(xi),
  Dispersion = xi-mean(xi),
  Square_Dispersion = (xi-mean(xi))^2
)
sigma <- sqrt(sum(df$Square_Dispersion)/9)

This data shows that 68% of heights were 75 inches plus or minus 9.3 inches (1 standard deviation away from the mean), 95% of heights were 75’’ plus or minus 18.6’’ (2 standard deviations away from the mean), and 99.7% of heights were 75’’ plus or minus 27.9’’ (3 standard deviations away from the mean).

Z-score: number of standard deviations away from the mean.

\(Z=\frac{x-\mu}{\sigma}\)

z <- df$Dispersion/sigma
z[7]
## [1] 0.5357143

The z-score of 0.54 corresponds to 0.7054 on the z-table. This means that student A is taller than 70.54% of the class and is shorter than 29.46% of the class.