2023-10-10

Background

Mean

  • The sum of all of the values within a quantitative data set divided by the size of the data set. Considering a sample of size n, the Mean is calculated by:
    \(Mean = \mu = \sum_{i=1}^{n} x_i\)

Standard Deviation

  • The rough average of the differences between a value in a data set and the mean of that data set. Considering a sample of size n, the Standard Deviation(SD) is calculated by:
    \(SD = \sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \mu)^2}{n-1}}\)

Normal Distribution

  • A category or variable within a data set is said to be normally distributed if the graph of its distribution curve is symmetric.The notation for normal distribution is, \(N(\mu, \sigma)\) where the mean(\(\mu\)) and standard deviation(\(\sigma\)) are its parameters.
  • When comparing two normally distributed data sets, \(N_1(\mu_1, \sigma_1)\) and \(N_2(\mu_2, \sigma_2)\), if,\(\sigma_1 < \sigma_2\), then \(N_2(\mu_2, \sigma_2)\) is wider in shape.In other words its distribution is more spread out given that the data points have higher deviation from its respective mean.
  • The means, \(\mu_1\) and \(\mu_2\) will indicate where the distribution is centered at.

Example A: Comparing two Distributions

  • Suppose we are examining the distribution of weights, among males and females from a small town. The growing town has 20,000 people, 10,000 woman and 10,000 men. Their heights(inches) and weights(lbs) are given by the code below:
set.seed(123)

data_male = data.frame(Male_inch = rnorm(10000, mean=72, sd=4),
     Male_lbs = rnorm(10000, mean=180, sd=4))
data_male = format(data_male, digits = 2)
colnames(data_male) = c("Height", "Weight")

data_female = data.frame(Female_inch = rnorm(10000, mean=60, sd=2),
     Female_lbs = rnorm(10000, mean=130, sd=2))
data_female= format(data_female, digits = 2)
colnames(data_female) = c("Height", "Weight")

Tables displaying First 5 Males and Females

  • Females
  Height Weight
1     58    130
2     60    131
3     56    129
4     57    128
5     58    132
  • Males
  Height Weight
1     70    189
2     71    179
3     78    184
4     72    178
5     73    181

Distribution of Weights

  • Plotting the histograms for the weight distribution among males and females we get:

Analysis of Histograms

  • As we can see by the spread of the data the standard deviation is higher for male weights than female weights. This can be seen by observing the difference in the spread of the weights.
  • Also, as previously mentioned the mean dictates where the histograms are centered at. Thus, the distribution of the male weights is centered at 180lbs, while the distribution of the female weights is centered at 130lbs.
  • One last thing to note is since, \(\sigma_{female}< \sigma_{male}\), the female weights are more concentrated around the mean, thus, resulting in less bins with higher peaks.

Example B: Calculating Percentages

  • Now let’s find what percentage of males have heights between 65 and 75 inches, the region shown below:

Calculating area of region

  • Using the following formulas and a z-table we can determine the area or percentage of males with heights within the desired bounds. Where \(Z_a\) represents area to the left of 65inches and \(Z_b\) represents area to the left of 75inches.

  • \(Z-value = \frac{\mu - x}{\sigma}\)

  • Therefore, \(Z_{a} = \frac{65 - 72}{4}=-1.75 \\ Z_{b} = \frac{75 - 72}{4}=.75\)

  • Using any Z-table we have; \(Z_{b} - Z_{a} = .7734 - .0401 = .7333\)

  • Thus, 73.33% of the male population has heights between 65 and 75 inches.