March 25, 2020

##LibraryPackages in Use and Data Frame Read-in

Dataset cereal

cerealAll <- read.csv("cereal.csv", sep=",", header=TRUE)
cerealDisplay <- cerealAll %>% select(-name, -type, -weight, -cups)
cerealDisplay <- cerealDisplay%>% select(-vitamins, -shelf)
head(cerealDisplay)
  mfr calories protein fat sodium fiber carbo sugars potass   rating
1   N       70       4   1    130  10.0   5.0      6    280 68.40297
2   Q      120       3   5     15   2.0   8.0      8    135 33.98368
3   K       70       4   1    260   9.0   7.0      5    320 59.42551
4   K       50       4   0    140  14.0   8.0      0    330 93.70491
5   R      110       2   2    200   1.0  14.0      8     -1 34.38484
6   G      110       2   2    180   1.5  10.5     10     70 29.50954

Population Mean

We will begin by finding the actual mean (for demonstrating the Law of Large Numbers combined with Central Limit Theorem) of the data set with respect to the ratings category from the previous slide. The mean is defined as:

\[\displaystyle \mu = { \sum_{i=1}^n (x_i) \over n }\]

Note: We are assuming the dataset itself is the population for demonstrative purposes

rating <- cerealAll$rating
obs <- length(rating)
popMean = sum(rating)/obs
popMean
[1] 42.6657

Collecting Random Samples

Now that we have the population mean, we may start taking random samples. I will be showing the concept of the Central Limit Theorem by changing the observations per sample size of 5’s, 10’s, 20’s, and 40’s. We should see a fairly normal approximation distribution as well as a tighter cluster as the sample size increase in each one. I will take 100 samples of each size and we will display them in plots.

Sample Collecting Code

fiveSize <- seq(1:100)
tenSize <- seq(1:100)
twentySize <- seq(1:100)
fortySize <- seq(1:100)
str(fiveSize)
 int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
for(i in 1:100)
{
    fiveSize[i] <- sum(rating[sample(1:obs,5)])/5
    tenSize[i] <- sum(rating[sample(1:obs,10)])/10
    twentySize[i] <- sum(rating[sample(1:obs,20)])/20
    fortySize[i] <- sum(rating[sample(1:obs,40)])/40
}

Visualization

Visualization Cont. (5’s and 10’s)

Note The X range*

Visualization Cont. (20’s and 40’s)

Note The X range*

Code for all the graphs


fig <- plot_ly(alpha= 0.6)
fig <- fig %>% add_histogram(x = fiveSize, name ="5's" )
fig <- fig %>% add_histogram(x = tenSize, name="10's" )
fig <- fig %>% add_histogram(x = twentySize, name="20's" )
fig <- fig %>% add_histogram(x = fortySize, name="40's")
fig <- fig %>% layout(barmode = "overlay")

fig

fig1 <- plot_ly(alpha= 0.8)
fig2 <- plot_ly(alpha= 0.8)
fig1 <- fig1 %>% add_histogram(x = fiveSize, name="5's")
fig2 <- fig2 %>% add_histogram(x = tenSize, name="10's")

subplot(fig1,fig2)

fig3 <- plot_ly(alpha= 0.8)
fig4 <- plot_ly(alpha= 0.8)
fig3 <- fig3 %>% add_histogram(x = twentySize, name="20s's")
fig4 <- fig4 %>% add_histogram(x = fortySize, name="40's")

subplot(fig3,fig4)

Analysis

As we can see from the graphs before, the more observations we takke in a given sample, the more clustered the data becomes. The less we take the more spread out the data becomes. Notice that in the 40’s plot we barely go out of the range of the actual mean, with a very small standard deviation.

We will now look at the box plots for the data

Box Plot Analysis

Box Plot Analysis Cont.

Addition of confidence interval, maroon is 5’s, gold is 40’s, and black is the true population mean

allSamples <- c(fiveSize,tenSize,twentySize,fortySize)
categories <- c(replicate(100, "A5"),replicate(100,"B10"),
        replicate(100,"C20"),replicate(100,"D40"))
dataRatings <- data.frame(allSamples, categories)

p <- ggplot(dataRatings, aes(x = categories, y = allSamples,
                 fill = categories)) +
geom_boxplot(notch = T)

p

sampMeanFi  <- mean(fiveSize)
sampMeanFo  <- mean(fortySize)
sampSDFi <- sd(fiveSize)
sampSDFo <- sd(fortySize)

upConfFi <- sampMeanFi + 1.96*sampSDFi/sqrt(length(fiveSize))
lowConfFi <- sampMeanFi - 1.96*sampSDFi/sqrt(length(fiveSize))

upConfFo <- sampMeanFo + 1.96*sampSDFo/sqrt(length(tenSize))
lowConfFo <- sampMeanFo - 1.96*sampSDFo/sqrt(length(tenSize))

p <- p + geom_hline(yintercept=popMean, color="black") +
     geom_hline(yintercept=upConfFi, color="maroon",linetype="dashed") +
     geom_hline(yintercept=lowConfFi, color="maroon",linetype="dashed") +
     geom_hline(yintercept=upConfFo, color="gold",linetype="dashed") +
     geom_hline(yintercept=lowConfFo, color="gold",linetype="dashed")

p

Math Used in addition to the Mean

  • Confidence interval

\[ \bar{x} \pm Z_{\alpha \over 2} \cdot {sd \over \sqrt{n} }\] - Sample Standard Deviation

\[\sqrt{\sum_{i=1}^n (\bar{x} - x_i)^2 \over n-1}\]

End & Credits

As we can see, increasing obesrvation count is crucial to having more accurate results. Both the Law of Large Numbers and Central Limit Theorem play a huge roll in society and this trivial example shows how powerful they are.

Credits go to:

  • Kaggle for the data Set
  • ggplot2 for their good documentation
  • plotly for their good documentation
  • pandocs for converting all the different languages well