THE RULE OF BIG NUMBERS

The objective of this post is developing the R code to prove, graphically, the rule of big numbers and see how work some powerful R tools.

Definition

We can establish the following: when a sample tens to infinite, or is growing, the sample mean will be closer to population mean.

Empirically, we could demonstrate the last sentence with a practical exercise. First, we will take some samples normally distributed, each of them larger than previous one. The second step will be, for each sample, calculate the percentage of numbers between -1 and 1. Theoretically, we expected that the proportion tens to be 68.2%.

Before to start the exercise, we have to load the packages:

library(ggplot2) # MAKE GRAPHIPS
library(tidyverse) # TIDY DATA
library(kableExtra) # Formatting tables

Then, we are going to set a seed to maintain reproducible results.

set.seed(10)

Lets make a little example with a sample size of one thousand values.

sample1000<-rnorm(1000)
head(sample1000,20) # Only print the firsts 20 values

##  [1]  0.01874617 -0.18425254 -1.37133055 -0.59916772  0.29454513  0.38979430
##  [7] -1.20807618 -0.36367602 -1.62667268 -0.25647839  1.10177950  0.75578151
## [13] -0.23823356  0.98744470  0.74139013  0.08934727 -0.95494386 -0.19515038
## [19]  0.92552126  0.48297852

We could be able to know what of these numbers are less than 1 with the next line of code sample10<=1. The result is a logical vector when is TRUE the condition is satisfied:

head(sample1000<=1,20) # Only print the first 20 values

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

We can combine more than one condition as it is showed bellow:

head(sample1000>=-1&sample1000<=1,20)

##  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

The result return our values of interest. Using this vector we can extract the numbers that are TRUE with squared brackets, []:

head(sample1000[sample1000>=-1&sample1000<=1],20)

##  [1]  0.01874617 -0.18425254 -0.59916772  0.29454513  0.38979430 -0.36367602
##  [7] -0.25647839  0.75578151 -0.23823356  0.98744470  0.74139013  0.08934727
## [13] -0.95494386 -0.19515038  0.92552126  0.48297852 -0.59631064 -0.67486594
## [19] -0.37366156 -0.68755543

Therefore, we have to divide the length of generated vector, which is equal to the numbers that satisfied the condition, between sample size:

length(sample1000[sample1000>=-1&sample1000<=1])/1000

## [1] 0.673

To generalize the previous results to different sample size, we will create a for loop. It will iterate through different sample size, as well as calculate the proportion of values that are between -1 & 1:

set.seed(10)
remove(i)
n<-seq(1000,1000000,by=1000)
p<-numeric()
for (i in 1:length(n)){
 sample<-rnorm(n[i]) 
 div<-length(sample[sample>=-1&sample<=1])/n[i]
 p[i]<-div
}

To understand what the previous code is doing, lets split it:

n is a vector that contains different samples size. In each iteration, we are going to use a different sample size from 1,000 to 1,000,000, increasing by 1,000.
The sample object contains numbers normally distributed and its size is equal to \(n_i\).
The div object is calculating the proportion of the numbers that is between -1 and 1 divided by \(n_i\).
The object p saves the results for each sample.

To better visualize the results, we could plot the proportion for each of 1,000 calculated samples and see that when the size of the sample is growing the proportion tends to 62.8%, as the rule of the big numbers says.

df_results<-data.frame(n,p)

df_results %>% ggplot(aes(x=n,y=p))+ geom_line(color='#11A579', size=1.3) +       geom_point(color='black') +
  scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE))+
  scale_x_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE))+
  theme_minimal() + labs(x="",y="") +geom_smooth(se=FALSE, color='gray', size=1.5)

Finally, we created a histogram based on the sample of size 1,000,000. It is easy to see that the largest amount of data is inside one standard deviation.

df_hist<-data.frame(num=seq(1,length(sample),by=1),sample)

df_hist %>% ggplot(aes(x=sample)) + 
  geom_histogram(color='#11A579',fill='#11A579',alpha=0.5) + labs(x="",y="") +
  scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme_minimal() + 
  geom_vline(xintercept =c(-1,1), size=1, linetype="dashed")+
  annotate("text", x = -1.7, y = 110000, parse = TRUE,
           label = "-sigma", size=7) +
  annotate("text", x = 1.7, y = 110000, parse = TRUE,
           label = "sigma", size=7)