Exploring Primes

By: Jared Dangar

Synopsis

 In this informal report, we will be exploring the distribution and frequency of low number prime numbers. We will begin be finding the spread of the prime numbers between 1 and 1,000,000. As results develop, further hypothesis will be given and tests developed.
 

Load Data

Install Packages

 Here are the packages used in this analysis. Their version numbers will be listed at the end of the report.
suppressMessages(require(numbers)); suppressMessages(require(dplyr)); suppressMessages(require(ggplot2))

Find the Primes

 To begin with, we will analyze all of the prime numbers between 1 and 1,000,000. The isPrime function from the numbers package does a wonderful job in efficiently and accurately determining if a number is a prime and will be used frequently in the report. Here is its first appearance.
nums <- as.data.frame(1:1000000)
#Creates a logical vector
truths <- isPrime(nums[ , 1])
#What are the prime numbers?
primes <- nums[truths, ]
 The numbers are put in a dataframe in order to keep track of future identities of all of the numbers. For instance, a factor of "Prime" and "Not Prime" can be useful in graphing the spread of these numbers.
nums <- mutate(nums, 
               #By default, say all numbers are not primes
               Prime_Val = "Not Prime",
               #Set all of the bars in a bar plot to the same height
               Height_Of_Bar = 1) 
#Using the logical vector previously created, relabel the primes
nums[truths, 2] <- "Prime"
names(nums)[1] <- "num"
 Now we can look visually at the location of the prime numbers. Before plotting, I will claim that as the number line approaches infinity, there will be a gradual decrease in the number of primes. This is because, even though there are an infinite number of primes, it seems way more likely that some number will divide any given large number. 
 
 Here is their spread. 

 It is appears that my original hypothesis is incorrect. There does not appear to be less primes closer to 1,000,000 than closer to one. If anything, it looks like the opposite may be true. 
 
 Let's find out: Are there more primes between 1 and 500,000 or 500,001 and 1,000,000?
range1 <- filter(nums, num <= 500000)
# Value displayed inline below
range1_prime_num <- nrow(filter(range1, Prime_Val == "Prime"))
range2 <- filter(nums, num > 500000)
# Value displayed inline below
range2_prime_num <- nrow(filter(range2, Prime_Val == "Prime"))
 There are 41538 primes between 1 and 500,000 and 36960 primes between 500,001 and 1,000,000. So there are more primes closer to 1 than 1,000,000 but the difference isn't very large.

Ratio of Primes

 There are many primes between 1 and 1,000,000. But what percent of the 1,000,000 are primes?
# Value displayed below
prime_ratio <- round(length(primes) / nrow(nums) * 100, 2)
 The 78498 prime numbers between 1 and 1,000,000 make up 7.85% of all the numbers between 1 and 1,000,000. This begs the question, is that a coincidence, or is this a limit? In other words, do ALL primes make up approximately 7% of numbers. A quick graphic of the primes ratio as n approaches 1,000,000 will be useful here.
### First, some math -----------------------------------------------------------
nPrime <- function(num){
     ### This function returns the percent of primes from 1 to i
     return (length(Primes(num)) / num)
}

# Creates a blank vector for the ratios
ratios <- numeric()
# To save space and time, we're only testing 10,000 cases.
for (i in seq(1, nrow(nums), by = nrow(nums) / 10000)){
     ratios <- c(ratios, nPrime(i))
}
# Tacks on the new vector to it's corresponding numeric
estimator <- as.data.frame(cbind(seq(1, 
                                     nrow(nums), 
                                     by = nrow(nums) / 10000), 
                                 ratios))

### Plotting -------------------------------------------------------------------
p2 <- ggplot(data = estimator,
             aes(x = V1,
                 y = ratios))
p2 <- p2 + geom_line(size = 1.5,
                     col = "black")
p2 <- p2 + geom_abline(slope = 0,
                       intercept = prime_ratio / 100,
                       color = "red",
                       size = 1,
                       linetype = 2)
p2 <- p2 + labs(x = "Integer",
                y = "Num-Prime Ratio",
                title = "Figure 2: Lineplot of Prime Ratio")
p2 <- p2 + theme(plot.title = element_text(size = rel(1.7),
                                           vjust = 1),
                 axis.title.x = element_text(size = rel(1.5),
                                             vjust = -0.5),
                 axis.title.y = element_text(size = rel(1.5),
                                             vjust = 0.5),
                 axis.text.x = element_text(size = rel(1.2)),
                 axis.text.y = element_text(size = rel(1.2)),
                 panel.background = element_rect(fill = "white"))
p2

 It appears that 7.85% may indeed be around a limit of the total prime to non prime ratio. In this case, it is wise to check larger numbers for consistency. Since my laptop cannot handle vectors on the order of 1,000,000+ length, confidence intervals will be used to predict the true ratio.
 
# We're sampling so we must set the seed
set.seed(48884)
# Let's start with sampling 100,000 numbers from 1 to 1,000,000,000
rand_nums <- sample(1:1000000000, 100000, replace = F)
# Find the % of primes
p_bar <- length(rand_nums[isPrime(rand_nums)]) / length(rand_nums)
# Constuct confidence interval for proportion
#Values are displayed inline below
bounds <- round((p_bar + c(-1,1) * 1.96 * (sqrt(p_bar * (1 - p_bar) / length(rand_nums)))) * 100, 2) 
 We can be 95% confident that the percent of primes in 1 to 1,000,000,000 is between 5% and 5.27%. We observe that 7.85% is not within this confidence interval. This means that 7.85 is not a limit of the prime to non prime ratio. In fact,
### Math -----------------------------------------------------------------------
# Create a blank dataframe
values <- data.frame()
# Do this loop for powers of 10
for (i in 4:9){
     #Sample 2000 numbers for each power of 10
     rand_nums <- sample(1:10^i, 5000)
     p_bar <- length(rand_nums[isPrime(rand_nums)]) / length(rand_nums)
     bounds <- p_bar + c(-1,1) * 1.96 * (sqrt(p_bar * (1 - p_bar) / length(rand_nums)))
     values <- rbind(values, data.frame(n = 10^i, 
                                       l_bound = bounds[1], 
                                       u_bound = bounds[2]))
}

### Plotting -------------------------------------------------------------------
grapher <- data.frame()
for(i in values$n){
     min <- values[values$n == i, 2]
     range <- values[values$n == i, 3] - min
     for (j in 1:1000){
          grapher <- rbind(grapher, 
                           data.frame(n = paste("10^",
                                                log10(i),
                                                sep = ""), 
                                      #simulating a line
                                      value = min + range / 1000 * j))
     }
}

p3 <- ggplot(data = grapher,
             aes(x = as.factor(n),
                 y = value))
p3 <- p3 + geom_point(col = "red")
p3 <- p3 + scale_y_continuous(breaks = seq(0, 
                                           round(max(grapher$value), 
                                                 1) + 0.05,
                                           1/80),
                              labels = paste(seq(0, 
                                             round(max(grapher$value), 
                                                   1) + 0.05,
                                             1/80) * 100, "%", sep = ""))
p3 <- p3 + labs(y = "Percent Prime",
                x = "Least Upper Bound of Sampling Set\n(2,000 Sampled)",
                title = "Figure 3: 95% Confidence Intervals\nfor Small Powers of 10")
p3 <- p3 + coord_flip()
p3 <- p3 + theme(plot.title = element_text(size = rel(1.7),
                                           vjust = 1),
                 axis.title.x = element_text(size = rel(1),
                                             vjust = rel(-0.25)),
                 axis.title.y = element_text(size = rel(1),
                                             vjust = rel(1.25)),
                 axis.text.x = element_text(size = rel(1.2)),
                 axis.text.y = element_text(size = rel(1.2)),
                 panel.grid.major = element_line(size = rel(0.75),
                                                 colour = "grey"),
                 panel.background = element_rect(fill = "white"))
p3

 It appears that as we include more numbers into the pool for sampling, the confidence interval for the ratio of primes shifts closer to zero. My origibal hypothesis appears to be correct.
 

Conclusion

 Through several exploratory graphics in this report it can be seen that as n approaches infinity along the integers, the prime to non-prime ratio under n gets smaller. This approach is gradual, but it may be the case that the ratio when n approaches infinity is 0%. In other words, in the context of infinity, there may as well be no primes!
 

Environment

Sys.info()
##                      sysname                      release 
##                    "Windows"                      "7 x64" 
##                      version                     nodename 
## "build 7601, Service Pack 1"                "JDANGAR3-HP" 
##                      machine                        login 
##                     "x86-64"                   "jdangar3" 
##                         user               effective_user 
##                   "jdangar3"                   "jdangar3"
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.1 dplyr_0.4.2   numbers_0.5-6
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6      knitr_1.10.5     magrittr_1.5     MASS_7.3-42     
##  [5] munsell_0.4.2    colorspace_1.2-6 R6_2.1.0         stringr_1.0.0   
##  [9] plyr_1.8.3       tools_3.2.0      parallel_3.2.0   grid_3.2.0      
## [13] gtable_0.1.2     DBI_0.3.1        htmltools_0.2.6  lazyeval_0.1.10 
## [17] assertthat_0.1   digest_0.6.8     reshape2_1.4.1   formatR_1.2     
## [21] evaluate_0.7     rmarkdown_0.7    labeling_0.3     stringi_0.5-5   
## [25] scales_0.2.5     proto_0.3-10