Objective

The objective of this assignment is to demonstrate Zipfs and Heaps Law by working on a corpus of documents and obtaining the Frecuencies of the words. Our data consists of 33 different novels, and with a provided Java script, we are able to count the occurrences of each word for all the novels.

Methodology

Two problems are approached, the first one is to see if the rank-frequency for our data seems to follow Zipfs law with the following parameters: \(f = c * (rank + b)^a)\)

The second problem is to see if our data follows Heaps law, by checking if the number of distinct terms in a piece of text with \(N\) words contains about: \(d = k*N^\beta\).

Zipf’s Law

In order to prove Zipf’s law, an index of the total number of novels was created in order to have all the distinct word with its frequency. The data passed through the following preprocessing steps:

  • Removing numbers
  • Removing punctuation
  • Removing webpages
  • Removing roman numbers
  • Removing words with symbols and different alphabet accents.

Then the frequencies were ordered in descending order and a rank was generated.

After following the previous process we ended up with the following dataset:

##    Word Frequency Rank
## 1    he     27976    1
## 2   had     20869    2
## 3   you     17540    3
## 4 which     15442    4
## 5  have     14840    5
## 6  from     14012    6

Then we plotted the data in order to see if it behaves as stated in Zipfs law, where it heavily decreases and slows down while reducing its frequency.

Since it looks that it heavily drops the frequency while the rank increases, we could guess it does follows Zipf law. If it does, we could expect a straight line by plotting the log(rank) against log(frequency).

After using the logarithm function to plot the Frequencies and Ranks, we could argue that it follows a linear model. Now we have to fit the function of Zipfs law to our graph to see if our data follows the power law.

\[f = c * (rank + b)^a)\]

In this case we have only the frequency and rank. This means we are missing the parameters c, b and a. For this we will first take a logarithm of the function and then pass it through the nls() R function in order to find the missing parameters. In this case we get the following results:

a = -1.9644594 b = 538.5050541
c = 4.463520310^{8}

Now if we plot the model obtained by optimizing this parameters with our data, we obtain the following.

The most important is to fit the lower frequencies, since the fit holds, we are proving that it follows the Zipf’s Law.

Heaps’s Law

For Heaps Law, first indices were created containing different number of novels. The indexes contained 5, 10, 15, 20, 25 and 30 novels. The java script for counting words was executed for each index of novels and the following dataset was obtianed:

##   Number of novels DistinctWords WordOccurrences
## 1                1         13575          117795
## 2                1         34224          947652
## 3                1         40353         1146953
## 4                1         55392         1661664
## 5                1         60656         1968274

Heaps law follows the function: \[d = k*N^\beta\] Where d is the number of distinct words in the group of novels (DistinctWords), and N is our total of words contained (WordOccurrences).

In order to fit the curve for this data, the approach was to convert the formula into a linear one by apply logarithms:

\[log(d) = log(k) + \beta * log(N)\] if we consider a linear model,

\[y = \beta_{0} + \beta_{1} * X\]

This way, by fitting a linear model, we can find a fit while obtaining the constants for \(\beta_{0} = log(k)\) and \(\beta_{1} = \beta\)

\(K\) = 30.1585375 \(\beta\) = 0.5199596

As shown in the plot, the fitting curve approximates the data for the novel collections. It is to conclude that the data follows Heap’s Law with the constants previously shown.

Conclusion

From the work done we can see that given a random text corpus after some preprocessing we can demonstrate that Zipf’s and Heap’s laws hold. The parameters of the laws can be approximated to fit the data and be further be used for predictions.

Code

library(dplyr)
library(stringi)
library(gtools)

##### Load Data
words <- read.csv("Data/someFiles.txt", col.names = c("Word","Frequency"), encoding = "UTF-8",stringsAsFactors = FALSE, sep =" ")


##### Cleaning #####
  # Remove numbers
words <- filter(words, !grepl('[[:digit:]]+', Word ))
  # Remove anything with punctuation or webpages.
words <- filter(words, !grepl('[[:punct:]]+', Word ))
  # Remove roman numbers (Slows down the code, might be ok to keep)
isroman <- roman2int(words$Word)
words <- words[is.na(isroman),]
  # Remove words with Accents
unwanted_array = list(    'S', 's', 'Z', 'z', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É',
                          'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù',
                          'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç',
                          'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ',
                          'ö', 'ø', 'ù', 'ú', 'û', 'ý', 'ý', 'þ', 'ÿ' )
words <-  filter(words, !grepl(paste(unwanted_array, collapse='|'), Word ))
  #Remove the last ones
words <- words[1:(length(words$Word)-9),]
#write.table(words, "processed.txt")


####### Create Frequency-Rank Table #######
  # Arrange
words = words %>% arrange(desc(Frequency))
head(words)
  # Add Rank
words$Rank = 1:nrow(words)
  # Plot Frequency vs. Rank
plot(words$Rank, words$Frequency, xlab= "Rank", ylab="Frequency", main="Word Distribution")
  # Remove the frequencies lower than 1
words <- words %>% filter(Frequency > 1)
plot(words$Rank, words$Frequency, xlab= "Rank", ylab="Frequency", main="Word Distribution")

  #Logarithm
x <- log(words$Rank)
y <- log(words$Frequency)
df <- data.frame(x,y)

x_rank <- words$Rank
 
plot(x, y, xlab = "log(Rank)", ylab = "log(Frequency)", main = "Zipf's Law")

#Zipf's Function
zipf <- function(rank, a,b,c) {
  return(log(c)+a*log(rank+b))
}

# Choosing the best parameters visually
lines(x, zipf(x_rank,-1.3,18,1000000),col = "green", lwd=3)

# Estimating the parameters
fitted <- nls(y ~ zipf(x_rank,a,b,c), data = df, start = list(a=-1.3, b=18, c=1000000), trace = T)

# Plotting the Zipf's law with the parameters estimated
lines(x, predict(fitted),lty=2,col="lightblue",lwd=4)
Zipfs_coef = coef(fitted)


##### HEAPS LAW ######

files = list.files("Data")[1:5]

novels_collection = matrix(0,nrow= length(files),ncol =2)
for(i in 1:length(files)){
  x <- read.csv(files[i], col.names = c("DistinctWords","WordOccurrences"), stringsAsFactors = FALSE, sep =" ")
  b = tail(x,3)[2,1]
  novels_collection[i,1] <- as.numeric(substr(b, 1, nchar(b)-1))
  novels_collection[i,2] <- as.numeric(tail(x,3)[3,2])
}
novels_collection <-  as.data.frame(novels_collection)
colnames(novels_collection) =  c("DistinctWords","WordOccurrences")
Novels <- c(1:length(n))
novels_collection = cbind("Number of novels" = Novels, novels_collection)

print(novels_collection)

# Function for Heaps law
heaps <- function(K, n, B){
  K*n^B
}
heaps(2,117795,.7)
# y = K * n ^ B ==> log(y) = log(K) + B * log(n)

fitHeaps <- lm(log(DistinctWords) ~ log(WordOccurrences), data = novels_collection[,2:3])

logK <- summary(fitHeaps)$coef[1]
B <-  summary(fitHeaps)$coef[2]
K = exp(logK)

EstDistWords = heaps(K,novels_collection$WordOccurrences,B)
EstDistWords


#Plot of Data
plot(novels_collection$WordOccurrences, novels_collection$DistinctWords,
     xlab= "Word Occurrences (N)", ylab="Distinct Words (d)", main="Heap's Law", pch=19)
lines(novels_collection$WordOccurrences,EstDistWords, lwd = 3, col="lightblue")