Lab 1: Zipf's Law

Pallavi Prakash

Preliminary Tasks


  • Size of patentsample.txt: The document had 107088 word tokens.
  • Record the size of the dictionary for patentsample.txt: The document has 7856 word types.
  • Contents of the patentsample.txt: The document is a compiled group of patents.
  • Predicting word frequencies in patentsample.txt: I predict that, while having regualar word frequencies for common words, there would be higher word freuqencies for technology specfic words.

Natural Language : English : Zipf's Plot

  • The preliminary task was to analyze patent samples in English. This was a slightly skewed sample due to the specificity of the content.


English

Constructed Languages

For my lab, I will observe the Zipf's Law plots of constructed languages.

  • Constructed Languages are languages developed with one or more directives or goals. This is unlike spoken languages which developed naturally and do not adhere perfectly to any liguistic directive.
  • I used Esperanto, Lojban and Elvish in this lab. Each had a different linguistic goal in their developement.

Analysis

  • Open source .pdfs were taken from the internet. Most texts downloaded were non-technical fiction texts. Each pdf was converted to a .txt file using pdftotxt and then combined into one .txt file for analysis through doc2freq.py. Then the generated .tsv was used to produce a log-log plot of word rank and word frequency using zipf.R.

  • For a preliminary Zipf's Law analysis, accents were ignored in this process. Each character that did have an accent had its accent removed in the process of conversion from .pdf to .txt. I believe this makes our analysis less accurate, as clearly accents are an important distinguisher in any language.

Constructed Language 1 : Esperanto

  • Esperanto is a constructed auxiliary language created with the intention of being a universal second language.
  • It is an a posteriori schematic language, that is, it is consturcted using elements from natural languages with original grammar.
  • It uses the latin alphabet along with traditonally Slavic accents. It's pronunciation is heavily based on Slavic languages and Romance languages.
  • Esperanto currently has several thousand native speakers and around 2 million speakers who speak it as a second or tertiary language. It is, by far, the most widely spoken constructed language.
  • Fun Fact: Voyager 1 has a message in Esperanto on it!

Hypothesis

  • Given the way Esperanto was constructed, I would hypothesize that Esperanto follows Zipf's Law.

Constructed Language 1 : Esperanto

  • A corpus was created of texts in Esperanto.

    • Texts in Esperanto were found here: http://i-espero.info/files/elibroj/
    • The texts found were literature from other languages translated into esperanto. The largest group is the Oz stories, followed by the Alice in Wonderland books.
    • 21 texts were used in total.
  • The final .txt file was analyzed using zipf.R

    • It contained 694402 word tokens.
    • It contained 72269 word types or # of words in dictionary.

Constructed Language 1 : Esperanto : Zipf's Plot

Esperanto

The plotted results confirm our hypothesis, and Zipf's Law.

Constructed Language 2 : Lojban

  • Lojban is a constructed, experimental, syntactically unambigous language developed with the goal of unambigous communication.
  • Lojban is considered both an apriori (a constructed language not based on an exisiting natural language) and an a posteriori language. It is experimental in that its goal was to develop langauge that attempts a specific linguistic feature.
  • Given its constructive goal, Lojban is a common choice for machine learning and translation efforts with respect to language processing.
  • It uses the Latin alphabet as well.

Hypothesis

  • Given the way Lojban was constructed, I would expect it to follow Zipf's Law, with certain caveats:
    • I would assume that since this language is so syntactically unambigous, there would be a lot of word types and not as many tokens for each.

Constructed Language 2 : Lojban

  • A corpus was created of texts in Lojban.

    • Texts in Esperanto were found here: http://tiki.lojban.org/tiki/tiki-index.php?page=Texts+in+Lojban&bl
    • The texts found were original texts written in Lojban. The largest group was poetry (claimed to be written in freestyle) by Robin's Palm. The corpus also included transcripts of converations in Lojban on a list traffic server, and some of The Epic of Gilgamesh!
  • The final .txt file was analyzed using zipf.R

    • It contained 9410 word tokens.
    • It contained 1138 word types or # of words in dictionary.

Constructed Language 2 : Lojban : Zipf's Plot


Lojban alt text

The plotted results seem to confirm the hypothesis, and also seem to roughly follow Zipf's Law.

Zipf's Law Comparisons




English Esperanto Lojban