Libraries used in the Assignment

library(tidytext)
library(stringr)
library(stm)
library(dplyr)
library(qdapRegex)
library(Zelig)
library(aRxiv)
library(lubridate)
library(mosaic)
library(tm)
library(wordcloud)

Retrieval of Data from arXiv Package

ProgrammingPapers <- arxiv_search(query = '" Programming Languages"', limit = 200)

Programming languages are widely used throughout the industry. From developing new software or creating the next smartphone application, computer programming is paramount to the future of technological innovation. Thanks to the power and efficiency of computer programming, technology continues to evolve. This assignment is written in a programmigng language called R that has allowed for advanced statisitcal computing and analysis versus other statisical packages such as SPSS and STATA. For this assignment, the R package arXiv is utilized to find the most common terms in research papers regarding programming languages.

head(ProgrammingPapers)

##                   id           submitted             updated
## 1       cs/9301115v1 1991-12-01 00:00:00 1991-12-01 00:00:00
## 2 alg-geom/9203003v1 1992-03-18 15:25:55 1992-03-18 15:25:55
## 3       cs/9401102v1 1994-01-01 00:00:00 1994-01-01 00:00:00
## 4   cmp-lg/9406019v3 1994-06-10 09:17:05 1994-06-17 11:51:36
## 5   cmp-lg/9406026v1 1994-06-17 01:26:29 1994-06-17 01:26:29
## 6   cmp-lg/9409006v1 1994-09-07 14:37:03 1994-09-07 14:37:03
##                                                         title
## 1                                 Context-free multilanguages
## 2 Computing the Cohomological Brauer Group of a Toric Variety
## 3                          Mini-indexes for literate programs
## 4                     A Complete and Recursive Feature Theory
## 5                          The Very Idea of Dynamic Semantics
## 6                      Situated Modeling of Epistemic Puzzles
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       abstract
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                 This article is a sketch of ideas that were once intended to appear in the\nauthor's famous series, "The Art of Computer Programming". He generalizes the\nnotion of a context-free language from a set to a multiset of words over an\nalphabet. The idea is to keep track of the number of ways to parse a string.\nFor example, "fruit flies like a banana" can famously be parsed in two ways;\nanalogous examples in the setting of programming languages may yet be important\nin the future.\n  The treatment is informal but essentially rigorous.\n
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       The purpose of this article is to show how one might compute the \\'etale\ncohomology groups $H^p_{\\acute{e}t}(X,G_m)$ in degrees $p=0$, $1$ and $2$ of a\ntoric variety $X$ with coefficients in the sheaf of units. The method is to\nreduce the computation down to the problem of diagonalizing a matrix with\nintegral coefficients. The procedure outlined in this article has been fully\nimplemented by the author as a program written in the ``C'' programming\nlanguage.\n
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             This paper describes how to implement a documentation technique that helps\nreaders to understand large programs or collections of programs, by providing\nlocal indexes to all identifiers that are visible on every two-page spread. A\ndetailed example is given for a program that finds all Hamiltonian circuits in\nan undirected graph.\n
## 4   Various feature descriptions are being employed in logic programming\nlanguages and constrained-based grammar formalisms. The common notational\nprimitive of these descriptions are functional attributes called features. The\ndescriptions considered in this paper are the possibly quantified first-order\nformulae obtained from a signature of binary and unary predicates called\nfeatures and sorts, respectively. We establish a first-order theory FT by means\nof three axiom schemes, show its completeness, and construct three elementarily\nequivalent models. One of the models consists of so-called feature graphs, a\ndata structure common in computational linguistics. The other two models\nconsist of so-called feature trees, a record-like data structure generalizing\nthe trees corresponding to first-order terms. Our completeness proof exhibits a\nterminating simplification system deciding validity and satisfiability of\npossibly quantified feature descriptions.\n
## 5                                                                                                    "Natural languages are programming languages for minds." Can we or should we\ntake this slogan seriously? If so, how? Can answers be found by looking at the\nvarious "dynamic" treatments of natural language developed over the last decade\nor so, mostly in response to problems associated with donkey anaphora? In\nDynamic Logic of Programs, the meaning of a program is a binary relation on the\nset of states of some abstract machine. This relation is meant to model aspects\nof the effects of the execution of the program, in particular its input-output\nbehavior. What, if anything, are the dynamic aspects of various proposed\ndynamic semantics for natural languages supposed to model? Is there anything\ndynamic to be modeled? If not, what is all the full about? We shall try to\nanswer some, at least, of these questions and provide materials for answers to\nothers.\n
## 6                                                                                                                                                                                                                                                                                                                                                                                                         Situation theory is a mathematical theory of meaning introduced by Jon\nBarwise and John Perry. It has evoked great theoretical and practical interest\nand motivated the framework of a few `computational' systems. PROSIT is the\npioneering work in this direction. Unfortunately, there is a lack of real-life\napplications on these systems and this study is a preliminary attempt to remedy\nthis deficiency. Here, we examine how much PROSIT reflects situation-theoretic\nconcepts and solve a group of epistemic puzzles using the constructs provided\nby this programming language.\n
##                     authors                                affiliations
## 1           Donald E. Knuth                                            
## 2           Timothy J. Ford                                            
## 3           Donald E. Knuth                                            
## 4 Rolf Backofen|Gert Smolka                                            
## 5              David Israel         Artificial Intelligence Center, SRI
## 6   Murat Ersan|Varol Akman Brown University|Bilkent University, Ankara
##                             link_abstract
## 1       http://arxiv.org/abs/cs/9301115v1
## 2 http://arxiv.org/abs/alg-geom/9203003v1
## 3       http://arxiv.org/abs/cs/9401102v1
## 4   http://arxiv.org/abs/cmp-lg/9406019v3
## 5   http://arxiv.org/abs/cmp-lg/9406026v1
## 6   http://arxiv.org/abs/cmp-lg/9409006v1
##                                  link_pdf link_doi
## 1       http://arxiv.org/pdf/cs/9301115v1         
## 2 http://arxiv.org/pdf/alg-geom/9203003v1         
## 3       http://arxiv.org/pdf/cs/9401102v1         
## 4   http://arxiv.org/pdf/cmp-lg/9406019v3         
## 5   http://arxiv.org/pdf/cmp-lg/9406026v1         
## 6   http://arxiv.org/pdf/cmp-lg/9409006v1         
##                                                                                                                                                                                                                                                                                                                        comment
## 1                                                                                                                                                                                                                                                                                             Abstract added by Greg Kuperberg
## 2                                                                                                                                                                                                                                                                                                           3 pages, AMS-LaTeX
## 3                                                                                                                                                                                                                                                                                                                             
## 4                                                                                                                                                                                                                        Short version appeared in the 1992 Annual Meeting of the Association\n  for Computational Linguistics
## 5                                                                                                                                                                                                                                                                                                      22 pages. Vanilla LaTex
## 6 iii + 49 pages, compressed, uuencoded Postscript file; revised\n  version of the first author's Bilkent M.S. thesis, written under the\n  supervision of the second author; notify Akman via e-mail\n  (akman@cs.bilkent.edu.tr) or fax (+90-312-266-4126) if you are unable to\n  obtain hardcopy, he'll work out something
##                                                     journal_ref doi
## 1 Theoretical Studies in Computer Science, Ginsburg Festschrift    
## 2                                                                  
## 3               Software -- Concepts and Tools 15 (1994), 2--11    
## 4                                                                  
## 5                        Proc. Ninth Amsterdam Colloquium, 1993    
## 6                                                                  
##   primary_category       categories
## 1            cs.DS            cs.DS
## 2         alg-geom alg-geom|math.AG
## 3            cs.PL            cs.PL
## 4           cmp-lg     cmp-lg|cs.CL
## 5           cmp-lg     cmp-lg|cs.CL
## 6           cmp-lg     cmp-lg|cs.CL

Cleaning up the Dates

ProgrammingPapers <- ProgrammingPapers %>%
  mutate(submitted = ymd_hms(submitted), updated = ymd_hms(updated))
glimpse(ProgrammingPapers)

## Observations: 200
## Variables: 15
## $ id               <chr> "cs/9301115v1", "alg-geom/9203003v1", "cs/940...
## $ submitted        <dttm> 1991-12-01 00:00:00, 1992-03-18 15:25:55, 19...
## $ updated          <dttm> 1991-12-01 00:00:00, 1992-03-18 15:25:55, 19...
## $ title            <chr> "Context-free multilanguages", "Computing the...
## $ abstract         <chr> "  This article is a sketch of ideas that wer...
## $ authors          <chr> "Donald E. Knuth", "Timothy J. Ford", "Donald...
## $ affiliations     <chr> "", "", "", "", "Artificial Intelligence Cent...
## $ link_abstract    <chr> "http://arxiv.org/abs/cs/9301115v1", "http://...
## $ link_pdf         <chr> "http://arxiv.org/pdf/cs/9301115v1", "http://...
## $ link_doi         <chr> "", "", "", "", "", "", "", "http://dx.doi.or...
## $ comment          <chr> "Abstract added by Greg Kuperberg", "3 pages,...
## $ journal_ref      <chr> "Theoretical Studies in Computer Science, Gin...
## $ doi              <chr> "", "", "", "", "", "", "", "10.1016/0010-465...
## $ primary_category <chr> "cs.DS", "alg-geom", "cs.PL", "cmp-lg", "cmp-...
## $ categories       <chr> "cs.DS", "alg-geom|math.AG", "cs.PL", "cmp-lg...

Submission Year

xtabs(~ year(submitted), data = ProgrammingPapers)

## year(submitted)
## 1991 1992 1994 1995 1996 1997 1998 1999 2000 2001 2002 
##    1    1    7    3    3    2   17   16   51   67   32

The majority of research papers were published in 2001. Furthermore, the aggregate amount of papers published in the 2000, 2001 and 2002 are significantly higher than the total number of research papers published between 1991-1999. There can be quite a few reasons for this disparity such as how technology and programming became more popular in the 21st century. However, a likely cause for this significant gap may be due to the Y2K incident.

Fields

xtabs(~ primary_category, data = ProgrammingPapers)

## primary_category
##        alg-geom        astro-ph          cmp-lg cond-mat.str-el 
##               1               1               8               1 
##           cs.AI           cs.CL           cs.DC           cs.DS 
##               9               4               1               1 
##           cs.IR           cs.LO           cs.NI           cs.OS 
##               2              37               2               1 
##           cs.PL           cs.SC           cs.SE          hep-ex 
##              93               3              23               2 
##         hep-lat          hep-ph          hep-th         math.LO 
##               1               2               1               1 
##         math.NA  physics.acc-ph physics.ins-det        quant-ph 
##               1               1               1               3

Focus on the primary fields

ProgrammingPapers %>%
  mutate(field = str_extract(primary_category, "^[a-z,-]+"))  %>%
  mosaic::tally(x = ~field)  %>%
  sort()

## field
## alg-geom astro-ph cond-mat  hep-lat   hep-th   hep-ex   hep-ph     math 
##        1        1        1        1        1        2        2        2 
##  physics quant-ph   cmp-lg       cs 
##        2        3        8      176

Text Corpus

Corpus <- with(ProgrammingPapers, VCorpus(VectorSource(abstract)))
Corpus[[1]]  %>%
  as.character()  %>%
  strwrap()

## [1] "This article is a sketch of ideas that were once intended to"       
## [2] "appear in the author's famous series, \"The Art of Computer"        
## [3] "Programming\". He generalizes the notion of a context-free language"
## [4] "from a set to a multiset of words over an alphabet. The idea is to" 
## [5] "keep track of the number of ways to parse a string. For example,"   
## [6] "\"fruit flies like a banana\" can famously be parsed in two ways;"  
## [7] "analogous examples in the setting of programming languages may yet" 
## [8] "be important in the future.  The treatment is informal but"         
## [9] "essentially rigorous."

Corpus <- Corpus  %>%
  tm_map(stripWhitespace)  %>%
  tm_map(removeNumbers)  %>%
  tm_map(removePunctuation)  %>%
  tm_map(content_transformer(tolower))  %>%
  tm_map(removeWords, c(stopwords("english"), "also", "will", "uses", "like", "can", "using", "used"))
strwrap(as.character(Corpus[[1]]))

## [1] "article sketch ideas intended appear authors famous series art"  
## [2] "computer programming generalizes notion contextfree language set"
## [3] "multiset words alphabet idea keep track number ways parse string"
## [4] "example fruit flies banana famously parsed two ways analogous"   
## [5] "examples setting programming languages may yet important future" 
## [6] "treatment informal essentially rigorous"

By writing the few lines of code above, any numerical digits and numbers, white spaces, punctuation, and stop words were taken out. Moreover, the words (also, will, uses, like, can, using, and used) were removed for capturing more key words from the analysis. Additionally, all letters were altered into lower case format.

Word Cloud

set.seed(1234)
wordcloud(Corpus, max.words = 200, scale = c(8, 1),
          colors = topo.colors(n = 200), random.color = TRUE)

After creating the word cloud, it is apparent that the words languages, termination, analysis, code, and system were the most frequent words found from the computer programming research papers. More words such as software, quantum, transformation,and computation were also prominent. Interestingly enough, the word cloud did not produce any words from actual programming languages such as C++, Java, Python, and VBA, etc. It is possible that the words found above are more widely used to describe the programming languages and their processess.

Homework 09: Exploring common terminologies used in Computer Programming

Fahad Ahmed

May 3, 2017