Trends in Programming Language Popularity

Indrani, Indranil, Sharanya, Michael

Data Science with R

plot of chunk homeui

Overview and Motivation

Evaluating Programming Language Popularity based on various measurements

Reddit and Stackoverflow as two different platforms with different user spaces

Using Textmining Techniques to better understand the data

Reddit

  • Found in 2005
  • Known as the front page of the internet
  • Community forum for any kind of news or questions
  • Targeted at young people/digital natives

Stackoverflow

  • Found in 2008
  • Platform for questions regarding programming issues
  • Targeted at Software Developers

Research Questions

  • Can we decide which topic/language a certain post is about?

  • How do number of comments and number of posts correlate to popularity?

  • How does the popularity of programming languages change over time?

  • Can we predict the popularity of programming languages in the future?

  • How do the two platforms compare based on programming languages?

Related Work

  • The growth of R programming language blog by Stackoverflow

  • The growth of Python programming language blog by Stackoverflow

  • The PYPL indexing / TIOBE indexing for programming language

  • Topic Modeling

Reddit Data

Here is an example of the Reddit data set that we used:

subreddit title author selftext id domain url created_utc score num_comments text_new language year text_length month day
programming Watch “"Compiling and a Running a java program - Part 1”“ on YouTube repterx 3z2kmn youtu.be https://youtu.be/oF4X6jq69PI 2016-01-01 1 0 watch ”“compiling and a running a java program - part 1”“ on youtube Java 2016 69 1 1
Python MIT is offering one of the best, if not the best, ”“Introduction to Computer Science”“ courses for free and it uses Python 2.7! Course starts January 13th! Longhorns2102 3z2n8y edx.org https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x-6 2016-01-01 153 50 mit is offering one of the best, if not the best, ”“introduction to computer science”“ courses for free and it uses python 2.7! course starts january 13th! Python 2016 156 1 1
javascript What would be a nice short intro tutorial for beginners to JavaScript? jamesfinn180 Soon I’ll be running a 15 minute introductory class to JavaScript where I’ll be taking a group of adult students with little to no programming experience and will spend the time building a little application with them. If I do this well, they won’t get too confused with anything and by the end of the 15 minutes they will have something to be proud of. The idea is to entice them into learning more programming. What I was considering was to build a simple Rock, Paper, Scissors game. With this they’ll be introduced to variables, functions, function parameters and if else statements however it seems a bit convoluted with all the nested conditional statements and I don’t want to scare them. Also I might be a bit over zealous with the amount I can showcase in 15 minutes so perhaps less is more. So does anyone have any suggestions for a short and sweet JavaScript application that could be a good learning experience for newcomers? 3z2xrh self.javascript https://www.reddit.com/r/javascript/comments/3z2xrh/what_would_be_a_nice_short_intro_tutorial_for/ 2016-01-02 1 4 soon i’ll be running a 15 minute introductory class to javascript where i’ll be taking a group of adult students with little to no programming experience and will spend the time building a little application with them. if i do this well, they won’t get too confused with anything and by the end of the 15 minutes they will have something to be proud of. the idea is to entice them into learning more programming. what i was considering was to build a simple rock, paper, scissors game. with this they’ll be introduced to variables, functions, function parameters and if else statements however it seems a bit convoluted with all the nested conditional statements and i don’t want to scare them. also i might be a bit over zealous with the amount i can showcase in 15 minutes so perhaps less is more. so does anyone have any suggestions for a short and sweet javascript application that could be a good learning experience for newcomers? what would be a nice short intro tutorial for beginners to javascript? JavaScript 2016 1009 1 2

Stackoverflow Data

Here is an example of the Stackoverflow data set that we used:

Id PostTypeId AcceptedAnswerId ParentId CreationDate DeletionDate Score ViewCount Body OwnerUserId OwnerDisplayName LastEditorUserId LastEditorDisplayName LastEditDate LastActivityDate Title Tags AnswerCount FavoriteCount ClosedDate CommunityOwnedDate ContentLicense language text_new num_comments id text_length year month
57300494 1 NA NA 2019-08-01 00:43:04 NA 0 34 <p>When saving an animated GIF from a Numpy array of shape <code>(20, 64, 64, 3)</code> and loading it again, the shape is suddenly <code>(20, 64, 64)</code>. I think the array may contain indices into a color palette but I’m not sure how to access that. How can I restore the original data from the saved GIF?</p> <pre><code>import imageio import numpy as np imageio.mimsave(‘animation.gif’, np.zeros((20, 64, 64, 3))) np.array(imageio.mimread(‘animation.gif’)).shape # (20, 64, 64) </code></pre> 1079110 NA 2019-08-01 00:43:04 Why is the color dimension gone after saving and loading an animated GIF with imageio? <python><animated-gif><python-imageio> 0 1 CC BY-SA 4.0 Python <p>when saving an animated gif from a numpy array of shape <code>(20, 64, 64, 3)</code> and loading it again, the shape is suddenly <code>(20, 64, 64)</code>. i think the array may contain indices into a color palette but i’m not sure how to access that. how can i restore the original data from the saved gif?</p> <pre><code>import imageio import numpy as np imageio.mimsave(‘animation.gif’, np.zeros((20, 64, 64, 3))) np.array(imageio.mimread(‘animation.gif’)).shape # (20, 64, 64) </code></pre> why is the color dimension gone after saving and loading an animated gif with imageio? 2 57300494 586 2019 8
57300495 1 NA NA 2019-08-01 00:43:27 NA 32 12140 <p><a href=“http://clang.llvm.org/docs/Modules.html” rel=“noreferrer”>Clang</a> and <a href=“http://blogs.msdn.com/b/vcblog/archive/2015/12/03/c-modules-in-vs-2015-update-1.aspx” rel=“noreferrer”>MSVC</a> already supports <a href=“https://github.com/cplusplus/modules-ts” rel=“noreferrer”>Modules TS</a> from unfinished C++20 standard. Can I build my modules based project with CMake or other build system and how?</p> <p>I tried <a href=“https://build2.org/” rel=“noreferrer”>build2</a>, it supports modules and it works very well, but i have a <a href=“https://stackoverflow.com/questions/57296089/build2-analog-of-cmakes-find-package”>question</a> about it’s dependency management (UPD: question is closed).</p> 5468048 3204551 2019-10-05 20:37:33 2021-05-05 20:45:43 How to use c++20 modules with CMake? <c++><cmake><c++20><c++-modules> 5 3 CC BY-SA 4.0 C++ <p><a href=“http://clang.llvm.org/docs/modules.html” rel=“noreferrer”>clang</a> and <a href=“http://blogs.msdn.com/b/vcblog/archive/2015/12/03/c-modules-in-vs-2015-update-1.aspx” rel=“noreferrer”>msvc</a> already supports <a href=“https://github.com/cplusplus/modules-ts” rel=“noreferrer”>modules ts</a> from unfinished c++20 standard. can i build my modules based project with cmake or other build system and how?</p> <p>i tried <a href=“https://build2.org/” rel=“noreferrer”>build2</a>, it supports modules and it works very well, but i have a <a href=“https://stackoverflow.com/questions/57296089/build2-analog-of-cmakes-find-package”>question</a> about it’s dependency management (upd: question is closed).</p> how to use c++20 modules with cmake? 3 57300495 752 2019 8
57300497 1 57300621 NA 2019-08-01 00:44:06 NA 1 33 <p>So basically I have an array of gallery items which contains class “GalleryItem” objects (contains gallery name and list of images and their names).</p> <p>I also have a component called “Gallery” which takes GalleryItem class object as a prop and renders it.</p> <p>What I want to do is possibility to navigate with …/galleries/:galleryName to rendering the specific gallery inside single page.</p> <p>Galleries render fine, but I need this to work together with nested routes!</p> <pre><code>&lt;Switch&gt; &lt;Route path=“/galleries/:name” render={(props) =&gt; &lt;Gallery {…props} galleryItem={this.state.galleryItems[:name]} /&gt;} /&gt; &lt;Switch&gt; </code></pre> <p>Obviously this doesn’t work so I’m asking how it’s done and what to know if I’m doing that absolutely wrong.</p> 9854267 8330162 2019-09-16 15:42:44 2019-09-16 15:42:44 Is there a way to use the nested route ID as a prop argument in React Router <javascript><reactjs><routes><nested><react-router> 2 NA CC BY-SA 4.0 JavaScript <p>so basically i have an array of gallery items which contains class “galleryitem” objects (contains gallery name and list of images and their names).</p> <p>i also have a component called “gallery” which takes galleryitem class object as a prop and renders it.</p> <p>what i want to do is possibility to navigate with …/galleries/:galleryname to rendering the specific gallery inside single page.</p> <p>galleries render fine, but i need this to work together with nested routes!</p> <pre><code>&lt;switch&gt; &lt;route path=“/galleries/:name” render={(props) =&gt; &lt;gallery {…props} galleryitem={this.state.galleryitems[:name]} /&gt;} /&gt; &lt;switch&gt; </code></pre> <p>obviously this doesn’t work so i’m asking how it’s done and what to know if i’m doing that absolutely wrong.</p> is there a way to use the nested route id as a prop argument in react router 1 57300497 873 2019 8

Explorative Data Analysis

  • Total number of posts
  • Average number of comments
  • Average text length
  • Distribution of text length
  • Correlation Analysis

Total Number of Posts

Reddit

plot of chunk totalPostsReddit

Stackoveflow

plot of chunk totalPostsStack

Total Number of Posts

plot_language_counts_bar_comb <- function(data){
  data%>%
    group_by(language,Platform)%>%
    summarise(N = n())%>%
    ggplot(aes(x=reorder(language,-N),
               y=N,
               fill=Platform))+
      geom_bar(stat="identity",
               position=position_dodge())+
      labs(title="Total number of posts per language",
               x = "Language",
               y = "Number of posts")+ 
      scale_fill_manual(values=c('#FF5700','#BCBBBB'))+
      theme_bw()
}

plot of chunk totalPostsCombPlot

Average Number of Comments

Reddit

plot of chunk avgCommentsReddit

Stackoverflow

plot of chunk avgCommentsStack

Average Number of Comments

plot_language_comments_bar_combined <- function(data,color='#FF5700'){ 
data%>%
    group_by(language,Platform)%>%
    summarise(
      comments = sum(num_comments)/n())%>%
    ggplot(aes(x=reorder(language,-comments),
               y=comments,
               fill=Platform))+
     geom_bar(stat="identity",
              position=position_dodge())+
     scale_fill_manual(values=c('#FF5700','#BCBBBB'))+
       labs(title="Average number of comments per post for all languages",
            y = "Number of comments",
            x = "Programming Language"
            )+
    theme_bw()+
    theme(axis.text.x = element_text(angle = 90))
}

plot of chunk avgCommentsShow

Average Text Lengths

Reddit

plot of chunk reddit_textlength

Stackoveflow

plot of chunk stack_textlength

Average Text Lengths - Combined

plot_text_length_bar <- function(data){
  data %>%
    group_by(language,Platform)%>%
    summarise(text_length=mean(text_length))%>%
    ggplot(aes(x=reorder(language,-text_length),
               y=text_length,
               fill=Platform))+
      geom_bar(stat="identity", position=position_dodge())+
      labs(title="Average number of text 
           length per post for all languages",
              y = "Average text length",
              x = "Programming Language"
              )+
      scale_fill_manual(values=c('#FF5700','#BCBBBB'))+
      theme_bw()
}

plot of chunk textlengthShow

Text Lengths Distribution

Reddit

plot of chunk reddit_textlength_d

Stackoveflow

plot of chunk stack_textlength_d

Text Lengths Distribution - Combined

plot of chunk textlength_d

Correlation of Languages

Reddit

plot of chunk corrReddit

Stackoverflow

plot of chunk corrStack

Trends in Programming Posts

plot of chunk trends2

Forecasting

  • Observing of data points at specific times are known as time series data
  • Such data points are usually sampled at equal intervals in the time period
  • Forecasting models can be used to predict the future outcome of time series data

ARIMA

Forecasting Model : ARIMA(Auto Regressive Integrated Moving Average)

  1. Integration of two models that forecast using lag and lagged forecast errors.
  2. AR model uses lag values
  3. MA model uses lagged forecast errors
  4. ARIMA consists of 3 parameters : p ,q,d

Forcasting Reddit

Python

Python

Forcasting Reddit

R

R

Forcasting Stackoverflow

Python

Python

Forcasting Stackoverflow

R

R

Topic Modeling

  • Statistical modeling for finding topics in collection of text
  • Unsupervised Machine Learning technique
  • Cluster word groups that characterise set of text/ documents
  • Typical Topic Modeling algorithms
    • VSM (Vector Space Model)
    • LSA/LSI (Latent Semantic Indexing)
    • LDA (Latent Dirichlet Allocation)

LDA

  • Latent Dirichlet Allocation
  • Generative Model (probabilistic approach)
  • Expectation Maximisation of Multivariate Distribution
  • Each term is assigned to a certain topic
  • Each document is assigned to a certain topic

Preprocessing

library(tm)
text <- lrd %>% select(text_new)
# Create a corpus  
docs <- Corpus(VectorSource(text$text_new))
docs <- docs %>%
    tm_map(removeNumbers) %>%
    tm_map(stripWhitespace) %>%
    tm_map(removePunctuation)%>%
    tm_map(content_transformer(tolower))%>%
    tm_map(removeWords, stopwords("english"))

dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm,0.97)
sel_idx <- slam::row_sums(dtm) > 0
dtm <- dtm[sel_idx, ]
model = LDA(dtm,K)

Topic Modeling Reddit

Python

R

plot of chunk pytho_tm

Topic Modeling Reddit

R

R

R

Topic Modeling Stackoverflow

Python

R

R

Topic Modeling Stackoverflow

R

R

R

Conclusion

  • Can we decide which topic/language a certain post is about?

    • Determined by utilizing the “Tags” feature of Stack Overflow.
    • Lexical analysis used to determine the proportion of the language a post belongs to
  • How do number of comments and number of posts correlate to popularity?

    • The language most commented on - C
    • C being one of the oldest programming languages, a probable reason for the high number of comments can be its wide user base.
    • Highest number of Posts - Python
    • A probable reason why more people are querying about Python can be its growing popularity (testified by the Time Series graph)

Conclusion

  • How does the popularity of programming languages change over time?
    • In Stackoverflow as well as Reddit Python has risen above Java through 2016 and has remained on top.
    • JavaScript due to its variety of new libraries releasing almost every year, also remains popular just behind python. This is mostly followed by Java and then the other languages.
    • C and C++ surprisingly have a lesser amount of posts comparatively, one reason might be that they have been in use for a long time now and the data we use is very recent.

Conclusion

  • Can we predict the popularity of programming languages in the future?

    • Based on our analysis from forecasting we can conclude that Python is steadily continuing to gain popularity on both Reddit and Stackoverflow platforms in the upcoming years.
  • How do the two platforms compare based on programming languages?

    • From the Topic Modeling one can conclude that Reddit is more focusing on career advice and general opinion on certain programming topics. Stackoverflow is focusing on problems of Software Engineering and addresses more coding related issues

Project Challenges

  • Obtaining the data sets
  • Having two data sets which are not uniform
  • Memory issue for Topic Modeling
  • Deploying the shiny application

Future Work

  • Unifying the data sets to ensure comparison
  • Handling large data sets
  • Try different topic modeling algorithms
  • Try different forecasting methods

References

Thank you!