10/2/2021

The Problem: Comparing R and Python

Given that R and Python are both popular languages for doing data science, we wanted to compare stories and comments about both languages from Hacker News.

This could yield insights about the strengths and weaknesses of both languages, their applications in practice, and how their communities interact with Hacker News and each other.

Data Collection and Hurdles

  • We tried using the API via R and Google Big Query (GBQ) and preferred the latter.
  • We ultimately collected enough data (10k-20k per case per language), but at times had trouble balancing both kinds of data.

Response to Peer Feedback

  • Feedback from peers focused on if we’d choose a time period to look at. While this would be interesting, we chose to not to focus on time as much as the relevance of articles.
  • Peer feedback focused on what our documents would contain. In the end we had data from titles, stories, and comments.
  • One commentor asked if we would do semantic coherence and exclusivity for both groups together or separately. We ended up doing them both separately.

Analysis & Results

We found associated terms for 20 topics for both R and Python stories.
Top Terms for Python and R by TopicTop Terms for Python and R by Topic

Top Terms for Python and R by Topic

Analysis & Results

We applied a number of methods we learned in class like quantitative analysis of textual data (corpus, tokenizing, DFM, etc.) as well as advanced topic modeling and data visualization.
Top Terms for Python and R ArticlesTop Terms for Python and R Articles

Top Terms for Python and R Articles

Analysis & Results

We visualized semantic coherence for Python and R Text and Comments

The End