1 Overview

This project aims to analyze and visualize the network and relationship between progarmming language questions on Stack Overflow, an open community for developers. The data set is downloaded from Kaggle, see here. Data from 2020 on is used in this project. It would be interesting to find frequent patterns and analyze relationships in the dataset. Since this project focuses more on visualization, only result data will be shown.

2 Load Packages And Data

2.1 Load Packages

The package tidyverse is an opinionated collection of R packages designed for data science, including dplyr, tidyr, stringr, readr, tibble, ggplot2, purrr and so on. plotly helps to make visualizations interactive.visNetwork is a great tool to visualize network data with interactive effects. arulesSequences is a package for Aripri Algorithm, mining frequent patterns from a dataset.

packages = c("tidyverse", "plotly", "visNetwork","arulesSequences")

for (p in packages){
  if (!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only =T)
}
## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.2     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: visNetwork
## Loading required package: arulesSequences
## Loading required package: arules
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write

2.2 Load Data

The function read_csv from package readr is used to read the stack dataset.

tags <- read_csv("E:/111Visual Analytics & Applications/Session10 Assign5/tag_list.csv")
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   tag_list = col_character(),
##   size = col_double()
## )
head(tags)
## # A tibble: 6 x 3
##      X1 tag_list                                    size
##   <dbl> <chr>                                      <dbl>
## 1     0 ['git', 'macos', 'file', 'git-clean']          4
## 2     1 ['django', 'amazon-s3', 'tinymce']             3
## 3     2 ['windows', 'jenkins', 'jenkins-pipeline']     3
## 4     3 ['php']                                        1
## 5     4 ['c++', 'cmd', 'compilation']                  3
## 6     5 ['symfony', 'webpack']                         2

From the tags dataset, we could see a list of tags with that row of question and its number of tags. In the next step, we will find the most important tags or languages with most questions asked on Stack Overflow. See below the variable explainations.

Next, we will count the occurrences of each tag and count the percentage of occurrence of each tag. Results are shown in a descending order. Change the column names to make the meaning more intuitive.

tags <- tags %>% mutate(tag_list = gsub("\\[|\\]|\\'", '', tag_list))
overview <- data.frame(table(unlist(strsplit(tags$tag_list, ","))))

overview <- overview[order(-overview$Freq),]
overview$Freq <- overview$Freq / length(tags$tag_list) * 100

overview <- overview %>%
  rename(Language = Var1) %>%
  rename(Support = Freq)

head(overview)
##         Language   Support
## 46454     python 13.800344
## 43812 javascript 10.641598
## 43778       java  6.652674
## 40534         c#  4.544757
## 45996        php  3.302423
## 39394    android  3.281297

The reuslt shows that Python takes the largest proportion in the 2020 dataset, followed by Javascript. The column meaning is shown below.

Column Name Explaination
Language The most frequently appeared language in Stack Overflow
Support The percentage each language takes in the dataset

Ah-ha! Now we know what we should focus on. Since Python is the most popular language for the time being, we would like to delve deeper into this language.

2.3 Proposed Design And Difficulties

2.3.1 Proposed Design

2.3.2 Potential Difficulties

  • Since the data is downloaded from Kaggle, it is raw and thus lots of data manipulations need to be done in R to carry out this project.

  • Need to find a R package for Apriori Algorithms to do frequent pattern mining. Otherwise, we could just write an algorithm for ourselves.

  • Visualizing a network is hard, with lots of formatting problems. Moreover, a huge number of nodes would deeply affect the aesthetic performance of the chart, while too few nodes would make the network less informative.

3 Network Analysis of Python

3.1 Rechieve Python Dataset

Now let’s get the subset with Python in all rows. Filter rows with ‘python’ tags and do data manipulation to fit into the Apriori model. Rearrange rows and set factors for the following parts.

python_data  = tags %>% filter(grepl('python', tag_list))

python_data <- python_data %>%
  rename(sequenceID = X1) %>%
  mutate(eventID=sequenceID) %>%
  rename(SIZE =size) %>%
  rename(items = tag_list)

python_data = python_data[, c(1,4,3,2)]

python_data <- data.frame(lapply(python_data, as.factor))

head(python_data)
##   sequenceID eventID SIZE
## 1         21      21    3
## 2         27      27    5
## 3         30      30    3
## 4         32      32    3
## 5         33      33    3
## 6         39      39    4
##                                                       items
## 1                                     python, arrays, numpy
## 2 python, python-3.x, tkinter, tkinter-layout, tkinter-menu
## 3                            python, keystroke, directinput
## 4                               python, scrapy, scrapinghub
## 5                    python, tkinter, graphical-interaction
## 6                  python, postgresql, sql-update, psycopg2

The after-manipulation data contains four columns, sequenceID, eventID, SIZE and items.

3.2 Frequent Pattern Mining

Now let’s Write the data into a transaction matrix to fit the model.

write.table(python_data, "mytxtout.txt", sep=";", row.names = FALSE, col.names = FALSE, quote = FALSE)
trans_matrix <- read_baskets("mytxtout.txt", sep = ";", info = c("sequenceID","eventID","SIZE"))

graph <- cspade(trans_matrix, parameter = list(support = 0.1), control = list(verbose = TRUE))
graph.df <- as(graph, "data.frame")

4 Visualization of the Network

Let’s Rename the column names of graph.nodes and graph.edges to fit in visNetwork, in order for the following visualizations. The column group helps to dye the nodes and edges accordingly. Width is for edge widths. Title is for creating a tooltip, which will be shown when the mouse hovers above a node. Also, we need to double the width for better effects in the chart.

graph.nodes <- graph.nodes %>%
  rename(group = relation)
graph.edges <- graph.edges %>%
  rename(width = weight)
graph.nodes <- graph.nodes %>%
  mutate(title = label)

graph.edges$width <- graph.edges$width*200

4.1 Visualize The Network By Group

Now, let’s see the relationships among tags that are frequently appeared with Python. The viz below is an ego-network of Python. Nodes are in 5 different categories according to their functions. Similar nodes tend to locate near one another. The wider the edge, the more important the connection between the two nodes. Blue edges are from Python to other nodes. They are widest and are all frequent patterns, together with Pandas–DataFrame, Pandas-NumPy, and Keras-TensorFlow. When a group ID is chosen, all the nodes in that group will be highlighted.

visNetwork(graph.nodes, graph.edges, main = "Python Network Visualization by Group") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(highlightNearest = TRUE, selectedBy = "group") %>%
  visLegend(useGroups = TRUE)
Legend Explaination
NUll Only for Python since the chart is an ego-network of it
DW Data Wrangling
WS Website Related
ML Machine Learning
PL Plotting
BF Basic Function

Insight 1: Nodes in group Machine Learning and Data Wrangling are densely distributed, showing that the inner-relationship of these nodes are relatively higher than other groups. Also, group Data Wrangling is at the core of the network, closer to Website-Related nodes than Machine Learning is.

Insight 2: Even though matPLib, tkinter, and pyqt5 all belong to the group Plotting, they spread in different places in the network. matPLib is the most popular one, followed by tkinter, with relatively high interactions with other tags. pyqt5 is far away from other nodes, showing that mostly people ask questions about this topic without confusions of its interacting with other packages.

4.2 Visualize The Network By Node ID

When an ID is selected, nodes that are connected to it are highlighted.

visNetwork(graph.nodes, graph.edges, main = "Python Network Visualization by ID") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
  visLegend(useGroups = TRUE) 

Insight 3: Nodes in group Data Wrangling locate in the middle of the chart, nearest to Python. When we click NumPy, Pandas, and DataFrame respectively, almost all nodes in the network are highlighted. It reveals that data wrangling is of great importance in Python coding. More detailedly, Pandas is the most important tool, except for the field deep-learning, which belongs to machine learning and is more related to NumPy. DataFrame is more related to basic functions.

4.3 Your Time Now

What do you think a good network visualization should have? Below is the interactive DIY board for you to play with.

visNetwork(graph.nodes, graph.edges, main = "Python Network Visualization DIY", height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(highlightNearest = TRUE, selectedBy = "group", nodesIdSelection = TRUE) %>%
  visLegend(useGroups = TRUE) %>%
  visInteraction(navigationButtons = TRUE) %>%
  visConfigure(enabled = TRUE)