This project aims to analyze and visualize the network and relationship between progarmming language questions on Stack Overflow, an open community for developers. The data set is downloaded from Kaggle, see here. Data from 2020 on is used in this project. It would be interesting to find frequent patterns and analyze relationships in the dataset. Since this project focuses more on visualization, only result data will be shown.
The package tidyverse is an opinionated collection of R packages designed for data science, including dplyr, tidyr, stringr, readr, tibble, ggplot2, purrr and so on. plotly helps to make visualizations interactive.visNetwork is a great tool to visualize network data with interactive effects. arulesSequences is a package for Aripri Algorithm, mining frequent patterns from a dataset.
packages = c("tidyverse", "plotly", "visNetwork","arulesSequences")
for (p in packages){
if (!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only =T)
}
## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.2 v dplyr 1.0.0
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: visNetwork
## Loading required package: arulesSequences
## Loading required package: arules
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
The function read_csv from package readr is used to read the stack dataset.
tags <- read_csv("E:/111Visual Analytics & Applications/Session10 Assign5/tag_list.csv")
## Parsed with column specification:
## cols(
## X1 = col_double(),
## tag_list = col_character(),
## size = col_double()
## )
head(tags)
## # A tibble: 6 x 3
## X1 tag_list size
## <dbl> <chr> <dbl>
## 1 0 ['git', 'macos', 'file', 'git-clean'] 4
## 2 1 ['django', 'amazon-s3', 'tinymce'] 3
## 3 2 ['windows', 'jenkins', 'jenkins-pipeline'] 3
## 4 3 ['php'] 1
## 5 4 ['c++', 'cmd', 'compilation'] 3
## 6 5 ['symfony', 'webpack'] 2
From the tags dataset, we could see a list of tags with that row of question and its number of tags. In the next step, we will find the most important tags or languages with most questions asked on Stack Overflow. See below the variable explainations.
Next, we will count the occurrences of each tag and count the percentage of occurrence of each tag. Results are shown in a descending order. Change the column names to make the meaning more intuitive.
tags <- tags %>% mutate(tag_list = gsub("\\[|\\]|\\'", '', tag_list))
overview <- data.frame(table(unlist(strsplit(tags$tag_list, ","))))
overview <- overview[order(-overview$Freq),]
overview$Freq <- overview$Freq / length(tags$tag_list) * 100
overview <- overview %>%
rename(Language = Var1) %>%
rename(Support = Freq)
head(overview)
## Language Support
## 46454 python 13.800344
## 43812 javascript 10.641598
## 43778 java 6.652674
## 40534 c# 4.544757
## 45996 php 3.302423
## 39394 android 3.281297
The reuslt shows that Python takes the largest proportion in the 2020 dataset, followed by Javascript. The column meaning is shown below.
| Column Name | Explaination |
|---|---|
| Language | The most frequently appeared language in Stack Overflow |
| Support | The percentage each language takes in the dataset |
Ah-ha! Now we know what we should focus on. Since Python is the most popular language for the time being, we would like to delve deeper into this language.
Since the data is downloaded from Kaggle, it is raw and thus lots of data manipulations need to be done in R to carry out this project.
Need to find a R package for Apriori Algorithms to do frequent pattern mining. Otherwise, we could just write an algorithm for ourselves.
Visualizing a network is hard, with lots of formatting problems. Moreover, a huge number of nodes would deeply affect the aesthetic performance of the chart, while too few nodes would make the network less informative.
Now let’s get the subset with Python in all rows. Filter rows with ‘python’ tags and do data manipulation to fit into the Apriori model. Rearrange rows and set factors for the following parts.
python_data = tags %>% filter(grepl('python', tag_list))
python_data <- python_data %>%
rename(sequenceID = X1) %>%
mutate(eventID=sequenceID) %>%
rename(SIZE =size) %>%
rename(items = tag_list)
python_data = python_data[, c(1,4,3,2)]
python_data <- data.frame(lapply(python_data, as.factor))
head(python_data)
## sequenceID eventID SIZE
## 1 21 21 3
## 2 27 27 5
## 3 30 30 3
## 4 32 32 3
## 5 33 33 3
## 6 39 39 4
## items
## 1 python, arrays, numpy
## 2 python, python-3.x, tkinter, tkinter-layout, tkinter-menu
## 3 python, keystroke, directinput
## 4 python, scrapy, scrapinghub
## 5 python, tkinter, graphical-interaction
## 6 python, postgresql, sql-update, psycopg2
The after-manipulation data contains four columns, sequenceID, eventID, SIZE and items.
Now let’s Write the data into a transaction matrix to fit the model.
write.table(python_data, "mytxtout.txt", sep=";", row.names = FALSE, col.names = FALSE, quote = FALSE)
trans_matrix <- read_baskets("mytxtout.txt", sep = ";", info = c("sequenceID","eventID","SIZE"))
graph <- cspade(trans_matrix, parameter = list(support = 0.1), control = list(verbose = TRUE))
graph.df <- as(graph, "data.frame")
Let’s Rename the column names of graph.nodes and graph.edges to fit in visNetwork, in order for the following visualizations. The column group helps to dye the nodes and edges accordingly. Width is for edge widths. Title is for creating a tooltip, which will be shown when the mouse hovers above a node. Also, we need to double the width for better effects in the chart.
graph.nodes <- graph.nodes %>%
rename(group = relation)
graph.edges <- graph.edges %>%
rename(width = weight)
graph.nodes <- graph.nodes %>%
mutate(title = label)
graph.edges$width <- graph.edges$width*200
Now, let’s see the relationships among tags that are frequently appeared with Python. The viz below is an ego-network of Python. Nodes are in 5 different categories according to their functions. Similar nodes tend to locate near one another. The wider the edge, the more important the connection between the two nodes. Blue edges are from Python to other nodes. They are widest and are all frequent patterns, together with Pandas–DataFrame, Pandas-NumPy, and Keras-TensorFlow. When a group ID is chosen, all the nodes in that group will be highlighted.
visNetwork(graph.nodes, graph.edges, main = "Python Network Visualization by Group") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE, selectedBy = "group") %>%
visLegend(useGroups = TRUE)
| Legend | Explaination |
|---|---|
| NUll | Only for Python since the chart is an ego-network of it |
| DW | Data Wrangling |
| WS | Website Related |
| ML | Machine Learning |
| PL | Plotting |
| BF | Basic Function |
Insight 1: Nodes in group Machine Learning and Data Wrangling are densely distributed, showing that the inner-relationship of these nodes are relatively higher than other groups. Also, group Data Wrangling is at the core of the network, closer to Website-Related nodes than Machine Learning is.
Insight 2: Even though matPLib, tkinter, and pyqt5 all belong to the group Plotting, they spread in different places in the network. matPLib is the most popular one, followed by tkinter, with relatively high interactions with other tags. pyqt5 is far away from other nodes, showing that mostly people ask questions about this topic without confusions of its interacting with other packages.
When an ID is selected, nodes that are connected to it are highlighted.
visNetwork(graph.nodes, graph.edges, main = "Python Network Visualization by ID") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
visLegend(useGroups = TRUE)
Insight 3: Nodes in group Data Wrangling locate in the middle of the chart, nearest to
Python. When we click NumPy, Pandas, and DataFrame respectively, almost all nodes in the network are highlighted. It reveals that data wrangling is of great importance in Python coding. More detailedly, Pandas is the most important tool, except for the field deep-learning, which belongs to machine learning and is more related to NumPy. DataFrame is more related to basic functions.
What do you think a good network visualization should have? Below is the interactive DIY board for you to play with.
visNetwork(graph.nodes, graph.edges, main = "Python Network Visualization DIY", height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE, selectedBy = "group", nodesIdSelection = TRUE) %>%
visLegend(useGroups = TRUE) %>%
visInteraction(navigationButtons = TRUE) %>%
visConfigure(enabled = TRUE)