Introduction

A quick look at what kind of questions get posted on the Data Science StackExchange.

Get the data

Get all open questions:

Query used:

select * from Posts
where ClosedDate IS NULL AND PostTypeId = 1;

Let’s look at the data

library(dplyr)
library(stringr)
library(knitr)
library(kableExtra)

df <- read.csv("./DS_SE_questions.csv")

colnames(df)

 [1] "Id"                    "PostTypeId"            "AcceptedAnswerId"     
 [4] "ParentId"              "CreationDate"          "DeletionDate"         
 [7] "Score"                 "ViewCount"             "Body"                 
[10] "OwnerUserId"           "OwnerDisplayName"      "LastEditorUserId"     
[13] "LastEditorDisplayName" "LastEditDate"          "LastActivityDate"     
[16] "Title"                 "Tags"                  "AnswerCount"          
[19] "CommentCount"          "FavoriteCount"         "ClosedDate"           
[22] "CommunityOwnedDate"

df %>% select(Title, Tags, CreationDate) %>% top_n(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", 
    "hover", "responsive"))

Title	Tags	CreationDate
Difference between Convolution and Pooling?	<neural-network><deep-learning>	2020-02-08 07:41:53
Does it make sense to use train_test_split and cross-validation when using GridSearchCV to play with hyperparameters?	<cross-validation><model-selection>	2020-02-08 12:04:19
Use Python sklearn in Matlab, MLPRegressor	<python><scikit-learn><matlab><mlp>	2020-02-08 13:02:31
Predict a label based on multiple rows each one case?	<classification>	2020-02-08 16:00:37
How to summarize multiple time series like dataset	<pandas><graphs>	2020-02-08 16:32:43
Keras 1x1 convolution network	<keras>	2020-02-08 18:06:21
How to select checkpoint for model evaluation?	<neural-network><convolution><evaluation><overfitting>	2020-02-08 18:41:32
Data Labeling domain specific	<machine-learning><python><deep-learning><data-cleaning><data-science-model>	2020-02-08 20:32:22
How to optimize client’s portafolio with analytical models?	<optimization>	2020-02-08 21:13:49
Stylegan train.py Assertion Error	<python><gan><nvidia>	2020-02-08 21:54:39

Some quick EDA

dim(df)

[1] 20663    22

Prepare the Tags data

df <- df %>% mutate(Tags = str_replace_all(Tags, c(`><` = ", ", `>` = "", `<` = "")))

head(df$Tags)

[1] "machine-learning, python, scikit-learn, clustering, unsupervised-learning"
[2] "neural-network, deep-learning, tensorflow, lstm"                          
[3] "preprocessing"                                                            
[4] "machine-learning, python, cnn, image-classification"                      
[5] "python, scikit-learn, anomaly-detection, outlier, data-imputation"        
[6] "machine-learning, data"

write.csv(as.data.frame(df$Tags), "./tags_data.csv", row.names = F, quote = F)

Association rule mining

Item frequency

library(arules)
library(RColorBrewer)

tr <- read.transactions("./tags_data.csv", format = "basket", sep = ",")

itemFrequencyPlot(tr, topN = 20, type = "absolute", col = brewer.pal(8, "Pastel2"), 
    main = "Absolute Item Frequency Plot")

Create and filter association rules

assn_rules <- apriori(tr, parameter = list(supp = 0.001, conf = 0.3, minlen = 2, 
    maxlen = 5))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.3    0.1    1 none FALSE            TRUE       5   0.001      2
 maxlen target   ext
      5  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 20 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[568 item(s), 20664 transaction(s)] done [0.01s].
sorting and recoding items ... [264 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [411 rule(s)] done [0.00s].
creating S4 object  ... done [0.01s].

## Filter out redundant rules
nonr_rules <- assn_rules[!is.redundant(assn_rules)]

## Filter out statistically insignificant rules
sig_rules <- nonr_rules[!is.significant(nonr_rules, tr, method = "fisher", adjust = "bonferroni")]

## Convert rules matrix to dataframe
sig_rules_df <- DATAFRAME(sig_rules, setStart = "", setEnd = "", separate = TRUE)

sig_rules_df %>% arrange(desc(count)) %>% top_n(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", 
    "hover", "responsive"))

LHS	RHS	support	confidence	lift	count
scikit-learn	machine-learning	0.0207607	0.3341121	1.040714	429
regression	machine-learning	0.0150019	0.3634232	1.132014	310
cnn	machine-learning	0.0126307	0.3182927	0.991438	261
statistics	machine-learning	0.0114692	0.3885246	1.210201	237
python,scikit-learn	machine-learning	0.0105014	0.3604651	1.122799	217
decision-trees	machine-learning	0.0079849	0.3900709	1.215017	165
random-forest	machine-learning	0.0078397	0.3537118	1.101764	162
logistic-regression	machine-learning	0.0072590	0.3807107	1.185861	150
svm	machine-learning	0.0072106	0.3941799	1.227816	149
linear-regression	machine-learning	0.0071622	0.3557692	1.108172	148

Visualize the rules

library(arulesViz)

## Create circular grouped plot
plot(subset(sig_rules, support > 0.0055), method = "graph", cex = 1, control = list(main = NULL, 
    layout = igraph::in_circle()))