Introduction
A quick look at what kind of questions get posted on the Data Science StackExchange.
Get the data
Get all open questions:
Query used:
Let’s look at the data
library(dplyr)
library(stringr)
library(knitr)
library(kableExtra)
df <- read.csv("./DS_SE_questions.csv")
colnames(df)
[1] "Id" "PostTypeId" "AcceptedAnswerId"
[4] "ParentId" "CreationDate" "DeletionDate"
[7] "Score" "ViewCount" "Body"
[10] "OwnerUserId" "OwnerDisplayName" "LastEditorUserId"
[13] "LastEditorDisplayName" "LastEditDate" "LastActivityDate"
[16] "Title" "Tags" "AnswerCount"
[19] "CommentCount" "FavoriteCount" "ClosedDate"
[22] "CommunityOwnedDate"
df %>% select(Title, Tags, CreationDate) %>% top_n(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped",
"hover", "responsive"))
Title | Tags | CreationDate |
---|---|---|
Difference between Convolution and Pooling? | <neural-network><deep-learning> | 2020-02-08 07:41:53 |
Does it make sense to use train_test_split and cross-validation when using GridSearchCV to play with hyperparameters? | <cross-validation><model-selection> | 2020-02-08 12:04:19 |
Use Python sklearn in Matlab, MLPRegressor | <python><scikit-learn><matlab><mlp> | 2020-02-08 13:02:31 |
Predict a label based on multiple rows each one case? | <classification> | 2020-02-08 16:00:37 |
How to summarize multiple time series like dataset | <pandas><graphs> | 2020-02-08 16:32:43 |
Keras 1x1 convolution network | <keras> | 2020-02-08 18:06:21 |
How to select checkpoint for model evaluation? | <neural-network><convolution><evaluation><overfitting> | 2020-02-08 18:41:32 |
Data Labeling domain specific | <machine-learning><python><deep-learning><data-cleaning><data-science-model> | 2020-02-08 20:32:22 |
How to optimize client’s portafolio with analytical models? | <optimization> | 2020-02-08 21:13:49 |
Stylegan train.py Assertion Error | <python><gan><nvidia> | 2020-02-08 21:54:39 |
Association rule mining
Item frequency
library(arules)
library(RColorBrewer)
tr <- read.transactions("./tags_data.csv", format = "basket", sep = ",")
itemFrequencyPlot(tr, topN = 20, type = "absolute", col = brewer.pal(8, "Pastel2"),
main = "Absolute Item Frequency Plot")
Create and filter association rules
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.3 0.1 1 none FALSE TRUE 5 0.001 2
maxlen target ext
5 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 20
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[568 item(s), 20664 transaction(s)] done [0.01s].
sorting and recoding items ... [264 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [411 rule(s)] done [0.00s].
creating S4 object ... done [0.01s].
## Filter out redundant rules
nonr_rules <- assn_rules[!is.redundant(assn_rules)]
## Filter out statistically insignificant rules
sig_rules <- nonr_rules[!is.significant(nonr_rules, tr, method = "fisher", adjust = "bonferroni")]
## Convert rules matrix to dataframe
sig_rules_df <- DATAFRAME(sig_rules, setStart = "", setEnd = "", separate = TRUE)
sig_rules_df %>% arrange(desc(count)) %>% top_n(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped",
"hover", "responsive"))
LHS | RHS | support | confidence | lift | count |
---|---|---|---|---|---|
scikit-learn | machine-learning | 0.0207607 | 0.3341121 | 1.040714 | 429 |
regression | machine-learning | 0.0150019 | 0.3634232 | 1.132014 | 310 |
cnn | machine-learning | 0.0126307 | 0.3182927 | 0.991438 | 261 |
statistics | machine-learning | 0.0114692 | 0.3885246 | 1.210201 | 237 |
python,scikit-learn | machine-learning | 0.0105014 | 0.3604651 | 1.122799 | 217 |
decision-trees | machine-learning | 0.0079849 | 0.3900709 | 1.215017 | 165 |
random-forest | machine-learning | 0.0078397 | 0.3537118 | 1.101764 | 162 |
logistic-regression | machine-learning | 0.0072590 | 0.3807107 | 1.185861 | 150 |
svm | machine-learning | 0.0072106 | 0.3941799 | 1.227816 | 149 |
linear-regression | machine-learning | 0.0071622 | 0.3557692 | 1.108172 | 148 |