Market basket analysis is one of the data mining techniques to find patterns between items. It is mostly used on transactional data in retail or marketing industries. In analysing market basket analysis, association rules take important part. They work by looking for combinations of items that occur together frequently in transactions, creating If-Then rules: if A is purchased then B is also likely to be purchased.
Although association rules might be also used on other types of data and in different scenario. Some applications consist: medical diagnosis, stock market analysis or churn analysis. In this paper I will apply apriori algorithm to answer the question if tools/algorithms used in data science differ by industry.
Data used comes from kaggle survey from 2017 year - link. Based on this data I will try to check if there are associations between the type of industry and tools/algorithms that people use in data science related jobs.
Article is divided into 3 parts. First part is about data preparation, second covers finding association rules about tools and the last one consist association rules about algorithms.
Reading libraries
library(tidyverse)
library(knitr)
library(arules)
library(arulesViz)
Let`s read the data and check dimensionality.
df <- read.csv('kagglesurvey.csv',na.strings=c("","NA"))
dim(df)
## [1] 10153 5
As we can see there are 10153 rows and 5 columns. Let`s look at variable types.
summary(df) %>% kable() %>% kableExtra::kable_styling()
| Respondent | WorkToolsSelect | LanguageRecommendationSelect | EmployerIndustry | WorkAlgorithmsSelect | |
|---|---|---|---|---|---|
| Min. : 1 | Length:10153 | Length:10153 | Length:10153 | Length:10153 | |
| 1st Qu.: 2539 | Class :character | Class :character | Class :character | Class :character | |
| Median : 5077 | Mode :character | Mode :character | Mode :character | Mode :character | |
| Mean : 5077 | NA | NA | NA | NA | |
| 3rd Qu.: 7615 | NA | NA | NA | NA | |
| Max. :10153 | NA | NA | NA | NA |
The first column ‘Respondent’ consist of unique values of survey participants. All other features are character types. In analysis, we will be interested only in WorkToolsSelect - tools that people use in their jobs, EmployerIndustry - type of industry that the employer firm belongs to and WorkAlgorithmsSelect - algorithms that worker is using in his/her job.
df[3:6,] %>% kable(row.names=FALSE) %>% kableExtra::kable_styling()
| Respondent | WorkToolsSelect | LanguageRecommendationSelect | EmployerIndustry | WorkAlgorithmsSelect |
|---|---|---|---|---|
| 3 | C/C++,Jupyter notebooks,MATLAB/Octave,Python,R,TensorFlow | Python | Technology | Bayesian Techniques,CNNs,Ensemble Methods,Neural Networks,Regression/Logistic Regression,SVMs |
| 4 | Jupyter notebooks,Python,SQL,TensorFlow | Python | Academic | Bayesian Techniques,CNNs,Decision Trees,Gradient Boosted Machines,Neural Networks,Random Forests,Regression/Logistic Regression |
| 5 | C/C++,Cloudera,Hadoop/Hive/Pig,Java,NoSQL,R,Unix shell / awk | R | Government | NA |
| 6 | SQL | Python | Non-profit | NA |
There are some missing values. I will delete them and visualise industry data.
df <- na.omit(df)
pie(sort(table(df$EmployerIndustry)),main = 'Share of industries')
As we can see there are many industries in which data scientists work. I will analyse only the 3 most popular: Technology, Academic and Financial. Furthermore, I decided to join Telecommunications to Technology and Insurance to Financial industry.
df$EmployerIndustry <- ifelse(df$EmployerIndustry=='Telecommunications','Technology',df$EmployerIndustry)
df$EmployerIndustry <- ifelse(df$EmployerIndustry=='Insurance','Financial',df$EmployerIndustry)
Now I will prepare data for apriori algorithm. Data is going to be in wide format, first column will consist industry type and all other columns include single records of each used tool/algorithm per person.
#creating matrix for tools
m_col <- max(apply(array(df$WorkToolsSelect),1,function(x)length(c(str_split(x,','))[[1]])))
foo_m <- matrix('', nrow = nrow(df), ncol = m_col)
#splitting from single observation to multiple observations
for( i in 1:nrow(df)){
foo_vec <- c(str_split(df$WorkToolsSelect[i],',')[[1]])
foo_m[i,1:length(foo_vec)] <- foo_vec
}
foo_m <- cbind(df$EmployerIndustry,foo_m)
#saving to csv
write.csv(foo_m, file="lang_ds.csv")
#creating matrix for algorithms
m_col <- max(apply(array(df$WorkAlgorithmsSelect),1,function(x)length(c(str_split(x,','))[[1]])))
foo_m <- matrix('', nrow = nrow(df), ncol = m_col)
#splitting from single observation to multiple observations
for( i in 1:nrow(df)){
foo_vec <- c(str_split(df$WorkAlgorithmsSelect[i],',')[[1]])
foo_m[i,1:length(foo_vec)] <- foo_vec
}
foo_m <- cbind(df$EmployerIndustry,foo_m)
#saving to csv
write.csv(foo_m, file="algo_ds.csv")
Reading data for apriori algorithm and summarising it.
trans1<-read.transactions("lang_ds.csv", format="basket", sep=",", skip=1)
summary(trans1)
## transactions as itemMatrix in sparse format with
## 5991 rows (elements/itemsets/transactions) and
## 6053 columns (items) and a density of 0.00130583
##
## most frequent items:
## Python R SQL Jupyter notebooks
## 4732 3712 3286 2639
## TensorFlow (Other)
## 1834 31151
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## 2 257 491 719 810 851 740 580 430 338 238 186 119 65 47 25 24 22 10 9
## 22 23 24 25 26 27 30 34 35 50
## 9 5 3 4 1 2 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 6.000 7.000 7.904 10.000 50.000
##
## includes extended item information - examples:
## labels
## 1 1
## 2 10
## 3 100
plot(summary(trans1)@lengths,ylab='Frequency',xlab='size',main='Density of sizes')
itemFrequencyPlot(trans1, topN=25, type="relative", main="ItemFrequency", col=wesanderson::wes_palette(name = 'GrandBudapest2'))
As we can see from the summary of all ‘transactions’ most of the people use between 3 and 9 tools/languages in their job. The most popular are Python, R and SQL. Worth mentioning is also that not that many people restrain themselves to only one tool. This can be valuable for new people who are one language lover.
In the next parts I will run apriori algorithms. The most important parameters that analyst must declare are minimum support and confidence. Support depicts the fraction of answers that contain specified items and confidence tells us how much combination of “if” with “then” is often. In Data Science tools are widely used among all industries. To obtain sweet spot between popularity and acceptable confidence I calibrated the support to 0.005 and conf to 0.08. Specifying support to lower level leads to items that are rarely used. Count was lower than 10. The big picture could be skewed because of that. Goal here is to find some more universal characterisation of industries.
rules_technology<-apriori(data=trans1, parameter=list(supp=0.005, conf=0.08), appearance=list(default="lhs", rhs="Technology"), control=list(verbose=F))
count <- DATAFRAME(sort(rules_technology,by="count"))
count[,3:7] <- round(count[,3:7],4)
confident <- DATAFRAME(sort(rules_technology,by="confidence"))
confident[,3:7] <- round(confident[,3:7],4)
DT::datatable(count,filter = 'top',rownames = FALSE)
DT::datatable(confident,filter = 'top',rownames = FALSE)
We can see that in technology people are Python lovers. R and SQL are also very popular and are followed by two tools: Jupyter notebooks and tensorflow. Sorting the results by confidence, still we can see that Python is in almost every association. Associations with the most confidence share between them tools like: Spark/MLlib, SQL, TensorFlow, Unix Shell/awk. Moreover, there are also used things like C++, Hadoop or IBM Watson.
Below every results I will post two interactive plots: scatter-plot and graph.
plot(rules_technology,engine='plotly')
plot(sort(rules_technology,by="confidence")[1:5], method = "graph", engine = "htmlwidget")
rules_financial<-apriori(data=trans1, parameter=list(supp=0.005, conf=0.08), appearance=list(default="lhs", rhs="Financial"), control=list(verbose=F))
count <- DATAFRAME(sort(rules_financial,by="count"))
count[,3:7] <- round(count[,3:7],4)
confident <- DATAFRAME(sort(rules_financial,by="confidence"))
confident[,3:7] <- round(confident[,3:7],4)
DT::datatable(count,filter = 'top',rownames = FALSE)
DT::datatable(confident,filter = 'top',rownames = FALSE)
In finance python is also the most popular language but the difference between the second most popular is very small. We can even state that there are three equally used languages: Python, R and SQL. Furthermore, the combinations of those three tools are very common, like: Python & R, Python & SQL or R & SQL. Looking at confidence, the picture is different. SAS is the king now. SAS Base, SAS Enterpise Miner, SQL and R are the tools that baskets of their combinations got the biggest confidence value. This approved my perception that SAS is commonly used in banks and other financial institutions but it is not that popular elsewhere.
plot(rules_financial,engine='plotly')
plot(sort(rules_financial,by="confidence")[1:5], method = "graph", engine = "htmlwidget")
rules_academic<-apriori(data=trans1, parameter=list(supp=0.005, conf=0.08), appearance=list(default="lhs", rhs="Academic"), control=list(verbose=F))
count <- DATAFRAME(sort(rules_academic,by="count"))
count[,3:7] <- round(count[,3:7],4)
confident <- DATAFRAME(sort(rules_academic,by="confidence"))
confident[,3:7] <- round(confident[,3:7],4)
DT::datatable(count,filter = 'top',rownames = FALSE)
DT::datatable(confident,filter = 'top',rownames = FALSE)
In academia Python and R are also the most popular. We can see the set of new tools: Matlab/Octave. They are used in science for some numerical computations. Looking further at support, C/C++ are also quite popular and SQL is less popular than in private sector. This could probably be explained in a way that people working in private firms use data from databases constructed from their customers behaviour, so they need SQL to query desired observations. Based on the queried data they perform data science things. On the other hand academic data science can be more theoretical, so they do not need data too much.
Looking at confidence there are many different associations but almost in every one of them Matlab/Octave and C/C++ are present. Worth mentioning is also the presence of SPSS in the association with the highest confidence and Mathematica in the third one.
Those differences can probably also be explained in such a way that people in academia are experienced and are working on/with data science methods for a long time, using tools that were popular years ago, learning them when they were young. They also do not have time to learn new languages when they are efficient in old ones. While people in business are younger, have more time, so they learn tools that are new and popular.
plot(rules_academic,engine='plotly')
plot(sort(rules_academic,by="confidence")[1:5], method = "graph", engine = "htmlwidget")
Reading data for apriori algorithm and summarising it.
trans2<-read.transactions("algo_ds.csv", format="basket", sep=",", skip=1)
summary(trans2)
## transactions as itemMatrix in sparse format with
## 5991 rows (elements/itemsets/transactions) and
## 6019 columns (items) and a density of 0.0009430439
##
## most frequent items:
## Regression/Logistic Regression Decision Trees
## 3906 3027
## Random Forests Neural Networks
## 2863 2242
## Bayesian Techniques (Other)
## 1830 20138
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 23 1106 974 1090 915 729 473 288 195 105 40 26 12 6 7 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 5.000 5.676 7.000 17.000
##
## includes extended item information - examples:
## labels
## 1 1
## 2 10
## 3 100
plot(summary(trans2)@lengths,ylab='Frequency',xlab='size',main='Density of sizes')
itemFrequencyPlot(trans2, topN=15, type="relative", main="Item Frequency",
col=wesanderson::wes_palette(name = 'Royal2'))
From brief summary we can see that linear/logistic regression is the most popular algorithm, followed by decision trees and random forests. Most of the people use between 2 and 6 algorithms in their job.
rules_technology<-apriori(data=trans2, parameter=list(supp=0.005, conf=0.08), appearance=list(default="lhs", rhs="Technology"), control=list(verbose=F))
count <- DATAFRAME(sort(rules_technology,by="count"))
count[,3:7] <- round(count[,3:7],4)
confident <- DATAFRAME(sort(rules_technology,by="confidence"))
confident[,3:7] <- round(confident[,3:7],4)
DT::datatable(count,filter = 'top',rownames = FALSE)
DT::datatable(confident,filter = 'top',rownames = FALSE)
In technology the most popular algorithms are the same as in whole sample. When we look at confidence, bayesian techniques, CNN(Convolutional neural network) and Neural Networks are the most specific for this industry. Overall there is very broad choice of algorithms in technology. There are also people who use GAN(Generative Adversarial Networks), RNN(Recurrent Neural Network) or HMM(Hidden Markov model).
plot(rules_technology,engine='plotly')
plot(sort(rules_technology,by="confidence")[1:5], method = "graph", engine = "htmlwidget")
rules_financial<-apriori(data=trans2, parameter=list(supp=0.005, conf=0.08), appearance=list(default="lhs", rhs="Financial"), control=list(verbose=F))
count <- DATAFRAME(sort(rules_financial,by="count"))
count[,3:7] <- round(count[,3:7],4)
confident <- DATAFRAME(sort(rules_financial,by="confidence"))
confident[,3:7] <- round(confident[,3:7],4)
DT::datatable(count,filter = 'top',rownames = FALSE)
DT::datatable(confident,filter = 'top',rownames = FALSE)
In finance, looking at popularity, regression, decision trees and random forest are the most used. Confidence states that combination of bayesian techniques, ensemble methods, gradient boosted machines, regression and SVM are the most specific for this industry. Although the statistic is low. This tells us that almost every algorithm is used. The difference between finance and technology is that popularity of neural networks, CNN and RNN is low here.
plot(rules_financial,engine='plotly')
plot(sort(rules_financial,by="confidence")[1:5], method = "graph", engine = "htmlwidget")
rules_academic<-apriori(data=trans2, parameter=list(supp=0.005, conf=0.08), appearance=list(default="lhs", rhs="Academic"), control=list(verbose=F))
count <- DATAFRAME(sort(rules_academic,by="count"))
count[,3:7] <- round(count[,3:7],4)
confident <- DATAFRAME(sort(rules_academic,by="confidence"))
confident[,3:7] <- round(confident[,3:7],4)
DT::datatable(count,filter = 'top',rownames = FALSE)
DT::datatable(confident,filter = 'top',rownames = FALSE)
In the academia popularity of methods is similar to the previous cases, with the difference that neural networks are the second most popular algorithm. Looking at confidence the results are totally different than in previous two private sectors. People in academia use HMM, RNN, Neural Networks and the most interesting thing for me - evolutionary approaches. This results can only approve our suppositions that people in universities, researchers are working on new and complex methods, while private sector is mostly using old and tested algorithms.
plot(rules_academic,engine='plotly')
plot(sort(rules_academic,by="confidence")[1:5], method = "graph", engine = "htmlwidget")
In this article we saw that association rules can be used to find some interesting relationships in survey data. With apriori algorithm we can easily perform analysis that can provide valuable patterns in data. There are many rules that can be unwrapped from not pleasant data format. We discovered dissimilarities between academic and private sector in usage of algorithms. In tools part we noticed that few languages are commonly shared but there were also differences like SAS in finance or Matlab/Octave in academia. For me the most important thing that I got out of this analysis is to be open for new tools/methods because in future job I will use probably a few of them.