Data Mining Tasks (Styles of learning):
Supervised learning is the machine learning method of learning a function that maps an input to an output based on example input-output pairs. The model is trained until it can detect the underlying patterns and relationships between the input data and the output labels, enabling it to yield accurate labeling results when presented with new never-before-seen data.
Supervised learning is good at classification and regression problems, such as determining what category a news article belongs to or predicting the volume of sales for a given future date. In supervised learning, the aim is to make sense of data within the context of a specific question.
Supervised learning uses a training set to teach models to yield the desired output. This training dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its accuracy through the loss function, adjusting until the error has been sufficiently minimized.
Classification is a process of categorizing a given set of data into classes. Uses an algorithm to accurately assign test data into specific categories. It recognizes specific entities within the dataset and attempts to draw some conclusions on how those entities should be labeled or defined.
While linear regression is leveraged when dependent variables are continuous, logistical regression is selected when the dependent variable is categorical, meaning they have binary outputs, such as “true” and “false” or “yes” and “no.” While both regression models seek to understand relationships between data inputs, logistic regression is mainly used to solve binary classification problems, such as spam identification.
A support vector machine is a popular supervised learning model used for both data classification and regression. That said, it is typically leveraged for classification problems, constructing a hyperplane where the distance between two classes of data points is at its maximum. This hyperplane is known as the decision boundary, separating the classes of data points (e.g., oranges vs. apples) on either side of the plane.
K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other data. This algorithm assumes that similar data points can be found near each other. As a result, it seeks to calculate the distance between data points, usually through Euclidean distance, and then it assigns a category based on the most frequent category or average. Its ease of use and low calculation time make it a preferred algorithm by data scientists, but as the test dataset grows, the processing time lengthens, making it less appealing for classification tasks. KNN is typically used for recommendation engines and image recognition.
Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.
Random forest is another flexible supervised machine learning algorithm used for both classification and regression purposes. The “forest” references a collection of uncorrelated decision trees, which are then merged together to reduce variance and create more accurate data predictions. In other words a decision tree is a map of the possible outcomes of a series of related choices. A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities. This gives it a treelike shape.
Naïve Bayes Classifier is one of the simple and most effective classification algorithms which helps in building the fast machine learning models that can make quick predictions. It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. For example the Iris data set consists of the physical parameters of three species of flower: Versicolor, Setosa and Virginica. The numeric parameters which the dataset contains are Sepal width, Sepal length, Petal width and Petal length. With this data we will be predicting the classes of the flowers based on these parameters. The data consists of continuous numeric values which describe the dimensions of the respective features. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier would consider all of these properties to independently contribute to the probability that the flower belongs to a particular species.
Linear regression helps predict a continuous quantity. It is used to identify the relationship between a dependent variable and one or more independent variables and is to make predictions about future outcomes. When there is only one independent variable and one dependent variable, it is known as simple linear regression. As the number of independent variables increases, it is referred to as multiple linear regression. For each type of linear regression, it seeks to plot a line of best fit, which is calculated through the method of least squares. However, unlike other regression models, this line is straight when plotted on a graph.
Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. It allows the model to work on its own to discover patterns and information that was previously undetected. It mainly deals with the unlabelled data. These algorithms discover hidden patterns or data groupings without the need for human intervention. Its ability to discover similarities and differences in information make it the ideal solution for exploratory data analysis, cross-selling strategies, customer segmentation, and image recognition.
Unsupervised learning models are utilized for three main tasks—clustering, association, and dimensionality reduction.
Clustering is a data mining technique which groups unlabeled data based on their similarities or differences. Clustering algorithms are used to process raw, unclassified data objects into groups represented by structures or patterns in the information. Clustering algorithms can be categorized into a few types, specifically exclusive, overlapping, hierarchical, and probabilistic.
Exclusive clustering is a form of grouping that demands for a data point to exist only in one cluster. This can also be referred to as “hard” clustering. The K-means clustering algorithm is an example of exclusive clustering.
K-mean clustering is a common example of an exclusive clustering method where data points are assigned into K groups, where K represents the number of clusters based on the distance from each group’s centroid. The data points closest to a given centroid will be clustered under the same category. K-means clustering is commonly used in market segmentation, document clustering, image segmentation, and image compression. Overlapping clusters differs from exclusive clustering in that it allows data points to belong to multiple clusters with separate degrees of membership. “Soft” or fuzzy k-means clustering is an example of overlapping clustering.
knitr::include_graphics("https://miro.medium.com/max/1122/0*ipBIcsy9jjvqEpbK.png")
Hierarchical clustering, also known as hierarchical cluster analysis (HCA), is an unsupervised clustering algorithm that can be categorized in two ways; they can be agglomerative or divisive. Agglomerative clustering is considered a “bottoms-up approach.” Its data points are isolated as separate groupings initially, and then they are merged together iteratively on the basis of similarity until one cluster has been achieved. Four different methods are commonly used to measure similarity:
Ward’s linkage: This method states that the distance between two clusters is defined by the increase in the sum of squared after the clusters are merged.
Average linkage: This method is defined by the mean distance between two points in each cluster
Complete (or maximum) linkage: This method is defined by the maximum distance between two points in each cluster
Single (or minimum) linkage: This method is defined by the minimum distance between two points in each cluster
knitr::include_graphics("https://editor.analyticsvidhya.com/uploads/40351linkages.PNG")
Euclidean distance is the most common metric used to calculate these distances, however, other metrics, such as Manhattan distance, are also cited in clustering literature.
Divisive clustering can be defined as the opposite of agglomerative clustering; instead it takes a “top-down” approach. In this case, a single data cluster is divided based on the differences between data points. Divisive clustering is not commonly used, but it is still worth noting in the context of hierarchical clustering. These clustering processes are usually visualized using a dendrogram, a tree-like diagram that documents the merging or splitting of data points at each iteration.
1. Diagram of a Dendrogram - “bottom up” demonstrates agglomerative clustering, while “top-down” demonstrates divisive clustering
While more data generally yields more accurate results, it can also impact the performance of machine learning algorithms (e.g. overfitting) and it can also make it difficult to visualize datasets. Dimensionality reduction is a technique used when the number of features, or dimensions, in a given dataset is too high. It reduces the number of data inputs to a manageable size while also preserving the integrity of the dataset as much as possible. It is commonly used in the preprocessing data stage, and there are a few different dimensionality reduction methods that can be used, such as:
Principal component analysis (PCA) is a type of dimensionality reduction method which is used to reduce redundancies and to compress datasets through feature extraction, in other words to reduce the dimensionality of a large data set into smaller one that still contains most of the information needed. This method uses a linear transformation to create a new data representation, yielding a set of “principal components.” The first principal component is the direction which maximizes the variance of the dataset. While the second principal component also finds the maximum variance in the data, it is completely uncorrelated to the first principal component, yielding a direction that is perpendicular, or orthogonal, to the first component. This process repeats based on the number of dimensions, where a next principal component is the direction orthogonal to the prior components with the most variance. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
Step by Step Explanation of PCA:
It is critical to perform standardization prior to PCA because the latter is quite sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges, which will lead to biased results. So, transforming the data to comparable scales can prevent this problem. Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable.
\(z=\frac{value-mean}{standard\:deviation}\)
The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.
\[\begin{equation*} Covariance\:matrix = \begin{bmatrix} Cov(x,x) & Cov(x,y) & Cov(x,z) \\ Cov(y,z) & Cov(y,y) & Cov(y,z) \\ Cov(z,x) & Cov(z,y) & Cov(z,z) \end{bmatrix} \end{equation*}\]
Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we actually have the variances of each initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.
What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to determine the principal components of the data.
Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.
So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on.
Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance, the lines that capture most information of the data. The relationship between variance and information is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.
As there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set.
The first principal component is approximately the line that matches the purple marks because it goes through the origin and it’s the line in which the projection of the points (red dots) is the most spread out. Or mathematically speaking, it’s the line that maximizes the variance (the average of the squared distances from the projected points (red dots) to the origin).
2. Principal components graphical representation
The second principal component is calculated in the same way, with the condition that it is uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for the next highest variance.
PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
Mathematically, the transformation is defined by a set of size \({l}\) of p-dimensional vectors of weights or coefficients \({w_k = (w_1,\dots,w_p)_k}\) that for each row vector \({x_i}\) of X to a new vector of principal component scores \({t_i = (t_1,\dots,t_l)_i}\), given by \({t_{k_i} = x_i\cdot{w_k}}\) for \({i = 1,\dots,n}\) \({k = 1,\dots,l}\) in such a way that the individual variables \({t_{1},\dots,t_{l}}\) of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector (where \({l}\) is usually selected to be strictly less than \({p}\) to reduce dimensionality).
First component In order to maximize variance, the first weight vector \({w_1}\) thus has to satisfy
\({w_1 = arg\:max_{\lVert{w}\rVert = 1}\{{\sum_{i}(t_1)^2}_i\} = arg\:max_{\lVert{w}\rVert = 1}\{\sum_{i}{(x_i\cdot{w})^2}\}}\)
Singular value decomposition (SVD) is another dimensionality reduction approach which factorizes a matrix, A, into three, low-rank matrices. SVD is denoted by the formula, A = USVT, where U and V are orthogonal matrices. S is a diagonal matrix, and S values are considered singular values of matrix A. Similar to PCA, it is commonly used to reduce noise and compress data, such as image files.
Autoencoders leverage neural networks to compress data and then recreate a new representation of the original data’s input. Looking at the image below, you can see that the hidden layer specifically acts as a bottleneck to compress the input layer prior to reconstructing within the output layer. The stage from the input layer to the hidden layer is referred to as “encoding” while the stage from the hidden layer to the output layer is known as “decoding.”
An association rule is a rule-based method for finding relationships between variables in a given dataset. These methods are frequently used for market basket analysis, allowing companies to better understand relationships between different products. Understanding consumption habits of customers enables businesses to develop better cross-selling strategies and recommendation engines. Examples of this can be seen in Amazon’s “Customers Who Bought This Item Also Bought” or Spotify’s “Discover Weekly” playlist.
Apriori algorithms have been popularized through market basket analyses, leading to different recommendation engines for music platforms and online retailers. They are used within transactional datasets to identify frequent itemsets, or collections of items, to identify the likelihood of consuming a product given the consumption of another product. For example, if I play Black Sabbath’s radio on Spotify, starting with their song “Orchid”, one of the other songs on this channel will likely be a Led Zeppelin song, such as “Over the Hills and Far Away.” This is based on my prior listening habits as well as the ones of others. Apriori algorithms use a hash tree to count itemsets, navigating through the dataset in a breadth-first manner. Apriori is generally considered an unsupervised learning approach, since it’s often used to discover or mine for interesting patterns and relationships. Apriori can also be modified to do classification based on labelled data.
T1 <- "hot dogs, buns, ketchup"
T2 <- "hot dogs, buns"
T3 <- "hot dogs, coke, chips"
T4 <- "chips, coke"
T5 <- "chips, ketchup"
T6 <- "hot dogs, coke, chips"
Transaction_ID <- c("T1","T2","T3","T4","T5","T6")
t1 <- "T1"
t2 <- "T2"
t3 <- "T3"
t4 <- "T4"
t5 <- "T5"
t6 <- "T6"
For example we can have transactions T1, T2, T3, T4, T5, T6 and for this transactions associated items T1 : hot dogs, buns, ketchup T2: hot dogs, buns T3: hot dogs, coke, chips T4: chips, coke T5: chips, ketchup T6: hot dogs, coke, chips. Here I use inline R chunks for the values.
| Transaction ID | Items |
|---|---|
| T1 | hot dogs, buns, ketchup |
| T2 | hot dogs, buns |
| T3 | hot dogs, coke, chips |
| T4 | chips, coke |
| T5 | chips, ketchup |
| T6 | hot dogs, coke, chips |
we can deduce probability of frequency of certain product referenced by buying other product by setting minimum support threshold and minimum confidence threshold.
Since there was not informative data specifically about the topic I found one that is about employed people in the field.
dataset <- read_csv("data_cleaned_2021.csv")
## Rows: 742 Columns: 42
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (17): Job Title, Salary Estimate, Job Description, Company Name, Locatio...
## dbl (25): index, Rating, Founded, Hourly, Employer provided, Lower Salary, U...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Specification (column types)
spec(dataset)
## cols(
## index = col_double(),
## `Job Title` = col_character(),
## `Salary Estimate` = col_character(),
## `Job Description` = col_character(),
## Rating = col_double(),
## `Company Name` = col_character(),
## Location = col_character(),
## Headquarters = col_character(),
## Size = col_character(),
## Founded = col_double(),
## `Type of ownership` = col_character(),
## Industry = col_character(),
## Sector = col_character(),
## Revenue = col_character(),
## Competitors = col_character(),
## Hourly = col_double(),
## `Employer provided` = col_double(),
## `Lower Salary` = col_double(),
## `Upper Salary` = col_double(),
## `Avg Salary(K)` = col_double(),
## company_txt = col_character(),
## `Job Location` = col_character(),
## Age = col_double(),
## Python = col_double(),
## spark = col_double(),
## aws = col_double(),
## excel = col_double(),
## sql = col_double(),
## sas = col_double(),
## keras = col_double(),
## pytorch = col_double(),
## scikit = col_double(),
## tensor = col_double(),
## hadoop = col_double(),
## tableau = col_double(),
## bi = col_double(),
## flink = col_double(),
## mongo = col_double(),
## google_an = col_double(),
## job_title_sim = col_character(),
## seniority_by_title = col_character(),
## Degree = col_character()
## )
# How the data looks like
dataset
## # A tibble: 742 x 42
## index `Job Title` `Salary Estima~` `Job Descripti~` Rating `Company Name`
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 0 Data Scientist $53K-$91K (Glas~ "Data Scientist~ 3.8 "Tecolote Res~
## 2 1 Healthcare Dat~ $63K-$112K (Gla~ "What You Will ~ 3.4 "University o~
## 3 2 Data Scientist $80K-$90K (Glas~ "KnowBe4, Inc. ~ 4.8 "KnowBe4\n4.8"
## 4 3 Data Scientist $56K-$97K (Glas~ "*Organization ~ 3.8 "PNNL\n3.8"
## 5 4 Data Scientist $86K-$143K (Gla~ "Data Scientist~ 2.9 "Affinity Sol~
## 6 5 Data Scientist $71K-$119K (Gla~ "CyrusOne is se~ 3.4 "CyrusOne\n3.~
## 7 6 Data Scientist $54K-$93K (Glas~ "Job Descriptio~ 4.1 "ClearOne Adv~
## 8 7 Data Scientist $86K-$142K (Gla~ "Advanced Analy~ 3.8 "Logic20/20\n~
## 9 8 Research Scien~ $38K-$84K (Glas~ "SUMMARY\n\nThe~ 3.3 "Rochester Re~
## 10 9 Data Scientist $120K-$160K (Gl~ "isn’t your usu~ 4.6 "<intent>\n4.~
## # ... with 732 more rows, and 36 more variables: Location <chr>,
## # Headquarters <chr>, Size <chr>, Founded <dbl>, `Type of ownership` <chr>,
## # Industry <chr>, Sector <chr>, Revenue <chr>, Competitors <chr>,
## # Hourly <dbl>, `Employer provided` <dbl>, `Lower Salary` <dbl>,
## # `Upper Salary` <dbl>, `Avg Salary(K)` <dbl>, company_txt <chr>,
## # `Job Location` <chr>, Age <dbl>, Python <dbl>, spark <dbl>, aws <dbl>,
## # excel <dbl>, sql <dbl>, sas <dbl>, keras <dbl>, pytorch <dbl>, ...
# New column names for easier manipulation
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
df_clean <- clean_names(dataset)
# The new column names
colnames(df_clean)
## [1] "index" "job_title" "salary_estimate"
## [4] "job_description" "rating" "company_name"
## [7] "location" "headquarters" "size"
## [10] "founded" "type_of_ownership" "industry"
## [13] "sector" "revenue" "competitors"
## [16] "hourly" "employer_provided" "lower_salary"
## [19] "upper_salary" "avg_salary_k" "company_txt"
## [22] "job_location" "age" "python"
## [25] "spark" "aws" "excel"
## [28] "sql" "sas" "keras"
## [31] "pytorch" "scikit" "tensor"
## [34] "hadoop" "tableau" "bi"
## [37] "flink" "mongo" "google_an"
## [40] "job_title_sim" "seniority_by_title" "degree"
For example we want to see the average salaries of people working in data science field by their job title and salary by sector,
df_clean %>% ggplot(aes(x=avg_salary_k, y=job_title_sim,color=job_title_sim))+geom_point(alpha=0.8)+ theme_bw(base_size = 7)
df_clean %>% ggplot(aes(x=avg_salary_k, y=sector,color=sector))+geom_point(alpha=0.8)+ theme_bw(base_size = 5)
or the presence of different job titles in different sectors.
df_clean %>% ggplot(aes(x=job_title_sim, y=sector,color=job_title_sim))+geom_point(alpha=0.8)+ theme_bw(base_size = 7)
Citations: (Zaki and Meira 2020) (Yang 2019)