Introduction

In this project there will be presented 3 topics from Unsupervised Learning. First, I would like to present clustering of the data, then moving to dimension reduction and finishing with association rules. All algorithms will be presented on one database which was gathered as a part of my graduation work project.

Nowadays, analyzing text and observing sentiment is getting more and more popular. It is like getting information from sources that cannot be explained by math or statistic but actually they can. It makes a bit hard to analyze such data but it also can give us some information about the topic that we want to observe.

First of all I would like to describe the data as it is important to know what we can expect. The data was gathered from twitter and it is about polish gaming stock company - CD Project RED. The tweets about the company was gathered via twitter API and it contains tweet text, number of likes and retweets, date of publication and user. Only tweets containing one of the keywords connected with CDPR were taken into consideration. They were gathered from february 2021 to may 2021 and database contains about 45 000 unique tweets. From each tweet I would like to extract emojis and text. Then the sentiment analysis will be conducted to get some numeric variables from text and then proceed with analyses. I would also like to obtain words from each tweet to extract some association rules.

All further steps and analyses will be described in each block of the project.

Packages

Before loading packages by “library(package_name)” first make sure that they are installed. If they are not please use “install.packages(package_name)” to install them.

library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(stats)
library(tsne)
library(NbClust)
library(arules)
library(arulesViz)
library(arulesCBA)


library(graphics)
library(purrr)
library(stringr)
library(tm)
library(syuzhet)
library(plyr)
library(dplyr)
library(wordcloud)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(grid)
library(tm)
library(knitr)


options(scipen = 999, "digits"=4)

After loading needed libraries we can go thought clearing and exploring data, before we get to the actual analysis with use of Unsupervised Learning algorithms

Loading data

#Data <- read.csv("path to tweets data")
#Emoji_db <- read.csv("path to base of emojis")
#Emoji_sentiment <- read.csv("path to emojis sentiment")

Emojis

Clearing twitter data from additional special signs near emojis codepoints - will be used later to check which emoji is in each tweet and how many times is appears

Data$text <- gsub("[<>]"," ", Data$text)
Data$text <- gsub("000","",Data$text)
Data$text <- gsub("\\+","!",Data$text)
Data$text <- gsub("U!","",Data$text)

Merging emojis codepoints and their sentiment

Emoji_sentiment$codepoints <- sprintf("%x", Emoji_sentiment$Unicode.codepoint)

Emoji <- merge(x = Emoji_db, y = Emoji_sentiment, by = "codepoints", all.y = TRUE)
rm(Emoji_db)
rm(Emoji_sentiment)
Emoji <- Emoji[,c(1,8,9,10,11,12)]
colnames(Emoji)<-c("codepoints","Ocurrences","position","negative","neutral","positive")
Emoji$codepoints <- toupper(Emoji$codepoints)

Normalization of the sentiment data of each emoji

Emoji$substract<-Emoji$positive-Emoji$negative
Emoji$sentiment <-((Emoji$substract- mean(Emoji$substract))/(sd(Emoji$substract)))

Checking for outliers - if there is a big sentiment outlier especially for very common emojis it is very bad situation as it can have a huge impact on overall sentiment of the tweet

Based on the plot I decided to set threshold that if sentiment is more than 1 or less than 1 the sentiment will be equal respectively 1 or -1 (all sentiments for each emoji are from interval from -1 to 1)

Emoji$sentiment <- ifelse(Emoji$sentiment>=1,1,Emoji$sentiment)
Emoji$sentiment<- ifelse(Emoji$sentiment<=-1,-1,Emoji$sentiment)

Plot after applying threshold

Now I will check each tweet if it contains emojis and if they do count each type of emoji and multiply by it sentiment based on previously calculated emojis sentiment. Then I sum up all sentiments from emojis for each tweet to obtain overall sentiment from emojis

z <- c()
matrix_Emoji <- c()

for(i in Emoji$codepoints){
  
  z <- as.data.frame(str_count(Data$text,i))
  z <- z * Emoji$sentiment[Emoji$codepoints==i]
  matrix_Emoji <- as.data.frame(append(matrix_Emoji,z))
  
}

Emoji_sentiment_sum_mat <- as.matrix(matrix_Emoji)
Emoji_sentiment_sum <- rowSums(Emoji_sentiment_sum_mat)

We must take into consideration that some tweets contains same emojis and it can create some outliers. Let`s plot and see if there are some tweets for which emoji sentiment is from interval from -3 to 3

plot(1:length(Data$text),Emoji_sentiment_sum, xlab="N.o. of tweet", ylab="Emoji sentiment")
abline(h=c(-3,3), col="red", lwd = 2)

We could see some outliers. Let`s assume if tweet sentiment is more than 3 or less than -3 we set maximum values for them respectively 3 and -3.

Emoji_sentiment_sum <- ifelse(Emoji_sentiment_sum >=3,3,Emoji_sentiment_sum)
Emoji_sentiment_sum <- ifelse(Emoji_sentiment_sum <=-3,-3,Emoji_sentiment_sum)
Emoji_sentiment_sum <- as.data.frame(Emoji_sentiment_sum)

Clearing tweets

First we need to get rid of special signs, stop words, numbers, blank spaces etc.

twitterCorpus <- Corpus(VectorSource(Data$text))

twitterCorpus <- tm_map(twitterCorpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(twitterCorpus, content_transformer(tolower)):
## transformation drops documents

twitterCorpus <- tm_map(twitterCorpus, removeWords, stopwords("en"))

## Warning in tm_map.SimpleCorpus(twitterCorpus, removeWords, stopwords("en")):
## transformation drops documents

twitterCorpus <- tm_map(twitterCorpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(twitterCorpus, removeNumbers): transformation
## drops documents

removeURL <- function(x) gsub("http[[:alnum:]]*","",x)
twitterCorpus <- tm_map(twitterCorpus, content_transformer(removeURL))

## Warning in tm_map.SimpleCorpus(twitterCorpus, content_transformer(removeURL)):
## transformation drops documents

removeURL_2 <- function(x) gsub("edua[[:alnum:]]*","",x)
twitterCorpus <- tm_map(twitterCorpus, content_transformer(removeURL_2))

## Warning in tm_map.SimpleCorpus(twitterCorpus, content_transformer(removeURL_2)):
## transformation drops documents

removeNonAscii <- function(x) textclean::replace_non_ascii(x)
twitterCorpus <- tm_map(twitterCorpus, content_transformer(removeNonAscii))

## Warning in tm_map.SimpleCorpus(twitterCorpus,
## content_transformer(removeNonAscii)): transformation drops documents

twitterCorpus <- tm_map(twitterCorpus, removeWords,c("amp","ufef","ufeft","uufefuufefuufef","uufef","s","uffuffuufef"))

## Warning in tm_map.SimpleCorpus(twitterCorpus, removeWords, c("amp", "ufef", :
## transformation drops documents

twitterCorpus <- tm_map(twitterCorpus, stripWhitespace)

## Warning in tm_map.SimpleCorpus(twitterCorpus, stripWhitespace): transformation
## drops documents

twitterCorpus <- tm_map(twitterCorpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(twitterCorpus, removePunctuation): transformation
## drops documents

Now I will create one data frame with all needed data that contains tweets and sentiment from emojis for each tweet

Data[,1] <- twitterCorpus$content
Data$EmojisSentiment <- Emoji_sentiment_sum$Emoji_sentiment_sum

Obtaining sentiment from text

We can use some built-in packages in R with sentiment dictionaries to evaluate each tweet sentiment based on text. First we will use get_nrc_sentiment function to obtain emotions connected with text in each tweet such as anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative and positive.

emotions <- get_nrc_sentiment(Data$text)

Data$anger <- emotions$anger
Data$anticipation <- emotions$anticipation
Data$disgust <- emotions$disgust
Data$fear <- emotions$fear
Data$joy <- emotions$joy
Data$sadness <- emotions$sadness
Data$surprise <- emotions$surprise
Data$trust <- emotions$trust
Data$negative <- emotions$negative
Data$positive <- emotions$positive

Now I will use second algorithm that will evaluate sentiment based on the interval scale. Positive values means that the sentiment is more positive, negative that tweet has negative sentiment and around 0 means that tweet is neutral

sentiment_syuzhet <- get_sentiment(Data$text, method = "syuzhet")
Data$Sentiment <- sentiment_syuzhet

Now lest do Document Term Matrix with tweets and words - it will be needed to perform association rules

dtm <- TermDocumentMatrix(twitterCorpus)
dtm2 <- removeSparseTerms(dtm, sparse = 0.999)
matrix <- as.matrix(dtm2) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

Exploration

Adding month and hour as different columns and SentimentOverall which is sum of sentiment from text and from emoticons

Data <- Data[,-1]
Data <- Data[,-3]

month <- as.numeric(substr(Data$created,7,7))
hour <- as.numeric(substr(Data$created,12,13))

Data <- Data[,-2]

Data$month <- month
Data$hour <- hour

Data$SentimentOverall <- Data$EmojisSentiment+Data$Sentiment

Now I would like to see some descriptive statistics for our data

summary(Data)

##  favoriteCount    retweetCount    EmojisSentiment       anger      
##  Min.   :    0   Min.   :   0.0   Min.   :-3.0000   Min.   :0.000  
##  1st Qu.:    0   1st Qu.:   0.0   1st Qu.: 0.0000   1st Qu.:0.000  
##  Median :    1   Median :   0.0   Median : 0.0000   Median :0.000  
##  Mean   :    7   Mean   :   1.1   Mean   : 0.0591   Mean   :0.151  
##  3rd Qu.:    3   3rd Qu.:   1.0   3rd Qu.: 0.0000   3rd Qu.:0.000  
##  Max.   :11600   Max.   :1192.0   Max.   : 3.0000   Max.   :4.000  
##   anticipation      disgust           fear            joy       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.000   Median :0.000   Median :0.000   Median :0.000  
##  Mean   :0.349   Mean   :0.117   Mean   :0.174   Mean   :0.275  
##  3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :5.000   Max.   :4.000   Max.   :4.000   Max.   :5.000  
##     sadness         surprise         trust        negative        positive    
##  Min.   :0.000   Min.   :0.000   Min.   :0.0   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.000   Median :0.000   Median :0.0   Median :0.000   Median :0.000  
##  Mean   :0.161   Mean   :0.174   Mean   :0.3   Mean   :0.365   Mean   :0.548  
##  3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:1.0   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.0   Max.   :5.000   Max.   :6.000  
##    Sentiment          month           hour    SentimentOverall
##  Min.   :-4.050   Min.   :2.00   Min.   : 0   Min.   :-4.250  
##  1st Qu.:-0.100   1st Qu.:3.00   1st Qu.: 8   1st Qu.:-0.106  
##  Median : 0.000   Median :4.00   Median :14   Median : 0.000  
##  Mean   : 0.198   Mean   :3.77   Mean   :13   Mean   : 0.257  
##  3rd Qu.: 0.750   3rd Qu.:4.00   3rd Qu.:18   3rd Qu.: 0.750  
##  Max.   : 5.400   Max.   :5.00   Max.   :23   Max.   : 6.800

As we can see there are some very big outliers for variables favouriteCount and retweetCount that is why I would normalize this two variables to make sure that they will not affect analyses with high values

Data$favoriteCount<-scale(Data$favoriteCount)
Data$retweetCount<-scale(Data$retweetCount)

Now let us look at the sentiment of tweets based on the month and hour when tweet was posted. Red line on the graphs shows mean sentiment for all tweets that were gathered. We can see that on February and March sentiment of tweets were higher than mean and it was falling each month.

ag1<-aggregate.data.frame(Data$SentimentOverall,list(Data$month),mean)
ag1$positive <- ifelse(ag1$x>=0,1,2)
plot(as.character(ag1$Group.1),ag1$x, cex=3,col=ag1$positive, pch=19, xlab="Month", ylab="Mean sentiment")
abline(h=mean(Data$SentimentOverall), col="red", lwd = 2)

Analyzing the sentiment based on hour of the posted tweet we can also see the groups in which posts were more positive and groups of time when they were more negative.

ag2<-aggregate.data.frame(Data$SentimentOverall,list(Data$hour),mean)
ag2$positive <- ifelse(ag2$x>=0,1,2)
plot(ag2$Group.1,ag2$x, cex=3,col=ag2$positive, pch=19, xlab="Hour", ylab="Mean sentiment")
abline(h=mean(Data$SentimentOverall), col="red", lwd = 2)

We should also look at the correlation between variables. As we can expect some emotions are highly correlated with each other, but now we can see how strong that correlation is.

cor.matrix <- cor(Data, method = c("spearman"))
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colorRampPalette(c("#99FF33", "#CC0066", "black"))(200))

Clustering

Before clustering

First we will start analysis with clustering. Based on the data we may expect that there will be at least two groups - positive tweets and negative tweets. Maybe after clustering we will see some other groups. The dataset has around 45 000 observations which may be quite intense for some algorithms, that is why I decided to randomly pick a sample of 10 000 observations on which we will proceed with clustering. We will also standardize variables to make sure high or low values do not affect whole analysis. After standardization we will not be able to interpret values so easily but still we can interpret them as a relation to other groups.

data.standarized <- as.data.frame(scale(Data))

set.seed(12444)
Sample <- sample_n(data.standarized,10000)

str(Sample)

## 'data.frame':    10000 obs. of  17 variables:
##  $ favoriteCount   : num  -0.081 -0.081 -0.081 -0.081 -0.081 ...
##  $ retweetCount    : num  -0.1167 -0.1167 -0.1167 -0.1167 -0.0111 ...
##  $ EmojisSentiment : num  -0.167 -0.167 -0.167 -0.167 -0.167 ...
##  $ anger           : num  -0.376 -0.376 -0.376 -0.376 -0.376 ...
##  $ anticipation    : num  1.05 -0.56 -0.56 -0.56 -0.56 ...
##  $ disgust         : num  -0.333 -0.333 -0.333 -0.333 -0.333 ...
##  $ fear            : num  -0.404 -0.404 -0.404 -0.404 -0.404 ...
##  $ joy             : num  1.34 -0.51 -0.51 -0.51 -0.51 ...
##  $ sadness         : num  -0.394 -0.394 -0.394 -0.394 -0.394 ...
##  $ surprise        : num  -0.41 -0.41 -0.41 -0.41 -0.41 ...
##  $ trust           : num  1.263 -0.542 -0.542 -0.542 -0.542 ...
##  $ negative        : num  1.024 -0.588 -0.588 -0.588 -0.588 ...
##  $ positive        : num  0.589 -0.715 -0.715 -0.715 -0.715 ...
##  $ Sentiment       : num  0.261 -0.255 -0.255 -0.255 -0.255 ...
##  $ month           : num  0.29 1.578 -0.998 0.29 0.29 ...
##  $ hour            : num  1.518 -1.527 -0.309 -0.766 -1.223 ...
##  $ SentimentOverall: num  0.164 -0.294 -0.294 -0.294 -0.294 ...

summary(Sample)

##  favoriteCount    retweetCount    EmojisSentiment      anger       
##  Min.   :-0.08   Min.   :-0.117   Min.   :-8.653   Min.   :-0.376  
##  1st Qu.:-0.08   1st Qu.:-0.117   1st Qu.:-0.167   1st Qu.:-0.376  
##  Median :-0.07   Median :-0.117   Median :-0.167   Median :-0.376  
##  Mean   :-0.01   Mean   :-0.008   Mean   : 0.010   Mean   : 0.001  
##  3rd Qu.:-0.05   3rd Qu.:-0.011   3rd Qu.:-0.167   3rd Qu.:-0.376  
##  Max.   :34.28   Max.   :14.662   Max.   : 8.319   Max.   : 9.553  
##   anticipation       disgust            fear             joy       
##  Min.   :-0.560   Min.   :-0.333   Min.   :-0.404   Min.   :-0.51  
##  1st Qu.:-0.560   1st Qu.:-0.333   1st Qu.:-0.404   1st Qu.:-0.51  
##  Median :-0.560   Median :-0.333   Median :-0.404   Median :-0.51  
##  Mean   : 0.001   Mean   : 0.012   Mean   :-0.006   Mean   : 0.00  
##  3rd Qu.: 1.045   3rd Qu.:-0.333   3rd Qu.:-0.404   3rd Qu.:-0.51  
##  Max.   : 5.860   Max.   : 8.210   Max.   : 8.880   Max.   : 8.74  
##     sadness          surprise          trust           negative     
##  Min.   :-0.394   Min.   :-0.410   Min.   :-0.542   Min.   :-0.588  
##  1st Qu.:-0.394   1st Qu.:-0.410   1st Qu.:-0.542   1st Qu.:-0.588  
##  Median :-0.394   Median :-0.410   Median :-0.542   Median :-0.588  
##  Mean   : 0.007   Mean   : 0.009   Mean   : 0.004   Mean   : 0.018  
##  3rd Qu.:-0.394   3rd Qu.:-0.410   3rd Qu.: 1.263   3rd Qu.: 1.024  
##  Max.   : 9.402   Max.   : 9.035   Max.   : 6.680   Max.   : 7.471  
##     positive        Sentiment          month              hour        
##  Min.   :-0.715   Min.   :-5.096   Min.   :-2.2862   Min.   :-1.9837  
##  1st Qu.:-0.715   1st Qu.:-0.449   1st Qu.:-0.9981   1st Qu.:-0.7658  
##  Median :-0.715   Median :-0.255   Median : 0.2901   Median : 0.1476  
##  Mean   : 0.005   Mean   :-0.016   Mean   :-0.0085   Mean   :-0.0033  
##  3rd Qu.: 0.589   3rd Qu.: 0.648   3rd Qu.: 0.2901   3rd Qu.: 0.7565  
##  Max.   : 5.804   Max.   : 4.779   Max.   : 1.5782   Max.   : 1.5177  
##  SentimentOverall
##  Min.   :-4.580  
##  1st Qu.:-0.465  
##  Median :-0.294  
##  Mean   :-0.010  
##  3rd Qu.: 0.564  
##  Max.   : 6.505

Before we apply actual clustering algorithms we should check if the data is can be clustered. In that case we will use Hopkins statistic. Null hypothesis for the test is that there are no meaningful clusters in the data. Alternative hypothesis assumes that data can be clustered. Basically if the statistic is around 1 we can reject null hypothesis and proceed with clustering. Unfortunately R Markdown had a problem with handling such a big vector in the function but the statistic is equal to 0.9649.

#get_clust_tendency(Sample,100)

As we can see the Hopkins statistic is around 1. That means we can cluster our data. The question is how many clusters we should get? We will check it by Calinski-Harabasz and Duda-Hart index first. We will also take a look at Silhouette index, elbow method, shadow statistic.

Let us start with Calinski-Harabasz and Duda-Hart indexes. Basically we will run 3 k-means clustering (with k=4, k=3 and k=2) and then we will check the statistics of indexes for each type of clustering. The higher the statistic the better it is to use that number of clusters.

km4<-kmeans(Sample, 4)
km3<-kmeans(Sample, 3)
km2<-kmeans(Sample,2)
round(calinhara(Sample, km4$cluster),digits=2)

## [1] 1892

round(calinhara(Sample, km3$cluster),digits=2)

## [1] 1588

round(calinhara(Sample, km2$cluster),digits=2)

## [1] 2017

dudahart2(Sample, km4$cluster)

## $p.value
## [1] 0
## 
## $dh
## [1] 0.3128
## 
## $compare
## [1] 0.9522
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.09

dudahart2(Sample, km3$cluster)

## $p.value
## [1] 0
## 
## $dh
## [1] 0.5693
## 
## $compare
## [1] 0.9522
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.09

dudahart2(Sample, km2$cluster)

## $p.value
## [1] 0
## 
## $dh
## [1] 0.8321
## 
## $compare
## [1] 0.9522
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.09

As we can see both indexes shows that the optimal number of clusters is 2 (we should look at highest values of $dh)Next we shall move to elbow method which can asses optimal number of clusters visually and Shilhouette statistic interpreted as - the higher statistic the the better.

opt1<-Optimal_Clusters_KMeans(Sample, max_clusters=10, plot_clusters = TRUE, max_iters = 1000)

##  [1] 0.0000 0.2265 0.2106 0.1877 0.1411 0.1136 0.1260 0.1186 0.1107 0.1166
## attr(,"class")
## [1] "k-means clustering"

Here we have similar situation, that shows k=2 is optimal. Nevertheless k=3 and k=4 may be also not so bad, as they have quite high values. Now we can move to the Shadow statistic, which is also used to asses optimal number of clusters. This time the lower values the better number of clusters. I did two versions first with “euclidean” distance and second with “manhattan” distance measure

d1<-cclust(Sample, 5, dist="euclidean")
shadow(d1)

##      1      2      3      4      5 
## 0.8844 0.7492 0.8100 0.8274 0.8121

plot(shadow(d1))

d2<-cclust(Sample, 5, dist="manhattan")
shadow(d2)

##      1      2      3      4      5 
## 0.8392 0.8002 0.8142 0.7941 0.7366

plot(shadow(d2))

Shadows statistic with euclidean measure shows the k=2 as optimal, whereas considering manhattan distance measure other numbers of clusters are optimal (they have similar values and k=5 is optimal). Now let`s see different method with usage of medoids with two distance measures “euclidean” and “manhattan”.

opt4<-Optimal_Clusters_Medoids(Sample, 5, 'euclidean', plot_clusters=TRUE)

##   
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?

opt4<-Optimal_Clusters_Medoids(Sample, 5, 'manhattan', plot_clusters=TRUE)

##   
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?

K-MEANS

Based on a lot of different statistics we can see that there are different options to choose. Knowing data we may expect just 2 clusters with positive and negative tweets, but that is not that much informative. That is why I would like to see clustering for k=2 just to make sure it will look as I thought. Then I would like to try k=3 and k=4 which may be less optimal but may give us some more insights of the data.

fviz_cluster(list(data=Sample, cluster=km2$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

fviz_cluster(list(data=Sample, cluster=km3$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

fviz_cluster(list(data=Sample, cluster=km4$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

Now let us see some post hoc analysis for each type. First lets try to call each cluster based what kind of observations are inside each of them. We would also see a plot that shows how far each observation is from its centroid.

km2$centers

##   favoriteCount retweetCount EmojisSentiment   anger anticipation disgust
## 1      -0.00268    -0.002497         0.01914 -0.3173      -0.0206 -0.2546
## 2      -0.02177    -0.040349        -0.03740  1.7095       0.1154  1.4419
##     fear       joy sadness surprise    trust negative  positive Sentiment
## 1 -0.304 -0.004688  -0.271 -0.04863 -0.01488  -0.2861 -0.003373    0.1574
## 2  1.597  0.028083   1.499  0.31667  0.10576   1.6536  0.052781   -0.9448
##      month      hour SentimentOverall
## 1 -0.01473 -0.002381           0.1471
## 2  0.02490 -0.008131          -0.8516

km3$centers

##   favoriteCount retweetCount EmojisSentiment   anger anticipation disgust
## 1     -0.005257   -0.0006652         0.28878 -0.3408      -0.1929 -0.3126
## 2     -0.007125   -0.0125843        -0.18067  0.1610      -0.3434  0.1157
## 3     -0.001694   -0.0105422         0.07587  0.1668       1.5466  0.3293
##      fear      joy sadness surprise    trust negative positive Sentiment
## 1 -0.3310 -0.07994 -0.3153  -0.3455 -0.05196  -0.4434   0.1325    0.5935
## 2  0.1457 -0.46544  0.1224  -0.2655 -0.41216   0.2937  -0.5263   -0.6599
## 3  0.1566  1.72031  0.2822   1.6507  1.50806   0.0465   1.5194    0.8853
##      month     hour SentimentOverall
## 1 -0.02514  0.02129           0.6421
## 2 -0.03054 -0.03151          -0.6573
## 3  0.09912  0.04050           0.8145

km4$centers

##   favoriteCount retweetCount EmojisSentiment   anger anticipation  disgust
## 1     -0.021302    -0.016251        -0.16066 -0.3162     -0.33565 -0.28981
## 2     -0.021160    -0.040349        -0.06762  1.8010      0.02253  1.51861
## 3      0.007418    -0.002137         0.17773 -0.1722      1.32192 -0.02993
## 4      0.014234     0.012897         0.17118 -0.3188     -0.35016 -0.26976
##      fear     joy  sadness surprise    trust negative positive Sentiment
## 1 -0.3079 -0.3941 -0.29059  -0.3452 -0.32817  -0.2842 -0.35107  -0.10486
## 2  1.7069 -0.1104  1.59937   0.2053 -0.01916   1.7207 -0.05795  -1.06310
## 3 -0.1635  1.6009 -0.03701   1.2455  1.40014  -0.2503  1.47336   1.09248
## 4 -0.3126 -0.3777 -0.31802  -0.3506 -0.37568  -0.2181 -0.36630  -0.06681
##      month       hour SentimentOverall
## 1  0.70599  0.0007237         -0.15775
## 2  0.03544 -0.0127952         -0.96857
## 3  0.07233  0.0400711          1.03907
## 4 -1.01193 -0.0297372          0.01001

d1<-cclust(Sample, 2, dist="euclidean")
stripes(d1)

d2<-cclust(Sample, 3, dist="euclidean")
stripes(d2)

d3<-cclust(Sample, 4, dist="euclidean")
stripes(d3)

For k=2 first cluster seems to have lower values of sentiment so we could call it as negative tweets cluster, whereas second has more positive emotions and sentiment

For k=3 first cluster seems to be a cluster of positive tweets as it has bigger values for positive variables, second cluster may be called negative values as it has high values of negative variables. Third one could be called neutral as values of variables seems to be near 0. I think that k=3 shows better differences between groups than k=2.

For k=4 3th and 4th cluster seems to be again negative and positive tweets. First cluster could be called neutral and third one is quite hard to describe - I would say it is a group of negative tweets but just a bit less harsh than in the 3rd cluster as they have lower values for negative variables.

CLARA

Now we can move to CLARA algorithm to see the results. We cannot use PAM as the dataset is too big for that algorithm.

clara_2<-eclust(Sample, "clara", k=2)

clara_2$medoids

##      favoriteCount retweetCount EmojisSentiment   anger anticipation disgust
## [1,]      -0.08097      -0.1167         -0.1673 -0.3756       1.0452  -0.333
## [2,]      -0.08097      -0.1167         -0.1673 -0.3756      -0.5597  -0.333
##        fear     joy sadness surprise   trust negative positive Sentiment  month
## [1,] -0.404  1.3408 -0.3938  -0.4102  1.2635  -0.5879   0.5888    0.7129 0.2901
## [2,] -0.404 -0.5097 -0.3938  -0.4102 -0.5421  -0.5879  -0.7151   -0.3844 0.2901
##        hour SentimentOverall
## [1,] 1.2132           0.5635
## [2,] 0.1476          -0.4079

fviz_silhouette(clara_2)

##   cluster size ave.sil.width
## 1       1 2592          0.08
## 2       2 7408          0.35

clara_3<-eclust(Sample, "clara", k=3)

clara_3$medoids

##      favoriteCount retweetCount EmojisSentiment   anger anticipation disgust
## [1,]      -0.08097      -0.1167         -0.1673 -0.3756      -0.5597  -0.333
## [2,]      -0.08097      -0.1167         -0.1673 -0.3756       2.6500  -0.333
## [3,]       0.08084       0.3055         -0.1673  2.1066      -0.5597   2.515
##        fear     joy sadness surprise   trust negative positive Sentiment
## [1,] -0.404 -0.5097 -0.3938  -0.4102 -0.5421  -0.5879  -0.7151   -0.3844
## [2,] -0.404  1.3408 -0.3938   1.9510  1.2635  -0.5879   1.8926    0.9065
## [3,]  1.917 -0.5097  2.0551  -0.4102 -0.5421   1.0239  -0.7151   -1.2235
##        month    hour SentimentOverall
## [1,]  0.2901  0.1476          -0.4079
## [2,]  0.2901 -0.3091           0.7350
## [3,] -0.9981  0.2998          -1.1508

fviz_silhouette(clara_3)

##   cluster size ave.sil.width
## 1       1 7150          0.42
## 2       2 1685          0.13
## 3       3 1165          0.10

clara_4<-eclust(Sample, "clara", k=4)

clara_4$medoids

##      favoriteCount retweetCount EmojisSentiment   anger anticipation disgust
## [1,]      -0.06941      -0.1167         -0.1673 -0.3756      -0.5597  -0.333
## [2,]      -0.08097      -0.1167         -0.1673 -0.3756       2.6500  -0.333
## [3,]       0.08084       0.3055         -0.1673  2.1066      -0.5597   2.515
## [4,]       0.05773       0.9389          6.7578 -0.3756       1.0452  -0.333
##        fear     joy sadness surprise   trust negative positive Sentiment
## [1,] -0.404 -0.5097 -0.3938  -0.4102 -0.5421  -0.5879  -0.7151   -0.2553
## [2,] -0.404  1.3408 -0.3938   1.9510  1.2635  -0.5879   1.8926    0.9065
## [3,]  1.917 -0.5097  2.0551  -0.4102 -0.5421   1.0239  -0.7151   -1.2235
## [4,]  1.917  1.3408 -0.3938   1.9510 -0.5421  -0.5879   1.8926    4.2628
##        month      hour SentimentOverall
## [1,]  0.2901 -0.004638          -0.2936
## [2,] -0.9981  0.147598           0.7350
## [3,] -0.9981  0.299833          -1.1508
## [4,]  1.5782  0.452069           6.5046

fviz_silhouette(clara_4)

##   cluster size ave.sil.width
## 1       1 7195          0.41
## 2       2 1559          0.16
## 3       3 1162          0.10
## 4       4   84          0.33

We can see that CALARA algorithm gives similar effects like K-means. The clusters generated by CLARA would be called more or less like the ones from K-means

DIMENSION REDUCTION

The second part of the project is about dimensionality reduction. We will try to merge some of the variables as one and get rid of those that do not affect variance of the data that much.

PCA

We will start with basic PCA algorithm to see if the data can be reduced to lower dimensionality. Knowing that variables are correlated with each other based on exploration at the beginning of analysis we can expect that we will be able to reduce some common variables to single one.

Data.PCA <- prcomp(data.standarized, center = TRUE, scale = TRUE)
summary(Data.PCA)

## Importance of components:
##                         PC1   PC2   PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     2.02 1.866 1.386 1.1170 1.0004 0.9940 0.8602 0.8121
## Proportion of Variance 0.24 0.205 0.113 0.0734 0.0589 0.0581 0.0435 0.0388
## Cumulative Proportion  0.24 0.445 0.558 0.6314 0.6903 0.7484 0.7920 0.8308
##                           PC9   PC10   PC11  PC12   PC13   PC14   PC15    PC16
## Standard deviation     0.7450 0.6913 0.6761 0.626 0.5798 0.5735 0.4994 0.28486
## Proportion of Variance 0.0326 0.0281 0.0269 0.023 0.0198 0.0193 0.0147 0.00477
## Cumulative Proportion  0.8634 0.8915 0.9184 0.941 0.9612 0.9806 0.9952 1.00000
##                                     PC17
## Standard deviation     0.000000000000104
## Proportion of Variance 0.000000000000000
## Cumulative Proportion  1.000000000000000

We can see that around 10 components explain almost 90% of variance in data. Now lets plot it and see how it looks and add table with cumulative variance explained under. Then we can plot eigen values to see which are higher than 1. The ones that are bigger than that threshold could be used to further analysis and we will chose only those that fulfill this constraint.

eig.val <- get_eigenvalue(Data.PCA)
eig.val_plot <- eig.val$variance.percent
eig.val_plot2 <- eig.val$eigenvalue

qplot(c(1:17),eig.val_plot) + 
  geom_line() + 
  xlab("Principal Component") + 
  ylab("Variance Explained") +
  ggtitle("Scree Plot") +
  ylim(0, 25)+xlim(1,10)

## Warning: Removed 7 rows containing missing values (geom_point).

## Warning: Removed 7 row(s) containing missing values (geom_path).

eig.val$cumulative.variance.percent

##  [1]  24.01  44.50  55.81  63.14  69.03  74.84  79.20  83.08  86.34  89.15
## [11]  91.84  94.14  96.12  98.06  99.52 100.00 100.00

plot(summary(Data.PCA)$importance[3,],ylab="Cumulative variance")

qplot(c(1:17),eig.val_plot2) + 
  geom_line() + 
  xlab("Principal Component") + 
  ylab("Eigenvalue") +
  ggtitle("Scree Plot") +
  ylim(0, 5)+xlim(1,17) + geom_hline(aes(yintercept = 1), colour="red")

eig.val_plot2

##  [1] 4.08184865134454000212826940697 3.48304354844858021778009060654
##  [3] 1.92198698897711861555137602409 1.24771762413537623537251874950
##  [5] 1.00080459421469813996452558058 0.98807962439654106745479111851
##  [7] 0.73997232161528225802271663269 0.65953918239787645561023055052
##  [9] 0.55499390636153178579803579851 0.47783394546824053961131539836
## [11] 0.45716876416437218777488737942 0.39141839750203621939306231070
## [13] 0.33612548659948815465980942463 0.32887482358300179630106185868
## [15] 0.24944853716893922168651442917 0.08114360362077405186553846761
## [17] 0.00000000000000000000000001085

Based on analysis we should choose 5 components which explains around 70% of overall variance. Now we can see the reduced dimensions on correlation circle. The darker values the more important variable. Let us compare first dimension with all the others. We should also compare others but that would be a big mess, so we just focus on the comparisons with 1st dimension.

Components Plots

fviz_pca_var(Data.PCA, col.var="contrib")+
  scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                        high="black", midpoint=11)

fviz_pca_var(Data.PCA, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
  labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
  scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                        high="black", midpoint=11)

fviz_pca_var(Data.PCA, col.var="contrib", repel = TRUE, axes = c(1, 3)) +
  labs(title="Variables loadings for PC1 and PC3", x="PC1", y="PC3")+
  scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                        high="black", midpoint=11)

## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_pca_var(Data.PCA, col.var="contrib", repel = TRUE, axes = c(1, 4)) +
  labs(title="Variables loadings for PC1 and PC4", x="PC1", y="PC4")+
  scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                        high="black", midpoint=11)

fviz_pca_var(Data.PCA, col.var="contrib", repel = TRUE, axes = c(1, 5)) +
  labs(title="Variables loadings for PC1 and PC5", x="PC1", y="PC5")+
  scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                        high="black", midpoint=11)

## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

We can see that negative and positive emotions are directed to opposite sides.

fviz_pca_ind(Data.PCA, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))

PC1 <- fviz_contrib(Data.PCA, choice = "var", axes = 1,fill = "#FF3366",color = "#FF3366")
PC2 <- fviz_contrib(Data.PCA, choice = "var", axes = 2,fill = "#FF3366",color = "#FF3366")
PC3 <- fviz_contrib(Data.PCA, choice = "var", axes = 3,fill = "#FF3366",color = "#FF3366")
PC4 <- fviz_contrib(Data.PCA, choice = "var", axes = 4,fill = "#FF3366",color = "#FF3366")
PC5 <- fviz_contrib(Data.PCA, choice = "var", axes = 5,fill = "#FF3366",color = "#FF3366")

PC1

PC2

PC3

PC4

PC5

As we can see the first dimension is based on sentiment and sentiment overall which is derived from sentiment. What is more positive emotions have impact on that dimension.

2nd dimension was constructed of negative emotions

3rd one is popularity of the tweet

4th is just a sentiment from text and emojis without any emotions

5th is hour of posted tweet

Hierarchical clustering

distance.m<-dist(t(data.standarized))
hc<-hclust(distance.m, method="complete") 
plot(hc, hang=-1)
rect.hclust(hc, k = 5, border="red")

sub_grp<-cutree(hc, k=5,h=-1) 
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("#339900", "#CC0066", "black", "blue", "yellow" ))

We can see that there are 5 groups from with red rectangles. They are similar to the results of PCA based on the main dimensions - positive, negative, when it was posted and how popular the tweet is.

I have also clustered the data with k = 5 and as we can see there are clusters with positive emotions, negative emotions, popularity of tweet and two considering month and hour of post.

Association rules

Creating Document term Matrix and creating basket data

It took a lot of time to convert document term matrix into proper data structure (transaction data). Basically this code is going over each tweet and each word to see if that word was used in the tweet. If so it appends words after commas. It had to go through 1132 words x 45 851 tweets = 51 903 332 cells to check. Then such prepared data was used to perform association rules with words from tweets. That is why I loaded ready data.

#words_df <- as.data.frame(matrix)

#tweet_words <-c()
#words<-data.frame(tweet_words=c(1:length(words_df)))
#i=1
#n=1

#for (i in 1:length(words_df)){
#  tweet_words<-c()
#  for (n in 1:nrow(words_df)){
#    if(words_df[n,i]==0) next
#      x<-row.names(words_df)[n]
#    tweet_words<-append(tweet_words,x)
#  }
#  print(i)
#  words$tweet_words[i]<-paste(tweet_words, collapse = ', ')
#}
#trans <- read.transactions("path to file with transacion data", format = "basket",
#                            header = TRUE, sep=",")

Some statistics

Lets take a look at the first 100 tweets and see how many words are there in each of them to make sure the data is loaded correctly.

size(trans[1:100])

##   [1]  7  3  2  9 12  4  5  3  4 12  8  4  3  7  6  4  7  6  5  3  6  6  3  3  5
##  [26] 10  2  3  7  2  7  7  4  2  4  8  2  5 10  6  9  8  3  3  3  8  2  5  5  3
##  [51]  3  1  2  7  7  7  9  8  4  6  2  8  5  6  6  7  7  6  2  8  7  5 10  7  5
##  [76]  4  0  0  5  2  0  7  3  4  6  5  9  3  6  4  7  2  5  6  7  6  2  7  5  5

length(trans)

## [1] 45559

ifre<-itemFrequency(trans, type="relative")
sort(ifre,decreasing = TRUE)[1:100]

##          cyberpunk               cdpr               game       cdprojektred 
##           0.671196           0.108080           0.101846           0.094866 
## virtualphotography      cyberpunkgame                red            psshare 
##           0.060317           0.060164           0.058232           0.057069 
##               just            projekt               time               like 
##           0.052613           0.048772           0.046116           0.045787 
##                via              night                new               city 
##           0.042955           0.041419           0.040475           0.039004 
##                now             gaming          photomode           vgpunite 
##           0.038214           0.036458           0.036217           0.035010 
##                one              patch               good               live 
##           0.034022           0.030905           0.030334           0.029083 
##               will                get              still             future 
##           0.028798           0.028293           0.028227           0.026888 
## cyberpunkphotomode                can               love                fef 
##           0.026230           0.025483           0.024035           0.023552 
##           inspired               epic             twitch               judy 
##           0.023223           0.023135           0.022915           0.022498 
##              games            witcher              enter            explore 
##           0.021972           0.021906           0.021642           0.021357 
##         razorsharp        neojapanese                see        judyalvarez 
##           0.021050           0.020940           0.020237           0.020062 
##               back                got               play             really 
##           0.019996           0.019864           0.019162           0.018701 
##          xboxshare            playing             travel             people 
##           0.018547           0.018350           0.018218           0.017647 
##               know             update              think          nightcity 
##           0.017560           0.017318           0.017274           0.016967 
##            youtube              going               made         rcyberpunk 
##           0.016879           0.016835           0.016572           0.015913 
##               come              check              first               xbox 
##           0.015782           0.015716           0.015650           0.015057 
##               part        playstation             johnny              never 
##           0.014596           0.014245           0.014004           0.013828 
##                day               even     rcyberpunkgame               make 
##           0.013740           0.013587           0.013477           0.013367 
##             stream              today               much            project 
##           0.013367           0.013345           0.013279           0.013016 
##               last               best              video               want 
##           0.012094           0.011941           0.011919           0.011897 
##               well          worldofvp            finally                ign 
##           0.011524           0.011414           0.011150           0.011106 
##               look                say               edge          gamergram 
##           0.011063           0.010667           0.010624           0.010316 
##              great            amazing            another               work 
##           0.010272           0.010185           0.010075           0.010053 
##              panam               also               take               hope 
##           0.009965           0.009877           0.009811           0.009746 
##               ever             better                dev              right 
##           0.009680           0.009548           0.009548           0.009438

ifre2<-itemFrequency(trans, type="absolute")
sort(ifre2,decreasing = TRUE)[1:100]

##          cyberpunk               cdpr               game       cdprojektred 
##              30579               4924               4640               4322 
## virtualphotography      cyberpunkgame                red            psshare 
##               2748               2741               2653               2600 
##               just            projekt               time               like 
##               2397               2222               2101               2086 
##                via              night                new               city 
##               1957               1887               1844               1777 
##                now             gaming          photomode           vgpunite 
##               1741               1661               1650               1595 
##                one              patch               good               live 
##               1550               1408               1382               1325 
##               will                get              still             future 
##               1312               1289               1286               1225 
## cyberpunkphotomode                can               love                fef 
##               1195               1161               1095               1073 
##           inspired               epic             twitch               judy 
##               1058               1054               1044               1025 
##              games            witcher              enter            explore 
##               1001                998                986                973 
##         razorsharp        neojapanese                see        judyalvarez 
##                959                954                922                914 
##               back                got               play             really 
##                911                905                873                852 
##          xboxshare            playing             travel             people 
##                845                836                830                804 
##               know             update              think          nightcity 
##                800                789                787                773 
##            youtube              going               made         rcyberpunk 
##                769                767                755                725 
##               come              check              first               xbox 
##                719                716                713                686 
##               part        playstation             johnny              never 
##                665                649                638                630 
##                day               even     rcyberpunkgame               make 
##                626                619                614                609 
##             stream              today               much            project 
##                609                608                605                593 
##               last               best              video               want 
##                551                544                543                542 
##               well          worldofvp            finally                ign 
##                525                520                508                506 
##               look                say               edge          gamergram 
##                504                486                484                470 
##              great            amazing            another               work 
##                468                464                459                458 
##              panam               also               take               hope 
##                454                450                447                444 
##               ever             better                dev              right 
##                441                435                435                430

As we can see the relative values are rather small as we have a huge database. That is why we need to set rather small support levels around 2%.

First plot shows words that appears in more than 2.5% tweets. Second shows 25 most popular words used in all tweets. Last visualization on y axis shows 1000 random tweets from data and each dot indicates if the word (on the x axis) is in the tweet. Basing on the plot we can see that some vertical lines and that means there are some quite frequent words in the tweets.

itemFrequencyPlot(trans, support = 0.025)

itemFrequencyPlot(trans, topN = 25)

image(sample(trans, 1000))

## APRIORI

Now lets take a look at the actual association rules with minimal support level 2% and minimal confidence level set at 25%. Then they were sorted by lift, confidence, support and count.

rules <- apriori(trans, parameter = list(support = 0.02, confidence = 0.25, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5    0.02      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 911 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1123 item(s), 45559 transaction(s)] done [0.05s].
## sorting and recoding items ... [44 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [1044 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

arules::inspect(sort(rules, by = "lift")[1:15])

##      lhs                                 rhs           support confidence
## [1]  {explore, razorsharp}            => {neojapanese} 0.02028 1         
## [2]  {enter, razorsharp}              => {neojapanese} 0.02028 1         
## [3]  {inspired, razorsharp}           => {neojapanese} 0.02028 1         
## [4]  {future, razorsharp}             => {neojapanese} 0.02028 1         
## [5]  {explore, inspired}              => {neojapanese} 0.02079 1         
## [6]  {epic, explore}                  => {neojapanese} 0.02070 1         
## [7]  {explore, future}                => {neojapanese} 0.02079 1         
## [8]  {enter, inspired}                => {neojapanese} 0.02090 1         
## [9]  {enter, epic}                    => {neojapanese} 0.02070 1         
## [10] {enter, explore, razorsharp}     => {neojapanese} 0.02028 1         
## [11] {explore, inspired, razorsharp}  => {neojapanese} 0.02028 1         
## [12] {epic, explore, razorsharp}      => {neojapanese} 0.02028 1         
## [13] {explore, future, razorsharp}    => {neojapanese} 0.02028 1         
## [14] {cyberpunk, explore, razorsharp} => {neojapanese} 0.02028 1         
## [15] {enter, inspired, razorsharp}    => {neojapanese} 0.02028 1         
##      coverage lift  count
## [1]  0.02028  47.76 924  
## [2]  0.02028  47.76 924  
## [3]  0.02028  47.76 924  
## [4]  0.02028  47.76 924  
## [5]  0.02079  47.76 947  
## [6]  0.02070  47.76 943  
## [7]  0.02079  47.76 947  
## [8]  0.02090  47.76 952  
## [9]  0.02070  47.76 943  
## [10] 0.02028  47.76 924  
## [11] 0.02028  47.76 924  
## [12] 0.02028  47.76 924  
## [13] 0.02028  47.76 924  
## [14] 0.02028  47.76 924  
## [15] 0.02028  47.76 924

arules::inspect(sort(rules, by = "confidence")[1:15])

##      lhs                          rhs           support confidence coverage
## [1]  {razorsharp}              => {cyberpunk}   0.02105 1          0.02105 
## [2]  {explore, razorsharp}     => {neojapanese} 0.02028 1          0.02028 
## [3]  {enter, razorsharp}       => {neojapanese} 0.02028 1          0.02028 
## [4]  {inspired, razorsharp}    => {neojapanese} 0.02028 1          0.02028 
## [5]  {future, razorsharp}      => {neojapanese} 0.02028 1          0.02028 
## [6]  {neojapanese, razorsharp} => {cyberpunk}   0.02030 1          0.02030 
## [7]  {explore, neojapanese}    => {enter}       0.02079 1          0.02079 
## [8]  {explore, neojapanese}    => {inspired}    0.02079 1          0.02079 
## [9]  {explore, inspired}       => {neojapanese} 0.02079 1          0.02079 
## [10] {epic, neojapanese}       => {explore}     0.02070 1          0.02070 
## [11] {epic, explore}           => {neojapanese} 0.02070 1          0.02070 
## [12] {explore, neojapanese}    => {future}      0.02079 1          0.02079 
## [13] {explore, future}         => {neojapanese} 0.02079 1          0.02079 
## [14] {inspired, neojapanese}   => {enter}       0.02090 1          0.02090 
## [15] {enter, inspired}         => {neojapanese} 0.02090 1          0.02090 
##      lift  count
## [1]   1.49 959  
## [2]  47.76 924  
## [3]  47.76 924  
## [4]  47.76 924  
## [5]  47.76 924  
## [6]   1.49 925  
## [7]  46.21 947  
## [8]  43.06 947  
## [9]  47.76 947  
## [10] 46.82 943  
## [11] 47.76 943  
## [12] 37.19 947  
## [13] 47.76 947  
## [14] 46.21 952  
## [15] 47.76 952

arules::inspect(sort(rules, by = "support")[1:15])

##      lhs                     rhs         support confidence coverage lift   
## [1]  {cdprojektred}       => {cyberpunk} 0.07636 0.8050     0.09487   1.1993
## [2]  {game}               => {cyberpunk} 0.05641 0.5539     0.10185   0.8252
## [3]  {psshare}            => {cyberpunk} 0.05555 0.9735     0.05707   1.4503
## [4]  {virtualphotography} => {cyberpunk} 0.05408 0.8967     0.06032   1.3359
## [5]  {projekt}            => {red}       0.04486 0.9199     0.04877  15.7970
## [6]  {red}                => {projekt}   0.04486 0.7704     0.05823  15.7970
## [7]  {cyberpunkgame}      => {cyberpunk} 0.03729 0.6198     0.06016   0.9235
## [8]  {via}                => {cyberpunk} 0.03584 0.8344     0.04296   1.2432
## [9]  {time}               => {cyberpunk} 0.03494 0.7577     0.04612   1.1289
## [10] {photomode}          => {cyberpunk} 0.03448 0.9521     0.03622   1.4185
## [11] {night}              => {cyberpunk} 0.03288 0.7939     0.04142   1.1827
## [12] {city}               => {night}     0.03222 0.8261     0.03900  19.9453
## [13] {night}              => {city}      0.03222 0.7780     0.04142  19.9453
## [14] {vgpunite}           => {cyberpunk} 0.03106 0.8871     0.03501   1.3217
## [15] {city}               => {cyberpunk} 0.03104 0.7957     0.03900   1.1855
##      count
## [1]  3479 
## [2]  2570 
## [3]  2531 
## [4]  2464 
## [5]  2044 
## [6]  2044 
## [7]  1699 
## [8]  1633 
## [9]  1592 
## [10] 1571 
## [11] 1498 
## [12] 1468 
## [13] 1468 
## [14] 1415 
## [15] 1414

arules::inspect(sort(rules, by = "count")[1:15])

##      lhs                     rhs         support confidence coverage lift   
## [1]  {cdprojektred}       => {cyberpunk} 0.07636 0.8050     0.09487   1.1993
## [2]  {game}               => {cyberpunk} 0.05641 0.5539     0.10185   0.8252
## [3]  {psshare}            => {cyberpunk} 0.05555 0.9735     0.05707   1.4503
## [4]  {virtualphotography} => {cyberpunk} 0.05408 0.8967     0.06032   1.3359
## [5]  {projekt}            => {red}       0.04486 0.9199     0.04877  15.7970
## [6]  {red}                => {projekt}   0.04486 0.7704     0.05823  15.7970
## [7]  {cyberpunkgame}      => {cyberpunk} 0.03729 0.6198     0.06016   0.9235
## [8]  {via}                => {cyberpunk} 0.03584 0.8344     0.04296   1.2432
## [9]  {time}               => {cyberpunk} 0.03494 0.7577     0.04612   1.1289
## [10] {photomode}          => {cyberpunk} 0.03448 0.9521     0.03622   1.4185
## [11] {night}              => {cyberpunk} 0.03288 0.7939     0.04142   1.1827
## [12] {city}               => {night}     0.03222 0.8261     0.03900  19.9453
## [13] {night}              => {city}      0.03222 0.7780     0.04142  19.9453
## [14] {vgpunite}           => {cyberpunk} 0.03106 0.8871     0.03501   1.3217
## [15] {city}               => {cyberpunk} 0.03104 0.7957     0.03900   1.1855
##      count
## [1]  3479 
## [2]  2570 
## [3]  2531 
## [4]  2464 
## [5]  2044 
## [6]  2044 
## [7]  1699 
## [8]  1633 
## [9]  1592 
## [10] 1571 
## [11] 1498 
## [12] 1468 
## [13] 1468 
## [14] 1415 
## [15] 1414

The rules based on max lift and confidence are mostly connected with word “neojapanese”. There are some words that very often comes with that word.

The rules based on max support and count focuses more on the cyberpunk (game made by CDRP). We can find some common associations of the words for example projekt -> red, city->night etc.

LHS Rules

I have checked some rules to see what words drive to some chosen words (xbox, sony, witcher, love, hate)

rules.xbox<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                       appearance=list(default="lhs", rhs="xbox"), control=list(verbose=F)) 
rules.xbox.byconf<-sort(rules.xbox, by="confidence", decreasing=TRUE)
arules::inspect(rules.xbox.byconf[1:10])

##      lhs                         rhs    support  confidence coverage lift  
## [1]  {xboxseriesx}            => {xbox} 0.002173 0.31132    0.006980 20.676
## [2]  {cyberpunk, xboxseriesx} => {xbox} 0.001954 0.30479    0.006409 20.242
## [3]  {series}                 => {xbox} 0.001383 0.24324    0.005685 16.154
## [4]  {cyberpunk, one}         => {xbox} 0.002744 0.13951    0.019667  9.265
## [5]  {playstation}            => {xbox} 0.001405 0.09861    0.014245  6.549
## [6]  {one}                    => {xbox} 0.003292 0.09677    0.034022  6.427
## [7]  {cyberpunk, part}        => {xbox} 0.001032 0.08918    0.011567  5.923
## [8]  {cyberpunk, xboxshare}   => {xbox} 0.001427 0.07711    0.018503  5.121
## [9]  {xboxshare}              => {xbox} 0.001427 0.07692    0.018547  5.109
## [10] {part}                   => {xbox} 0.001097 0.07519    0.014596  4.993
##      count
## [1]   99  
## [2]   89  
## [3]   63  
## [4]  125  
## [5]   64  
## [6]  150  
## [7]   47  
## [8]   65  
## [9]   65  
## [10]  50

rules.sony<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                       appearance=list(default="lhs", rhs="sony"), control=list(verbose=F)) 
rules.sony.byconf<-sort(rules.sony, by="confidence", decreasing=TRUE)
arules::inspect(rules.sony.byconf[1:2])

##     lhs              rhs    support  confidence coverage lift   count
## [1] {playstation} => {sony} 0.001778 0.12481    0.01425  23.209 81   
## [2] {cdpr}        => {sony} 0.001756 0.01625    0.10808   3.021 80

rules.witcher<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                       appearance=list(default="lhs", rhs="witcher"), control=list(verbose=F)) 
rules.witcher.byconf<-sort(rules.witcher, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.witcher.byconf))

##     lhs                           rhs       support  confidence coverage lift 
## [1] {following, workplace}     => {witcher} 0.001097 1          0.001097 45.65
## [2] {projekt, workplace}       => {witcher} 0.001251 1          0.001251 45.65
## [3] {hunt, thewitcherwildhunt} => {witcher} 0.001119 1          0.001119 45.65
## [4] {thewitcherwildhunt, wild} => {witcher} 0.001119 1          0.001119 45.65
## [5] {thewitcher, wild}         => {witcher} 0.001163 1          0.001163 45.65
## [6] {play, wild}               => {witcher} 0.001010 1          0.001010 45.65
##     count
## [1] 50   
## [2] 57   
## [3] 51   
## [4] 51   
## [5] 53   
## [6] 46

rules.love<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                       appearance=list(default="lhs", rhs="love"), control=list(verbose=F)) 
rules.love.byconf<-sort(rules.love, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.love.byconf))

##     lhs                  rhs    support  confidence coverage lift  count
## [1] {cyberpunk, much} => {love} 0.001163 0.17608    0.006607 7.326  53  
## [2] {much}            => {love} 0.001866 0.14050    0.013279 5.846  85  
## [3] {really}          => {love} 0.001032 0.05516    0.018701 2.295  47  
## [4] {fef}             => {love} 0.001251 0.05312    0.023552 2.210  57  
## [5] {cyberpunk, game} => {love} 0.002963 0.05253    0.056410 2.186 135  
## [6] {judy}            => {love} 0.001054 0.04683    0.022498 1.948  48

rules.hate<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                       appearance=list(default="lhs", rhs="hate"), control=list(verbose=F)) 
rules.hate.byconf<-sort(rules.hate, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.hate.byconf))

##     lhs       rhs    support  confidence coverage lift  count
## [1] {cdpr} => {hate} 0.001141 0.010561   0.1081   2.781  52  
## [2] {}     => {hate} 0.003797 0.003797   1.0000   1.000 173

LHS Rules

Now I have checked some rules to see what words will could be next if I write some chosen words (xbox, sony, witcher, love, hate)

rules.xbox.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                    appearance=list(default="rhs", lhs="xbox"), control=list(verbose=F)) 
rules.xbox.byconf<-sort(rules.xbox.r, by="confidence", decreasing=TRUE)
arules::inspect(rules.xbox.byconf[1:10])

##      lhs       rhs            support  confidence coverage lift    count
## [1]  {xbox} => {cyberpunk}    0.011611 0.77114    0.01506   1.1489   529
## [2]  {}     => {cyberpunk}    0.671196 0.67120    1.00000   1.0000 30579
## [3]  {xbox} => {one}          0.003292 0.21866    0.01506   6.4270   150
## [4]  {xbox} => {cdprojektred} 0.002371 0.15743    0.01506   1.6595   108
## [5]  {xbox} => {xboxseriesx}  0.002173 0.14431    0.01506  20.6756    99
## [6]  {}     => {cdpr}         0.108080 0.10808    1.00000   1.0000  4924
## [7]  {}     => {game}         0.101846 0.10185    1.00000   1.0000  4640
## [8]  {xbox} => {game}         0.001493 0.09913    0.01506   0.9733    68
## [9]  {}     => {cdprojektred} 0.094866 0.09487    1.00000   1.0000  4322
## [10] {xbox} => {xboxshare}    0.001427 0.09475    0.01506   5.1087    65

rules.sony.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                    appearance=list(default="rhs", lhs="sony"), control=list(verbose=F)) 
rules.sony.byconf<-sort(rules.sony.r, by="confidence", decreasing=TRUE)
arules::inspect(rules.sony.byconf[1:5])

##     lhs       rhs           support  confidence coverage lift   count
## [1] {}     => {cyberpunk}   0.671196 0.6712     1.000000  1.000 30579
## [2] {sony} => {playstation} 0.001778 0.3306     0.005378 23.209    81
## [3] {sony} => {cdpr}        0.001756 0.3265     0.005378  3.021    80
## [4] {sony} => {cyberpunk}   0.001624 0.3020     0.005378  0.450    74
## [5] {}     => {cdpr}        0.108080 0.1081     1.000000  1.000  4924

rules.witcher.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                       appearance=list(default="rhs", lhs="witcher"), control=list(verbose=F)) 
rules.witcher.byconf<-sort(rules.witcher.r, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.witcher.byconf))

##     lhs          rhs            support  confidence coverage lift   count
## [1] {}        => {cyberpunk}    0.671196 0.6712     1.00000  1.0000 30579
## [2] {witcher} => {cdpr}         0.006629 0.3026     0.02191  2.7998   302
## [3] {witcher} => {red}          0.005904 0.2695     0.02191  4.6287   269
## [4] {witcher} => {projekt}      0.005663 0.2585     0.02191  5.3005   258
## [5] {witcher} => {cyberpunk}    0.004170 0.1904     0.02191  0.2836   190
## [6] {witcher} => {cdprojektred} 0.003753 0.1713     0.02191  1.8062   171

rules.love.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                    appearance=list(default="rhs", lhs="love"), control=list(verbose=F)) 
rules.love.byconf<-sort(rules.love.r, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.love.byconf))

##     lhs       rhs         support  confidence coverage lift   count
## [1] {}     => {cyberpunk} 0.671196 0.67120    1.00000  1.0000 30579
## [2] {love} => {cyberpunk} 0.016001 0.66575    0.02403  0.9919   729
## [3] {love} => {game}      0.004258 0.17717    0.02403  1.7396   194
## [4] {}     => {cdpr}      0.108080 0.10808    1.00000  1.0000  4924
## [5] {}     => {game}      0.101846 0.10185    1.00000  1.0000  4640
## [6] {love} => {cdpr}      0.002283 0.09498    0.02403  0.8788   104

rules.hate.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025), 
                    appearance=list(default="rhs", lhs="hate"), control=list(verbose=F)) 
rules.hate.byconf<-sort(rules.hate.r, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.hate.byconf))

##     lhs       rhs            support  confidence coverage lift   count
## [1] {}     => {cyberpunk}    0.671196 0.67120    1.000000 1.0000 30579
## [2] {hate} => {cyberpunk}    0.001383 0.36416    0.003797 0.5426    63
## [3] {hate} => {cdpr}         0.001141 0.30058    0.003797 2.7811    52
## [4] {}     => {cdpr}         0.108080 0.10808    1.000000 1.0000  4924
## [5] {}     => {game}         0.101846 0.10185    1.000000 1.0000  4640
## [6] {}     => {cdprojektred} 0.094866 0.09487    1.000000 1.0000  4322

Post hoc

plot(rules, method="matrix", measure="lift")

## Itemsets in Antecedent (LHS)
##   [1] "{cyberpunk,enter,epic,explore,future,inspired,razorsharp}"       
##   [2] "{cyberpunk,enter,epic,future,inspired,razorsharp}"               
##   [3] "{cyberpunk,enter,epic,explore,future,inspired}"                  
##   [4] "{cyberpunk,enter,epic,future,inspired}"                          
##   [5] "{cyberpunk,epic,explore,future,inspired,razorsharp}"             
##   [6] "{cyberpunk,epic,future,inspired,razorsharp}"                     
##   [7] "{cyberpunk,epic,explore,future,inspired}"                        
##   [8] "{cyberpunk,epic,future,inspired}"                                
##   [9] "{cyberpunk,enter,epic,future,inspired,neojapanese,razorsharp}"   
##  [10] "{cyberpunk,enter,epic,future,inspired,neojapanese}"              
##  [11] "{cyberpunk,enter,epic,explore,future,inspired,neojapanese}"      
##  [12] "{cyberpunk,epic,future,inspired,neojapanese}"                    
##  [13] "{cyberpunk,epic,future,inspired,neojapanese,razorsharp}"         
##  [14] "{cyberpunk,epic,explore,future,inspired,neojapanese}"            
##  [15] "{cyberpunk,epic,explore,future,inspired,neojapanese,razorsharp}" 
##  [16] "{cyberpunk,future,inspired}"                                     
##  [17] "{cyberpunk,enter,future,inspired}"                               
##  [18] "{cyberpunk,enter,epic,future}"                                   
##  [19] "{cyberpunk,epic,future}"                                         
##  [20] "{cyberpunk,future,inspired,razorsharp}"                          
##  [21] "{cyberpunk,epic,future,razorsharp}"                              
##  [22] "{cyberpunk,enter,future,inspired,razorsharp}"                    
##  [23] "{cyberpunk,explore,future,inspired}"                             
##  [24] "{cyberpunk,epic,explore,future}"                                 
##  [25] "{cyberpunk,enter,epic,future,razorsharp}"                        
##  [26] "{cyberpunk,enter,explore,future,inspired}"                       
##  [27] "{cyberpunk,enter,epic,explore,future}"                           
##  [28] "{cyberpunk,explore,future,inspired,razorsharp}"                  
##  [29] "{cyberpunk,future,inspired,neojapanese}"                         
##  [30] "{cyberpunk,epic,explore,future,razorsharp}"                      
##  [31] "{cyberpunk,epic,future,neojapanese}"                             
##  [32] "{cyberpunk,enter,future,inspired,neojapanese}"                   
##  [33] "{cyberpunk,enter,explore,future,inspired,razorsharp}"            
##  [34] "{cyberpunk,enter,epic,future,neojapanese}"                       
##  [35] "{cyberpunk,future,inspired,neojapanese,razorsharp}"              
##  [36] "{cyberpunk,enter,future}"                                        
##  [37] "{cyberpunk,future,razorsharp}"                                   
##  [38] "{cyberpunk,enter,epic,explore,future,razorsharp}"                
##  [39] "{cyberpunk,epic,future,neojapanese,razorsharp}"                  
##  [40] "{cyberpunk,explore,future}"                                      
##  [41] "{cyberpunk,explore,future,inspired,neojapanese}"                 
##  [42] "{cyberpunk,epic,explore,future,neojapanese}"                     
##  [43] "{cyberpunk,enter,future,razorsharp}"                             
##  [44] "{cyberpunk,future,neojapanese}"                                  
##  [45] "{cyberpunk,enter,explore,future}"                                
##  [46] "{cyberpunk,explore,future,razorsharp}"                           
##  [47] "{cyberpunk,enter,future,inspired,neojapanese,razorsharp}"        
##  [48] "{cyberpunk,enter,epic,future,neojapanese,razorsharp}"            
##  [49] "{cyberpunk,enter,future,neojapanese}"                            
##  [50] "{cyberpunk,epic,inspired}"                                       
##  [51] "{cyberpunk,enter,explore,future,inspired,neojapanese}"           
##  [52] "{cyberpunk,future,neojapanese,razorsharp}"                       
##  [53] "{cyberpunk,enter,epic,explore,future,neojapanese}"               
##  [54] "{cyberpunk,explore,future,neojapanese}"                          
##  [55] "{cyberpunk,explore,future,inspired,neojapanese,razorsharp}"      
##  [56] "{cyberpunk,enter,explore,future,razorsharp}"                     
##  [57] "{cyberpunk,epic,explore,future,neojapanese,razorsharp}"          
##  [58] "{cyberpunk,enter,epic,inspired}"                                 
##  [59] "{cyberpunk,epic,inspired,razorsharp}"                            
##  [60] "{cyberpunk,epic,explore,inspired}"                               
##  [61] "{cyberpunk,enter,future,neojapanese,razorsharp}"                 
##  [62] "{cyberpunk,enter,inspired}"                                      
##  [63] "{cyberpunk,enter,epic}"                                          
##  [64] "{cyberpunk,enter,explore,future,neojapanese}"                    
##  [65] "{cyberpunk,inspired,razorsharp}"                                 
##  [66] "{cyberpunk,epic,inspired,neojapanese}"                           
##  [67] "{cyberpunk,explore,inspired}"                                    
##  [68] "{cyberpunk,explore,future,neojapanese,razorsharp}"               
##  [69] "{cyberpunk,epic,explore}"                                        
##  [70] "{cyberpunk,inspired,neojapanese}"                                
##  [71] "{cyberpunk,epic,neojapanese}"                                    
##  [72] "{cyberpunk,enter,epic,inspired,razorsharp}"                      
##  [73] "{cyberpunk,enter,epic,explore,inspired}"                         
##  [74] "{cyberpunk,neojapanese}"                                         
##  [75] "{cyberpunk,enter,inspired,razorsharp}"                           
##  [76] "{cyberpunk,epic,explore,inspired,razorsharp}"                    
##  [77] "{cyberpunk,enter,epic,razorsharp}"                               
##  [78] "{cyberpunk,enter,explore,inspired}"                              
##  [79] "{cyberpunk,enter,epic,explore}"                                  
##  [80] "{cyberpunk,enter,razorsharp}"                                    
##  [81] "{cyberpunk,explore,inspired,razorsharp}"                         
##  [82] "{cyberpunk,epic,explore,razorsharp}"                             
##  [83] "{cyberpunk,enter,epic,inspired,neojapanese}"                     
##  [84] "{cyberpunk,enter,explore}"                                       
##  [85] "{cyberpunk,explore,razorsharp}"                                  
##  [86] "{cyberpunk,enter,inspired,neojapanese}"                          
##  [87] "{cyberpunk,epic,inspired,neojapanese,razorsharp}"                
##  [88] "{cyberpunk,enter,epic,neojapanese}"                              
##  [89] "{cyberpunk,inspired,neojapanese,razorsharp}"                     
##  [90] "{cyberpunk,enter,neojapanese}"                                   
##  [91] "{cyberpunk,epic,neojapanese,razorsharp}"                         
##  [92] "{cyberpunk,epic,explore,inspired,neojapanese}"                   
##  [93] "{cyberpunk,epic,razorsharp}"                                     
##  [94] "{cyberpunk,explore,inspired,neojapanese}"                        
##  [95] "{cyberpunk,explore}"                                             
##  [96] "{cyberpunk,neojapanese,razorsharp}"                              
##  [97] "{cyberpunk,epic,explore,neojapanese}"                            
##  [98] "{cyberpunk,enter}"                                               
##  [99] "{cyberpunk,explore,neojapanese}"                                 
## [100] "{cyberpunk,enter,explore,future,inspired,neojapanese,razorsharp}"
## [101] "{cyberpunk,enter,explore,future,neojapanese,razorsharp}"         
## [102] "{cyberpunk,enter,epic,explore,future,neojapanese,razorsharp}"    
## [103] "{cyberpunk,enter,explore,razorsharp}"                            
## [104] "{cyberpunk,enter,explore,inspired,razorsharp}"                   
## [105] "{cyberpunk,enter,epic,explore,razorsharp}"                       
## [106] "{cyberpunk,razorsharp}"                                          
## [107] "{cyberpunk,enter,neojapanese,razorsharp}"                        
## [108] "{cyberpunk,enter,explore,neojapanese}"                           
## [109] "{cyberpunk,enter,epic,explore,inspired,razorsharp}"              
## [110] "{cyberpunk,explore,neojapanese,razorsharp}"                      
## [111] "{cyberpunk,enter,inspired,neojapanese,razorsharp}"               
## [112] "{cyberpunk,enter,epic,neojapanese,razorsharp}"                   
## [113] "{cyberpunk,enter,explore,inspired,neojapanese}"                  
## [114] "{cyberpunk,enter,epic,explore,neojapanese}"                      
## [115] "{cyberpunk,explore,inspired,neojapanese,razorsharp}"             
## [116] "{cyberpunk,epic,explore,neojapanese,razorsharp}"                 
## [117] "{cyberpunk,enter,epic,inspired,neojapanese,razorsharp}"          
## [118] "{cyberpunk,enter,epic,explore,inspired,neojapanese}"             
## [119] "{cyberpunk,epic}"                                                
## [120] "{cyberpunk,epic,explore,inspired,neojapanese,razorsharp}"        
## [121] "{cyberpunk,inspired}"                                            
## [122] "{cyberpunk,enter,explore,neojapanese,razorsharp}"                
## [123] "{cyberpunk,enter,explore,inspired,neojapanese,razorsharp}"       
## [124] "{cyberpunk,enter,epic,explore,neojapanese,razorsharp}"           
## [125] "{epic,future}"                                                   
## [126] "{cyberpunk,future}"                                              
## [127] "{future,inspired}"                                               
## [128] "{future,razorsharp}"                                             
## [129] "{explore,future}"                                                
## [130] "{enter,future}"                                                  
## [131] "{epic,future,inspired}"                                          
## [132] "{future,neojapanese}"                                            
## [133] "{epic,inspired}"                                                 
## [134] "{neojapanese}"                                                   
## [135] "{cyberpunk,enter,epic,explore,inspired,neojapanese,razorsharp}"  
## [136] "{enter,epic}"                                                    
## [137] "{enter,epic,future}"                                             
## [138] "{inspired,razorsharp}"                                           
## [139] "{future,inspired,razorsharp}"                                    
## [140] "{epic,future,razorsharp}"                                        
## [141] "{epic,explore}"                                                  
## [142] "{epic,explore,future}"                                           
## [143] "{explore,inspired}"                                              
## [144] "{enter,future,inspired}"                                         
## [145] "{enter,inspired}"                                                
## [146] "{explore,future,inspired}"                                       
## [147] "{epic,neojapanese}"                                              
## [148] "{explore}"                                                       
## [149] "{epic,future,neojapanese}"                                       
## [150] "{razorsharp}"                                                    
## [151] "{inspired,neojapanese}"                                          
## [152] "{future,inspired,neojapanese}"                                   
## [153] "{enter,razorsharp}"                                              
## [154] "{enter}"                                                         
## [155] "{explore,razorsharp}"                                            
## [156] "{enter,future,razorsharp}"                                       
## [157] "{enter,explore}"                                                 
## [158] "{explore,future,razorsharp}"                                     
## [159] "{enter,explore,future}"                                          
## [160] "{epic,razorsharp}"                                               
## [161] "{neojapanese,razorsharp}"                                        
## [162] "{explore,neojapanese}"                                           
## [163] "{future,neojapanese,razorsharp}"                                 
## [164] "{enter,neojapanese}"                                             
## [165] "{enter,future,neojapanese}"                                      
## [166] "{explore,future,neojapanese}"                                    
## [167] "{enter,epic,inspired}"                                           
## [168] "{epic,inspired,razorsharp}"                                      
## [169] "{epic,explore,inspired}"                                         
## [170] "{enter,epic,future,inspired}"                                    
## [171] "{epic,inspired,neojapanese}"                                     
## [172] "{epic,future,inspired,razorsharp}"                               
## [173] "{epic,explore,future,inspired}"                                  
## [174] "{enter,inspired,razorsharp}"                                     
## [175] "{epic,future,inspired,neojapanese}"                              
## [176] "{enter,epic,razorsharp}"                                         
## [177] "{enter,epic,explore}"                                            
## [178] "{explore,inspired,razorsharp}"                                   
## [179] "{enter,explore,inspired}"                                        
## [180] "{epic,explore,razorsharp}"                                       
## [181] "{enter,epic,neojapanese}"                                        
## [182] "{inspired,neojapanese,razorsharp}"                               
## [183] "{epic,neojapanese,razorsharp}"                                   
## [184] "{epic,explore,neojapanese}"                                      
## [185] "{explore,inspired,neojapanese}"                                  
## [186] "{enter,future,inspired,razorsharp}"                              
## [187] "{enter,inspired,neojapanese}"                                    
## [188] "{enter,epic,future,razorsharp}"                                  
## [189] "{enter,epic,explore,future}"                                     
## [190] "{explore,future,inspired,razorsharp}"                            
## [191] "{enter,explore,future,inspired}"                                 
## [192] "{epic,explore,future,razorsharp}"                                
## [193] "{epic}"                                                          
## [194] "{enter,explore,razorsharp}"                                      
## [195] "{inspired}"                                                      
## [196] "{enter,epic,future,neojapanese}"                                 
## [197] "{future,inspired,neojapanese,razorsharp}"                        
## [198] "{epic,future,neojapanese,razorsharp}"                            
## [199] "{enter,neojapanese,razorsharp}"                                  
## [200] "{epic,explore,future,neojapanese}"                               
## [201] "{enter,future,inspired,neojapanese}"                             
## [202] "{explore,future,inspired,neojapanese}"                           
## [203] "{explore,neojapanese,razorsharp}"                                
## [204] "{enter,explore,neojapanese}"                                     
## [205] "{enter,explore,future,razorsharp}"                               
## [206] "{enter,future,neojapanese,razorsharp}"                           
## [207] "{explore,future,neojapanese,razorsharp}"                         
## [208] "{enter,explore,future,neojapanese}"                              
## [209] "{enter,epic,inspired,razorsharp}"                                
## [210] "{enter,epic,explore,inspired}"                                   
## [211] "{epic,explore,inspired,razorsharp}"                              
## [212] "{enter,epic,inspired,neojapanese}"                               
## [213] "{epic,inspired,neojapanese,razorsharp}"                          
## [214] "{epic,explore,inspired,neojapanese}"                             
## [215] "{enter,explore,inspired,razorsharp}"                             
## [216] "{enter,epic,explore,razorsharp}"                                 
## [217] "{enter,inspired,neojapanese,razorsharp}"                         
## [218] "{enter,epic,neojapanese,razorsharp}"                             
## [219] "{enter,epic,explore,neojapanese}"                                
## [220] "{explore,inspired,neojapanese,razorsharp}"                       
## [221] "{enter,epic,future,inspired,razorsharp}"                         
## [222] "{enter,explore,inspired,neojapanese}"                            
## [223] "{epic,explore,neojapanese,razorsharp}"                           
## [224] "{enter,epic,explore,future,inspired}"                            
## [225] "{epic,explore,future,inspired,razorsharp}"                       
## [226] "{enter,epic,future,inspired,neojapanese}"                        
## [227] "{epic,future,inspired,neojapanese,razorsharp}"                   
## [228] "{epic,explore,future,inspired,neojapanese}"                      
## [229] "{enter,explore,neojapanese,razorsharp}"                          
## [230] "{enter,explore,future,inspired,razorsharp}"                      
## [231] "{enter,epic,explore,future,razorsharp}"                          
## [232] "{enter,future,inspired,neojapanese,razorsharp}"                  
## [233] "{enter,epic,future,neojapanese,razorsharp}"                      
## [234] "{future}"                                                        
## [235] "{enter,epic,explore,future,neojapanese}"                         
## [236] "{explore,future,inspired,neojapanese,razorsharp}"                
## [237] "{enter,explore,future,inspired,neojapanese}"                     
## [238] "{epic,explore,future,neojapanese,razorsharp}"                    
## [239] "{enter,explore,future,neojapanese,razorsharp}"                   
## [240] "{enter,epic,explore,inspired,razorsharp}"                        
## [241] "{enter,epic,inspired,neojapanese,razorsharp}"                    
## [242] "{enter,epic,explore,inspired,neojapanese}"                       
## [243] "{epic,explore,inspired,neojapanese,razorsharp}"                  
## [244] "{enter,explore,inspired,neojapanese,razorsharp}"                 
## [245] "{enter,epic,explore,neojapanese,razorsharp}"                     
## [246] "{enter,epic,explore,future,inspired,razorsharp}"                 
## [247] "{enter,epic,future,inspired,neojapanese,razorsharp}"             
## [248] "{enter,epic,explore,future,inspired,neojapanese}"                
## [249] "{epic,explore,future,inspired,neojapanese,razorsharp}"           
## [250] "{enter,explore,future,inspired,neojapanese,razorsharp}"          
## [251] "{enter,epic,explore,future,neojapanese,razorsharp}"              
## [252] "{cyberpunk,night}"                                               
## [253] "{city,cyberpunk}"                                                
## [254] "{enter,epic,explore,inspired,neojapanese,razorsharp}"            
## [255] "{cyberpunk,red}"                                                 
## [256] "{cyberpunk,projekt}"                                             
## [257] "{city}"                                                          
## [258] "{night}"                                                         
## [259] "{projekt}"                                                       
## [260] "{red}"                                                           
## [261] "{vgpunite}"                                                      
## [262] "{virtualphotography}"                                            
## [263] "{cyberpunkgame}"                                                 
## [264] "{enter,epic,explore,future,inspired,neojapanese,razorsharp}"     
## [265] "{cyberpunkphotomode}"                                            
## [266] "{psshare}"                                                       
## [267] "{photomode}"                                                     
## [268] "{via}"                                                           
## [269] "{gaming}"                                                        
## [270] "{live}"                                                          
## [271] "{patch}"                                                         
## [272] "{cdprojektred}"                                                  
## [273] "{city,night}"                                                    
## [274] "{time}"                                                          
## [275] "{now}"                                                           
## [276] "{new}"                                                           
## [277] "{game}"                                                          
## [278] "{just}"                                                          
## [279] "{like}"                                                          
## [280] "{projekt,red}"                                                   
## Itemsets in Consequent (RHS)
##  [1] "{cyberpunk}"          "{cdprojektred}"       "{vgpunite}"          
##  [4] "{virtualphotography}" "{red}"                "{projekt}"           
##  [7] "{night}"              "{city}"               "{future}"            
## [10] "{inspired}"           "{epic}"               "{enter}"             
## [13] "{razorsharp}"         "{explore}"            "{neojapanese}"

plot(rules, measure=c("support","lift"), shading="confidence")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules, method="grouped")

Let us look at the graph of the rules that will show how the association rules work for previously stated words. It will be much easier to observe how are they connected.

plot(rules.xbox, method="graph",main="xbox")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

plot(rules.sony, method="graph",main="sony")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

plot(rules.witcher, method="graph",main="witcher")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

plot(rules.love, method="graph",main="love")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

plot(rules.hate, method="graph",main="hate")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

plot(rules.xbox, method="paracoord", control=list(reorder=TRUE),main="xbox")

plot(rules.sony, method="paracoord", control=list(reorder=TRUE),main="sony")

plot(rules.witcher, method="paracoord", control=list(reorder=TRUE),main="witcher")

plot(rules.love, method="paracoord", control=list(reorder=TRUE),main="love")

plot(rules.hate, method="paracoord", control=list(reorder=TRUE),main="hate")

ECLAT

Now let us look at ECLAT algorithm and we will see the rules that we obtain here. As we can see we obtained quite similar outcome as in previous analysis.

freq.items<-eclat(trans, parameter=list(supp=0.02, maxlen=15))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.02      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 911 
## 
## create itemset ... 
## set transactions ...[1123 item(s), 45559 transaction(s)] done [0.05s].
## sorting and recoding items ... [44 item(s)] done [0.00s].
## creating sparse bit matrix ... [44 row(s), 45559 column(s)] done [0.00s].
## writing  ... [320 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].

freq.rules<-ruleInduction(freq.items, trans, confidence=0.5)
freq.rules

## set of 1038 rules

freq.items<-eclat(trans, parameter=list(supp=0.03, maxlen=15))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.03      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 1366 
## 
## create itemset ... 
## set transactions ...[1123 item(s), 45559 transaction(s)] done [0.05s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating sparse bit matrix ... [23 row(s), 45559 column(s)] done [0.00s].
## writing  ... [37 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

arules::inspect(freq.items)

##      items                           support count
## [1]  {cyberpunk, gaming}             0.03033  1382
## [2]  {cyberpunk, photomode}          0.03448  1571
## [3]  {cyberpunk, vgpunite}           0.03106  1415
## [4]  {cyberpunk, via}                0.03584  1633
## [5]  {city, cyberpunk}               0.03104  1414
## [6]  {city, night}                   0.03222  1468
## [7]  {cyberpunk, psshare}            0.05555  2531
## [8]  {cyberpunk, night}              0.03288  1498
## [9]  {projekt, red}                  0.04486  2044
## [10] {cyberpunk, cyberpunkgame}      0.03729  1699
## [11] {cyberpunk, virtualphotography} 0.05408  2464
## [12] {cyberpunk, time}               0.03494  1592
## [13] {cdprojektred, cyberpunk}       0.07636  3479
## [14] {cyberpunk, game}               0.05641  2570
## [15] {cyberpunk}                     0.67120 30579
## [16] {game}                          0.10185  4640
## [17] {cdpr}                          0.10808  4924
## [18] {cdprojektred}                  0.09487  4322
## [19] {red}                           0.05823  2653
## [20] {time}                          0.04612  2101
## [21] {virtualphotography}            0.06032  2748
## [22] {just}                          0.05261  2397
## [23] {cyberpunkgame}                 0.06016  2741
## [24] {projekt}                       0.04877  2222
## [25] {like}                          0.04579  2086
## [26] {new}                           0.04047  1844
## [27] {night}                         0.04142  1887
## [28] {psshare}                       0.05707  2600
## [29] {now}                           0.03821  1741
## [30] {city}                          0.03900  1777
## [31] {via}                           0.04296  1957
## [32] {vgpunite}                      0.03501  1595
## [33] {one}                           0.03402  1550
## [34] {photomode}                     0.03622  1650
## [35] {good}                          0.03033  1382
## [36] {gaming}                        0.03646  1661
## [37] {patch}                         0.03090  1408

freq.rules<-ruleInduction(freq.items, trans, confidence=0.1)
freq.rules

## set of 17 rules

arules::inspect(freq.rules)

##      lhs                     rhs            support confidence lift    itemset
## [1]  {gaming}             => {cyberpunk}    0.03033 0.8320      1.2396  1     
## [2]  {photomode}          => {cyberpunk}    0.03448 0.9521      1.4185  2     
## [3]  {vgpunite}           => {cyberpunk}    0.03106 0.8871      1.3217  3     
## [4]  {via}                => {cyberpunk}    0.03584 0.8344      1.2432  4     
## [5]  {city}               => {cyberpunk}    0.03104 0.7957      1.1855  5     
## [6]  {night}              => {city}         0.03222 0.7780     19.9453  6     
## [7]  {city}               => {night}        0.03222 0.8261     19.9453  6     
## [8]  {psshare}            => {cyberpunk}    0.05555 0.9735      1.4503  7     
## [9]  {night}              => {cyberpunk}    0.03288 0.7939      1.1827  8     
## [10] {red}                => {projekt}      0.04486 0.7704     15.7970  9     
## [11] {projekt}            => {red}          0.04486 0.9199     15.7970  9     
## [12] {cyberpunkgame}      => {cyberpunk}    0.03729 0.6198      0.9235 10     
## [13] {virtualphotography} => {cyberpunk}    0.05408 0.8967      1.3359 11     
## [14] {time}               => {cyberpunk}    0.03494 0.7577      1.1289 12     
## [15] {cyberpunk}          => {cdprojektred} 0.07636 0.1138      1.1993 13     
## [16] {cdprojektred}       => {cyberpunk}    0.07636 0.8050      1.1993 13     
## [17] {game}               => {cyberpunk}    0.05641 0.5539      0.8252 14

Conclusion

To conclude I have made 3 parts of different analyses using Unsupervised Learning algorithms. First we saw clustering, then dimensionality reduction and last but not least association rules. I think that those tools are very powerful to mine some information from unlabeled data such as tweets. Based on analyses we could use it to target groups of people or to choose proper words to have the best outcome. I think that the project fulfilled it goal to present UL algorithms on real data and see how we can take advantage of it and get some more knowledge.

Sources

List of emoticons downloaded from: https://unicode.org/emoji/charts/full-emoji-list.html Sentiment of emoticons: https://www.clarin.si/repository/xmlui/handle/11356/1048 Data from twitter: https://github.com/sunnyline99/Analiza-sentymentu-CDPR (own work on project)

Unsupervised Learining Project - Twitter data

Dawid Szyszko-Celiński

Introduction

Packages

Loading data

Emojis

Clearing tweets

Obtaining sentiment from text

Exploration

Clustering

Before clustering

K-MEANS

CLARA

DIMENSION REDUCTION

PCA

Components Plots

Hierarchical clustering

Association rules

Creating Document term Matrix and creating basket data

Some statistics

LHS Rules

LHS Rules

Post hoc

ECLAT

Conclusion

Sources