In this project there will be presented 3 topics from Unsupervised Learning. First, I would like to present clustering of the data, then moving to dimension reduction and finishing with association rules. All algorithms will be presented on one database which was gathered as a part of my graduation work project.
Nowadays, analyzing text and observing sentiment is getting more and more popular. It is like getting information from sources that cannot be explained by math or statistic but actually they can. It makes a bit hard to analyze such data but it also can give us some information about the topic that we want to observe.
First of all I would like to describe the data as it is important to know what we can expect. The data was gathered from twitter and it is about polish gaming stock company - CD Project RED. The tweets about the company was gathered via twitter API and it contains tweet text, number of likes and retweets, date of publication and user. Only tweets containing one of the keywords connected with CDPR were taken into consideration. They were gathered from february 2021 to may 2021 and database contains about 45 000 unique tweets. From each tweet I would like to extract emojis and text. Then the sentiment analysis will be conducted to get some numeric variables from text and then proceed with analyses. I would also like to obtain words from each tweet to extract some association rules.
All further steps and analyses will be described in each block of the project.
Before loading packages by âlibrary(package_name)â first make sure that they are installed. If they are not please use âinstall.packages(package_name)â to install them.
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(stats)
library(tsne)
library(NbClust)
library(arules)
library(arulesViz)
library(arulesCBA)
library(graphics)
library(purrr)
library(stringr)
library(tm)
library(syuzhet)
library(plyr)
library(dplyr)
library(wordcloud)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(grid)
library(tm)
library(knitr)
options(scipen = 999, "digits"=4)
After loading needed libraries we can go thought clearing and exploring data, before we get to the actual analysis with use of Unsupervised Learning algorithms
#Data <- read.csv("path to tweets data")
#Emoji_db <- read.csv("path to base of emojis")
#Emoji_sentiment <- read.csv("path to emojis sentiment")
Clearing twitter data from additional special signs near emojis codepoints - will be used later to check which emoji is in each tweet and how many times is appears
Data$text <- gsub("[<>]"," ", Data$text)
Data$text <- gsub("000","",Data$text)
Data$text <- gsub("\\+","!",Data$text)
Data$text <- gsub("U!","",Data$text)
Merging emojis codepoints and their sentiment
Emoji_sentiment$codepoints <- sprintf("%x", Emoji_sentiment$Unicode.codepoint)
Emoji <- merge(x = Emoji_db, y = Emoji_sentiment, by = "codepoints", all.y = TRUE)
rm(Emoji_db)
rm(Emoji_sentiment)
Emoji <- Emoji[,c(1,8,9,10,11,12)]
colnames(Emoji)<-c("codepoints","Ocurrences","position","negative","neutral","positive")
Emoji$codepoints <- toupper(Emoji$codepoints)
Normalization of the sentiment data of each emoji
Emoji$substract<-Emoji$positive-Emoji$negative
Emoji$sentiment <-((Emoji$substract- mean(Emoji$substract))/(sd(Emoji$substract)))
Checking for outliers - if there is a big sentiment outlier especially for very common emojis it is very bad situation as it can have a huge impact on overall sentiment of the tweet
Based on the plot I decided to set threshold that if sentiment is more than 1 or less than 1 the sentiment will be equal respectively 1 or -1 (all sentiments for each emoji are from interval from -1 to 1)
Emoji$sentiment <- ifelse(Emoji$sentiment>=1,1,Emoji$sentiment)
Emoji$sentiment<- ifelse(Emoji$sentiment<=-1,-1,Emoji$sentiment)
Plot after applying threshold
Now I will check each tweet if it contains emojis and if they do count each type of emoji and multiply by it sentiment based on previously calculated emojis sentiment. Then I sum up all sentiments from emojis for each tweet to obtain overall sentiment from emojis
z <- c()
matrix_Emoji <- c()
for(i in Emoji$codepoints){
z <- as.data.frame(str_count(Data$text,i))
z <- z * Emoji$sentiment[Emoji$codepoints==i]
matrix_Emoji <- as.data.frame(append(matrix_Emoji,z))
}
Emoji_sentiment_sum_mat <- as.matrix(matrix_Emoji)
Emoji_sentiment_sum <- rowSums(Emoji_sentiment_sum_mat)
We must take into consideration that some tweets contains same emojis and it can create some outliers. Let`s plot and see if there are some tweets for which emoji sentiment is from interval from -3 to 3
plot(1:length(Data$text),Emoji_sentiment_sum, xlab="N.o. of tweet", ylab="Emoji sentiment")
abline(h=c(-3,3), col="red", lwd = 2)
We could see some outliers. Let`s assume if tweet sentiment is more than 3 or less than -3 we set maximum values for them respectively 3 and -3.
Emoji_sentiment_sum <- ifelse(Emoji_sentiment_sum >=3,3,Emoji_sentiment_sum)
Emoji_sentiment_sum <- ifelse(Emoji_sentiment_sum <=-3,-3,Emoji_sentiment_sum)
Emoji_sentiment_sum <- as.data.frame(Emoji_sentiment_sum)
First we need to get rid of special signs, stop words, numbers, blank spaces etc.
twitterCorpus <- Corpus(VectorSource(Data$text))
twitterCorpus <- tm_map(twitterCorpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(twitterCorpus, content_transformer(tolower)):
## transformation drops documents
twitterCorpus <- tm_map(twitterCorpus, removeWords, stopwords("en"))
## Warning in tm_map.SimpleCorpus(twitterCorpus, removeWords, stopwords("en")):
## transformation drops documents
twitterCorpus <- tm_map(twitterCorpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(twitterCorpus, removeNumbers): transformation
## drops documents
removeURL <- function(x) gsub("http[[:alnum:]]*","",x)
twitterCorpus <- tm_map(twitterCorpus, content_transformer(removeURL))
## Warning in tm_map.SimpleCorpus(twitterCorpus, content_transformer(removeURL)):
## transformation drops documents
removeURL_2 <- function(x) gsub("edua[[:alnum:]]*","",x)
twitterCorpus <- tm_map(twitterCorpus, content_transformer(removeURL_2))
## Warning in tm_map.SimpleCorpus(twitterCorpus, content_transformer(removeURL_2)):
## transformation drops documents
removeNonAscii <- function(x) textclean::replace_non_ascii(x)
twitterCorpus <- tm_map(twitterCorpus, content_transformer(removeNonAscii))
## Warning in tm_map.SimpleCorpus(twitterCorpus,
## content_transformer(removeNonAscii)): transformation drops documents
twitterCorpus <- tm_map(twitterCorpus, removeWords,c("amp","ufef","ufeft","uufefuufefuufef","uufef","s","uffuffuufef"))
## Warning in tm_map.SimpleCorpus(twitterCorpus, removeWords, c("amp", "ufef", :
## transformation drops documents
twitterCorpus <- tm_map(twitterCorpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(twitterCorpus, stripWhitespace): transformation
## drops documents
twitterCorpus <- tm_map(twitterCorpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(twitterCorpus, removePunctuation): transformation
## drops documents
Now I will create one data frame with all needed data that contains tweets and sentiment from emojis for each tweet
Data[,1] <- twitterCorpus$content
Data$EmojisSentiment <- Emoji_sentiment_sum$Emoji_sentiment_sum
We can use some built-in packages in R with sentiment dictionaries to evaluate each tweet sentiment based on text. First we will use get_nrc_sentiment function to obtain emotions connected with text in each tweet such as anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative and positive.
emotions <- get_nrc_sentiment(Data$text)
Data$anger <- emotions$anger
Data$anticipation <- emotions$anticipation
Data$disgust <- emotions$disgust
Data$fear <- emotions$fear
Data$joy <- emotions$joy
Data$sadness <- emotions$sadness
Data$surprise <- emotions$surprise
Data$trust <- emotions$trust
Data$negative <- emotions$negative
Data$positive <- emotions$positive
Now I will use second algorithm that will evaluate sentiment based on the interval scale. Positive values means that the sentiment is more positive, negative that tweet has negative sentiment and around 0 means that tweet is neutral
sentiment_syuzhet <- get_sentiment(Data$text, method = "syuzhet")
Data$Sentiment <- sentiment_syuzhet
Now lest do Document Term Matrix with tweets and words - it will be needed to perform association rules
dtm <- TermDocumentMatrix(twitterCorpus)
dtm2 <- removeSparseTerms(dtm, sparse = 0.999)
matrix <- as.matrix(dtm2)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
Adding month and hour as different columns and SentimentOverall which is sum of sentiment from text and from emoticons
Data <- Data[,-1]
Data <- Data[,-3]
month <- as.numeric(substr(Data$created,7,7))
hour <- as.numeric(substr(Data$created,12,13))
Data <- Data[,-2]
Data$month <- month
Data$hour <- hour
Data$SentimentOverall <- Data$EmojisSentiment+Data$Sentiment
Now I would like to see some descriptive statistics for our data
summary(Data)
## favoriteCount retweetCount EmojisSentiment anger
## Min. : 0 Min. : 0.0 Min. :-3.0000 Min. :0.000
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.:0.000
## Median : 1 Median : 0.0 Median : 0.0000 Median :0.000
## Mean : 7 Mean : 1.1 Mean : 0.0591 Mean :0.151
## 3rd Qu.: 3 3rd Qu.: 1.0 3rd Qu.: 0.0000 3rd Qu.:0.000
## Max. :11600 Max. :1192.0 Max. : 3.0000 Max. :4.000
## anticipation disgust fear joy
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000 Median :0.000
## Mean :0.349 Mean :0.117 Mean :0.174 Mean :0.275
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :5.000 Max. :4.000 Max. :4.000 Max. :5.000
## sadness surprise trust negative positive
## Min. :0.000 Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.0 Median :0.000 Median :0.000
## Mean :0.161 Mean :0.174 Mean :0.3 Mean :0.365 Mean :0.548
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:1.0 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :4.000 Max. :4.000 Max. :4.0 Max. :5.000 Max. :6.000
## Sentiment month hour SentimentOverall
## Min. :-4.050 Min. :2.00 Min. : 0 Min. :-4.250
## 1st Qu.:-0.100 1st Qu.:3.00 1st Qu.: 8 1st Qu.:-0.106
## Median : 0.000 Median :4.00 Median :14 Median : 0.000
## Mean : 0.198 Mean :3.77 Mean :13 Mean : 0.257
## 3rd Qu.: 0.750 3rd Qu.:4.00 3rd Qu.:18 3rd Qu.: 0.750
## Max. : 5.400 Max. :5.00 Max. :23 Max. : 6.800
As we can see there are some very big outliers for variables favouriteCount and retweetCount that is why I would normalize this two variables to make sure that they will not affect analyses with high values
Data$favoriteCount<-scale(Data$favoriteCount)
Data$retweetCount<-scale(Data$retweetCount)
Now let us look at the sentiment of tweets based on the month and hour when tweet was posted. Red line on the graphs shows mean sentiment for all tweets that were gathered. We can see that on February and March sentiment of tweets were higher than mean and it was falling each month.
ag1<-aggregate.data.frame(Data$SentimentOverall,list(Data$month),mean)
ag1$positive <- ifelse(ag1$x>=0,1,2)
plot(as.character(ag1$Group.1),ag1$x, cex=3,col=ag1$positive, pch=19, xlab="Month", ylab="Mean sentiment")
abline(h=mean(Data$SentimentOverall), col="red", lwd = 2)
Analyzing the sentiment based on hour of the posted tweet we can also see the groups in which posts were more positive and groups of time when they were more negative.
ag2<-aggregate.data.frame(Data$SentimentOverall,list(Data$hour),mean)
ag2$positive <- ifelse(ag2$x>=0,1,2)
plot(ag2$Group.1,ag2$x, cex=3,col=ag2$positive, pch=19, xlab="Hour", ylab="Mean sentiment")
abline(h=mean(Data$SentimentOverall), col="red", lwd = 2)
We should also look at the correlation between variables. As we can expect some emotions are highly correlated with each other, but now we can see how strong that correlation is.
cor.matrix <- cor(Data, method = c("spearman"))
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colorRampPalette(c("#99FF33", "#CC0066", "black"))(200))
First we will start analysis with clustering. Based on the data we may expect that there will be at least two groups - positive tweets and negative tweets. Maybe after clustering we will see some other groups. The dataset has around 45 000 observations which may be quite intense for some algorithms, that is why I decided to randomly pick a sample of 10 000 observations on which we will proceed with clustering. We will also standardize variables to make sure high or low values do not affect whole analysis. After standardization we will not be able to interpret values so easily but still we can interpret them as a relation to other groups.
data.standarized <- as.data.frame(scale(Data))
set.seed(12444)
Sample <- sample_n(data.standarized,10000)
str(Sample)
## 'data.frame': 10000 obs. of 17 variables:
## $ favoriteCount : num -0.081 -0.081 -0.081 -0.081 -0.081 ...
## $ retweetCount : num -0.1167 -0.1167 -0.1167 -0.1167 -0.0111 ...
## $ EmojisSentiment : num -0.167 -0.167 -0.167 -0.167 -0.167 ...
## $ anger : num -0.376 -0.376 -0.376 -0.376 -0.376 ...
## $ anticipation : num 1.05 -0.56 -0.56 -0.56 -0.56 ...
## $ disgust : num -0.333 -0.333 -0.333 -0.333 -0.333 ...
## $ fear : num -0.404 -0.404 -0.404 -0.404 -0.404 ...
## $ joy : num 1.34 -0.51 -0.51 -0.51 -0.51 ...
## $ sadness : num -0.394 -0.394 -0.394 -0.394 -0.394 ...
## $ surprise : num -0.41 -0.41 -0.41 -0.41 -0.41 ...
## $ trust : num 1.263 -0.542 -0.542 -0.542 -0.542 ...
## $ negative : num 1.024 -0.588 -0.588 -0.588 -0.588 ...
## $ positive : num 0.589 -0.715 -0.715 -0.715 -0.715 ...
## $ Sentiment : num 0.261 -0.255 -0.255 -0.255 -0.255 ...
## $ month : num 0.29 1.578 -0.998 0.29 0.29 ...
## $ hour : num 1.518 -1.527 -0.309 -0.766 -1.223 ...
## $ SentimentOverall: num 0.164 -0.294 -0.294 -0.294 -0.294 ...
summary(Sample)
## favoriteCount retweetCount EmojisSentiment anger
## Min. :-0.08 Min. :-0.117 Min. :-8.653 Min. :-0.376
## 1st Qu.:-0.08 1st Qu.:-0.117 1st Qu.:-0.167 1st Qu.:-0.376
## Median :-0.07 Median :-0.117 Median :-0.167 Median :-0.376
## Mean :-0.01 Mean :-0.008 Mean : 0.010 Mean : 0.001
## 3rd Qu.:-0.05 3rd Qu.:-0.011 3rd Qu.:-0.167 3rd Qu.:-0.376
## Max. :34.28 Max. :14.662 Max. : 8.319 Max. : 9.553
## anticipation disgust fear joy
## Min. :-0.560 Min. :-0.333 Min. :-0.404 Min. :-0.51
## 1st Qu.:-0.560 1st Qu.:-0.333 1st Qu.:-0.404 1st Qu.:-0.51
## Median :-0.560 Median :-0.333 Median :-0.404 Median :-0.51
## Mean : 0.001 Mean : 0.012 Mean :-0.006 Mean : 0.00
## 3rd Qu.: 1.045 3rd Qu.:-0.333 3rd Qu.:-0.404 3rd Qu.:-0.51
## Max. : 5.860 Max. : 8.210 Max. : 8.880 Max. : 8.74
## sadness surprise trust negative
## Min. :-0.394 Min. :-0.410 Min. :-0.542 Min. :-0.588
## 1st Qu.:-0.394 1st Qu.:-0.410 1st Qu.:-0.542 1st Qu.:-0.588
## Median :-0.394 Median :-0.410 Median :-0.542 Median :-0.588
## Mean : 0.007 Mean : 0.009 Mean : 0.004 Mean : 0.018
## 3rd Qu.:-0.394 3rd Qu.:-0.410 3rd Qu.: 1.263 3rd Qu.: 1.024
## Max. : 9.402 Max. : 9.035 Max. : 6.680 Max. : 7.471
## positive Sentiment month hour
## Min. :-0.715 Min. :-5.096 Min. :-2.2862 Min. :-1.9837
## 1st Qu.:-0.715 1st Qu.:-0.449 1st Qu.:-0.9981 1st Qu.:-0.7658
## Median :-0.715 Median :-0.255 Median : 0.2901 Median : 0.1476
## Mean : 0.005 Mean :-0.016 Mean :-0.0085 Mean :-0.0033
## 3rd Qu.: 0.589 3rd Qu.: 0.648 3rd Qu.: 0.2901 3rd Qu.: 0.7565
## Max. : 5.804 Max. : 4.779 Max. : 1.5782 Max. : 1.5177
## SentimentOverall
## Min. :-4.580
## 1st Qu.:-0.465
## Median :-0.294
## Mean :-0.010
## 3rd Qu.: 0.564
## Max. : 6.505
Before we apply actual clustering algorithms we should check if the data is can be clustered. In that case we will use Hopkins statistic. Null hypothesis for the test is that there are no meaningful clusters in the data. Alternative hypothesis assumes that data can be clustered. Basically if the statistic is around 1 we can reject null hypothesis and proceed with clustering. Unfortunately R Markdown had a problem with handling such a big vector in the function but the statistic is equal to 0.9649.
#get_clust_tendency(Sample,100)
As we can see the Hopkins statistic is around 1. That means we can cluster our data. The question is how many clusters we should get? We will check it by Calinski-Harabasz and Duda-Hart index first. We will also take a look at Silhouette index, elbow method, shadow statistic.
Let us start with Calinski-Harabasz and Duda-Hart indexes. Basically we will run 3 k-means clustering (with k=4, k=3 and k=2) and then we will check the statistics of indexes for each type of clustering. The higher the statistic the better it is to use that number of clusters.
km4<-kmeans(Sample, 4)
km3<-kmeans(Sample, 3)
km2<-kmeans(Sample,2)
round(calinhara(Sample, km4$cluster),digits=2)
## [1] 1892
round(calinhara(Sample, km3$cluster),digits=2)
## [1] 1588
round(calinhara(Sample, km2$cluster),digits=2)
## [1] 2017
dudahart2(Sample, km4$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.3128
##
## $compare
## [1] 0.9522
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.09
dudahart2(Sample, km3$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.5693
##
## $compare
## [1] 0.9522
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.09
dudahart2(Sample, km2$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.8321
##
## $compare
## [1] 0.9522
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.09
As we can see both indexes shows that the optimal number of clusters is 2 (we should look at highest values of $dh)Next we shall move to elbow method which can asses optimal number of clusters visually and Shilhouette statistic interpreted as - the higher statistic the the better.
opt1<-Optimal_Clusters_KMeans(Sample, max_clusters=10, plot_clusters = TRUE, max_iters = 1000)
## [1] 0.0000 0.2265 0.2106 0.1877 0.1411 0.1136 0.1260 0.1186 0.1107 0.1166
## attr(,"class")
## [1] "k-means clustering"
Here we have similar situation, that shows k=2 is optimal. Nevertheless k=3 and k=4 may be also not so bad, as they have quite high values. Now we can move to the Shadow statistic, which is also used to asses optimal number of clusters. This time the lower values the better number of clusters. I did two versions first with âeuclideanâ distance and second with âmanhattanâ distance measure
d1<-cclust(Sample, 5, dist="euclidean")
shadow(d1)
## 1 2 3 4 5
## 0.8844 0.7492 0.8100 0.8274 0.8121
plot(shadow(d1))
d2<-cclust(Sample, 5, dist="manhattan")
shadow(d2)
## 1 2 3 4 5
## 0.8392 0.8002 0.8142 0.7941 0.7366
plot(shadow(d2))
Shadows statistic with euclidean measure shows the k=2 as optimal, whereas considering manhattan distance measure other numbers of clusters are optimal (they have similar values and k=5 is optimal). Now let`s see different method with usage of medoids with two distance measures âeuclideanâ and âmanhattanâ.
opt4<-Optimal_Clusters_Medoids(Sample, 5, 'euclidean', plot_clusters=TRUE)
##
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?
opt4<-Optimal_Clusters_Medoids(Sample, 5, 'manhattan', plot_clusters=TRUE)
##
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?
Based on a lot of different statistics we can see that there are different options to choose. Knowing data we may expect just 2 clusters with positive and negative tweets, but that is not that much informative. That is why I would like to see clustering for k=2 just to make sure it will look as I thought. Then I would like to try k=3 and k=4 which may be less optimal but may give us some more insights of the data.
fviz_cluster(list(data=Sample, cluster=km2$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
fviz_cluster(list(data=Sample, cluster=km3$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
fviz_cluster(list(data=Sample, cluster=km4$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
Now let us see some post hoc analysis for each type. First lets try to call each cluster based what kind of observations are inside each of them. We would also see a plot that shows how far each observation is from its centroid.
km2$centers
## favoriteCount retweetCount EmojisSentiment anger anticipation disgust
## 1 -0.00268 -0.002497 0.01914 -0.3173 -0.0206 -0.2546
## 2 -0.02177 -0.040349 -0.03740 1.7095 0.1154 1.4419
## fear joy sadness surprise trust negative positive Sentiment
## 1 -0.304 -0.004688 -0.271 -0.04863 -0.01488 -0.2861 -0.003373 0.1574
## 2 1.597 0.028083 1.499 0.31667 0.10576 1.6536 0.052781 -0.9448
## month hour SentimentOverall
## 1 -0.01473 -0.002381 0.1471
## 2 0.02490 -0.008131 -0.8516
km3$centers
## favoriteCount retweetCount EmojisSentiment anger anticipation disgust
## 1 -0.005257 -0.0006652 0.28878 -0.3408 -0.1929 -0.3126
## 2 -0.007125 -0.0125843 -0.18067 0.1610 -0.3434 0.1157
## 3 -0.001694 -0.0105422 0.07587 0.1668 1.5466 0.3293
## fear joy sadness surprise trust negative positive Sentiment
## 1 -0.3310 -0.07994 -0.3153 -0.3455 -0.05196 -0.4434 0.1325 0.5935
## 2 0.1457 -0.46544 0.1224 -0.2655 -0.41216 0.2937 -0.5263 -0.6599
## 3 0.1566 1.72031 0.2822 1.6507 1.50806 0.0465 1.5194 0.8853
## month hour SentimentOverall
## 1 -0.02514 0.02129 0.6421
## 2 -0.03054 -0.03151 -0.6573
## 3 0.09912 0.04050 0.8145
km4$centers
## favoriteCount retweetCount EmojisSentiment anger anticipation disgust
## 1 -0.021302 -0.016251 -0.16066 -0.3162 -0.33565 -0.28981
## 2 -0.021160 -0.040349 -0.06762 1.8010 0.02253 1.51861
## 3 0.007418 -0.002137 0.17773 -0.1722 1.32192 -0.02993
## 4 0.014234 0.012897 0.17118 -0.3188 -0.35016 -0.26976
## fear joy sadness surprise trust negative positive Sentiment
## 1 -0.3079 -0.3941 -0.29059 -0.3452 -0.32817 -0.2842 -0.35107 -0.10486
## 2 1.7069 -0.1104 1.59937 0.2053 -0.01916 1.7207 -0.05795 -1.06310
## 3 -0.1635 1.6009 -0.03701 1.2455 1.40014 -0.2503 1.47336 1.09248
## 4 -0.3126 -0.3777 -0.31802 -0.3506 -0.37568 -0.2181 -0.36630 -0.06681
## month hour SentimentOverall
## 1 0.70599 0.0007237 -0.15775
## 2 0.03544 -0.0127952 -0.96857
## 3 0.07233 0.0400711 1.03907
## 4 -1.01193 -0.0297372 0.01001
d1<-cclust(Sample, 2, dist="euclidean")
stripes(d1)
d2<-cclust(Sample, 3, dist="euclidean")
stripes(d2)
d3<-cclust(Sample, 4, dist="euclidean")
stripes(d3)
For k=2 first cluster seems to have lower values of sentiment so we could call it as negative tweets cluster, whereas second has more positive emotions and sentiment
For k=3 first cluster seems to be a cluster of positive tweets as it has bigger values for positive variables, second cluster may be called negative values as it has high values of negative variables. Third one could be called neutral as values of variables seems to be near 0. I think that k=3 shows better differences between groups than k=2.
For k=4 3th and 4th cluster seems to be again negative and positive tweets. First cluster could be called neutral and third one is quite hard to describe - I would say it is a group of negative tweets but just a bit less harsh than in the 3rd cluster as they have lower values for negative variables.
Now we can move to CLARA algorithm to see the results. We cannot use PAM as the dataset is too big for that algorithm.
clara_2<-eclust(Sample, "clara", k=2)
clara_2$medoids
## favoriteCount retweetCount EmojisSentiment anger anticipation disgust
## [1,] -0.08097 -0.1167 -0.1673 -0.3756 1.0452 -0.333
## [2,] -0.08097 -0.1167 -0.1673 -0.3756 -0.5597 -0.333
## fear joy sadness surprise trust negative positive Sentiment month
## [1,] -0.404 1.3408 -0.3938 -0.4102 1.2635 -0.5879 0.5888 0.7129 0.2901
## [2,] -0.404 -0.5097 -0.3938 -0.4102 -0.5421 -0.5879 -0.7151 -0.3844 0.2901
## hour SentimentOverall
## [1,] 1.2132 0.5635
## [2,] 0.1476 -0.4079
fviz_silhouette(clara_2)
## cluster size ave.sil.width
## 1 1 2592 0.08
## 2 2 7408 0.35
clara_3<-eclust(Sample, "clara", k=3)
clara_3$medoids
## favoriteCount retweetCount EmojisSentiment anger anticipation disgust
## [1,] -0.08097 -0.1167 -0.1673 -0.3756 -0.5597 -0.333
## [2,] -0.08097 -0.1167 -0.1673 -0.3756 2.6500 -0.333
## [3,] 0.08084 0.3055 -0.1673 2.1066 -0.5597 2.515
## fear joy sadness surprise trust negative positive Sentiment
## [1,] -0.404 -0.5097 -0.3938 -0.4102 -0.5421 -0.5879 -0.7151 -0.3844
## [2,] -0.404 1.3408 -0.3938 1.9510 1.2635 -0.5879 1.8926 0.9065
## [3,] 1.917 -0.5097 2.0551 -0.4102 -0.5421 1.0239 -0.7151 -1.2235
## month hour SentimentOverall
## [1,] 0.2901 0.1476 -0.4079
## [2,] 0.2901 -0.3091 0.7350
## [3,] -0.9981 0.2998 -1.1508
fviz_silhouette(clara_3)
## cluster size ave.sil.width
## 1 1 7150 0.42
## 2 2 1685 0.13
## 3 3 1165 0.10
clara_4<-eclust(Sample, "clara", k=4)
clara_4$medoids
## favoriteCount retweetCount EmojisSentiment anger anticipation disgust
## [1,] -0.06941 -0.1167 -0.1673 -0.3756 -0.5597 -0.333
## [2,] -0.08097 -0.1167 -0.1673 -0.3756 2.6500 -0.333
## [3,] 0.08084 0.3055 -0.1673 2.1066 -0.5597 2.515
## [4,] 0.05773 0.9389 6.7578 -0.3756 1.0452 -0.333
## fear joy sadness surprise trust negative positive Sentiment
## [1,] -0.404 -0.5097 -0.3938 -0.4102 -0.5421 -0.5879 -0.7151 -0.2553
## [2,] -0.404 1.3408 -0.3938 1.9510 1.2635 -0.5879 1.8926 0.9065
## [3,] 1.917 -0.5097 2.0551 -0.4102 -0.5421 1.0239 -0.7151 -1.2235
## [4,] 1.917 1.3408 -0.3938 1.9510 -0.5421 -0.5879 1.8926 4.2628
## month hour SentimentOverall
## [1,] 0.2901 -0.004638 -0.2936
## [2,] -0.9981 0.147598 0.7350
## [3,] -0.9981 0.299833 -1.1508
## [4,] 1.5782 0.452069 6.5046
fviz_silhouette(clara_4)
## cluster size ave.sil.width
## 1 1 7195 0.41
## 2 2 1559 0.16
## 3 3 1162 0.10
## 4 4 84 0.33
We can see that CALARA algorithm gives similar effects like K-means. The clusters generated by CLARA would be called more or less like the ones from K-means
The second part of the project is about dimensionality reduction. We will try to merge some of the variables as one and get rid of those that do not affect variance of the data that much.
We will start with basic PCA algorithm to see if the data can be reduced to lower dimensionality. Knowing that variables are correlated with each other based on exploration at the beginning of analysis we can expect that we will be able to reduce some common variables to single one.
Data.PCA <- prcomp(data.standarized, center = TRUE, scale = TRUE)
summary(Data.PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.02 1.866 1.386 1.1170 1.0004 0.9940 0.8602 0.8121
## Proportion of Variance 0.24 0.205 0.113 0.0734 0.0589 0.0581 0.0435 0.0388
## Cumulative Proportion 0.24 0.445 0.558 0.6314 0.6903 0.7484 0.7920 0.8308
## PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
## Standard deviation 0.7450 0.6913 0.6761 0.626 0.5798 0.5735 0.4994 0.28486
## Proportion of Variance 0.0326 0.0281 0.0269 0.023 0.0198 0.0193 0.0147 0.00477
## Cumulative Proportion 0.8634 0.8915 0.9184 0.941 0.9612 0.9806 0.9952 1.00000
## PC17
## Standard deviation 0.000000000000104
## Proportion of Variance 0.000000000000000
## Cumulative Proportion 1.000000000000000
We can see that around 10 components explain almost 90% of variance in data. Now lets plot it and see how it looks and add table with cumulative variance explained under. Then we can plot eigen values to see which are higher than 1. The ones that are bigger than that threshold could be used to further analysis and we will chose only those that fulfill this constraint.
eig.val <- get_eigenvalue(Data.PCA)
eig.val_plot <- eig.val$variance.percent
eig.val_plot2 <- eig.val$eigenvalue
qplot(c(1:17),eig.val_plot) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot") +
ylim(0, 25)+xlim(1,10)
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 7 row(s) containing missing values (geom_path).
eig.val$cumulative.variance.percent
## [1] 24.01 44.50 55.81 63.14 69.03 74.84 79.20 83.08 86.34 89.15
## [11] 91.84 94.14 96.12 98.06 99.52 100.00 100.00
plot(summary(Data.PCA)$importance[3,],ylab="Cumulative variance")
qplot(c(1:17),eig.val_plot2) +
geom_line() +
xlab("Principal Component") +
ylab("Eigenvalue") +
ggtitle("Scree Plot") +
ylim(0, 5)+xlim(1,17) + geom_hline(aes(yintercept = 1), colour="red")
eig.val_plot2
## [1] 4.08184865134454000212826940697 3.48304354844858021778009060654
## [3] 1.92198698897711861555137602409 1.24771762413537623537251874950
## [5] 1.00080459421469813996452558058 0.98807962439654106745479111851
## [7] 0.73997232161528225802271663269 0.65953918239787645561023055052
## [9] 0.55499390636153178579803579851 0.47783394546824053961131539836
## [11] 0.45716876416437218777488737942 0.39141839750203621939306231070
## [13] 0.33612548659948815465980942463 0.32887482358300179630106185868
## [15] 0.24944853716893922168651442917 0.08114360362077405186553846761
## [17] 0.00000000000000000000000001085
Based on analysis we should choose 5 components which explains around 70% of overall variance. Now we can see the reduced dimensions on correlation circle. The darker values the more important variable. Let us compare first dimension with all the others. We should also compare others but that would be a big mess, so we just focus on the comparisons with 1st dimension.
fviz_pca_var(Data.PCA, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)
fviz_pca_var(Data.PCA, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)
fviz_pca_var(Data.PCA, col.var="contrib", repel = TRUE, axes = c(1, 3)) +
labs(title="Variables loadings for PC1 and PC3", x="PC1", y="PC3")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
fviz_pca_var(Data.PCA, col.var="contrib", repel = TRUE, axes = c(1, 4)) +
labs(title="Variables loadings for PC1 and PC4", x="PC1", y="PC4")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)
fviz_pca_var(Data.PCA, col.var="contrib", repel = TRUE, axes = c(1, 5)) +
labs(title="Variables loadings for PC1 and PC5", x="PC1", y="PC5")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We can see that negative and positive emotions are directed to opposite sides.
fviz_pca_ind(Data.PCA, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))
PC1 <- fviz_contrib(Data.PCA, choice = "var", axes = 1,fill = "#FF3366",color = "#FF3366")
PC2 <- fviz_contrib(Data.PCA, choice = "var", axes = 2,fill = "#FF3366",color = "#FF3366")
PC3 <- fviz_contrib(Data.PCA, choice = "var", axes = 3,fill = "#FF3366",color = "#FF3366")
PC4 <- fviz_contrib(Data.PCA, choice = "var", axes = 4,fill = "#FF3366",color = "#FF3366")
PC5 <- fviz_contrib(Data.PCA, choice = "var", axes = 5,fill = "#FF3366",color = "#FF3366")
PC1
PC2
PC3
PC4
PC5
As we can see the first dimension is based on sentiment and sentiment overall which is derived from sentiment. What is more positive emotions have impact on that dimension.
2nd dimension was constructed of negative emotions
3rd one is popularity of the tweet
4th is just a sentiment from text and emojis without any emotions
5th is hour of posted tweet
distance.m<-dist(t(data.standarized))
hc<-hclust(distance.m, method="complete")
plot(hc, hang=-1)
rect.hclust(hc, k = 5, border="red")
sub_grp<-cutree(hc, k=5,h=-1)
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("#339900", "#CC0066", "black", "blue", "yellow" ))
We can see that there are 5 groups from with red rectangles. They are similar to the results of PCA based on the main dimensions - positive, negative, when it was posted and how popular the tweet is.
I have also clustered the data with k = 5 and as we can see there are clusters with positive emotions, negative emotions, popularity of tweet and two considering month and hour of post.
It took a lot of time to convert document term matrix into proper data structure (transaction data). Basically this code is going over each tweet and each word to see if that word was used in the tweet. If so it appends words after commas. It had to go through 1132 words x 45 851 tweets = 51 903 332 cells to check. Then such prepared data was used to perform association rules with words from tweets. That is why I loaded ready data.
#words_df <- as.data.frame(matrix)
#tweet_words <-c()
#words<-data.frame(tweet_words=c(1:length(words_df)))
#i=1
#n=1
#for (i in 1:length(words_df)){
# tweet_words<-c()
# for (n in 1:nrow(words_df)){
# if(words_df[n,i]==0) next
# x<-row.names(words_df)[n]
# tweet_words<-append(tweet_words,x)
# }
# print(i)
# words$tweet_words[i]<-paste(tweet_words, collapse = ', ')
#}
#trans <- read.transactions("path to file with transacion data", format = "basket",
# header = TRUE, sep=",")
Lets take a look at the first 100 tweets and see how many words are there in each of them to make sure the data is loaded correctly.
size(trans[1:100])
## [1] 7 3 2 9 12 4 5 3 4 12 8 4 3 7 6 4 7 6 5 3 6 6 3 3 5
## [26] 10 2 3 7 2 7 7 4 2 4 8 2 5 10 6 9 8 3 3 3 8 2 5 5 3
## [51] 3 1 2 7 7 7 9 8 4 6 2 8 5 6 6 7 7 6 2 8 7 5 10 7 5
## [76] 4 0 0 5 2 0 7 3 4 6 5 9 3 6 4 7 2 5 6 7 6 2 7 5 5
length(trans)
## [1] 45559
ifre<-itemFrequency(trans, type="relative")
sort(ifre,decreasing = TRUE)[1:100]
## cyberpunk cdpr game cdprojektred
## 0.671196 0.108080 0.101846 0.094866
## virtualphotography cyberpunkgame red psshare
## 0.060317 0.060164 0.058232 0.057069
## just projekt time like
## 0.052613 0.048772 0.046116 0.045787
## via night new city
## 0.042955 0.041419 0.040475 0.039004
## now gaming photomode vgpunite
## 0.038214 0.036458 0.036217 0.035010
## one patch good live
## 0.034022 0.030905 0.030334 0.029083
## will get still future
## 0.028798 0.028293 0.028227 0.026888
## cyberpunkphotomode can love fef
## 0.026230 0.025483 0.024035 0.023552
## inspired epic twitch judy
## 0.023223 0.023135 0.022915 0.022498
## games witcher enter explore
## 0.021972 0.021906 0.021642 0.021357
## razorsharp neojapanese see judyalvarez
## 0.021050 0.020940 0.020237 0.020062
## back got play really
## 0.019996 0.019864 0.019162 0.018701
## xboxshare playing travel people
## 0.018547 0.018350 0.018218 0.017647
## know update think nightcity
## 0.017560 0.017318 0.017274 0.016967
## youtube going made rcyberpunk
## 0.016879 0.016835 0.016572 0.015913
## come check first xbox
## 0.015782 0.015716 0.015650 0.015057
## part playstation johnny never
## 0.014596 0.014245 0.014004 0.013828
## day even rcyberpunkgame make
## 0.013740 0.013587 0.013477 0.013367
## stream today much project
## 0.013367 0.013345 0.013279 0.013016
## last best video want
## 0.012094 0.011941 0.011919 0.011897
## well worldofvp finally ign
## 0.011524 0.011414 0.011150 0.011106
## look say edge gamergram
## 0.011063 0.010667 0.010624 0.010316
## great amazing another work
## 0.010272 0.010185 0.010075 0.010053
## panam also take hope
## 0.009965 0.009877 0.009811 0.009746
## ever better dev right
## 0.009680 0.009548 0.009548 0.009438
ifre2<-itemFrequency(trans, type="absolute")
sort(ifre2,decreasing = TRUE)[1:100]
## cyberpunk cdpr game cdprojektred
## 30579 4924 4640 4322
## virtualphotography cyberpunkgame red psshare
## 2748 2741 2653 2600
## just projekt time like
## 2397 2222 2101 2086
## via night new city
## 1957 1887 1844 1777
## now gaming photomode vgpunite
## 1741 1661 1650 1595
## one patch good live
## 1550 1408 1382 1325
## will get still future
## 1312 1289 1286 1225
## cyberpunkphotomode can love fef
## 1195 1161 1095 1073
## inspired epic twitch judy
## 1058 1054 1044 1025
## games witcher enter explore
## 1001 998 986 973
## razorsharp neojapanese see judyalvarez
## 959 954 922 914
## back got play really
## 911 905 873 852
## xboxshare playing travel people
## 845 836 830 804
## know update think nightcity
## 800 789 787 773
## youtube going made rcyberpunk
## 769 767 755 725
## come check first xbox
## 719 716 713 686
## part playstation johnny never
## 665 649 638 630
## day even rcyberpunkgame make
## 626 619 614 609
## stream today much project
## 609 608 605 593
## last best video want
## 551 544 543 542
## well worldofvp finally ign
## 525 520 508 506
## look say edge gamergram
## 504 486 484 470
## great amazing another work
## 468 464 459 458
## panam also take hope
## 454 450 447 444
## ever better dev right
## 441 435 435 430
As we can see the relative values are rather small as we have a huge database. That is why we need to set rather small support levels around 2%.
First plot shows words that appears in more than 2.5% tweets. Second shows 25 most popular words used in all tweets. Last visualization on y axis shows 1000 random tweets from data and each dot indicates if the word (on the x axis) is in the tweet. Basing on the plot we can see that some vertical lines and that means there are some quite frequent words in the tweets.
itemFrequencyPlot(trans, support = 0.025)
itemFrequencyPlot(trans, topN = 25)
image(sample(trans, 1000))
## APRIORI
Now lets take a look at the actual association rules with minimal support level 2% and minimal confidence level set at 25%. Then they were sorted by lift, confidence, support and count.
rules <- apriori(trans, parameter = list(support = 0.02, confidence = 0.25, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.02 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 911
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1123 item(s), 45559 transaction(s)] done [0.05s].
## sorting and recoding items ... [44 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [1044 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
arules::inspect(sort(rules, by = "lift")[1:15])
## lhs rhs support confidence
## [1] {explore, razorsharp} => {neojapanese} 0.02028 1
## [2] {enter, razorsharp} => {neojapanese} 0.02028 1
## [3] {inspired, razorsharp} => {neojapanese} 0.02028 1
## [4] {future, razorsharp} => {neojapanese} 0.02028 1
## [5] {explore, inspired} => {neojapanese} 0.02079 1
## [6] {epic, explore} => {neojapanese} 0.02070 1
## [7] {explore, future} => {neojapanese} 0.02079 1
## [8] {enter, inspired} => {neojapanese} 0.02090 1
## [9] {enter, epic} => {neojapanese} 0.02070 1
## [10] {enter, explore, razorsharp} => {neojapanese} 0.02028 1
## [11] {explore, inspired, razorsharp} => {neojapanese} 0.02028 1
## [12] {epic, explore, razorsharp} => {neojapanese} 0.02028 1
## [13] {explore, future, razorsharp} => {neojapanese} 0.02028 1
## [14] {cyberpunk, explore, razorsharp} => {neojapanese} 0.02028 1
## [15] {enter, inspired, razorsharp} => {neojapanese} 0.02028 1
## coverage lift count
## [1] 0.02028 47.76 924
## [2] 0.02028 47.76 924
## [3] 0.02028 47.76 924
## [4] 0.02028 47.76 924
## [5] 0.02079 47.76 947
## [6] 0.02070 47.76 943
## [7] 0.02079 47.76 947
## [8] 0.02090 47.76 952
## [9] 0.02070 47.76 943
## [10] 0.02028 47.76 924
## [11] 0.02028 47.76 924
## [12] 0.02028 47.76 924
## [13] 0.02028 47.76 924
## [14] 0.02028 47.76 924
## [15] 0.02028 47.76 924
arules::inspect(sort(rules, by = "confidence")[1:15])
## lhs rhs support confidence coverage
## [1] {razorsharp} => {cyberpunk} 0.02105 1 0.02105
## [2] {explore, razorsharp} => {neojapanese} 0.02028 1 0.02028
## [3] {enter, razorsharp} => {neojapanese} 0.02028 1 0.02028
## [4] {inspired, razorsharp} => {neojapanese} 0.02028 1 0.02028
## [5] {future, razorsharp} => {neojapanese} 0.02028 1 0.02028
## [6] {neojapanese, razorsharp} => {cyberpunk} 0.02030 1 0.02030
## [7] {explore, neojapanese} => {enter} 0.02079 1 0.02079
## [8] {explore, neojapanese} => {inspired} 0.02079 1 0.02079
## [9] {explore, inspired} => {neojapanese} 0.02079 1 0.02079
## [10] {epic, neojapanese} => {explore} 0.02070 1 0.02070
## [11] {epic, explore} => {neojapanese} 0.02070 1 0.02070
## [12] {explore, neojapanese} => {future} 0.02079 1 0.02079
## [13] {explore, future} => {neojapanese} 0.02079 1 0.02079
## [14] {inspired, neojapanese} => {enter} 0.02090 1 0.02090
## [15] {enter, inspired} => {neojapanese} 0.02090 1 0.02090
## lift count
## [1] 1.49 959
## [2] 47.76 924
## [3] 47.76 924
## [4] 47.76 924
## [5] 47.76 924
## [6] 1.49 925
## [7] 46.21 947
## [8] 43.06 947
## [9] 47.76 947
## [10] 46.82 943
## [11] 47.76 943
## [12] 37.19 947
## [13] 47.76 947
## [14] 46.21 952
## [15] 47.76 952
arules::inspect(sort(rules, by = "support")[1:15])
## lhs rhs support confidence coverage lift
## [1] {cdprojektred} => {cyberpunk} 0.07636 0.8050 0.09487 1.1993
## [2] {game} => {cyberpunk} 0.05641 0.5539 0.10185 0.8252
## [3] {psshare} => {cyberpunk} 0.05555 0.9735 0.05707 1.4503
## [4] {virtualphotography} => {cyberpunk} 0.05408 0.8967 0.06032 1.3359
## [5] {projekt} => {red} 0.04486 0.9199 0.04877 15.7970
## [6] {red} => {projekt} 0.04486 0.7704 0.05823 15.7970
## [7] {cyberpunkgame} => {cyberpunk} 0.03729 0.6198 0.06016 0.9235
## [8] {via} => {cyberpunk} 0.03584 0.8344 0.04296 1.2432
## [9] {time} => {cyberpunk} 0.03494 0.7577 0.04612 1.1289
## [10] {photomode} => {cyberpunk} 0.03448 0.9521 0.03622 1.4185
## [11] {night} => {cyberpunk} 0.03288 0.7939 0.04142 1.1827
## [12] {city} => {night} 0.03222 0.8261 0.03900 19.9453
## [13] {night} => {city} 0.03222 0.7780 0.04142 19.9453
## [14] {vgpunite} => {cyberpunk} 0.03106 0.8871 0.03501 1.3217
## [15] {city} => {cyberpunk} 0.03104 0.7957 0.03900 1.1855
## count
## [1] 3479
## [2] 2570
## [3] 2531
## [4] 2464
## [5] 2044
## [6] 2044
## [7] 1699
## [8] 1633
## [9] 1592
## [10] 1571
## [11] 1498
## [12] 1468
## [13] 1468
## [14] 1415
## [15] 1414
arules::inspect(sort(rules, by = "count")[1:15])
## lhs rhs support confidence coverage lift
## [1] {cdprojektred} => {cyberpunk} 0.07636 0.8050 0.09487 1.1993
## [2] {game} => {cyberpunk} 0.05641 0.5539 0.10185 0.8252
## [3] {psshare} => {cyberpunk} 0.05555 0.9735 0.05707 1.4503
## [4] {virtualphotography} => {cyberpunk} 0.05408 0.8967 0.06032 1.3359
## [5] {projekt} => {red} 0.04486 0.9199 0.04877 15.7970
## [6] {red} => {projekt} 0.04486 0.7704 0.05823 15.7970
## [7] {cyberpunkgame} => {cyberpunk} 0.03729 0.6198 0.06016 0.9235
## [8] {via} => {cyberpunk} 0.03584 0.8344 0.04296 1.2432
## [9] {time} => {cyberpunk} 0.03494 0.7577 0.04612 1.1289
## [10] {photomode} => {cyberpunk} 0.03448 0.9521 0.03622 1.4185
## [11] {night} => {cyberpunk} 0.03288 0.7939 0.04142 1.1827
## [12] {city} => {night} 0.03222 0.8261 0.03900 19.9453
## [13] {night} => {city} 0.03222 0.7780 0.04142 19.9453
## [14] {vgpunite} => {cyberpunk} 0.03106 0.8871 0.03501 1.3217
## [15] {city} => {cyberpunk} 0.03104 0.7957 0.03900 1.1855
## count
## [1] 3479
## [2] 2570
## [3] 2531
## [4] 2464
## [5] 2044
## [6] 2044
## [7] 1699
## [8] 1633
## [9] 1592
## [10] 1571
## [11] 1498
## [12] 1468
## [13] 1468
## [14] 1415
## [15] 1414
The rules based on max lift and confidence are mostly connected with word âneojapaneseâ. There are some words that very often comes with that word.
The rules based on max support and count focuses more on the cyberpunk (game made by CDRP). We can find some common associations of the words for example projekt -> red, city->night etc.
I have checked some rules to see what words drive to some chosen words (xbox, sony, witcher, love, hate)
rules.xbox<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="lhs", rhs="xbox"), control=list(verbose=F))
rules.xbox.byconf<-sort(rules.xbox, by="confidence", decreasing=TRUE)
arules::inspect(rules.xbox.byconf[1:10])
## lhs rhs support confidence coverage lift
## [1] {xboxseriesx} => {xbox} 0.002173 0.31132 0.006980 20.676
## [2] {cyberpunk, xboxseriesx} => {xbox} 0.001954 0.30479 0.006409 20.242
## [3] {series} => {xbox} 0.001383 0.24324 0.005685 16.154
## [4] {cyberpunk, one} => {xbox} 0.002744 0.13951 0.019667 9.265
## [5] {playstation} => {xbox} 0.001405 0.09861 0.014245 6.549
## [6] {one} => {xbox} 0.003292 0.09677 0.034022 6.427
## [7] {cyberpunk, part} => {xbox} 0.001032 0.08918 0.011567 5.923
## [8] {cyberpunk, xboxshare} => {xbox} 0.001427 0.07711 0.018503 5.121
## [9] {xboxshare} => {xbox} 0.001427 0.07692 0.018547 5.109
## [10] {part} => {xbox} 0.001097 0.07519 0.014596 4.993
## count
## [1] 99
## [2] 89
## [3] 63
## [4] 125
## [5] 64
## [6] 150
## [7] 47
## [8] 65
## [9] 65
## [10] 50
rules.sony<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="lhs", rhs="sony"), control=list(verbose=F))
rules.sony.byconf<-sort(rules.sony, by="confidence", decreasing=TRUE)
arules::inspect(rules.sony.byconf[1:2])
## lhs rhs support confidence coverage lift count
## [1] {playstation} => {sony} 0.001778 0.12481 0.01425 23.209 81
## [2] {cdpr} => {sony} 0.001756 0.01625 0.10808 3.021 80
rules.witcher<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="lhs", rhs="witcher"), control=list(verbose=F))
rules.witcher.byconf<-sort(rules.witcher, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.witcher.byconf))
## lhs rhs support confidence coverage lift
## [1] {following, workplace} => {witcher} 0.001097 1 0.001097 45.65
## [2] {projekt, workplace} => {witcher} 0.001251 1 0.001251 45.65
## [3] {hunt, thewitcherwildhunt} => {witcher} 0.001119 1 0.001119 45.65
## [4] {thewitcherwildhunt, wild} => {witcher} 0.001119 1 0.001119 45.65
## [5] {thewitcher, wild} => {witcher} 0.001163 1 0.001163 45.65
## [6] {play, wild} => {witcher} 0.001010 1 0.001010 45.65
## count
## [1] 50
## [2] 57
## [3] 51
## [4] 51
## [5] 53
## [6] 46
rules.love<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="lhs", rhs="love"), control=list(verbose=F))
rules.love.byconf<-sort(rules.love, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.love.byconf))
## lhs rhs support confidence coverage lift count
## [1] {cyberpunk, much} => {love} 0.001163 0.17608 0.006607 7.326 53
## [2] {much} => {love} 0.001866 0.14050 0.013279 5.846 85
## [3] {really} => {love} 0.001032 0.05516 0.018701 2.295 47
## [4] {fef} => {love} 0.001251 0.05312 0.023552 2.210 57
## [5] {cyberpunk, game} => {love} 0.002963 0.05253 0.056410 2.186 135
## [6] {judy} => {love} 0.001054 0.04683 0.022498 1.948 48
rules.hate<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="lhs", rhs="hate"), control=list(verbose=F))
rules.hate.byconf<-sort(rules.hate, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.hate.byconf))
## lhs rhs support confidence coverage lift count
## [1] {cdpr} => {hate} 0.001141 0.010561 0.1081 2.781 52
## [2] {} => {hate} 0.003797 0.003797 1.0000 1.000 173
Now I have checked some rules to see what words will could be next if I write some chosen words (xbox, sony, witcher, love, hate)
rules.xbox.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="rhs", lhs="xbox"), control=list(verbose=F))
rules.xbox.byconf<-sort(rules.xbox.r, by="confidence", decreasing=TRUE)
arules::inspect(rules.xbox.byconf[1:10])
## lhs rhs support confidence coverage lift count
## [1] {xbox} => {cyberpunk} 0.011611 0.77114 0.01506 1.1489 529
## [2] {} => {cyberpunk} 0.671196 0.67120 1.00000 1.0000 30579
## [3] {xbox} => {one} 0.003292 0.21866 0.01506 6.4270 150
## [4] {xbox} => {cdprojektred} 0.002371 0.15743 0.01506 1.6595 108
## [5] {xbox} => {xboxseriesx} 0.002173 0.14431 0.01506 20.6756 99
## [6] {} => {cdpr} 0.108080 0.10808 1.00000 1.0000 4924
## [7] {} => {game} 0.101846 0.10185 1.00000 1.0000 4640
## [8] {xbox} => {game} 0.001493 0.09913 0.01506 0.9733 68
## [9] {} => {cdprojektred} 0.094866 0.09487 1.00000 1.0000 4322
## [10] {xbox} => {xboxshare} 0.001427 0.09475 0.01506 5.1087 65
rules.sony.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="rhs", lhs="sony"), control=list(verbose=F))
rules.sony.byconf<-sort(rules.sony.r, by="confidence", decreasing=TRUE)
arules::inspect(rules.sony.byconf[1:5])
## lhs rhs support confidence coverage lift count
## [1] {} => {cyberpunk} 0.671196 0.6712 1.000000 1.000 30579
## [2] {sony} => {playstation} 0.001778 0.3306 0.005378 23.209 81
## [3] {sony} => {cdpr} 0.001756 0.3265 0.005378 3.021 80
## [4] {sony} => {cyberpunk} 0.001624 0.3020 0.005378 0.450 74
## [5] {} => {cdpr} 0.108080 0.1081 1.000000 1.000 4924
rules.witcher.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="rhs", lhs="witcher"), control=list(verbose=F))
rules.witcher.byconf<-sort(rules.witcher.r, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.witcher.byconf))
## lhs rhs support confidence coverage lift count
## [1] {} => {cyberpunk} 0.671196 0.6712 1.00000 1.0000 30579
## [2] {witcher} => {cdpr} 0.006629 0.3026 0.02191 2.7998 302
## [3] {witcher} => {red} 0.005904 0.2695 0.02191 4.6287 269
## [4] {witcher} => {projekt} 0.005663 0.2585 0.02191 5.3005 258
## [5] {witcher} => {cyberpunk} 0.004170 0.1904 0.02191 0.2836 190
## [6] {witcher} => {cdprojektred} 0.003753 0.1713 0.02191 1.8062 171
rules.love.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="rhs", lhs="love"), control=list(verbose=F))
rules.love.byconf<-sort(rules.love.r, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.love.byconf))
## lhs rhs support confidence coverage lift count
## [1] {} => {cyberpunk} 0.671196 0.67120 1.00000 1.0000 30579
## [2] {love} => {cyberpunk} 0.016001 0.66575 0.02403 0.9919 729
## [3] {love} => {game} 0.004258 0.17717 0.02403 1.7396 194
## [4] {} => {cdpr} 0.108080 0.10808 1.00000 1.0000 4924
## [5] {} => {game} 0.101846 0.10185 1.00000 1.0000 4640
## [6] {love} => {cdpr} 0.002283 0.09498 0.02403 0.8788 104
rules.hate.r<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.0025),
appearance=list(default="rhs", lhs="hate"), control=list(verbose=F))
rules.hate.byconf<-sort(rules.hate.r, by="confidence", decreasing=TRUE)
arules::inspect(head(rules.hate.byconf))
## lhs rhs support confidence coverage lift count
## [1] {} => {cyberpunk} 0.671196 0.67120 1.000000 1.0000 30579
## [2] {hate} => {cyberpunk} 0.001383 0.36416 0.003797 0.5426 63
## [3] {hate} => {cdpr} 0.001141 0.30058 0.003797 2.7811 52
## [4] {} => {cdpr} 0.108080 0.10808 1.000000 1.0000 4924
## [5] {} => {game} 0.101846 0.10185 1.000000 1.0000 4640
## [6] {} => {cdprojektred} 0.094866 0.09487 1.000000 1.0000 4322
plot(rules, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{cyberpunk,enter,epic,explore,future,inspired,razorsharp}"
## [2] "{cyberpunk,enter,epic,future,inspired,razorsharp}"
## [3] "{cyberpunk,enter,epic,explore,future,inspired}"
## [4] "{cyberpunk,enter,epic,future,inspired}"
## [5] "{cyberpunk,epic,explore,future,inspired,razorsharp}"
## [6] "{cyberpunk,epic,future,inspired,razorsharp}"
## [7] "{cyberpunk,epic,explore,future,inspired}"
## [8] "{cyberpunk,epic,future,inspired}"
## [9] "{cyberpunk,enter,epic,future,inspired,neojapanese,razorsharp}"
## [10] "{cyberpunk,enter,epic,future,inspired,neojapanese}"
## [11] "{cyberpunk,enter,epic,explore,future,inspired,neojapanese}"
## [12] "{cyberpunk,epic,future,inspired,neojapanese}"
## [13] "{cyberpunk,epic,future,inspired,neojapanese,razorsharp}"
## [14] "{cyberpunk,epic,explore,future,inspired,neojapanese}"
## [15] "{cyberpunk,epic,explore,future,inspired,neojapanese,razorsharp}"
## [16] "{cyberpunk,future,inspired}"
## [17] "{cyberpunk,enter,future,inspired}"
## [18] "{cyberpunk,enter,epic,future}"
## [19] "{cyberpunk,epic,future}"
## [20] "{cyberpunk,future,inspired,razorsharp}"
## [21] "{cyberpunk,epic,future,razorsharp}"
## [22] "{cyberpunk,enter,future,inspired,razorsharp}"
## [23] "{cyberpunk,explore,future,inspired}"
## [24] "{cyberpunk,epic,explore,future}"
## [25] "{cyberpunk,enter,epic,future,razorsharp}"
## [26] "{cyberpunk,enter,explore,future,inspired}"
## [27] "{cyberpunk,enter,epic,explore,future}"
## [28] "{cyberpunk,explore,future,inspired,razorsharp}"
## [29] "{cyberpunk,future,inspired,neojapanese}"
## [30] "{cyberpunk,epic,explore,future,razorsharp}"
## [31] "{cyberpunk,epic,future,neojapanese}"
## [32] "{cyberpunk,enter,future,inspired,neojapanese}"
## [33] "{cyberpunk,enter,explore,future,inspired,razorsharp}"
## [34] "{cyberpunk,enter,epic,future,neojapanese}"
## [35] "{cyberpunk,future,inspired,neojapanese,razorsharp}"
## [36] "{cyberpunk,enter,future}"
## [37] "{cyberpunk,future,razorsharp}"
## [38] "{cyberpunk,enter,epic,explore,future,razorsharp}"
## [39] "{cyberpunk,epic,future,neojapanese,razorsharp}"
## [40] "{cyberpunk,explore,future}"
## [41] "{cyberpunk,explore,future,inspired,neojapanese}"
## [42] "{cyberpunk,epic,explore,future,neojapanese}"
## [43] "{cyberpunk,enter,future,razorsharp}"
## [44] "{cyberpunk,future,neojapanese}"
## [45] "{cyberpunk,enter,explore,future}"
## [46] "{cyberpunk,explore,future,razorsharp}"
## [47] "{cyberpunk,enter,future,inspired,neojapanese,razorsharp}"
## [48] "{cyberpunk,enter,epic,future,neojapanese,razorsharp}"
## [49] "{cyberpunk,enter,future,neojapanese}"
## [50] "{cyberpunk,epic,inspired}"
## [51] "{cyberpunk,enter,explore,future,inspired,neojapanese}"
## [52] "{cyberpunk,future,neojapanese,razorsharp}"
## [53] "{cyberpunk,enter,epic,explore,future,neojapanese}"
## [54] "{cyberpunk,explore,future,neojapanese}"
## [55] "{cyberpunk,explore,future,inspired,neojapanese,razorsharp}"
## [56] "{cyberpunk,enter,explore,future,razorsharp}"
## [57] "{cyberpunk,epic,explore,future,neojapanese,razorsharp}"
## [58] "{cyberpunk,enter,epic,inspired}"
## [59] "{cyberpunk,epic,inspired,razorsharp}"
## [60] "{cyberpunk,epic,explore,inspired}"
## [61] "{cyberpunk,enter,future,neojapanese,razorsharp}"
## [62] "{cyberpunk,enter,inspired}"
## [63] "{cyberpunk,enter,epic}"
## [64] "{cyberpunk,enter,explore,future,neojapanese}"
## [65] "{cyberpunk,inspired,razorsharp}"
## [66] "{cyberpunk,epic,inspired,neojapanese}"
## [67] "{cyberpunk,explore,inspired}"
## [68] "{cyberpunk,explore,future,neojapanese,razorsharp}"
## [69] "{cyberpunk,epic,explore}"
## [70] "{cyberpunk,inspired,neojapanese}"
## [71] "{cyberpunk,epic,neojapanese}"
## [72] "{cyberpunk,enter,epic,inspired,razorsharp}"
## [73] "{cyberpunk,enter,epic,explore,inspired}"
## [74] "{cyberpunk,neojapanese}"
## [75] "{cyberpunk,enter,inspired,razorsharp}"
## [76] "{cyberpunk,epic,explore,inspired,razorsharp}"
## [77] "{cyberpunk,enter,epic,razorsharp}"
## [78] "{cyberpunk,enter,explore,inspired}"
## [79] "{cyberpunk,enter,epic,explore}"
## [80] "{cyberpunk,enter,razorsharp}"
## [81] "{cyberpunk,explore,inspired,razorsharp}"
## [82] "{cyberpunk,epic,explore,razorsharp}"
## [83] "{cyberpunk,enter,epic,inspired,neojapanese}"
## [84] "{cyberpunk,enter,explore}"
## [85] "{cyberpunk,explore,razorsharp}"
## [86] "{cyberpunk,enter,inspired,neojapanese}"
## [87] "{cyberpunk,epic,inspired,neojapanese,razorsharp}"
## [88] "{cyberpunk,enter,epic,neojapanese}"
## [89] "{cyberpunk,inspired,neojapanese,razorsharp}"
## [90] "{cyberpunk,enter,neojapanese}"
## [91] "{cyberpunk,epic,neojapanese,razorsharp}"
## [92] "{cyberpunk,epic,explore,inspired,neojapanese}"
## [93] "{cyberpunk,epic,razorsharp}"
## [94] "{cyberpunk,explore,inspired,neojapanese}"
## [95] "{cyberpunk,explore}"
## [96] "{cyberpunk,neojapanese,razorsharp}"
## [97] "{cyberpunk,epic,explore,neojapanese}"
## [98] "{cyberpunk,enter}"
## [99] "{cyberpunk,explore,neojapanese}"
## [100] "{cyberpunk,enter,explore,future,inspired,neojapanese,razorsharp}"
## [101] "{cyberpunk,enter,explore,future,neojapanese,razorsharp}"
## [102] "{cyberpunk,enter,epic,explore,future,neojapanese,razorsharp}"
## [103] "{cyberpunk,enter,explore,razorsharp}"
## [104] "{cyberpunk,enter,explore,inspired,razorsharp}"
## [105] "{cyberpunk,enter,epic,explore,razorsharp}"
## [106] "{cyberpunk,razorsharp}"
## [107] "{cyberpunk,enter,neojapanese,razorsharp}"
## [108] "{cyberpunk,enter,explore,neojapanese}"
## [109] "{cyberpunk,enter,epic,explore,inspired,razorsharp}"
## [110] "{cyberpunk,explore,neojapanese,razorsharp}"
## [111] "{cyberpunk,enter,inspired,neojapanese,razorsharp}"
## [112] "{cyberpunk,enter,epic,neojapanese,razorsharp}"
## [113] "{cyberpunk,enter,explore,inspired,neojapanese}"
## [114] "{cyberpunk,enter,epic,explore,neojapanese}"
## [115] "{cyberpunk,explore,inspired,neojapanese,razorsharp}"
## [116] "{cyberpunk,epic,explore,neojapanese,razorsharp}"
## [117] "{cyberpunk,enter,epic,inspired,neojapanese,razorsharp}"
## [118] "{cyberpunk,enter,epic,explore,inspired,neojapanese}"
## [119] "{cyberpunk,epic}"
## [120] "{cyberpunk,epic,explore,inspired,neojapanese,razorsharp}"
## [121] "{cyberpunk,inspired}"
## [122] "{cyberpunk,enter,explore,neojapanese,razorsharp}"
## [123] "{cyberpunk,enter,explore,inspired,neojapanese,razorsharp}"
## [124] "{cyberpunk,enter,epic,explore,neojapanese,razorsharp}"
## [125] "{epic,future}"
## [126] "{cyberpunk,future}"
## [127] "{future,inspired}"
## [128] "{future,razorsharp}"
## [129] "{explore,future}"
## [130] "{enter,future}"
## [131] "{epic,future,inspired}"
## [132] "{future,neojapanese}"
## [133] "{epic,inspired}"
## [134] "{neojapanese}"
## [135] "{cyberpunk,enter,epic,explore,inspired,neojapanese,razorsharp}"
## [136] "{enter,epic}"
## [137] "{enter,epic,future}"
## [138] "{inspired,razorsharp}"
## [139] "{future,inspired,razorsharp}"
## [140] "{epic,future,razorsharp}"
## [141] "{epic,explore}"
## [142] "{epic,explore,future}"
## [143] "{explore,inspired}"
## [144] "{enter,future,inspired}"
## [145] "{enter,inspired}"
## [146] "{explore,future,inspired}"
## [147] "{epic,neojapanese}"
## [148] "{explore}"
## [149] "{epic,future,neojapanese}"
## [150] "{razorsharp}"
## [151] "{inspired,neojapanese}"
## [152] "{future,inspired,neojapanese}"
## [153] "{enter,razorsharp}"
## [154] "{enter}"
## [155] "{explore,razorsharp}"
## [156] "{enter,future,razorsharp}"
## [157] "{enter,explore}"
## [158] "{explore,future,razorsharp}"
## [159] "{enter,explore,future}"
## [160] "{epic,razorsharp}"
## [161] "{neojapanese,razorsharp}"
## [162] "{explore,neojapanese}"
## [163] "{future,neojapanese,razorsharp}"
## [164] "{enter,neojapanese}"
## [165] "{enter,future,neojapanese}"
## [166] "{explore,future,neojapanese}"
## [167] "{enter,epic,inspired}"
## [168] "{epic,inspired,razorsharp}"
## [169] "{epic,explore,inspired}"
## [170] "{enter,epic,future,inspired}"
## [171] "{epic,inspired,neojapanese}"
## [172] "{epic,future,inspired,razorsharp}"
## [173] "{epic,explore,future,inspired}"
## [174] "{enter,inspired,razorsharp}"
## [175] "{epic,future,inspired,neojapanese}"
## [176] "{enter,epic,razorsharp}"
## [177] "{enter,epic,explore}"
## [178] "{explore,inspired,razorsharp}"
## [179] "{enter,explore,inspired}"
## [180] "{epic,explore,razorsharp}"
## [181] "{enter,epic,neojapanese}"
## [182] "{inspired,neojapanese,razorsharp}"
## [183] "{epic,neojapanese,razorsharp}"
## [184] "{epic,explore,neojapanese}"
## [185] "{explore,inspired,neojapanese}"
## [186] "{enter,future,inspired,razorsharp}"
## [187] "{enter,inspired,neojapanese}"
## [188] "{enter,epic,future,razorsharp}"
## [189] "{enter,epic,explore,future}"
## [190] "{explore,future,inspired,razorsharp}"
## [191] "{enter,explore,future,inspired}"
## [192] "{epic,explore,future,razorsharp}"
## [193] "{epic}"
## [194] "{enter,explore,razorsharp}"
## [195] "{inspired}"
## [196] "{enter,epic,future,neojapanese}"
## [197] "{future,inspired,neojapanese,razorsharp}"
## [198] "{epic,future,neojapanese,razorsharp}"
## [199] "{enter,neojapanese,razorsharp}"
## [200] "{epic,explore,future,neojapanese}"
## [201] "{enter,future,inspired,neojapanese}"
## [202] "{explore,future,inspired,neojapanese}"
## [203] "{explore,neojapanese,razorsharp}"
## [204] "{enter,explore,neojapanese}"
## [205] "{enter,explore,future,razorsharp}"
## [206] "{enter,future,neojapanese,razorsharp}"
## [207] "{explore,future,neojapanese,razorsharp}"
## [208] "{enter,explore,future,neojapanese}"
## [209] "{enter,epic,inspired,razorsharp}"
## [210] "{enter,epic,explore,inspired}"
## [211] "{epic,explore,inspired,razorsharp}"
## [212] "{enter,epic,inspired,neojapanese}"
## [213] "{epic,inspired,neojapanese,razorsharp}"
## [214] "{epic,explore,inspired,neojapanese}"
## [215] "{enter,explore,inspired,razorsharp}"
## [216] "{enter,epic,explore,razorsharp}"
## [217] "{enter,inspired,neojapanese,razorsharp}"
## [218] "{enter,epic,neojapanese,razorsharp}"
## [219] "{enter,epic,explore,neojapanese}"
## [220] "{explore,inspired,neojapanese,razorsharp}"
## [221] "{enter,epic,future,inspired,razorsharp}"
## [222] "{enter,explore,inspired,neojapanese}"
## [223] "{epic,explore,neojapanese,razorsharp}"
## [224] "{enter,epic,explore,future,inspired}"
## [225] "{epic,explore,future,inspired,razorsharp}"
## [226] "{enter,epic,future,inspired,neojapanese}"
## [227] "{epic,future,inspired,neojapanese,razorsharp}"
## [228] "{epic,explore,future,inspired,neojapanese}"
## [229] "{enter,explore,neojapanese,razorsharp}"
## [230] "{enter,explore,future,inspired,razorsharp}"
## [231] "{enter,epic,explore,future,razorsharp}"
## [232] "{enter,future,inspired,neojapanese,razorsharp}"
## [233] "{enter,epic,future,neojapanese,razorsharp}"
## [234] "{future}"
## [235] "{enter,epic,explore,future,neojapanese}"
## [236] "{explore,future,inspired,neojapanese,razorsharp}"
## [237] "{enter,explore,future,inspired,neojapanese}"
## [238] "{epic,explore,future,neojapanese,razorsharp}"
## [239] "{enter,explore,future,neojapanese,razorsharp}"
## [240] "{enter,epic,explore,inspired,razorsharp}"
## [241] "{enter,epic,inspired,neojapanese,razorsharp}"
## [242] "{enter,epic,explore,inspired,neojapanese}"
## [243] "{epic,explore,inspired,neojapanese,razorsharp}"
## [244] "{enter,explore,inspired,neojapanese,razorsharp}"
## [245] "{enter,epic,explore,neojapanese,razorsharp}"
## [246] "{enter,epic,explore,future,inspired,razorsharp}"
## [247] "{enter,epic,future,inspired,neojapanese,razorsharp}"
## [248] "{enter,epic,explore,future,inspired,neojapanese}"
## [249] "{epic,explore,future,inspired,neojapanese,razorsharp}"
## [250] "{enter,explore,future,inspired,neojapanese,razorsharp}"
## [251] "{enter,epic,explore,future,neojapanese,razorsharp}"
## [252] "{cyberpunk,night}"
## [253] "{city,cyberpunk}"
## [254] "{enter,epic,explore,inspired,neojapanese,razorsharp}"
## [255] "{cyberpunk,red}"
## [256] "{cyberpunk,projekt}"
## [257] "{city}"
## [258] "{night}"
## [259] "{projekt}"
## [260] "{red}"
## [261] "{vgpunite}"
## [262] "{virtualphotography}"
## [263] "{cyberpunkgame}"
## [264] "{enter,epic,explore,future,inspired,neojapanese,razorsharp}"
## [265] "{cyberpunkphotomode}"
## [266] "{psshare}"
## [267] "{photomode}"
## [268] "{via}"
## [269] "{gaming}"
## [270] "{live}"
## [271] "{patch}"
## [272] "{cdprojektred}"
## [273] "{city,night}"
## [274] "{time}"
## [275] "{now}"
## [276] "{new}"
## [277] "{game}"
## [278] "{just}"
## [279] "{like}"
## [280] "{projekt,red}"
## Itemsets in Consequent (RHS)
## [1] "{cyberpunk}" "{cdprojektred}" "{vgpunite}"
## [4] "{virtualphotography}" "{red}" "{projekt}"
## [7] "{night}" "{city}" "{future}"
## [10] "{inspired}" "{epic}" "{enter}"
## [13] "{razorsharp}" "{explore}" "{neojapanese}"
plot(rules, measure=c("support","lift"), shading="confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules, method="grouped")
Let us look at the graph of the rules that will show how the association rules work for previously stated words. It will be much easier to observe how are they connected.
plot(rules.xbox, method="graph",main="xbox")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules.sony, method="graph",main="sony")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules.witcher, method="graph",main="witcher")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules.love, method="graph",main="love")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules.hate, method="graph",main="hate")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules.xbox, method="paracoord", control=list(reorder=TRUE),main="xbox")
plot(rules.sony, method="paracoord", control=list(reorder=TRUE),main="sony")
plot(rules.witcher, method="paracoord", control=list(reorder=TRUE),main="witcher")
plot(rules.love, method="paracoord", control=list(reorder=TRUE),main="love")
plot(rules.hate, method="paracoord", control=list(reorder=TRUE),main="hate")
Now let us look at ECLAT algorithm and we will see the rules that we obtain here. As we can see we obtained quite similar outcome as in previous analysis.
freq.items<-eclat(trans, parameter=list(supp=0.02, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.02 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 911
##
## create itemset ...
## set transactions ...[1123 item(s), 45559 transaction(s)] done [0.05s].
## sorting and recoding items ... [44 item(s)] done [0.00s].
## creating sparse bit matrix ... [44 row(s), 45559 column(s)] done [0.00s].
## writing ... [320 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
freq.rules<-ruleInduction(freq.items, trans, confidence=0.5)
freq.rules
## set of 1038 rules
freq.items<-eclat(trans, parameter=list(supp=0.03, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.03 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 1366
##
## create itemset ...
## set transactions ...[1123 item(s), 45559 transaction(s)] done [0.05s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating sparse bit matrix ... [23 row(s), 45559 column(s)] done [0.00s].
## writing ... [37 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
arules::inspect(freq.items)
## items support count
## [1] {cyberpunk, gaming} 0.03033 1382
## [2] {cyberpunk, photomode} 0.03448 1571
## [3] {cyberpunk, vgpunite} 0.03106 1415
## [4] {cyberpunk, via} 0.03584 1633
## [5] {city, cyberpunk} 0.03104 1414
## [6] {city, night} 0.03222 1468
## [7] {cyberpunk, psshare} 0.05555 2531
## [8] {cyberpunk, night} 0.03288 1498
## [9] {projekt, red} 0.04486 2044
## [10] {cyberpunk, cyberpunkgame} 0.03729 1699
## [11] {cyberpunk, virtualphotography} 0.05408 2464
## [12] {cyberpunk, time} 0.03494 1592
## [13] {cdprojektred, cyberpunk} 0.07636 3479
## [14] {cyberpunk, game} 0.05641 2570
## [15] {cyberpunk} 0.67120 30579
## [16] {game} 0.10185 4640
## [17] {cdpr} 0.10808 4924
## [18] {cdprojektred} 0.09487 4322
## [19] {red} 0.05823 2653
## [20] {time} 0.04612 2101
## [21] {virtualphotography} 0.06032 2748
## [22] {just} 0.05261 2397
## [23] {cyberpunkgame} 0.06016 2741
## [24] {projekt} 0.04877 2222
## [25] {like} 0.04579 2086
## [26] {new} 0.04047 1844
## [27] {night} 0.04142 1887
## [28] {psshare} 0.05707 2600
## [29] {now} 0.03821 1741
## [30] {city} 0.03900 1777
## [31] {via} 0.04296 1957
## [32] {vgpunite} 0.03501 1595
## [33] {one} 0.03402 1550
## [34] {photomode} 0.03622 1650
## [35] {good} 0.03033 1382
## [36] {gaming} 0.03646 1661
## [37] {patch} 0.03090 1408
freq.rules<-ruleInduction(freq.items, trans, confidence=0.1)
freq.rules
## set of 17 rules
arules::inspect(freq.rules)
## lhs rhs support confidence lift itemset
## [1] {gaming} => {cyberpunk} 0.03033 0.8320 1.2396 1
## [2] {photomode} => {cyberpunk} 0.03448 0.9521 1.4185 2
## [3] {vgpunite} => {cyberpunk} 0.03106 0.8871 1.3217 3
## [4] {via} => {cyberpunk} 0.03584 0.8344 1.2432 4
## [5] {city} => {cyberpunk} 0.03104 0.7957 1.1855 5
## [6] {night} => {city} 0.03222 0.7780 19.9453 6
## [7] {city} => {night} 0.03222 0.8261 19.9453 6
## [8] {psshare} => {cyberpunk} 0.05555 0.9735 1.4503 7
## [9] {night} => {cyberpunk} 0.03288 0.7939 1.1827 8
## [10] {red} => {projekt} 0.04486 0.7704 15.7970 9
## [11] {projekt} => {red} 0.04486 0.9199 15.7970 9
## [12] {cyberpunkgame} => {cyberpunk} 0.03729 0.6198 0.9235 10
## [13] {virtualphotography} => {cyberpunk} 0.05408 0.8967 1.3359 11
## [14] {time} => {cyberpunk} 0.03494 0.7577 1.1289 12
## [15] {cyberpunk} => {cdprojektred} 0.07636 0.1138 1.1993 13
## [16] {cdprojektred} => {cyberpunk} 0.07636 0.8050 1.1993 13
## [17] {game} => {cyberpunk} 0.05641 0.5539 0.8252 14
To conclude I have made 3 parts of different analyses using Unsupervised Learning algorithms. First we saw clustering, then dimensionality reduction and last but not least association rules. I think that those tools are very powerful to mine some information from unlabeled data such as tweets. Based on analyses we could use it to target groups of people or to choose proper words to have the best outcome. I think that the project fulfilled it goal to present UL algorithms on real data and see how we can take advantage of it and get some more knowledge.
List of emoticons downloaded from: https://unicode.org/emoji/charts/full-emoji-list.html Sentiment of emoticons: https://www.clarin.si/repository/xmlui/handle/11356/1048 Data from twitter: https://github.com/sunnyline99/Analiza-sentymentu-CDPR (own work on project)