Read Data and Create State Sequence Object

library(TraMineR)

## 
## TraMineR stable version 2.0-11.1 (Built: 2019-04-24)

## Website: http://traminer.unige.ch

## Please type 'citation("TraMineR")' for citation information.

cafe <- read.csv(file = "cafeData.csv", header = TRUE)
#Print first 3 obsercations
head(cafe,3)

##   ï..Name Reading.1 Reading.2 Reading.3 Reading.4 Reading.5 Reading.6
## 1     Tim         5         5         5         5         5         5
## 2     Tim         3         3         3         3         3         3
## 3     Tim         1         1         1         1         1         1
##   Reading.7 Reading.8 Reading.9 Reading.10 Reading.11 Reading.12
## 1         5         5         5          5          5          5
## 2         3         3         3          3          3          3
## 3         1         1         1          1          1          1
##   Reading.13 Reading.14 Reading.15 Reading.16 Reading.17 Reading.18
## 1          5          5          7          7          7          7
## 2          3          3          3          3          3          3
## 3          1          1          1          1          1          1
##   Reading.19 Reading.20 Reading.21 Reading.22 Reading.23 Reading.24
## 1          7          7          7          7          7          7
## 2          3          3          3          3          3          3
## 3          1          1          1          5          5          5
##   Reading.25 Reading.26 Reading.27 Reading.28 Reading.29 Reading.30
## 1          7          7          7          5          5          5
## 2          3          3          3          3          3          3
## 3          5          5          5          5          5          5

#Dimension of data set (Im using 10 observation from 2 customers in this example)
dim(cafe)

## [1] 20 31

#Create labels for the sequence object
cafe.seq.labels <- c("BookShelf", "PS4", "Counter", "Side1", "Side2", "Business", "Corner")
#create a sequence object, the sequence appears from 2nd-31st column as 1st column is name of customer
cafe.seq <- seqdef(cafe, var = 2:31, labels = cafe.seq.labels)

##  [>] 7 distinct states appear in the data:

##      1 = 1

##      2 = 2

##      3 = 3

##      4 = 4

##      5 = 5

##      6 = 6

##      7 = 7

##  [>] state coding:

##        [alphabet]  [label]  [long label]

##      1  1           1        BookShelf

##      2  2           2        PS4

##      3  3           3        Counter

##      4  4           4        Side1

##      5  5           5        Side2

##      6  6           6        Business

##      7  7           7        Corner

##  [>] 20 sequences in the data set

##  [>] min/max sequence length: 30/30

Plotting State Frequency and Distribution plot

#This will plot the first 10 sequences
seqiplot(cafe.seq, title = "Index plot (first 10 sequences)",
             withlegend = "right")

##  [!] In rmarkdown::render() : title is deprecated, use main instead.

##  [!] In rmarkdown::render() : withlegend is deprecated, use with.legend instead.

#This shows that the 10th customer (reading) spent all of his time in side table 1

#State Distribution
seqdplot(cafe.seq, title = "State distribution plot", withlegend = "right")

##  [!] In rmarkdown::render() : title is deprecated, use main instead.
##  [!] In rmarkdown::render() : withlegend is deprecated, use with.legend instead.

#This shows the sequence distribution with time (as we take more readings)
seqfplot(cafe.seq, title = "Sequence frequency plot", withlegend = "right",
              pbarw = TRUE)

##  [!] In rmarkdown::render() : title is deprecated, use main instead.
##  [!] In rmarkdown::render() : withlegend is deprecated, use with.legend instead.

#This should plot the 10 most frequent sequences (shows what customers do the most 
# in coffee shops but in this case with only 20 observation this is not useful)

Transition Rates

#Compute Transitional probabilities 
tr <- seqtrate(cafe.seq)

##  [>] computing transition probabilities for states 1/2/3/4/5/6/7 ...

#Print the table
round(tr, 2)

##        [-> 1] [-> 2] [-> 3] [-> 4] [-> 5] [-> 6] [-> 7]
## [1 ->]   0.91   0.00   0.00   0.03   0.02   0.02   0.02
## [2 ->]   0.00   0.97   0.01   0.00   0.00   0.01   0.01
## [3 ->]   0.03   0.03   0.92   0.00   0.01   0.01   0.00
## [4 ->]   0.04   0.00   0.00   0.96   0.00   0.00   0.00
## [5 ->]   0.00   0.01   0.01   0.01   0.94   0.00   0.02
## [6 ->]   0.00   0.02   0.02   0.00   0.00   0.97   0.00
## [7 ->]   0.02   0.02   0.00   0.00   0.03   0.00   0.93

#This calculates the probability from transitioning from any given state to another of the 7 states. So for instance, the probability of moving from state 1 to state 1 (stay in state 1) is quite high (0.91). This is the case for all, so the diagonal has large values which shows that if you are in any state the probability that you will remain there is quite high.

#This can be useful if for example we want to know, a customer is likely to sit in which region after he grabs a book from the shelf? Is it more likely that he will move to side table, business table, corner, or sit near window after being in book shelf

Mean Time in Each State (zone) by Customers

seqmtplot(cafe.seq, group = cafe$ï..Name, main = "Mean time")

#This can be also done based on gender. So, we can get insights for each gender, example males spend more time near PS4(gaming) zone than females and so on

Find Subsequences in a Sequence

#Lets say we want to find out whether any people spent significant amount of time near the bookshelf. Since bookshelf is Zone 1, lets find "11111" in any sequence
seqpm(cafe.seq, "11111")

##  [>] pattern 11111 has been found in 8 sequences

## $MTab
##   pattern nbocc
## 1   11111     8
## 
## $MIndex
## [1]  3  4  6  7  9 15 16 20

a <- seqpm(cafe.seq, "11111")

##  [>] pattern 11111 has been found in 8 sequences

#It shows that this sequence appears in 8/20 sequences. Using the index we can trace it to find the sequences
cafe.seq[a$MIndex, ]

##    Sequence                                                   
## 3  1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-5-5-5-5-5-5-5-5-5
## 4  3-3-3-3-3-1-1-1-1-1-1-1-1-1-1-1-1-1-1-4-4-4-4-4-1-1-1-1-1-1
## 6  1-1-1-1-1-1-1-1-1-1-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-1-1-1
## 7  1-1-1-1-1-1-1-1-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-1-1-1
## 9  1-1-1-1-1-1-1-1-1-1-1-7-7-7-7-7-7-7-7-7-7-7-7-7-7-7-7-7-7-7
## 15 2-2-2-2-2-2-2-2-3-3-3-3-3-3-3-1-1-1-1-1-7-7-7-7-7-7-7-2-2-2
## 16 1-1-1-1-1-1-1-1-1-1-5-5-5-5-5-5-5-5-5-5-5-5-5-2-2-2-2-2-2-2
## 20 1-1-1-1-1-1-1-1-1-6-6-6-6-6-6-6-6-6-6-6-6-6-6-6-6-6-6-6-6-6

#This can be 1 way by which we can trace customer's behavior i.e by trying to find if the subsequence for a behavior is present in his observation

Tabulate Each Observation Based on Time Spent in Each State

seqistatd(cafe.seq)

##  [>] computing state distribution for 20 sequences ...

##     1  2  3  4  5  6  7
## 1   0  0  0  0 17  0 13
## 2   0  0 30  0  0  0  0
## 3  21  0  0  0  9  0  0
## 4  20  0  5  5  0  0  0
## 5   0  0  0 11 19  0  0
## 6  13  0  0 17  0  0  0
## 7  11  0  0 19  0  0  0
## 8   0  0 22  0  8  0  0
## 9  11  0  0  0  0  0 19
## 10  0  0  0 30  0  0  0
## 11  0 13  5  0  0 12  0
## 12  0 30  0  0  0  0  0
## 13  0  0  0  0 17  0 13
## 14  0 30  0  0  0  0  0
## 15  5 11  7  0  0  0  7
## 16 10  7  0  0 13  0  0
## 17  0  3  4  0  0 23  0
## 18  4  8  5  0  0  5  8
## 19  0 22  0  0  0  8  0
## 20  9  0  0  0  0 21  0

#In first observation, customer spent 17 instance out of 30 in zone 5 and 13 in zone 7. Meaning he wasnt in any other zones

Calculate Mean Time Spent in Each State For All Customers

statd <- seqistatd(cafe.seq)

##  [>] computing state distribution for 20 sequences ...

apply(statd, 2, mean)

##    1    2    3    4    5    6    7 
## 5.20 6.20 3.90 4.10 4.15 3.45 3.00

#This shows that based on our current data, zone 2 is the most popular, wit customers spending on average 6.2/30 instance of time in this zone and zone 7 is the least popular

Calculate Entropy for Sequences

cafe.ient <- seqient(cafe.seq)

##  [>] computing entropy for 20 sequences ...

##  [>] computing state distribution for 20 sequences ...

head (cafe.ient)

##     Entropy
## 1 0.3516256
## 2 0.0000000
## 3 0.3139222
## 4 0.4458393
## 5 0.3377123
## 6 0.3516256

#Entropy implies stability. How likely is a person to change his state? A Low entropy indicates the person is not likely to move. So in the 2nd observation, the person has 0 entropy because his reading had "333.....333" so makes sense that he is stable
boxplot(cafe.ient ~ cafe$ï..Name, data = cafe.seq, xlab = "Person", ylab = "Sequences entropy",col = "cyan")

#It shows that Tim is more stable than Mark. We can compare entropy for genders as well.

Calculate Similarity and Distances Between Sequences (Optimal Matching)

#This is based on Levenshtein Distance (Edit Distance)
couts <- seqsubm(cafe.seq, method = "TRATE")

##  [>] creating substitution-cost matrix using transition rates ...

##  [>] computing transition probabilities for states 1/2/3/4/5/6/7 ...

#Using this we create the cost matrix using probability of transition
cafe.OM <- seqdist(cafe.seq, method = "OM", sm = couts)

##  [>] 20 sequences with 7 distinct states

##  [>] checking 'sm' (one value for each state, triangle inequality)

##  [>] 18 distinct sequences

##  [>] min/max sequence length: 30/30

##  [>] computing distances using the OM metric

##  [>] elapsed time: 0.01 secs

#Lets find out cost of 4th obsercation to the 3rd one
cafe.OM[4, 3]

## [1] 19.9328

#So the Cost is 20 for substituting 4th reading to 3rd.
#We can use this to find how close a behavior of a person is to a predefined set of behaviors. The smaller the distance, more similar he is to a behavior.

Sequence Data Mining for Coffe Shop

Sakib Shahriar

April 30, 2019

Read Data and Create State Sequence Object

Plotting State Frequency and Distribution plot

Transition Rates

Mean Time in Each State (zone) by Customers

Find Subsequences in a Sequence

Tabulate Each Observation Based on Time Spent in Each State

Calculate Mean Time Spent in Each State For All Customers

Calculate Entropy for Sequences

Calculate Similarity and Distances Between Sequences (Optimal Matching)

END