Co-Occurrences & Association:
Data Science in Context Presentation

Data 607

Joseph Simone

11/11/19

Background

\(Co\)-\(Orrurence\) \(Grouping\) \(and\) \(Association\) \(Discovery\) attempts to find associations between entities based on transcations involving them.

Why Find This?

To see if these Co-Orrurences & Associations can capture true consumer preferences,and in doing so, this may increase revenue from cross-selling.

Co-Occurrence

Co-Occurrence is simply a search through the data for combinations of items whos statistics are “interesting.”

“If A occurs then B is likely to occur as well”

Association

Association Discovery is commonly refered to as Market Basket Analysis (MBA).

The MBA helps us to understand what items are likely to be purchased together.

Complexity Control

Support of Association

Strength of the Rule

Lift and Leverage: surrise measures

Measuring Surpise

\(Lift(A,B)=\frac{p(A,B)}{p(A)*p(B)}\)

\(Leverage(A,B)= p(B,A) - p(A)*p(B)\)

Beer and Lottery Tickters

We operate a small convenience store where people buy groceries, liquor, lottery tickets, etc.

We estimate that:

30% of all transactions involve beer.

40% of all transactions involve lottery.

20% of all the transactions include both beer and lottery tickets.

We assume:

“Customers who buy beer are also likely to buy lottery tickets”

\(p(beer) = 0.3\)
\(p(lotterytickets) = 0.4\)

beer = 0.3
lottery_tickets = 0.4
blt  = beer * lottery_tickets
blt
## [1] 0.12

\(p(beer)*p(lotterytickets)=0.12\)

What is the Lift ?

As mentioned before, 20% of the transaction involve both.

Therefore, \(p(lotterytickets, beer) = 0.2\)

\(Lift(A,B)=\frac{p(A,B)}{p(A)*p(B)}\)

pbl = 0.2
bl_lift = pbl / blt
bl_lift
## [1] 1.666667

This means that buying lottery tickets and beer together is about \(1\frac{2}{3}\) times more likely

What is the Leverage ?

\(Leverage(A,B)= p(B,A) - p(A)*p(B)\)

\(p(lotterytickets,beer)-p(lotterytickets)*p(beer)\)

bl_leverage = pbl - blt
bl_leverage
## [1] 0.08

Whatever is driving the Co-Occurences result in an 8% point increase in the probability of buying both together, not just simply because we would expect them to be popular items.

Furthermore

When talking about Co-Occurrences and Associations

We can also calculate two other variables :

The Support of the Association is just the prevalence in the data of buying items

or
\(p(lotterytickets, beer)\)

pbl
## [1] 0.2

\(Support =\) 20%

The Strength is the Conditional Probabilty, \((1 - p)\)

or
\((beer,lotterytickets)\)

bl_strength = bl_lift - 1
bl_strength
## [1] 0.6666667

\(Strength =\) 67%

Questions or Comments

References

“12. Other Data Science Tasks and Techniques.” Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking, by Foster Provost and Tom Fawcett, O’Reilly, 2013, pp. 292–295.