Imagine that you are a web site or company that sells watches and laptops. Let say we know what are the queries that the users enter and whether or not they have purchased either of this product. Here I have 20 users, who can enter a query looking for the product they are interested in. To keep the model simple, I have assumed that we have three features; “awesome”, “watch” and “laptop”. So if the user is looking for “awesome laptop” then the feature vector for this user will be: [1,0,1] if she enters “laptop” then the feature vector for her will change to [0,0,1]. When a user enters “awesome laptop”, it means that she intends to buy a laptop. So laptop is an important word in this query or at least it is more important than “awesome”. If we just look at the occurance of the word then awesome and laptop both are 1 in the feature vector for this user and so both get a same value.
How we can solve this problem automaticaly and learn the features? I suggested to use a logistic regression to learn these features (in fact the value of each feature). Here I show that the suggested model is working.
In order to make prediction, we could treat purchasing behavior of the users for each product as a logistic regression problem. so for each product we are going to learn the parameter vcetor (theta) of the features. Theta will show the importance of the features in purchasing each product.
Users = c("user1", "user2", "user3", "user4", "user5", "user6", "user7", "user8",
"user9", "user10", "user11", "user12", "user13", "user14", "user15", "user16",
"user17", "user18", "user19", "user20")
queries = c("awesome laptop", "laptop awesome", "small laptop", "awesome laptop",
"sony laptop", "acer laptop", "ac laptop", "good laptop", "awesome laptop",
"acer laptop", "awesome watch", "watch", "good watch", "watch awesom", "watch good",
"watch swatch", "red watch", "awesome watch", "awesome watch", "watch")
Looking at the queries of each user, the feature vectors are extracted.
awesome = c(1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1)
watch = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
laptop = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
# purchasing behavior of the users for each product
purchased_laptop = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0)
purchased_watch = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
1)
df = data.frame(Users, queries, purchased_laptop, purchased_watch, awesome,
watch, laptop)
print(df) #Take a look at data set
## Users queries purchased_laptop purchased_watch awesome watch
## 1 user1 awesome laptop 1 0 1 0
## 2 user2 laptop awesome 1 0 1 0
## 3 user3 small laptop 1 0 0 0
## 4 user4 awesome laptop 1 0 1 0
## 5 user5 sony laptop 1 0 0 0
## 6 user6 acer laptop 1 0 0 0
## 7 user7 ac laptop 1 0 0 0
## 8 user8 good laptop 1 0 0 0
## 9 user9 awesome laptop 1 0 1 0
## 10 user10 acer laptop 1 0 0 0
## 11 user11 awesome watch 0 1 1 1
## 12 user12 watch 0 1 0 1
## 13 user13 good watch 0 0 0 1
## 14 user14 watch awesom 0 1 1 1
## 15 user15 watch good 0 1 0 1
## 16 user16 watch swatch 0 1 0 1
## 17 user17 red watch 0 1 0 1
## 18 user18 awesome watch 0 1 1 1
## 19 user19 awesome watch 0 1 1 1
## 20 user20 watch 0 1 1 1
## laptop
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## 11 0
## 12 0
## 13 0
## 14 0
## 15 0
## 16 0
## 17 0
## 18 0
## 19 0
## 20 0
# Make the logistic regression model for laptop
p1 = glm(formula = purchased_laptop ~ awesome + laptop, data = df, family = binomial)
# Make the logistic regression model for watch
p2 = glm(formula = purchased_watch ~ awesome + watch, data = df, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
So the value of “awesome”“ and "laptop”“ are:
coef(p1)
## (Intercept) awesome laptop
## -2.557e+01 1.845e-11 5.113e+01
and the value of "watch” is:
coef(p2)
## (Intercept) awesome watch
## -39.56 19.17 40.95
As this results show, logistic regression is giving a high value to “laptop” and “watch” featur and much lower value to “awesome. It learns the value of products based on the purchase behavior of the users.