Coupon purchase prediction for Ponpare

Introduction

Ponpare is a leading Japanese coupon site. Japan is a strong e-commerce market with Internet penetration rate at almost 80%. According to “Cross-border Ecommerce Report -Japan” 75% of population have purchased products online contributing to ecommerce sales 104.7 billion USD (JPY 11.2 trillion).

Primary goal: Using past purchase and browsing behavior predict which coupons a customer will buy in a given period of time.

Secondary goal: Conduct exploratory analysis on trasactional data for 22,873 users on Ponpare.

Motivation: The resulting models will be used to improve Ponpare’s recommendation system, so they can make sure their customers don’t miss out on their next favorite thing.

Solution: Develop recommender by calculating the dot product of the user profile and the item profile and compute the cosine distance between the two (user-based and item-based collaborative filtering); implement matrix facorization model based on the Bayesian Personalized Ranking (BPR) framework.

Literature Review

The number of internet users, engaged in e-commerce continue to increase worldwide. The amount of information encountered by average e-commerce user is overwhelming. It becomes a big problem for a customer to find what he/she is actually looking for, as opposed to what is pushed through ads. Search engines allow users to filter out relevant results, however personalized searches are not available [1]. A promising solution is recommender system - class of Web applications that involve predicting user responce to options. Good quality recommender system has to satisfy users by providing effective recommendation as well as improve sales and attract new customers for the e-commerce businesses.

There are two types of recommender systems: collaborative filtering-based recommenders utilize only consumer-product interaction data, while content based recommenders examine properies of the products. In e-commerce, most recommendation algorithms take as input three types of data: user attributes, product attributes, and previous interactions between user and product (buying, browsing, etc) [2]. The most popular prediction algorithms employed in e-commerce applications are [3]:

  1. Baselines (constant, random, overall average, user average, item average)

  2. Memory-based neighborhood algorithms (user-based, item-based collaborative filtering)

  3. Matrix Factorization methods (SVD, NMF, PMF, Bayesian PMF, Non-linear PMF) [4].

The performance of recommendation algorithms depends on the data characteristics [5]. Some algorithms may work better or worse depending on the amount of missing data (sparcity), distribution of ratings, and the number of users and items. Most algorithms will have a better performance with dense (unreduced) dataset. In addition, there are different evaluation methods with conflicting orderings over algorithms [6]. Ponpare coupon prediction problem follows cold start scenario with user and product (coupon) attributes present. User attributes (e.g. gender, age, geographical location) were extracted from log files, coupon attributes (e.g. coupon categor, price, discount rate) were provided.

Datasets

We recieved users transactional data sorced from Ponpare downloaded from Kaggle. The training set spans the dates from 2011-07-01 to 2012-06-23 covering 359 days of customers activity. The test set spans the week after the training set from 2012-06-24 to 2012-06-30 covering seven days of customers activity.

Datasets variables

user_list.csv

user_list.csv is the master list of users in the dataset.
Total number of recors in user_list.csv is 22,873.

Column Name Description Type Note
USER_ID_Hash User ID CHAR
REG_DATE Registration date CHAR Sign up date
SEX_ID Gender CHAR f=female, m=male
AGE Age NUMBER
WITHDRAW_DATE Date of withdrawal CHAR
PREF_NAME Residential Prefecture CHAR Japaneese

coupon_list_train.csv and coupon_list_test.csv

coupon_list_train.csv is the master list of coupons which are considered part of the training set.
Total number of records in coupon_list_train.csv is 19,413.

coupon_list_test.csv is the master list of coupons which are considered part of the test set.
Total number of records in coupon_list_test.csv is 310. Predictions for this project will be sourced from those 310 coupons.

Column Name Description Type Note
CAPSULE_TEXT Coupon text CHAR Japaneese
GENRE_NAME Category name CHAR Japaneese
PRICE_RATE Discount Rate INT
CATALOG_PRICE List price INT
DISCOUNT_PRICE Discount price INT
DISPFROM Sales release date CHAR
DISPEND Sales end date CHAR
DISPPERIOD Sales period(day) INT
VALIDFROM The term of validity starts DATE
VALIDEND The term of validity ends DATE
VALIDPERIOD Validity period(day) INT
USABLE_DATE_MON Is available on Monday INT
USABLE_DATE_TUE Is available on Tuesday INT
USABLE_DATE_WED Is available on Wednesday INT
USABLE_DATE_THU Is available on Thursday INT
USABLE_DATE_FRI Is available on Friday INT
USABLE_DATE_SAT Is available on Saturday INT
USABLE_DATE_SUN Is available on Sunday INT
USABLE_DATE_HOLIDAY Is available on holiday INT
USABLE_DATE_BEFORE_HOLIDAY Is available on the day before holiday INT
large_area_name Large area name of shop location CHAR Japaneese
ken_name Prefecture name of shop CHAR Japaneese
small_area_name Small area name of shop location CHAR Japaneese
COUPON_ID_hash Coupon ID CHAR

coupn_visit_train.csv

coupon_visit_train.csv is the viewing log of users browsing coupons during the training set time period.
Total number of records in coupon_visit_train.csv is 2,833,180.

Column Name Description Type Note
PURCHASE_FLG Purchsed flag INT 0: Not purchased, 1:Purchased
PURCHASEID_hash Purchase ID CHAR
I_DATE View date CHAR Purchase date if purchased
PAGE_SERIAL Page serial INT
REFERRER_hash Referer CHAR
VIEW_COUPON_ID_hash Browsing coupon ID CHAR
USER_ID_hash User ID CHAR
SESSION_ID_hash Session ID CHAR

coupon_detail_train.csv

coupon_detail_train.csv is the purchase log of users buying coupons during the training set time period. Total number of records in coupon_detail_train is 168,996.

Column Name Description Type Note
ITEM_COUNT Purchased item count INT
I_DATE Purchase date CHAR
SMALL_AREA_NAME Small area name CHAR Japaneese
PURCHASEID_hash Purchase ID CHAR
USER_ID_hash User ID CHAR
COUPON_ID_hash Coupon ID CHAR

Approach

Step1 Download datasets:

Install required packages and connect to libraries in R

*Import datasets in R

Step2 Data transformation (deriving and casting)

user_list:

Data deriving

Translate Japanees text in PREF_NAME into English and save it into new variable PREF_NAME_en:

user_list<-read.csv("user_list.csv", as.is=TRUE)
user_list$PREF_NAME[(user_list$PREF_NAME=="")] <- NA #replace empty cells in PREF_NAME for NA
user_list_jp<-unique(user_list$PREF_NAME)
user_list_en<-c("NA", "Tokyo", "Aichi Prefecture", "Kanagawa Prefecture", 
                "Hiroshima Prefecture", "Saitama Prefecture", "Nara Prefecture",
                "Ishikawa Prefecture", "Osaka prefecture",
                "Kumamoto Prefecture", "Fukuoka Prefecture", "Hokkaido", "Kyoto", 
                "Akita", "Chiba Prefecture", "Nagasaki Prefecture", 
                "Hyogo Prefecture", "Okinawa","Mie", "Ibaraki Prefecture", 
                "Kagoshima prefecture", "Miyagi Prefecture", "Shizuoka Prefecture", 
                "Wakayama Prefecture", "Nagano Prefecture", "Okayama Prefecture", 
                "Tochigi Prefecture","Shiga Prefecture", "Toyama Prefecture", 
                "Saga Prefecture", "Miyazaki Prefecture", "Iwate Prefecture", 
                "Niigata Prefecture", "Oita Prefecture", "Yamaguchi Prefecture", 
                "Gifu Prefecture","Gunma Prefecture", "Fukushima Prefecture", 
                "Ehime Prefecture", "Kagawa Prefecture", "Yamanashi Prefecture", 
                "Kochi Prefecture", "Shimane Prefecture", "Tokushima Prefecture", 
                "Fukui Prefecture","Aomori Prefecture", "Yamagata Prefecture", 
                "Tottori Prefecture")
translation_user_list<-data.frame(user_list_jp, user_list_en, stringsAsFactors=FALSE)

names(translation_user_list)<-c("Japanese", "PREF_NAME_en")
user_list<-merge(user_list, translation_user_list, by.x="PREF_NAME", by.y="Japanese", all.x=TRUE)

Data casting - changing type of the variables for further analysis

Variable REG_DATE is a character, which contains date and time in the form of “2011-05-24 08:42:15”. It is split into 2 variables: Date_REG_DATE (Date format) and Time_REG_DATE (Factor):

user_list$Date_REG_DATE<-substring(user_list$REG_DATE, 1,11)
user_list$Time_REG_DATE<-substring(user_list$REG_DATE, 11, 19)
user_list$Date_REG_DATE<-as.Date(user_list$Date_REG_DATE)
user_list$Time_REG_DATE<-as.factor(user_list$Time_REG_DATE)

Variable WITHDRAW_DATE is a character, which contains date and time. It is split into 2 variables: Date_WITHDRAW_DATE (Date format) and Time_WITHDRAW_DATE (Factor):

user_list$Date_WITHDRAW_DATE<-substring(user_list$WITHDRAW_DATE, 1,11)
user_list$Time_WITHDRAW_DATE<-substring(user_list$WITHDRAW_DATE, 11, 19)
user_list$Date_WITHDRAW_DATE<-as.Date(user_list$Date_WITHDRAW_DATE)
user_list$Time_WITHDRAW_DATE<-as.factor(user_list$Time_WITHDRAW_DATE)

Binary variable SEX_ID is a character and is converted into factor with 2 levels:

user_list$SEX_ID<-as.factor(user_list$SEX_ID)

New variable AGE_GROUPS is created from AGE for further analysis:

user_list$AGE_GROUPS<-cut(user_list$AGE, breaks=c(14,24,34,44,54,64,74,84), 
                labels=c("14-23", "24-33","34-43", "44-53", "54-63", "64-73", "74-83"))

The outcome of this step is addition of the new variables (in bold), so the user_list dataset has the following variables:

Column Name Description Type Note
USER_ID_Hash User ID CHAR
REG_DATE Registration date CHAR Sign up date
Date_REG_DATE Registration date DATE
Time_REG_DATE Registration time Factor
SEX_ID Gender CHAR f=female, m=male
AGE Age NUMBER
AGE_GROUPS Age groups Factor
WITHDRAW_DATE Date of withdrawal CHAR
Date_WITHDRAW_DATE Date of withdrawal DATE
Time_WITHDRAW_DATE Time of withdrawal Factor
PREF_NAME Residential Prefecture CHAR Japaneese
PREF_NAME_en Residential Prefecture CHAR English

coupon_list_train.csv and coupon_list_test.csv:

Data deriving

Translate Japanees text in CAPSULE_TEXT , GENRE_NAME, large_area_name, ken_name, small_area_name into English and save it into new corresponding variables CAPSULE_TEXT_en, GENRE_NAME_en, large_area_name_en, ken_name_en, small_area_name_en in training and test data sets:

#For train set:
train<-read.csv("coupon_list_train.csv", as.is=TRUE) #read csv file, 
#argument as.is=TRUE suppresses conversion of character vectors to factors
translation<-data.frame(Japanese=unique(c(train$CAPSULE_TEXT, train$GENRE_NAME, 
train$large_area_name, train$ken_name, train$small_area_name)), 
                English=c("Food", "Hair salon", "Spa", "Relaxation","Beauty", 
                          "Nail and eye salon","Delivery service","Lesson",
                          "Gift card","Other coupon","Leisure",
                          "Hotel and Japanese hotel","Health and medical","Other",
                          "Hotel","Japanese hotel","Vacation rental","Lodge",
                          "Resort inn","Guest house","Japanse guest house",
                          "Public hotel","Beauty","Event","Web service","Class",
                          "Correspondence course","Kanto","Kansai","East Sea",
                          "Hokkaido","Kyushu-Okinawa","Northeast","Shikoku",
                          "China","Hokushinetsu","Saitama Prefecture",
                          "Chiba Prefecture","Tokyo","Kyoto","Aichi Prefecture",
                          "Kanagawa Prefecture","Fukuoka Prefecture",
                          "Tochigi Prefecture","Osaka prefecture","Miyagi Prefecture",
                          "Fukushima Prefecture","Oita Prefecture","Kochi Prefecture",
                          "Hiroshima Prefecture","Niigata Prefecture",
                          "Okayama Prefecture","Ehime Prefecture","Kagawa Prefecture",
                          "Tokushima Prefecture","Hyogo Prefecture","Gifu Prefecture",
                          "Miyazaki Prefecture","Nagasaki Prefecture", 
                          "Ishikawa Prefecture","Yamagata Prefecture","Shizuoka Prefecture",
                          "Aomori Prefecture", "Okinawa","Akita","Nagano Prefecture",
                          "Iwate Prefecture","Kumamoto Prefecture",
                          "Yamaguchi Prefecture","Saga Prefecture","Nara Prefecture",
                          "Mie","Gunma Prefecture","Wakayama Prefecture",
                          "Yamanashi Prefecture","Tottori Prefecture","Kagoshima prefecture",
                          "Fukui Prefecture","Shiga Prefecture","Toyama Prefecture",
                          "Shimane Prefecture","Ibaraki Prefecture","Saitama","Chiba",
                          "Shinjuku, Takadanobaba Nakano - Kichijoji","Kyoto",
                          "Ebisu, Meguro Shinagawa","Ginza Shinbashi, Tokyo, Ueno",
                          "Aichi","Kawasaki, Shonan-Hakone other","Fukuoka","Tochigi",
                          "Minami other","Shibuya, Aoyama, Jiyugaoka",
                          "Ikebukuro Kagurazaka-Akabane","Akasaka, Roppongi, Azabu",
                          "Yokohama","Miyagi","Fukushima","Much","Kochi",
                          "Tachikawa Machida, Hachioji other","Hiroshima","Niigata",
                          "Okayama","Ehime","Kagawa","Northern","Tokushima","Hyogo",
                          "Gifu","Miyazaki","Nagasaki","Ishikawa","Yamagata",
                          "Shizuoka","Aomori","Okinawa","Akita","Nagano","Iwate",
                          "Kumamoto","Yamaguchi","Saga","Nara","Triple","Gunma",
                          "Wakayama","Yamanashi","Tottori","Kagoshima","Fukui",
                          "Shiga","Toyama","Shimane","Ibaraki"), 
                stringsAsFactors=FALSE) #stringsAsFactors=FALSE 
                                        #because we have some text fields

# Merge translated data with original data colum by column:

names(translation)<-c("Japanese", "CAPSULE_TEXT_en")
train<-merge(train, translation, by.x="CAPSULE_TEXT", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "GENRE_NAME_en")
train<-merge(train, translation, by.x="GENRE_NAME", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "large_area_name_en")
train<-merge(train, translation, by.x="large_area_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "ken_name_en")
train<-merge(train, translation, by.x="ken_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "small_area_name_en")
train<-merge(train, translation, by.x="small_area_name", by.y="Japanese", all.x=TRUE)

#For test set:
test<-read.csv("coupon_list_test.csv", as.is=TRUE)
names(translation)<-c("Japanese", "CAPSULE_TEXT_en")
test<-merge(test, translation, by.x="CAPSULE_TEXT", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "GENRE_NAME_en")
test<-merge(test, translation, by.x="GENRE_NAME", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "large_area_name_en")
test<-merge(test, translation, by.x="large_area_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "ken_name_en")
test<-merge(test, translation, by.x="ken_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "small_area_name_en")
test<-merge(test, translation, by.x="small_area_name", by.y="Japanese", all.x=TRUE)

Data casting - changing type of the variables for further analysis

Variables DISPFROM and DISPEND are character type, which contains date and time components.

Each variable is split into 2 variables: Date_DISPFROM, DATE_DISPEND (Date format) and Time_DISPFROME, TIME_DISPEND (Factor):

train$Date_DISPFROM<-substring(train$DISPFROM, 1,11)
train$Time_DISPFROM<-substring(train$DISPFROM, 11, 19)
train$Date_DISPFROM<-as.Date(train$Date_DISPFROM)
train$Time_DISPFROM<-as.factor(train$Time_DISPFROM)

train$Date_DISPEND<-substring(train$DISPEND, 1,11)
train$Time_DISPEND<-substring(train$DISPEND, 11, 19)
train$Date_DISPEND<-as.Date(train$Date_DISPEND)
train$Time_DISPEND<-as.factor(train$Time_DISPEND)



test$Date_DISPFROM<-substring(test$DISPFROM, 1,11)
test$Time_DISPFROM<-substring(test$DISPFROM, 11, 19)
test$Date_DISPFROM<-as.Date(test$Date_DISPFROM)
test$Time_DISPFROM<-as.factor(test$Time_DISPFROM)

test$Date_DISPEND<-substring(test$DISPEND, 1,11)
test$Time_DISPEND<-substring(test$DISPEND, 11, 19)
test$Date_DISPEND<-as.Date(test$Date_DISPEND)
test$Time_DISPEND<-as.factor(test$Time_DISPEND)

Variables VALIDFROM(date when the coupon becomes valid) and VALIDEND (date when the coupon expire) are character type, which contains the date. They are converted into date format:

train$VALIDFROM<-as.Date(train$VALIDFROM)
train$VALIDEND<-as.Date(train$VALIDEND)


test$VALIDFROM<-as.Date(test$VALIDFROM)
test$VALIDEND<-as.Date(test$VALIDEND)

Variable PRICE_RATE describes discount rate for every coupon. PRICE_RATE_GROUPS was derived from it for the sake of future visualization:

#Create interval groups for train$PRICE_RATE:
train$PRICE_RATE_GROUPS<-cut(train$PRICE_RATE, 
                        breaks=c(0,20,40,60, 80,100), 
                        labels=c("0-19","20-39", "40-59", "60-79", "80-100"))


#Create interval groups for test$PRICE_RATE:
test$PRICE_RATE_GROUPS<-cut(test$PRICE_RATE, 
                            breaks=c(0,20,40,60, 80,100), 
                            labels=c("0-19","20-39", "40-59", "60-79", "80-100"))

The outcome of this step is addition of the new variables (in bold), so the train and test datasets have the following variables:

Column Name Description Type Note
CAPSULE_TEXT Coupon text CHAR Japaneese
CAPSULE_TEXT_en Coupon text CHAR English
GENRE_NAME Category name CHAR Japaneese
GENRE_NAME_en Category name CHAR English
PRICE_RATE Discount Rate INT
PRICE_RATE_GROUPS Groupped discount rates Factor
CATALOG_PRICE List price INT
DISCOUNT_PRICE Discount price INT
DISPFROM Sales release date CHAR
Date_DISPFROM Sales release date DATE
Time_DISPFROM Sales release time Factor
DISPEND Sales end date CHAR
Date_DISPEND Sales end date DATE
Time_DISPEND Sales end time Factor
DISPPERIOD Sales period(day) INT
VALIDFROM The term of validity starts DATE
VALIDEND The term of validity ends DATE
VALIDPERIOD Validity period(day) INT
USABLE_DATE_MON Is available on Monday INT
USABLE_DATE_TUE Is available on Tuesday INT
USABLE_DATE_WED Is available on Wednesday INT
USABLE_DATE_THU Is available on Thursday INT
USABLE_DATE_FRI Is available on Friday INT
USABLE_DATE_SAT Is available on Saturday INT
USABLE_DATE_SUN Is available on Sunday INT
USABLE_DATE_HOLIDAY Is available on holiday INT
USABLE_DATE_BEFORE_HOLIDAY Is available on the day before holiday INT
large_area_name Large area name of shop location CHAR Japaneese
large_area_name_en Large area name of shop location CHAR English
ken_name Prefecture name of shop CHAR Japaneese
ken_name_en Prefecture name of shop CHAR English
small_area_name Small area name of shop location CHAR Japaneese
small_area_name_en Small area name of shop location CHAR English
COUPON_ID_hash Coupon ID CHAR

coupon_visit_train.csv

Data casting - changing type of the variables for further analysis

Variable PURCHASE_FLG is binary variable that describes when the coupon was purchased(1) or not(0). It is converted into factor with 2 levels and saved as PURCHASE_FLG_f.

coupon_visit_train<-read.csv("coupon_visit_train.csv", as.is=TRUE)
coupon_visit_train$PURCHASE_FLG_f<-as.factor(coupon_visit_train$PURCHASE_FLG)

Variable I_DATE describes purchese date for the coupon (if purchased). It is character type and contains date and time components. It is split into 2 variables: Date_I_DATE, (Date format) and Time_I_DATE (Factor):

coupon_visit_train$Date_I_DATE<-substring(coupon_visit_train$I_DATE, 1,11)
coupon_visit_train$Time_I_DATE<-substring(coupon_visit_train$I_DATE, 11, 19)
coupon_visit_train$Date_I_DATE<-strptime(coupon_visit_train$Date_I_DATE, format="%Y-%m-%d")
coupon_visit_train$Time_I_DATE<-strptime(coupon_visit_train$Time_I_DATE, format="%H:%M:%S")

Extract the hour of the browsing from Time_I_DATE and create a new variable Hour

coupon_visit_train$Hour=coupon_visit_train$Time_I_DATE$hour

Variable Weekday was created from Date_I_DATE and describes day of the week when the coupon was browsed/purchased

coupon_visit_train$Weekday<-weekdays(coupon_visit_train$Date_I_DATE)

The outcome of this step is addition of the new variables (in bold), so the coupon_visit_train dataset has the following variables:

Column Name Description Type Note
PURCHASE_FLG Purchsed flag INT 0: Not purchased, 1:Purchased
PURCHASE_FLG Purchsed flag Factor 0: Not purchased, 1:Purchased
PURCHASEID_hash Purchase ID CHAR
I_DATE View date CHAR Purchase date if purchased
Date_I_DATE View date POSIXlt format Purchase date if purchased
Time_I_DATE View time POSIXlt format Purchase time if purchased
Weekday Day of week CHAR
Hour Hour INT
PAGE_SERIAL Page serial INT
REFERRER_hash Referer CHAR
VIEW_COUPON_ID_hash Browsing coupon ID CHAR
USER_ID_hash User ID CHAR
SESSION_ID_hash Session ID CHAR

coupon_detail_train

Data deriving

Translate Japanees text in SMALL_AREA_NAME into English and save it into new variable SMALL_AREA_NAME_en:

coupon_detail_train<-read.csv("coupon_detail_train.csv", as.is=TRUE)
names(translation)<-c("Japanese", "SMALL_AREA_NAME_en")
coupon_detail_train<-merge(coupon_detail_train, translation, by.x="SMALL_AREA_NAME", 
                           by.y="Japanese", all.x=TRUE)

Data casting - changing type of the variables for further analysis

Variables I_DATE is character type, which contains date and time components. It is split into 2 variables: Date_I_DATE, (Date format) and Time_I_DATE** (Factor):

coupon_detail_train$Date_I_DATE<-substring(coupon_detail_train$I_DATE, 1,11)
coupon_detail_train$Time_I_DATE<-substring(coupon_detail_train$I_DATE, 11, 19)
coupon_detail_train$Date_I_DATE<-as.Date(coupon_detail_train$Date_I_DATE)
coupon_detail_train$Time_I_DATE<-as.factor(coupon_detail_train$Time_I_DATE)

Step 3 Exploratory Data Analysis:

    *Examine summaries of the data
    *Check for missing data
    *Explore relationship between variables

for user_list:

Variable SEX_ID describes gender of the users.

counts_SEX_ID<-table(user_list$SEX_ID) # Buil and save contigency table
barplot(counts_SEX_ID, names.arg=c("Females", "Males"), 
        main= "Sex distribution among website users ", 
        ylab="Website users", ylim=c(0,20000))
axis(side=2, at=c(0, 5000, 10000, 15000,20000))
text(x=0.7, y=7000, "48%", col="pink", cex=2)
text(1.9,7000, "52%", col="blue", cex=2)

Variable AGE describes age of the users. To visualize it new variable AGE_GROUPS was introduced.

counts_AGE_GROUPS<-table(user_list$AGE_GROUPS) # Buil and save contigency table
barplot(counts_AGE_GROUPS, names.arg=c("14-23", "24-33","34-43", "44-53", 
        "54-63", "64-73", "74-83"), 
        main= "Age distribution among website users ", 
        ylab="Website users", ylim=c(0,10000))
axis(side=2, at=c(0, 2000, 4000, 6000,8000, 10000))
#add percentages to each age group on the plot: (counts_AGE_GROUPS/22873)*100
text(x=0.8, y=600, "4%", col="blue", cex=1)
text(x=2, y=3000, "24%", col="blue", cex=1)
text(x=3.1, y=4000, "31.3%", col="blue", cex=1)
text(x=4.2, y=3000, "23.8%", col="blue", cex=1)
text(x=5.5, y=2000, "12.8%", col="blue", cex=1)
text(x=6.8, y=600, "3.6%", col="blue", cex=1)
text(x=8, y=600, "0.5%", col="blue", cex=1)

Variable PREF_NAME_en describes residential prefecture of the user. In training set (19413 users), prefecture name was NA for 7256 users (or 37% of the users). Majority of the users whose residential prefecture was available reside in Tokyo. Top ten prefectures in terms on number of customers are:

sort(table (user_list$PREF_NAME_en), decreasing=TRUE)[1:10]
## 
##                  NA               Tokyo Kanagawa Prefecture 
##                7256                2830                1653 
##    Osaka prefecture    Aichi Prefecture    Hyogo Prefecture 
##                1638                 938                 879 
##  Saitama Prefecture    Chiba Prefecture  Fukuoka Prefecture 
##                 874                 835                 731 
##            Hokkaido 
##                 628

for train and test:

Variable PRICE_RATE describes discount rate for every coupon in %. To visualize it new variable PRICE_RATE_GROUPS was introduced.

#for training set:    
counts_PRICE_RATE_GROUPS<-table(train$PRICE_RATE_GROUPS) # Buil and save contigency table

barplot(counts_PRICE_RATE_GROUPS, 
        names.arg=c("0-19","20-39", "40-59", "60-79", "80-100"),
        main= "Coupons discount rate in training set ", 
        ylab="Number of coupons* 10^3", 
        xlab="Discount rate in %",
        ylim=c(0, 14000))
axis(side=2, at=c(0,4000, 8000, 12000))
#add percentages to each discount group on the plot: 
#(counts_PRICE_RATE_GROUPS/19415)*100
text(x=0.8, y=2000, "0.03%", col="blue", cex=1)
text(x=2, y=2000, "0.41%", col="blue", cex=1)
text(x=3.1, y=6000, "66.80%", col="blue", cex=1)
text(x=4.2, y=4000, "27.00%", col="blue", cex=1)
text(x=5.5, y=3000, "5.76%", col="blue", cex=1)

Variable DISPPERIOD describes number of days coupon is displayed on the website.

summary(train$DISPPERIOD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.167   4.000  36.000
plot(table(train$DISPPERIOD),  ylab="Number of coupons",
                        xlab="Display period in days",
                        main="Coupons display period")

Variable VALIDPERIOD describe number of days during which coupon is valid for use.

summary(train$VALIDPERIOD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      89     128     126     177     179    6147
plot(table(train$VALIDPERIOD),  ylab="Number of coupons",
     xlab="Validity period in days",
     main="Coupons validity period")

Variable large_area_name_en describe large area name of the shop location.

for coupon_visit_train

Variable PURCHASE_FLG is a binary variable that describes if purchased was made(1) or not(0)

coupon_visit_train<-read.csv("coupon_visit_train.csv", as.is=TRUE) #Read in data
counts<-table(coupon_visit_train$PURCHASE_FLG) #Build and save contigency table

Create barplot to visualise numbers of viewed and purchased coupons

barplot(counts, names.arg=c("Viewed coupons", "Purchased coupons"), main= "Browsed and purchased coupons during training time period", ylab="Users browsing coupons, train set, *10^5", ylim=c(0,3000000), yaxt="n")
axis(side=2, at=c(0, 1000000, 2000000,3000000),  labels=c(0, 10, 20,30))
text(x=0.7, y=2000000, "95.7%", col="blue", cex=2)
text(1.9,1000000, "4.3%", col="red", cex=2)

Variable Hour was derived from Time_I_DATE and describes time of the day when coupon was browsed

Step 4 Build the recommender system on a training set

I plan to try cosine similarity and regression approaches. …to be continued…

Step 5 Model evaluation

To evaluate the model Mean Average Precision will be used