Coupon purchase prediction for Ponpare

Introduction

Ponpare is a leading Japanese coupon site. Japan is a strong e-commerce market with Internet penetration rate at almost 80%. According to “Cross-border Ecommerce Report -Japan” 75% of population have purchased products online contributing to ecommerce sales 104.7 billion USD (JPY 11.2 trillion).

Primary goal: Using past purchase and browsing behavior predict which coupons a customer will buy in a given period of time.

Secondary goal: Conduct exploratory analysis on trasactional data for 22,873 users on Ponpare.

Motivation: The resulting models will be used to improve Ponpare’s recommendation system, so they can make sure their customers don’t miss out on their next favorite thing.

Solution: Develop recommender by calculating the dot product of the user profile and the item profile and compute the cosine distance between the two (user-based and item-based collaborative filtering); implement matrix facorization model based on the Bayesian Personalized Ranking (BPR) framework.

Literature Review

The number of internet users, engaged in e-commerce continue to increase worldwide. The amount of information encountered by average e-commerce user is overwhelming. It becomes a big problem for a customer to find what he/she is actually looking for, as opposed to what is pushed through ads. Search engines allow users to filter out relevant results, however personalized searches are not available [1]. A promising solution is recommender system - class of Web applications that involve predicting user responce to options. Good quality recommender system has to satisfy users by providing effective recommendation as well as improve sales and attract new customers for the e-commerce businesses.

There are two types of recommender systems: collaborative filtering-based recommenders utilize only consumer-product interaction data, while content based recommenders examine properies of the products. In e-commerce, most recommendation algorithms take as input three types of data: user attributes, product attributes, and previous interactions between user and product (buying, browsing, etc) [2]. The most popular prediction algorithms employed in e-commerce applications are [3]:

Baselines (constant, random, overall average, user average, item average)
Memory-based neighborhood algorithms (user-based, item-based collaborative filtering)
Matrix Factorization methods (SVD, NMF, PMF, Bayesian PMF, Non-linear PMF) [4].

The performance of recommendation algorithms depends on the data characteristics [5]. Some algorithms may work better or worse depending on the amount of missing data (sparcity), distribution of ratings, and the number of users and items. Most algorithms will have a better performance with dense (unreduced) dataset. In addition, there are different evaluation methods with conflicting orderings over algorithms [6]. Ponpare coupon prediction problem follows cold start scenario with user and product (coupon) attributes present. User attributes (e.g. gender, age, geographical location) were extracted from log files, coupon attributes (e.g. coupon categor, price, discount rate) were provided.

Datasets

We recieved users transactional data sorced from Ponpare downloaded from Kaggle. The training set spans the dates from 2011-07-01 to 2012-06-23 covering 359 days of customers activity. The test set spans the week after the training set from 2012-06-24 to 2012-06-30 covering seven days of customers activity.

Datasets variables

user_list.csv

user_list.csv is the master list of users in the dataset.
Total number of recors in user_list.csv is 22,873.

Column Name	Description	Type	Note
USER_ID_Hash	User ID	CHAR
REG_DATE	Registration date	CHAR	Sign up date
SEX_ID	Gender	CHAR	f=female, m=male
AGE	Age	NUMBER
WITHDRAW_DATE	Date of withdrawal	CHAR
PREF_NAME	Residential Prefecture	CHAR	Japaneese

coupon_list_train.csv and coupon_list_test.csv

coupon_list_train.csv is the master list of coupons which are considered part of the training set.
Total number of records in coupon_list_train.csv is 19,413.

coupon_list_test.csv is the master list of coupons which are considered part of the test set.
Total number of records in coupon_list_test.csv is 310. Predictions for this project will be sourced from those 310 coupons.

Column Name	Description	Type	Note
CAPSULE_TEXT	Coupon text	CHAR	Japaneese
GENRE_NAME	Category name	CHAR	Japaneese
PRICE_RATE	Discount Rate	INT
CATALOG_PRICE	List price	INT
DISCOUNT_PRICE	Discount price	INT
DISPFROM	Sales release date	CHAR
DISPEND	Sales end date	CHAR
DISPPERIOD	Sales period(day)	INT
VALIDFROM	The term of validity starts	DATE
VALIDEND	The term of validity ends	DATE
VALIDPERIOD	Validity period(day)	INT
USABLE_DATE_MON	Is available on Monday	INT
USABLE_DATE_TUE	Is available on Tuesday	INT
USABLE_DATE_WED	Is available on Wednesday	INT
USABLE_DATE_THU	Is available on Thursday	INT
USABLE_DATE_FRI	Is available on Friday	INT
USABLE_DATE_SAT	Is available on Saturday	INT
USABLE_DATE_SUN	Is available on Sunday	INT
USABLE_DATE_HOLIDAY	Is available on holiday	INT
USABLE_DATE_BEFORE_HOLIDAY	Is available on the day before holiday	INT
large_area_name	Large area name of shop location	CHAR	Japaneese
ken_name	Prefecture name of shop	CHAR	Japaneese
small_area_name	Small area name of shop location	CHAR	Japaneese
COUPON_ID_hash	Coupon ID	CHAR

coupn_visit_train.csv

coupon_visit_train.csv is the viewing log of users browsing coupons during the training set time period.
Total number of records in coupon_visit_train.csv is 2,833,180.

Column Name	Description	Type	Note
PURCHASE_FLG	Purchsed flag	INT	0: Not purchased, 1:Purchased
PURCHASEID_hash	Purchase ID	CHAR
I_DATE	View date	CHAR	Purchase date if purchased
PAGE_SERIAL	Page serial	INT
REFERRER_hash	Referer	CHAR
VIEW_COUPON_ID_hash	Browsing coupon ID	CHAR
USER_ID_hash	User ID	CHAR
SESSION_ID_hash	Session ID	CHAR

coupon_detail_train.csv

coupon_detail_train.csv is the purchase log of users buying coupons during the training set time period. Total number of records in coupon_detail_train is 168,996.

Column Name	Description	Type	Note
ITEM_COUNT	Purchased item count	INT
I_DATE	Purchase date	CHAR
SMALL_AREA_NAME	Small area name	CHAR	Japaneese
PURCHASEID_hash	Purchase ID	CHAR
USER_ID_hash	User ID	CHAR
COUPON_ID_hash	Coupon ID	CHAR

Approach

Step1 Download datasets:

Install required packages and connect to libraries in R

*Import datasets in R

Step2 Data transformation (deriving and casting)

user_list:

Data deriving

Translate Japanees text in PREF_NAME into English and save it into new variable PREF_NAME_en:

user_list<-read.csv("user_list.csv", as.is=TRUE)
user_list$PREF_NAME[(user_list$PREF_NAME=="")] <- NA #replace empty cells in PREF_NAME for NA
user_list_jp<-unique(user_list$PREF_NAME)
user_list_en<-c("NA", "Tokyo", "Aichi Prefecture", "Kanagawa Prefecture", 
                "Hiroshima Prefecture", "Saitama Prefecture", "Nara Prefecture",
                "Ishikawa Prefecture", "Osaka prefecture",
                "Kumamoto Prefecture", "Fukuoka Prefecture", "Hokkaido", "Kyoto", 
                "Akita", "Chiba Prefecture", "Nagasaki Prefecture", 
                "Hyogo Prefecture", "Okinawa","Mie", "Ibaraki Prefecture", 
                "Kagoshima prefecture", "Miyagi Prefecture", "Shizuoka Prefecture", 
                "Wakayama Prefecture", "Nagano Prefecture", "Okayama Prefecture", 
                "Tochigi Prefecture","Shiga Prefecture", "Toyama Prefecture", 
                "Saga Prefecture", "Miyazaki Prefecture", "Iwate Prefecture", 
                "Niigata Prefecture", "Oita Prefecture", "Yamaguchi Prefecture", 
                "Gifu Prefecture","Gunma Prefecture", "Fukushima Prefecture", 
                "Ehime Prefecture", "Kagawa Prefecture", "Yamanashi Prefecture", 
                "Kochi Prefecture", "Shimane Prefecture", "Tokushima Prefecture", 
                "Fukui Prefecture","Aomori Prefecture", "Yamagata Prefecture", 
                "Tottori Prefecture")
translation_user_list<-data.frame(user_list_jp, user_list_en, stringsAsFactors=FALSE)

names(translation_user_list)<-c("Japanese", "PREF_NAME_en")
user_list<-merge(user_list, translation_user_list, by.x="PREF_NAME", by.y="Japanese", all.x=TRUE)

Data casting - changing type of the variables for further analysis

Variable REG_DATE is a character, which contains date and time in the form of “2011-05-24 08:42:15”. It is split into 2 variables: Date_REG_DATE (Date format) and Time_REG_DATE (Factor):

user_list$Date_REG_DATE<-substring(user_list$REG_DATE, 1,11)
user_list$Time_REG_DATE<-substring(user_list$REG_DATE, 11, 19)
user_list$Date_REG_DATE<-as.Date(user_list$Date_REG_DATE)
user_list$Time_REG_DATE<-as.factor(user_list$Time_REG_DATE)

Variable WITHDRAW_DATE is a character, which contains date and time. It is split into 2 variables: Date_WITHDRAW_DATE (Date format) and Time_WITHDRAW_DATE (Factor):

user_list$Date_WITHDRAW_DATE<-substring(user_list$WITHDRAW_DATE, 1,11)
user_list$Time_WITHDRAW_DATE<-substring(user_list$WITHDRAW_DATE, 11, 19)
user_list$Date_WITHDRAW_DATE<-as.Date(user_list$Date_WITHDRAW_DATE)
user_list$Time_WITHDRAW_DATE<-as.factor(user_list$Time_WITHDRAW_DATE)

Binary variable SEX_ID is a character and is converted into factor with 2 levels:

user_list$SEX_ID<-as.factor(user_list$SEX_ID)

New variable AGE_GROUPS is created from AGE for further analysis:

user_list$AGE_GROUPS<-cut(user_list$AGE, breaks=c(14,24,34,44,54,64,74,84), 
                labels=c("14-23", "24-33","34-43", "44-53", "54-63", "64-73", "74-83"))

The outcome of this step is addition of the new variables (in bold), so the user_list dataset has the following variables:

Column Name	Description	Type	Note
USER_ID_Hash	User ID	CHAR
REG_DATE	Registration date	CHAR	Sign up date
Date_REG_DATE	Registration date	DATE
Time_REG_DATE	Registration time	Factor
SEX_ID	Gender	CHAR	f=female, m=male
AGE	Age	NUMBER
AGE_GROUPS	Age groups	Factor
WITHDRAW_DATE	Date of withdrawal	CHAR
Date_WITHDRAW_DATE	Date of withdrawal	DATE
Time_WITHDRAW_DATE	Time of withdrawal	Factor
PREF_NAME	Residential Prefecture	CHAR	Japaneese
PREF_NAME_en	Residential Prefecture	CHAR	English

coupon_list_train.csv and coupon_list_test.csv:

Data deriving

Translate Japanees text in CAPSULE_TEXT , GENRE_NAME, large_area_name, ken_name, small_area_name into English and save it into new corresponding variables CAPSULE_TEXT_en, GENRE_NAME_en, large_area_name_en, ken_name_en, small_area_name_en in training and test data sets:

#For train set:
train<-read.csv("coupon_list_train.csv", as.is=TRUE) #read csv file, 
#argument as.is=TRUE suppresses conversion of character vectors to factors
translation<-data.frame(Japanese=unique(c(train$CAPSULE_TEXT, train$GENRE_NAME, 
train$large_area_name, train$ken_name, train$small_area_name)), 
                English=c("Food", "Hair salon", "Spa", "Relaxation","Beauty", 
                          "Nail and eye salon","Delivery service","Lesson",
                          "Gift card","Other coupon","Leisure",
                          "Hotel and Japanese hotel","Health and medical","Other",
                          "Hotel","Japanese hotel","Vacation rental","Lodge",
                          "Resort inn","Guest house","Japanse guest house",
                          "Public hotel","Beauty","Event","Web service","Class",
                          "Correspondence course","Kanto","Kansai","East Sea",
                          "Hokkaido","Kyushu-Okinawa","Northeast","Shikoku",
                          "China","Hokushinetsu","Saitama Prefecture",
                          "Chiba Prefecture","Tokyo","Kyoto","Aichi Prefecture",
                          "Kanagawa Prefecture","Fukuoka Prefecture",
                          "Tochigi Prefecture","Osaka prefecture","Miyagi Prefecture",
                          "Fukushima Prefecture","Oita Prefecture","Kochi Prefecture",
                          "Hiroshima Prefecture","Niigata Prefecture",
                          "Okayama Prefecture","Ehime Prefecture","Kagawa Prefecture",
                          "Tokushima Prefecture","Hyogo Prefecture","Gifu Prefecture",
                          "Miyazaki Prefecture","Nagasaki Prefecture", 
                          "Ishikawa Prefecture","Yamagata Prefecture","Shizuoka Prefecture",
                          "Aomori Prefecture", "Okinawa","Akita","Nagano Prefecture",
                          "Iwate Prefecture","Kumamoto Prefecture",
                          "Yamaguchi Prefecture","Saga Prefecture","Nara Prefecture",
                          "Mie","Gunma Prefecture","Wakayama Prefecture",
                          "Yamanashi Prefecture","Tottori Prefecture","Kagoshima prefecture",
                          "Fukui Prefecture","Shiga Prefecture","Toyama Prefecture",
                          "Shimane Prefecture","Ibaraki Prefecture","Saitama","Chiba",
                          "Shinjuku, Takadanobaba Nakano - Kichijoji","Kyoto",
                          "Ebisu, Meguro Shinagawa","Ginza Shinbashi, Tokyo, Ueno",
                          "Aichi","Kawasaki, Shonan-Hakone other","Fukuoka","Tochigi",
                          "Minami other","Shibuya, Aoyama, Jiyugaoka",
                          "Ikebukuro Kagurazaka-Akabane","Akasaka, Roppongi, Azabu",
                          "Yokohama","Miyagi","Fukushima","Much","Kochi",
                          "Tachikawa Machida, Hachioji other","Hiroshima","Niigata",
                          "Okayama","Ehime","Kagawa","Northern","Tokushima","Hyogo",
                          "Gifu","Miyazaki","Nagasaki","Ishikawa","Yamagata",
                          "Shizuoka","Aomori","Okinawa","Akita","Nagano","Iwate",
                          "Kumamoto","Yamaguchi","Saga","Nara","Triple","Gunma",
                          "Wakayama","Yamanashi","Tottori","Kagoshima","Fukui",
                          "Shiga","Toyama","Shimane","Ibaraki"), 
                stringsAsFactors=FALSE) #stringsAsFactors=FALSE 
                                        #because we have some text fields

# Merge translated data with original data colum by column:

names(translation)<-c("Japanese", "CAPSULE_TEXT_en")
train<-merge(train, translation, by.x="CAPSULE_TEXT", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "GENRE_NAME_en")
train<-merge(train, translation, by.x="GENRE_NAME", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "large_area_name_en")
train<-merge(train, translation, by.x="large_area_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "ken_name_en")
train<-merge(train, translation, by.x="ken_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "small_area_name_en")
train<-merge(train, translation, by.x="small_area_name", by.y="Japanese", all.x=TRUE)

#For test set:
test<-read.csv("coupon_list_test.csv", as.is=TRUE)
names(translation)<-c("Japanese", "CAPSULE_TEXT_en")
test<-merge(test, translation, by.x="CAPSULE_TEXT", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "GENRE_NAME_en")
test<-merge(test, translation, by.x="GENRE_NAME", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "large_area_name_en")
test<-merge(test, translation, by.x="large_area_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "ken_name_en")
test<-merge(test, translation, by.x="ken_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "small_area_name_en")
test<-merge(test, translation, by.x="small_area_name", by.y="Japanese", all.x=TRUE)

Data casting - changing type of the variables for further analysis

Variables DISPFROM and DISPEND are character type, which contains date and time components.

Each variable is split into 2 variables: Date_DISPFROM, DATE_DISPEND (Date format) and Time_DISPFROME, TIME_DISPEND (Factor):

train$Date_DISPFROM<-substring(train$DISPFROM, 1,11)
train$Time_DISPFROM<-substring(train$DISPFROM, 11, 19)
train$Date_DISPFROM<-as.Date(train$Date_DISPFROM)
train$Time_DISPFROM<-as.factor(train$Time_DISPFROM)

train$Date_DISPEND<-substring(train$DISPEND, 1,11)
train$Time_DISPEND<-substring(train$DISPEND, 11, 19)
train$Date_DISPEND<-as.Date(train$Date_DISPEND)
train$Time_DISPEND<-as.factor(train$Time_DISPEND)



test$Date_DISPFROM<-substring(test$DISPFROM, 1,11)
test$Time_DISPFROM<-substring(test$DISPFROM, 11, 19)
test$Date_DISPFROM<-as.Date(test$Date_DISPFROM)
test$Time_DISPFROM<-as.factor(test$Time_DISPFROM)

test$Date_DISPEND<-substring(test$DISPEND, 1,11)
test$Time_DISPEND<-substring(test$DISPEND, 11, 19)
test$Date_DISPEND<-as.Date(test$Date_DISPEND)
test$Time_DISPEND<-as.factor(test$Time_DISPEND)

Variables VALIDFROM(date when the coupon becomes valid) and VALIDEND (date when the coupon expire) are character type, which contains the date. They are converted into date format:

train$VALIDFROM<-as.Date(train$VALIDFROM)
train$VALIDEND<-as.Date(train$VALIDEND)


test$VALIDFROM<-as.Date(test$VALIDFROM)
test$VALIDEND<-as.Date(test$VALIDEND)

Variable PRICE_RATE describes discount rate for every coupon. PRICE_RATE_GROUPS was derived from it for the sake of future visualization:

#Create interval groups for train$PRICE_RATE:
train$PRICE_RATE_GROUPS<-cut(train$PRICE_RATE, 
                        breaks=c(0,20,40,60, 80,100), 
                        labels=c("0-19","20-39", "40-59", "60-79", "80-100"))


#Create interval groups for test$PRICE_RATE:
test$PRICE_RATE_GROUPS<-cut(test$PRICE_RATE, 
                            breaks=c(0,20,40,60, 80,100), 
                            labels=c("0-19","20-39", "40-59", "60-79", "80-100"))

The outcome of this step is addition of the new variables (in bold), so the train and test datasets have the following variables:

Column Name	Description	Type	Note
CAPSULE_TEXT	Coupon text	CHAR	Japaneese
CAPSULE_TEXT_en	Coupon text	CHAR	English
GENRE_NAME	Category name	CHAR	Japaneese
GENRE_NAME_en	Category name	CHAR	English
PRICE_RATE	Discount Rate	INT
PRICE_RATE_GROUPS	Groupped discount rates	Factor
CATALOG_PRICE	List price	INT
DISCOUNT_PRICE	Discount price	INT
DISPFROM	Sales release date	CHAR
Date_DISPFROM	Sales release date	DATE
Time_DISPFROM	Sales release time	Factor
DISPEND	Sales end date	CHAR
Date_DISPEND	Sales end date	DATE
Time_DISPEND	Sales end time	Factor
DISPPERIOD	Sales period(day)	INT
VALIDFROM	The term of validity starts	DATE
VALIDEND	The term of validity ends	DATE
VALIDPERIOD	Validity period(day)	INT
USABLE_DATE_MON	Is available on Monday	INT
USABLE_DATE_TUE	Is available on Tuesday	INT
USABLE_DATE_WED	Is available on Wednesday	INT
USABLE_DATE_THU	Is available on Thursday	INT
USABLE_DATE_FRI	Is available on Friday	INT
USABLE_DATE_SAT	Is available on Saturday	INT
USABLE_DATE_SUN	Is available on Sunday	INT
USABLE_DATE_HOLIDAY	Is available on holiday	INT
USABLE_DATE_BEFORE_HOLIDAY	Is available on the day before holiday	INT
large_area_name	Large area name of shop location	CHAR	Japaneese
large_area_name_en	Large area name of shop location	CHAR	English
ken_name	Prefecture name of shop	CHAR	Japaneese
ken_name_en	Prefecture name of shop	CHAR	English
small_area_name	Small area name of shop location	CHAR	Japaneese
small_area_name_en	Small area name of shop location	CHAR	English
COUPON_ID_hash	Coupon ID	CHAR

coupon_visit_train.csv

Data casting - changing type of the variables for further analysis

Variable PURCHASE_FLG is binary variable that describes when the coupon was purchased(1) or not(0). It is converted into factor with 2 levels and saved as PURCHASE_FLG_f.

coupon_visit_train<-read.csv("coupon_visit_train.csv", as.is=TRUE)
coupon_visit_train$PURCHASE_FLG_f<-as.factor(coupon_visit_train$PURCHASE_FLG)

Variable I_DATE describes purchese date for the coupon (if purchased). It is character type and contains date and time components. It is split into 2 variables: Date_I_DATE, (Date format) and Time_I_DATE (Factor):

coupon_visit_train$Date_I_DATE<-substring(coupon_visit_train$I_DATE, 1,11)
coupon_visit_train$Time_I_DATE<-substring(coupon_visit_train$I_DATE, 11, 19)
coupon_visit_train$Date_I_DATE<-strptime(coupon_visit_train$Date_I_DATE, format="%Y-%m-%d")
coupon_visit_train$Time_I_DATE<-strptime(coupon_visit_train$Time_I_DATE, format="%H:%M:%S")

Extract the hour of the browsing from Time_I_DATE and create a new variable Hour

coupon_visit_train$Hour=coupon_visit_train$Time_I_DATE$hour

Variable Weekday was created from Date_I_DATE and describes day of the week when the coupon was browsed/purchased

coupon_visit_train$Weekday<-weekdays(coupon_visit_train$Date_I_DATE)

The outcome of this step is addition of the new variables (in bold), so the coupon_visit_train dataset has the following variables:

Column Name	Description	Type	Note
PURCHASE_FLG	Purchsed flag	INT	0: Not purchased, 1:Purchased
PURCHASE_FLG	Purchsed flag	Factor	0: Not purchased, 1:Purchased
PURCHASEID_hash	Purchase ID	CHAR
I_DATE	View date	CHAR	Purchase date if purchased
Date_I_DATE	View date	POSIXlt format	Purchase date if purchased
Time_I_DATE	View time	POSIXlt format	Purchase time if purchased
Weekday	Day of week	CHAR
Hour	Hour	INT
PAGE_SERIAL	Page serial	INT
REFERRER_hash	Referer	CHAR
VIEW_COUPON_ID_hash	Browsing coupon ID	CHAR
USER_ID_hash	User ID	CHAR
SESSION_ID_hash	Session ID	CHAR

coupon_detail_train

Data deriving

Translate Japanees text in SMALL_AREA_NAME into English and save it into new variable SMALL_AREA_NAME_en:

coupon_detail_train<-read.csv("coupon_detail_train.csv", as.is=TRUE)
names(translation)<-c("Japanese", "SMALL_AREA_NAME_en")
coupon_detail_train<-merge(coupon_detail_train, translation, by.x="SMALL_AREA_NAME", 
                           by.y="Japanese", all.x=TRUE)

Data casting - changing type of the variables for further analysis

Variables I_DATE is character type, which contains date and time components. It is split into 2 variables: Date_I_DATE, (Date format) and Time_I_DATE** (Factor):

coupon_detail_train$Date_I_DATE<-substring(coupon_detail_train$I_DATE, 1,11)
coupon_detail_train$Time_I_DATE<-substring(coupon_detail_train$I_DATE, 11, 19)
coupon_detail_train$Date_I_DATE<-as.Date(coupon_detail_train$Date_I_DATE)
coupon_detail_train$Time_I_DATE<-as.factor(coupon_detail_train$Time_I_DATE)

Step 3 Exploratory Data Analysis:

    *Examine summaries of the data
    *Check for missing data
    *Explore relationship between variables

for user_list:

Variable SEX_ID describes gender of the users.

counts_SEX_ID<-table(user_list$SEX_ID) # Buil and save contigency table
barplot(counts_SEX_ID, names.arg=c("Females", "Males"), 
        main= "Sex distribution among website users ", 
        ylab="Website users", ylim=c(0,20000))
axis(side=2, at=c(0, 5000, 10000, 15000,20000))
text(x=0.7, y=7000, "48%", col="pink", cex=2)
text(1.9,7000, "52%", col="blue", cex=2)

Variable AGE describes age of the users. To visualize it new variable AGE_GROUPS was introduced.

counts_AGE_GROUPS<-table(user_list$AGE_GROUPS) # Buil and save contigency table
barplot(counts_AGE_GROUPS, names.arg=c("14-23", "24-33","34-43", "44-53", 
        "54-63", "64-73", "74-83"), 
        main= "Age distribution among website users ", 
        ylab="Website users", ylim=c(0,10000))
axis(side=2, at=c(0, 2000, 4000, 6000,8000, 10000))
#add percentages to each age group on the plot: (counts_AGE_GROUPS/22873)*100
text(x=0.8, y=600, "4%", col="blue", cex=1)
text(x=2, y=3000, "24%", col="blue", cex=1)
text(x=3.1, y=4000, "31.3%", col="blue", cex=1)
text(x=4.2, y=3000, "23.8%", col="blue", cex=1)
text(x=5.5, y=2000, "12.8%", col="blue", cex=1)
text(x=6.8, y=600, "3.6%", col="blue", cex=1)
text(x=8, y=600, "0.5%", col="blue", cex=1)

Variable PREF_NAME_en describes residential prefecture of the user. In training set (19413 users), prefecture name was NA for 7256 users (or 37% of the users). Majority of the users whose residential prefecture was available reside in Tokyo. Top ten prefectures in terms on number of customers are:

sort(table (user_list$PREF_NAME_en), decreasing=TRUE)[1:10]

## 
##                  NA               Tokyo Kanagawa Prefecture 
##                7256                2830                1653 
##    Osaka prefecture    Aichi Prefecture    Hyogo Prefecture 
##                1638                 938                 879 
##  Saitama Prefecture    Chiba Prefecture  Fukuoka Prefecture 
##                 874                 835                 731 
##            Hokkaido 
##                 628

for train and test:

Variable PRICE_RATE describes discount rate for every coupon in %. To visualize it new variable PRICE_RATE_GROUPS was introduced.

#for training set:    
counts_PRICE_RATE_GROUPS<-table(train$PRICE_RATE_GROUPS) # Buil and save contigency table

barplot(counts_PRICE_RATE_GROUPS, 
        names.arg=c("0-19","20-39", "40-59", "60-79", "80-100"),
        main= "Coupons discount rate in training set ", 
        ylab="Number of coupons* 10^3", 
        xlab="Discount rate in %",
        ylim=c(0, 14000))
axis(side=2, at=c(0,4000, 8000, 12000))
#add percentages to each discount group on the plot: 
#(counts_PRICE_RATE_GROUPS/19415)*100
text(x=0.8, y=2000, "0.03%", col="blue", cex=1)
text(x=2, y=2000, "0.41%", col="blue", cex=1)
text(x=3.1, y=6000, "66.80%", col="blue", cex=1)
text(x=4.2, y=4000, "27.00%", col="blue", cex=1)
text(x=5.5, y=3000, "5.76%", col="blue", cex=1)

Variable DISPPERIOD describes number of days coupon is displayed on the website.

summary(train$DISPPERIOD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.167   4.000  36.000

plot(table(train$DISPPERIOD),  ylab="Number of coupons",
                        xlab="Display period in days",
                        main="Coupons display period")

Variable VALIDPERIOD describe number of days during which coupon is valid for use.

summary(train$VALIDPERIOD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      89     128     126     177     179    6147

plot(table(train$VALIDPERIOD),  ylab="Number of coupons",
     xlab="Validity period in days",
     main="Coupons validity period")

Variable large_area_name_en describe large area name of the shop location.

for coupon_visit_train

Variable PURCHASE_FLG is a binary variable that describes if purchased was made(1) or not(0)

coupon_visit_train<-read.csv("coupon_visit_train.csv", as.is=TRUE) #Read in data
counts<-table(coupon_visit_train$PURCHASE_FLG) #Build and save contigency table

Create barplot to visualise numbers of viewed and purchased coupons

barplot(counts, names.arg=c("Viewed coupons", "Purchased coupons"), main= "Browsed and purchased coupons during training time period", ylab="Users browsing coupons, train set, *10^5", ylim=c(0,3000000), yaxt="n")
axis(side=2, at=c(0, 1000000, 2000000,3000000),  labels=c(0, 10, 20,30))
text(x=0.7, y=2000000, "95.7%", col="blue", cex=2)
text(1.9,1000000, "4.3%", col="red", cex=2)

Variable Hour was derived from Time_I_DATE and describes time of the day when coupon was browsed

Step 4 Build the recommender system on a training set

I plan to try cosine similarity and regression approaches. …to be continued…

Step 5 Model evaluation

To evaluate the model Mean Average Precision will be used