Ponpare is a leading Japanese coupon site. Japan is a strong e-commerce market with Internet penetration rate at almost 80%. According to “Cross-border Ecommerce Report -Japan” 75% of population have purchased products online contributing to ecommerce sales 104.7 billion USD (JPY 11.2 trillion).
Primary goal: Using past purchase and browsing behavior predict which coupons a customer will buy in a given period of time.
Secondary goal: Conduct exploratory analysis on trasactional data for 22,873 users on Ponpare.
Motivation: The resulting models will be used to improve Ponpare’s recommendation system, so they can make sure their customers don’t miss out on their next favorite thing.
Solution: Develop recommender by calculating the dot product of the user profile and the item profile and compute the cosine distance between the two (user-based and item-based collaborative filtering); implement matrix facorization model based on the Bayesian Personalized Ranking (BPR) framework.
The number of internet users, engaged in e-commerce continue to increase worldwide. The amount of information encountered by average e-commerce user is overwhelming. It becomes a big problem for a customer to find what he/she is actually looking for, as opposed to what is pushed through ads. Search engines allow users to filter out relevant results, however personalized searches are not available [1]. A promising solution is recommender system - class of Web applications that involve predicting user responce to options. Good quality recommender system has to satisfy users by providing effective recommendation as well as improve sales and attract new customers for the e-commerce businesses.
There are two types of recommender systems: collaborative filtering-based recommenders utilize only consumer-product interaction data, while content based recommenders examine properies of the products. In e-commerce, most recommendation algorithms take as input three types of data: user attributes, product attributes, and previous interactions between user and product (buying, browsing, etc) [2]. The most popular prediction algorithms employed in e-commerce applications are [3]:
Baselines (constant, random, overall average, user average, item average)
Memory-based neighborhood algorithms (user-based, item-based collaborative filtering)
Matrix Factorization methods (SVD, NMF, PMF, Bayesian PMF, Non-linear PMF) [4].
The performance of recommendation algorithms depends on the data characteristics [5]. Some algorithms may work better or worse depending on the amount of missing data (sparcity), distribution of ratings, and the number of users and items. Most algorithms will have a better performance with dense (unreduced) dataset. In addition, there are different evaluation methods with conflicting orderings over algorithms [6]. Ponpare coupon prediction problem follows cold start scenario with user and product (coupon) attributes present. User attributes (e.g. gender, age, geographical location) were extracted from log files, coupon attributes (e.g. coupon categor, price, discount rate) were provided.
We recieved users transactional data sorced from Ponpare downloaded from Kaggle. The training set spans the dates from 2011-07-01 to 2012-06-23 covering 359 days of customers activity. The test set spans the week after the training set from 2012-06-24 to 2012-06-30 covering seven days of customers activity.
user_list.csv is the master list of users in the dataset.
Total number of recors in user_list.csv is 22,873.
| Column Name | Description | Type | Note |
|---|---|---|---|
| USER_ID_Hash | User ID | CHAR | |
| REG_DATE | Registration date | CHAR | Sign up date |
| SEX_ID | Gender | CHAR | f=female, m=male |
| AGE | Age | NUMBER | |
| WITHDRAW_DATE | Date of withdrawal | CHAR | |
| PREF_NAME | Residential Prefecture | CHAR | Japaneese |
coupon_list_train.csv is the master list of coupons which are considered part of the training set.
Total number of records in coupon_list_train.csv is 19,413.
coupon_list_test.csv is the master list of coupons which are considered part of the test set.
Total number of records in coupon_list_test.csv is 310. Predictions for this project will be sourced from those 310 coupons.
| Column Name | Description | Type | Note |
|---|---|---|---|
| CAPSULE_TEXT | Coupon text | CHAR | Japaneese |
| GENRE_NAME | Category name | CHAR | Japaneese |
| PRICE_RATE | Discount Rate | INT | |
| CATALOG_PRICE | List price | INT | |
| DISCOUNT_PRICE | Discount price | INT | |
| DISPFROM | Sales release date | CHAR | |
| DISPEND | Sales end date | CHAR | |
| DISPPERIOD | Sales period(day) | INT | |
| VALIDFROM | The term of validity starts | DATE | |
| VALIDEND | The term of validity ends | DATE | |
| VALIDPERIOD | Validity period(day) | INT | |
| USABLE_DATE_MON | Is available on Monday | INT | |
| USABLE_DATE_TUE | Is available on Tuesday | INT | |
| USABLE_DATE_WED | Is available on Wednesday | INT | |
| USABLE_DATE_THU | Is available on Thursday | INT | |
| USABLE_DATE_FRI | Is available on Friday | INT | |
| USABLE_DATE_SAT | Is available on Saturday | INT | |
| USABLE_DATE_SUN | Is available on Sunday | INT | |
| USABLE_DATE_HOLIDAY | Is available on holiday | INT | |
| USABLE_DATE_BEFORE_HOLIDAY | Is available on the day before holiday | INT | |
| large_area_name | Large area name of shop location | CHAR | Japaneese |
| ken_name | Prefecture name of shop | CHAR | Japaneese |
| small_area_name | Small area name of shop location | CHAR | Japaneese |
| COUPON_ID_hash | Coupon ID | CHAR |
coupon_visit_train.csv is the viewing log of users browsing coupons during the training set time period.
Total number of records in coupon_visit_train.csv is 2,833,180.
| Column Name | Description | Type | Note |
|---|---|---|---|
| PURCHASE_FLG | Purchsed flag | INT | 0: Not purchased, 1:Purchased |
| PURCHASEID_hash | Purchase ID | CHAR | |
| I_DATE | View date | CHAR | Purchase date if purchased |
| PAGE_SERIAL | Page serial | INT | |
| REFERRER_hash | Referer | CHAR | |
| VIEW_COUPON_ID_hash | Browsing coupon ID | CHAR | |
| USER_ID_hash | User ID | CHAR | |
| SESSION_ID_hash | Session ID | CHAR |
coupon_detail_train.csv is the purchase log of users buying coupons during the training set time period. Total number of records in coupon_detail_train is 168,996.
| Column Name | Description | Type | Note |
|---|---|---|---|
| ITEM_COUNT | Purchased item count | INT | |
| I_DATE | Purchase date | CHAR | |
| SMALL_AREA_NAME | Small area name | CHAR | Japaneese |
| PURCHASEID_hash | Purchase ID | CHAR | |
| USER_ID_hash | User ID | CHAR | |
| COUPON_ID_hash | Coupon ID | CHAR |
Install required packages and connect to libraries in R
*Import datasets in R
Translate Japanees text in PREF_NAME into English and save it into new variable PREF_NAME_en:
user_list<-read.csv("user_list.csv", as.is=TRUE)
user_list$PREF_NAME[(user_list$PREF_NAME=="")] <- NA #replace empty cells in PREF_NAME for NA
user_list_jp<-unique(user_list$PREF_NAME)
user_list_en<-c("NA", "Tokyo", "Aichi Prefecture", "Kanagawa Prefecture",
"Hiroshima Prefecture", "Saitama Prefecture", "Nara Prefecture",
"Ishikawa Prefecture", "Osaka prefecture",
"Kumamoto Prefecture", "Fukuoka Prefecture", "Hokkaido", "Kyoto",
"Akita", "Chiba Prefecture", "Nagasaki Prefecture",
"Hyogo Prefecture", "Okinawa","Mie", "Ibaraki Prefecture",
"Kagoshima prefecture", "Miyagi Prefecture", "Shizuoka Prefecture",
"Wakayama Prefecture", "Nagano Prefecture", "Okayama Prefecture",
"Tochigi Prefecture","Shiga Prefecture", "Toyama Prefecture",
"Saga Prefecture", "Miyazaki Prefecture", "Iwate Prefecture",
"Niigata Prefecture", "Oita Prefecture", "Yamaguchi Prefecture",
"Gifu Prefecture","Gunma Prefecture", "Fukushima Prefecture",
"Ehime Prefecture", "Kagawa Prefecture", "Yamanashi Prefecture",
"Kochi Prefecture", "Shimane Prefecture", "Tokushima Prefecture",
"Fukui Prefecture","Aomori Prefecture", "Yamagata Prefecture",
"Tottori Prefecture")
translation_user_list<-data.frame(user_list_jp, user_list_en, stringsAsFactors=FALSE)
names(translation_user_list)<-c("Japanese", "PREF_NAME_en")
user_list<-merge(user_list, translation_user_list, by.x="PREF_NAME", by.y="Japanese", all.x=TRUE)
Variable REG_DATE is a character, which contains date and time in the form of “2011-05-24 08:42:15”. It is split into 2 variables: Date_REG_DATE (Date format) and Time_REG_DATE (Factor):
user_list$Date_REG_DATE<-substring(user_list$REG_DATE, 1,11)
user_list$Time_REG_DATE<-substring(user_list$REG_DATE, 11, 19)
user_list$Date_REG_DATE<-as.Date(user_list$Date_REG_DATE)
user_list$Time_REG_DATE<-as.factor(user_list$Time_REG_DATE)
Variable WITHDRAW_DATE is a character, which contains date and time. It is split into 2 variables: Date_WITHDRAW_DATE (Date format) and Time_WITHDRAW_DATE (Factor):
user_list$Date_WITHDRAW_DATE<-substring(user_list$WITHDRAW_DATE, 1,11)
user_list$Time_WITHDRAW_DATE<-substring(user_list$WITHDRAW_DATE, 11, 19)
user_list$Date_WITHDRAW_DATE<-as.Date(user_list$Date_WITHDRAW_DATE)
user_list$Time_WITHDRAW_DATE<-as.factor(user_list$Time_WITHDRAW_DATE)
Binary variable SEX_ID is a character and is converted into factor with 2 levels:
user_list$SEX_ID<-as.factor(user_list$SEX_ID)
New variable AGE_GROUPS is created from AGE for further analysis:
user_list$AGE_GROUPS<-cut(user_list$AGE, breaks=c(14,24,34,44,54,64,74,84),
labels=c("14-23", "24-33","34-43", "44-53", "54-63", "64-73", "74-83"))
The outcome of this step is addition of the new variables (in bold), so the user_list dataset has the following variables:
| Column Name | Description | Type | Note |
|---|---|---|---|
| USER_ID_Hash | User ID | CHAR | |
| REG_DATE | Registration date | CHAR | Sign up date |
| Date_REG_DATE | Registration date | DATE | |
| Time_REG_DATE | Registration time | Factor | |
| SEX_ID | Gender | CHAR | f=female, m=male |
| AGE | Age | NUMBER | |
| AGE_GROUPS | Age groups | Factor | |
| WITHDRAW_DATE | Date of withdrawal | CHAR | |
| Date_WITHDRAW_DATE | Date of withdrawal | DATE | |
| Time_WITHDRAW_DATE | Time of withdrawal | Factor | |
| PREF_NAME | Residential Prefecture | CHAR | Japaneese |
| PREF_NAME_en | Residential Prefecture | CHAR | English |
Translate Japanees text in CAPSULE_TEXT , GENRE_NAME, large_area_name, ken_name, small_area_name into English and save it into new corresponding variables CAPSULE_TEXT_en, GENRE_NAME_en, large_area_name_en, ken_name_en, small_area_name_en in training and test data sets:
#For train set:
train<-read.csv("coupon_list_train.csv", as.is=TRUE) #read csv file,
#argument as.is=TRUE suppresses conversion of character vectors to factors
translation<-data.frame(Japanese=unique(c(train$CAPSULE_TEXT, train$GENRE_NAME,
train$large_area_name, train$ken_name, train$small_area_name)),
English=c("Food", "Hair salon", "Spa", "Relaxation","Beauty",
"Nail and eye salon","Delivery service","Lesson",
"Gift card","Other coupon","Leisure",
"Hotel and Japanese hotel","Health and medical","Other",
"Hotel","Japanese hotel","Vacation rental","Lodge",
"Resort inn","Guest house","Japanse guest house",
"Public hotel","Beauty","Event","Web service","Class",
"Correspondence course","Kanto","Kansai","East Sea",
"Hokkaido","Kyushu-Okinawa","Northeast","Shikoku",
"China","Hokushinetsu","Saitama Prefecture",
"Chiba Prefecture","Tokyo","Kyoto","Aichi Prefecture",
"Kanagawa Prefecture","Fukuoka Prefecture",
"Tochigi Prefecture","Osaka prefecture","Miyagi Prefecture",
"Fukushima Prefecture","Oita Prefecture","Kochi Prefecture",
"Hiroshima Prefecture","Niigata Prefecture",
"Okayama Prefecture","Ehime Prefecture","Kagawa Prefecture",
"Tokushima Prefecture","Hyogo Prefecture","Gifu Prefecture",
"Miyazaki Prefecture","Nagasaki Prefecture",
"Ishikawa Prefecture","Yamagata Prefecture","Shizuoka Prefecture",
"Aomori Prefecture", "Okinawa","Akita","Nagano Prefecture",
"Iwate Prefecture","Kumamoto Prefecture",
"Yamaguchi Prefecture","Saga Prefecture","Nara Prefecture",
"Mie","Gunma Prefecture","Wakayama Prefecture",
"Yamanashi Prefecture","Tottori Prefecture","Kagoshima prefecture",
"Fukui Prefecture","Shiga Prefecture","Toyama Prefecture",
"Shimane Prefecture","Ibaraki Prefecture","Saitama","Chiba",
"Shinjuku, Takadanobaba Nakano - Kichijoji","Kyoto",
"Ebisu, Meguro Shinagawa","Ginza Shinbashi, Tokyo, Ueno",
"Aichi","Kawasaki, Shonan-Hakone other","Fukuoka","Tochigi",
"Minami other","Shibuya, Aoyama, Jiyugaoka",
"Ikebukuro Kagurazaka-Akabane","Akasaka, Roppongi, Azabu",
"Yokohama","Miyagi","Fukushima","Much","Kochi",
"Tachikawa Machida, Hachioji other","Hiroshima","Niigata",
"Okayama","Ehime","Kagawa","Northern","Tokushima","Hyogo",
"Gifu","Miyazaki","Nagasaki","Ishikawa","Yamagata",
"Shizuoka","Aomori","Okinawa","Akita","Nagano","Iwate",
"Kumamoto","Yamaguchi","Saga","Nara","Triple","Gunma",
"Wakayama","Yamanashi","Tottori","Kagoshima","Fukui",
"Shiga","Toyama","Shimane","Ibaraki"),
stringsAsFactors=FALSE) #stringsAsFactors=FALSE
#because we have some text fields
# Merge translated data with original data colum by column:
names(translation)<-c("Japanese", "CAPSULE_TEXT_en")
train<-merge(train, translation, by.x="CAPSULE_TEXT", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "GENRE_NAME_en")
train<-merge(train, translation, by.x="GENRE_NAME", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "large_area_name_en")
train<-merge(train, translation, by.x="large_area_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "ken_name_en")
train<-merge(train, translation, by.x="ken_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "small_area_name_en")
train<-merge(train, translation, by.x="small_area_name", by.y="Japanese", all.x=TRUE)
#For test set:
test<-read.csv("coupon_list_test.csv", as.is=TRUE)
names(translation)<-c("Japanese", "CAPSULE_TEXT_en")
test<-merge(test, translation, by.x="CAPSULE_TEXT", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "GENRE_NAME_en")
test<-merge(test, translation, by.x="GENRE_NAME", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "large_area_name_en")
test<-merge(test, translation, by.x="large_area_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "ken_name_en")
test<-merge(test, translation, by.x="ken_name", by.y="Japanese", all.x=TRUE)
names(translation)<-c("Japanese", "small_area_name_en")
test<-merge(test, translation, by.x="small_area_name", by.y="Japanese", all.x=TRUE)
Variables DISPFROM and DISPEND are character type, which contains date and time components.
Each variable is split into 2 variables: Date_DISPFROM, DATE_DISPEND (Date format) and Time_DISPFROME, TIME_DISPEND (Factor):
train$Date_DISPFROM<-substring(train$DISPFROM, 1,11)
train$Time_DISPFROM<-substring(train$DISPFROM, 11, 19)
train$Date_DISPFROM<-as.Date(train$Date_DISPFROM)
train$Time_DISPFROM<-as.factor(train$Time_DISPFROM)
train$Date_DISPEND<-substring(train$DISPEND, 1,11)
train$Time_DISPEND<-substring(train$DISPEND, 11, 19)
train$Date_DISPEND<-as.Date(train$Date_DISPEND)
train$Time_DISPEND<-as.factor(train$Time_DISPEND)
test$Date_DISPFROM<-substring(test$DISPFROM, 1,11)
test$Time_DISPFROM<-substring(test$DISPFROM, 11, 19)
test$Date_DISPFROM<-as.Date(test$Date_DISPFROM)
test$Time_DISPFROM<-as.factor(test$Time_DISPFROM)
test$Date_DISPEND<-substring(test$DISPEND, 1,11)
test$Time_DISPEND<-substring(test$DISPEND, 11, 19)
test$Date_DISPEND<-as.Date(test$Date_DISPEND)
test$Time_DISPEND<-as.factor(test$Time_DISPEND)
Variables VALIDFROM(date when the coupon becomes valid) and VALIDEND (date when the coupon expire) are character type, which contains the date. They are converted into date format:
train$VALIDFROM<-as.Date(train$VALIDFROM)
train$VALIDEND<-as.Date(train$VALIDEND)
test$VALIDFROM<-as.Date(test$VALIDFROM)
test$VALIDEND<-as.Date(test$VALIDEND)
Variable PRICE_RATE describes discount rate for every coupon. PRICE_RATE_GROUPS was derived from it for the sake of future visualization:
#Create interval groups for train$PRICE_RATE:
train$PRICE_RATE_GROUPS<-cut(train$PRICE_RATE,
breaks=c(0,20,40,60, 80,100),
labels=c("0-19","20-39", "40-59", "60-79", "80-100"))
#Create interval groups for test$PRICE_RATE:
test$PRICE_RATE_GROUPS<-cut(test$PRICE_RATE,
breaks=c(0,20,40,60, 80,100),
labels=c("0-19","20-39", "40-59", "60-79", "80-100"))
The outcome of this step is addition of the new variables (in bold), so the train and test datasets have the following variables:
| Column Name | Description | Type | Note |
|---|---|---|---|
| CAPSULE_TEXT | Coupon text | CHAR | Japaneese |
| CAPSULE_TEXT_en | Coupon text | CHAR | English |
| GENRE_NAME | Category name | CHAR | Japaneese |
| GENRE_NAME_en | Category name | CHAR | English |
| PRICE_RATE | Discount Rate | INT | |
| PRICE_RATE_GROUPS | Groupped discount rates | Factor | |
| CATALOG_PRICE | List price | INT | |
| DISCOUNT_PRICE | Discount price | INT | |
| DISPFROM | Sales release date | CHAR | |
| Date_DISPFROM | Sales release date | DATE | |
| Time_DISPFROM | Sales release time | Factor | |
| DISPEND | Sales end date | CHAR | |
| Date_DISPEND | Sales end date | DATE | |
| Time_DISPEND | Sales end time | Factor | |
| DISPPERIOD | Sales period(day) | INT | |
| VALIDFROM | The term of validity starts | DATE | |
| VALIDEND | The term of validity ends | DATE | |
| VALIDPERIOD | Validity period(day) | INT | |
| USABLE_DATE_MON | Is available on Monday | INT | |
| USABLE_DATE_TUE | Is available on Tuesday | INT | |
| USABLE_DATE_WED | Is available on Wednesday | INT | |
| USABLE_DATE_THU | Is available on Thursday | INT | |
| USABLE_DATE_FRI | Is available on Friday | INT | |
| USABLE_DATE_SAT | Is available on Saturday | INT | |
| USABLE_DATE_SUN | Is available on Sunday | INT | |
| USABLE_DATE_HOLIDAY | Is available on holiday | INT | |
| USABLE_DATE_BEFORE_HOLIDAY | Is available on the day before holiday | INT | |
| large_area_name | Large area name of shop location | CHAR | Japaneese |
| large_area_name_en | Large area name of shop location | CHAR | English |
| ken_name | Prefecture name of shop | CHAR | Japaneese |
| ken_name_en | Prefecture name of shop | CHAR | English |
| small_area_name | Small area name of shop location | CHAR | Japaneese |
| small_area_name_en | Small area name of shop location | CHAR | English |
| COUPON_ID_hash | Coupon ID | CHAR |
Variable PURCHASE_FLG is binary variable that describes when the coupon was purchased(1) or not(0). It is converted into factor with 2 levels and saved as PURCHASE_FLG_f.
coupon_visit_train<-read.csv("coupon_visit_train.csv", as.is=TRUE)
coupon_visit_train$PURCHASE_FLG_f<-as.factor(coupon_visit_train$PURCHASE_FLG)
Variable I_DATE describes purchese date for the coupon (if purchased). It is character type and contains date and time components. It is split into 2 variables: Date_I_DATE, (Date format) and Time_I_DATE (Factor):
coupon_visit_train$Date_I_DATE<-substring(coupon_visit_train$I_DATE, 1,11)
coupon_visit_train$Time_I_DATE<-substring(coupon_visit_train$I_DATE, 11, 19)
coupon_visit_train$Date_I_DATE<-strptime(coupon_visit_train$Date_I_DATE, format="%Y-%m-%d")
coupon_visit_train$Time_I_DATE<-strptime(coupon_visit_train$Time_I_DATE, format="%H:%M:%S")
Extract the hour of the browsing from Time_I_DATE and create a new variable Hour
coupon_visit_train$Hour=coupon_visit_train$Time_I_DATE$hour
Variable Weekday was created from Date_I_DATE and describes day of the week when the coupon was browsed/purchased
coupon_visit_train$Weekday<-weekdays(coupon_visit_train$Date_I_DATE)
The outcome of this step is addition of the new variables (in bold), so the coupon_visit_train dataset has the following variables:
| Column Name | Description | Type | Note |
|---|---|---|---|
| PURCHASE_FLG | Purchsed flag | INT | 0: Not purchased, 1:Purchased |
| PURCHASE_FLG | Purchsed flag | Factor | 0: Not purchased, 1:Purchased |
| PURCHASEID_hash | Purchase ID | CHAR | |
| I_DATE | View date | CHAR | Purchase date if purchased |
| Date_I_DATE | View date | POSIXlt format | Purchase date if purchased |
| Time_I_DATE | View time | POSIXlt format | Purchase time if purchased |
| Weekday | Day of week | CHAR | |
| Hour | Hour | INT | |
| PAGE_SERIAL | Page serial | INT | |
| REFERRER_hash | Referer | CHAR | |
| VIEW_COUPON_ID_hash | Browsing coupon ID | CHAR | |
| USER_ID_hash | User ID | CHAR | |
| SESSION_ID_hash | Session ID | CHAR |
Translate Japanees text in SMALL_AREA_NAME into English and save it into new variable SMALL_AREA_NAME_en:
coupon_detail_train<-read.csv("coupon_detail_train.csv", as.is=TRUE)
names(translation)<-c("Japanese", "SMALL_AREA_NAME_en")
coupon_detail_train<-merge(coupon_detail_train, translation, by.x="SMALL_AREA_NAME",
by.y="Japanese", all.x=TRUE)
Variables I_DATE is character type, which contains date and time components. It is split into 2 variables: Date_I_DATE, (Date format) and Time_I_DATE** (Factor):
coupon_detail_train$Date_I_DATE<-substring(coupon_detail_train$I_DATE, 1,11)
coupon_detail_train$Time_I_DATE<-substring(coupon_detail_train$I_DATE, 11, 19)
coupon_detail_train$Date_I_DATE<-as.Date(coupon_detail_train$Date_I_DATE)
coupon_detail_train$Time_I_DATE<-as.factor(coupon_detail_train$Time_I_DATE)
*Examine summaries of the data
*Check for missing data
*Explore relationship between variables
Variable SEX_ID describes gender of the users.
counts_SEX_ID<-table(user_list$SEX_ID) # Buil and save contigency table
barplot(counts_SEX_ID, names.arg=c("Females", "Males"),
main= "Sex distribution among website users ",
ylab="Website users", ylim=c(0,20000))
axis(side=2, at=c(0, 5000, 10000, 15000,20000))
text(x=0.7, y=7000, "48%", col="pink", cex=2)
text(1.9,7000, "52%", col="blue", cex=2)
Variable AGE describes age of the users. To visualize it new variable AGE_GROUPS was introduced.
counts_AGE_GROUPS<-table(user_list$AGE_GROUPS) # Buil and save contigency table
barplot(counts_AGE_GROUPS, names.arg=c("14-23", "24-33","34-43", "44-53",
"54-63", "64-73", "74-83"),
main= "Age distribution among website users ",
ylab="Website users", ylim=c(0,10000))
axis(side=2, at=c(0, 2000, 4000, 6000,8000, 10000))
#add percentages to each age group on the plot: (counts_AGE_GROUPS/22873)*100
text(x=0.8, y=600, "4%", col="blue", cex=1)
text(x=2, y=3000, "24%", col="blue", cex=1)
text(x=3.1, y=4000, "31.3%", col="blue", cex=1)
text(x=4.2, y=3000, "23.8%", col="blue", cex=1)
text(x=5.5, y=2000, "12.8%", col="blue", cex=1)
text(x=6.8, y=600, "3.6%", col="blue", cex=1)
text(x=8, y=600, "0.5%", col="blue", cex=1)
Variable PREF_NAME_en describes residential prefecture of the user. In training set (19413 users), prefecture name was NA for 7256 users (or 37% of the users). Majority of the users whose residential prefecture was available reside in Tokyo. Top ten prefectures in terms on number of customers are:
sort(table (user_list$PREF_NAME_en), decreasing=TRUE)[1:10]
##
## NA Tokyo Kanagawa Prefecture
## 7256 2830 1653
## Osaka prefecture Aichi Prefecture Hyogo Prefecture
## 1638 938 879
## Saitama Prefecture Chiba Prefecture Fukuoka Prefecture
## 874 835 731
## Hokkaido
## 628
Variable PRICE_RATE describes discount rate for every coupon in %. To visualize it new variable PRICE_RATE_GROUPS was introduced.
#for training set:
counts_PRICE_RATE_GROUPS<-table(train$PRICE_RATE_GROUPS) # Buil and save contigency table
barplot(counts_PRICE_RATE_GROUPS,
names.arg=c("0-19","20-39", "40-59", "60-79", "80-100"),
main= "Coupons discount rate in training set ",
ylab="Number of coupons* 10^3",
xlab="Discount rate in %",
ylim=c(0, 14000))
axis(side=2, at=c(0,4000, 8000, 12000))
#add percentages to each discount group on the plot:
#(counts_PRICE_RATE_GROUPS/19415)*100
text(x=0.8, y=2000, "0.03%", col="blue", cex=1)
text(x=2, y=2000, "0.41%", col="blue", cex=1)
text(x=3.1, y=6000, "66.80%", col="blue", cex=1)
text(x=4.2, y=4000, "27.00%", col="blue", cex=1)
text(x=5.5, y=3000, "5.76%", col="blue", cex=1)
Variable DISPPERIOD describes number of days coupon is displayed on the website.
summary(train$DISPPERIOD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.167 4.000 36.000
plot(table(train$DISPPERIOD), ylab="Number of coupons",
xlab="Display period in days",
main="Coupons display period")
Variable VALIDPERIOD describe number of days during which coupon is valid for use.
summary(train$VALIDPERIOD)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 89 128 126 177 179 6147
plot(table(train$VALIDPERIOD), ylab="Number of coupons",
xlab="Validity period in days",
main="Coupons validity period")
Variable large_area_name_en describe large area name of the shop location.
Variable PURCHASE_FLG is a binary variable that describes if purchased was made(1) or not(0)
coupon_visit_train<-read.csv("coupon_visit_train.csv", as.is=TRUE) #Read in data
counts<-table(coupon_visit_train$PURCHASE_FLG) #Build and save contigency table
Create barplot to visualise numbers of viewed and purchased coupons
barplot(counts, names.arg=c("Viewed coupons", "Purchased coupons"), main= "Browsed and purchased coupons during training time period", ylab="Users browsing coupons, train set, *10^5", ylim=c(0,3000000), yaxt="n")
axis(side=2, at=c(0, 1000000, 2000000,3000000), labels=c(0, 10, 20,30))
text(x=0.7, y=2000000, "95.7%", col="blue", cex=2)
text(1.9,1000000, "4.3%", col="red", cex=2)
Variable Hour was derived from Time_I_DATE and describes time of the day when coupon was browsed
I plan to try cosine similarity and regression approaches. …to be continued…
To evaluate the model Mean Average Precision will be used