Many a times getting datasets to work on is tricky. Collecting data takes time and money. Even after someone spends time, money and effort, the amount of data collected is less. There is also a need to try different models on a dataset which is similar in characteristics to the actual data. Then this model can be applied to the actual dataset. Very often, variations in analysis can be tested on two datasets one of which is simulated and the other is actual.
We start with the generation of a random number seed. Let this number be 19800. This could be any number. This dataset is for a store where customers visit and they also make an online purchase provided they share their email ids with the store.
set.seed(19800)
noofcust <- 2000
cust.df <- data.frame(cust.id=as.factor(c(1:noofcust)))
Then the variables age,credit score, presence or absence of email on record, distance to the store are defined.
cust.df$age <- rnorm(n=noofcust, mean=32, sd=4)
cust.df$credit.score <- rnorm(n=noofcust, mean=3*cust.df$age+500, sd=45)
cust.df$email <- factor(sample(c("yes", "no"), size=noofcust, replace=TRUE,prob=c(0.7, 0.3)))
cust.df$distance.to.store <- exp(rnorm(n=noofcust, mean=2, sd=1.2))
The online transactions and online spends are also recorded. Different distributions are assumed on the basis of the common knowledge of such events. Normal distribution, log normal distribution, Negative binomial distribution and exponential distributions are assumed.
cust.df$online.visits <- rnbinom(noofcust, size=0.3,mu = 15 + ifelse(cust.df$email=="yes", 15, 0)- 0.7 * (cust.df$age-median(cust.df$age)))
cust.df$online.trans <- rbinom(noofcust, size=cust.df$online.visits, prob=0.3)
cust.df$online.spend <- exp(rnorm(noofcust, mean=3, sd=0.1)) *cust.df$online.trans
cust.df$store.trans <- rnbinom(noofcust, size=5,mu=3 / sqrt(cust.df$distance.to.store))
cust.df$store.spend <- exp(rnorm(noofcust, mean=3.5, sd=0.4)) *cust.df$store.trans
Customer Satisfcation scores are also created. Satisfaction can not be observed directly. Satisfaction from services and from selection of products is calculated.Satisfaction scores are calculated on a scale of 1 to 7. Any scores above 7 are converted to 7 and any scores less than 1 are converted to 1.
sat.overall <- rnorm(noofcust, mean=5.1, sd=0.9)
summary(sat.overall)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.942 4.492 5.077 5.091 5.683 7.876
sat.service <- floor(sat.overall + rnorm(noofcust, mean=-0.1,sd=0.7))
sat.selection <- floor(sat.overall + rnorm(noofcust, mean=-0.2, sd=0.6))
summary(cbind(sat.service, sat.selection))
## sat.service sat.selection
## Min. :1.000 Min. :1.00
## 1st Qu.:4.000 1st Qu.:4.00
## Median :4.000 Median :4.00
## Mean :4.488 Mean :4.37
## 3rd Qu.:5.000 3rd Qu.:5.00
## Max. :8.000 Max. :8.00
sat.service[sat.service > 7] <- 7
sat.service[sat.service < 1] <- 1
sat.selection[sat.selection > 7] <- 7
sat.selection[sat.selection < 1] <- 1
summary(cbind(sat.service, sat.selection))
## sat.service sat.selection
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000
## Median :4.000 Median :4.000
## Mean :4.484 Mean :4.369
## 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :7.000 Max. :7.000
no.response <- as.logical(rbinom(noofcust, size=1, prob=cust.df$age/100))
sat.service[no.response] <- NA
sat.selection[no.response] <- NA
summary(cbind(sat.service, sat.selection))
## sat.service sat.selection
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000
## Median :4.000 Median :4.000
## Mean :4.477 Mean :4.347
## 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :7.000 Max. :7.000
## NA's :630 NA's :630
cust.df$sat.service <- sat.service
cust.df$sat.selection <- sat.selection
summary(cust.df)
## cust.id age credit.score email
## 1 : 1 Min. :18.88 Min. :432.8 no : 619
## 2 : 1 1st Qu.:29.23 1st Qu.:563.4 yes:1381
## 3 : 1 Median :31.87 Median :593.3
## 4 : 1 Mean :31.94 Mean :594.6
## 5 : 1 3rd Qu.:34.64 3rd Qu.:625.7
## 6 : 1 Max. :45.39 Max. :751.6
## (Other):1994
## distance.to.store online.visits online.trans online.spend
## Min. : 0.1928 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.1796 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 7.3314 Median : 6.00 Median : 2.000 Median : 37.53
## Mean : 15.8046 Mean : 26.91 Mean : 8.088 Mean : 163.33
## 3rd Qu.: 16.5525 3rd Qu.: 27.00 3rd Qu.: 8.000 3rd Qu.: 156.60
## Max. :1279.5724 Max. :721.00 Max. :228.000 Max. :4384.17
##
## store.trans store.spend sat.service sat.selection
## Min. : 0.000 Min. : 0.00 Min. :1.000 Min. :1.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:4.000 1st Qu.:4.000
## Median : 1.000 Median : 30.60 Median :4.000 Median :4.000
## Mean : 1.351 Mean : 48.08 Mean :4.477 Mean :4.347
## 3rd Qu.: 2.000 3rd Qu.: 67.08 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :12.000 Max. :1047.75 Max. :7.000 Max. :7.000
## NA's :630 NA's :630
The data for age and credit score is rounded off for optical reasons.
cust.df$age<-round(cust.df$age)
cust.df$credit.score<-round(cust.df$credit.score)
summary(cust.df)
## cust.id age credit.score email
## 1 : 1 Min. :19.00 Min. :433.0 no : 619
## 2 : 1 1st Qu.:29.00 1st Qu.:563.0 yes:1381
## 3 : 1 Median :32.00 Median :593.0
## 4 : 1 Mean :31.94 Mean :594.6
## 5 : 1 3rd Qu.:35.00 3rd Qu.:626.0
## 6 : 1 Max. :45.00 Max. :752.0
## (Other):1994
## distance.to.store online.visits online.trans online.spend
## Min. : 0.1928 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.1796 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 7.3314 Median : 6.00 Median : 2.000 Median : 37.53
## Mean : 15.8046 Mean : 26.91 Mean : 8.088 Mean : 163.33
## 3rd Qu.: 16.5525 3rd Qu.: 27.00 3rd Qu.: 8.000 3rd Qu.: 156.60
## Max. :1279.5724 Max. :721.00 Max. :228.000 Max. :4384.17
##
## store.trans store.spend sat.service sat.selection
## Min. : 0.000 Min. : 0.00 Min. :1.000 Min. :1.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:4.000 1st Qu.:4.000
## Median : 1.000 Median : 30.60 Median :4.000 Median :4.000
## Mean : 1.351 Mean : 48.08 Mean :4.477 Mean :4.347
## 3rd Qu.: 2.000 3rd Qu.: 67.08 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :12.000 Max. :1047.75 Max. :7.000 Max. :7.000
## NA's :630 NA's :630