Creating a dataset for experiments

Why create a dataset?

Many a times getting datasets to work on is tricky. Collecting data takes time and money. Even after someone spends time, money and effort, the amount of data collected is less. There is also a need to try different models on a dataset which is similar in characteristics to the actual data. Then this model can be applied to the actual dataset. Very often, variations in analysis can be tested on two datasets one of which is simulated and the other is actual.

Creating a customer dataset

We start with the generation of a random number seed. Let this number be 19800. This could be any number. This dataset is for a store where customers visit and they also make an online purchase provided they share their email ids with the store.

set.seed(19800)

Creating a variable noofcust to define the sample size and defining it as a dataframe

noofcust <- 2000
cust.df <- data.frame(cust.id=as.factor(c(1:noofcust)))

Then the variables age,credit score, presence or absence of email on record, distance to the store are defined.

cust.df$age <- rnorm(n=noofcust, mean=32, sd=4)
cust.df$credit.score <- rnorm(n=noofcust, mean=3*cust.df$age+500, sd=45)
cust.df$email <- factor(sample(c("yes", "no"), size=noofcust, replace=TRUE,prob=c(0.7, 0.3)))

cust.df$distance.to.store <- exp(rnorm(n=noofcust, mean=2, sd=1.2))

The online transactions and online spends are also recorded. Different distributions are assumed on the basis of the common knowledge of such events. Normal distribution, log normal distribution, Negative binomial distribution and exponential distributions are assumed.

cust.df$online.visits <- rnbinom(noofcust, size=0.3,mu = 15 + ifelse(cust.df$email=="yes", 15, 0)- 0.7 * (cust.df$age-median(cust.df$age)))
cust.df$online.trans <- rbinom(noofcust, size=cust.df$online.visits, prob=0.3)
cust.df$online.spend <- exp(rnorm(noofcust, mean=3, sd=0.1)) *cust.df$online.trans
cust.df$store.trans <- rnbinom(noofcust, size=5,mu=3 / sqrt(cust.df$distance.to.store))
cust.df$store.spend <- exp(rnorm(noofcust, mean=3.5, sd=0.4)) *cust.df$store.trans

Customer Satisfcation scores are also created. Satisfaction can not be observed directly. Satisfaction from services and from selection of products is calculated.Satisfaction scores are calculated on a scale of 1 to 7. Any scores above 7 are converted to 7 and any scores less than 1 are converted to 1.

sat.overall <- rnorm(noofcust, mean=5.1, sd=0.9)

summary(sat.overall)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.942   4.492   5.077   5.091   5.683   7.876

sat.service <- floor(sat.overall + rnorm(noofcust, mean=-0.1,sd=0.7))
sat.selection <- floor(sat.overall + rnorm(noofcust, mean=-0.2, sd=0.6))
summary(cbind(sat.service, sat.selection))

##   sat.service    sat.selection 
##  Min.   :1.000   Min.   :1.00  
##  1st Qu.:4.000   1st Qu.:4.00  
##  Median :4.000   Median :4.00  
##  Mean   :4.488   Mean   :4.37  
##  3rd Qu.:5.000   3rd Qu.:5.00  
##  Max.   :8.000   Max.   :8.00

sat.service[sat.service > 7] <- 7
sat.service[sat.service < 1] <- 1
sat.selection[sat.selection > 7] <- 7
sat.selection[sat.selection < 1] <- 1
summary(cbind(sat.service, sat.selection))

##   sat.service    sat.selection  
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000  
##  Median :4.000   Median :4.000  
##  Mean   :4.484   Mean   :4.369  
##  3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :7.000   Max.   :7.000

no.response <- as.logical(rbinom(noofcust, size=1, prob=cust.df$age/100))
sat.service[no.response] <- NA
sat.selection[no.response] <- NA
summary(cbind(sat.service, sat.selection))

##   sat.service    sat.selection  
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000  
##  Median :4.000   Median :4.000  
##  Mean   :4.477   Mean   :4.347  
##  3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :7.000   Max.   :7.000  
##  NA's   :630     NA's   :630

cust.df$sat.service <- sat.service
cust.df$sat.selection <- sat.selection
summary(cust.df)

##     cust.id          age         credit.score   email     
##  1      :   1   Min.   :18.88   Min.   :432.8   no : 619  
##  2      :   1   1st Qu.:29.23   1st Qu.:563.4   yes:1381  
##  3      :   1   Median :31.87   Median :593.3             
##  4      :   1   Mean   :31.94   Mean   :594.6             
##  5      :   1   3rd Qu.:34.64   3rd Qu.:625.7             
##  6      :   1   Max.   :45.39   Max.   :751.6             
##  (Other):1994                                             
##  distance.to.store   online.visits     online.trans      online.spend    
##  Min.   :   0.1928   Min.   :  0.00   Min.   :  0.000   Min.   :   0.00  
##  1st Qu.:   3.1796   1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:   0.00  
##  Median :   7.3314   Median :  6.00   Median :  2.000   Median :  37.53  
##  Mean   :  15.8046   Mean   : 26.91   Mean   :  8.088   Mean   : 163.33  
##  3rd Qu.:  16.5525   3rd Qu.: 27.00   3rd Qu.:  8.000   3rd Qu.: 156.60  
##  Max.   :1279.5724   Max.   :721.00   Max.   :228.000   Max.   :4384.17  
##                                                                          
##   store.trans      store.spend       sat.service    sat.selection  
##  Min.   : 0.000   Min.   :   0.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:   0.00   1st Qu.:4.000   1st Qu.:4.000  
##  Median : 1.000   Median :  30.60   Median :4.000   Median :4.000  
##  Mean   : 1.351   Mean   :  48.08   Mean   :4.477   Mean   :4.347  
##  3rd Qu.: 2.000   3rd Qu.:  67.08   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :1047.75   Max.   :7.000   Max.   :7.000  
##                                     NA's   :630     NA's   :630

The data for age and credit score is rounded off for optical reasons.

cust.df$age<-round(cust.df$age)
cust.df$credit.score<-round(cust.df$credit.score)
summary(cust.df)

##     cust.id          age         credit.score   email     
##  1      :   1   Min.   :19.00   Min.   :433.0   no : 619  
##  2      :   1   1st Qu.:29.00   1st Qu.:563.0   yes:1381  
##  3      :   1   Median :32.00   Median :593.0             
##  4      :   1   Mean   :31.94   Mean   :594.6             
##  5      :   1   3rd Qu.:35.00   3rd Qu.:626.0             
##  6      :   1   Max.   :45.00   Max.   :752.0             
##  (Other):1994                                             
##  distance.to.store   online.visits     online.trans      online.spend    
##  Min.   :   0.1928   Min.   :  0.00   Min.   :  0.000   Min.   :   0.00  
##  1st Qu.:   3.1796   1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:   0.00  
##  Median :   7.3314   Median :  6.00   Median :  2.000   Median :  37.53  
##  Mean   :  15.8046   Mean   : 26.91   Mean   :  8.088   Mean   : 163.33  
##  3rd Qu.:  16.5525   3rd Qu.: 27.00   3rd Qu.:  8.000   3rd Qu.: 156.60  
##  Max.   :1279.5724   Max.   :721.00   Max.   :228.000   Max.   :4384.17  
##                                                                          
##   store.trans      store.spend       sat.service    sat.selection  
##  Min.   : 0.000   Min.   :   0.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:   0.00   1st Qu.:4.000   1st Qu.:4.000  
##  Median : 1.000   Median :  30.60   Median :4.000   Median :4.000  
##  Mean   : 1.351   Mean   :  48.08   Mean   :4.477   Mean   :4.347  
##  3rd Qu.: 2.000   3rd Qu.:  67.08   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :1047.75   Max.   :7.000   Max.   :7.000  
##                                     NA's   :630     NA's   :630

Creating a dataset for experiments

Sanjay Fuloria

July 27, 2018

Why create a dataset?

Creating a customer dataset

Creating a variable noofcust to define the sample size and defining it as a dataframe