Why simulate data?
Have you ever spent an inordinate amount of time looking for the right data set to try out an analytical technique, but you don’t quite find what you are looking for.
Well, why not just create your own dataset for maximum flexibility which gives you a few advantages:
This article will walk through how to create synthetic data in R.
We want to simulate weekly sales of a product at a store.
# Install pacman if needed
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
# load packages
pacman::p_load(pacman,
tidyverse, openxlsx, psych, ggcorrplot, corrplot, cowplot, openxlsx)
First, set the seed for reproducibility when simulating data especially if you want to re-create the same dataset later.
#Set seed for reproducibility
set.seed(90210)
The dataset we want to create is as follows:
sales_date <- seq(as.Date('2019-01-01'), by = 'week', length.out=52)
#Check results
print(sales_date)
## [1] "2019-01-01" "2019-01-08" "2019-01-15" "2019-01-22" "2019-01-29"
## [6] "2019-02-05" "2019-02-12" "2019-02-19" "2019-02-26" "2019-03-05"
## [11] "2019-03-12" "2019-03-19" "2019-03-26" "2019-04-02" "2019-04-09"
## [16] "2019-04-16" "2019-04-23" "2019-04-30" "2019-05-07" "2019-05-14"
## [21] "2019-05-21" "2019-05-28" "2019-06-04" "2019-06-11" "2019-06-18"
## [26] "2019-06-25" "2019-07-02" "2019-07-09" "2019-07-16" "2019-07-23"
## [31] "2019-07-30" "2019-08-06" "2019-08-13" "2019-08-20" "2019-08-27"
## [36] "2019-09-03" "2019-09-10" "2019-09-17" "2019-09-24" "2019-10-01"
## [41] "2019-10-08" "2019-10-15" "2019-10-22" "2019-10-29" "2019-11-05"
## [46] "2019-11-12" "2019-11-19" "2019-11-26" "2019-12-03" "2019-12-10"
## [51] "2019-12-17" "2019-12-24"
We want to add weeks with marketing promotions.
We will simulate with the binomial distribution. The notation of the binomial distribution is B(n,p), where n is the number of experiments or trials, and p is the probability of a success.
#Create weeks where there were promotions running. 10% likelihood of a promotion
promotion <- rbinom(n=length(sales_date), size = 1, p = 0.1)
#Check results
table(promotion)
## promotion
## 0 1
## 45 7
#Create social vector
social <-rep(0, length(sales_date))
#Check results
print(social)
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Generate the product’s price and place in a vector and then we’ll use the sample function to randomly add prices to our dataframe.
price <- sample(x=c(4.50, 4.99), size = length(sales_date), replace = TRUE)
#Check output
table(price)
## price
## 4.5 4.99
## 26 26
Next is to generate unit sales data and place into a temporary sales variable.
Sales data is randomly generated based on a poisson distribution.
#first argument is the number of draws
#2nd argument is the mean value per week of sales
temp_sales <- rpois(length(sales_date), lambda = 8300)
#Check output
temp_sales
## [1] 8384 8479 8339 8561 8378 8161 8246 8215 8252 8122 8231 8441 8424 8226 8305
## [16] 8409 8448 8332 8362 8354 8259 8159 8378 8265 8345 8198 8241 8184 8259 8236
## [31] 8153 8311 8206 8049 8419 8355 8260 8316 8311 8381 8327 8117 8343 8293 8332
## [46] 8272 8369 8253 8274 8417 8231 8156
#Scale sales by multiplying sales by the log of price
temp_sales <- temp_sales * log(price)
#Check output
temp_sales
## [1] 13476.74 12753.07 12542.50 13761.26 12601.16 13118.28 12402.62 13205.09
## [9] 12411.65 13055.59 12380.06 12695.92 13541.04 13222.77 12491.36 13516.93
## [17] 12706.45 13393.16 13441.38 13428.52 13275.81 13115.07 13467.10 13285.46
## [25] 12551.53 13177.76 12395.10 13155.26 12422.18 12387.58 13105.42 13359.40
## [33] 13190.62 12106.32 12662.83 13430.13 13277.42 13367.44 12500.39 12605.67
## [41] 12524.45 12208.60 12548.52 12473.31 13393.16 13296.71 12587.62 12413.15
## [49] 12444.74 12659.82 13230.80 12267.26
In our last step, we need to take our temporary sales and make sure to add an increase of 30% in unit sales for weeks we have the promotion running. We want to show the boost in sales in the weeks where there are promotions.
#Add impact of increased sales due to week where a promotion was running
unit_sales <- floor(temp_sales * (1 + promotion * 0.30))
#Check results
unit_sales
## [1] 13476 12753 12542 13761 12601 13118 16123 13205 12411 13055 12380 12695
## [13] 13541 13222 12491 13516 12706 13393 13441 13428 13275 17049 13467 17271
## [25] 12551 13177 12395 13155 12422 12387 13105 13359 13190 12106 12662 13430
## [37] 13277 17377 12500 12605 12524 12208 12548 12473 13393 17285 12587 16137
## [49] 16178 12659 13230 12267
#Initialize data frame
df <- data.frame(sales_date,
unit_sales, promotion, social, price)
#view data frame
df
We want to only add Paid Social values for the range of dates between July and September and again in December. This is much easier to do after the dataframe has been created.
df <- df %>%
mutate(social = replace(social, between(sales_date,as.Date('2019-07-02'),as.Date('2019-09-10')), 350)) %>% mutate(social = replace(social, between(sales_date,as.Date('2019-12-03'),as.Date('2019-12-24')), 200))
#Let's do some descriptive statistics
df %>%
select_if(is.numeric) %>%
psych::describe()
Let’s visualize the variables in our dataset we created.
#Create visualizations
p1 <- ggplot(df, aes(sales_date, unit_sales)) + geom_line() + theme_minimal()
p2 <- ggplot(df, aes(social, unit_sales)) + geom_point() + theme_minimal()
p3 <- ggplot(df, aes(factor(promotion), unit_sales)) + geom_boxplot() + theme_minimal()
p4 <- ggplot(df, aes(factor(price), unit_sales)) + geom_boxplot() + theme_minimal()
plot_grid(p1, p2, p3, p4, labels = "AUTO")
Visualization of the correlation coefficients.
#Correlation Plot
df %>%
select_if(is.numeric) %>%
cor() %>%
corrplot(type = "upper", addCoef.col = "black", diag=FALSE)
Our synthetic data looks good enough for us to do some data analysis e.g. marketing mix modeling, measuring advertising effectiveness, etc.
#Write csv file
write.csv(df, "datasets/weekly_sales_data.csv")
#Write to an Excel file
#Using openxlsx package
write.xlsx(x = df, file = "datasets/weekly_sales_data.xlsx")