setwd("~/code-repositories/book-use-r-mkt-research-analytics/")
library('ProjectTemplate')
library(psych)
load.project()

Summarizing data

Discrete Variables

A basic way to describe discrete data is with frequency counts.

(p1.table <- table(store.df$p1price))
## 
## 2.19 2.49 2.79 2.99 
##  839  423  443  375
plot(p1.table)

How often each product was promoted at each price point? Two-way cross tabs:

(p1.table2 <- table(store.df$p1price, store.df$p1prom))
##       
##          0   1
##   2.19 752  87
##   2.49 381  42
##   2.79 396  47
##   2.99 343  32

The exact fraction of times product 1 is on promotion at each price point should be close to 10%:

p1.table2[, 2] / (p1.table2[, 1] + p1.table2[, 2])
##       2.19       2.49       2.79       2.99 
## 0.10369487 0.09929078 0.10609481 0.08533333

Continuous Variables

With continuous data it is more helpful to summarize the data in terms of its distribution.

Describe Function Value
Extremes min(x) Minimum value
max(x) Maximum value
Central dentency mean(x) Arithmetic mean
median(x) Median
Dispersion var(x) Variance around the mean
sd(x) Standard deviation (sqrt(var(x)))
IQR(x) Interquantile range, 75th-25th percentile
mad(x) Median absolute deviation (a robust variance estimator)
min(store.df$p1sales)
## [1] 73
max(store.df$p2sales)
## [1] 225
mean(store.df$p1prom)
## [1] 0.1
median(store.df$p2sales)
## [1] 95
var(store.df$p1sales)
## [1] 877.8111
sd(store.df$p1sales)
## [1] 29.62788
IQR(store.df$p1sales)
## [1] 39
mad(store.df$p1sales)
## [1] 28.1694
quantile(store.df$p1sales, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75% 
## 114 131 153

describe()

The trimmed mean is the mean after dropping a small proportion of extreme values.

Statistics such as skew and kurtosis are useful when interpreting data with regard to normal distributions.

describe(store.df)
##           vars    n   mean    sd median trimmed   mad   min    max range
## storeNum*    1 2080  10.50  5.77  10.50   10.50  7.41  1.00  20.00  19.0
## Year         2 2080   1.50  0.50   1.50    1.50  0.74  1.00   2.00   1.0
## Week         3 2080  26.50 15.01  26.50   26.50 19.27  1.00  52.00  51.0
## p1sales      4 2080 134.80 29.63 131.00  132.79 28.17 73.00 263.00 190.0
## p2sales      5 2080  99.10 24.83  95.00   96.97 22.24 48.00 225.00 177.0
## p1price      6 2080   2.52  0.32   2.49    2.51  0.44  2.19   2.99   0.8
## p2price      7 2080   2.70  0.33   2.59    2.69  0.44  2.29   3.19   0.9
## p1prom       8 2080   0.10  0.30   0.00    0.00  0.00  0.00   1.00   1.0
## p2prom       9 2080   0.14  0.35   0.00    0.05  0.00  0.00   1.00   1.0
## country*    10 2080   4.55  1.72   4.50    4.62  2.22  1.00   7.00   6.0
##            skew kurtosis   se
## storeNum*  0.00    -1.21 0.13
## Year       0.00    -2.00 0.01
## Week       0.00    -1.20 0.33
## p1sales    0.71     0.52 0.65
## p2sales    0.98     1.46 0.54
## p1price    0.24    -1.52 0.01
## p2price    0.32    -1.40 0.01
## p1prom     2.66     5.10 0.01
## p2prom     2.09     2.38 0.01
## country*  -0.29    -0.81 0.04

apply()*

apply(store.df[, 2:9], 2, function(x) { mean(x) - median(x) })
##       Year       Week    p1sales    p2sales    p1price    p2price 
## 0.00000000 0.00000000 3.80288462 4.10288462 0.03302885 0.10951923 
##     p1prom     p2prom 
## 0.10000000 0.13846154

Single Variable Visualization

Histograms

hist(store.df$p1sales,
     main = "Product 1 Weekly Sales Frequencies, All Stores",
     xlab = "Product 1 Sales (Units)",
     ylab = "Relative frequency",
     breaks = 30,
     col = "lightblue",
     freq = FALSE,
     xaxt = "n")
axis(side = 1, at = seq(60, 300, by = 20))
lines(density(store.df$p1sales, bw = 10), type = "l", col = "darkred", lwd = 2)

Boxplots

Tukey boxplot (or box-and-wiskers plot):

boxplot(store.df$p2sales, xlab = "Weekly Sales", ylab = "P2", main = "Weekly Sales of P2, All Stores", horizontal = T)

Boxplots are even more useful when you compare distributions by some other factor.

How do different store compare on sales of product 2?

boxplot(store.df$p2sales ~ store.df$storeNum, 
        horizontal = T, ylab = "Store", xlab = "Weekly unit sales", 
        las = 1, 
        main = "Weekly Sales of P2 by Store")

Do P2 sales differ in relation to in-store promotion?

boxplot(p2sales ~ p2prom, data = store.df, horizontal = T, yaxt = "n",
        ylab = "P2 promoted in store?", xlab = "Weekly sales",
        main = "Weekly sales of P2 with and without promotion")
axis(side = 2, at = c(1, 2), labels = c("No", "Yes"))

QQ Plot to Check Normality

A QQ plot can confirm that the distribution is, in fact, normal by plotting the observed quantiles of your data against the quantiles that would be expected for a normal distribution.

We check p1sales to see whether it is normally distributed:

qqnorm(store.df$p1sales)
qqline(store.df$p1sales)

A common pattern in marketing data is a logarithmic destribution. We examine whether p1sales is more approximately normal after a log() transform:

qqnorm(log(store.df$p1sales))
qqline(log(store.df$p1sales))

Cumulative Distribution

Cumulative distribution plot with lines to emphasize the 90th percentile. The chart identifies that 90% of weekly sales are lower than or equal to 171 units. Other values are easy to read off the chart. For instance, roughly 10% of weeks sell less than 100 units, and fewer than 5% sell more than 200 units.

plot(ecdf(store.df$p1sales),
     main = "Cumulative Distribution of P1 Weekly Sales",
     ylab = "Cumulative Proportion",
     xlab = c("P1 weekly sales, all stores", "90% of weeks sold <= 171 units"),
     yaxt = "n")
axis(side = 2, at = seq(0, 1, by = 0.1), las = 1, labels = paste(seq(0, 100, by = 10), "%", sep = ""))
abline(h = 0.9, lty = 3)
abline(v = quantile(store.df$p1sales, pr = 0.9), lty = 3)