setwd("~/code-repositories/book-use-r-mkt-research-analytics/")
library('ProjectTemplate')
library(psych)
load.project()
A basic way to describe discrete data is with frequency counts.
(p1.table <- table(store.df$p1price))
##
## 2.19 2.49 2.79 2.99
## 839 423 443 375
plot(p1.table)
How often each product was promoted at each price point? Two-way cross tabs:
(p1.table2 <- table(store.df$p1price, store.df$p1prom))
##
## 0 1
## 2.19 752 87
## 2.49 381 42
## 2.79 396 47
## 2.99 343 32
The exact fraction of times product 1 is on promotion at each price point should be close to 10%:
p1.table2[, 2] / (p1.table2[, 1] + p1.table2[, 2])
## 2.19 2.49 2.79 2.99
## 0.10369487 0.09929078 0.10609481 0.08533333
With continuous data it is more helpful to summarize the data in terms of its distribution.
| Describe | Function | Value |
|---|---|---|
| Extremes | min(x) | Minimum value |
| max(x) | Maximum value | |
| Central dentency | mean(x) | Arithmetic mean |
| median(x) | Median | |
| Dispersion | var(x) | Variance around the mean |
| sd(x) | Standard deviation (sqrt(var(x))) |
|
| IQR(x) | Interquantile range, 75th-25th percentile | |
| mad(x) | Median absolute deviation (a robust variance estimator) |
min(store.df$p1sales)
## [1] 73
max(store.df$p2sales)
## [1] 225
mean(store.df$p1prom)
## [1] 0.1
median(store.df$p2sales)
## [1] 95
var(store.df$p1sales)
## [1] 877.8111
sd(store.df$p1sales)
## [1] 29.62788
IQR(store.df$p1sales)
## [1] 39
mad(store.df$p1sales)
## [1] 28.1694
quantile(store.df$p1sales, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 114 131 153
The trimmed mean is the mean after dropping a small proportion of extreme values.
Statistics such as skew and kurtosis are useful when interpreting data with regard to normal distributions.
describe(store.df)
## vars n mean sd median trimmed mad min max range
## storeNum* 1 2080 10.50 5.77 10.50 10.50 7.41 1.00 20.00 19.0
## Year 2 2080 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.0
## Week 3 2080 26.50 15.01 26.50 26.50 19.27 1.00 52.00 51.0
## p1sales 4 2080 134.80 29.63 131.00 132.79 28.17 73.00 263.00 190.0
## p2sales 5 2080 99.10 24.83 95.00 96.97 22.24 48.00 225.00 177.0
## p1price 6 2080 2.52 0.32 2.49 2.51 0.44 2.19 2.99 0.8
## p2price 7 2080 2.70 0.33 2.59 2.69 0.44 2.29 3.19 0.9
## p1prom 8 2080 0.10 0.30 0.00 0.00 0.00 0.00 1.00 1.0
## p2prom 9 2080 0.14 0.35 0.00 0.05 0.00 0.00 1.00 1.0
## country* 10 2080 4.55 1.72 4.50 4.62 2.22 1.00 7.00 6.0
## skew kurtosis se
## storeNum* 0.00 -1.21 0.13
## Year 0.00 -2.00 0.01
## Week 0.00 -1.20 0.33
## p1sales 0.71 0.52 0.65
## p2sales 0.98 1.46 0.54
## p1price 0.24 -1.52 0.01
## p2price 0.32 -1.40 0.01
## p1prom 2.66 5.10 0.01
## p2prom 2.09 2.38 0.01
## country* -0.29 -0.81 0.04
read.csv() or another appropriate functionmy.data <- data.frame(DATA)) and set column names (names(my.data) <- c(...)) if neededdim() for expected number of rows and columnshead() and tail() to check the first and last rowssome() from the car package to examine a few sets of random rowsstr() to ensure that variable types are appropriatesummary() and look for unexpected valuespsych library and examine basic descriptives with describe()apply(store.df[, 2:9], 2, function(x) { mean(x) - median(x) })
## Year Week p1sales p2sales p1price p2price
## 0.00000000 0.00000000 3.80288462 4.10288462 0.03302885 0.10951923
## p1prom p2prom
## 0.10000000 0.13846154
hist(store.df$p1sales,
main = "Product 1 Weekly Sales Frequencies, All Stores",
xlab = "Product 1 Sales (Units)",
ylab = "Relative frequency",
breaks = 30,
col = "lightblue",
freq = FALSE,
xaxt = "n")
axis(side = 1, at = seq(60, 300, by = 20))
lines(density(store.df$p1sales, bw = 10), type = "l", col = "darkred", lwd = 2)
Tukey boxplot (or box-and-wiskers plot):
boxplot(store.df$p2sales, xlab = "Weekly Sales", ylab = "P2", main = "Weekly Sales of P2, All Stores", horizontal = T)
Boxplots are even more useful when you compare distributions by some other factor.
How do different store compare on sales of product 2?
boxplot(store.df$p2sales ~ store.df$storeNum,
horizontal = T, ylab = "Store", xlab = "Weekly unit sales",
las = 1,
main = "Weekly Sales of P2 by Store")
Do P2 sales differ in relation to in-store promotion?
boxplot(p2sales ~ p2prom, data = store.df, horizontal = T, yaxt = "n",
ylab = "P2 promoted in store?", xlab = "Weekly sales",
main = "Weekly sales of P2 with and without promotion")
axis(side = 2, at = c(1, 2), labels = c("No", "Yes"))
A QQ plot can confirm that the distribution is, in fact, normal by plotting the observed quantiles of your data against the quantiles that would be expected for a normal distribution.
We check p1sales to see whether it is normally distributed:
qqnorm(store.df$p1sales)
qqline(store.df$p1sales)
A common pattern in marketing data is a logarithmic destribution. We examine whether p1sales is more approximately normal after a log() transform:
qqnorm(log(store.df$p1sales))
qqline(log(store.df$p1sales))
Cumulative distribution plot with lines to emphasize the 90th percentile. The chart identifies that 90% of weekly sales are lower than or equal to 171 units. Other values are easy to read off the chart. For instance, roughly 10% of weeks sell less than 100 units, and fewer than 5% sell more than 200 units.
plot(ecdf(store.df$p1sales),
main = "Cumulative Distribution of P1 Weekly Sales",
ylab = "Cumulative Proportion",
xlab = c("P1 weekly sales, all stores", "90% of weeks sold <= 171 units"),
yaxt = "n")
axis(side = 2, at = seq(0, 1, by = 0.1), las = 1, labels = paste(seq(0, 100, by = 10), "%", sep = ""))
abline(h = 0.9, lty = 3)
abline(v = quantile(store.df$p1sales, pr = 0.9), lty = 3)