setwd("~/code-repositories/book-use-r-mkt-research-analytics/")
library('ProjectTemplate')
library(psych)
load.project()

Summarizing data

Discrete Variables

A basic way to describe discrete data is with frequency counts.

(p1.table <- table(store.df$p1price))

## 
## 2.19 2.49 2.79 2.99 
##  839  423  443  375

plot(p1.table)

How often each product was promoted at each price point? Two-way cross tabs:

(p1.table2 <- table(store.df$p1price, store.df$p1prom))

##       
##          0   1
##   2.19 752  87
##   2.49 381  42
##   2.79 396  47
##   2.99 343  32

The exact fraction of times product 1 is on promotion at each price point should be close to 10%:

p1.table2[, 2] / (p1.table2[, 1] + p1.table2[, 2])

##       2.19       2.49       2.79       2.99 
## 0.10369487 0.09929078 0.10609481 0.08533333

Continuous Variables

With continuous data it is more helpful to summarize the data in terms of its distribution.

Describe	Function	Value
Extremes	min(x)	Minimum value
	max(x)	Maximum value
Central dentency	mean(x)	Arithmetic mean
	median(x)	Median
Dispersion	var(x)	Variance around the mean
	sd(x)	Standard deviation (`sqrt(var(x))`)
	IQR(x)	Interquantile range, 75th-25th percentile
	mad(x)	Median absolute deviation (a robust variance estimator)

min(store.df$p1sales)

## [1] 73

max(store.df$p2sales)

## [1] 225

mean(store.df$p1prom)

## [1] 0.1

median(store.df$p2sales)

## [1] 95

var(store.df$p1sales)

## [1] 877.8111

sd(store.df$p1sales)

## [1] 29.62788

IQR(store.df$p1sales)

## [1] 39

mad(store.df$p1sales)

## [1] 28.1694

quantile(store.df$p1sales, probs = c(0.25, 0.5, 0.75))

## 25% 50% 75% 
## 114 131 153

describe()

The trimmed mean is the mean after dropping a small proportion of extreme values.

Statistics such as skew and kurtosis are useful when interpreting data with regard to normal distributions.

describe(store.df)

##           vars    n   mean    sd median trimmed   mad   min    max range
## storeNum*    1 2080  10.50  5.77  10.50   10.50  7.41  1.00  20.00  19.0
## Year         2 2080   1.50  0.50   1.50    1.50  0.74  1.00   2.00   1.0
## Week         3 2080  26.50 15.01  26.50   26.50 19.27  1.00  52.00  51.0
## p1sales      4 2080 134.80 29.63 131.00  132.79 28.17 73.00 263.00 190.0
## p2sales      5 2080  99.10 24.83  95.00   96.97 22.24 48.00 225.00 177.0
## p1price      6 2080   2.52  0.32   2.49    2.51  0.44  2.19   2.99   0.8
## p2price      7 2080   2.70  0.33   2.59    2.69  0.44  2.29   3.19   0.9
## p1prom       8 2080   0.10  0.30   0.00    0.00  0.00  0.00   1.00   1.0
## p2prom       9 2080   0.14  0.35   0.00    0.05  0.00  0.00   1.00   1.0
## country*    10 2080   4.55  1.72   4.50    4.62  2.22  1.00   7.00   6.0
##            skew kurtosis   se
## storeNum*  0.00    -1.21 0.13
## Year       0.00    -2.00 0.01
## Week       0.00    -1.20 0.33
## p1sales    0.71     0.52 0.65
## p2sales    0.98     1.46 0.54
## p1price    0.24    -1.52 0.01
## p2price    0.32    -1.40 0.01
## p1prom     2.66     5.10 0.01
## p2prom     2.09     2.38 0.01
## country*  -0.29    -0.81 0.04

Recommended Approach to Inspecting Data

Import data with read.csv() or another appropriate function
Convert it to a data frame if needed (my.data <- data.frame(DATA)) and set column names (names(my.data) <- c(...)) if needed
Examine dim() for expected number of rows and columns
Use head() and tail() to check the first and last rows
Use some() from the car package to examine a few sets of random rows
Check the data structure with str() to ensure that variable types are appropriate
Run summary() and look for unexpected values
Load the psych library and examine basic descriptives with describe()

apply()*

apply(store.df[, 2:9], 2, function(x) { mean(x) - median(x) })

##       Year       Week    p1sales    p2sales    p1price    p2price 
## 0.00000000 0.00000000 3.80288462 4.10288462 0.03302885 0.10951923 
##     p1prom     p2prom 
## 0.10000000 0.13846154

Single Variable Visualization

Histograms

hist(store.df$p1sales,
     main = "Product 1 Weekly Sales Frequencies, All Stores",
     xlab = "Product 1 Sales (Units)",
     ylab = "Relative frequency",
     breaks = 30,
     col = "lightblue",
     freq = FALSE,
     xaxt = "n")
axis(side = 1, at = seq(60, 300, by = 20))
lines(density(store.df$p1sales, bw = 10), type = "l", col = "darkred", lwd = 2)

Boxplots

Tukey boxplot (or box-and-wiskers plot):

boxplot(store.df$p2sales, xlab = "Weekly Sales", ylab = "P2", main = "Weekly Sales of P2, All Stores", horizontal = T)

Boxplots are even more useful when you compare distributions by some other factor.

How do different store compare on sales of product 2?

boxplot(store.df$p2sales ~ store.df$storeNum, 
        horizontal = T, ylab = "Store", xlab = "Weekly unit sales", 
        las = 1, 
        main = "Weekly Sales of P2 by Store")

Do P2 sales differ in relation to in-store promotion?

boxplot(p2sales ~ p2prom, data = store.df, horizontal = T, yaxt = "n",
        ylab = "P2 promoted in store?", xlab = "Weekly sales",
        main = "Weekly sales of P2 with and without promotion")
axis(side = 2, at = c(1, 2), labels = c("No", "Yes"))

QQ Plot to Check Normality

A QQ plot can confirm that the distribution is, in fact, normal by plotting the observed quantiles of your data against the quantiles that would be expected for a normal distribution.

We check p1sales to see whether it is normally distributed:

qqnorm(store.df$p1sales)
qqline(store.df$p1sales)

A common pattern in marketing data is a logarithmic destribution. We examine whether p1sales is more approximately normal after a log() transform:

qqnorm(log(store.df$p1sales))
qqline(log(store.df$p1sales))

Cumulative Distribution

Cumulative distribution plot with lines to emphasize the 90th percentile. The chart identifies that 90% of weekly sales are lower than or equal to 171 units. Other values are easy to read off the chart. For instance, roughly 10% of weeks sell less than 100 units, and fewer than 5% sell more than 200 units.

plot(ecdf(store.df$p1sales),
     main = "Cumulative Distribution of P1 Weekly Sales",
     ylab = "Cumulative Proportion",
     xlab = c("P1 weekly sales, all stores", "90% of weeks sold <= 171 units"),
     yaxt = "n")
axis(side = 2, at = seq(0, 1, by = 0.1), las = 1, labels = paste(seq(0, 100, by = 10), "%", sep = ""))
abline(h = 0.9, lty = 3)
abline(v = quantile(store.df$p1sales, pr = 0.9), lty = 3)

Chapter 3 - Describing Data