1 Objective

The objective of this data analytics project is to gain a systematic understanding of the factors that explain the variability of scratch ticket sales in the US. We approach the question from the point of view of state lotteries that exclusively distribute this product in the US. Deeper insight into these relationships will help lotteries develop and market scratch tickets that better satisfy the diverse and evolving needs of players in different market and customer segments.

2 Business Context

Scratch-off lottery tickets (also known as scratch tickets or instant games) are distributed in the US through networks of retailers by state lotteries that operate monopolies within their jurisdictions. The product does compete with other forms of gambling such as casino based games, draw games and various other forms of betting. Although revenue growth has held steady at 3-6% between 2013 and 2018, US lotteries have reasons to be concerned for the long term sustainability of the traditional scratch ticket product that contributes over $50 billion in combined annual revenue.

2.1 Challenges

Deomographics - Fewer young people are purchasing scratch tickets. In a major 2017 research report, IGT (a major lottery consulting group) found that millenials (born between 1981 and 1997), the largest generation and also the largest share of the work force, are showing declining interest in traditional lottery products in favour of digital games accessible through their mobile phones. Their economic weight will only rise as their incomes increase over the coming years.

Competition - New gaming options (e.g. online sports betting) more in tune with younger consumers’ habits and preferences are receiving government assent in a growing number of states across the country. The trend is forecast to continue.

Uncertainty around the factors that explain sales variability - There is remarkable variance in the sales performance of scratch tickets. Aside from a few core games that remain in market permanently, states have to rely on sales from new games launched at frequent intervals - players typically expect new releases every month. The performance of these new games is hard to predict and lotteries often find themselves uncomfortably dependent on a few hit games to drive fiscal year sales growth.

The factors that affect sales are not well understood. There are several, often competing ideas, that are mostly based on subjective experience, intuition, anecdote and survey based research on very limited population samples.

This list is by no means exhaustive. The industry is faced with several other significant challenges including mediocre management (at some lotteries) and regulatory restrictions imposed by state authorities.

2.2 Attempted Solutions

U.S lotteries are attempting to respond to some of these threats by appealing to younger audiences through online instant games with more gamification features (subject to regulatory approval), increasing presence on social media and communicating more on the social causes lotteries support.

There is also a greater effort to better understand the factors that explain sales variability. Multiple opposing ideas have emerged from these efforts largely because the methods adopted are mostly unscientific leading to different stakeholders putting forward “analysis” and “evidence” that favors their interests. Ticket art designers believe sales mostly depend on the appeal of the ticket artwork, sales people argue that sales are mainly driven by field sales and retail optimization; others are convinced it is all about the prize structure of the ticket. This may require some explanation.

A pre-determined proportion of the revenue generated by the sales of each instant game must be returned to players in the form of prizes. The lottery would generally design their portfolio of instant games such that this payout ratio cascades downwards with the highest priced tickets returning the greatest proportion of their sales to players. Whilst the payout ratios are pre-determined (by lottery policy and/or state regulation), the specific manner in which the prize fund is allocated between prize tiers may be varied in endless ways. For example, do you sacrifice small wins for a large top prize to attract players chasing the dream or do you reduce the marquee prize in favour of a large number of winning tickets offering more, albeit modest, win opportunities for players? These prize structure calibrations are thought by many to have a fundamental influence on players’ experience of an instant game and thus the sales of the ticket. Lotteries are investing greater efforts in seeking the optimal prize allocations for instant games.

Considerable effort is also going into improving field marketing and sales operations as well as optimizing product portfolios. We remain convinced that the most significant opportunity lies in gaining a true understanding of the variables that affect sales. This knowledge should then be used to inform growth initiatives in all other areas.

3 Analytics Approach

We will attempt to develop a systematic understanding of the factors that explain the variability of scratch ticket sales by answering the following questions:

Which independent variables (predictors) are significantly associated with sales, the dependent variable (response)?
What is the nature of the relationship between each predictor and sales: positive or negative, linear or non-linear.
If, as is most likely, it turns out that sales variability is associated with some combination of several predictors, can we accurately estimate the effect of each independent variable on sales?
Do the independent variables interact with each other in significant ways?

3.1 Potential Business Benefits

Accurate answers to the above questions could help scratch ticket marketers:

Develop products that better match the expectations of different customer segments
Improve retail execution with a view to optimizing performance in overcrowded retail spaces where thousands of products vie for consumers’ attention
Optimize marketing efforts through highly efficient budget allocation, more relevant promotions, better conceived and targeted campaigns etc.

3.2 Methods

We will use regression techniques that model a numerical response variable (sales, in our case) from a given set of predictors. Several models will be implemented and these will be chosen for their effectiveness in modeling data with the features and characteristics of our dataset.

Because our primary goal is inferential, that is our priority is to understand how sales change as a function of our inputs, we will focus on building relatively simple, interpretable models. Although developments in the statistical learning field have produced powerful, sophisticated modeling techniques, those tend to sacrifice interpretability for predictive accuracy. However, we note the recent arrival of analytic frameworks that seek to explain the predictions of black box models. These will be used as appropriate.

3.3 Performance Evaluation and Model Selection

Performance is evaluated based on performance on relevant assessment statistics and model interpretability.

The final model is selected based on an honest assessment on holdout data samples - cross validation.

4 Dataset Details

4.1 Sources

This analysis is based on a dataset made up of sales and product data relating to sctratch tickets sold in the following 7 North American lottery jurisdictions:

Florida
North Carolina
Nebraska
Virginia
Texas
West Virginia
California

The initial dataset includes 1178 unique scratch tickets in market between 01/03/2015 and 05/26/2018

The code book that accompanies this report describes each of the variables included in the dataset.

Data are drawn from:

State Lotteries
La Fleur’s magazine
Public sources identified in the codebook

4.2 Features

sales, the response variable, is our outcome of interest. These are weekly sales. The numbers are not observed directly but are derived by combining information about the number of winning tickets validated at retail locations with known odds of winning a prize for each game. These “sales” figures are therefore imprecise and include a time lag between the day of purchase and the day the sale is recorded.

Each sales value represents the sum of the first 13 weeks of sales for each case in the dataset; each scratch ticket. This period, approximately a quarter, is commonly used in the industry to compare sales performance between tickets. 13 weeks roughly captures the lifecyle of most games. The idea is to ensure “fair” comparisons given that games are in market for varying periods. Using total lifetime sales would include the long tail of declining sales that is often observed for tickets that are maintained in market for long periods of time.

All the predictors in our dataset are related to product characteristics. See table 1.

4.3 Data Challenges

We can safely assume that sales are also associated with non-product variables. Advertizing and promotion are obvious examples.

Analyses performed in a limited feature space that excludes relevant variables will likely restrict the predictive performance and explanatory power of any statistical model. Nevertheless, the effort remains worthwhile because, at the very least, the results will permit us to confirm or discredit some of the received truths and unproven theories that drive important decisions in the industry. This study should also provide useful guidance for future research and data gathering efforts.

5 Data Preparation

Set Up

R Packages used

library(tidyverse) ## Hadley Wickam (2017)
library(lubridate) ## Garrett Grolemund, Hadley Wickham (2011)
library(DataExplorer) ## Boxuan Cui (2019)
library(table1) ## Benjamin Rich (2018)
library(inspectdf) ## Alastair Rushworth (2019)
library(scales) ## Hadley Wickham (2017)
library(knitr) ## Yihui Xie (2018)
library(kableExtra) ##  Hao Zhu (2018)
library(ggthemes) ## Jeffrey B. Arnold (2018)
library(gridExtra) ## Baptiste Auguie (2017)
library(knitr) ## Yihui Xie (2018)
library(stringi) ## Gagolewski M. and others (2019)
library(car) ## John Fox and Sanford Weisberg (2011)
library(formattable) ## Kun Ren and Kenton Russell (2016)
library(dunn.test) # Alexis Dinno (2017)
library(gplots) # Gregory R. Warnes et al (2019)

Load datasets

#Load weekly sales dataset
sales <- read.csv("val_sales.csv", stringsAsFactors = FALSE)

#Load game features
features <- read.csv("game_features.csv", stringsAsFactors = FALSE)

#Load prize structure data
prize_structures <- read.csv("prize_structures.csv", stringsAsFactors = FALSE)

Create analysis dataset

The original sales dataset is in a wide format with weekly sales figures for each game laid out horizontally in separate columns. Multiple columns have data that represent the same variable. The dataset is rearranged into a tidier, long format more amenable to data analysis by collapsing all columns with sales data to a single column.

sales <- sales %>% gather(key = sales.date, value = sales, 
                          -c("game.id","game.name", "price","odds","start.date","quantity"))

Combine datasets

Three separate datasets with sales, prize structure and information on scratch ticket features need to be combined into a single dataset to facilitate subsequent data exploration and modelling.

scratch_data <- left_join(sales, select(features, -c(start.date)), by = "game.id") %>%
                               left_join(., prize_structures, by = "game.id")

A jurisdiction variable is created to identify the state in which each ticket was distributed.

scratch_data <- mutate(scratch_data, 
                       jurisdiction = substr(scratch_data$game.id, start = 1, stop = 2))

Clean data

Imported formatting is cleaned from sales.date, price, sales, start.month, start.year and all columns including prize tier allocation percentages.

scratch_data$sales.date <- sub('.',"", scratch_data$sales.date) 

scratch_data$price <- sub('.',"", scratch_data$price)

scratch_data$sales <- parse_number(scratch_data$sales)

scratch_data$start.year <- parse_number(scratch_data$start.year)

scratch_data$start.month <- gsub("-.*","",scratch_data$start.month)

scratch_data[,21:29] <- lapply(scratch_data[,21:29], 
                               function(x){as.numeric(sub("%", "", x, fixed=TRUE))/100})

# remove spaces from `price` column
scratch_data$price <- stri_replace_all_fixed(scratch_data$price, " ", "")

The combined dataset, created after joining 3 tables, includes lottery ticket sales data, product features and prize structure information.

Table 1. Overview of independent variables. Several predictors are recorded in the “wrong” data type and will need to be converted e.g dates or ticket prices recorded as character variables

var_names <- read.csv("var_names.csv", stringsAsFactors = FALSE) # load variable descriptions

data_desc_table <- data.frame(Variable = names(scratch_data),
                              Data_Type = sapply(scratch_data, class),
                              First_Values = sapply(scratch_data, 
                                                    function(x) paste0(head(x),  collapse = ", ")), 
                                                    row.names = NULL) %>%
                   left_join(., var_names, by = "Variable") %>%
                   filter(Variable != c("sales_13", "sales")) %>%
                   kable(format = "html", longtable = TRUE) %>%
                   kable_styling(bootstrap_options = "striped")

data_desc_table

Variable	Data_Type	First_Values	Description
game.id	character	CA1154, CA1155, CA1156, CA1159, CA1157, CA1158	Unique identifier
game.name	character	$25,000 Taxes Paid, $250,000 Taxes Paid, $1,000,000 Taxes Paid, Hit $500, Wild 9’s, Tripling Bonus Cashword	Ticket name
price	character	2, 5, 10, 5, 1, 3	Ticket retail price
odds	numeric	4.33, 3.74, 3.29, 4.43, 4.89, 3.34	Overall odds of winning a prize
start.date	character	1/25/2015, 1/25/2015, 1/25/2015, 2/22/2015, 2/22/2015, 2/22/2015	Ticket market launch date
quantity	character	21283500, 18241605, 15744152, 20095684, 23641500, 23484000	Number of tickets printed
sales.date	character	01.03.2015, 01.03.2015, 01.03.2015, 01.03.2015, 01.03.2015, 01.03.2015	Date ticket was sold (estimate based on validations)
family	logical	TRUE, TRUE, TRUE, FALSE, FALSE, FALSE	Is the ticket part of a family of games?
license	logical	FALSE, FALSE, FALSE, FALSE, FALSE, FALSE	Is the ticket a licensed property? Hence likely a well known concept and likely advertized
ltp	logical	NA, NA, NA, NA, NA, NA	Is the top prize relatively low? suggesting more prize fund allocated to low/mid tier prizes
start.month	character	Jan, Jan, Jan, Feb, Feb, Feb	Month ticket was launched
play.style	character	Key Number Match, Key Number Match, Key Number Match, Key Number Match, Multiple, Crossword	Play style
color	character	Green, , Purple, Green, Black, Red	Predominant ticket color
feature	character	Multiplier - 5x, Multiplier - 10 x, Multiplier - 20 x, Multiplier - 5x, Autowin - Single, Multiplier - Tripler	Additional feature
theme	character	Money / Cash, Money / Cash, Money / Cash, Money / Cash, Numbers, Crossword	Ticket theme
vendor	character	Scientific Games Corporation, Scientific Games Corporation, Scientific Games Corporation, Scientific Games Corporation, Scientific Games Corporation, Pollard Banknote	Print vendor
start.year	numeric	2015, 2015, 2015, 2015, 2015, 2015	Year ticket was launched
num.win	integer	10, 18, 20, 18, 1, 2	Number of ways to win
payout	numeric	0.62, 0.68, 0.73, 0.68, 0.57, 0.63	Percentage of sales allocated to prize fund
breakeven	numeric	0.08, 0.075, 0.17, 0.062, 0.1, 0.06	Percentage of sales allocated to breakeven prizes
oneX_two.fiveX	numeric	0.145, 0.316, 0.274, 0.313, 0.225, 0.45	Percentage of sales allocated to prizes between 1X and 2.5X of retail price
two.fiveX_fiveX	numeric	0.202, 0.184, 0.082, 0.202, 0, 0.149	Percentage of sales allocated to prizes between 2.5X and 5X
fiveX_tenX	numeric	0.202, 0.186, 0.137, 0.066, 0.357, 0.128	Percentage of sales allocated to prizes between 5X and 10X
tenX_twentyfiveX	numeric	0.293, 0.123, 0.182, 0.142, 0.238, 0.155	Percentage of sales allocated to prizes between 10X and 25X
twentyfiveX_fiftyX	numeric	0, 0.025, 0, 0, 0.052, 0.036	Percentage of sales allocated to prizes between 25X and 50X
fiftyX_hundredX	numeric	0.054, 0, 0.051, 0, 0.023, 0.01	Percentage of sales allocated to prizes between 50X and 100X
hundredX_TP	numeric	0.007, 0.01, 0.012, 0, 0, 0.006	Percentage of sales allocated to prizes between 100X and the top prize
top.prize	numeric	0.018, 0.081, 0.093, 0.214, 0.004, 0.009	Percentage of sales allocated to the top prize
jurisdiction	character	CA, CA, CA, CA, CA, CA	State lottery that launched the ticket

Convert predictors to R appropriate data types to facilitate analysis and plotting.

scratch_data$quantity <- as.numeric(scratch_data$quantity)

scratch_data$start.date <- mdy(scratch_data$start.date)

scratch_data$sales.date <- mdy(scratch_data$sales.date)

scratch_data[,13:17] <- lapply(scratch_data[,13:17], 
                               function(x) as.factor(x))

scratch_data$price <- as.factor(scratch_data$price)


scratch_data$start.month <- as.factor(scratch_data$start.month)

Re-order price factor levels in ascending order.

scratch_data$price <- fct_relevel(scratch_data$price, 
                                  levels = c("1", "2", "3", "5", "7", "10", "15", 
                                             "20", "25", "30", "50"))

Dataset is restricted to games launched since 2015 to focus analysis on recently launched tickets.

scratch_data <- filter(scratch_data, start.date >= as.Date("2015-01-01") & 
                         start.date <= as.Date("2018-12-01"))

Rows with missing sales values are removed. These are mostly the result of the earlier transformation reshaping sales data from wide format to a tidier, long format.

scratch_data <- filter(scratch_data, !is.na(sales))

Filter games with at least 13 weeks of sales and extract the first 13 weeks of sales for each ticket.

This excludes seasonal games with short print runs as well as games that may have been prematurely withdrawn from market. 13 weeks captures the typical period from ticket launch to peak sales. The aim is to compare sales performance over the same length of time in market.

scratch_data <- scratch_data %>%
  group_by(game.id) %>%
  mutate(n_occurences = n()) %>%
  filter(n() > 12) %>%
  arrange(desc(sales)) %>%
  slice(1:13) %>%
  select(-n_occurences)

Weekly sales are summed to obtain a total 13 week sales amount for each ticket. We are interested in analyzing the absolute performance of the games, not trends or variations apparent in the time series.

scratch_data %<>%
  group_by(game.id) %>%
  mutate(sales_13 = sum(sales)) %>%
  ungroup() %>%
  distinct(game.id, .keep_all = TRUE) %>%
  select(-sales)

6 Data Exploration

6.1 Dataset summary

Table 2. Basic dataset characteristics. Scroll right to see additional columns

introduce(scratch_data) %>%
   kable(format = "html", longtable = TRUE) %>%
   kable_styling(bootstrap_options = "striped") %>%
   scroll_box(width = "100%")

rows	columns	discrete_columns	continuous_columns	all_missing_columns	total_missing_values	complete_rows	total_observations	memory_usage
1222	30	15	15	0	1218	392	36660	396384

The data includes an equal number of quantitative and qualitative predictors. Only a third of cases are complete (no missing values across all predictors). However, the total number of missing values is relatively small - 3.3%.

plot_intro(scratch_data,
           title = "Figure 1 - Dataset Summary",
           ggtheme = theme_classic())

6.2 Missing Values

A closer examination of missing values.

plot_missing(scratch_data,
             title = "Figure 2 - Distribution of Missing Values by Feature",
             ggtheme = theme_classic())

quantity and ltp (low top prize) have a high proportion of values missing.

In addition, the classes of the logical variable ltp are severely unbalanced - 98% of tickets do not have low top prizes. Most algorithms used in the modelling process (Part II - Regression Analysis) will see their performance significantly diminished by one or both of these factors.

ggplot(scratch_data, aes(x = ltp)) +
  geom_bar(fill = "steelblue") +
  theme_classic() +
  labs(title = "ltp (low top prize)", 
       subtitle = "Severely unbalanced classes with high proportion of missing values" ,
       x = "# of Cases", 
       y = "",
       tag = "Figure 3") +
   theme(plot.tag = element_text(color = "steelblue"))

We remove the predictor from the dataset.

scratch_data$ltp <- NULL

North Carolina, Virginia and West Virginia have no data for ticket quantity. The quantity of tickets ordered is unlikely to directly expain variations in sales.

Table 3. Number of missing values per variable by jurisdiction

scratch_data %>%
  select(jurisdiction, quantity, family, play.style, feature, 
         theme, vendor, start.year, breakeven) %>%
  group_by(jurisdiction) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  kable(format = "html", caption = "") %>%
  kable_styling(full_width = TRUE, position = "left",
                bootstrap_options = "striped", "hover")


jurisdiction	quantity	family	play.style	feature	theme	vendor	start.year	breakeven
CA	0	0	0	0	0	0	0	1
FL	0	1	1	1	1	1	1	1
NC	174	0	0	0	0	0	0	0
NE	0	6	0	0	0	0	0	0
TX	0	3	3	3	3	3	3	4
VA	215	0	0	0	0	0	0	0
WV	157	3	3	3	3	3	3	4

The predictor has limited value and is removed from the dataset.

# Remove 'quantity' variable

scratch_data$quantity <- NULL

1,222.00 cases in the dataset break down by jurisdiction as follows:

Table 4. Number of cases by jurisdiction

scratch_data %>%
  group_by(jurisdiction) %>%
  summarise(cases = n()) %>%
  arrange(desc(cases)) %>%
  kable(col.names = c("Jurisdiction", "Cases"), format = "html") %>%
  kable_styling(full_width = FALSE, position = "left",
                bootstrap_options = "striped", "hover")

Jurisdiction	Cases
TX	260
VA	215
NC	174
WV	157
NE	150
CA	143
FL	123

6.3 Summary Statistics

Numeric variables

Several features have 0 minimum values. This is to be expected in the case of features relating to prize fund allocation; 0% of the prize fund may be allocated to breakeven prizes, for example. However, 0 minimum values for odds, quantity, payout, num.win raise red flags. Every scratch ticket must have above 0 odds of winning a prize and the proportion of sales paid out to players must certainly be positive.

Table 5. Descriptive statistics for numeric variables.

table1(~., data = select_if(scratch_data, is.numeric))

	Overall (n=1222)
odds
Mean (SD)	4.09 (0.734)
Median [Min, Max]	4.09 [0.00, 9.73]
Missing	20 (1.6%)
start.year
Mean (SD)	2020 (0.997)
Median [Min, Max]	2020 [2020, 2020]
Missing	7 (0.6%)
num.win
Mean (SD)	10.8 (9.31)
Median [Min, Max]	10.0 [0.00, 115]
Missing	15 (1.2%)
payout
Mean (SD)	0.660 (0.0700)
Median [Min, Max]	0.660 [0.00, 0.820]
Missing	7 (0.6%)
breakeven
Mean (SD)	0.160 (0.0545)
Median [Min, Max]	0.157 [0.00, 0.448]
Missing	10 (0.8%)
oneX_two.fiveX
Mean (SD)	0.225 (0.0970)
Median [Min, Max]	0.225 [0.00, 0.917]
Missing	10 (0.8%)
two.fiveX_fiveX
Mean (SD)	0.149 (0.0948)
Median [Min, Max]	0.137 [0.00, 0.703]
Missing	10 (0.8%)
fiveX_tenX
Mean (SD)	0.136 (0.0906)
Median [Min, Max]	0.126 [0.00, 0.714]
Missing	10 (0.8%)
tenX_twentyfiveX
Mean (SD)	0.169 (0.0876)
Median [Min, Max]	0.159 [0.00, 0.444]
Missing	10 (0.8%)
twentyfiveX_fiftyX
Mean (SD)	0.0419 (0.0530)
Median [Min, Max]	0.0220 [0.00, 0.334]
Missing	10 (0.8%)
fiftyX_hundredX
Mean (SD)	0.0439 (0.0478)
Median [Min, Max]	0.0320 [0.00, 0.453]
Missing	10 (0.8%)
hundredX_TP
Mean (SD)	0.0209 (0.0252)
Median [Min, Max]	0.0130 [0.00, 0.237]
Missing	10 (0.8%)
top.prize
Mean (SD)	0.0530 (0.0539)
Median [Min, Max]	0.0350 [0.00, 0.573]
Missing	10 (0.8%)
sales_13
Mean (SD)	17800000 (24900000)
Median [Min, Max]	8940000 [2140, 2.81e+08]

Table 6. Number of cases with 0 values.

scratch_data %>%
  select(odds, payout, num.win) %>%
  summarise_all(funs(sum(.==0, na.rm = TRUE))) %>%
  
  kable(format = "html") %>%
  kable_styling(full_width = TRUE, position = "center",
                bootstrap_options = "striped", "hover")

odds	payout	num.win
1	8	55

0 values for odds, payout and num.win are recoded as missing.

is.na(scratch_data$odds) <- !scratch_data$odds
is.na(scratch_data$num.win) <- !scratch_data$num.win
is.na(scratch_data$payout) <- !scratch_data$payout

Games in market by year.

Table 7. Number of games in market by year in our dataset. The dataset does not cover all of 2018, hence the relatively small number of games for that year. Note that many more games (> 1,500) than are included in this dataset were launched in the US between 2015 and 2018.

tapply(scratch_data$sales.date, year(scratch_data$sales.date), length) %>%
  
  kable(format = "html") %>%
  kable_styling(full_width = TRUE, position = "center",
                bootstrap_options = "striped", "hover")

	x
2015	350
2016	405
2017	373
2018	94

Games by price point

The $5 price point is the most important with the highest number of tickets launched and the most significant contributor to sales.

After $5 tickets, $1 and $2 tickets are most frequently launched although the higher priced tickets generate more sales.

Note the $10 and $20 tickets large sales contribution compared to the relatively small number of tickets launched at those price points.

scratch_data %>%
  group_by(price) %>%
  summarise("# of Games" = n(), "Total 13 Wk.Sales" = sum(as.numeric(sales_13))) %>%
  gather(variable, value, -price) %>%
  filter(price != 7 & price != 15) %>%
  
  ggplot(aes(x = price, y = value)) +
  geom_col(fill = "steelblue") +
  theme_classic() +
  facet_grid(variable ~., scales = "free") +
  labs(title = "Number of Games by Ticket Price \nSales (1st 13 Weeks) by Ticket Price", 
       x = "Ticket Price", 
       y = "Value",
       tag = "Figure 4") +
   theme(plot.tag = element_text(color = "steelblue"))

6.4 Data Distributions

Sales

The histogram shows strong right skewness with a high concentration of values at the low end of the data range.

ggplot(data = scratch_data, aes(x = sales_13/1000)) +
   geom_histogram(bins = 60, fill = "steelblue") +
   scale_x_continuous(labels = dollar, breaks = seq(0, 300000, 50000)) +
   labs(title = "Ticket Sales Histogram", 
        x = "Sales ($000) - Sum of 1st 13 Weeks", 
        y = "# of Tickets",
        tag = "Figure 5") +
   theme_hc() +
   theme(plot.tag = element_text(color = "steelblue"))

Sample percentiles confirm strong right skewness in the data and highlight the presence of unusually low sales values.

  data.frame(Min = min(comma(scratch_data$sales_13, digits = 0)),
             Quartile_1 = quantile(comma(scratch_data$sales_13, digits = 0), 0.25),
             Median = median(comma(scratch_data$sales_13, digits = 0)),
             Mean = mean(comma(scratch_data$sales_13, digits = 0)),
             Quartile_3 = quantile(comma(scratch_data$sales_13, digits = 0), 0.75),
             Max = max(comma(scratch_data$sales_13, digits = 0)),
             row.names = NULL) %>%
 
  kable(format = "html") %>%
  kable_styling(full_width = FALSE, position = "left")

Min	Quartile_1	Median	Mean	Quartile_3	Max
2,135	2,025,878	8,935,066	17,833,058	23,984,254	281,491,560

Log Transformed Sales

The wide range and extreme skewness of the data obscure potentially significant information in the previous sales histogram. Plotting the data on a logarithmic scale reveals previously unseen details. The multi-modality of the distribution suggests the presence of distinct sub groups in the data.

ggplot(data = scratch_data, aes(x = sales_13/1000, fill = jurisdiction)) +
   geom_histogram(bins = 60) +
  
   scale_x_log10(breaks = c(100, 1000, 10000),  labels = dollar) +
   scale_y_continuous(labels = comma, breaks = seq(0,10000,500)) +
  
   labs(title = "Ticket Sales Histogram",
        subtitle = "Log10 Transformed",
        x = "Sum of 1st 13 Weeks ($000)",
        y ="# of Cases",
        tag = "Figure 6",
        caption = "Source: La Fleur's Magazine; State Lotteries") +
   theme_hc() +
   theme(plot.tag = element_text(color = "steelblue"))

The significant differences in weekly sales between jurisdictions may explain the multimodality observed in the histogram above.

Looking at the sales boxplots, we can roughly divide the jurisdictions into 3 tiers: large jurisdictions (California, Texas and Florida), mid-size (North Carolina and Virginia) and small (West Virginia and Nebraska).

ggplot(data = scratch_data, 
       aes(x = reorder(jurisdiction, sales_13, FUN = median), y = sales_13/1000)) +
   geom_boxplot(fill = "steelblue") +
  
  scale_y_continuous(labels = dollar_format(), breaks = seq(0, 300000, 50000)) +
  labs(title = "Ticket Sales by Jurisdiction", 
       x = "", 
       y = "Sales ($000) - Sum of 1st 13 Weeks",
       tag = "Figure 7") +
  coord_flip() +
  theme_classic() +
   theme(plot.tag = element_text(color = "steelblue"))

We will focus the following steps of our data exploration on the top 2 tiers.

scratch_data_toptiers <- scratch_data%>%
                           filter(jurisdiction %in% 
                                    c("CA", "TX", "FL", "VA", "NC"))

6.5 Quantitative Variables

6.5.1 Correlations

Correlations between all variable pairs

Correlations between most variable pairs are either non-existent or very weak.

3 pairs can be described as moderately correlated.

Higher payouts are moderately correlated with higher sales. Such tickets also tend to have more ways to win (num.win) built into their prize structures.

These relationships are well understood within the industry and merely serve to validate our data.

Nevertheless, the relative weakness of the associations is very informative.

scratch_data_toptiers %>% 
  inspect_cor() %>%
  show_plot() +
  labs(title = "Figure 8", 
       subtitle = "Pairwise Correlations - All Numeric Variables") +
  theme(plot.title = element_text(color = "steelblue"))

The pink and grey error bars are the confidence intervals around the correlation coefficients. The bars are grey when they touch the dotted centerline indicating that the true coefficient is not significantly different from 0.

Correlations with sales

As noted above, the variables with the strongest positive correlations with sales are price and payout.

Weak correlations between sales and most prize tiers are noteworthy because considerable effort is expended by game designers to calibrate prize tiers in the belief that this is the main factor that explains the performance of a scratch ticket.

scratch_data_toptiers %>% 
  inspect_cor(with_col = "sales_13") %>%
  show_plot() +
  ggtitle("Figure 9", subtitle = "Correlations with Sales") +
  theme(plot.title = element_text(color = "steelblue"))

These results should be interpreted with caution. Several variables are inter-related. Higher priced tickets also have higher payouts and lower overall odds of winning, for example.

Also, correlations describe linear relationships between variables. Several of these relationships are likely non-linear as we discover below. Part II of this report covers regression modelling where we define more complex models with a view to developing a sophisticated understanding of the relationships between the variables in our dataset.

6.5.2 Sales vs. Odds of Winning

We examine the relationship between a key element of the prize structure of a scratch ticket - the overall odds of winning a prize - and sales performance.

Most lotteries design their scratch ticket portfolios to ensure that overall odds generally decrease as you go up in price point. The idea is to offer more winning opportunities to players who bet on higher priced tickets. Breakeven prizes are considered wins for these purposes.

There are exceptions - high price tickets with high odds that offer fewer, more substantial wins have recently proved successful.

Low price points $1, $2 and $3

The relationship between sales and odds is non-linear for low priced tickets.

Note, on the $3 chart, the consistently negative curve across most of the data range (between 3 and 4.25).

The $2 chart shows cases with odds between 4.25 and 4.75 separated into 2 clusters, one with high sales and another with weaker sales. It might be useful to perform a separate cluster analysis to understand the features that characterize those sub-groups.

ggplot(filter(scratch_data_toptiers, price == 1:3), aes(x = odds, y = sales_13)) +
  geom_point() +
  geom_smooth() +
  facet_grid(~price) +
  labs(title = "Sales vs. Odds of Winning by Ticket Price",
       subtitle = "Low priced tickes - $1, $2 and $3",
       x = "Overall Odds of Winning a Prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tag = "Figure 10") +
   theme(plot.tag = element_text(color = "steelblue"))

Figure 11 offers some initial insights.

Most of the tickets grouped at the higher end of the sales range are from Florida, a large, top performing lottery.

The relationship between sales and odds is not identical in the different markets. Consider Florida and Texas (also a large top performing lottery). Texas $2 tickets perform significantly worse than Florida.

Florida lottery’s $2 tickets with higher odds generally perform better. The same is not true for Texas.

There are clearly other factors at work.

ggplot(filter(scratch_data_toptiers, price == 2), aes(x = odds, y = sales_13, color = jurisdiction)) +
  geom_point() +
  theme_classic() +
  theme(legend.position = "bottom") +
  geom_smooth(se = FALSE) +
  labs(title = "$2 Tickets - Sales vs. Odds of Winning by Jurisdiction",
       x = "Overall Odds of Winning a Prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 11") +
  theme(plot.tag = element_text(color = "steelblue"))

$5 price point

Figure 12 shows a negative relationship between $5 ticket odds and sales. This suggests that sales of $5 tickets decrease with higher odds.

However, the tight cluster of cases at the bottom right of the plot indicates the possible presence of a distinct sub-group.

ggplot(filter(scratch_data_toptiers, price == 5), aes(x = odds, y = sales_13)) +
  geom_point() +
  theme_classic() +
  scale_x_continuous(breaks = seq(0, 6, 0.25)) +
  geom_smooth(method = "loess") +
  labs(title = "$5 Tickets - Sales vs. Odds of Winning",
       x = "Overall Odds of Winning a Prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 12") +
  theme(plot.tag = element_text(color = "steelblue"))

Figure 13 shows us that almost all the cases in question are Virginia tickets. This lottery’s $5 tickets have higher odds and lower sales than the other jurisdictions - note that Virginia is a smaller lottery than California or Texas. Those 2 factors combine to accentuate the negative slope of the smooth curve.

The negative relationship between sales and odds of $5 tickets does not seem to hold when we look at each market individually, including Virginia. Indeed, the opposite is true for some jurisdictions - Texas and North Carolina, for example.

ggplot(filter(scratch_data_toptiers, price == 5), 
       aes(x = odds, y = sales_13, color = jurisdiction)) +
  geom_point() +
  theme_classic() +
  theme(legend.position = "bottom") +
  scale_x_continuous(limits = c(3,5), breaks = seq(0, 5, 0.25)) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "$5 Tickets - Sales vs. Odds of Winning",
       x = "Overall Odds of Winning a Prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 13") +
  theme(plot.tag = element_text(color = "steelblue"))

$10 price point

The prevailing thnking among many industry experts is that higher priced tickets with tougher odds are likely to perform better because players attracted to such tickets prefer tougher odds in exchange for the opportunity to win “significant” prizes (approximately > 3X for a $10 ticket).

Figure 14 appears to confirm this.

However, it is noticeable that whilst the overwhelming majority of observations lie within a tight odds range (3.25 to 3.75), the corresponding sales values range widely.

The cases with very high odds (> 6) that seem like outliers are almost certainly tickets with non-traditional prize structures; for example 2 tier prize structures with all prizes paying out either $50 or $100. Because the prize fund is fixed (say 70% of revenue for a $10 ticket), the total number of winning tickets is significantly reduced (pushing up the overall odds of winning) to make the structure work. Some players are happy to accept the much harder odds in exchange for the opportunity to win a “meaningful” sum.

ggplot(filter(scratch_data_toptiers, price == 10), aes(x = odds, y = sales_13)) +
  geom_point() +
  theme_classic() +
  geom_smooth() +
  labs(title = "$10 Tickets - Sales vs. Odds of Winning",
       x = "Overall Odds of Winning a Prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 14") +
  theme(plot.tag = element_text(color = "steelblue"))

When we add information about jurisdiction, we see that the positive trend is less pronounced within each individual market.

ggplot(filter(scratch_data_toptiers, price == 10), 
       aes(x = odds, y = sales_13, color = jurisdiction)) +
  geom_point() +
  theme_classic() +
  theme(legend.position = "bottom") +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "$10 Tickets - Sales vs. Odds of Winning",
       subtitle = "X axis limits are set between 3 and 4 to zoom the plot",
       x = "Overall Odds of Winning a Prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 15") +
  theme(plot.tag = element_text(color = "steelblue")) +
  coord_cartesian(xlim = c(3,4))

$20 price point

There is not an easily recognizable pattern in the relationship between $20 ticket odds and sales performance.

With only 56 cases spread across 5 distinct markets, we must be careful in interpreting this plot. The wide confidence interval around the smooth curve is an important reminder.

ggplot(filter(scratch_data_toptiers, price == 20), 
       aes(x = odds, y = sales_13)) +
  geom_point() +
  theme_classic() +
  theme(legend.position = "bottom") +
  geom_smooth(method = "loess") +
  labs(title = "$20 Tickets - Sales vs. Odds of Winning",
       subtitle = "X axis limits are set between 2.5 and 4.5 to zoom the plot",
       x = "Overall Odds of Winning a Prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 16") +
  theme(plot.tag = element_text(color = "steelblue")) +
  coord_cartesian(xlim = c(2.5,4.5), ylim = c(0, 150000000))

6.5.3 Sales vs. Payout Ratio

The payout ratio refers to the proportion of sales that is paid out to scratch ticket buyers in the form of cash prizes. As figure 17 shows, payout ratio increases with ticket price. As we noted above, payout is the variable (in our dataset) most correlated with sales. Higher payout tickets typically sell more, on average.

Note that this positive relationship does not hold above $25. There are a couple of possible explanations: $25 tickets are often anniversary or sorts of special tickets whose sales may be artificially boosted by heavy advertizing and promotion. Notice the very small numer of cases. $50 are only available in a small number of lotteries such as Texas and the price point represent a significant spend for a scratch ticket. Only the most committed players are willing to make the purchase.

scratch_data_toptiers %>%
  group_by(price) %>%
  summarise("Average Payout Ratio" = mean(payout, na.rm = TRUE), 
            "Average 13 Week Sales" = mean(sales_13), 
            "# of Cases" = n()) %>%
  gather(variable, value, -price) %>%
  filter(price != 7) %>%
  
  ggplot(aes(x = as.factor(price), y = value)) +
  geom_col(fill = "steelblue") +
  theme_classic() +
  facet_grid(variable ~., scales = "free") +
  labs(title = "Average Payout and Sales by Price Point",
       subtitle = "Average Ticket Payout Ratio and Sales generally increase with price point",
       x = "Ticket Price",
       y = "Value",
       tags = "Figure 17") +
  theme(plot.tag = element_text(color = "steelblue"))

The Texas cases at the extreme right of the payout range are $50 tickets. Sales for those tickets are lower for the reasons described above.

ggplot(scratch_data_toptiers, 
       aes(x = payout, y = sales_13)) +
  geom_point() +
  facet_wrap(~jurisdiction) +
  geom_smooth() +
  scale_x_continuous(labels = percent) +
  labs(title = "Sales vs. Payout Ratio by Jurisdiction",
       subtitle = "A positive relationship between sales and payout ratio in all jurisdictions",
       x = "Payout Ratio",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 19") +
  theme(plot.tag = element_text(color = "steelblue"))

6.5.4 Sales vs. Prize tier allocations

As previously mentioned, the distribution of the prize fund between prize tiers (and indeed the value of the prize tiers themselves) is increasingly seen as the main objective factor affecting the performance of a ticket in market. This section focuses on examining the association between sales vs. breakeven prizes and sales vs. top prizes. These tiers are arguably the most important as they act as key anchor points for the entire prize structure, highly visible to players (in the case of the top prize) and crucial for controlling the overall odds of winning, a key variable for players.

Breakeven

The graph shows sales gently declining as breakeven allocation increases up to about 15%. Beyond that, sales bein to rise with further increases in the proportion of the prize fund allocated to breakeven prizes. This observation should be interpreted with great caution because the data points represent all the tickets in the dataset and the following caveats are applicable as they are to all analyses relating to prize structures:

Scratch ticket price points should almost be considered as separate instant game product lines. Each price point will generally comprise tickets of various themes, play styles, features etc. Lotteries will generally take a portfolio approach in designing prize structures will take care to differentiate them to ensure that structures correspond to player expectations of tickets purchased at different prices. It is therefore useful to examine prize structures globally and price point by price by point.
Ticket sales volumes vary significantly by jurisdiction and individual state lotteries do not adopt uniform approaches to prize structure design. It is worth exploring sales vs prize structure relationships by individual jurisdiction.

ggplot(scratch_data_toptiers, aes(x = breakeven, y = sales_13)) +
  geom_point() +
    theme_classic() +
  geom_smooth() +
  scale_x_continuous(labels = percent, breaks = seq(0, 0.5, 0.05)) +
  labs(title = "Sales vs. Breakeven",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to Breakeven Prizes",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 18") +
  theme(plot.tag = element_text(color = "steelblue"))

ggplot(scratch_data_toptiers, 
       aes(x = breakeven, y = sales_13, color = jurisdiction)) +
  geom_point() +
    theme_classic() +
  geom_smooth(se = FALSE) +
  scale_x_continuous(labels = percent, breaks = seq(0, 0.5, 0.05)) +
  labs(title = "Sales vs. Breakeven by Jurisdiction",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to Breakeven Prizes",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 19") +
  theme(plot.tag = element_text(color = "steelblue"), legend.position="bottom")

The proportion of prize fund fund dedicated to breakeven prizes for low-priced tickets ($1 - $3) ranges between about 10% and 25%.

A clear negative relationship is apparent in the plots. Increasing the allocation to breakeven is associated with lower sales. This runs counter to the widely held view within the industry that low-priced tckets should have high allocations to maximize the winning experiences (research suggests players feel they have won even when they are merely getting their money back) of the casual/new players who are thought to favour these price points.

This observation does validate the growing view that most players prefer fewer, more meaningful wins; simply getting your money back may not be as satisfying for players as previuosly thought by scratch ticket marketers.

ggplot(filter(scratch_data_toptiers, price == c(1,2,3)), aes(x = breakeven, y = sales_13)) +
  geom_point() +
  facet_wrap(~price) +
  geom_smooth() +
  labs(title = "Sales vs. Breakeven (Low price points)",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to Breakeven Prizes",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 20") +
  theme(plot.tag = element_text(color = "steelblue"))

The proportion of prize fund fund dedicated to breakeven prizes for mid-priced tickets ($5 - $10) ranges between about 7.5% and 20%. The range is wider and we can observe some non-breaken tickets; games with prize structures that completely eschew parity and focus the prize fund on mid-tier prizes (roughly 5X to 10X ticket price). These non-traditional tickets are gaining in popularity with lotteries as they prove successful in market.

The negative association observed with low-priced tickets is less pronounced. The relationship is more complex, certainly with $10 price point.

ggplot(filter(scratch_data_toptiers, price == c(5,10)), aes(x = breakeven, y = sales_13)) +
  geom_point() +
  facet_wrap(~price) +
  geom_smooth() +
  labs(title = "Sales vs. Breakeven (mid price points)",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to Breakeven Prizes",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 21") +
  theme(plot.tag = element_text(color = "steelblue"))

ggplot(filter(scratch_data_toptiers, price == 20), aes(x = breakeven, y = sales_13)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Sales vs. Breakeven ($20 tickets)",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to Breakeven Prizes",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 22") +
  theme(plot.tag = element_text(color = "steelblue"))

Top prize

The top prize is the most visible element of a scratch ticket’s prize structure. The top prize is almost always clearly displayed on the face of the ticket. Indeed this number is often used as a marketing tool and is sometimes even built into the ticket’s name - “$10,000,000 Cash Spectacular!”. Most lottery players play in the hope of winning a life transforming prize. The top prize is usually the only prize tier in the structure large enough to make such dreeams true - improbable as this may be; most games will only have a small handful of such prizes.

The plot showing all data points together suggests there is moderately strong, positive association between the size of the top prize and the sales performance of a ticket.

ggplot(scratch_data_toptiers, aes(x = top.prize, y = sales_13)) +
  geom_point() +
  geom_smooth() +
  theme_classic() +
  labs(title = "Sales vs. Top Prize",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to the top prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 23A") +
  theme(plot.tag = element_text(color = "steelblue"))

Figure 23B presents a clearer view of the data by ignoring outliers and zooming in on the section of the plot that concentrates the vast majority of the information.

ggplot(scratch_data_toptiers, aes(x = top.prize, y = sales_13)) +
  geom_point() +
  geom_smooth() +
  coord_cartesian(xlim = c(0, 0.15)) +
  theme_classic() +
  labs(title = "",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to the top prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 23B") +
  theme(plot.tag = element_text(color = "steelblue"))

The price point plots in figure 24 and figure 25 underscore the need to take a more granular view when exploring scratch ticket data. Looking at the data in aggregate (figure 23), we clearly notice a moderate, positive relationship between top prize and sales. However, looking at the data price point by price point we observe the relationship to be at times positive, negative or barely existent. In some cases we can clearly see that the association between the 2 variables may be non-linear. This reinforces the motivation for the application of sophisticated statistical methods that can handle the complexity of these associations. Part II of this study is dedicated to regression modelling on the data explored here.

ggplot(filter(scratch_data_toptiers, price == c(1,2,3)), 
       aes(x = top.prize, y = sales_13)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~price) +
  coord_cartesian(ylim = c(0,50000000), xlim = c(0,0.10)) +
  labs(title = "Sales vs. Top prize by price point (low price points)",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to the top prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 24") +
  theme(plot.tag = element_text(color = "steelblue"))

ggplot(filter(scratch_data_toptiers, price == c(5,10,20)), 
       aes(x = top.prize, y = sales_13)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~price) +
  coord_cartesian(ylim = c(0,75000000), xlim = c(0,0.10)) +
  labs(title = "Sales vs. Top prize by price point (mid-high price points)",
       subtitle = "",
       x = "Proportion of Prize Fund Allocated to the top prize",
       y = "Weekly Sales - 1st 13 Weeks",
       tags = "Figure 25") +
  theme(plot.tag = element_text(color = "steelblue"))

6.6 Categorical Variables - ANOVA

So far we have examined the numerical variables in our dataset. However, about half the predictors available to us are categorical and those require separate data exploration techniques.

The table and stacked barplot below show the relative frequencies of categorical variables. We can observe, for example, that the most common ticket color is blue, the most frequently occuring features are multipliers and doublers and by far the most dominant play style among scratch-off tickets in the dataset is the key number match where players are invited to match their numbers to winning numbers on the ticket in order to win prizes.

Table 8. Relative frequency of categorical variables

scratch_data_toptiers_cat <- scratch_data_toptiers %>% 
                             select(-c(game.id, game.name, sales.date,
                                       start.date, start.month)) %>%
                             inspect_cat()

scratch_data_toptiers_cat %>%
  select(-levels) %>%
  kable(format = "html") %>%
  kable_styling(full_width = TRUE, position = "center")

col_name	cnt	common	common_pcnt
color	21	Blue	19.56284
family	3	FALSE	78.03279
feature	16	Multiplier - Doubler	19.78142
jurisdiction	5	TX	28.41530
license	3	FALSE	93.55191
play.style	20	Key Number Match	61.85792
price	10	5	29.83607
theme	36	Money / Cash	30.16393
vendor	7	Unknown	55.95628

scratch_data_toptiers_cat %>%
  show_plot(high_cardinality = 1)

More significantly, we aim to understand the nature and strength of any differential associations between individual groups and sales. We attempt to answer these questions using ANOVA (analysis of variance) methods to test for significant differences on our continuous dependent variable (sales) by the categorical independent variables (e.g price with each price point constituting a group).

For example, do sales sigificantly vary by price point; do $5 games typically generate more/less sales revenue than $10 or any other price points and if yes, are any observed differences statistically significant.

6.6.1 Ticket Price

The top bar chart in figure 27 shows that the highest priced tickets generate significantly greater sales revenue on average than the lower priced options. The bottom chart reminds us that the relative frequency of these high priced tickets in the dataset is very low. Lotteries lauch far fewer $20 + tickets than they do $2 or $5.

scratch_data_toptiers %>%
  group_by(price) %>%
  summarise(count = n(), avg.sales = mean(sales_13)) %>%
  gather(variable, value, -price) %>%
  filter(price != c(7, 15)) %>%
  
  ggplot(aes(x = as.factor(price), y = value)) +
  geom_col(fill = "steelblue") +
  labs(title = "",
       subtitle = "",
       x = "Ticket Price",
       y = "",
       tags = "Figure 27") +
  theme(plot.tag = element_text(color = "steelblue")) +
  facet_grid(variable ~., scales = "free")

The marked difference in the sample size of the groups (price points) underlines the need to verify the validity of a key condition required for the reliability of ANOVA results - the variances within each of the groups are assumed to be (roughly) equal.

The Levene test for homogeneity (equality) of variance shows that the group variances are unequal.

leveneTest(sales_13 ~ as.factor(price), data = scratch_data_toptiers)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   9  40.493 < 2.2e-16 ***
##       905                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This, combined with the extreme skewness of the data and presence of significant outliers (see figures 5, 6 and 7) requires us to use a non-parametric test that makes no assumptions about the characteristics of the data. The Kruskal-Wallis test concludes that there are significant differences between the groups. P-value is below the 5% cutoff indicating a very low probability of the observed variations occurring if there were no significant diferrences between the groups.

kruskal.test(sales_13 ~ as.factor(price), data = scratch_data_toptiers)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  sales_13 by as.factor(price)
## Kruskal-Wallis chi-squared = 427.25, df = 9, p-value < 2.2e-16

The Kruskal-Wallis test tells us there is a significant difference between at least one pair and nothing more.

The nonparametric Pairwise Wilcoxon Rank Sum Tests calculates pairwise comparisons between the groups. The table of p values shows the statistical siginificance of the differences between each pair. We can see, for example, that sales generated by the typical $20 game are significantly different (we presume higher based on figure 27)from those of $1,$2,$3 and $5 but not $10 and $20. This may be relevant for state lotteries whose instant game portfolios are still overwhelmingly dominated by low priced tickets. This is usually done in the mistaken view that low priced tickets, being more accessible to a larger number of players, will generate higher overall sales.

pairwise.wilcox.test(as.numeric(scratch_data_toptiers$sales_13), scratch_data_toptiers$price,
                     p.adjust.method = "BH")

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  as.numeric(scratch_data_toptiers$sales_13) and scratch_data_toptiers$price 
## 
##    1       2       3       5       7       10      20      25      30     
## 2  2.8e-10 -       -       -       -       -       -       -       -      
## 3  1.9e-10 0.08298 -       -       -       -       -       -       -      
## 5  < 2e-16 < 2e-16 7.7e-07 -       -       -       -       -       -      
## 7  0.11499 0.10652 0.10652 0.10652 -       -       -       -       -      
## 10 < 2e-16 < 2e-16 5.0e-16 2.9e-11 0.10652 -       -       -       -      
## 20 < 2e-16 < 2e-16 6.9e-15 2.9e-11 0.10917 0.08417 -       -       -      
## 25 0.00552 0.00552 0.00569 0.00552 0.52326 0.00569 0.00764 -       -      
## 30 5.8e-06 7.0e-06 2.0e-05 3.1e-05 0.24390 0.00113 0.00744 0.10652 -      
## 50 8.1e-05 9.5e-05 0.00018 0.00552 0.30612 0.52573 0.71209 0.03571 0.04254
## 
## P value adjustment method: BH

Dunn’s test identifies which pairs are significantly different from one another but also shows the direction of the difference (higher or lower) and also the magnitude of the difference.

dunn.test(as.numeric(scratch_data_toptiers$sales_13), scratch_data_toptiers$price,
          method = "bh")

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 427.2458, df = 9, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                              (Benjamini-Hochberg)                              
## Col Mean-|
## Row Mean |          1         10          2         20         25          3
## ---------+------------------------------------------------------------------
##       10 |  -15.16835
##          |    0.0000*
##          |
##        2 |  -4.781530   11.28802
##          |    0.0000*    0.0000*
##          |
##       20 |  -13.29149  -1.055345  -10.14171
##          |    0.0000*     0.1725    0.0000*
##          |
##       25 |  -4.575735  -1.365293  -3.721347  -1.054627
##          |    0.0000*     0.1107    0.0002*     0.1682
##          |
##        3 |  -5.658379   8.060847  -1.778457   7.798167   3.310132
##          |    0.0000*    0.0000*     0.0530    0.0000*    0.0009*
##          |
##       30 |  -6.907189  -1.723846  -5.536920  -1.210984   0.247084  -4.817075
##          |    0.0000*     0.0578    0.0000*     0.1374     0.4024    0.0000*
##          |
##        5 |  -13.21820   5.130171  -8.268639   5.190616   2.388291  -4.640317
##          |    0.0000*    0.0000*    0.0000*    0.0000*    0.0131*    0.0000*
##          |
##       50 |  -5.258462  -0.758147  -4.062113  -0.334778   0.680488  -3.467747
##          |    0.0000*     0.2461    0.0001*     0.3773     0.2658    0.0006*
##          |
##        7 |   0.763030   2.617768   1.260896   2.779721   2.969962   1.480201
##          |     0.2506    0.0071*     0.1296    0.0047*    0.0027*     0.0919
## Col Mean-|
## Row Mean |         30          5         50
## ---------+---------------------------------
##        5 |   3.398862
##          |    0.0007*
##          |
##       50 |   0.581231  -2.193448
##          |     0.2936    0.0212*
##          |
##        7 |   3.075574   2.039249   2.729539
##          |    0.0020*     0.0301    0.0053*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

We apply the same procedure to other key categorical variables.

6.6.2 Theme

The most common ticket themes are Money/Cash, Multiplier, Numbers and Holiday tickets.

ggplot(filter(scratch_data_toptiers, as.numeric(theme) > 10), aes(x = fct_infreq(theme))) +
  geom_bar(fill = "steelblue") +
  scale_y_continuous() +
  labs(title = "Ticket Theme (themes with > 10 cases in the dataset)",
       subtitle = "",
       x = "",
       y = "Count",
       tags = "Figure 28") +
  theme_classic() +
  theme(plot.tag = element_text(color = "steelblue")) +
  coord_flip()

Note that some of the themes with the highest median sales have relatively few observations (Novelty, Annuity, Extended play - other) leading us to refrain from making any general inferences from this information.

ggplot(data = scratch_data_toptiers, aes(x = fct_reorder(theme, sales_13), y = sales_13/1000)) +
  geom_boxplot(fill = "steelblue") +
  
  scale_y_continuous(labels = dollar) +
  
  labs(title = "Sales Distribution by Theme", 
       x = "", 
       y = "Sales ($000)",
       tags = "Figure 29") +
  theme_classic() +
  theme(plot.tag = element_text(color = "steelblue")) +
  coord_flip()

We test for homogeneity of variance because of the non-normality of group distributions.

leveneTest(sales_13 ~ theme, data = scratch_data_toptiers)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group  34  1.4202 0.05788 .
##       876                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene test for homogeneity (equality) of variance shows that the group variances are unequal.

Multiple pairwise comparison between groups.

6.6.3 Play Style

Over 50% of tickets are key number match. The top 4 play styles make up >80% of all cases.

ggplot(data = scratch_data_toptiers, aes(x = fct_infreq(play.style))) +
  geom_bar(fill = "steelblue") +
  
  scale_y_continuous(breaks = seq(0,700,50)) +
  
  labs(title = "Play Style", 
       x = "", 
       y = "Count",
       tag = "Figure 30") +
  theme_classic() +
  theme(plot.tag = element_text(color = "steelblue")) +
  coord_flip()

Note the very small sample sizes of some of the top ranked play styles e.g Poker, the play style with the highest median sales, has just nrow(filter(scratch_data_II, play.style == "Poker")) observations.

ggplot(data = scratch_data_toptiers, aes(x = fct_reorder(play.style, sales_13), y = sales_13/1000)) +
  geom_boxplot(fill = "steelblue") +
  scale_y_continuous() +
  labs(title = "Play Style", 
       x = "", 
       y = "Sales (000)",
       tag = "Figure 31") +
  theme_classic() +
  theme(plot.tag = element_text(color = "steelblue")) +
  coord_flip()

Test for homogeneity of variance

leveneTest(sales_13 ~ play.style, data = scratch_data_toptiers)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group  18  1.7252 0.03037 *
##       892                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene test for homogeneity (equality) of variance shows that the group variances are unequal.

6.6.4 Features

Multiplier features are the most frequent and the best performing in the market.

ggplot(data = scratch_data_toptiers, 
       aes(x = fct_infreq(feature, ordered = TRUE))) +
  geom_bar(fill = "steelblue") +
  scale_y_continuous() +
  labs(title = "Ticket Feature", 
       x = "", 
       y = "Count",
       tag = "Figure 32") +
  theme_classic() +
  theme(plot.tag = element_text(color = "steelblue")) +
  coord_flip()

ggplot(data = scratch_data_toptiers, 
       aes(x = fct_reorder(feature, sales_13), y = sales_13/1000)) +
  geom_boxplot(fill = "steelblue") +
  
  scale_y_continuous() +
  
  labs(title = "", x = "", 
       y = "Sales ($000)",
       tag = "Figure 33") +
  theme_classic() +
  theme(plot.tag = element_text(color = "steelblue")) +
  coord_flip()

Test for homogeneity of variance

leveneTest(sales_13 ~ feature, data = scratch_data_toptiers)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group  14  7.7524 1.191e-15 ***
##       896                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene test for homogeneity (equality) of variance shows that the group variances are unequal.

Data exploration is an iterative, non-exhaustive process. Whilst we recognize that we could have developed a more complete view of the data with an even more thorough exploratory data analysis, we are satisfied that our current understanding of the data is sufficient for us to proceed with the regression analysis portion of the project. Please see Part II of of this report.

7 Appendix

7.1 R Session Information

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gplots_3.0.1.1      dunn.test_1.3.5     formattable_0.2.0.1
##  [4] car_3.0-3           carData_3.0-2       stringi_1.4.3      
##  [7] gridExtra_2.3       ggthemes_4.2.0      kableExtra_1.1.0   
## [10] knitr_1.28          scales_1.0.0        inspectdf_0.0.5    
## [13] table1_1.1          DataExplorer_0.8.0  lubridate_1.7.4    
## [16] forcats_0.4.0       stringr_1.4.0       dplyr_0.8.3        
## [19] purrr_0.3.2         readr_1.3.1         tidyr_0.8.3        
## [22] tibble_2.1.3        ggplot2_3.2.1       tidyverse_1.2.1    
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-139       bitops_1.0-6       webshot_0.5.1      progress_1.2.2    
##  [5] httr_1.4.1         tools_3.6.0        backports_1.1.4    R6_2.4.0          
##  [9] KernSmooth_2.23-15 lazyeval_0.2.2     colorspace_1.4-1   withr_2.1.2       
## [13] tidyselect_0.2.5   prettyunits_1.0.2  curl_4.0           compiler_3.6.0    
## [17] cli_1.1.0          rvest_0.3.4        xml2_1.2.2         labeling_0.3      
## [21] caTools_1.17.1.2   digest_0.6.20      foreign_0.8-71     rmarkdown_1.15    
## [25] rio_0.5.16         pkgconfig_2.0.2    htmltools_0.3.6    highr_0.8         
## [29] htmlwidgets_1.3    rlang_0.4.0        readxl_1.3.1       rstudioapi_0.10   
## [33] generics_0.0.2     jsonlite_1.6       gtools_3.8.1       zip_2.0.3         
## [37] magrittr_1.5       Formula_1.2-3      Rcpp_1.0.2         munsell_0.5.0     
## [41] ggfittext_0.8.1    abind_1.4-5        yaml_2.2.0         plyr_1.8.4        
## [45] grid_3.6.0         parallel_3.6.0     gdata_2.18.0       crayon_1.3.4      
## [49] lattice_0.20-38    haven_2.1.1        hms_0.5.1          zeallot_0.1.0     
## [53] pillar_1.4.2       igraph_1.2.4.1     reshape2_1.4.3     glue_1.3.1        
## [57] evaluate_0.14      data.table_1.12.2  modelr_0.1.5       vctrs_0.2.0       
## [61] networkD3_0.4      cellranger_1.1.0   gtable_0.3.0       assertthat_0.2.1  
## [65] xfun_0.9           openxlsx_4.1.0.1   broom_0.5.2        viridisLite_0.3.0 
## [69] ellipsis_0.2.0.1

What Drives Scratch Ticket Sales in the US? Part I - Data Preparation and Exploration

Karl Melo

July 2018

1 Objective

2 Business Context

2.1 Challenges

2.2 Attempted Solutions

3 Analytics Approach

3.1 Potential Business Benefits

3.2 Methods

3.3 Performance Evaluation and Model Selection

4 Dataset Details

4.1 Sources

4.2 Features

4.3 Data Challenges

5 Data Preparation

6 Data Exploration

6.1 Dataset summary

6.2 Missing Values

6.3 Summary Statistics

6.4 Data Distributions

6.5 Quantitative Variables

6.5.1 Correlations

6.5.2 Sales vs. Odds of Winning

6.5.3 Sales vs. Payout Ratio

6.5.4 Sales vs. Prize tier allocations

6.6 Categorical Variables - ANOVA

6.6.1 Ticket Price

6.6.2 Theme

6.6.3 Play Style

6.6.4 Features

7 Appendix

7.1 R Session Information

jurisdiction	quantity	family	play.style	feature	theme	vendor	start.year	breakeven
CA	0	0	0	0	0	0	0	1
FL	0	1	1	1	1	1	1	1
NC	174	0	0	0	0	0	0	0
NE	0	6	0	0	0	0	0	0
TX	0	3	3	3	3	3	3	4
VA	215	0	0	0	0	0	0	0
WV	157	3	3	3	3	3	3	4

jurisdiction	quantity	family	play.style	feature	theme	vendor	start.year	breakeven
CA	0	0	0	0	0	0	0	1
FL	0	1	1	1	1	1	1	1
NC	174	0	0	0	0	0	0	0
NE	0	6	0	0	0	0	0	0
TX	0	3	3	3	3	3	3	4
VA	215	0	0	0	0	0	0	0
WV	157	3	3	3	3	3	3	4

jurisdiction	quantity	family	play.style	feature	theme	vendor	start.year	breakeven
CA	0	0	0	0	0	0	0	1
FL	0	1	1	1	1	1	1	1
NC	174	0	0	0	0	0	0	0
NE	0	6	0	0	0	0	0	0
TX	0	3	3	3	3	3	3	4
VA	215	0	0	0	0	0	0	0
WV	157	3	3	3	3	3	3	4