1 Objective

The objective of this data analytics project is to gain a systematic understanding of the factors that explain the variability of scratch ticket sales in the US. We approach the question from the point of view of state lotteries that exclusively distribute this product in the US. Deeper insight into these relationships will help lotteries develop and market scratch tickets that better satisfy the diverse and evolving needs of players in different market and customer segments.

2 Business Context

Scratch-off lottery tickets (also known as scratch tickets or instant games) are distributed in the US through networks of retailers by state lotteries that operate monopolies within their jurisdictions. The product does compete with other forms of gambling such as casino based games, draw games and various other forms of betting. Although revenue growth has held steady at 3-6% between 2013 and 2018, US lotteries have reasons to be concerned for the long term sustainability of the traditional scratch ticket product that contributes over $50 billion in combined annual revenue.

2.1 Challenges

Deomographics - Fewer young people are purchasing scratch tickets. In a major 2017 research report, IGT (a major lottery consulting group) found that millenials (born between 1981 and 1997), the largest generation and also the largest share of the work force, are showing declining interest in traditional lottery products in favour of digital games accessible through their mobile phones. Their economic weight will only rise as their incomes increase over the coming years.

Competition - New gaming options (e.g. online sports betting) more in tune with younger consumers’ habits and preferences are receiving government assent in a growing number of states across the country. The trend is forecast to continue.

Uncertainty around the factors that explain sales variability - There is remarkable variance in the sales performance of scratch tickets. Aside from a few core games that remain in market permanently, states have to rely on sales from new games launched at frequent intervals - players typically expect new releases every month. The performance of these new games is hard to predict and lotteries often find themselves uncomfortably dependent on a few hit games to drive fiscal year sales growth.

The factors that affect sales are not well understood. There are several, often competing ideas, that are mostly based on subjective experience, intuition, anecdote and survey based research on very limited population samples.

This list is by no means exhaustive. The industry is faced with several other significant challenges including mediocre management (at some lotteries) and regulatory restrictions imposed by state authorities.

2.2 Attempted Solutions

U.S lotteries are attempting to respond to some of these threats by appealing to younger audiences through online instant games with more gamification features (subject to regulatory approval), increasing presence on social media and communicating more on the social causes lotteries support.

There is also a greater effort to better understand the factors that explain sales variability. Multiple opposing ideas have emerged from these efforts largely because the methods adopted are mostly unscientific leading to different stakeholders putting forward “analysis” and “evidence” that favors their interests. Ticket art designers believe sales mostly depend on the appeal of the ticket artwork, sales people argue that sales are mainly driven by field sales and retail optimization; others are convinced it is all about the prize structure of the ticket. This may require some explanation.

A pre-determined proportion of the revenue generated by the sales of each instant game must be returned to players in the form of prizes. The lottery would generally design their portfolio of instant games such that this payout ratio cascades downwards with the highest priced tickets returning the greatest proportion of their sales to players. Whilst the payout ratios are pre-determined (by lottery policy and/or state regulation), the specific manner in which the prize fund is allocated between prize tiers may be varied in endless ways. For example, do you sacrifice small wins for a large top prize to attract players chasing the dream or do you reduce the marquee prize in favour of a large number of winning tickets offering more, albeit modest, win opportunities for players? These prize structure calibrations are thought by many to have a fundamental influence on players’ experience of an instant game and thus the sales of the ticket. Lotteries are investing greater efforts in seeking the optimal prize allocations for instant games.

Considerable effort is also going into improving field marketing and sales operations as well as optimizing product portfolios. We remain convinced that the most significant opportunity lies in gaining a true understanding of the variables that affect sales. This knowledge should then be used to inform growth initiatives in all other areas.

3 Analytics Approach

We will attempt to develop a systematic understanding of the factors that explain the variability of scratch ticket sales by answering the following questions:

  • Which independent variables (predictors) are significantly associated with sales, the dependent variable (response)?

  • What is the nature of the relationship between each predictor and sales: positive or negative, linear or non-linear.

  • If, as is most likely, it turns out that sales variability is associated with some combination of several predictors, can we accurately estimate the effect of each independent variable on sales?

  • Do the independent variables interact with each other in significant ways?

3.1 Potential Business Benefits

Accurate answers to the above questions could help scratch ticket marketers:

  • Develop products that better match the expectations of different customer segments

  • Improve retail execution with a view to optimizing performance in overcrowded retail spaces where thousands of products vie for consumers’ attention

  • Optimize marketing efforts through highly efficient budget allocation, more relevant promotions, better conceived and targeted campaigns etc.

3.2 Methods

We will use regression techniques that model a numerical response variable (sales, in our case) from a given set of predictors. Several models will be implemented and these will be chosen for their effectiveness in modeling data with the features and characteristics of our dataset.

Because our primary goal is inferential, that is our priority is to understand how sales change as a function of our inputs, we will focus on building relatively simple, interpretable models. Although developments in the statistical learning field have produced powerful, sophisticated modeling techniques, those tend to sacrifice interpretability for predictive accuracy. However, we note the recent arrival of analytic frameworks that seek to explain the predictions of black box models. These will be used as appropriate.

3.3 Performance Evaluation and Model Selection

Performance is evaluated based on performance on relevant assessment statistics and model interpretability.

The final model is selected based on an honest assessment on holdout data samples - cross validation.

4 Dataset Details

4.1 Sources

This analysis is based on a dataset made up of sales and product data relating to sctratch tickets sold in the following 7 North American lottery jurisdictions:

  • Florida
  • North Carolina
  • Nebraska
  • Virginia
  • Texas
  • West Virginia
  • California

The initial dataset includes 1178 unique scratch tickets in market between 01/03/2015 and 05/26/2018

The code book that accompanies this report describes each of the variables included in the dataset.

Data are drawn from:

  • State Lotteries

  • La Fleur’s magazine

  • Public sources identified in the codebook

4.2 Features

sales, the response variable, is our outcome of interest. These are weekly sales. The numbers are not observed directly but are derived by combining information about the number of winning tickets validated at retail locations with known odds of winning a prize for each game. These “sales” figures are therefore imprecise and include a time lag between the day of purchase and the day the sale is recorded.

Each sales value represents the sum of the first 13 weeks of sales for each case in the dataset; each scratch ticket. This period, approximately a quarter, is commonly used in the industry to compare sales performance between tickets. 13 weeks roughly captures the lifecyle of most games. The idea is to ensure “fair” comparisons given that games are in market for varying periods. Using total lifetime sales would include the long tail of declining sales that is often observed for tickets that are maintained in market for long periods of time.

All the predictors in our dataset are related to product characteristics. See table 1.

4.3 Data Challenges

We can safely assume that sales are also associated with non-product variables. Advertizing and promotion are obvious examples.

Analyses performed in a limited feature space that excludes relevant variables will likely restrict the predictive performance and explanatory power of any statistical model. Nevertheless, the effort remains worthwhile because, at the very least, the results will permit us to confirm or discredit some of the received truths and unproven theories that drive important decisions in the industry. This study should also provide useful guidance for future research and data gathering efforts.

5 Data Preparation

Set Up

R Packages used


Load datasets


Create analysis dataset

The original sales dataset is in a wide format with weekly sales figures for each game laid out horizontally in separate columns. Multiple columns have data that represent the same variable. The dataset is rearranged into a tidier, long format more amenable to data analysis by collapsing all columns with sales data to a single column.


Combine datasets

Three separate datasets with sales, prize structure and information on scratch ticket features need to be combined into a single dataset to facilitate subsequent data exploration and modelling.


A jurisdiction variable is created to identify the state in which each ticket was distributed.


Clean data

Imported formatting is cleaned from sales.date, price, sales, start.month, start.year and all columns including prize tier allocation percentages.


The combined dataset, created after joining 3 tables, includes lottery ticket sales data, product features and prize structure information.

Table 1. Overview of independent variables. Several predictors are recorded in the “wrong” data type and will need to be converted e.g dates or ticket prices recorded as character variables

Variable Data_Type First_Values Description
game.id character CA1154, CA1155, CA1156, CA1159, CA1157, CA1158 Unique identifier
game.name character $25,000 Taxes Paid, $250,000 Taxes Paid, $1,000,000 Taxes Paid, Hit $500, Wild 9’s, Tripling Bonus Cashword Ticket name
price character 2, 5, 10, 5, 1, 3 Ticket retail price
odds numeric 4.33, 3.74, 3.29, 4.43, 4.89, 3.34 Overall odds of winning a prize
start.date character 1/25/2015, 1/25/2015, 1/25/2015, 2/22/2015, 2/22/2015, 2/22/2015 Ticket market launch date
quantity character 21283500, 18241605, 15744152, 20095684, 23641500, 23484000 Number of tickets printed
sales.date character 01.03.2015, 01.03.2015, 01.03.2015, 01.03.2015, 01.03.2015, 01.03.2015 Date ticket was sold (estimate based on validations)
family logical TRUE, TRUE, TRUE, FALSE, FALSE, FALSE Is the ticket part of a family of games?
license logical FALSE, FALSE, FALSE, FALSE, FALSE, FALSE Is the ticket a licensed property? Hence likely a well known concept and likely advertized
ltp logical NA, NA, NA, NA, NA, NA Is the top prize relatively low? suggesting more prize fund allocated to low/mid tier prizes
start.month character Jan, Jan, Jan, Feb, Feb, Feb Month ticket was launched
play.style character Key Number Match, Key Number Match, Key Number Match, Key Number Match, Multiple, Crossword Play style
color character Green, , Purple, Green, Black, Red Predominant ticket color
feature character Multiplier - 5x, Multiplier - 10 x, Multiplier - 20 x, Multiplier - 5x, Autowin - Single, Multiplier - Tripler Additional feature
theme character Money / Cash, Money / Cash, Money / Cash, Money / Cash, Numbers, Crossword Ticket theme
vendor character Scientific Games Corporation, Scientific Games Corporation, Scientific Games Corporation, Scientific Games Corporation, Scientific Games Corporation, Pollard Banknote Print vendor
start.year numeric 2015, 2015, 2015, 2015, 2015, 2015 Year ticket was launched
num.win integer 10, 18, 20, 18, 1, 2 Number of ways to win
payout numeric 0.62, 0.68, 0.73, 0.68, 0.57, 0.63 Percentage of sales allocated to prize fund
breakeven numeric 0.08, 0.075, 0.17, 0.062, 0.1, 0.06 Percentage of sales allocated to breakeven prizes
oneX_two.fiveX numeric 0.145, 0.316, 0.274, 0.313, 0.225, 0.45 Percentage of sales allocated to prizes between 1X and 2.5X of retail price
two.fiveX_fiveX numeric 0.202, 0.184, 0.082, 0.202, 0, 0.149 Percentage of sales allocated to prizes between 2.5X and 5X
fiveX_tenX numeric 0.202, 0.186, 0.137, 0.066, 0.357, 0.128 Percentage of sales allocated to prizes between 5X and 10X
tenX_twentyfiveX numeric 0.293, 0.123, 0.182, 0.142, 0.238, 0.155 Percentage of sales allocated to prizes between 10X and 25X
twentyfiveX_fiftyX numeric 0, 0.025, 0, 0, 0.052, 0.036 Percentage of sales allocated to prizes between 25X and 50X
fiftyX_hundredX numeric 0.054, 0, 0.051, 0, 0.023, 0.01 Percentage of sales allocated to prizes between 50X and 100X
hundredX_TP numeric 0.007, 0.01, 0.012, 0, 0, 0.006 Percentage of sales allocated to prizes between 100X and the top prize
top.prize numeric 0.018, 0.081, 0.093, 0.214, 0.004, 0.009 Percentage of sales allocated to the top prize
jurisdiction character CA, CA, CA, CA, CA, CA State lottery that launched the ticket


Convert predictors to R appropriate data types to facilitate analysis and plotting.


Re-order price factor levels in ascending order.

Dataset is restricted to games launched since 2015 to focus analysis on recently launched tickets.


Rows with missing sales values are removed. These are mostly the result of the earlier transformation reshaping sales data from wide format to a tidier, long format.


Filter games with at least 13 weeks of sales and extract the first 13 weeks of sales for each ticket.

This excludes seasonal games with short print runs as well as games that may have been prematurely withdrawn from market. 13 weeks captures the typical period from ticket launch to peak sales. The aim is to compare sales performance over the same length of time in market.


Weekly sales are summed to obtain a total 13 week sales amount for each ticket. We are interested in analyzing the absolute performance of the games, not trends or variations apparent in the time series.


6 Data Exploration

6.1 Dataset summary


Table 2. Basic dataset characteristics. Scroll right to see additional columns

rows columns discrete_columns continuous_columns all_missing_columns total_missing_values complete_rows total_observations memory_usage
1222 30 15 15 0 1218 392 36660 396384


The data includes an equal number of quantitative and qualitative predictors. Only a third of cases are complete (no missing values across all predictors). However, the total number of missing values is relatively small - 3.3%.


6.2 Missing Values

A closer examination of missing values.

quantity and ltp (low top prize) have a high proportion of values missing.

In addition, the classes of the logical variable ltp are severely unbalanced - 98% of tickets do not have low top prizes. Most algorithms used in the modelling process (Part II - Regression Analysis) will see their performance significantly diminished by one or both of these factors.

We remove the predictor from the dataset.


North Carolina, Virginia and West Virginia have no data for ticket quantity. The quantity of tickets ordered is unlikely to directly expain variations in sales.


Table 3. Number of missing values per variable by jurisdiction

jurisdiction quantity family play.style feature theme vendor start.year breakeven
CA 0 0 0 0 0 0 0 1
FL 0 1 1 1 1 1 1 1
NC 174 0 0 0 0 0 0 0
NE 0 6 0 0 0 0 0 0
TX 0 3 3 3 3 3 3 4
VA 215 0 0 0 0 0 0 0
WV 157 3 3 3 3 3 3 4

The predictor has limited value and is removed from the dataset.


1,222.00 cases in the dataset break down by jurisdiction as follows:

Table 4. Number of cases by jurisdiction

Jurisdiction Cases
TX 260
VA 215
NC 174
WV 157
NE 150
CA 143
FL 123


6.3 Summary Statistics

Numeric variables

Several features have 0 minimum values. This is to be expected in the case of features relating to prize fund allocation; 0% of the prize fund may be allocated to breakeven prizes, for example. However, 0 minimum values for odds, quantity, payout, num.win raise red flags. Every scratch ticket must have above 0 odds of winning a prize and the proportion of sales paid out to players must certainly be positive.

Table 5. Descriptive statistics for numeric variables.

Overall
(n=1222)
odds
Mean (SD) 4.09 (0.734)
Median [Min, Max] 4.09 [0.00, 9.73]
Missing 20 (1.6%)
start.year
Mean (SD) 2020 (0.997)
Median [Min, Max] 2020 [2020, 2020]
Missing 7 (0.6%)
num.win
Mean (SD) 10.8 (9.31)
Median [Min, Max] 10.0 [0.00, 115]
Missing 15 (1.2%)
payout
Mean (SD) 0.660 (0.0700)
Median [Min, Max] 0.660 [0.00, 0.820]
Missing 7 (0.6%)
breakeven
Mean (SD) 0.160 (0.0545)
Median [Min, Max] 0.157 [0.00, 0.448]
Missing 10 (0.8%)
oneX_two.fiveX
Mean (SD) 0.225 (0.0970)
Median [Min, Max] 0.225 [0.00, 0.917]
Missing 10 (0.8%)
two.fiveX_fiveX
Mean (SD) 0.149 (0.0948)
Median [Min, Max] 0.137 [0.00, 0.703]
Missing 10 (0.8%)
fiveX_tenX
Mean (SD) 0.136 (0.0906)
Median [Min, Max] 0.126 [0.00, 0.714]
Missing 10 (0.8%)
tenX_twentyfiveX
Mean (SD) 0.169 (0.0876)
Median [Min, Max] 0.159 [0.00, 0.444]
Missing 10 (0.8%)
twentyfiveX_fiftyX
Mean (SD) 0.0419 (0.0530)
Median [Min, Max] 0.0220 [0.00, 0.334]
Missing 10 (0.8%)
fiftyX_hundredX
Mean (SD) 0.0439 (0.0478)
Median [Min, Max] 0.0320 [0.00, 0.453]
Missing 10 (0.8%)
hundredX_TP
Mean (SD) 0.0209 (0.0252)
Median [Min, Max] 0.0130 [0.00, 0.237]
Missing 10 (0.8%)
top.prize
Mean (SD) 0.0530 (0.0539)
Median [Min, Max] 0.0350 [0.00, 0.573]
Missing 10 (0.8%)
sales_13
Mean (SD) 17800000 (24900000)
Median [Min, Max] 8940000 [2140, 2.81e+08]


Table 6. Number of cases with 0 values.

odds payout num.win
1 8 55


0 values for odds, payout and num.win are recoded as missing.


Games in market by year.

Table 7. Number of games in market by year in our dataset. The dataset does not cover all of 2018, hence the relatively small number of games for that year. Note that many more games (> 1,500) than are included in this dataset were launched in the US between 2015 and 2018.

x
2015 350
2016 405
2017 373
2018 94


Games by price point

The $5 price point is the most important with the highest number of tickets launched and the most significant contributor to sales.

After $5 tickets, $1 and $2 tickets are most frequently launched although the higher priced tickets generate more sales.

Note the $10 and $20 tickets large sales contribution compared to the relatively small number of tickets launched at those price points.


6.4 Data Distributions

Sales

The histogram shows strong right skewness with a high concentration of values at the low end of the data range.


Sample percentiles confirm strong right skewness in the data and highlight the presence of unusually low sales values.

Min Quartile_1 Median Mean Quartile_3 Max
2,135 2,025,878 8,935,066 17,833,058 23,984,254 281,491,560


Log Transformed Sales

The wide range and extreme skewness of the data obscure potentially significant information in the previous sales histogram. Plotting the data on a logarithmic scale reveals previously unseen details. The multi-modality of the distribution suggests the presence of distinct sub groups in the data.

The significant differences in weekly sales between jurisdictions may explain the multimodality observed in the histogram above.

Looking at the sales boxplots, we can roughly divide the jurisdictions into 3 tiers: large jurisdictions (California, Texas and Florida), mid-size (North Carolina and Virginia) and small (West Virginia and Nebraska).

We will focus the following steps of our data exploration on the top 2 tiers.


6.5 Quantitative Variables

6.5.1 Correlations

Correlations between all variable pairs

Correlations between most variable pairs are either non-existent or very weak.

3 pairs can be described as moderately correlated.

Higher payouts are moderately correlated with higher sales. Such tickets also tend to have more ways to win (num.win) built into their prize structures.

These relationships are well understood within the industry and merely serve to validate our data.

Nevertheless, the relative weakness of the associations is very informative.

The pink and grey error bars are the confidence intervals around the correlation coefficients. The bars are grey when they touch the dotted centerline indicating that the true coefficient is not significantly different from 0.


Correlations with sales

As noted above, the variables with the strongest positive correlations with sales are price and payout.

Weak correlations between sales and most prize tiers are noteworthy because considerable effort is expended by game designers to calibrate prize tiers in the belief that this is the main factor that explains the performance of a scratch ticket.

These results should be interpreted with caution. Several variables are inter-related. Higher priced tickets also have higher payouts and lower overall odds of winning, for example.

Also, correlations describe linear relationships between variables. Several of these relationships are likely non-linear as we discover below. Part II of this report covers regression modelling where we define more complex models with a view to developing a sophisticated understanding of the relationships between the variables in our dataset.


6.5.2 Sales vs. Odds of Winning

We examine the relationship between a key element of the prize structure of a scratch ticket - the overall odds of winning a prize - and sales performance.

Most lotteries design their scratch ticket portfolios to ensure that overall odds generally decrease as you go up in price point. The idea is to offer more winning opportunities to players who bet on higher priced tickets. Breakeven prizes are considered wins for these purposes.

There are exceptions - high price tickets with high odds that offer fewer, more substantial wins have recently proved successful.

Low price points $1, $2 and $3

The relationship between sales and odds is non-linear for low priced tickets.

Note, on the $3 chart, the consistently negative curve across most of the data range (between 3 and 4.25).

The $2 chart shows cases with odds between 4.25 and 4.75 separated into 2 clusters, one with high sales and another with weaker sales. It might be useful to perform a separate cluster analysis to understand the features that characterize those sub-groups.

Figure 11 offers some initial insights.

Most of the tickets grouped at the higher end of the sales range are from Florida, a large, top performing lottery.

The relationship between sales and odds is not identical in the different markets. Consider Florida and Texas (also a large top performing lottery). Texas $2 tickets perform significantly worse than Florida.

Florida lottery’s $2 tickets with higher odds generally perform better. The same is not true for Texas.

There are clearly other factors at work.

$5 price point

Figure 12 shows a negative relationship between $5 ticket odds and sales. This suggests that sales of $5 tickets decrease with higher odds.

However, the tight cluster of cases at the bottom right of the plot indicates the possible presence of a distinct sub-group.

Figure 13 shows us that almost all the cases in question are Virginia tickets. This lottery’s $5 tickets have higher odds and lower sales than the other jurisdictions - note that Virginia is a smaller lottery than California or Texas. Those 2 factors combine to accentuate the negative slope of the smooth curve.

The negative relationship between sales and odds of $5 tickets does not seem to hold when we look at each market individually, including Virginia. Indeed, the opposite is true for some jurisdictions - Texas and North Carolina, for example.


$10 price point

The prevailing thnking among many industry experts is that higher priced tickets with tougher odds are likely to perform better because players attracted to such tickets prefer tougher odds in exchange for the opportunity to win “significant” prizes (approximately > 3X for a $10 ticket).

Figure 14 appears to confirm this.

However, it is noticeable that whilst the overwhelming majority of observations lie within a tight odds range (3.25 to 3.75), the corresponding sales values range widely.

The cases with very high odds (> 6) that seem like outliers are almost certainly tickets with non-traditional prize structures; for example 2 tier prize structures with all prizes paying out either $50 or $100. Because the prize fund is fixed (say 70% of revenue for a $10 ticket), the total number of winning tickets is significantly reduced (pushing up the overall odds of winning) to make the structure work. Some players are happy to accept the much harder odds in exchange for the opportunity to win a “meaningful” sum.

When we add information about jurisdiction, we see that the positive trend is less pronounced within each individual market.

$20 price point

There is not an easily recognizable pattern in the relationship between $20 ticket odds and sales performance.

With only 56 cases spread across 5 distinct markets, we must be careful in interpreting this plot. The wide confidence interval around the smooth curve is an important reminder.


6.5.3 Sales vs. Payout Ratio

The payout ratio refers to the proportion of sales that is paid out to scratch ticket buyers in the form of cash prizes. As figure 17 shows, payout ratio increases with ticket price. As we noted above, payout is the variable (in our dataset) most correlated with sales. Higher payout tickets typically sell more, on average.

Note that this positive relationship does not hold above $25. There are a couple of possible explanations: $25 tickets are often anniversary or sorts of special tickets whose sales may be artificially boosted by heavy advertizing and promotion. Notice the very small numer of cases. $50 are only available in a small number of lotteries such as Texas and the price point represent a significant spend for a scratch ticket. Only the most committed players are willing to make the purchase.


The Texas cases at the extreme right of the payout range are $50 tickets. Sales for those tickets are lower for the reasons described above.


6.5.4 Sales vs. Prize tier allocations

As previously mentioned, the distribution of the prize fund between prize tiers (and indeed the value of the prize tiers themselves) is increasingly seen as the main objective factor affecting the performance of a ticket in market. This section focuses on examining the association between sales vs. breakeven prizes and sales vs. top prizes. These tiers are arguably the most important as they act as key anchor points for the entire prize structure, highly visible to players (in the case of the top prize) and crucial for controlling the overall odds of winning, a key variable for players.

Breakeven

The graph shows sales gently declining as breakeven allocation increases up to about 15%. Beyond that, sales bein to rise with further increases in the proportion of the prize fund allocated to breakeven prizes. This observation should be interpreted with great caution because the data points represent all the tickets in the dataset and the following caveats are applicable as they are to all analyses relating to prize structures:

  • Scratch ticket price points should almost be considered as separate instant game product lines. Each price point will generally comprise tickets of various themes, play styles, features etc. Lotteries will generally take a portfolio approach in designing prize structures will take care to differentiate them to ensure that structures correspond to player expectations of tickets purchased at different prices. It is therefore useful to examine prize structures globally and price point by price by point.

  • Ticket sales volumes vary significantly by jurisdiction and individual state lotteries do not adopt uniform approaches to prize structure design. It is worth exploring sales vs prize structure relationships by individual jurisdiction.




The proportion of prize fund fund dedicated to breakeven prizes for low-priced tickets ($1 - $3) ranges between about 10% and 25%.

A clear negative relationship is apparent in the plots. Increasing the allocation to breakeven is associated with lower sales. This runs counter to the widely held view within the industry that low-priced tckets should have high allocations to maximize the winning experiences (research suggests players feel they have won even when they are merely getting their money back) of the casual/new players who are thought to favour these price points.

This observation does validate the growing view that most players prefer fewer, more meaningful wins; simply getting your money back may not be as satisfying for players as previuosly thought by scratch ticket marketers.


The proportion of prize fund fund dedicated to breakeven prizes for mid-priced tickets ($5 - $10) ranges between about 7.5% and 20%. The range is wider and we can observe some non-breaken tickets; games with prize structures that completely eschew parity and focus the prize fund on mid-tier prizes (roughly 5X to 10X ticket price). These non-traditional tickets are gaining in popularity with lotteries as they prove successful in market.

The negative association observed with low-priced tickets is less pronounced. The relationship is more complex, certainly with $10 price point.


Top prize

The top prize is the most visible element of a scratch ticket’s prize structure. The top prize is almost always clearly displayed on the face of the ticket. Indeed this number is often used as a marketing tool and is sometimes even built into the ticket’s name - “$10,000,000 Cash Spectacular!”. Most lottery players play in the hope of winning a life transforming prize. The top prize is usually the only prize tier in the structure large enough to make such dreeams true - improbable as this may be; most games will only have a small handful of such prizes.


The plot showing all data points together suggests there is moderately strong, positive association between the size of the top prize and the sales performance of a ticket.

Figure 23B presents a clearer view of the data by ignoring outliers and zooming in on the section of the plot that concentrates the vast majority of the information.


The price point plots in figure 24 and figure 25 underscore the need to take a more granular view when exploring scratch ticket data. Looking at the data in aggregate (figure 23), we clearly notice a moderate, positive relationship between top prize and sales. However, looking at the data price point by price point we observe the relationship to be at times positive, negative or barely existent. In some cases we can clearly see that the association between the 2 variables may be non-linear. This reinforces the motivation for the application of sophisticated statistical methods that can handle the complexity of these associations. Part II of this study is dedicated to regression modelling on the data explored here.


6.6 Categorical Variables - ANOVA

So far we have examined the numerical variables in our dataset. However, about half the predictors available to us are categorical and those require separate data exploration techniques.


The table and stacked barplot below show the relative frequencies of categorical variables. We can observe, for example, that the most common ticket color is blue, the most frequently occuring features are multipliers and doublers and by far the most dominant play style among scratch-off tickets in the dataset is the key number match where players are invited to match their numbers to winning numbers on the ticket in order to win prizes.

Table 8. Relative frequency of categorical variables

col_name cnt common common_pcnt
color 21 Blue 19.56284
family 3 FALSE 78.03279
feature 16 Multiplier - Doubler 19.78142
jurisdiction 5 TX 28.41530
license 3 FALSE 93.55191
play.style 20 Key Number Match 61.85792
price 10 5 29.83607
theme 36 Money / Cash 30.16393
vendor 7 Unknown 55.95628


More significantly, we aim to understand the nature and strength of any differential associations between individual groups and sales. We attempt to answer these questions using ANOVA (analysis of variance) methods to test for significant differences on our continuous dependent variable (sales) by the categorical independent variables (e.g price with each price point constituting a group).

For example, do sales sigificantly vary by price point; do $5 games typically generate more/less sales revenue than $10 or any other price points and if yes, are any observed differences statistically significant.

6.6.1 Ticket Price

The top bar chart in figure 27 shows that the highest priced tickets generate significantly greater sales revenue on average than the lower priced options. The bottom chart reminds us that the relative frequency of these high priced tickets in the dataset is very low. Lotteries lauch far fewer $20 + tickets than they do $2 or $5.

The marked difference in the sample size of the groups (price points) underlines the need to verify the validity of a key condition required for the reliability of ANOVA results - the variances within each of the groups are assumed to be (roughly) equal.

The Levene test for homogeneity (equality) of variance shows that the group variances are unequal.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   9  40.493 < 2.2e-16 ***
##       905                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This, combined with the extreme skewness of the data and presence of significant outliers (see figures 5, 6 and 7) requires us to use a non-parametric test that makes no assumptions about the characteristics of the data. The Kruskal-Wallis test concludes that there are significant differences between the groups. P-value is below the 5% cutoff indicating a very low probability of the observed variations occurring if there were no significant diferrences between the groups.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  sales_13 by as.factor(price)
## Kruskal-Wallis chi-squared = 427.25, df = 9, p-value < 2.2e-16

The Kruskal-Wallis test tells us there is a significant difference between at least one pair and nothing more.

The nonparametric Pairwise Wilcoxon Rank Sum Tests calculates pairwise comparisons between the groups. The table of p values shows the statistical siginificance of the differences between each pair. We can see, for example, that sales generated by the typical $20 game are significantly different (we presume higher based on figure 27)from those of $1,$2,$3 and $5 but not $10 and $20. This may be relevant for state lotteries whose instant game portfolios are still overwhelmingly dominated by low priced tickets. This is usually done in the mistaken view that low priced tickets, being more accessible to a larger number of players, will generate higher overall sales.

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  as.numeric(scratch_data_toptiers$sales_13) and scratch_data_toptiers$price 
## 
##    1       2       3       5       7       10      20      25      30     
## 2  2.8e-10 -       -       -       -       -       -       -       -      
## 3  1.9e-10 0.08298 -       -       -       -       -       -       -      
## 5  < 2e-16 < 2e-16 7.7e-07 -       -       -       -       -       -      
## 7  0.11499 0.10652 0.10652 0.10652 -       -       -       -       -      
## 10 < 2e-16 < 2e-16 5.0e-16 2.9e-11 0.10652 -       -       -       -      
## 20 < 2e-16 < 2e-16 6.9e-15 2.9e-11 0.10917 0.08417 -       -       -      
## 25 0.00552 0.00552 0.00569 0.00552 0.52326 0.00569 0.00764 -       -      
## 30 5.8e-06 7.0e-06 2.0e-05 3.1e-05 0.24390 0.00113 0.00744 0.10652 -      
## 50 8.1e-05 9.5e-05 0.00018 0.00552 0.30612 0.52573 0.71209 0.03571 0.04254
## 
## P value adjustment method: BH


Dunn’s test identifies which pairs are significantly different from one another but also shows the direction of the difference (higher or lower) and also the magnitude of the difference.

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 427.2458, df = 9, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                              (Benjamini-Hochberg)                              
## Col Mean-|
## Row Mean |          1         10          2         20         25          3
## ---------+------------------------------------------------------------------
##       10 |  -15.16835
##          |    0.0000*
##          |
##        2 |  -4.781530   11.28802
##          |    0.0000*    0.0000*
##          |
##       20 |  -13.29149  -1.055345  -10.14171
##          |    0.0000*     0.1725    0.0000*
##          |
##       25 |  -4.575735  -1.365293  -3.721347  -1.054627
##          |    0.0000*     0.1107    0.0002*     0.1682
##          |
##        3 |  -5.658379   8.060847  -1.778457   7.798167   3.310132
##          |    0.0000*    0.0000*     0.0530    0.0000*    0.0009*
##          |
##       30 |  -6.907189  -1.723846  -5.536920  -1.210984   0.247084  -4.817075
##          |    0.0000*     0.0578    0.0000*     0.1374     0.4024    0.0000*
##          |
##        5 |  -13.21820   5.130171  -8.268639   5.190616   2.388291  -4.640317
##          |    0.0000*    0.0000*    0.0000*    0.0000*    0.0131*    0.0000*
##          |
##       50 |  -5.258462  -0.758147  -4.062113  -0.334778   0.680488  -3.467747
##          |    0.0000*     0.2461    0.0001*     0.3773     0.2658    0.0006*
##          |
##        7 |   0.763030   2.617768   1.260896   2.779721   2.969962   1.480201
##          |     0.2506    0.0071*     0.1296    0.0047*    0.0027*     0.0919
## Col Mean-|
## Row Mean |         30          5         50
## ---------+---------------------------------
##        5 |   3.398862
##          |    0.0007*
##          |
##       50 |   0.581231  -2.193448
##          |     0.2936    0.0212*
##          |
##        7 |   3.075574   2.039249   2.729539
##          |    0.0020*     0.0301    0.0053*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2


We apply the same procedure to other key categorical variables.


6.6.2 Theme

The most common ticket themes are Money/Cash, Multiplier, Numbers and Holiday tickets.


Note that some of the themes with the highest median sales have relatively few observations (Novelty, Annuity, Extended play - other) leading us to refrain from making any general inferences from this information.

We test for homogeneity of variance because of the non-normality of group distributions.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group  34  1.4202 0.05788 .
##       876                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene test for homogeneity (equality) of variance shows that the group variances are unequal.

Multiple pairwise comparison between groups.


6.6.3 Play Style

Over 50% of tickets are key number match. The top 4 play styles make up >80% of all cases.


Note the very small sample sizes of some of the top ranked play styles e.g Poker, the play style with the highest median sales, has just nrow(filter(scratch_data_II, play.style == "Poker")) observations.

Test for homogeneity of variance

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group  18  1.7252 0.03037 *
##       892                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene test for homogeneity (equality) of variance shows that the group variances are unequal.


6.6.4 Features

Multiplier features are the most frequent and the best performing in the market.


Test for homogeneity of variance

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group  14  7.7524 1.191e-15 ***
##       896                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene test for homogeneity (equality) of variance shows that the group variances are unequal.


Data exploration is an iterative, non-exhaustive process. Whilst we recognize that we could have developed a more complete view of the data with an even more thorough exploratory data analysis, we are satisfied that our current understanding of the data is sufficient for us to proceed with the regression analysis portion of the project. Please see Part II of of this report.

7 Appendix

7.1 R Session Information

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gplots_3.0.1.1      dunn.test_1.3.5     formattable_0.2.0.1
##  [4] car_3.0-3           carData_3.0-2       stringi_1.4.3      
##  [7] gridExtra_2.3       ggthemes_4.2.0      kableExtra_1.1.0   
## [10] knitr_1.28          scales_1.0.0        inspectdf_0.0.5    
## [13] table1_1.1          DataExplorer_0.8.0  lubridate_1.7.4    
## [16] forcats_0.4.0       stringr_1.4.0       dplyr_0.8.3        
## [19] purrr_0.3.2         readr_1.3.1         tidyr_0.8.3        
## [22] tibble_2.1.3        ggplot2_3.2.1       tidyverse_1.2.1    
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-139       bitops_1.0-6       webshot_0.5.1      progress_1.2.2    
##  [5] httr_1.4.1         tools_3.6.0        backports_1.1.4    R6_2.4.0          
##  [9] KernSmooth_2.23-15 lazyeval_0.2.2     colorspace_1.4-1   withr_2.1.2       
## [13] tidyselect_0.2.5   prettyunits_1.0.2  curl_4.0           compiler_3.6.0    
## [17] cli_1.1.0          rvest_0.3.4        xml2_1.2.2         labeling_0.3      
## [21] caTools_1.17.1.2   digest_0.6.20      foreign_0.8-71     rmarkdown_1.15    
## [25] rio_0.5.16         pkgconfig_2.0.2    htmltools_0.3.6    highr_0.8         
## [29] htmlwidgets_1.3    rlang_0.4.0        readxl_1.3.1       rstudioapi_0.10   
## [33] generics_0.0.2     jsonlite_1.6       gtools_3.8.1       zip_2.0.3         
## [37] magrittr_1.5       Formula_1.2-3      Rcpp_1.0.2         munsell_0.5.0     
## [41] ggfittext_0.8.1    abind_1.4-5        yaml_2.2.0         plyr_1.8.4        
## [45] grid_3.6.0         parallel_3.6.0     gdata_2.18.0       crayon_1.3.4      
## [49] lattice_0.20-38    haven_2.1.1        hms_0.5.1          zeallot_0.1.0     
## [53] pillar_1.4.2       igraph_1.2.4.1     reshape2_1.4.3     glue_1.3.1        
## [57] evaluate_0.14      data.table_1.12.2  modelr_0.1.5       vctrs_0.2.0       
## [61] networkD3_0.4      cellranger_1.1.0   gtable_0.3.0       assertthat_0.2.1  
## [65] xfun_0.9           openxlsx_4.1.0.1   broom_0.5.2        viridisLite_0.3.0 
## [69] ellipsis_0.2.0.1