Introduction:

The article we selected, titled “What digits should you bet on in Super Bowl squares?” looks at the betting game called Super Bowl Squares from a data-driven approach. The article is sorted into three parts looking at the most common digits and pairs, if numbers are independent, and how the numbers change over time. Using these three observations, the author shows that by looking at the statistics behind the game you can gain an advantage.

Analysis:

Most Common Digits

We will use the data from all NFL games since 1978. The dataset is on GitHub from James Every. The dataset contains NFL data since 1966, but we will only use the data from 1978 to 2014.

library(tidyverse)
theme_set(theme_light())

# Git clone first from git@github.com:devstopfix/nfl_results.git
games <- dir("nfl_results/", pattern = ".csv", full.names = TRUE) %>%
  map_df(read_csv)

# Gather into one tall data frame of scores, and calculate the digit
# The dataset has a couple games from before 1978, but only a few per year
scores <- games %>%
  filter(season >= 1978) %>%
  gather(type, score, home_score, visitors_score) %>%
  mutate(digit = score %% 10)

scores
## # A tibble: 18,148 x 9
##    season  week kickoff             home_team    X5 visiting_team type 
##     <int> <int> <dttm>              <chr>     <int> <chr>         <chr>
##  1   1978     1 1978-09-02 00:00:00 Buccanee…    NA Giants        home…
##  2   1978     1 1978-09-03 00:00:00 Bears        NA Cardinals     home…
##  3   1978     1 1978-09-03 00:00:00 Bengals      NA Chiefs        home…
##  4   1978     1 1978-09-03 00:00:00 Bills        NA Steelers      home…
##  5   1978     1 1978-09-03 00:00:00 Broncos      NA Raiders       home…
##  6   1978     1 1978-09-03 00:00:00 Browns       NA 49ers         home…
##  7   1978     1 1978-09-03 00:00:00 Eagles       NA Rams          home…
##  8   1978     1 1978-09-03 00:00:00 Falcons      NA Oilers        home…
##  9   1978     1 1978-09-03 00:00:00 Jets         NA Dolphins      home…
## 10   1978     1 1978-09-03 00:00:00 Lions        NA Packers       home…
## # ... with 18,138 more rows, and 2 more variables: score <int>,
## #   digit <dbl>

We can see that there are 18,148 football scores from over 9,000 games can be analyzed.

scores %>%
  count(score, sort = TRUE)
## # A tibble: 60 x 2
##    score     n
##    <int> <int>
##  1    17  1412
##  2    20  1197
##  3    24  1189
##  4    10  1089
##  5    13   952
##  6    14   946
##  7    27   931
##  8    21   810
##  9    31   781
## 10    23   754
## # ... with 50 more rows

In order to find the most common scores, we sorted the number of counts of scores, and observed that the most common score is 17, and is followed by 20, 24 and 10. Therefore we could guess that the most common digit would possibly consist of 7, 0 and 4.

ggplot(scores, aes(digit)) +
  geom_histogram(fill = "#8D9093", binwidth = .5) +
  scale_x_continuous(breaks = 0:9) +
  labs(title = "Frequency of final digits in NFL games since 1966")

To have a better view of the most common digit, we plotted a histogram of the frequency of the final digit. We can observe that the most common digit is 7, and is closely followed by 0, 4 and 3. By looking at the histogram, we could conclude that 2 or 5 are the two numbers that we really don’t want to bet on.

Are they independent?

The data we used from the article showed that two scores where the most likely to win this game. The two scores that are the most likely are 7-0 and 7-4. Also, because both teams can arrive at these scores, the inverse is true as well - so 0-7 and 4-7 are equally as common. These numbers are not independent of each other, either. The numbers that a team receive off each other can be affected by a number of different reasons. For instance, if a team is playing highly defensive, it is likely both teams will have lower numbers. This leads the scoring to avoid ending in numbers such as 1 or 2, as it’s mathematically impossible to have a single point in football, and incredibly hard to just end in two. In a more realistic scenario though, 7 and 3 points are the most common in football, which means that low scoring games would likely end in 0, 3, 4, or 7.

games %>%
  count(home_digit = home_score %% 10,
        visitors_digit = visitors_score %% 10) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(home_digit, visitors_digit, fill = percent)) +
  geom_tile() +
  scale_x_continuous(breaks = 0:9) +
  scale_y_continuous(breaks = 0:9) +
  scale_fill_gradient(high = "#C8102E", low = "white",
                      labels = scales::percent_format()) +
  theme_minimal() +
  labs(x = "Last digit of home team score",
       y = "Last digit of visitors team score",
       title = "Common pairs of NFL scores",
       fill = "% of games since 1978")

There are other factors that make these numbers dependant on each other as well. Another low scoring factor would be the weather. It changes how to game can be played, as if it’s rainy the ball is harder to throw and catch, forcing more runs and possibly fewer yards gained and therefore fewer points. These means that common factors affect both easily - and that they are dependant upon them.

Change Over Time

Looking at the change in ending digit throughout the dataset you don’t see much change. Throughout the years the order and rough percentages of each number stay the same. Not too much is gained from an overview of this size.

scores %>%
  count(decade = 10 * season %/% 10,
        digit) %>%
  mutate(digit = reorder(digit, -n)) %>%
  group_by(decade) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(decade, percent, color = digit)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  expand_limits(y = 0) +
  labs(title = "Most common ending digits of NFL games by decade",
       y = "% of scores ending in this digit")

The original hypothesis that the article presents is that strategy has changed over time, and this will affect what scores appear. When looking at the whole dataset it is apparent that not much has changed, but looking further into sections of time, such as 1980 to 1990, you can see some more drastic changes. Looking into what happened around football in the 80s might give a greater glimpse into why the percentages change so much.

Follow-up:

We thought it would be interesting for each of us to take a look at the data and come up with our own unique way of looking and anaylising it. For this reason we have three follow up additions to the original article.

Super Bowl Games

We wanted to figure out what scores would look like for games that were just in the super bowl. There is a lot less of a sample, however these games play differently - they often have different scoring due to not wanting to lose. This can lead to far more defensive games, and have different results than normal games. That is why we chose to look at it - to see how different it would be.

supbowlscores = read.csv("supbowlscores.csv", header = TRUE)

supbowlscores %>%
  count(home_digit = team1 %% 10,
        visitors_digit = team2 %% 10) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(home_digit, visitors_digit, fill = percent)) +
  geom_tile() +
  scale_x_continuous(breaks = 0:9) +
  scale_y_continuous(breaks = 0:9) +
  scale_fill_gradient(high = "#C8102E", low = "white",
                      labels = scales::percent_format()) +
  theme_minimal() +
  labs(x = "Last digit of team 1 score",
       y = "Last digit of team 2 score",
       title = "Common pairs of NFL Super Bowl scores",
       fill = "% of games since 1978")

To make it simpler for processing a custom csv was created with just the superbowl games. This can be found here

Top Digits Per State

We thought it might be interesting to look at what the top digits in each state the games were played in. Knowing the number one digit was seven it was suprising to see that this is not the case accross all states. Several states had their top digit as 0, the second most common end digit.

library(mapdata)
states <- map_data("state")
teamloc = read.csv("team_location.csv", header = TRUE)

games_withlocation <- merge(scores, teamloc,by="home_team")

digitcounts <- games_withlocation %>%
  count(digit,region, sort = TRUE)

digitagg <- aggregate(n ~ region, digitcounts, max)
digitmax <- merge(digitagg, digitcounts)

ggplot() +
  geom_map(data=states, map=states,
           aes(x=long, y=lat, map_id=region),
           fill="#ffffff", color="#ffffff", size=0.15) +
  geom_map(data=digitmax, map=states,
           aes(fill=factor(digit), map_id=region),
           color="#ffffff", size=0.15) +
  coord_map() +
  labs(x="Longitude", y="Latitude", fill="Digit", title="Home Team Digits")

The original article shows that you should generally aim for a 7-0 combination when betting, you can also look at the state the game is being played in to gain another point of advantage.

For this look at the data we did have to build and append a new column to our data. This is because the original data provided did not have the states, key for the map visualization, for each team. This was easy enough data to collect. The names of each unique team was found and put in a csv with their home state. This datatset can be found here.

Top digits per week

We want to figure out during which weeks in all seasons the digits have the highest frequency.

# group the data and make a new data table
week <- scores$week
digit <- scores$digit
tabmm <- table(week, digit)
tabmm
##     digit
## week   0   1   2   3   4   5   6   7   8   9
##   1  174 108  27 135 200  36 102 193  74  51
##   2  203 108  42 149 147  41  91 196  64  57
##   3  181  95  29 146 165  35  80 178  54  37
##   4  157  94  32 130 132  37  83 181  70  38
##   5  155 103  28 118 151  35  86 163  56  51
##   6  154  94  26 129 157  43  70 173  57  39
##   7  160  92  38 119 154  26  72 202  62  43
##   8  163 101  30 132 131  50  73 167  63  56
##   9  184 108  37 129 137  38  63 185  59  44
##   10 161  96  34 134 147  36  89 199  61  55
##   11 201  96  30 150 170  39  70 225  47  54
##   12 185 126  36 147 163  36  88 193  59  61
##   13 179 113  40 160 179  52  92 178  69  40
##   14 201 111  34 137 152  38  87 224  69  43
##   15 205 117  37 136 163  48  80 197  68  51
##   16 209 116  36 134 169  39  88 211  59  41
##   17 135  89  29 117 118  24  62 151  59  54
##   18  70  33  11  35  49  10  24  59  17  16
##   19  43  29   7  28  41  15  13  41  18   9
##   20  21  12   3  15  18   3   6  16   9   5
##   21  11   6   1   5   2   3   5   5   8   4
##   22   2   7   2   2   7   1   1  10   0   4
# make the plot to show digits frequency in each week
#standardize freqency
df <- as.data.frame(tabmm)
aa <- mean(df$Freq)
bb <- sd(df$Freq)
stan <- abs((df$Freq - aa) / bb)
ggplot(df, aes(week, digit, color = df$Freq)) + 
  geom_point(shape = 16, size = 10 * stan) + 
  theme_minimal() + 
  scale_color_gradient(low = "#0091ff", high = "#f0650e") + 
  scale_alpha(range = c(0.05, 0.25)) +
  labs(title = "Digits Frequency in Each Week")

We standadized the frequency of digits and used different colors and sizes of circles to show the frequency. The closer the color to deep red and the larger the circles, the higher the frequency of that digits in certain week. On the other hand, the color closer to deep blue and the smaller the size, the lower the frequency.

We can observe that digits 7 and 0 have the most red and largest circles, followed with digits 4 and 3, and others are mainly with blue colored small circles. To look at the x-axis, the weeks, we can observe that most of the deep red colored large circle are concentrate upon week 14, 16, 15 and 11. Based on the result, we would say that it is better to play Super Bowl Squares in week 14, 16, 15 and 11 of the season in each year and choose digit 7 and 0.

Discussion:

Within our initial analysis we showed that picking a 7-0 or 0-7 combination is our best choice when playing Super Bowl Squares. 7 is your safest bet, whether you look at the pure most common, the independance, or historical positions. The numbers that come after in sequence, 0, 4, and 3, can be seen in our follow up visualizations. When looking at just the superbowl there is a lot of noise, but you can see the top combinations revolve around 7, but also include pairs with 0, 5, 4, and 1. And when looking at where the games are played, while the majority is still 7, you can see certain states do favor 0 over 7 for the home team.

This lab went much smoother than lab 1. We activlely communicated each day, helped each other, and equally contributed. For future labs where we get to choose our own data we will aim for data with more variables. We struggled with the follow up and in the end had to collect some small supplmentary data for creating further visualizations. This did lead to new learning experiences, working with maps, and creating new custom visualizations, but it added more more than we original planned.