The article we selected, titled “What digits should you bet on in Super Bowl squares?” looks at the betting game called Super Bowl Squares from a data-driven approach. The article is sorted into three parts looking at the most common digits and pairs, if numbers are independent, and how the numbers change over time. Using these three observations, the author shows that by looking at the statistics behind the game you can gain an advantage.
We will use the data from all NFL games since 1978. The dataset is on GitHub from James Every. The dataset contains NFL data since 1966, but we will only use the data from 1978 to 2014.
library(tidyverse)
theme_set(theme_light())
# Git clone first from git@github.com:devstopfix/nfl_results.git
games <- dir("nfl_results/", pattern = ".csv", full.names = TRUE) %>%
map_df(read_csv)
# Gather into one tall data frame of scores, and calculate the digit
# The dataset has a couple games from before 1978, but only a few per year
scores <- games %>%
filter(season >= 1978) %>%
gather(type, score, home_score, visitors_score) %>%
mutate(digit = score %% 10)
scores
## # A tibble: 18,148 x 9
## season week kickoff home_team X5 visiting_team type
## <int> <int> <dttm> <chr> <int> <chr> <chr>
## 1 1978 1 1978-09-02 00:00:00 Buccanee… NA Giants home…
## 2 1978 1 1978-09-03 00:00:00 Bears NA Cardinals home…
## 3 1978 1 1978-09-03 00:00:00 Bengals NA Chiefs home…
## 4 1978 1 1978-09-03 00:00:00 Bills NA Steelers home…
## 5 1978 1 1978-09-03 00:00:00 Broncos NA Raiders home…
## 6 1978 1 1978-09-03 00:00:00 Browns NA 49ers home…
## 7 1978 1 1978-09-03 00:00:00 Eagles NA Rams home…
## 8 1978 1 1978-09-03 00:00:00 Falcons NA Oilers home…
## 9 1978 1 1978-09-03 00:00:00 Jets NA Dolphins home…
## 10 1978 1 1978-09-03 00:00:00 Lions NA Packers home…
## # ... with 18,138 more rows, and 2 more variables: score <int>,
## # digit <dbl>
We can see that there are 18,148 football scores from over 9,000 games can be analyzed.
scores %>%
count(score, sort = TRUE)
## # A tibble: 60 x 2
## score n
## <int> <int>
## 1 17 1412
## 2 20 1197
## 3 24 1189
## 4 10 1089
## 5 13 952
## 6 14 946
## 7 27 931
## 8 21 810
## 9 31 781
## 10 23 754
## # ... with 50 more rows
In order to find the most common scores, we sorted the number of counts of scores, and observed that the most common score is 17, and is followed by 20, 24 and 10. Therefore we could guess that the most common digit would possibly consist of 7, 0 and 4.
ggplot(scores, aes(digit)) +
geom_histogram(fill = "#8D9093", binwidth = .5) +
scale_x_continuous(breaks = 0:9) +
labs(title = "Frequency of final digits in NFL games since 1966")
To have a better view of the most common digit, we plotted a histogram of the frequency of the final digit. We can observe that the most common digit is 7, and is closely followed by 0, 4 and 3. By looking at the histogram, we could conclude that 2 or 5 are the two numbers that we really don’t want to bet on.
The data we used from the article showed that two scores where the most likely to win this game. The two scores that are the most likely are 7-0 and 7-4. Also, because both teams can arrive at these scores, the inverse is true as well - so 0-7 and 4-7 are equally as common. These numbers are not independent of each other, either. The numbers that a team receive off each other can be affected by a number of different reasons. For instance, if a team is playing highly defensive, it is likely both teams will have lower numbers. This leads the scoring to avoid ending in numbers such as 1 or 2, as it’s mathematically impossible to have a single point in football, and incredibly hard to just end in two. In a more realistic scenario though, 7 and 3 points are the most common in football, which means that low scoring games would likely end in 0, 3, 4, or 7.
games %>%
count(home_digit = home_score %% 10,
visitors_digit = visitors_score %% 10) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(home_digit, visitors_digit, fill = percent)) +
geom_tile() +
scale_x_continuous(breaks = 0:9) +
scale_y_continuous(breaks = 0:9) +
scale_fill_gradient(high = "#C8102E", low = "white",
labels = scales::percent_format()) +
theme_minimal() +
labs(x = "Last digit of home team score",
y = "Last digit of visitors team score",
title = "Common pairs of NFL scores",
fill = "% of games since 1978")
There are other factors that make these numbers dependant on each other as well. Another low scoring factor would be the weather. It changes how to game can be played, as if it’s rainy the ball is harder to throw and catch, forcing more runs and possibly fewer yards gained and therefore fewer points. These means that common factors affect both easily - and that they are dependant upon them.
Looking at the change in ending digit throughout the dataset you don’t see much change. Throughout the years the order and rough percentages of each number stay the same. Not too much is gained from an overview of this size.
scores %>%
count(decade = 10 * season %/% 10,
digit) %>%
mutate(digit = reorder(digit, -n)) %>%
group_by(decade) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(decade, percent, color = digit)) +
geom_line() +
scale_y_continuous(labels = scales::percent_format()) +
expand_limits(y = 0) +
labs(title = "Most common ending digits of NFL games by decade",
y = "% of scores ending in this digit")
The original hypothesis that the article presents is that strategy has changed over time, and this will affect what scores appear. When looking at the whole dataset it is apparent that not much has changed, but looking further into sections of time, such as 1980 to 1990, you can see some more drastic changes. Looking into what happened around football in the 80s might give a greater glimpse into why the percentages change so much.
We thought it would be interesting for each of us to take a look at the data and come up with our own unique way of looking and anaylising it. For this reason we have three follow up additions to the original article.
We wanted to figure out what scores would look like for games that were just in the super bowl. There is a lot less of a sample, however these games play differently - they often have different scoring due to not wanting to lose. This can lead to far more defensive games, and have different results than normal games. That is why we chose to look at it - to see how different it would be.
supbowlscores = read.csv("supbowlscores.csv", header = TRUE)
supbowlscores %>%
count(home_digit = team1 %% 10,
visitors_digit = team2 %% 10) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(home_digit, visitors_digit, fill = percent)) +
geom_tile() +
scale_x_continuous(breaks = 0:9) +
scale_y_continuous(breaks = 0:9) +
scale_fill_gradient(high = "#C8102E", low = "white",
labels = scales::percent_format()) +
theme_minimal() +
labs(x = "Last digit of team 1 score",
y = "Last digit of team 2 score",
title = "Common pairs of NFL Super Bowl scores",
fill = "% of games since 1978")
To make it simpler for processing a custom csv was created with just the superbowl games. This can be found here
We thought it might be interesting to look at what the top digits in each state the games were played in. Knowing the number one digit was seven it was suprising to see that this is not the case accross all states. Several states had their top digit as 0, the second most common end digit.
library(mapdata)
states <- map_data("state")
teamloc = read.csv("team_location.csv", header = TRUE)
games_withlocation <- merge(scores, teamloc,by="home_team")
digitcounts <- games_withlocation %>%
count(digit,region, sort = TRUE)
digitagg <- aggregate(n ~ region, digitcounts, max)
digitmax <- merge(digitagg, digitcounts)
ggplot() +
geom_map(data=states, map=states,
aes(x=long, y=lat, map_id=region),
fill="#ffffff", color="#ffffff", size=0.15) +
geom_map(data=digitmax, map=states,
aes(fill=factor(digit), map_id=region),
color="#ffffff", size=0.15) +
coord_map() +
labs(x="Longitude", y="Latitude", fill="Digit", title="Home Team Digits")
The original article shows that you should generally aim for a 7-0 combination when betting, you can also look at the state the game is being played in to gain another point of advantage.
For this look at the data we did have to build and append a new column to our data. This is because the original data provided did not have the states, key for the map visualization, for each team. This was easy enough data to collect. The names of each unique team was found and put in a csv with their home state. This datatset can be found here.
We want to figure out during which weeks in all seasons the digits have the highest frequency.
# group the data and make a new data table
week <- scores$week
digit <- scores$digit
tabmm <- table(week, digit)
tabmm
## digit
## week 0 1 2 3 4 5 6 7 8 9
## 1 174 108 27 135 200 36 102 193 74 51
## 2 203 108 42 149 147 41 91 196 64 57
## 3 181 95 29 146 165 35 80 178 54 37
## 4 157 94 32 130 132 37 83 181 70 38
## 5 155 103 28 118 151 35 86 163 56 51
## 6 154 94 26 129 157 43 70 173 57 39
## 7 160 92 38 119 154 26 72 202 62 43
## 8 163 101 30 132 131 50 73 167 63 56
## 9 184 108 37 129 137 38 63 185 59 44
## 10 161 96 34 134 147 36 89 199 61 55
## 11 201 96 30 150 170 39 70 225 47 54
## 12 185 126 36 147 163 36 88 193 59 61
## 13 179 113 40 160 179 52 92 178 69 40
## 14 201 111 34 137 152 38 87 224 69 43
## 15 205 117 37 136 163 48 80 197 68 51
## 16 209 116 36 134 169 39 88 211 59 41
## 17 135 89 29 117 118 24 62 151 59 54
## 18 70 33 11 35 49 10 24 59 17 16
## 19 43 29 7 28 41 15 13 41 18 9
## 20 21 12 3 15 18 3 6 16 9 5
## 21 11 6 1 5 2 3 5 5 8 4
## 22 2 7 2 2 7 1 1 10 0 4
# make the plot to show digits frequency in each week
#standardize freqency
df <- as.data.frame(tabmm)
aa <- mean(df$Freq)
bb <- sd(df$Freq)
stan <- abs((df$Freq - aa) / bb)
ggplot(df, aes(week, digit, color = df$Freq)) +
geom_point(shape = 16, size = 10 * stan) +
theme_minimal() +
scale_color_gradient(low = "#0091ff", high = "#f0650e") +
scale_alpha(range = c(0.05, 0.25)) +
labs(title = "Digits Frequency in Each Week")
We standadized the frequency of digits and used different colors and sizes of circles to show the frequency. The closer the color to deep red and the larger the circles, the higher the frequency of that digits in certain week. On the other hand, the color closer to deep blue and the smaller the size, the lower the frequency.
We can observe that digits 7 and 0 have the most red and largest circles, followed with digits 4 and 3, and others are mainly with blue colored small circles. To look at the x-axis, the weeks, we can observe that most of the deep red colored large circle are concentrate upon week 14, 16, 15 and 11. Based on the result, we would say that it is better to play Super Bowl Squares in week 14, 16, 15 and 11 of the season in each year and choose digit 7 and 0.
Within our initial analysis we showed that picking a 7-0 or 0-7 combination is our best choice when playing Super Bowl Squares. 7 is your safest bet, whether you look at the pure most common, the independance, or historical positions. The numbers that come after in sequence, 0, 4, and 3, can be seen in our follow up visualizations. When looking at just the superbowl there is a lot of noise, but you can see the top combinations revolve around 7, but also include pairs with 0, 5, 4, and 1. And when looking at where the games are played, while the majority is still 7, you can see certain states do favor 0 over 7 for the home team.
This lab went much smoother than lab 1. We activlely communicated each day, helped each other, and equally contributed. For future labs where we get to choose our own data we will aim for data with more variables. We struggled with the follow up and in the end had to collect some small supplmentary data for creating further visualizations. This did lead to new learning experiences, working with maps, and creating new custom visualizations, but it added more more than we original planned.