This week, you will look individual rows of data, and groups of them. For each row/group, you’ll investigate the probability that row (or group) should occur, given your data. This, in a way, is a kind of anomaly detection.
Your RMarkdown notebook for this data dive should contain the following:
At least 3 “group by” data frames, and an investigation into
each. You’ll need to use categorical columns, or one of the
cut_
functions here.
Use the group_by
function to group your data into
(at least) 3 different sets of groups, each summarizing different
variables.
For example, this could be as simple as three data frames which group your data based on three different categorical columns, but summarize the same continuous column. Or, it could be as complex as three different combinations of categorical columns, each illustrating summarizations of different continuous (or categorical columns).
Each group in a group_by
dataframe will have a
number of rows associated with it (e.g., if you only group by a single
column, then this is the result of count
). So, if we were
to randomly select a row from your dataset, the smallest groups have a
lowest probability of being selected.
Assign the lowest probability group(s) a special tag, and then translate that back into your original data frame. Draw some conclusions about the groups you’ve found. (I.e., in other words, what does it mean that Group X is the smallest? Can you phrase this in terms of probability?)
Try to draw a testable hypothesis for why some groups are rarer than others (i.e., something quantifiable).
Build at least one visualization for each of these three groupings.
Pick two categorical variables.
(If you can) Find a combination that does not exist in the data. (For example, if you have a column for “color” and a column for “size”, you might be missing a row that is both “Blue” and “Medium”.) Why do you think these are missing?
Which combinations are the most/least common, and why might that be?
Try (i.e., no need if you can’t figure this one out) to find a way to visualize at least one of these combinations.
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
For this weeks data dive I will be using NFL Standings data which comes from Pro Football Reference team standings. I want to investigate Superbowl winning teams, by finding the patterns that explain their success.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(standings)
## # A tibble: 6 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
I want to investigate the group of teams who won a Superbowl vs the group of teams who did not.
First I will create a data frame called superbowl using the group_by function, grouping by the sb_winner categorical variable, and finding the mean of wins, losses, points for, points against, margin of victory, and strength of schedule. I will also count the number of instances of each group
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
superbowl <- standings |>
group_by(sb_winner) |>
summarise(mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
count_n = n())
superbowl
## # A tibble: 2 × 8
## sb_winner mean_wins mean_loses mean_pf mean_pa mean_mov mean_sos count_n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 No Superbowl 7.86 8.11 349. 352. -0.247 -0.00744 618
## 2 Won Superbowl 11.8 4.15 404. 283. 7.57 0.265 20
The results show what the average Superbowl winning team looks like vs the average non-Superbowl winning team.
standings |>
ggplot(mapping = aes(x = points_for, y = points_against)) +
geom_point() +
geom_point(x = 348.5356, y = 352.52, color = 'red',size=5,) +
annotate('text', x = 500, y = 200, color = 'red', label = ' Average Non-Superbowl Winning Team', size = 3) +
annotate('text', x = 550, y = 170, color = 'blue', label = ' Average Superbowl Winning Team', size = 3) +
geom_point(x = 404.3, y = 283.3, color = 'blue',size = 5) +
labs(title = 'NFL Points For vs Points Against', x = 'Points For', y = 'Points Against') +
theme_classic()
Looking at the group of Superbowl winners, I noticed that the average number of wins was slightly less than 12. Lets compare the group of teams who won at least 12 games vs Superbowl winners.
First I will create a data frame grouped by teams who won 12+ games, and those who did not.
mean_wins_group <- standings |>
group_by(wins >= 12) |>
summarise(mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
count_n = n())
mean_wins_group
## # A tibble: 2 × 8
## `wins >= 12` mean_wins mean_loses mean_pf mean_pa mean_mov mean_sos count_n
## <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 FALSE 7.18 8.78 336. 360. -1.52 0.0681 545
## 2 TRUE 12.7 3.30 434. 291. 8.90 -0.391 93
To make comparing easier, I want everything to be in the same data frame. I will do this by making each column in each data frame have the same name and then combine rows.
names(superbowl)[1] <- 'Group'
names(mean_wins_group)[1] <- 'Group'
combined_performance <- rbind(superbowl,mean_wins_group)
combined_performance
## # A tibble: 4 × 8
## Group mean_wins mean_loses mean_pf mean_pa mean_mov mean_sos count_n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 No Superbowl 7.86 8.11 349. 352. -0.247 -0.00744 618
## 2 Won Superbowl 11.8 4.15 404. 283. 7.57 0.265 20
## 3 FALSE 7.18 8.78 336. 360. -1.52 0.0681 545
## 4 TRUE 12.7 3.30 434. 291. 8.90 -0.391 93
I am interested in the difference between the average Superbowl winning team, and the average 12+ win team. I will filter by Superbowl winners and 12+ win teams, then find the difference between them.
difference <- combined_performance |>
filter(Group == 'Won Superbowl' | Group == 'TRUE') |>
summarise(across(where(is.numeric), ~ .[1] - .[2]))
difference
## # A tibble: 1 × 7
## mean_wins mean_loses mean_pf mean_pa mean_mov mean_sos count_n
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 -0.849 0.849 -29.2 -7.81 -1.33 0.656 -73
From this result we see that on average a Superbowl winning team will have about 1 less win, or about 1 more loss.
We also can see that the average points scored in a season for a Superbowl winning team is 29 points less than a 12+ win team.
Teams who won the Superbowl had strength of schedule 0.65 points higher than teams who won 12+ games. This could suggests a few things.
Superbowl teams are battle tested, meaning they learn how to win by playing against good teams.
Teams who won 12+ games managed this feat because they played against more inferior opponents.
There are 73 more teams who won 12+ games than teams who went on to win a Superbowl, suggesting that it is harder to win a Superbowl than to win 12+ games.
The biggest question that needs answering is do any of these data points have significance?
My hypothesis is that teams with a higher strength of schedule do have a better chance of winning a Superbowl.
p <- standings |>
ggplot(mapping = aes(x = points_for, y = points_against)) +
geom_point(aes(color = wins)) +
scale_color_gradient(low = 'red', high = 'blue') +
theme_classic()
p <- p +
geom_point(x = 404.3000, y = 283.3000, color = 'black',shape = 0,size = 5) +
geom_point(x = 433.5376, y = 291.1075, color = 'black',shape = 2,size = 5)
p <- p +
annotate('text', x = 550, y = 510, color = 'black', label = ' Average 12+ Win Team', size = 3) +
annotate('text', x = 575, y = 490, color = 'black', label = ' Average Superbowl Winning Team', size = 3)
p <- p + geom_point(x = 485, y = 488, color = 'black',shape = 0,size = 4) +
geom_point(x = 485, y = 508, color = 'black',shape = 2,size = 4)
p <- p + labs(titles = 'NFL Points For Vesus Points Against',
x = 'Points For',
y = 'Points Against')
p
I want to understand how strength of schedule impacts whether a team wins the Superbowl. I want to group each team by different levels of strength of schedule, then count how many Superbowl winning teams are in each.
I will group by strength of schedule by using a cut_ function. I will create a summary of these groups, including the number of Superbowl winners.
sos_group <- standings |>
group_by(cut(strength_of_schedule, breaks = 4)) |>
summarise(
mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
count_n = n(),
sb_win_count = sum(sb_winner == 'Won Superbowl'),
sb_win_percentile = sb_win_count / count_n
)
sos_group
## # A tibble: 4 × 10
## cut(strength_of_sched…¹ mean_wins mean_loses mean_pf mean_pa mean_mov mean_sos
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (-4.61,-2.38] 9.09 6.89 369 329. 2.49 -2.90
## 2 (-2.38,-0.15] 8.35 7.62 358. 349. 0.534 -1.09
## 3 (-0.15,2.07] 7.79 8.18 347. 352. -0.291 0.829
## 4 (2.07,4.31] 6.65 9.31 321. 362. -2.59 2.76
## # ℹ abbreviated name: ¹`cut(strength_of_schedule, breaks = 4)`
## # ℹ 3 more variables: count_n <int>, sb_win_count <int>,
## # sb_win_percentile <dbl>
Next we plot the number of Superbowl wins within each group. I will color each group by the total number of teams in each group. This will give us a sense of how often teams fall into categories ordinarily.
sos_group |>
ggplot(mapping = aes(x = 1:nrow(sos_group), y = sb_win_count, fill = count_n)) +
geom_col() +
labs(title = 'Strength of Schedule by Superbowl Winners',
x = 'Strength of Schedule',
y = 'Number of Superbowl Wins') +
theme_minimal()
We can notice that the outer columns are colored darker than the inner columns of the chart. This would make sense since it is less likely a team will face all weaker/stronger opponents, and more likely they will face a mix of both.
We can also notice that Superbowl winners have a range of schedule strengths, but there is a spike at the middle high values.
Looking at the counts of Superbowl wins we can see only 1 team won a Superbowl in the highest strength of schedule group. That’s the fewest Superbowl wins out of all the groups.
sb_winners <- standings |>
filter(sb_winner == 'Won Superbowl')
max_schedule <- max(sb_winners$strength_of_schedule)
standings |>
ggplot(mapping = aes(x = team_name, y = strength_of_schedule))+
geom_point(mapping = aes(color = sb_winner)) +
geom_hline(yintercept = max_schedule, color = 'black',linewidth = 0.2) +
scale_fill_brewer(palette = 'Blues') +
theme_minimal() +
labs(title = 'NFL Strength of Schedule From 2000-2019',
y = 'Strength of Schedule',
x = ' Team Name') +
annotate('text', x = 5, y = 3.5, label = 'Highest SOS Tier', color = 'black',size = 2.5) +
coord_flip()
The graph shows that the Broncos were the only team to win a Superbowl with a strength of schedule in the highest tier.
sos_group$sb_win_percentile
## [1] 0.03773585 0.02469136 0.04059041 0.01408451
The probability of this happening again is about a 1% chance.
I hypothesize that the reason so few teams win a Superbowl within the highest strength of schedule tier is because more of these teams fail to make the playoffs, and therefore cannot win the Superbowl.
It is important to note that these groups were cut arbitrarily, and the teams could have been divided into even smaller groups or larger groups.
sos_group_small <- standings |>
group_by(cut(strength_of_schedule, breaks = 8)) |>
summarise(
mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
count_n = n(),
sb_win_count = sum(sb_winner == 'Won Superbowl'),
sb_win_percentile = sb_win_count / count_n
)
sos_group_small
## # A tibble: 8 × 10
## cut(strength_of_sched…¹ mean_wins mean_loses mean_pf mean_pa mean_mov mean_sos
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (-4.61,-3.49] 9.33 6.67 371 334. 2.3 -4.03
## 2 (-3.49,-2.38] 9.06 6.91 369. 329. 2.51 -2.76
## 3 (-2.38,-1.26] 8.85 7.12 374. 345. 1.83 -1.79
## 4 (-1.26,-0.15] 8.05 7.91 349. 352. -0.217 -0.690
## 5 (-0.15,0.962] 7.69 8.29 348. 355. -0.424 0.339
## 6 (0.962,2.07] 7.93 8.03 346. 348. -0.119 1.46
## 7 (2.07,3.19] 6.93 9.04 327. 356. -1.82 2.49
## 8 (3.19,4.31] 5.76 10.2 299. 379. -5.02 3.61
## # ℹ abbreviated name: ¹`cut(strength_of_schedule, breaks = 8)`
## # ℹ 3 more variables: count_n <int>, sb_win_count <int>,
## # sb_win_percentile <dbl>
sos_group_small |>
ggplot(mapping = aes(x = 1:nrow(sos_group_small), y = sb_win_count, fill = count_n)) +
geom_col() +
labs(title = 'Strength of Schedule by Superbowl Winners',
x = 'Strength of Schedule',
y = 'Number of Superbowl Wins') +
theme_minimal()
The results of smaller groups are similar to that of the original groups. Darker filled columns on the outside, lighter filled columns on the inside. There exists a spike in the high middle portion of strength of schedule. There were no teams that occupied the easiest or hardest strength of schedule groups that went on to win a Superbowl. This makes sense if we consider how often any team would have a schedule like this.
sos_group_small$count_n
## [1] 6 47 89 154 153 118 54 17
If we make the groups larger in size…
sos_group_large <- standings |>
group_by(cut(strength_of_schedule, breaks = 2)) |>
summarise(
mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
count_n = n(),
sb_win_count = sum(sb_winner == 'Won Superbowl'),
sb_win_percentile = sb_win_count / count_n
)
sos_group_large
## # A tibble: 2 × 10
## cut(strength_of_sched…¹ mean_wins mean_loses mean_pf mean_pa mean_mov mean_sos
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (-4.61,-0.15] 8.48 7.49 360. 346. 0.883 -1.42
## 2 (-0.15,4.31] 7.56 8.41 342. 354. -0.768 1.23
## # ℹ abbreviated name: ¹`cut(strength_of_schedule, breaks = 2)`
## # ℹ 3 more variables: count_n <int>, sb_win_count <int>,
## # sb_win_percentile <dbl>
sos_group_large |>
ggplot(mapping = aes(x = 1:nrow(sos_group_large), y = sb_win_count, fill = count_n)) +
geom_col() +
labs(title = 'Strength of Schedule by Superbowl Winners',
x = 'Strength of Schedule',
y = 'Number of Superbowl Wins') +
theme_minimal()
The color of each group becomes less useful here. There are more Superbowl winning teams with higher strength of schedules. Is this difference significant?
sos_group_large$count_n
## [1] 296 342
sos_group_large$sb_win_percentile
## [1] 0.02702703 0.03508772
Lets re-examine the group of non-Superbowl winners and Superbowl winners. Previously we investigated 12+ win teams because that was the average win total of Superbowl winners. This time, lets look at the average wins of non-Superbowl winning teams. More specifically, I want to investigate teams who have 8 or fewer wins. Have any of these teams won a Superbowl?
low_wins_group <- standings |>
group_by(wins <= 8) |>
summarise(mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
count_n = n(),
playoff_n = sum(playoffs == 'Playoffs'),
sb_winner_n = sum(sb_winner == 'Won Superbowl'))
low_wins_group
## # A tibble: 2 × 10
## `wins <= 8` mean_wins mean_loses mean_pf mean_pa mean_mov mean_sos count_n
## <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 FALSE 10.8 5.15 397. 310. 5.40 -0.272 285
## 2 TRUE 5.69 10.3 313. 383. -4.37 0.221 353
## # ℹ 2 more variables: playoff_n <int>, sb_winner_n <int>
low_wins_group$playoff_n
## [1] 232 8
Of the teams who won 8 or fewer games, only 8 of them made the playoffs.
low_wins_group$sb_winner_n
## [1] 20 0
No team, having 8 or fewer wins, has gone on to win a Superbowl.