The purpose of this week’s data dive is for you to learn a bit more about your data by running summary statistics and basic plots.
Your RMarkdown notebook for this data dive should contain the following:
A numeric summary of data for at least 2 columns of
data
For categorical columns, this should include unique values and counts [x]
For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles) [x]
These summaries can be combined
A set of at least 3 novel questions to investigate
informed by the following:
column summaries (i.e., the above bullet)
data documentation
your project’s goals/purpose
Address at least one of the above questions using an aggregation function
Visual summaries (i.e., visualizations) of 2 or more
columns of your data
This should include distributions at least
In addition, you should consider trends, correlations, and interactions between variables
Use different channels (e.g., color) to show how categorical variables interact with continuous variables
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
I will be using NFL Standings data in this week’s data dive.
I first want to import my data, then get a better idea of what I am working with by running head(). This will let me see the first few rows of the data.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(standings)
## # A tibble: 6 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
Just looking at the first few columns, I see a few character type columns, and a few double type columns.
How many teams are there in the NFL?
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
standings |>
distinct(team_name) |>
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 32
How many wins does a team need to have to be in the top 75% quantile of wins?
standings |>
pluck('wins') |>
quantile()
## 0% 25% 50% 75% 100%
## 0 6 8 10 16
What is the average number of wins in a season?
standings |>
pluck('wins') |>
mean() |>
round()
## [1] 8
How many times has each team made it to the playoffs, and of those playoff teams, how many won the Superbowl?
I want to start by counting how many times each team made it to the playoffs.
standings |>
group_by(team_name) |>
filter(playoffs == 'Playoffs') |>
count()
## # A tibble: 32 × 2
## # Groups: team_name [32]
## team_name n
## <chr> <int>
## 1 49ers 6
## 2 Bears 5
## 3 Bengals 7
## 4 Bills 2
## 5 Broncos 9
## 6 Browns 1
## 7 Buccaneers 5
## 8 Cardinals 4
## 9 Chargers 7
## 10 Chiefs 9
## # ℹ 22 more rows
I want to plot this to better understand how each team performed.
playoff_counts <- standings |>
group_by(team_name) |>
filter(playoffs == 'Playoffs') |>
count()
# There are too many teams to display at once, so I will only display the top 75 quantile of playoff counts.
playoff_counts |>
pluck('n') |>
quantile()
## 0% 25% 50% 75% 100%
## 1.00 4.75 7.00 9.00 17.00
high_playoff_counts <- playoff_counts |>
filter(n > 9)
high_playoff_counts |>
ggplot(aes(x = reorder(team_name, n), y = n)) +
geom_bar(stat = "identity", fill = 'blue') +
theme_minimal() +
labs(title = "Top NFL Team Playoff Appearances",
x = "Team",
y = "Number of Playoff Appearances") +
coord_flip()
Next, I want to find the number of Superbowl wins for each team. I will use a similar process to the one I used to find the playoff counts.
superbowl_wins <- standings |>
group_by(team_name) |>
filter(sb_winner == 'Won Superbowl') |>
count() |>
arrange(desc(n))
Next, I want to visualize the number of Superbowl wins with the number of playoff appearances for each Superbowl winning team.
I need to combine the playoff counts with the Superbowl win counts.
playoff_counts |>
inner_join(superbowl_wins, by = 'team_name') |>
ggplot(aes(y = reorder(team_name, n.x))) +
geom_bar(mapping = aes(x = n.x), stat = 'identity', fill = 'blue') +
geom_bar(mapping = aes(x = n.y), stat = 'identity', fill = 'red') +
theme_minimal() +
labs(title = "NFL Playoff Appearances and Super Bowl Wins",
y = "Team",
x = "Number of Playoff Appearances (Blue) and Superbowl Wins (Red)")
What range of years does ‘standings’ capture?
standings |>
pluck('year') |>
range()
## [1] 2000 2019
As a Baltimore Raven’s fan, the years 2000 and 2019 were two very memorable seasons. In 2000 (and 2012), the Baltimore Ravens won the Super Bowl…
standings |>
filter(team == 'Baltimore') |>
filter(sb_winner == 'Won Superbowl') |>
pluck('year')
## [1] 2000 2012
… and in 2019, Lamar Jackson, the quarterback of the Ravens, won the Most Valuable Player award. Many fans believed the team would go on to win a Superbowl. However, the Ravens season would be cut short after a first round, leaving the fans very disappointed.
standings |>
filter(team == 'Baltimore') |>
filter(year == '2019') |>
pluck('playoffs')
## [1] "Playoffs"
standings |>
filter(team == 'Baltimore') |>
filter(year == '2019') |>
pluck('sb_winner')
## [1] "No Superbowl"
Lets further examine the Baltimore Ravens in both of their super bowl seasons, and Lamar Jackson’s 2019 MVP season.
good_ravens_seasons <- standings |>
filter(team == 'Baltimore') |>
filter((year == 2000) | (year == 2012) | (year == 2019))
Lets compare teams be visualizing points_for and points_against.
good_ravens_seasons |>
ggplot(mapping = aes(x = points_for, y = points_against, color = as.factor(year))) +
geom_point() +
labs(title = ' Balimore Ravens, Points For and Against, Memorable Seasons',color = "Year", x = "Points For", y = "Points Against") +
theme_minimal()
The 2019 Baltimore Ravens scored more points than both of their Superbowl winning teams and had conceded less points than the 2012 team. Lets look at more than just the Ravens. Lets see how the offense and defense of the league varied over time.
standings |>
ggplot(mapping = aes(x = points_for, y = points_against, color = factor(playoffs))) +
geom_point() +
scale_color_manual(values = c('red', 'blue')) +
labs(title = 'NFL Points For VS Points Against: What it takes to make the playoffs',color = '',x = "Points For", y = "Points Against") +
theme_minimal()
standings |>
ggplot(mapping = aes(x = points_for, y = points_against, color = factor(sb_winner))) +
geom_point() +
scale_color_manual(values = c('red', 'blue')) +
labs(title = 'NFL Points For VS Points Against: What it takes to win a Superbowl', color = 'Super Bowl Winner',x = "Points For", y = "Points Against") +
theme_minimal()