NFL Statistics & Trends

Dataset Introduction

This NFL dataset contains a multitude of information about NFL teams, games, and seasons. It based on three separate excel files that are used to tell one comprehensive story. These include details about the stadiums where the games are played, the teams’ performance in terms of score spreads and win rates, and various information about the weather of particular games.

# Load the necessary libraries
library(ggplot2)      # For creating visuals
library(dplyr)        # For data manipulation
library(scales)       # For percentage scales on visuals
library(data.table)   # For efficient data reading and manipulation
library(tools)        # For text manipulation
library(tidyr)        # For data tidying

# Set the path
setwd("C:/Users/themi/OneDrive/Documents/DS736/NFL Scores R")

# Read the CSV files using fread() from data.table (I do not see a join that would be beneficial for these visualizations)
nfl_teams <- fread("nfl_teams.csv")
nfl_stadiums <- fread("nfl_stadiums.csv")
spreadspoke_scores <- fread("spreadspoke_scores.csv")

Visualizations

Distribution of NFL Stadium Types (Pie Chart)

This pie chart displays the percentage distribution of different types of NFL stadiums. It shows three categories: Indoor, Outdoor, and Retractable, with the majority being outdoor stadiums (76.8%). The chart indicates that indoor stadiums are the second most common (16.2%), followed by retractable ones (7.1%). The total number of stadiums analyzed is mentioned as 99.

# Visualization 1: Stadium Type Distribution (Pie Chart)
# Count the frequency of each stadium type
stadium_types <- table(nfl_stadiums$stadium_type)
# Convert the table to a data frame for ggplot
stadium_types_df <- as.data.frame(stadium_types)
# Remove rows with empty stadium types
stadium_types_df <- stadium_types_df[stadium_types_df$Var1 != "", ]

# Renaming the 'Var1' column to 'StadiumType' and capitalizing the first letter
stadium_types_df <- rename(stadium_types_df, StadiumType = Var1) %>%
  mutate(StadiumType = tools::toTitleCase(as.character(StadiumType)))

# Calculate the percentage of each stadium type
stadium_types_df$Percentage <- stadium_types_df$Freq / sum(stadium_types_df$Freq) * 100

# Calculate the total number of stadiums being analyzed
total_stadiums <- sum(stadium_types_df$Freq)

# Generate a pie chart using ggplot
ggplot(stadium_types_df, aes(x = "", y = Freq, fill = StadiumType)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y") + # Use polar coordinates for pie chart
  geom_text(aes(label = scales::percent(Percentage/100)), position = position_stack(vjust = 0.5)) +
  labs(title = "Distribution of NFL Stadium Types", 
       fill = "Stadium Type",
       caption = paste("Total number of stadiums analyzed:", total_stadiums)) +
  theme_void() + # Remove axes and background
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0, size = 10))

Average Score Spread by Team in 2021 (Bar Plot with Point Overlay)

This bar plot provides a comparison of the average score spreads by team for the year 2021. The score spread indicates the points by which a team wins or loses a game on average. Positive values show teams that, on average, win by that many points, while negative values indicate teams that, on average, lose by the spread amount. A team with a very high bar (e.g., 7) suggests a strong performance, typically winning by many points.

# Visualization 2: Average Score Spread by Team in 2021 (Bar Graph)
# Filter and summarize data for 2021 season
spread_data <- spreadspoke_scores %>%
  filter(schedule_season == 2021) %>%
  mutate(team_spread = ifelse(team_home == team_favorite_id, score_home - score_away, score_away - score_home)) %>%
  group_by(team_favorite_id) %>%
  summarise(AverageSpread = mean(team_spread, na.rm = TRUE)) %>%
  arrange(desc(AverageSpread))

# Add a new variable for fill color based on AverageSpread
spread_data$color <- ifelse(spread_data$AverageSpread < 0, "brown3", "darkseagreen")

# Convert to a bar chart for visual clarity
ggplot(spread_data, aes(x = team_favorite_id, y = AverageSpread, label = round(AverageSpread, 2), fill = color)) +
  geom_bar(stat = "identity", color = "black", width = 0.7) + # Bar color is now determined by the fill aesthetic
  geom_text(nudge_y = ifelse(spread_data$AverageSpread < 0, -1.5, 1.5), color = "black", size = 3, vjust = -0.35) + # Adjust position of text based on sign
  scale_fill_identity() + # Use the colors as they are in the data
  theme_minimal(base_size = 12) + # Use the minimal theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), # Improve X axis text readability
        plot.title = element_text(hjust = 0.5, size = 14),
        legend.position = "none") + # Remove legend
  labs(title = "Average Score Spread by Team in 2021", x = "Team", y = "Average Spread")

Home and Away Wins by Team and Season, Since 2000 (Heatmap)

This heatmap shows the number of home and away wins for each NFL team across multiple seasons since 2000. The teams are listed on the y-axis, and seasons are on the x-axis. The color intensity represents the number of wins, with darker shades indicating a higher number of wins in that category. This visualization can be used to identify trends in team performance over time and how they perform.

# Visualization 3: Home and Away Wins by Team and Season (Since 2000)
library(ggplot2)
library(dplyr)
library(tidyr)

# Visualization 3: Home and Away Wins by Team and Season (Since 2000)
# Filter post-2000 seasons, calculate home/away wins, and reshape for plotting.
win_counts_by_location <- spreadspoke_scores %>%
  filter(schedule_season >= 2000) %>% # Seasons since 2000
  mutate(
    home_win = ifelse(score_home > score_away, 1, 0), # Flag for home win
    away_win = ifelse(score_away > score_home, 1, 0)  # Flag for away win
  ) %>%
  group_by(schedule_season, team_home) %>% # Group by season and team
  summarise(home_wins = sum(home_win), away_wins = sum(away_win), .groups = 'drop') %>% # Sum wins
  pivot_longer(cols = c(home_wins, away_wins), names_to = "win_type", values_to = "wins") %>% # Long format
  mutate(win_type = ifelse(win_type == "home_wins", "Home Win", "Away Win")) # Rename for clarity

# Create heatmap of wins.
ggplot(win_counts_by_location, aes(x = schedule_season, y = team_home, fill = wins)) +
  geom_tile() + # Heatmap tiles
  scale_fill_gradient(low = "azure1", high = "darkgoldenrod") + # Gradient for win counts
  labs(title = "Home and Away Wins by Team and Season (Since 2000)", x = "Season", y = "Team", fill = "Win Count") +
  theme_minimal() + # Clean theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1), plot.title = element_text(hjust = 0.5)) # Adjust text

Win Rate Trend of Baltimore Ravens Since 2000 (Line Graph)

The line graph illustrates the win rate percentage for the Baltimore Ravens over each season since 2000. It shows significant fluctuations in performance, with peaks indicating seasons of high win rates and troughs indicating lower success. The percentages at each point allow viewers to quickly gauge the team’s success in any given season. This graph is helpful for analyzing the stability and changes in a team’s performance over time. The mean win rate is also shown at the bottom of the chart.

# Visualization 4: Baltimore Ravens Performance Trend
selected_team <- "Baltimore Ravens"

# Calculate win rates for the selected team by season
team_performance_trend <- spreadspoke_scores %>%
  filter(team_home == selected_team | team_away == selected_team) %>% # Filter games involving the selected team
  mutate(win = ifelse((team_home == selected_team & score_home > score_away) | 
                        (team_away == selected_team & score_away > score_home), 1, 0)) %>% # Determine wins
  group_by(schedule_season) %>%
  summarise(games_played = n(), wins = sum(win), win_rate = wins / games_played * 100) %>%
  filter(schedule_season >= 2000) # Only seasons since 2000

# Calculate the mean win rate
mean_win_rate <- mean(team_performance_trend$win_rate, na.rm = TRUE)

# Plot win rate trend with mean win rate as caption
ggplot(team_performance_trend, aes(x = schedule_season, y = win_rate)) +
  geom_line(color = "blueviolet", size = 1) + # Line for win rate trend
  geom_point(color = "darkgoldenrod2", size = 2) + # Points for each season's win rate
  geom_text(aes(label = sprintf("%.1f%%", win_rate)), nudge_y = 3, color = "black", size = 3.5) + # Win rate labels
  theme_minimal() + # Minimalistic theme
  labs(title = paste("Win Rate Trend of", selected_team, "Since 2000"), 
       x = "Season", y = "Win Rate (%)", 
       caption = paste("Mean win rate:", sprintf("%.1f%%", mean_win_rate))) + # Adding mean win rate as caption
  theme(axis.text.x = element_text(angle = 45, hjust = 1), # Slant x-axis labels for readability
        plot.title = element_text(hjust = 0.5), # Center the title
        plot.caption = element_text(hjust = 0, vjust = 1)) # Align caption to the left

Average Total Game Score vs. Weather Temperature (Scatter Plot with Trend Line)

This scatter plot analyzes the relationship between the average total game score and the weather temperature (in Fahrenheit). Each point represents a game, with its position indicating the temperature during the game and the total score. A trend line is fitted to the data, which suggests a slight increase in total scores as the temperature rises. This visualization can be interpreted to assess if weather conditions have a noticeable effect on the scoring during games.

# Visualization 5: Total Game Score vs Weather Temperature
# Filter non-missing temperature and scores, calculate total scores, group by temperature, and compute average scores while removing extreme cold temperatures.
weather_scores <- spreadspoke_scores %>%
  filter(!is.na(weather_temperature), !is.na(score_home), !is.na(score_away)) %>% # Remove NA values
  mutate(total_score = score_home + score_away) %>% # Calculate total game score
  group_by(weather_temperature) %>%
  summarise(average_total_score = mean(total_score, na.rm = TRUE)) %>%
  filter(weather_temperature > -50) # Exclude extreme cold temperatures

# Create the plot: Scatter plot of average total scores vs. temperature with color gradient and LOESS smoothed trend line to visualize potential patterns or trends.
ggplot(weather_scores, aes(x = weather_temperature, y = average_total_score)) +
  geom_point(aes(color = average_total_score), alpha = 0.6) + # Colored points for scores
  geom_smooth(method = "loess", color = "cornflowerblue", se = FALSE) + # Trend line
  scale_color_viridis_c(option = "plasma", begin = 0.3, end = 0.9, guide = "none") + # Color scale for points
  theme_minimal() + # Minimal theme for clean look
  labs(title = "Average Total Game Score vs. Weather Temperature",
       x = "Weather Temperature (Fahrenheit)", y = "Average Total Score") + # Labels and title
  theme(plot.title = element_text(hjust = 0.5)) # Center plot title

Summary of Findings

The analysis uncovers intriguing patterns:

The tendency for outdoor stadiums reflects a traditional approach to football environments. This may also be due to many of the stadiums being older, as more recently built stadiums tend to be closed or retractable.
The score spread analysis reveals the competitive landscape in 2021, with clear front-runners and underperformers. For example, Houston heavily outperformed the average score spread in 2021.
The heatmap highlights the consistency of certain teams over the years and the fluctuating fortunes of others.
The Baltimore Ravens’ win rate trend underscores the ebb and flow of success, with seasons of high triumphs and others of lesser achievement. 2019 appears to be their peak in terms of win rate.
The temperature analysis tentatively suggests that milder weather could be conducive to higher-scoring games, although further investigation would be needed to confirm any causal relationship.

These insights provide a valuable perspective on the dynamics of NFL games and the factors that may influence outcomes, contributing to the broader understanding of the sport’s statistical trends.