Assignment 7- Ethical Webscrapig

Author

Chase Gray

Important Resources

library(tidyverse)  # The tidyverse collection of packages
library(httr)       # Useful for web authentication
library(rvest)      # Useful tools for working with HTML and XML
library(polite)     # Promoting responsible web scraping
library(lubridate)  # Working with dates
library(magrittr)   # scraping assistance
library(dplyr)      # for piping use
library(ggplot2)    # for creative visuals

Question to answer

In this report, I wish to answer the question many football fans wish to know the answer to. “Statistically, which teams should have been in the playoffs last season?“ We all know the Philadelphia Eagles won the Super Bowl, but may have benefited from things like a lack of competition in the NFC outside of a few teams. Some teams hailed very good statistics in one category or another, but could not piece together other parts of the game. Many statistics can contribute to answering this question like defensive rating, offensive rating, general team rating, win/loss ratio, point differential, and the rest of the division.

Data Set

NFL_table <-
  read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/grayc11_xavier_edu/EWM_QBSexU5Fvpp8BUKWq6YBzlbhgiwzrxt-XThefOaQSQ?download=1")

print(NFL_table)
# A tibble: 32 × 13
   Tm             W     L `W-L%`    PF    PA    PD   MoV   SoS   SRS  OSRS  DSRS
   <chr>      <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Buffalo B…    13     4  0.765   525   368   157   9.2  -1.1   8.1   7.8   0.3
 2 Miami Dol…     8     9  0.471   345   364   -19  -1.1  -1.9  -3    -3.5   0.4
 3 New York …     5    12  0.294   338   404   -66  -3.9  -0.5  -4.3  -3    -1.4
 4 New Engla…     4    13  0.235   289   417  -128  -7.5  -0.6  -8.1  -6.2  -1.9
 5 Baltimore…    12     5  0.706   518   361   157   9.2   0.6   9.9   8     1.9
 6 Pittsburg…    10     7  0.588   380   347    33   1.9   0.1   2.1  -0.7   2.8
 7 Cincinnat…     9     8  0.529   472   434    38   2.2  -0.8   1.4   5    -3.6
 8 Cleveland…     3    14  0.176   258   435  -177 -10.4   1.2  -9.2  -7.1  -2.1
 9 Houston T…    10     7  0.588   372   372     0   0    -0.7  -0.7  -1.5   0.8
10 Indianapo…     8     9  0.471   377   427   -50  -2.9  -0.7  -3.7  -0.5  -3.2
# ℹ 22 more rows
# ℹ 1 more variable: `AFC or NFC` <chr>

In this data set, we will see the statistics for the 2024 season for all 32 NFL teams and their respective statistics. We will see stats like Wins, Losses, Win/Loss Ratio, Points Scored, Points Allowed, Point Differential, Margin of Victory, Strength of Schedule, Simple Rating System, Offensive Simple Rating System, Defensive Simple Rating System, and Division.

How will it be retrieved?

This data can be harvested from the website Pro Football Focus. This website records and publishes statistics for the National Football league and neatly puts together these statistics into easy to read and understand tables. This will be key in programmatically scraping data from their website while still obeying the restrictions set into place by the site. Utilizing the rvest package, we can scrape through the elements of the website, extracting the important statistics we can use to compare the NFL teams. We then can convert the data into data types we can use to form our own table of values. The analysis is now pretty straight forward after some cleaning and additions to the data set.

Deeper dive into the data

Thankfully, Pro Football Reference’s website is pretty clean from anything that would hinder our data collection, so there is very little to no data cleaning required. Variables that are observed, Wins through Points Defended, are calculated by observing the game statistics. The other variables are calculated based on the statistics that were observed during the season. Margin of victory is calculated by points scored minus the points allowed, then averaged across all the games played. Strength of Schedule is calculated by averaging the Simple Rating System among opponents. The Simple Rating System is the team quality relative to the average team. The offensive simple rating system compares the teams offensive quality relative to the average offense. Similarly, the defensive simple rating system compare the teams defensive quality relative to the average defense. The last column for the teams division is an added column created to compare the divisions, outlining the more difficult division. In a variable like team name, there is a plus or minus next to the teams that made the playoffs. A “+” denotes a team that made the playoffs in one of the 6 wildcard spots among both divisions. A “*” denotes that a team made the playoffs in one of the 8 division winning spots.

Analysis of the data

In the analysis of data, we will be using several different visualizations and discoveries to answer our question we posed at the start, “Statistically, which teams should have made the playoffs?”.

Visualizations and Analysis

We will be comparing teams statistically aided by visualizations to assist in our analysis of NFL teams. Common statistics like these are crucial in discovering if there is any way to tell which team should have made the playoffs or championship.

Simple Rating System

To start off the visualizations, we will examining the ditsribution of the Simple Ranting System. In this chart, we will see the distribution of the SRS statistic among all NFL teams for the 2024 season. We can see that there is a large majority of the teams that are closer to the middle of the pack, the average nfl team, so this observation makes sense. What we should be looking for is the number of teams that are above the average because these are the teams that should be better than the others. We can observe that there is about 8 teams with a large score in this section, so those teams should be our 8 playoff teams that secured a seed by winning the division. However, there is large variety in teams that could potentially take the wild card spots.

Win Loss Percentage by Division

From this visualization, we can observe that the division with the highest win percentage is the NFC. However, there is not much disparity between the two divisions so the teams are not very too lopsided in terms of being able to win. This gives us good insight into the kinds of statistics we need to look for in order to deeper understand why a team may have a better chance at getting to the playoffs and possibly even the championships.

Points Scored

NFL_table %>%
  ggplot(aes(x = PF, y = reorder(Tm, PF))) +
  geom_col(fill = "red", color ="black", alpha = 0.8) +
  labs(
    title = "Points Allowed By Team",
    x = "Points allowed",
    y = "Team")

In this visualization, we can observe that there are plenty of teams in the top of the scoring statistics that made the playoffs. Offense may just be the new defense because 9 of the 10 top teams made the playoffs, emphasizing the need for a high powered offense in the NFL.

Points Allowed

A common belief among NFL fans is “defense wins championships”. In analyzing the points allowed statistic, we can see if defense is influential for making the playoffs. We can observe that a lot of the defenses that made the playoffs were among the lower to average teams in points allowed among games. This supports the theory that defense is important in getting to the playoffs.

When looking at both graphics, it is very tough to have a top defense and offense in the same year. However, there are teams like the Detroit Lions and Minnesota Vikings that experienced close to the best of both worlds. However, neither team won the championship, so there must be some other factor that propelled them into the playoffs.

Strength of schedule

In this bar chart, we can observe the strength of schedule value assigned to each team based on the other teams they played. We can see that there is a pretty even split of teams that had harder or easier schedules, but there are teams that clearly had the easiest schedule in the league. A negative score describes an easier schedule than that of a positive score. In this category, we can see the winners of the Superbowl with one of the lowest scores, meaning they had one of the easiest schedules in the NFL in the 2024 season. However, most teams have a positive rating in this category, so it seems like the winners of the superbowl had an easier path than other teams.

Final Analysis

In my final analysis of the data, I wanted to combine our findings into one shortened discussion. We observed that there is a lot of variability when it comes to each of these variables because there is more to a season than these statistics. Thins like coaching, staff, organizations, and also equipment are nit being taken into account. However, the data does not lie when it comes to predicting playoff teams. A high powered offense or a high powered defense impacts the likelihood the most. Having both in the middle of the pack is not good enough. Strength of schedule plays a large impact on getting to the playoffs. Realistically speaking, if your strength of schedule is lower, it is a lot easier to make the playoffs. Division play plays a slight impact for making the playoffs, but this doesn’t tend to make that much of a difference.