In this exercise, we analyze a dataset of historical game scores for the Los Angeles Chargers (previously the San Diego Chargers) of the NFL. The idea of using this dataset came from my post on the Week 5 Discussion Board. The source of the dataset can be found at:
https://www.scoreboard.com/team/los-angeles-chargers/K0iU7PXK/results/
The dataset includes game scores from the 2008 season through 9/30/2018, a time span that includes 220 games played during 10 complete seasons (2008 through 2017) and an early portion of the 2018 season. For each game the data includes the date, the opposing team, an indicator whether the game was played at home (in LA or San Diego) or away, the final score, and an indicator whether the game was played in the pre-season, regular season, or post-season.
In analyzing the data, I was interested in addressing three questions:
The Chargers game score data is available at the URL above. In addition, to address the question of the Charger’s performance playing in warm weather vs. cold weather, we need temperature data by city. To do an accurate analysis, we should use the actual historical temperature at game time for each game, which we could pull from a source like the National Weather Service (https://www.weather.gov/). However, to keep things simple, let’s use an average temperature by city, which we can pull from the US Climate Data website (https://www.usclimatedata.com/). I chose to use average November temperatures, since that’s roughly the mid-to-end month of the NFL regular season (typically September through December).
Both datasets are saved as CSV files on GitHub:
chargers_scores: Chargers game score data
nov_temp: Average daily temperature in the month of November by city
First, read in the data from the CSV files stored on GitHub. Note that several columns in the game score data frame are all NA, and several rows are NA. We remove the NA columns and rows in this step, along with renaming the remaining columns.
library(tidyverse)
library(lubridate)
library(knitr)
file <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/chargers_scores.csv"
raw <- read_csv(file, col_names = FALSE)
raw
## # A tibble: 244 x 7
## X1 X2 X3 X4 X5 X6 X7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> "StandingsUSA:\x~ <NA> <NA> <NA> <NA> <NA>
## 2 <NA> Sep 30, 04:25 PM San Franci~ "@\xa0Los Ang~ "27\xa0~ <NA> <NA>
## 3 <NA> Sep 23, 04:05 PM Los Angele~ "@\xa0Los Ang~ "23\xa0~ <NA> <NA>
## 4 <NA> Sep 16, 01:00 PM Los Angele~ "@\xa0Buffalo~ "31\xa0~ <NA> <NA>
## 5 <NA> Sep 09, 04:05 PM Kansas Cit~ "@\xa0Los Ang~ "38\xa0~ <NA> <NA>
## 6 <NA> "StandingsUSA:\x~ <NA> <NA> <NA> <NA> <NA>
## 7 <NA> Aug 30, 10:00 PM Los Angele~ "@\xa0San Fra~ "23\xa0~ <NA> <NA>
## 8 <NA> Aug 25, 08:00 PM New Orlean~ "@\xa0Los Ang~ "36\xa0~ <NA> <NA>
## 9 <NA> Aug 18, 10:00 PM Seattle Se~ "@\xa0Los Ang~ "14\xa0~ <NA> <NA>
## 10 <NA> Aug 11, 10:00 PM Los Angele~ "@\xa0Arizona~ "17\xa0~ <NA> <NA>
## # ... with 234 more rows
# remove columns that are all NA & remove rows that are NA
all(is.na(c(raw$X1, raw$X6, raw$X7)))
## [1] TRUE
df1 <- raw %>% select(2:5) %>% filter(is.na(X3) == FALSE)
# rename columns
names(df1) <- c("Date", "Away", "Home", "Score")
df1
## # A tibble: 209 x 4
## Date Away Home Score
## <chr> <chr> <chr> <chr>
## 1 Sep 30, 04:25 ~ San Francisco 49~ "@\xa0Los Angeles Char~ "27\xa0:\xa0~
## 2 Sep 23, 04:05 ~ Los Angeles Char~ "@\xa0Los Angeles Rams" "23\xa0:\xa0~
## 3 Sep 16, 01:00 ~ Los Angeles Char~ "@\xa0Buffalo Bills" "31\xa0:\xa0~
## 4 Sep 09, 04:05 ~ Kansas City Chie~ "@\xa0Los Angeles Char~ "38\xa0:\xa0~
## 5 Aug 30, 10:00 ~ Los Angeles Char~ "@\xa0San Francisco 49~ "23\xa0:\xa0~
## 6 Aug 25, 08:00 ~ New Orleans Sain~ "@\xa0Los Angeles Char~ "36\xa0:\xa0~
## 7 Aug 18, 10:00 ~ Seattle Seahawks "@\xa0Los Angeles Char~ "14\xa0:\xa0~
## 8 Aug 11, 10:00 ~ Los Angeles Char~ "@\xa0Arizona Cardinal~ "17\xa0:\xa0~
## 9 Dec 31, 04:25 ~ Oakland Raiders "@\xa0Los Angeles Char~ "10\xa0:\xa0~
## 10 24-Dec-17 Los Angeles Char~ "@\xa0New York Jets" "14\xa0:\xa0~
## # ... with 199 more rows
file1 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/nov_temp.csv"
nov_temp <- read_csv(file1)
nov_temp
## # A tibble: 32 x 2
## Team `Avg Nov Temp`
## <chr> <int>
## 1 Arizona Cardinals 76
## 2 Atlanta Falcons 64
## 3 Baltimore Ravens 58
## 4 Buffalo Bills 48
## 5 Carolina Panthers 62
## 6 Chicago Bears 48
## 7 Cincinnati Bengals 55
## 8 Cleveland Browns 51
## 9 Dallas Cowboys 68
## 10 Denver Broncos 52
## # ... with 22 more rows
The average temperature data frame looks fine. However we need to do some clean-up on the game score data, including (a) converting the Date column from “string” to “date” data type and (b) removing the special characters from the Away, Home, and Score columns.
# convert date column to dates
df1$Date <- df1$Date %>% str_replace("^([:alpha:]{3}) (\\d{1,2})(.+)$", "\\2-\\1-18") %>% dmy()
# fix away & home team columns - remove special characters
df1$Away <- df1$Away %>% str_extract("[[:alnum:]\\s]+")
df1$Home <- df1$Home %>% str_extract("[[:alnum:]\\s]+")
# check that teams are spelled consistently in both team columns
all(sort(unique(df1$Home)) == sort(unique(df1$Away)))
## [1] TRUE
# fix scores column - remove special characters
df1$Score <- df1$Score %>% str_replace("^(\\d+)[^\\d]+(\\d+)", "\\1:\\2")
df1
## # A tibble: 209 x 4
## Date Away Home Score
## <date> <chr> <chr> <chr>
## 1 2018-09-30 San Francisco 49ers Los Angeles Chargers 27:29
## 2 2018-09-23 Los Angeles Chargers Los Angeles Rams 23:35
## 3 2018-09-16 Los Angeles Chargers Buffalo Bills 31:20
## 4 2018-09-09 Kansas City Chiefs Los Angeles Chargers 38:28
## 5 2018-08-30 Los Angeles Chargers San Francisco 49ers 23:21
## 6 2018-08-25 New Orleans Saints Los Angeles Chargers 36:7
## 7 2018-08-18 Seattle Seahawks Los Angeles Chargers 14:24
## 8 2018-08-11 Los Angeles Chargers Arizona Cardinals 17:24
## 9 2018-12-31 Oakland Raiders Los Angeles Chargers 10:30
## 10 2017-12-24 Los Angeles Chargers New York Jets 14:7
## # ... with 199 more rows
Next we add some new fields to the game score data that will help in our analysis. We transform the data frame through the following steps:
These steps are shown in the R code below. Note that in several places we had to do error checking and bug fixes. For instance, in step (g), the initial code assigned a “W” or “L” by comparing the Chargers points scored to the opponent points scored; however this didn’t work because the points scored columns both had “character” data types. Once the columns were converted to “integer” data types, the logic worked.
############################################################
# (a) & (b): fix dates and add pre, regular, or post season
############################################################
game_type <- function(d) {
# define dates for pre and post seasons
pre <- (month(d) == 8) | ((month(d) == 9) & (mday(d) < 7))
post <- (month(d) == 2) | ((month(d) ==1) & (mday(d) > 3))
if (pre)
"Pre"
else {
if (post)
"Post"
else
"Reg"
}
}
type <- sapply(df1$Date, game_type)
df2 <- df1 %>% mutate(Year = year(Date), Month = month(Date), Day = mday(Date), Type = type)
# fix one date for post season that doesn't follow logic above
df2[df2$Date == ymd("2009-01-03"), 8] <- "Post"
# check this was done correctly: there should be 40 pre-season and 5 post-season games
table(df2$Type)
##
## Post Pre Reg
## 5 40 164
df2
## # A tibble: 209 x 8
## Date Away Home Score Year Month Day Type
## <date> <chr> <chr> <chr> <dbl> <dbl> <int> <chr>
## 1 2018-09-30 San Francisco ~ Los Angeles C~ 27:29 2018 9 30 Reg
## 2 2018-09-23 Los Angeles Ch~ Los Angeles R~ 23:35 2018 9 23 Reg
## 3 2018-09-16 Los Angeles Ch~ Buffalo Bills 31:20 2018 9 16 Reg
## 4 2018-09-09 Kansas City Ch~ Los Angeles C~ 38:28 2018 9 9 Reg
## 5 2018-08-30 Los Angeles Ch~ San Francisco~ 23:21 2018 8 30 Pre
## 6 2018-08-25 New Orleans Sa~ Los Angeles C~ 36:7 2018 8 25 Pre
## 7 2018-08-18 Seattle Seahaw~ Los Angeles C~ 14:24 2018 8 18 Pre
## 8 2018-08-11 Los Angeles Ch~ Arizona Cardi~ 17:24 2018 8 11 Pre
## 9 2018-12-31 Oakland Raiders Los Angeles C~ 10:30 2018 12 31 Reg
## 10 2017-12-24 Los Angeles Ch~ New York Jets 14:7 2017 12 24 Reg
## # ... with 199 more rows
############################################################
# (c) & (d): add home/away indicator and opponent name
############################################################
df2 <- df2 %>% mutate(Home_Away = ifelse(str_detect(Away, "Charger"), "Away", "Home"),
Opp = ifelse(Home_Away == "Away", Home, Away)
)
# check this was done correctly: the Chargers should never appear in the Opp column
all(str_detect(df2$Opp, "Charger") == FALSE)
## [1] TRUE
df2
## # A tibble: 209 x 10
## Date Away Home Score Year Month Day Type Home_Away Opp
## <date> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <chr>
## 1 2018-09-30 San Fr~ Los A~ 27:29 2018 9 30 Reg Home San ~
## 2 2018-09-23 Los An~ Los A~ 23:35 2018 9 23 Reg Away Los ~
## 3 2018-09-16 Los An~ Buffa~ 31:20 2018 9 16 Reg Away Buff~
## 4 2018-09-09 Kansas~ Los A~ 38:28 2018 9 9 Reg Home Kans~
## 5 2018-08-30 Los An~ San F~ 23:21 2018 8 30 Pre Away San ~
## 6 2018-08-25 New Or~ Los A~ 36:7 2018 8 25 Pre Home New ~
## 7 2018-08-18 Seattl~ Los A~ 14:24 2018 8 18 Pre Home Seat~
## 8 2018-08-11 Los An~ Arizo~ 17:24 2018 8 11 Pre Away Ariz~
## 9 2018-12-31 Oaklan~ Los A~ 10:30 2018 12 31 Reg Home Oakl~
## 10 2017-12-24 Los An~ New Y~ 14:7 2017 12 24 Reg Away New ~
## # ... with 199 more rows
############################################################
# (e), (f) & (g): separate points scored and add win/loss indicator
############################################################
temp <- str_split(df2$Score, ":", simplify = TRUE)
df2 <- df2 %>% mutate(Pts_Ch = ifelse(Home_Away == "Home", temp[ , 2], temp[ , 1]),
Pts_Opp = ifelse(Home_Away == "Home", temp[ , 1], temp[ , 2]),
Won_Loss = ifelse(Pts_Ch > Pts_Opp, "W", ifelse(Pts_Ch < Pts_Opp, "L", "T"))
)
df2
## # A tibble: 209 x 13
## Date Away Home Score Year Month Day Type Home_Away Opp
## <date> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <chr>
## 1 2018-09-30 San ~ Los ~ 27:29 2018 9 30 Reg Home San ~
## 2 2018-09-23 Los ~ Los ~ 23:35 2018 9 23 Reg Away Los ~
## 3 2018-09-16 Los ~ Buff~ 31:20 2018 9 16 Reg Away Buff~
## 4 2018-09-09 Kans~ Los ~ 38:28 2018 9 9 Reg Home Kans~
## 5 2018-08-30 Los ~ San ~ 23:21 2018 8 30 Pre Away San ~
## 6 2018-08-25 New ~ Los ~ 36:7 2018 8 25 Pre Home New ~
## 7 2018-08-18 Seat~ Los ~ 14:24 2018 8 18 Pre Home Seat~
## 8 2018-08-11 Los ~ Ariz~ 17:24 2018 8 11 Pre Away Ariz~
## 9 2018-12-31 Oakl~ Los ~ 10:30 2018 12 31 Reg Home Oakl~
## 10 2017-12-24 Los ~ New ~ 14:7 2017 12 24 Reg Away New ~
## # ... with 199 more rows, and 3 more variables: Pts_Ch <chr>,
## # Pts_Opp <chr>, Won_Loss <chr>
table(df2$Won_Loss)
##
## L W
## 114 95
# need to fix W/L indicator; above didn't work because Pts data need to have integer type
df2$Pts_Ch <- as.integer(df2$Pts_Ch)
df2$Pts_Opp <- as.integer(df2$Pts_Opp)
df2 <- df2 %>% mutate(Won_Loss = ifelse(Pts_Ch > Pts_Opp, "W", ifelse(Pts_Ch < Pts_Opp, "L", "T")))
df2
## # A tibble: 209 x 13
## Date Away Home Score Year Month Day Type Home_Away Opp
## <date> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <chr>
## 1 2018-09-30 San ~ Los ~ 27:29 2018 9 30 Reg Home San ~
## 2 2018-09-23 Los ~ Los ~ 23:35 2018 9 23 Reg Away Los ~
## 3 2018-09-16 Los ~ Buff~ 31:20 2018 9 16 Reg Away Buff~
## 4 2018-09-09 Kans~ Los ~ 38:28 2018 9 9 Reg Home Kans~
## 5 2018-08-30 Los ~ San ~ 23:21 2018 8 30 Pre Away San ~
## 6 2018-08-25 New ~ Los ~ 36:7 2018 8 25 Pre Home New ~
## 7 2018-08-18 Seat~ Los ~ 14:24 2018 8 18 Pre Home Seat~
## 8 2018-08-11 Los ~ Ariz~ 17:24 2018 8 11 Pre Away Ariz~
## 9 2018-12-31 Oakl~ Los ~ 10:30 2018 12 31 Reg Home Oakl~
## 10 2017-12-24 Los ~ New ~ 14:7 2017 12 24 Reg Away New ~
## # ... with 199 more rows, and 3 more variables: Pts_Ch <int>,
## # Pts_Opp <int>, Won_Loss <chr>
table(df2$Won_Loss)
##
## L W
## 107 102
############################################################
# (h) & (i) add avg temp for game location and reduce variable set
# say COLD = avg Nov temp < 52
# WARM = avg Nov temp > 64
# MED = all else
############################################################
df2 <- df2 %>% left_join(nov_temp, by = c("Home" = "Team")) %>% rename(Nov_Temp = "Avg Nov Temp") %>%
mutate(Temp = ifelse(Nov_Temp < 52, "COLD", ifelse(Nov_Temp > 64, "WARM", "MED")))
# select final variable set
df3 <- df2 %>% select(Year, Month, Day, Type, Home_Away, Temp, Opp, Pts_Ch, Pts_Opp, Won_Loss)
str(df3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 209 obs. of 10 variables:
## $ Year : num 2018 2018 2018 2018 2018 ...
## $ Month : num 9 9 9 9 8 8 8 8 12 12 ...
## $ Day : int 30 23 16 9 30 25 18 11 31 24 ...
## $ Type : chr "Reg" "Reg" "Reg" "Reg" ...
## $ Home_Away: chr "Home" "Away" "Away" "Home" ...
## $ Temp : chr "WARM" "WARM" "COLD" "WARM" ...
## $ Opp : chr "San Francisco 49ers" "Los Angeles Rams" "Buffalo Bills" "Kansas City Chiefs" ...
## $ Pts_Ch : int 29 23 31 28 23 7 24 17 30 14 ...
## $ Pts_Opp : int 27 35 20 38 21 36 14 24 10 7 ...
## $ Won_Loss : chr "W" "L" "W" "L" ...
kable(head(df3, 25))
| Year | Month | Day | Type | Home_Away | Temp | Opp | Pts_Ch | Pts_Opp | Won_Loss |
|---|---|---|---|---|---|---|---|---|---|
| 2018 | 9 | 30 | Reg | Home | WARM | San Francisco 49ers | 29 | 27 | W |
| 2018 | 9 | 23 | Reg | Away | WARM | Los Angeles Rams | 23 | 35 | L |
| 2018 | 9 | 16 | Reg | Away | COLD | Buffalo Bills | 31 | 20 | W |
| 2018 | 9 | 9 | Reg | Home | WARM | Kansas City Chiefs | 28 | 38 | L |
| 2018 | 8 | 30 | Pre | Away | MED | San Francisco 49ers | 23 | 21 | W |
| 2018 | 8 | 25 | Pre | Home | WARM | New Orleans Saints | 7 | 36 | L |
| 2018 | 8 | 18 | Pre | Home | WARM | Seattle Seahawks | 24 | 14 | W |
| 2018 | 8 | 11 | Pre | Away | WARM | Arizona Cardinals | 17 | 24 | L |
| 2018 | 12 | 31 | Reg | Home | WARM | Oakland Raiders | 30 | 10 | W |
| 2017 | 12 | 24 | Reg | Away | MED | New York Jets | 14 | 7 | W |
| 2017 | 12 | 16 | Reg | Away | MED | Kansas City Chiefs | 13 | 30 | L |
| 2017 | 12 | 10 | Reg | Home | WARM | Washington Redskins | 30 | 13 | W |
| 2017 | 12 | 3 | Reg | Home | WARM | Cleveland Browns | 19 | 10 | W |
| 2017 | 11 | 23 | Reg | Away | WARM | Dallas Cowboys | 28 | 6 | W |
| 2017 | 11 | 19 | Reg | Home | WARM | Buffalo Bills | 54 | 24 | W |
| 2017 | 11 | 12 | Reg | Away | WARM | Jacksonville Jaguars | 17 | 20 | L |
| 2017 | 10 | 29 | Reg | Away | COLD | New England Patriots | 13 | 21 | L |
| 2017 | 10 | 22 | Reg | Home | WARM | Denver Broncos | 21 | 0 | W |
| 2017 | 10 | 15 | Reg | Away | WARM | Oakland Raiders | 17 | 16 | W |
| 2017 | 10 | 8 | Reg | Away | MED | New York Giants | 27 | 22 | W |
| 2017 | 10 | 1 | Reg | Home | WARM | Philadelphia Eagles | 24 | 26 | L |
| 2017 | 9 | 24 | Reg | Home | WARM | Kansas City Chiefs | 10 | 24 | L |
| 2017 | 9 | 17 | Reg | Home | WARM | Miami Dolphins | 17 | 19 | L |
| 2017 | 9 | 11 | Reg | Away | MED | Denver Broncos | 21 | 24 | L |
| 2017 | 8 | 31 | Pre | Away | MED | San Francisco 49ers | 13 | 23 | L |
Now that we have a tidy data frame, we can proceed to answer our questions.
The Chargers have won 83 games out of 164 regular season games since the 2008 season, which equates to a 50.6% win percentage.
df3 %>% group_by(Type) %>%
summarize(Number_Games = n(), Wins = sum(Won_Loss == "W"), Losses = sum(Won_Loss == "L"),
Win_Pct = round(Wins / Number_Games * 100, 1) ) %>%
kable(caption = "Chargers Win-Loss Record: 2008 - 2018/9/30")
| Type | Number_Games | Wins | Losses | Win_Pct |
|---|---|---|---|---|
| Post | 5 | 2 | 3 | 40.0 |
| Pre | 40 | 17 | 23 | 42.5 |
| Reg | 164 | 83 | 81 | 50.6 |
The Chargers definitely play better at home than on the road, as their win percentage is 57% at home vs. 44% away.
df3 %>% filter(Type == "Reg") %>% group_by(Home_Away) %>%
summarize(Number_Games = n(), Wins = sum(Won_Loss == "W"), Losses = sum(Won_Loss == "L"),
Win_Pct = round(Wins / Number_Games * 100, 1) ) %>%
kable(caption = "Chargers Win-Loss Record at Home vs. Away: 2008 - 2018/9/30")
| Home_Away | Number_Games | Wins | Losses | Win_Pct |
|---|---|---|---|---|
| Away | 82 | 36 | 46 | 43.9 |
| Home | 82 | 47 | 35 | 57.3 |
The Chargers are definitely a warm-weather team, as their win percentage and points scored-versus-allowed all show a clear bias toward better performance playing in warm weather. The Chargers have a 56% win percentage in warm weather, but only a 23% win percentage in cold weather.
df3 %>% filter(Type == "Reg") %>% group_by(Temp) %>%
summarize(Number_Games = n(), Wins = sum(Won_Loss == "W"), Losses = sum(Won_Loss == "L"),
Win_Pct = round(Wins / Number_Games * 100, 1), Tot_Pts_Scored = sum(Pts_Ch),
Tot_Pts_Allowed = sum(Pts_Opp) ) %>%
kable(caption = "Chargers Win-Loss Record and Points Scored/Allowed in Warm & Cold Weather Cities: 2008 - 2018/9/30")
| Temp | Number_Games | Wins | Losses | Win_Pct | Tot_Pts_Scored | Tot_Pts_Allowed |
|---|---|---|---|---|---|---|
| COLD | 13 | 3 | 10 | 23.1 | 238 | 313 |
| MED | 41 | 19 | 22 | 46.3 | 1026 | 995 |
| WARM | 110 | 61 | 49 | 55.5 | 2766 | 2317 |