Introduction

In this exercise, we analyze a dataset of historical game scores for the Los Angeles Chargers (previously the San Diego Chargers) of the NFL. The idea of using this dataset came from my post on the Week 5 Discussion Board. The source of the dataset can be found at:
https://www.scoreboard.com/team/los-angeles-chargers/K0iU7PXK/results/

The dataset includes game scores from the 2008 season through 9/30/2018, a time span that includes 220 games played during 10 complete seasons (2008 through 2017) and an early portion of the 2018 season. For each game the data includes the date, the opposing team, an indicator whether the game was played at home (in LA or San Diego) or away, the final score, and an indicator whether the game was played in the pre-season, regular season, or post-season.

In analyzing the data, I was interested in addressing three questions:

Create a CSV file that includes all the information in the dataset

The Chargers game score data is available at the URL above. In addition, to address the question of the Charger’s performance playing in warm weather vs. cold weather, we need temperature data by city. To do an accurate analysis, we should use the actual historical temperature at game time for each game, which we could pull from a source like the National Weather Service (https://www.weather.gov/). However, to keep things simple, let’s use an average temperature by city, which we can pull from the US Climate Data website (https://www.usclimatedata.com/). I chose to use average November temperatures, since that’s roughly the mid-to-end month of the NFL regular season (typically September through December).

Both datasets are saved as CSV files on GitHub:

Read the dataset and use tidyr and dplyr to tidy and transform the data

First, read in the data from the CSV files stored on GitHub. Note that several columns in the game score data frame are all NA, and several rows are NA. We remove the NA columns and rows in this step, along with renaming the remaining columns.

library(tidyverse)
library(lubridate)
library(knitr)

file <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/chargers_scores.csv"
raw <- read_csv(file, col_names = FALSE)
raw
## # A tibble: 244 x 7
##    X1    X2                X3          X4             X5       X6    X7   
##    <chr> <chr>             <chr>       <chr>          <chr>    <chr> <chr>
##  1 <NA>  "StandingsUSA:\x~ <NA>        <NA>           <NA>     <NA>  <NA> 
##  2 <NA>  Sep 30, 04:25 PM  San Franci~ "@\xa0Los Ang~ "27\xa0~ <NA>  <NA> 
##  3 <NA>  Sep 23, 04:05 PM  Los Angele~ "@\xa0Los Ang~ "23\xa0~ <NA>  <NA> 
##  4 <NA>  Sep 16, 01:00 PM  Los Angele~ "@\xa0Buffalo~ "31\xa0~ <NA>  <NA> 
##  5 <NA>  Sep 09, 04:05 PM  Kansas Cit~ "@\xa0Los Ang~ "38\xa0~ <NA>  <NA> 
##  6 <NA>  "StandingsUSA:\x~ <NA>        <NA>           <NA>     <NA>  <NA> 
##  7 <NA>  Aug 30, 10:00 PM  Los Angele~ "@\xa0San Fra~ "23\xa0~ <NA>  <NA> 
##  8 <NA>  Aug 25, 08:00 PM  New Orlean~ "@\xa0Los Ang~ "36\xa0~ <NA>  <NA> 
##  9 <NA>  Aug 18, 10:00 PM  Seattle Se~ "@\xa0Los Ang~ "14\xa0~ <NA>  <NA> 
## 10 <NA>  Aug 11, 10:00 PM  Los Angele~ "@\xa0Arizona~ "17\xa0~ <NA>  <NA> 
## # ... with 234 more rows
# remove columns that are all NA & remove rows that are NA
all(is.na(c(raw$X1, raw$X6, raw$X7)))
## [1] TRUE
df1 <- raw %>% select(2:5) %>% filter(is.na(X3) == FALSE)
# rename columns
names(df1) <- c("Date", "Away", "Home", "Score")
df1
## # A tibble: 209 x 4
##    Date            Away              Home                    Score        
##    <chr>           <chr>             <chr>                   <chr>        
##  1 Sep 30, 04:25 ~ San Francisco 49~ "@\xa0Los Angeles Char~ "27\xa0:\xa0~
##  2 Sep 23, 04:05 ~ Los Angeles Char~ "@\xa0Los Angeles Rams" "23\xa0:\xa0~
##  3 Sep 16, 01:00 ~ Los Angeles Char~ "@\xa0Buffalo Bills"    "31\xa0:\xa0~
##  4 Sep 09, 04:05 ~ Kansas City Chie~ "@\xa0Los Angeles Char~ "38\xa0:\xa0~
##  5 Aug 30, 10:00 ~ Los Angeles Char~ "@\xa0San Francisco 49~ "23\xa0:\xa0~
##  6 Aug 25, 08:00 ~ New Orleans Sain~ "@\xa0Los Angeles Char~ "36\xa0:\xa0~
##  7 Aug 18, 10:00 ~ Seattle Seahawks  "@\xa0Los Angeles Char~ "14\xa0:\xa0~
##  8 Aug 11, 10:00 ~ Los Angeles Char~ "@\xa0Arizona Cardinal~ "17\xa0:\xa0~
##  9 Dec 31, 04:25 ~ Oakland Raiders   "@\xa0Los Angeles Char~ "10\xa0:\xa0~
## 10 24-Dec-17       Los Angeles Char~ "@\xa0New York Jets"    "14\xa0:\xa0~
## # ... with 199 more rows
file1 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/nov_temp.csv"
nov_temp <- read_csv(file1)
nov_temp
## # A tibble: 32 x 2
##    Team               `Avg Nov Temp`
##    <chr>                       <int>
##  1 Arizona Cardinals              76
##  2 Atlanta Falcons                64
##  3 Baltimore Ravens               58
##  4 Buffalo Bills                  48
##  5 Carolina Panthers              62
##  6 Chicago Bears                  48
##  7 Cincinnati Bengals             55
##  8 Cleveland Browns               51
##  9 Dallas Cowboys                 68
## 10 Denver Broncos                 52
## # ... with 22 more rows

The average temperature data frame looks fine. However we need to do some clean-up on the game score data, including (a) converting the Date column from “string” to “date” data type and (b) removing the special characters from the Away, Home, and Score columns.

# convert date column to dates
df1$Date <- df1$Date %>% str_replace("^([:alpha:]{3}) (\\d{1,2})(.+)$", "\\2-\\1-18") %>% dmy()

# fix away & home team columns - remove special characters
df1$Away <- df1$Away %>% str_extract("[[:alnum:]\\s]+")
df1$Home <- df1$Home %>% str_extract("[[:alnum:]\\s]+")
# check that teams are spelled consistently in both team columns
all(sort(unique(df1$Home)) == sort(unique(df1$Away)))
## [1] TRUE
# fix scores column - remove special characters
df1$Score <- df1$Score %>% str_replace("^(\\d+)[^\\d]+(\\d+)", "\\1:\\2")

df1
## # A tibble: 209 x 4
##    Date       Away                 Home                 Score
##    <date>     <chr>                <chr>                <chr>
##  1 2018-09-30 San Francisco 49ers  Los Angeles Chargers 27:29
##  2 2018-09-23 Los Angeles Chargers Los Angeles Rams     23:35
##  3 2018-09-16 Los Angeles Chargers Buffalo Bills        31:20
##  4 2018-09-09 Kansas City Chiefs   Los Angeles Chargers 38:28
##  5 2018-08-30 Los Angeles Chargers San Francisco 49ers  23:21
##  6 2018-08-25 New Orleans Saints   Los Angeles Chargers 36:7 
##  7 2018-08-18 Seattle Seahawks     Los Angeles Chargers 14:24
##  8 2018-08-11 Los Angeles Chargers Arizona Cardinals    17:24
##  9 2018-12-31 Oakland Raiders      Los Angeles Chargers 10:30
## 10 2017-12-24 Los Angeles Chargers New York Jets        14:7 
## # ... with 199 more rows

Next we add some new fields to the game score data that will help in our analysis. We transform the data frame through the following steps:

  1. Convert dates to year, month, and day
  2. Determine the game type: pre, regular, or post season
  3. Add home or away game indicator
  4. Add opponent name
  5. Add Chargers points scored
  6. Add opponent points scored
  7. Add win/loss indicator
  8. Add average temperature indicator given the game location (depending on whether it’s a home or away game), by joining the game score data with the November temperature data
  9. Reduce data frame to the final variable set.

These steps are shown in the R code below. Note that in several places we had to do error checking and bug fixes. For instance, in step (g), the initial code assigned a “W” or “L” by comparing the Chargers points scored to the opponent points scored; however this didn’t work because the points scored columns both had “character” data types. Once the columns were converted to “integer” data types, the logic worked.

############################################################
# (a) & (b): fix dates and add pre, regular, or post season
############################################################
game_type <- function(d) {
                # define dates for pre and post seasons
                pre <- (month(d) == 8) | ((month(d) == 9) & (mday(d) < 7))
                post <- (month(d) == 2) | ((month(d) ==1) & (mday(d) > 3)) 
                    
                if (pre) 
                    "Pre"
                else {
                    if (post) 
                        "Post"
                    else 
                        "Reg"
                }
            }
type <- sapply(df1$Date, game_type)

df2 <- df1 %>% mutate(Year = year(Date), Month = month(Date), Day = mday(Date), Type = type)
# fix one date for post season that doesn't follow logic above
df2[df2$Date == ymd("2009-01-03"), 8] <- "Post"
# check this was done correctly: there should be 40 pre-season and 5 post-season games
table(df2$Type)
## 
## Post  Pre  Reg 
##    5   40  164
df2
## # A tibble: 209 x 8
##    Date       Away            Home           Score  Year Month   Day Type 
##    <date>     <chr>           <chr>          <chr> <dbl> <dbl> <int> <chr>
##  1 2018-09-30 San Francisco ~ Los Angeles C~ 27:29  2018     9    30 Reg  
##  2 2018-09-23 Los Angeles Ch~ Los Angeles R~ 23:35  2018     9    23 Reg  
##  3 2018-09-16 Los Angeles Ch~ Buffalo Bills  31:20  2018     9    16 Reg  
##  4 2018-09-09 Kansas City Ch~ Los Angeles C~ 38:28  2018     9     9 Reg  
##  5 2018-08-30 Los Angeles Ch~ San Francisco~ 23:21  2018     8    30 Pre  
##  6 2018-08-25 New Orleans Sa~ Los Angeles C~ 36:7   2018     8    25 Pre  
##  7 2018-08-18 Seattle Seahaw~ Los Angeles C~ 14:24  2018     8    18 Pre  
##  8 2018-08-11 Los Angeles Ch~ Arizona Cardi~ 17:24  2018     8    11 Pre  
##  9 2018-12-31 Oakland Raiders Los Angeles C~ 10:30  2018    12    31 Reg  
## 10 2017-12-24 Los Angeles Ch~ New York Jets  14:7   2017    12    24 Reg  
## # ... with 199 more rows
############################################################
# (c) & (d): add home/away indicator and opponent name
############################################################
df2 <- df2 %>% mutate(Home_Away = ifelse(str_detect(Away, "Charger"), "Away", "Home"),
                      Opp = ifelse(Home_Away == "Away", Home, Away)
)
# check this was done correctly: the Chargers should never appear in the Opp column
all(str_detect(df2$Opp, "Charger") == FALSE)            
## [1] TRUE
df2
## # A tibble: 209 x 10
##    Date       Away    Home   Score  Year Month   Day Type  Home_Away Opp  
##    <date>     <chr>   <chr>  <chr> <dbl> <dbl> <int> <chr> <chr>     <chr>
##  1 2018-09-30 San Fr~ Los A~ 27:29  2018     9    30 Reg   Home      San ~
##  2 2018-09-23 Los An~ Los A~ 23:35  2018     9    23 Reg   Away      Los ~
##  3 2018-09-16 Los An~ Buffa~ 31:20  2018     9    16 Reg   Away      Buff~
##  4 2018-09-09 Kansas~ Los A~ 38:28  2018     9     9 Reg   Home      Kans~
##  5 2018-08-30 Los An~ San F~ 23:21  2018     8    30 Pre   Away      San ~
##  6 2018-08-25 New Or~ Los A~ 36:7   2018     8    25 Pre   Home      New ~
##  7 2018-08-18 Seattl~ Los A~ 14:24  2018     8    18 Pre   Home      Seat~
##  8 2018-08-11 Los An~ Arizo~ 17:24  2018     8    11 Pre   Away      Ariz~
##  9 2018-12-31 Oaklan~ Los A~ 10:30  2018    12    31 Reg   Home      Oakl~
## 10 2017-12-24 Los An~ New Y~ 14:7   2017    12    24 Reg   Away      New ~
## # ... with 199 more rows
############################################################
# (e), (f) & (g): separate points scored and add win/loss indicator
############################################################
temp <- str_split(df2$Score, ":", simplify = TRUE)
df2 <- df2 %>% mutate(Pts_Ch = ifelse(Home_Away == "Home", temp[ , 2], temp[ , 1]),
                      Pts_Opp = ifelse(Home_Away == "Home", temp[ , 1], temp[ , 2]), 
                      Won_Loss = ifelse(Pts_Ch > Pts_Opp, "W", ifelse(Pts_Ch < Pts_Opp, "L", "T"))
)
df2                          
## # A tibble: 209 x 13
##    Date       Away  Home  Score  Year Month   Day Type  Home_Away Opp  
##    <date>     <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr>     <chr>
##  1 2018-09-30 San ~ Los ~ 27:29  2018     9    30 Reg   Home      San ~
##  2 2018-09-23 Los ~ Los ~ 23:35  2018     9    23 Reg   Away      Los ~
##  3 2018-09-16 Los ~ Buff~ 31:20  2018     9    16 Reg   Away      Buff~
##  4 2018-09-09 Kans~ Los ~ 38:28  2018     9     9 Reg   Home      Kans~
##  5 2018-08-30 Los ~ San ~ 23:21  2018     8    30 Pre   Away      San ~
##  6 2018-08-25 New ~ Los ~ 36:7   2018     8    25 Pre   Home      New ~
##  7 2018-08-18 Seat~ Los ~ 14:24  2018     8    18 Pre   Home      Seat~
##  8 2018-08-11 Los ~ Ariz~ 17:24  2018     8    11 Pre   Away      Ariz~
##  9 2018-12-31 Oakl~ Los ~ 10:30  2018    12    31 Reg   Home      Oakl~
## 10 2017-12-24 Los ~ New ~ 14:7   2017    12    24 Reg   Away      New ~
## # ... with 199 more rows, and 3 more variables: Pts_Ch <chr>,
## #   Pts_Opp <chr>, Won_Loss <chr>
table(df2$Won_Loss)
## 
##   L   W 
## 114  95
# need to fix W/L indicator; above didn't work because Pts data need to have integer type
df2$Pts_Ch <- as.integer(df2$Pts_Ch)
df2$Pts_Opp <- as.integer(df2$Pts_Opp)
df2 <- df2 %>% mutate(Won_Loss = ifelse(Pts_Ch > Pts_Opp, "W", ifelse(Pts_Ch < Pts_Opp, "L", "T")))
df2
## # A tibble: 209 x 13
##    Date       Away  Home  Score  Year Month   Day Type  Home_Away Opp  
##    <date>     <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr>     <chr>
##  1 2018-09-30 San ~ Los ~ 27:29  2018     9    30 Reg   Home      San ~
##  2 2018-09-23 Los ~ Los ~ 23:35  2018     9    23 Reg   Away      Los ~
##  3 2018-09-16 Los ~ Buff~ 31:20  2018     9    16 Reg   Away      Buff~
##  4 2018-09-09 Kans~ Los ~ 38:28  2018     9     9 Reg   Home      Kans~
##  5 2018-08-30 Los ~ San ~ 23:21  2018     8    30 Pre   Away      San ~
##  6 2018-08-25 New ~ Los ~ 36:7   2018     8    25 Pre   Home      New ~
##  7 2018-08-18 Seat~ Los ~ 14:24  2018     8    18 Pre   Home      Seat~
##  8 2018-08-11 Los ~ Ariz~ 17:24  2018     8    11 Pre   Away      Ariz~
##  9 2018-12-31 Oakl~ Los ~ 10:30  2018    12    31 Reg   Home      Oakl~
## 10 2017-12-24 Los ~ New ~ 14:7   2017    12    24 Reg   Away      New ~
## # ... with 199 more rows, and 3 more variables: Pts_Ch <int>,
## #   Pts_Opp <int>, Won_Loss <chr>
table(df2$Won_Loss)
## 
##   L   W 
## 107 102
############################################################
# (h) & (i) add avg temp for game location and reduce variable set
# say   COLD = avg Nov temp < 52
#       WARM = avg Nov temp > 64
#       MED = all else
############################################################
df2 <- df2 %>% left_join(nov_temp, by = c("Home" = "Team")) %>% rename(Nov_Temp = "Avg Nov Temp") %>% 
            mutate(Temp = ifelse(Nov_Temp < 52, "COLD", ifelse(Nov_Temp > 64, "WARM", "MED")))
# select final variable set
df3 <- df2 %>% select(Year, Month, Day, Type, Home_Away, Temp, Opp, Pts_Ch, Pts_Opp, Won_Loss) 
str(df3)
## Classes 'tbl_df', 'tbl' and 'data.frame':    209 obs. of  10 variables:
##  $ Year     : num  2018 2018 2018 2018 2018 ...
##  $ Month    : num  9 9 9 9 8 8 8 8 12 12 ...
##  $ Day      : int  30 23 16 9 30 25 18 11 31 24 ...
##  $ Type     : chr  "Reg" "Reg" "Reg" "Reg" ...
##  $ Home_Away: chr  "Home" "Away" "Away" "Home" ...
##  $ Temp     : chr  "WARM" "WARM" "COLD" "WARM" ...
##  $ Opp      : chr  "San Francisco 49ers" "Los Angeles Rams" "Buffalo Bills" "Kansas City Chiefs" ...
##  $ Pts_Ch   : int  29 23 31 28 23 7 24 17 30 14 ...
##  $ Pts_Opp  : int  27 35 20 38 21 36 14 24 10 7 ...
##  $ Won_Loss : chr  "W" "L" "W" "L" ...
kable(head(df3, 25))
Year Month Day Type Home_Away Temp Opp Pts_Ch Pts_Opp Won_Loss
2018 9 30 Reg Home WARM San Francisco 49ers 29 27 W
2018 9 23 Reg Away WARM Los Angeles Rams 23 35 L
2018 9 16 Reg Away COLD Buffalo Bills 31 20 W
2018 9 9 Reg Home WARM Kansas City Chiefs 28 38 L
2018 8 30 Pre Away MED San Francisco 49ers 23 21 W
2018 8 25 Pre Home WARM New Orleans Saints 7 36 L
2018 8 18 Pre Home WARM Seattle Seahawks 24 14 W
2018 8 11 Pre Away WARM Arizona Cardinals 17 24 L
2018 12 31 Reg Home WARM Oakland Raiders 30 10 W
2017 12 24 Reg Away MED New York Jets 14 7 W
2017 12 16 Reg Away MED Kansas City Chiefs 13 30 L
2017 12 10 Reg Home WARM Washington Redskins 30 13 W
2017 12 3 Reg Home WARM Cleveland Browns 19 10 W
2017 11 23 Reg Away WARM Dallas Cowboys 28 6 W
2017 11 19 Reg Home WARM Buffalo Bills 54 24 W
2017 11 12 Reg Away WARM Jacksonville Jaguars 17 20 L
2017 10 29 Reg Away COLD New England Patriots 13 21 L
2017 10 22 Reg Home WARM Denver Broncos 21 0 W
2017 10 15 Reg Away WARM Oakland Raiders 17 16 W
2017 10 8 Reg Away MED New York Giants 27 22 W
2017 10 1 Reg Home WARM Philadelphia Eagles 24 26 L
2017 9 24 Reg Home WARM Kansas City Chiefs 10 24 L
2017 9 17 Reg Home WARM Miami Dolphins 17 19 L
2017 9 11 Reg Away MED Denver Broncos 21 24 L
2017 8 31 Pre Away MED San Francisco 49ers 13 23 L

Use the tidy data to address the questions

Now that we have a tidy data frame, we can proceed to answer our questions.

Q1: What is the Chargers’ regular season win-loss record and win percentage over the last decade (2008-2018)?

The Chargers have won 83 games out of 164 regular season games since the 2008 season, which equates to a 50.6% win percentage.

df3 %>% group_by(Type) %>% 
    summarize(Number_Games = n(), Wins = sum(Won_Loss == "W"), Losses = sum(Won_Loss == "L"), 
              Win_Pct = round(Wins / Number_Games * 100, 1) ) %>% 
    kable(caption = "Chargers Win-Loss Record: 2008 - 2018/9/30") 
Chargers Win-Loss Record: 2008 - 2018/9/30
Type Number_Games Wins Losses Win_Pct
Post 5 2 3 40.0
Pre 40 17 23 42.5
Reg 164 83 81 50.6

Q2: What is the Chargers’ regular season win-loss record and win percentage playing at home vs. away, over this time?

The Chargers definitely play better at home than on the road, as their win percentage is 57% at home vs. 44% away.

df3 %>% filter(Type == "Reg") %>% group_by(Home_Away) %>% 
    summarize(Number_Games = n(), Wins = sum(Won_Loss == "W"), Losses = sum(Won_Loss == "L"), 
              Win_Pct = round(Wins / Number_Games * 100, 1) ) %>% 
    kable(caption = "Chargers Win-Loss Record at Home vs. Away: 2008 - 2018/9/30") 
Chargers Win-Loss Record at Home vs. Away: 2008 - 2018/9/30
Home_Away Number_Games Wins Losses Win_Pct
Away 82 36 46 43.9
Home 82 47 35 57.3

Q3: What is the Chargers’ regular season win-loss record, win percentage, and points scored vs. points allowed, when playing in cold weather cities vs. warm weather cities?

The Chargers are definitely a warm-weather team, as their win percentage and points scored-versus-allowed all show a clear bias toward better performance playing in warm weather. The Chargers have a 56% win percentage in warm weather, but only a 23% win percentage in cold weather.

df3 %>% filter(Type == "Reg") %>% group_by(Temp) %>% 
    summarize(Number_Games = n(), Wins = sum(Won_Loss == "W"), Losses = sum(Won_Loss == "L"), 
              Win_Pct = round(Wins / Number_Games * 100, 1), Tot_Pts_Scored = sum(Pts_Ch), 
              Tot_Pts_Allowed = sum(Pts_Opp) ) %>% 
    kable(caption = "Chargers Win-Loss Record and Points Scored/Allowed in Warm & Cold Weather Cities: 2008 - 2018/9/30") 
Chargers Win-Loss Record and Points Scored/Allowed in Warm & Cold Weather Cities: 2008 - 2018/9/30
Temp Number_Games Wins Losses Win_Pct Tot_Pts_Scored Tot_Pts_Allowed
COLD 13 3 10 23.1 238 313
MED 41 19 22 46.3 1026 995
WARM 110 61 49 55.5 2766 2317