Good work so far, you are on your way to becoming full fledged data analysts with R. This week’s course was a primer for when we are getting into Multiple and Logarithmic Regression. In the DataCamp courses we learned about different ways to visualize your data and when they are appropriate.

The data science process has five basic steps:

  1. Obtain Data
    • Importing data from source
  2. Clean Data
    • Remove Unwanted observations
    • Remove Duplicate Observations
    • Remove Irrelevant observations
    • Fix structural error(typos and capitalization)
    • Fix class types
  3. Exploratory Data Analysis
    • Variable Identification
    • Univariate Analysis
    • Bi-variate Analysis
    • Missing values treatment
    • Outlier treatment
    • Variable transformation
    • Feature engineering
  4. Data Modeling
    • Accomplish Task

This assignment will be the first one that you will go through four of the steps for the data science process. By the end of this assignment you will have a nice report when you knit the Rmarkdown file.


This week you are a Sports Data Analyst for the prestigious Brigham Young University. You have been tasked with examining BYU football attendance compared with other football teams around the country. You will be visualizing data, building linear regression models, and interpreting your results. Good luck and have fun!

#Part 1. Obtaining the Data

This dataset was created for MRKT 585R, the data comes from extracting data from approximately 600 Wikipedia pages for attendance numbers for each football game from 2000-2018, after that data was added, it was bound to the weather data from the National Climate Data Center. Following is a list of variables and description of each, this format will create a table when you knit:

Variable Description
Date The Date the Game was played on
Team Home team of the Football Game
Time Kickoff Time
Opponent Away team of the Football Game
Rank Rank of the Home Team for the AP poll
Site Location for the game
TV TV channel that the game was played on
Result The outcome for the football game
Attendance How many people attended the game
Current Wins How many wins the team has leading up to the game
Current Losses How many losses the team has leading up to the game
Stadium Capacity How many people fit into the stadium
Fill Rate Attendance / Stadium Capacity
New Coach If the team has a first year head coach
Tailgating If the team is a Top 25 tailgate destination
PRCP Precipitation
SNOW Snowfall
SNWD Snow Depth (Snow on ground)
TMAX Max Temperature for the day
TMIN Min Temperature for the day
Opponent_Rank Rank of the Opponent at the time of the game
Conference What football conference the team belongs
  1. Import ‘CFB.xlsx’ to RStudio Cloud and print the head: Don’t remember how? Review
cfb <- readxl::read_excel("CFB.xlsx", sheet = 1)
## New names:
## * `` -> ...1
head(cfb)
## # A tibble: 6 x 25
##   ...1  Date                Team  Time  Opponent Rank  Site  TV    Result
##   <chr> <dttm>              <chr> <chr> <chr>    <chr> <chr> <chr> <chr> 
## 1 1     2000-09-02 00:00:00 Arka… 8:00… Southwe… NR    War … <NA>  W 38–0
## 2 2     2000-09-16 00:00:00 Arka… 6:00… Boise S… NR    War … <NA>  W 38–…
## 3 3     2000-09-23 00:00:00 Arka… 8:00… Alabama  NR    Razo… ESPN2 W 28–…
## 4 4     2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR    Razo… JPS   L 7–38
## 5 5     2000-10-07 00:00:00 Arka… 6:00… Louisia… NR    Razo… <NA>  W 52–6
## 6 6     2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR    Razo… <NA>  L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## #   `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <chr>, Tailgating <chr>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <chr>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <chr>

Before cleaning the data, take a moment and become familiar with the dataset that you are working with. There are a multitude of variables, it would be important to

#Part 2. Cleaning the Data

During this step we will look at the data to see if there are any unwanted observations, duplicate observations, irrelevant observations, structural issues or class type issues.

  1. Let’s start by looking at the class types for the dataset to see if we need to change any of the data types
## Observations: 3,732
## Variables: 25
## $ ...1               <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ Date               <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team               <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas…
## $ Time               <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent           <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank               <chr> "NR", "NR", "NR", "NR", "NR", "NR", "NR", "NR…
## $ Site               <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV                 <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result             <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance         <chr> "53,946", "54,286", "51,482", "51,162", "50,9…
## $ `Current Wins`     <chr> "0", "1", "2", "3", "3", "4", "5", "0", "1", …
## $ `Current Losses`   <chr> "0", "0", "0", "0", "1", "3", "5", "0", "0", …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate`        <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach`        <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", …
## $ Tailgating         <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", …
## $ PRCP               <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX               <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN               <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank      <chr> "NR", "NR", "NR", " 25", "NR", "NR", " 24", "…
## $ numericDate        <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime        <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference         <chr> "SEC", "SEC", "SEC", "SEC", "SEC", "SEC", "SE…

A couple of the variables have issues with the. Here they are and their problems:

  1. Now let’s try to fix the columns, start by subsetting the data to remove the index (..1), output the head for the new dataset
cfb <- cfb[,-1]
head(cfb)
## # A tibble: 6 x 24
##   Date                Team  Time  Opponent Rank  Site  TV    Result
##   <dttm>              <chr> <chr> <chr>    <chr> <chr> <chr> <chr> 
## 1 2000-09-02 00:00:00 Arka… 8:00… Southwe… NR    War … <NA>  W 38–0
## 2 2000-09-16 00:00:00 Arka… 6:00… Boise S… NR    War … <NA>  W 38–…
## 3 2000-09-23 00:00:00 Arka… 8:00… Alabama  NR    Razo… ESPN2 W 28–…
## 4 2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR    Razo… JPS   L 7–38
## 5 2000-10-07 00:00:00 Arka… 6:00… Louisia… NR    Razo… <NA>  W 52–6
## 6 2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR    Razo… <NA>  L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## #   `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <chr>, Tailgating <chr>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <chr>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <chr>
  1. Next, fix the columns with categorical data. Factor the Team, Rank, New Coach, Tailgating, Opponent_Rank, and Conference columns so that the categorical nature of the columns is respected. Output the structure of the data using glimpse or str to confirm it worked.

Note: In actual analysis, you would want Rank, and Opponent Rank as ordered factors. Also, the New Coach column has a space in it, to type in the column without autofill, you surround the column name with `` symbols

cfb$Team <- as.factor(cfb$Team)
cfb$Rank <- as.factor(cfb$Rank)
cfb$`New Coach` <- as.factor(cfb$`New Coach`)
cfb$Tailgating <- as.factor(cfb$Tailgating)
cfb$Opponent_Rank <- as.factor(cfb$Opponent_Rank)
cfb$Conference <- as.factor(cfb$Conference)

glimpse(cfb)
## Observations: 3,732
## Variables: 24
## $ Date               <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team               <fct> Arkansas, Arkansas, Arkansas, Arkansas, Arkan…
## $ Time               <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent           <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank               <fct> NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, N…
## $ Site               <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV                 <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result             <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance         <chr> "53,946", "54,286", "51,482", "51,162", "50,9…
## $ `Current Wins`     <chr> "0", "1", "2", "3", "3", "4", "5", "0", "1", …
## $ `Current Losses`   <chr> "0", "0", "0", "0", "1", "3", "5", "0", "0", …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate`        <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach`        <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ Tailgating         <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRCP               <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX               <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN               <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank      <fct> NR, NR, NR,  25, NR, NR,  24, NR,  8, NR,  9,…
## $ numericDate        <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime        <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference         <fct> SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, …
  1. Next, we will fix the Attendance column. Use the gsub function to remove the commas in the text.
cfb$Attendance <- gsub(",", "", cfb$Attendance)

head(cfb)
## # A tibble: 6 x 24
##   Date                Team  Time  Opponent Rank  Site  TV    Result
##   <dttm>              <fct> <chr> <chr>    <fct> <chr> <chr> <chr> 
## 1 2000-09-02 00:00:00 Arka… 8:00… Southwe… NR    War … <NA>  W 38–0
## 2 2000-09-16 00:00:00 Arka… 6:00… Boise S… NR    War … <NA>  W 38–…
## 3 2000-09-23 00:00:00 Arka… 8:00… Alabama  NR    Razo… ESPN2 W 28–…
## 4 2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR    Razo… JPS   L 7–38
## 5 2000-10-07 00:00:00 Arka… 6:00… Louisia… NR    Razo… <NA>  W 52–6
## 6 2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR    Razo… <NA>  L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## #   `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <fct>, Tailgating <fct>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <fct>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <fct>
  1. Next the Attendance, Current Wins and Current Losses are both characters right now, make them integers. Output the structure with glimpse or str again.
cfb$Attendance <- as.integer(cfb$Attendance)
cfb$`Current Losses` <- as.integer(cfb$`Current Losses`)
cfb$`Current Wins` <- as.integer(cfb$`Current Wins`)

glimpse(cfb)
## Observations: 3,732
## Variables: 24
## $ Date               <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team               <fct> Arkansas, Arkansas, Arkansas, Arkansas, Arkan…
## $ Time               <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent           <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank               <fct> NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, N…
## $ Site               <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV                 <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result             <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance         <int> 53946, 54286, 51482, 51162, 50947, 49647, 439…
## $ `Current Wins`     <int> 0, 1, 2, 3, 3, 4, 5, 0, 1, 1, 2, 3, 5, 6, 0, …
## $ `Current Losses`   <int> 0, 0, 0, 0, 1, 3, 5, 0, 0, 3, 3, 3, 3, 3, 0, …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate`        <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach`        <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ Tailgating         <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRCP               <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX               <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN               <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank      <fct> NR, NR, NR,  25, NR, NR,  24, NR,  8, NR,  9,…
## $ numericDate        <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime        <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference         <fct> SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, …

You have now cleaned the data for our College Football Dataset.

#Part 3. Exploratory Data Analysis

Thus far we have imported and cleaned the data that we have for each of the games. Now we begin the exploratory data analysis, this step includes Variable Identification, Univariate Analysis, Bi-variate Analysis, Missing values treatment, Outlier treatment, Variable transformation, and Feature engineering. These are alot of big words, don’t worry though. I am going to explain what each means.

Variable Identification - identifying the input and the output variables, there are many different names for these. Input variables are also predictors, features, and independent variables. Output variable are also the response or dependent variable.

In this assignment, I am going to tell you that the output variable is the Fill Rate unless I specify otherwise. This means that the Fill Rate will be on the y axis for most graphs that we do.

Missing Value Treatment - dealing with missing data in the dataset

  1. Let’s start by identifying which variables might have missing data in it. Use the summary function to identify the NAs in the dataset
summary(cfb)
##       Date                                 Team          Time          
##  Min.   :2000-08-31 00:00:00   Arkansas      : 130   Length:3732       
##  1st Qu.:2005-06-24 12:00:00   Nebraska      : 130   Class :character  
##  Median :2009-10-31 00:00:00   Penn State    : 130   Mode  :character  
##  Mean   :2009-12-03 10:01:09   Michigan State: 129                     
##  3rd Qu.:2014-09-27 00:00:00   Kansas State  : 128                     
##  Max.   :2018-12-01 00:00:00   Texas A&M     : 127                     
##                                (Other)       :2958                     
##    Opponent              Rank          Site                TV           
##  Length:3732        NR     :2668   Length:3732        Length:3732       
##  Class :character   19     :  59   Class :character   Class :character  
##  Mode  :character   23     :  51   Mode  :character   Mode  :character  
##                     24     :  51                                        
##                     11     :  49                                        
##                     10     :  47                                        
##                     (Other): 807                                        
##     Result            Attendance      Current Wins    Current Losses  
##  Length:3732        Min.   :  4513   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 36018   1st Qu.: 1.000   1st Qu.: 0.000  
##  Mode  :character   Median : 53663   Median : 3.000   Median : 2.000  
##                     Mean   : 54607   Mean   : 3.049   Mean   : 2.061  
##                     3rd Qu.: 75504   3rd Qu.: 5.000   3rd Qu.: 3.000  
##                     Max.   :110889   Max.   :12.000   Max.   :11.000  
##                     NA's   :2                                         
##  Stadium Capacity   Fill Rate       New Coach    Tailgating  
##  Min.   : 22113   Min.   :0.08137   FALSE:3129   FALSE:2514  
##  1st Qu.: 49225   1st Qu.:0.78404   TRUE : 603   TRUE :1218  
##  Median : 60000   Median :0.94100                            
##  Mean   : 60354   Mean   :0.87895                            
##  3rd Qu.: 80321   3rd Qu.:1.00255                            
##  Max.   :107282   Max.   :1.40399                            
##                   NA's   :2                                  
##       PRCP              SNOW             SNWD           TMAX       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00   Min.   : 19.00  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.: 61.00  
##  Median :0.00000   Median :0.0000   Median :0.00   Median : 72.00  
##  Mean   :0.08403   Mean   :0.0166   Mean   :0.03   Mean   : 70.82  
##  3rd Qu.:0.01000   3rd Qu.:0.0000   3rd Qu.:0.00   3rd Qu.: 82.00  
##  Max.   :6.45000   Max.   :5.3000   Max.   :7.10   Max.   :111.00  
##  NA's   :20        NA's   :1005     NA's   :1049   NA's   :29      
##       TMIN       Opponent_Rank   numericDate     numericTime   
##  Min.   : 0.00   NR     :2936   Min.   :11200   Min.   : 3600  
##  1st Qu.:38.00    1     :  53   1st Qu.:12958   1st Qu.:12600  
##  Median :49.00    2     :  42   Median :14548   Median :21600  
##  Mean   :48.51    13    :  39   Mean   :14581   Mean   :23548  
##  3rd Qu.:59.00    10    :  34   3rd Qu.:16340   3rd Qu.:39600  
##  Max.   :81.00    3     :  32   Max.   :17866   Max.   :45000  
##  NA's   :29      (Other): 596                                  
##         Conference 
##  Big-10      :755  
##  ACC         :697  
##  Big-12      :570  
##  SEC         :374  
##  Pac-12      :336  
##  Mid-American:281  
##  (Other)     :719

From the output, we learn that TV has 582 NA’s, Attendance has 2 NA’s, Fill Rate has 2 NA’s, PRCP has 20 NA’s, SNOW and SNWD have over 1000 NA’s, and finally TMAX and TMIN have 29 NA’s.

  1. Let’s start to deal with these, we can safely assume that an NA in TV means that the game was not on TV, the same goes with PRCP, SNOW and SNWD, we can assume that there was no precipitation of any kind. Assign the NA values for the weather with 0. Follow my example.
cfb$TV[is.na(cfb$TV)] <- "Not on Television"          #Example of how to fill na values
cfb$PRCP[is.na(cfb$PRCP)] <- 0
cfb$SNOW[is.na(cfb$SNOW)] <- 0
cfb$SNWD[is.na(cfb$SNWD)] <- 0
summary(cfb)
##       Date                                 Team          Time          
##  Min.   :2000-08-31 00:00:00   Arkansas      : 130   Length:3732       
##  1st Qu.:2005-06-24 12:00:00   Nebraska      : 130   Class :character  
##  Median :2009-10-31 00:00:00   Penn State    : 130   Mode  :character  
##  Mean   :2009-12-03 10:01:09   Michigan State: 129                     
##  3rd Qu.:2014-09-27 00:00:00   Kansas State  : 128                     
##  Max.   :2018-12-01 00:00:00   Texas A&M     : 127                     
##                                (Other)       :2958                     
##    Opponent              Rank          Site                TV           
##  Length:3732        NR     :2668   Length:3732        Length:3732       
##  Class :character   19     :  59   Class :character   Class :character  
##  Mode  :character   23     :  51   Mode  :character   Mode  :character  
##                     24     :  51                                        
##                     11     :  49                                        
##                     10     :  47                                        
##                     (Other): 807                                        
##     Result            Attendance      Current Wins    Current Losses  
##  Length:3732        Min.   :  4513   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 36018   1st Qu.: 1.000   1st Qu.: 0.000  
##  Mode  :character   Median : 53663   Median : 3.000   Median : 2.000  
##                     Mean   : 54607   Mean   : 3.049   Mean   : 2.061  
##                     3rd Qu.: 75504   3rd Qu.: 5.000   3rd Qu.: 3.000  
##                     Max.   :110889   Max.   :12.000   Max.   :11.000  
##                     NA's   :2                                         
##  Stadium Capacity   Fill Rate       New Coach    Tailgating  
##  Min.   : 22113   Min.   :0.08137   FALSE:3129   FALSE:2514  
##  1st Qu.: 49225   1st Qu.:0.78404   TRUE : 603   TRUE :1218  
##  Median : 60000   Median :0.94100                            
##  Mean   : 60354   Mean   :0.87895                            
##  3rd Qu.: 80321   3rd Qu.:1.00255                            
##  Max.   :107282   Max.   :1.40399                            
##                   NA's   :2                                  
##       PRCP              SNOW              SNWD              TMAX       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   : 19.00  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 61.00  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median : 72.00  
##  Mean   :0.08357   Mean   :0.01217   Mean   :0.02157   Mean   : 70.82  
##  3rd Qu.:0.01000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 82.00  
##  Max.   :6.45000   Max.   :5.30000   Max.   :7.10000   Max.   :111.00  
##                                                        NA's   :29      
##       TMIN       Opponent_Rank   numericDate     numericTime   
##  Min.   : 0.00   NR     :2936   Min.   :11200   Min.   : 3600  
##  1st Qu.:38.00    1     :  53   1st Qu.:12958   1st Qu.:12600  
##  Median :49.00    2     :  42   Median :14548   Median :21600  
##  Mean   :48.51    13    :  39   Mean   :14581   Mean   :23548  
##  3rd Qu.:59.00    10    :  34   3rd Qu.:16340   3rd Qu.:39600  
##  Max.   :81.00    3     :  32   Max.   :17866   Max.   :45000  
##  NA's   :29      (Other): 596                                  
##         Conference 
##  Big-10      :755  
##  ACC         :697  
##  Big-12      :570  
##  SEC         :374  
##  Pac-12      :336  
##  Mid-American:281  
##  (Other)     :719
  1. Now let’s fix the Attendance and Fill Rate, determine which rows have NA values in them for the Attendance variable and output those row numbers. Now look at those rows and answer why is there no Attendance for those games? ________________(They were cancelled)
which(is.na(cfb$Attendance))
## [1]  662 3040
cfb[c(662, 3040),]
## # A tibble: 2 x 24
##   Date                Team  Time  Opponent Rank  Site  TV    Result
##   <dttm>              <fct> <chr> <chr>    <fct> <chr> <chr> <chr> 
## 1 2013-09-14 00:00:00 Colo… 12:0… Fresno … NR    Fols… P12N  Cance…
## 2 2015-09-03 00:00:00 Tole… 7:00… Stony B… NR    Glas… ESPN3 Cance…
## # … with 16 more variables: Attendance <int>, `Current Wins` <int>,
## #   `Current Losses` <int>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <fct>, Tailgating <fct>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <fct>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <fct>
  1. Next subset cfb to exclude those rows.
cfb <- cfb[-c(662, 3040),]
  1. Now, let’s deal with the rest of the TMAX and TMIN. This variable is hardest to infer, so let’s drop the NA values now. You will have a dataset that is 3700 observations long.
cfb <- cfb[complete.cases(cfb$TMAX),]
cfb <- cfb[complete.cases(cfb$TMIN),]
  1. Run the summary again on the data to confirm that we removed all the NA values
summary(cfb)
##       Date                                 Team          Time          
##  Min.   :2000-08-31 00:00:00   Nebraska      : 130   Length:3700       
##  1st Qu.:2004-11-26 18:00:00   Penn State    : 130   Class :character  
##  Median :2009-10-24 00:00:00   Arkansas      : 129   Mode  :character  
##  Mean   :2009-11-21 00:27:14   Michigan State: 129                     
##  3rd Qu.:2014-09-20 00:00:00   Kansas State  : 128                     
##  Max.   :2018-12-01 00:00:00   Texas A&M     : 127                     
##                                (Other)       :2927                     
##    Opponent              Rank          Site                TV           
##  Length:3700        NR     :2640   Length:3700        Length:3700       
##  Class :character   19     :  59   Class :character   Class :character  
##  Mode  :character   23     :  51   Mode  :character   Mode  :character  
##                     24     :  50                                        
##                     11     :  49                                        
##                     10     :  47                                        
##                     (Other): 804                                        
##     Result            Attendance      Current Wins    Current Losses  
##  Length:3700        Min.   :  4513   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 35986   1st Qu.: 1.000   1st Qu.: 0.000  
##  Mode  :character   Median : 53654   Median : 3.000   Median : 2.000  
##                     Mean   : 54633   Mean   : 3.051   Mean   : 2.056  
##                     3rd Qu.: 75579   3rd Qu.: 5.000   3rd Qu.: 3.000  
##                     Max.   :110889   Max.   :12.000   Max.   :11.000  
##                                                                       
##  Stadium Capacity   Fill Rate       New Coach    Tailgating  
##  Min.   : 22113   Min.   :0.08137   FALSE:3106   FALSE:2502  
##  1st Qu.: 49225   1st Qu.:0.78399   TRUE : 594   TRUE :1198  
##  Median : 60000   Median :0.94129                            
##  Mean   : 60373   Mean   :0.87926                            
##  3rd Qu.: 80321   3rd Qu.:1.00275                            
##  Max.   :107282   Max.   :1.40399                            
##                                                              
##       PRCP              SNOW              SNWD              TMAX       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   : 19.00  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 61.00  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median : 72.00  
##  Mean   :0.08396   Mean   :0.01227   Mean   :0.02176   Mean   : 70.82  
##  3rd Qu.:0.01000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 82.00  
##  Max.   :6.45000   Max.   :5.30000   Max.   :7.10000   Max.   :111.00  
##                                                                        
##       TMIN       Opponent_Rank   numericDate     numericTime   
##  Min.   : 0.00   NR     :2913   Min.   :11200   Min.   : 3600  
##  1st Qu.:38.00    1     :  51   1st Qu.:12749   1st Qu.:12600  
##  Median :49.00    2     :  42   Median :14541   Median :21600  
##  Mean   :48.51    13    :  39   Mean   :14569   Mean   :23517  
##  3rd Qu.:59.00    10    :  34   3rd Qu.:16333   3rd Qu.:39600  
##  Max.   :81.00    3     :  32   Max.   :17866   Max.   :45000  
##                  (Other): 589                                  
##         Conference 
##  Big-10      :753  
##  ACC         :696  
##  Big-12      :567  
##  SEC         :354  
##  Pac-12      :335  
##  Mid-American:279  
##  (Other)     :716
cfb %>%
  filter(is.na(TMAX))
## # A tibble: 0 x 24
## # … with 24 variables: Date <dttm>, Team <fct>, Time <chr>,
## #   Opponent <chr>, Rank <fct>, Site <chr>, TV <chr>, Result <chr>,
## #   Attendance <int>, `Current Wins` <int>, `Current Losses` <int>,
## #   `Stadium Capacity` <dbl>, `Fill Rate` <dbl>, `New Coach` <fct>,
## #   Tailgating <fct>, PRCP <dbl>, SNOW <dbl>, SNWD <dbl>, TMAX <dbl>,
## #   TMIN <dbl>, Opponent_Rank <fct>, numericDate <dbl>, numericTime <dbl>,
## #   Conference <fct>

Univariate Analysis - exploring variables one by one.

  1. Let’s start by using ggplot to create a histogram of the Fill Rate, names this chart “Histogram of Fill Rate”, set the binwidth = 0.5

  1. Now do the same thing with the Attendance variable, name this chart, “Histogram of Attendance”
ggplot(cfb, aes(x = Attendance)) +
  geom_histogram() +
  ggtitle("Histogram of Attendance")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looking at these two charts illustrates why standardization is important. The Fill Rate is the standardized version of the Attendance data.

Bi-variate Analysis - exploring how variables relate to one another

Now we are beginning to get into some fun with the data. We will be looking at the apparent relationship between two variables. In this case, we will be looking at the relationship of a couple of the variables against the Fill Rate

  1. Let’s start with the graph that is most demanding computationally first. Create a scatter plot with the following conditions met:
ggplot(cfb, aes(x = Date, y = `Fill Rate`, color = Conference)) +
  geom_point() +
  facet_wrap(~ Team) +
  ggtitle("Fill Rate by Team")

  1. Since 2000, BYU has had 3 new head coaches. Do the following to create a scatter plot:
cfb %>%
  filter(Team == "BYU") %>%
  ggplot( aes(x = Date, y = `Fill Rate`, color = `New Coach`)) +
  geom_point() +
  ggtitle("New Coaches at BYU")

  1. From the scatter plot, we see that something happened in 2011, since then BYU Football attendance has been decreasing every year. Let’s investigate this more and look to see if it’s because BYU is losing more. Make a scatter plot that:
cfb %>%
  filter(Team == "BYU") %>%
  ggplot(aes(x = Date, y = `Current Losses`)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  ggtitle("Losses by Year for BYU Football")

This might help to explain why 2008-2010 have such a full stadium but it doesn’t explain the drop in fullness starting in 2011. This might be a factor also

Outlier treatment - dealing with observations that appear to be too far away from the pattern of the data.

For this assignment, there are no outliers.

Feature engineering - creating new variables to extract more information from the data there are two ways to do this. First, by transforming the scale or relationship of the data. Second, by creating new variables based off the ones that we have.

  1. The feature engineering was done to the data before the assignment started. That being said, we should practice writing functions before moving on. Write a function that outputs the following:
teamMath <- function(data = cfb, team) {
  data %>%
    filter(Team == team) %>%
    summarize(team,SD = sd(Attendance), Avg = mean(Attendance),Upper = Avg + SD, Lower = Avg - SD)
}

#Use your function with BYU as the team, another time with SMU, and another time with Nebraska 
teamMath(cfb, "BYU")
## # A tibble: 1 x 5
##   team     SD    Avg  Upper  Lower
##   <chr> <dbl>  <dbl>  <dbl>  <dbl>
## 1 BYU   4297. 60006. 64303. 55709.
teamMath(cfb, "SMU")
## # A tibble: 1 x 5
##   team     SD    Avg  Upper  Lower
##   <chr> <dbl>  <dbl>  <dbl>  <dbl>
## 1 SMU   8064. 19889. 27953. 11825.
teamMath(cfb, "Nebraska")
## # A tibble: 1 x 5
##   team        SD    Avg  Upper  Lower
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
## 1 Nebraska 5075. 84494. 89569. 79419.

#Part 4. Data Modeling

Now we have reviewed most of what we have learned in R up to this point. Next we will begin what we have learned about correlation and regression in the class. In this section we will be looking at the linear correlation of values, adding best fit lines, statistical models.

First reminder, correlation only good for linear values. You get the correlation by subtracting the mean of the x axis from each value multiplied by the subtracted mean of the y axis from each value. divided by the squared version of each.

Second reminder, the line of best fit, or trendline is one that minimizes the squared distance from all points to it. The average distance of all the points above and below the line should be the same.

Third reminder, the generic statistical model is y = f(x) + noise. Most of what you do in creating models from here on out is determine ways to better fit the response variable or y with the the inputs or x.

  1. Start by taking the correlation of the Fill Rate by the TMAX, earlier we didn’t remove all the NA values from TMAX, use the appropriate argument to disregard the NA values. Look here
cfb %>%
  summarize(Correlation = cor(`Fill Rate`, TMAX, use = "pairwise.complete.obs"))
## # A tibble: 1 x 1
##   Correlation
##         <dbl>
## 1      0.0573

.0597 would be a very weak correlation

  1. Let’s now look and see if any of the other numeric value are more correlated to the Fill Rate. Check its correlation against Current Wins, Current Losses, PRCP, SNOW, and TMIN. Make sure to use the correct argument when there are NA values.
cfb %>%
  summarize(CorrWins = cor(`Fill Rate`, `Current Wins`), CorrLosses = cor(`Fill Rate`, `Current Losses`), CorrPRCP = cor(`Fill Rate`, PRCP), CorrSNOW = cor(`Fill Rate`, SNOW), CorrTMIN = cor(`Fill Rate`, TMIN, use = "pairwise.complete.obs"))
## # A tibble: 1 x 5
##   CorrWins CorrLosses CorrPRCP CorrSNOW CorrTMIN
##      <dbl>      <dbl>    <dbl>    <dbl>    <dbl>
## 1    0.136     -0.348  0.00919 -0.00581   0.0355

We have already fitted linear models simply within ggplot, now we want to practice creating the models themselves so that we can learn from them.

  1. Create your own simple linear model
lm1 <- lm(`Fill Rate` ~ TMAX ,data = cfb)
summary(lm1)
## 
## Call:
## lm(formula = `Fill Rate` ~ TMAX, data = cfb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80001 -0.09716  0.06258  0.12531  0.54336 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.8318411  0.0138903  59.886  < 2e-16 ***
## TMAX        0.0006695  0.0001920   3.487 0.000493 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.173 on 3698 degrees of freedom
## Multiple R-squared:  0.003278,   Adjusted R-squared:  0.003008 
## F-statistic: 12.16 on 1 and 3698 DF,  p-value: 0.0004934
  1. Now we will make things a little more complicated.
lm2 <- lm(`Fill Rate` ~ TMAX + TMIN + Conference + Tailgating + `Current Wins` + `Current Losses` + Team, data = cfb)
summary(lm2)
## 
## Call:
## lm(formula = `Fill Rate` ~ TMAX + TMIN + Conference + Tailgating + 
##     `Current Wins` + `Current Losses` + Team, data = cfb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52942 -0.05310  0.00461  0.05392  0.57294 
## 
## Coefficients: (10 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.6180455  0.0238387  25.926  < 2e-16 ***
## TMAX                    0.0008184  0.0002603   3.145 0.001677 ** 
## TMIN                   -0.0006619  0.0002848  -2.324 0.020175 *  
## ConferenceACC           0.1893057  0.0153343  12.345  < 2e-16 ***
## ConferenceBig-10        0.1276783  0.0213279   5.986 2.35e-09 ***
## ConferenceBig-12        0.2703458  0.0171358  15.777  < 2e-16 ***
## ConferenceCUSA          0.1361626  0.0190918   7.132 1.19e-12 ***
## ConferenceIndependent   0.1349430  0.0215127   6.273 3.96e-10 ***
## ConferenceMid-American  0.0500580  0.0180696   2.770 0.005629 ** 
## ConferenceMWC           0.0146947  0.0160635   0.915 0.360364    
## ConferencePac-12        0.0728579  0.0152291   4.784 1.79e-06 ***
## ConferenceSEC           0.3280568  0.0150522  21.795  < 2e-16 ***
## TailgatingTRUE          0.2270409  0.0153324  14.808  < 2e-16 ***
## `Current Wins`          0.0070846  0.0010809   6.555 6.36e-11 ***
## `Current Losses`       -0.0181029  0.0011859 -15.265  < 2e-16 ***
## TeamBaylor             -0.1095555  0.0165408  -6.623 4.02e-11 ***
## TeamBoise State         0.2811864  0.0155138  18.125  < 2e-16 ***
## TeamBYU                 0.1738787  0.0215989   8.050 1.11e-15 ***
## TeamClemson            -0.0902762  0.0216874  -4.163 3.22e-05 ***
## TeamColorado            0.2044518  0.0156644  13.052  < 2e-16 ***
## TeamFlorida State       0.1103872  0.0150156   7.352 2.40e-13 ***
## TeamGeorgia Tech        0.0836747  0.0149675   5.590 2.43e-08 ***
## TeamIndiana             0.0022155  0.0209931   0.106 0.915958    
## TeamIowa State          0.0559452  0.0161525   3.464 0.000539 ***
## TeamKansas State        0.0749136  0.0159461   4.698 2.72e-06 ***
## TeamMarshall           -0.0748676  0.0188875  -3.964 7.52e-05 ***
## TeamMichigan State      0.0139486  0.0140184   0.995 0.319795    
## TeamNC State            0.1543451  0.0146479  10.537  < 2e-16 ***
## TeamNebraska            0.0587729  0.0140686   4.178 3.01e-05 ***
## TeamNevada                     NA         NA      NA       NA    
## TeamNorthern Illinois  -0.0179518  0.0182483  -0.984 0.325303    
## TeamNotre Dame                 NA         NA      NA       NA    
## TeamOklahoma           -0.1242826  0.0227609  -5.460 5.07e-08 ***
## TeamOle Miss           -0.2422276  0.0216531 -11.187  < 2e-16 ***
## TeamPenn State         -0.0112770  0.0140222  -0.804 0.421320    
## TeamRutgers             0.0945303  0.0209131   4.520 6.38e-06 ***
## TeamSMU                        NA         NA      NA       NA    
## TeamSyracuse           -0.0233726  0.0148049  -1.579 0.114490    
## TeamTexas A&M          -0.2139975  0.0214652  -9.970  < 2e-16 ***
## TeamToledo              0.1342723  0.0180163   7.453 1.13e-13 ***
## TeamTroy                       NA         NA      NA       NA    
## TeamUCLA                       NA         NA      NA       NA    
## TeamVirginia                   NA         NA      NA       NA    
## TeamWashington                 NA         NA      NA       NA    
## TeamWest Virginia              NA         NA      NA       NA    
## TeamWestern Kentucky           NA         NA      NA       NA    
## TeamWisconsin                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.112 on 3663 degrees of freedom
## Multiple R-squared:  0.5861, Adjusted R-squared:  0.582 
## F-statistic: 144.1 on 36 and 3663 DF,  p-value: < 2.2e-16
  1. Now let’s look and see how well our model is at predicting the Fill Rate for the stadium. I have split the data into a training and validation datasets so that you can predict how good your model is against the data.
# Test and Training set
set.seed(121) #set.seed is included so that everyone gets the same results
testNum <- sample(nrow(cfb), 100) #Creates a vector of row numbers to include
test <- cfb[c(testNum),] #Subsets the data so that it only has the test rows in it
train <- cfb[-c(testNum),] #Takes the test numbers out

# Build Your Model and Predict 
lm3 <- lm(`Fill Rate` ~ TMAX + TMIN + Conference + Tailgating + `Current Wins` + `Current Losses` + Team, data = train)
test$pred_lm3 <- predict(lm3, test)
  1. Now let’s see how well we did. Write a for loop that will print the average difference between your prediction and the actual value.
accuracy <- rep(0, nrow(test))
for (i in 1:nrow(test)) {
  accuracy[i] <- abs(test$pred_lm3[i] - test$`Fill Rate`[i])
}
print(mean(accuracy))
## [1] 0.09386146

Well, looks like in the future we will have to improve our model. The predicted Fill Rate is currently off by about %10 for each value.

Now knit the file to an HTML and submit it :)