Week 11: Correlation, regression

Good work so far, you are on your way to becoming full fledged data analysts with R. This week’s course was a primer for when we are getting into Multiple and Logarithmic Regression. In the DataCamp courses we learned about different ways to visualize your data and when they are appropriate.

The data science process has five basic steps:

Obtain Data
- Importing data from source
Clean Data
- Remove Unwanted observations
- Remove Duplicate Observations
- Remove Irrelevant observations
- Fix structural error(typos and capitalization)
- Fix class types
Exploratory Data Analysis
- Variable Identification
- Univariate Analysis
- Bi-variate Analysis
- Missing values treatment
- Outlier treatment
- Variable transformation
- Feature engineering
Data Modeling
- Accomplish Task

This assignment will be the first one that you will go through four of the steps for the data science process. By the end of this assignment you will have a nice report when you knit the Rmarkdown file.

This week you are a Sports Data Analyst for the prestigious Brigham Young University. You have been tasked with examining BYU football attendance compared with other football teams around the country. You will be visualizing data, building linear regression models, and interpreting your results. Good luck and have fun!

#Part 1. Obtaining the Data

This dataset was created for MRKT 585R, the data comes from extracting data from approximately 600 Wikipedia pages for attendance numbers for each football game from 2000-2018, after that data was added, it was bound to the weather data from the National Climate Data Center. Following is a list of variables and description of each, this format will create a table when you knit:

Variable	Description
Date	The Date the Game was played on
Team	Home team of the Football Game
Time	Kickoff Time
Opponent	Away team of the Football Game
Rank	Rank of the Home Team for the AP poll
Site	Location for the game
TV	TV channel that the game was played on
Result	The outcome for the football game
Attendance	How many people attended the game
Current Wins	How many wins the team has leading up to the game
Current Losses	How many losses the team has leading up to the game
Stadium Capacity	How many people fit into the stadium
Fill Rate	Attendance / Stadium Capacity
New Coach	If the team has a first year head coach
Tailgating	If the team is a Top 25 tailgate destination
PRCP	Precipitation
SNOW	Snowfall
SNWD	Snow Depth (Snow on ground)
TMAX	Max Temperature for the day
TMIN	Min Temperature for the day
Opponent_Rank	Rank of the Opponent at the time of the game
Conference	What football conference the team belongs

Import ‘CFB.xlsx’ to RStudio Cloud and print the head: Don’t remember how? Review

cfb <- readxl::read_excel("CFB.xlsx", sheet = 1)

## New names:
## * `` -> ...1

head(cfb)

## # A tibble: 6 x 25
##   ...1  Date                Team  Time  Opponent Rank  Site  TV    Result
##   <chr> <dttm>              <chr> <chr> <chr>    <chr> <chr> <chr> <chr> 
## 1 1     2000-09-02 00:00:00 Arka… 8:00… Southwe… NR    War … <NA>  W 38–0
## 2 2     2000-09-16 00:00:00 Arka… 6:00… Boise S… NR    War … <NA>  W 38–…
## 3 3     2000-09-23 00:00:00 Arka… 8:00… Alabama  NR    Razo… ESPN2 W 28–…
## 4 4     2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR    Razo… JPS   L 7–38
## 5 5     2000-10-07 00:00:00 Arka… 6:00… Louisia… NR    Razo… <NA>  W 52–6
## 6 6     2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR    Razo… <NA>  L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## #   `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <chr>, Tailgating <chr>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <chr>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <chr>

Before cleaning the data, take a moment and become familiar with the dataset that you are working with. There are a multitude of variables, it would be important to

#Part 2. Cleaning the Data

During this step we will look at the data to see if there are any unwanted observations, duplicate observations, irrelevant observations, structural issues or class type issues.

Let’s start by looking at the class types for the dataset to see if we need to change any of the data types

## Observations: 3,732
## Variables: 25
## $ ...1               <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ Date               <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team               <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas…
## $ Time               <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent           <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank               <chr> "NR", "NR", "NR", "NR", "NR", "NR", "NR", "NR…
## $ Site               <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV                 <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result             <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance         <chr> "53,946", "54,286", "51,482", "51,162", "50,9…
## $ `Current Wins`     <chr> "0", "1", "2", "3", "3", "4", "5", "0", "1", …
## $ `Current Losses`   <chr> "0", "0", "0", "0", "1", "3", "5", "0", "0", …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate`        <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach`        <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", …
## $ Tailgating         <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", …
## $ PRCP               <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX               <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN               <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank      <chr> "NR", "NR", "NR", " 25", "NR", "NR", " 24", "…
## $ numericDate        <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime        <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference         <chr> "SEC", "SEC", "SEC", "SEC", "SEC", "SEC", "SE…

A couple of the variables have issues with the. Here they are and their problems:

...1 - This was the row index, it is unneeded
Team - There are 34 teams in total, therefore the data is categorical
Rank - This data is also categorical
TV - This data is also categorical (We will deal with this later)
Attendance - There are commas that prevent the data from being numeric
Current Wins - Loaded in as characters
Current Losses - Loaded in as characters
New Coach - This data is categorical
Tailgating - This data is categorical
Opponent_Rank - This data is categorical
Conference - This data is categorical

Now let’s try to fix the columns, start by subsetting the data to remove the index (..1), output the head for the new dataset

cfb <- cfb[,-1]
head(cfb)

## # A tibble: 6 x 24
##   Date                Team  Time  Opponent Rank  Site  TV    Result
##   <dttm>              <chr> <chr> <chr>    <chr> <chr> <chr> <chr> 
## 1 2000-09-02 00:00:00 Arka… 8:00… Southwe… NR    War … <NA>  W 38–0
## 2 2000-09-16 00:00:00 Arka… 6:00… Boise S… NR    War … <NA>  W 38–…
## 3 2000-09-23 00:00:00 Arka… 8:00… Alabama  NR    Razo… ESPN2 W 28–…
## 4 2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR    Razo… JPS   L 7–38
## 5 2000-10-07 00:00:00 Arka… 6:00… Louisia… NR    Razo… <NA>  W 52–6
## 6 2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR    Razo… <NA>  L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## #   `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <chr>, Tailgating <chr>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <chr>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <chr>

Next, fix the columns with categorical data. Factor the Team, Rank, New Coach, Tailgating, Opponent_Rank, and Conference columns so that the categorical nature of the columns is respected. Output the structure of the data using glimpse or str to confirm it worked.

Note: In actual analysis, you would want Rank, and Opponent Rank as ordered factors. Also, the New Coach column has a space in it, to type in the column without autofill, you surround the column name with `` symbols

cfb$Team <- as.factor(cfb$Team)
cfb$Rank <- as.factor(cfb$Rank)
cfb$`New Coach` <- as.factor(cfb$`New Coach`)
cfb$Tailgating <- as.factor(cfb$Tailgating)
cfb$Opponent_Rank <- as.factor(cfb$Opponent_Rank)
cfb$Conference <- as.factor(cfb$Conference)

glimpse(cfb)

## Observations: 3,732
## Variables: 24
## $ Date               <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team               <fct> Arkansas, Arkansas, Arkansas, Arkansas, Arkan…
## $ Time               <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent           <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank               <fct> NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, N…
## $ Site               <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV                 <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result             <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance         <chr> "53,946", "54,286", "51,482", "51,162", "50,9…
## $ `Current Wins`     <chr> "0", "1", "2", "3", "3", "4", "5", "0", "1", …
## $ `Current Losses`   <chr> "0", "0", "0", "0", "1", "3", "5", "0", "0", …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate`        <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach`        <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ Tailgating         <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRCP               <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX               <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN               <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank      <fct> NR, NR, NR,  25, NR, NR,  24, NR,  8, NR,  9,…
## $ numericDate        <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime        <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference         <fct> SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, …

Next, we will fix the Attendance column. Use the gsub function to remove the commas in the text.

cfb$Attendance <- gsub(",", "", cfb$Attendance)

head(cfb)

## # A tibble: 6 x 24
##   Date                Team  Time  Opponent Rank  Site  TV    Result
##   <dttm>              <fct> <chr> <chr>    <fct> <chr> <chr> <chr> 
## 1 2000-09-02 00:00:00 Arka… 8:00… Southwe… NR    War … <NA>  W 38–0
## 2 2000-09-16 00:00:00 Arka… 6:00… Boise S… NR    War … <NA>  W 38–…
## 3 2000-09-23 00:00:00 Arka… 8:00… Alabama  NR    Razo… ESPN2 W 28–…
## 4 2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR    Razo… JPS   L 7–38
## 5 2000-10-07 00:00:00 Arka… 6:00… Louisia… NR    Razo… <NA>  W 52–6
## 6 2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR    Razo… <NA>  L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## #   `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <fct>, Tailgating <fct>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <fct>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <fct>

Next the Attendance, Current Wins and Current Losses are both characters right now, make them integers. Output the structure with glimpse or str again.

cfb$Attendance <- as.integer(cfb$Attendance)
cfb$`Current Losses` <- as.integer(cfb$`Current Losses`)
cfb$`Current Wins` <- as.integer(cfb$`Current Wins`)

glimpse(cfb)

## Observations: 3,732
## Variables: 24
## $ Date               <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team               <fct> Arkansas, Arkansas, Arkansas, Arkansas, Arkan…
## $ Time               <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent           <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank               <fct> NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, N…
## $ Site               <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV                 <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result             <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance         <int> 53946, 54286, 51482, 51162, 50947, 49647, 439…
## $ `Current Wins`     <int> 0, 1, 2, 3, 3, 4, 5, 0, 1, 1, 2, 3, 5, 6, 0, …
## $ `Current Losses`   <int> 0, 0, 0, 0, 1, 3, 5, 0, 0, 3, 3, 3, 3, 3, 0, …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate`        <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach`        <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ Tailgating         <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRCP               <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX               <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN               <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank      <fct> NR, NR, NR,  25, NR, NR,  24, NR,  8, NR,  9,…
## $ numericDate        <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime        <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference         <fct> SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, …

You have now cleaned the data for our College Football Dataset.

#Part 3. Exploratory Data Analysis

Thus far we have imported and cleaned the data that we have for each of the games. Now we begin the exploratory data analysis, this step includes Variable Identification, Univariate Analysis, Bi-variate Analysis, Missing values treatment, Outlier treatment, Variable transformation, and Feature engineering. These are alot of big words, don’t worry though. I am going to explain what each means.

Variable Identification - identifying the input and the output variables, there are many different names for these. Input variables are also predictors, features, and independent variables. Output variable are also the response or dependent variable.

In this assignment, I am going to tell you that the output variable is the Fill Rate unless I specify otherwise. This means that the Fill Rate will be on the y axis for most graphs that we do.

Missing Value Treatment - dealing with missing data in the dataset

Let’s start by identifying which variables might have missing data in it. Use the summary function to identify the NAs in the dataset

summary(cfb)

##       Date                                 Team          Time          
##  Min.   :2000-08-31 00:00:00   Arkansas      : 130   Length:3732       
##  1st Qu.:2005-06-24 12:00:00   Nebraska      : 130   Class :character  
##  Median :2009-10-31 00:00:00   Penn State    : 130   Mode  :character  
##  Mean   :2009-12-03 10:01:09   Michigan State: 129                     
##  3rd Qu.:2014-09-27 00:00:00   Kansas State  : 128                     
##  Max.   :2018-12-01 00:00:00   Texas A&M     : 127                     
##                                (Other)       :2958                     
##    Opponent              Rank          Site                TV           
##  Length:3732        NR     :2668   Length:3732        Length:3732       
##  Class :character   19     :  59   Class :character   Class :character  
##  Mode  :character   23     :  51   Mode  :character   Mode  :character  
##                     24     :  51                                        
##                     11     :  49                                        
##                     10     :  47                                        
##                     (Other): 807                                        
##     Result            Attendance      Current Wins    Current Losses  
##  Length:3732        Min.   :  4513   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 36018   1st Qu.: 1.000   1st Qu.: 0.000  
##  Mode  :character   Median : 53663   Median : 3.000   Median : 2.000  
##                     Mean   : 54607   Mean   : 3.049   Mean   : 2.061  
##                     3rd Qu.: 75504   3rd Qu.: 5.000   3rd Qu.: 3.000  
##                     Max.   :110889   Max.   :12.000   Max.   :11.000  
##                     NA's   :2                                         
##  Stadium Capacity   Fill Rate       New Coach    Tailgating  
##  Min.   : 22113   Min.   :0.08137   FALSE:3129   FALSE:2514  
##  1st Qu.: 49225   1st Qu.:0.78404   TRUE : 603   TRUE :1218  
##  Median : 60000   Median :0.94100                            
##  Mean   : 60354   Mean   :0.87895                            
##  3rd Qu.: 80321   3rd Qu.:1.00255                            
##  Max.   :107282   Max.   :1.40399                            
##                   NA's   :2                                  
##       PRCP              SNOW             SNWD           TMAX       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00   Min.   : 19.00  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.: 61.00  
##  Median :0.00000   Median :0.0000   Median :0.00   Median : 72.00  
##  Mean   :0.08403   Mean   :0.0166   Mean   :0.03   Mean   : 70.82  
##  3rd Qu.:0.01000   3rd Qu.:0.0000   3rd Qu.:0.00   3rd Qu.: 82.00  
##  Max.   :6.45000   Max.   :5.3000   Max.   :7.10   Max.   :111.00  
##  NA's   :20        NA's   :1005     NA's   :1049   NA's   :29      
##       TMIN       Opponent_Rank   numericDate     numericTime   
##  Min.   : 0.00   NR     :2936   Min.   :11200   Min.   : 3600  
##  1st Qu.:38.00    1     :  53   1st Qu.:12958   1st Qu.:12600  
##  Median :49.00    2     :  42   Median :14548   Median :21600  
##  Mean   :48.51    13    :  39   Mean   :14581   Mean   :23548  
##  3rd Qu.:59.00    10    :  34   3rd Qu.:16340   3rd Qu.:39600  
##  Max.   :81.00    3     :  32   Max.   :17866   Max.   :45000  
##  NA's   :29      (Other): 596                                  
##         Conference 
##  Big-10      :755  
##  ACC         :697  
##  Big-12      :570  
##  SEC         :374  
##  Pac-12      :336  
##  Mid-American:281  
##  (Other)     :719

From the output, we learn that TV has 582 NA’s, Attendance has 2 NA’s, Fill Rate has 2 NA’s, PRCP has 20 NA’s, SNOW and SNWD have over 1000 NA’s, and finally TMAX and TMIN have 29 NA’s.

Let’s start to deal with these, we can safely assume that an NA in TV means that the game was not on TV, the same goes with PRCP, SNOW and SNWD, we can assume that there was no precipitation of any kind. Assign the NA values for the weather with 0. Follow my example.

cfb$TV[is.na(cfb$TV)] <- "Not on Television"          #Example of how to fill na values
cfb$PRCP[is.na(cfb$PRCP)] <- 0
cfb$SNOW[is.na(cfb$SNOW)] <- 0
cfb$SNWD[is.na(cfb$SNWD)] <- 0
summary(cfb)

##       Date                                 Team          Time          
##  Min.   :2000-08-31 00:00:00   Arkansas      : 130   Length:3732       
##  1st Qu.:2005-06-24 12:00:00   Nebraska      : 130   Class :character  
##  Median :2009-10-31 00:00:00   Penn State    : 130   Mode  :character  
##  Mean   :2009-12-03 10:01:09   Michigan State: 129                     
##  3rd Qu.:2014-09-27 00:00:00   Kansas State  : 128                     
##  Max.   :2018-12-01 00:00:00   Texas A&M     : 127                     
##                                (Other)       :2958                     
##    Opponent              Rank          Site                TV           
##  Length:3732        NR     :2668   Length:3732        Length:3732       
##  Class :character   19     :  59   Class :character   Class :character  
##  Mode  :character   23     :  51   Mode  :character   Mode  :character  
##                     24     :  51                                        
##                     11     :  49                                        
##                     10     :  47                                        
##                     (Other): 807                                        
##     Result            Attendance      Current Wins    Current Losses  
##  Length:3732        Min.   :  4513   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 36018   1st Qu.: 1.000   1st Qu.: 0.000  
##  Mode  :character   Median : 53663   Median : 3.000   Median : 2.000  
##                     Mean   : 54607   Mean   : 3.049   Mean   : 2.061  
##                     3rd Qu.: 75504   3rd Qu.: 5.000   3rd Qu.: 3.000  
##                     Max.   :110889   Max.   :12.000   Max.   :11.000  
##                     NA's   :2                                         
##  Stadium Capacity   Fill Rate       New Coach    Tailgating  
##  Min.   : 22113   Min.   :0.08137   FALSE:3129   FALSE:2514  
##  1st Qu.: 49225   1st Qu.:0.78404   TRUE : 603   TRUE :1218  
##  Median : 60000   Median :0.94100                            
##  Mean   : 60354   Mean   :0.87895                            
##  3rd Qu.: 80321   3rd Qu.:1.00255                            
##  Max.   :107282   Max.   :1.40399                            
##                   NA's   :2                                  
##       PRCP              SNOW              SNWD              TMAX       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   : 19.00  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 61.00  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median : 72.00  
##  Mean   :0.08357   Mean   :0.01217   Mean   :0.02157   Mean   : 70.82  
##  3rd Qu.:0.01000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 82.00  
##  Max.   :6.45000   Max.   :5.30000   Max.   :7.10000   Max.   :111.00  
##                                                        NA's   :29      
##       TMIN       Opponent_Rank   numericDate     numericTime   
##  Min.   : 0.00   NR     :2936   Min.   :11200   Min.   : 3600  
##  1st Qu.:38.00    1     :  53   1st Qu.:12958   1st Qu.:12600  
##  Median :49.00    2     :  42   Median :14548   Median :21600  
##  Mean   :48.51    13    :  39   Mean   :14581   Mean   :23548  
##  3rd Qu.:59.00    10    :  34   3rd Qu.:16340   3rd Qu.:39600  
##  Max.   :81.00    3     :  32   Max.   :17866   Max.   :45000  
##  NA's   :29      (Other): 596                                  
##         Conference 
##  Big-10      :755  
##  ACC         :697  
##  Big-12      :570  
##  SEC         :374  
##  Pac-12      :336  
##  Mid-American:281  
##  (Other)     :719

Now let’s fix the Attendance and Fill Rate, determine which rows have NA values in them for the Attendance variable and output those row numbers. Now look at those rows and answer why is there no Attendance for those games? ________________(They were cancelled)

which(is.na(cfb$Attendance))

## [1]  662 3040

cfb[c(662, 3040),]

## # A tibble: 2 x 24
##   Date                Team  Time  Opponent Rank  Site  TV    Result
##   <dttm>              <fct> <chr> <chr>    <fct> <chr> <chr> <chr> 
## 1 2013-09-14 00:00:00 Colo… 12:0… Fresno … NR    Fols… P12N  Cance…
## 2 2015-09-03 00:00:00 Tole… 7:00… Stony B… NR    Glas… ESPN3 Cance…
## # … with 16 more variables: Attendance <int>, `Current Wins` <int>,
## #   `Current Losses` <int>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <fct>, Tailgating <fct>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <fct>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <fct>

Next subset cfb to exclude those rows.

cfb <- cfb[-c(662, 3040),]

Now, let’s deal with the rest of the TMAX and TMIN. This variable is hardest to infer, so let’s drop the NA values now. You will have a dataset that is 3700 observations long.

cfb <- cfb[complete.cases(cfb$TMAX),]
cfb <- cfb[complete.cases(cfb$TMIN),]

Run the summary again on the data to confirm that we removed all the NA values

summary(cfb)

##       Date                                 Team          Time          
##  Min.   :2000-08-31 00:00:00   Nebraska      : 130   Length:3700       
##  1st Qu.:2004-11-26 18:00:00   Penn State    : 130   Class :character  
##  Median :2009-10-24 00:00:00   Arkansas      : 129   Mode  :character  
##  Mean   :2009-11-21 00:27:14   Michigan State: 129                     
##  3rd Qu.:2014-09-20 00:00:00   Kansas State  : 128                     
##  Max.   :2018-12-01 00:00:00   Texas A&M     : 127                     
##                                (Other)       :2927                     
##    Opponent              Rank          Site                TV           
##  Length:3700        NR     :2640   Length:3700        Length:3700       
##  Class :character   19     :  59   Class :character   Class :character  
##  Mode  :character   23     :  51   Mode  :character   Mode  :character  
##                     24     :  50                                        
##                     11     :  49                                        
##                     10     :  47                                        
##                     (Other): 804                                        
##     Result            Attendance      Current Wins    Current Losses  
##  Length:3700        Min.   :  4513   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 35986   1st Qu.: 1.000   1st Qu.: 0.000  
##  Mode  :character   Median : 53654   Median : 3.000   Median : 2.000  
##                     Mean   : 54633   Mean   : 3.051   Mean   : 2.056  
##                     3rd Qu.: 75579   3rd Qu.: 5.000   3rd Qu.: 3.000  
##                     Max.   :110889   Max.   :12.000   Max.   :11.000  
##                                                                       
##  Stadium Capacity   Fill Rate       New Coach    Tailgating  
##  Min.   : 22113   Min.   :0.08137   FALSE:3106   FALSE:2502  
##  1st Qu.: 49225   1st Qu.:0.78399   TRUE : 594   TRUE :1198  
##  Median : 60000   Median :0.94129                            
##  Mean   : 60373   Mean   :0.87926                            
##  3rd Qu.: 80321   3rd Qu.:1.00275                            
##  Max.   :107282   Max.   :1.40399                            
##                                                              
##       PRCP              SNOW              SNWD              TMAX       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   : 19.00  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 61.00  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median : 72.00  
##  Mean   :0.08396   Mean   :0.01227   Mean   :0.02176   Mean   : 70.82  
##  3rd Qu.:0.01000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 82.00  
##  Max.   :6.45000   Max.   :5.30000   Max.   :7.10000   Max.   :111.00  
##                                                                        
##       TMIN       Opponent_Rank   numericDate     numericTime   
##  Min.   : 0.00   NR     :2913   Min.   :11200   Min.   : 3600  
##  1st Qu.:38.00    1     :  51   1st Qu.:12749   1st Qu.:12600  
##  Median :49.00    2     :  42   Median :14541   Median :21600  
##  Mean   :48.51    13    :  39   Mean   :14569   Mean   :23517  
##  3rd Qu.:59.00    10    :  34   3rd Qu.:16333   3rd Qu.:39600  
##  Max.   :81.00    3     :  32   Max.   :17866   Max.   :45000  
##                  (Other): 589                                  
##         Conference 
##  Big-10      :753  
##  ACC         :696  
##  Big-12      :567  
##  SEC         :354  
##  Pac-12      :335  
##  Mid-American:279  
##  (Other)     :716

cfb %>%
  filter(is.na(TMAX))

## # A tibble: 0 x 24
## # … with 24 variables: Date <dttm>, Team <fct>, Time <chr>,
## #   Opponent <chr>, Rank <fct>, Site <chr>, TV <chr>, Result <chr>,
## #   Attendance <int>, `Current Wins` <int>, `Current Losses` <int>,
## #   `Stadium Capacity` <dbl>, `Fill Rate` <dbl>, `New Coach` <fct>,
## #   Tailgating <fct>, PRCP <dbl>, SNOW <dbl>, SNWD <dbl>, TMAX <dbl>,
## #   TMIN <dbl>, Opponent_Rank <fct>, numericDate <dbl>, numericTime <dbl>,
## #   Conference <fct>

Univariate Analysis - exploring variables one by one.

Let’s start by using ggplot to create a histogram of the Fill Rate, names this chart “Histogram of Fill Rate”, set the binwidth = 0.5

Now do the same thing with the Attendance variable, name this chart, “Histogram of Attendance”

ggplot(cfb, aes(x = Attendance)) +
  geom_histogram() +
  ggtitle("Histogram of Attendance")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looking at these two charts illustrates why standardization is important. The Fill Rate is the standardized version of the Attendance data.

Bi-variate Analysis - exploring how variables relate to one another

Now we are beginning to get into some fun with the data. We will be looking at the apparent relationship between two variables. In this case, we will be looking at the relationship of a couple of the variables against the Fill Rate

Let’s start with the graph that is most demanding computationally first. Create a scatter plot with the following conditions met:

Shows the Fill Rate over the Date
Each conference is a difference color
Facet wrapped by team
Titled “Fill Rate by Team”

ggplot(cfb, aes(x = Date, y = `Fill Rate`, color = Conference)) +
  geom_point() +
  facet_wrap(~ Team) +
  ggtitle("Fill Rate by Team")

Since 2000, BYU has had 3 new head coaches. Do the following to create a scatter plot:

Filter the teams so that we only have BYU
Create a scatter plot with Fill Rate over the Date, and color set to New Coach
Title it “New Coach Fill Rate at BYU”

cfb %>%
  filter(Team == "BYU") %>%
  ggplot( aes(x = Date, y = `Fill Rate`, color = `New Coach`)) +
  geom_point() +
  ggtitle("New Coaches at BYU")

From the scatter plot, we see that something happened in 2011, since then BYU Football attendance has been decreasing every year. Let’s investigate this more and look to see if it’s because BYU is losing more. Make a scatter plot that:

Filters the team to BYU
Puts Current Losses over Date
Uses geom_smooth to add a trendline without the standard error
Named “Losses by Year for BYU Football”

cfb %>%
  filter(Team == "BYU") %>%
  ggplot(aes(x = Date, y = `Current Losses`)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  ggtitle("Losses by Year for BYU Football")

This might help to explain why 2008-2010 have such a full stadium but it doesn’t explain the drop in fullness starting in 2011. This might be a factor also

Outlier treatment - dealing with observations that appear to be too far away from the pattern of the data.

For this assignment, there are no outliers.

Feature engineering - creating new variables to extract more information from the data there are two ways to do this. First, by transforming the scale or relationship of the data. Second, by creating new variables based off the ones that we have.

The feature engineering was done to the data before the assignment started. That being said, we should practice writing functions before moving on. Write a function that outputs the following:

Filters the data down to one team
The Standard deviation of attendance for one team
The average attendance for one team
The Upper bound (the Average plus standard deviation)
The Lower bound (the Average minus standard deviation)

teamMath <- function(data = cfb, team) {
  data %>%
    filter(Team == team) %>%
    summarize(team,SD = sd(Attendance), Avg = mean(Attendance),Upper = Avg + SD, Lower = Avg - SD)
}

#Use your function with BYU as the team, another time with SMU, and another time with Nebraska 
teamMath(cfb, "BYU")

## # A tibble: 1 x 5
##   team     SD    Avg  Upper  Lower
##   <chr> <dbl>  <dbl>  <dbl>  <dbl>
## 1 BYU   4297. 60006. 64303. 55709.

teamMath(cfb, "SMU")

## # A tibble: 1 x 5
##   team     SD    Avg  Upper  Lower
##   <chr> <dbl>  <dbl>  <dbl>  <dbl>
## 1 SMU   8064. 19889. 27953. 11825.

teamMath(cfb, "Nebraska")

## # A tibble: 1 x 5
##   team        SD    Avg  Upper  Lower
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
## 1 Nebraska 5075. 84494. 89569. 79419.

#Part 4. Data Modeling

Now we have reviewed most of what we have learned in R up to this point. Next we will begin what we have learned about correlation and regression in the class. In this section we will be looking at the linear correlation of values, adding best fit lines, statistical models.

First reminder, correlation only good for linear values. You get the correlation by subtracting the mean of the x axis from each value multiplied by the subtracted mean of the y axis from each value. divided by the squared version of each.

Second reminder, the line of best fit, or trendline is one that minimizes the squared distance from all points to it. The average distance of all the points above and below the line should be the same.

Third reminder, the generic statistical model is y = f(x) + noise. Most of what you do in creating models from here on out is determine ways to better fit the response variable or y with the the inputs or x.

Start by taking the correlation of the Fill Rate by the TMAX, earlier we didn’t remove all the NA values from TMAX, use the appropriate argument to disregard the NA values. Look here

cfb %>%
  summarize(Correlation = cor(`Fill Rate`, TMAX, use = "pairwise.complete.obs"))

## # A tibble: 1 x 1
##   Correlation
##         <dbl>
## 1      0.0573

.0597 would be a very weak correlation

Let’s now look and see if any of the other numeric value are more correlated to the Fill Rate. Check its correlation against Current Wins, Current Losses, PRCP, SNOW, and TMIN. Make sure to use the correct argument when there are NA values.

cfb %>%
  summarize(CorrWins = cor(`Fill Rate`, `Current Wins`), CorrLosses = cor(`Fill Rate`, `Current Losses`), CorrPRCP = cor(`Fill Rate`, PRCP), CorrSNOW = cor(`Fill Rate`, SNOW), CorrTMIN = cor(`Fill Rate`, TMIN, use = "pairwise.complete.obs"))

## # A tibble: 1 x 5
##   CorrWins CorrLosses CorrPRCP CorrSNOW CorrTMIN
##      <dbl>      <dbl>    <dbl>    <dbl>    <dbl>
## 1    0.136     -0.348  0.00919 -0.00581   0.0355

We have already fitted linear models simply within ggplot, now we want to practice creating the models themselves so that we can learn from them.

Create your own simple linear model

Create a linear model for the Fill Rate determined by TMAX
Show the summary for the linear model that you have made
What amount of variance is explained by our model? _______(0.003572)

lm1 <- lm(`Fill Rate` ~ TMAX ,data = cfb)
summary(lm1)

## 
## Call:
## lm(formula = `Fill Rate` ~ TMAX, data = cfb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80001 -0.09716  0.06258  0.12531  0.54336 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.8318411  0.0138903  59.886  < 2e-16 ***
## TMAX        0.0006695  0.0001920   3.487 0.000493 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.173 on 3698 degrees of freedom
## Multiple R-squared:  0.003278,   Adjusted R-squared:  0.003008 
## F-statistic: 12.16 on 1 and 3698 DF,  p-value: 0.0004934

Now we will make things a little more complicated.

Create a linear model for the Fill Rate determined by Team, TMAX, TMIN, Conference, Tailgating, Current Wins, and Current Losses
Show the summary of the model
What amount of variance is explained by this model? ________(0.5871)
Which model is better? __________(The second one)

lm2 <- lm(`Fill Rate` ~ TMAX + TMIN + Conference + Tailgating + `Current Wins` + `Current Losses` + Team, data = cfb)
summary(lm2)

## 
## Call:
## lm(formula = `Fill Rate` ~ TMAX + TMIN + Conference + Tailgating + 
##     `Current Wins` + `Current Losses` + Team, data = cfb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52942 -0.05310  0.00461  0.05392  0.57294 
## 
## Coefficients: (10 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.6180455  0.0238387  25.926  < 2e-16 ***
## TMAX                    0.0008184  0.0002603   3.145 0.001677 ** 
## TMIN                   -0.0006619  0.0002848  -2.324 0.020175 *  
## ConferenceACC           0.1893057  0.0153343  12.345  < 2e-16 ***
## ConferenceBig-10        0.1276783  0.0213279   5.986 2.35e-09 ***
## ConferenceBig-12        0.2703458  0.0171358  15.777  < 2e-16 ***
## ConferenceCUSA          0.1361626  0.0190918   7.132 1.19e-12 ***
## ConferenceIndependent   0.1349430  0.0215127   6.273 3.96e-10 ***
## ConferenceMid-American  0.0500580  0.0180696   2.770 0.005629 ** 
## ConferenceMWC           0.0146947  0.0160635   0.915 0.360364    
## ConferencePac-12        0.0728579  0.0152291   4.784 1.79e-06 ***
## ConferenceSEC           0.3280568  0.0150522  21.795  < 2e-16 ***
## TailgatingTRUE          0.2270409  0.0153324  14.808  < 2e-16 ***
## `Current Wins`          0.0070846  0.0010809   6.555 6.36e-11 ***
## `Current Losses`       -0.0181029  0.0011859 -15.265  < 2e-16 ***
## TeamBaylor             -0.1095555  0.0165408  -6.623 4.02e-11 ***
## TeamBoise State         0.2811864  0.0155138  18.125  < 2e-16 ***
## TeamBYU                 0.1738787  0.0215989   8.050 1.11e-15 ***
## TeamClemson            -0.0902762  0.0216874  -4.163 3.22e-05 ***
## TeamColorado            0.2044518  0.0156644  13.052  < 2e-16 ***
## TeamFlorida State       0.1103872  0.0150156   7.352 2.40e-13 ***
## TeamGeorgia Tech        0.0836747  0.0149675   5.590 2.43e-08 ***
## TeamIndiana             0.0022155  0.0209931   0.106 0.915958    
## TeamIowa State          0.0559452  0.0161525   3.464 0.000539 ***
## TeamKansas State        0.0749136  0.0159461   4.698 2.72e-06 ***
## TeamMarshall           -0.0748676  0.0188875  -3.964 7.52e-05 ***
## TeamMichigan State      0.0139486  0.0140184   0.995 0.319795    
## TeamNC State            0.1543451  0.0146479  10.537  < 2e-16 ***
## TeamNebraska            0.0587729  0.0140686   4.178 3.01e-05 ***
## TeamNevada                     NA         NA      NA       NA    
## TeamNorthern Illinois  -0.0179518  0.0182483  -0.984 0.325303    
## TeamNotre Dame                 NA         NA      NA       NA    
## TeamOklahoma           -0.1242826  0.0227609  -5.460 5.07e-08 ***
## TeamOle Miss           -0.2422276  0.0216531 -11.187  < 2e-16 ***
## TeamPenn State         -0.0112770  0.0140222  -0.804 0.421320    
## TeamRutgers             0.0945303  0.0209131   4.520 6.38e-06 ***
## TeamSMU                        NA         NA      NA       NA    
## TeamSyracuse           -0.0233726  0.0148049  -1.579 0.114490    
## TeamTexas A&M          -0.2139975  0.0214652  -9.970  < 2e-16 ***
## TeamToledo              0.1342723  0.0180163   7.453 1.13e-13 ***
## TeamTroy                       NA         NA      NA       NA    
## TeamUCLA                       NA         NA      NA       NA    
## TeamVirginia                   NA         NA      NA       NA    
## TeamWashington                 NA         NA      NA       NA    
## TeamWest Virginia              NA         NA      NA       NA    
## TeamWestern Kentucky           NA         NA      NA       NA    
## TeamWisconsin                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.112 on 3663 degrees of freedom
## Multiple R-squared:  0.5861, Adjusted R-squared:  0.582 
## F-statistic: 144.1 on 36 and 3663 DF,  p-value: < 2.2e-16

Now let’s look and see how well our model is at predicting the Fill Rate for the stadium. I have split the data into a training and validation datasets so that you can predict how good your model is against the data.

Build a new model on the train dataset
Create a linear model for the Fill Rate determined by Team, TMAX, TMIN, Conference, Tailgating, Current Wins, and Current Losses
Predict the Fitted values of the test dataset, assign this to a new column of test

# Test and Training set
set.seed(121) #set.seed is included so that everyone gets the same results
testNum <- sample(nrow(cfb), 100) #Creates a vector of row numbers to include
test <- cfb[c(testNum),] #Subsets the data so that it only has the test rows in it
train <- cfb[-c(testNum),] #Takes the test numbers out

# Build Your Model and Predict 
lm3 <- lm(`Fill Rate` ~ TMAX + TMIN + Conference + Tailgating + `Current Wins` + `Current Losses` + Team, data = train)
test$pred_lm3 <- predict(lm3, test)

Now let’s see how well we did. Write a for loop that will print the average difference between your prediction and the actual value.

accuracy <- rep(0, nrow(test))
for (i in 1:nrow(test)) {
  accuracy[i] <- abs(test$pred_lm3[i] - test$`Fill Rate`[i])
}
print(mean(accuracy))

## [1] 0.09386146

Well, looks like in the future we will have to improve our model. The predicted Fill Rate is currently off by about %10 for each value.

Now knit the file to an HTML and submit it :)