Week 12 - Correlation and regression

Good work so far, you are on your way to becoming full fledged data analysts with R. This week’s course was a primer for when we are getting into Multiple and Logarithmic Regression. In the DataCamp courses we learned about different ways to visualize your data and when they are appropriate.

The data science process has five basic steps:

Obtain Data
- Importing data from source
Clean Data
- Remove Unwanted observations
- Remove Duplicate Observations
- Remove Irrelevant observations
- Fix structural error(typos and capitalization)
- Fix class types
Exploratory Data Analysis
- Variable Identification
- Univariate Analysis
- Bi-variate Analysis
- Missing values treatment
- Outlier treatment
- Variable transformation
- Feature engineering
Data Modeling
- Accomplish Task

This assignment will be the first one that you will go through four of the steps for the data science process. By the end of this assignment you will have a nice report when you knit the Rmarkdown file.

This week you are a Sports Data Analyst for the prestigious Brigham Young University. You have been tasked with examining BYU football attendance compared with other football teams around the country. You will be visualizing data, building linear regression models, and interpreting your results. Good luck and have fun!

#Part 1. Obtaining the Data

This dataset was created for MRKT 585R, the data comes from extracting data from approximately 600 Wikipedia pages for attendance numbers for each football game from 2000-2018, after that data was added, it was bound to the weather data from the National Climate Data Center. Following is a list of variables and description of each, this format will create a table when you knit:

Variable	Description
Date	The Date the Game was played on
Team	Home team of the Football Game
Time	Kickoff Time
Opponent	Away team of the Football Game
Rank	Rank of the Home Team for the AP poll
Site	Location for the game
TV	TV channel that the game was played on
Result	The outcome for the football game
Attendance	How many people attended the game
Current Wins	How many wins the team has leading up to the game
Current Losses	How many losses the team has leading up to the game
Stadium Capacity	How many people fit into the stadium
Fill Rate	Attendance / Stadium Capacity
New Coach	If the team has a first year head coach
Tailgating	If the team is a Top 25 tailgate destination
PRCP	Precipitation
SNOW	Snowfall
SNWD	Snow Depth (Snow on ground)
TMAX	Max Temperature for the day
TMIN	Min Temperature for the day
Opponent_Rank	Rank of the Opponent at the time of the game
Conference	What football conference the team belongs

Import ‘CFB.xlsx’ to RStudio Cloud and print the head: Don’t remember how? Review

CFB <- read_excel("CFB.xlsx")

## New names:
## * `` -> ...1

head(CFB)

## # A tibble: 6 x 25
##   ...1  Date                Team  Time  Opponent Rank  Site  TV    Result
##   <chr> <dttm>              <chr> <chr> <chr>    <chr> <chr> <chr> <chr> 
## 1 1     2000-09-02 00:00:00 Arka… 8:00… Southwe… NR    War … <NA>  W 38–0
## 2 2     2000-09-16 00:00:00 Arka… 6:00… Boise S… NR    War … <NA>  W 38–…
## 3 3     2000-09-23 00:00:00 Arka… 8:00… Alabama  NR    Razo… ESPN2 W 28–…
## 4 4     2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR    Razo… JPS   L 7–38
## 5 5     2000-10-07 00:00:00 Arka… 6:00… Louisia… NR    Razo… <NA>  W 52–6
## 6 6     2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR    Razo… <NA>  L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## #   `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## #   `New Coach` <chr>, Tailgating <chr>, PRCP <dbl>, SNOW <dbl>,
## #   SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <chr>,
## #   numericDate <dbl>, numericTime <dbl>, Conference <chr>

Before cleaning the data, take a moment and become familiar with the dataset that you are working with. There are a multitude of variables, it would be important to

#Part 2. Cleaning the Data

During this step we will look at the data to see if there are any unwanted observations, duplicate observations, irrelevant observations, structural issues or class type issues.

Let’s start by looking at the class types for the dataset to see if we need to change any of the data types

##      ...1                Date                         Team          
##  Length:3732        Min.   :2000-08-31 00:00:00   Length:3732       
##  Class :character   1st Qu.:2005-06-24 12:00:00   Class :character  
##  Mode  :character   Median :2009-10-31 00:00:00   Mode  :character  
##                     Mean   :2009-12-03 10:01:09                     
##                     3rd Qu.:2014-09-27 00:00:00                     
##                     Max.   :2018-12-01 00:00:00                     
##                                                                     
##      Time             Opponent             Rank          
##  Length:3732        Length:3732        Length:3732       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      Site                TV               Result         
##  Length:3732        Length:3732        Length:3732       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   Attendance        Current Wins       Current Losses     Stadium Capacity
##  Length:3732        Length:3732        Length:3732        Min.   : 22113  
##  Class :character   Class :character   Class :character   1st Qu.: 49225  
##  Mode  :character   Mode  :character   Mode  :character   Median : 60000  
##                                                           Mean   : 60354  
##                                                           3rd Qu.: 80321  
##                                                           Max.   :107282  
##                                                                           
##    Fill Rate        New Coach          Tailgating             PRCP        
##  Min.   :0.08137   Length:3732        Length:3732        Min.   :0.00000  
##  1st Qu.:0.78404   Class :character   Class :character   1st Qu.:0.00000  
##  Median :0.94100   Mode  :character   Mode  :character   Median :0.00000  
##  Mean   :0.87895                                         Mean   :0.08403  
##  3rd Qu.:1.00255                                         3rd Qu.:0.01000  
##  Max.   :1.40399                                         Max.   :6.45000  
##  NA's   :2                                               NA's   :20       
##       SNOW             SNWD           TMAX             TMIN      
##  Min.   :0.0000   Min.   :0.00   Min.   : 19.00   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.: 61.00   1st Qu.:38.00  
##  Median :0.0000   Median :0.00   Median : 72.00   Median :49.00  
##  Mean   :0.0166   Mean   :0.03   Mean   : 70.82   Mean   :48.51  
##  3rd Qu.:0.0000   3rd Qu.:0.00   3rd Qu.: 82.00   3rd Qu.:59.00  
##  Max.   :5.3000   Max.   :7.10   Max.   :111.00   Max.   :81.00  
##  NA's   :1005     NA's   :1049   NA's   :29       NA's   :29     
##  Opponent_Rank       numericDate     numericTime     Conference       
##  Length:3732        Min.   :11200   Min.   : 3600   Length:3732       
##  Class :character   1st Qu.:12958   1st Qu.:12600   Class :character  
##  Mode  :character   Median :14548   Median :21600   Mode  :character  
##                     Mean   :14581   Mean   :23548                     
##                     3rd Qu.:16340   3rd Qu.:39600                     
##                     Max.   :17866   Max.   :45000                     
##

A couple of the variables have issues with the. Here they are and their problems:

...1 - This was the row index, it is unneeded
Team - There are 34 teams in total, therefore the data is categorical
Rank - This data is also categorical
TV - This data is also categorical (We will deal with this later)
Attendance - There are commas that prevent the data from being numeric
Current Wins - Loaded in as characters
Current Losses - Loaded in as characters
New Coach - This data is categorical
Tailgating - This data is categorical
Opponent_Rank - This data is categorical
Conference - This data is categorical

Now let’s try to fix the columns, start by subsetting the data to remove the index (..1), output the head for the new dataset

CFB <- select(CFB, -c(...1))

Next, fix the columns with categorical data. Factor the Team, Rank, New Coach, Tailgating, Opponent_Rank, and Conference columns so that the categorical nature of the columns is respected. Output the structure of the data using glimpse or str to confirm it worked.

Note: In actual analysis, you would want Rank, and Opponent Rank as ordered factors. Also, the New Coach column has a space in it, to type in the column without autofill, you surround the column name with `` symbols

CFB$Team <- as.factor(CFB$Team)
CFB$Rank <-as.factor(CFB$Rank)
CFB$`New Coach` <-as.factor(CFB$`New Coach`)
CFB$Tailgating <-as.factor(CFB$Tailgating)
CFB$Opponent_Rank <-as.factor(CFB$Opponent_Rank)
CFB$Conference <- as.factor(CFB$Conference)

str(CFB)

## Classes 'tbl_df', 'tbl' and 'data.frame':    3732 obs. of  24 variables:
##  $ Date            : POSIXct, format: "2000-09-02" "2000-09-16" ...
##  $ Team            : Factor w/ 33 levels "Arkansas","Baylor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time            : chr  "8:00 pm" "6:00 pm" "8:00 pm" "11:30 pm" ...
##  $ Opponent        : chr  "Southwest Missouri State*" "Boise State*" "Alabama" "No. 25 Georgia" ...
##  $ Rank            : Factor w/ 29 levels "1","1(3)","10",..: 29 29 29 29 29 29 29 29 29 29 ...
##  $ Site            : chr  "War Memorial StadiumLittle Rock, AR" "War Memorial StadiumLittle Rock, AR" "Razorback StadiumFayetteville, AR" "Razorback StadiumFayetteville, AR" ...
##  $ TV              : chr  NA NA "ESPN2" "JPS" ...
##  $ Result          : chr  "W 38–0" "W 38–31" "W 28–21" "L 7–38" ...
##  $ Attendance      : chr  "53,946" "54,286" "51,482" "51,162" ...
##  $ Current Wins    : chr  "0" "1" "2" "3" ...
##  $ Current Losses  : chr  "0" "0" "0" "0" ...
##  $ Stadium Capacity: num  53727 53727 50019 50019 50019 ...
##  $ Fill Rate       : num  1 1.01 1.03 1.02 1.02 ...
##  $ New Coach       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tailgating      : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PRCP            : num  0 0 2.12 0 0 0.11 0.94 0 NA 0 ...
##  $ SNOW            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SNWD            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ TMAX            : num  105 79 85 77 50 55 49 84 NA 62 ...
##  $ TMIN            : num  65 44 63 45 28 49 43 63 NA 31 ...
##  $ Opponent_Rank   : Factor w/ 32 levels " "," 1"," 1.",..: 32 32 32 21 32 32 20 32 27 32 ...
##  $ numericDate     : num  11202 11216 11223 11230 11237 ...
##  $ numericTime     : num  28800 21600 28800 41400 21600 7200 5400 23400 28800 21600 ...
##  $ Conference      : Factor w/ 10 levels "AAC","ACC","Big-10",..: 10 10 10 10 10 10 10 10 10 10 ...

Next, we will fix the Attendance column. Use the gsub function to remove the commas in the text.

CFB$Attendance <- gsub(",", "", CFB$Attendance)

Next the Attendance, Current Wins and Current Losses are both characters right now, make them integers. Output the structure with glimpse or str again.

CFB$Attendance <- as.integer(CFB$Attendance)
CFB$`Current Losses` <- as.integer(CFB$`Current Losses`)
CFB$`Current Wins`<- as.integer(CFB$`Current Wins`)

str(CFB)

## Classes 'tbl_df', 'tbl' and 'data.frame':    3732 obs. of  24 variables:
##  $ Date            : POSIXct, format: "2000-09-02" "2000-09-16" ...
##  $ Team            : Factor w/ 33 levels "Arkansas","Baylor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time            : chr  "8:00 pm" "6:00 pm" "8:00 pm" "11:30 pm" ...
##  $ Opponent        : chr  "Southwest Missouri State*" "Boise State*" "Alabama" "No. 25 Georgia" ...
##  $ Rank            : Factor w/ 29 levels "1","1(3)","10",..: 29 29 29 29 29 29 29 29 29 29 ...
##  $ Site            : chr  "War Memorial StadiumLittle Rock, AR" "War Memorial StadiumLittle Rock, AR" "Razorback StadiumFayetteville, AR" "Razorback StadiumFayetteville, AR" ...
##  $ TV              : chr  NA NA "ESPN2" "JPS" ...
##  $ Result          : chr  "W 38–0" "W 38–31" "W 28–21" "L 7–38" ...
##  $ Attendance      : int  53946 54286 51482 51162 50947 49647 43982 52213 70470 52683 ...
##  $ Current Wins    : int  0 1 2 3 3 4 5 0 1 1 ...
##  $ Current Losses  : int  0 0 0 0 1 3 5 0 0 3 ...
##  $ Stadium Capacity: num  53727 53727 50019 50019 50019 ...
##  $ Fill Rate       : num  1 1.01 1.03 1.02 1.02 ...
##  $ New Coach       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tailgating      : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PRCP            : num  0 0 2.12 0 0 0.11 0.94 0 NA 0 ...
##  $ SNOW            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SNWD            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ TMAX            : num  105 79 85 77 50 55 49 84 NA 62 ...
##  $ TMIN            : num  65 44 63 45 28 49 43 63 NA 31 ...
##  $ Opponent_Rank   : Factor w/ 32 levels " "," 1"," 1.",..: 32 32 32 21 32 32 20 32 27 32 ...
##  $ numericDate     : num  11202 11216 11223 11230 11237 ...
##  $ numericTime     : num  28800 21600 28800 41400 21600 7200 5400 23400 28800 21600 ...
##  $ Conference      : Factor w/ 10 levels "AAC","ACC","Big-10",..: 10 10 10 10 10 10 10 10 10 10 ...

You have now cleaned the data for our College Football Dataset.

#Part 3. Exploratory Data Analysis

Thus far we have imported and cleaned the data that we have for each of the games. Now we begin the exploratory data analysis, this step includes Variable Identification, Univariate Analysis, Bi-variate Analysis, Missing values treatment, Outlier treatment, Variable transformation, and Feature engineering. These are alot of big words, don’t worry though. I am going to explain what each means.

Variable Identification - identifying the input and the output variables, there are many different names for these. Input variables are also predictors, features, and independent variables. Output variable are also the response or dependent variable.

In this assignment, I am going to tell you that the output variable is the Fill Rate unless I specify otherwise. This means that the Fill Rate will be on the y axis for most graphs that we do.

Missing Value Treatment - dealing with missing data in the dataset

Let’s start by identifying which variables might have missing data in it. Use the summary function to identify the NAs in the dataset

summary(CFB)

##       Date                                 Team          Time          
##  Min.   :2000-08-31 00:00:00   Arkansas      : 130   Length:3732       
##  1st Qu.:2005-06-24 12:00:00   Nebraska      : 130   Class :character  
##  Median :2009-10-31 00:00:00   Penn State    : 130   Mode  :character  
##  Mean   :2009-12-03 10:01:09   Michigan State: 129                     
##  3rd Qu.:2014-09-27 00:00:00   Kansas State  : 128                     
##  Max.   :2018-12-01 00:00:00   Texas A&M     : 127                     
##                                (Other)       :2958                     
##    Opponent              Rank          Site                TV           
##  Length:3732        NR     :2668   Length:3732        Length:3732       
##  Class :character   19     :  59   Class :character   Class :character  
##  Mode  :character   23     :  51   Mode  :character   Mode  :character  
##                     24     :  51                                        
##                     11     :  49                                        
##                     10     :  47                                        
##                     (Other): 807                                        
##     Result            Attendance      Current Wins    Current Losses  
##  Length:3732        Min.   :  4513   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 36018   1st Qu.: 1.000   1st Qu.: 0.000  
##  Mode  :character   Median : 53663   Median : 3.000   Median : 2.000  
##                     Mean   : 54607   Mean   : 3.049   Mean   : 2.061  
##                     3rd Qu.: 75504   3rd Qu.: 5.000   3rd Qu.: 3.000  
##                     Max.   :110889   Max.   :12.000   Max.   :11.000  
##                     NA's   :2                                         
##  Stadium Capacity   Fill Rate       New Coach    Tailgating  
##  Min.   : 22113   Min.   :0.08137   FALSE:3129   FALSE:2514  
##  1st Qu.: 49225   1st Qu.:0.78404   TRUE : 603   TRUE :1218  
##  Median : 60000   Median :0.94100                            
##  Mean   : 60354   Mean   :0.87895                            
##  3rd Qu.: 80321   3rd Qu.:1.00255                            
##  Max.   :107282   Max.   :1.40399                            
##                   NA's   :2                                  
##       PRCP              SNOW             SNWD           TMAX       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00   Min.   : 19.00  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.: 61.00  
##  Median :0.00000   Median :0.0000   Median :0.00   Median : 72.00  
##  Mean   :0.08403   Mean   :0.0166   Mean   :0.03   Mean   : 70.82  
##  3rd Qu.:0.01000   3rd Qu.:0.0000   3rd Qu.:0.00   3rd Qu.: 82.00  
##  Max.   :6.45000   Max.   :5.3000   Max.   :7.10   Max.   :111.00  
##  NA's   :20        NA's   :1005     NA's   :1049   NA's   :29      
##       TMIN       Opponent_Rank   numericDate     numericTime   
##  Min.   : 0.00   NR     :2936   Min.   :11200   Min.   : 3600  
##  1st Qu.:38.00    1     :  53   1st Qu.:12958   1st Qu.:12600  
##  Median :49.00    2     :  42   Median :14548   Median :21600  
##  Mean   :48.51    13    :  39   Mean   :14581   Mean   :23548  
##  3rd Qu.:59.00    10    :  34   3rd Qu.:16340   3rd Qu.:39600  
##  Max.   :81.00    3     :  32   Max.   :17866   Max.   :45000  
##  NA's   :29      (Other): 596                                  
##         Conference 
##  Big-10      :755  
##  ACC         :697  
##  Big-12      :570  
##  SEC         :374  
##  Pac-12      :336  
##  Mid-American:281  
##  (Other)     :719

From the output, we learn that TV has 582 NA’s, Attendance has 2 NA’s, Fill Rate has 2 NA’s, PRCP has 20 NA’s, SNOW and SNWD have over 1000 NA’s, and finally TMAX and TMIN have 29 NA’s.

Let’s start to deal with these, we can safely assume that an NA in TV means that the game was not on TV, the same goes with PRCP, SNOW and SNWD, we can assume that there was no precipitation of any kind. Assign the NA values for the weather with 0. Follow my example.

         #Example of how to fill na values

CFB$TV[is.na(CFB$TV)] <- "Not on Televison"
CFB$PRCP[is.na(CFB$PRCP)] <- 0
CFB$SNOW[is.na(CFB$SNOW)] <- 0
CFB$SNWD[is.na(CFB$SNWD)] <- 0
""

## [1] ""

Now let’s fix the Attendance and Fill Rate, determine which rows have NA values in them for the Attendance variable and output those row numbers. Now look at those rows and answer why is there no Attendance for those games? ______The game was canceled__________

AttedanceNA <- CFB[is.na(CFB$Attendance),]
which(is.na(CFB$Attendance), arr.ind=TRUE)

## [1]  662 3040

Next subset cfb to exclude those rows.

CFB <- CFB[-c(662,3040),]

Now, let’s deal with the rest of the TMAX and TMIN. This variable is hardest to infer, so let’s drop the NA values now. You will have a dataset that is 3700 observations long.

CFB <- na.omit(CFB, cols="CFB$TMAX")
CFB <- na.omit(CFB, cols="CFB$TMIN")

Run the summary again on the data to confirm that we removed all the NA values

summary(CFB)

##       Date                                 Team          Time          
##  Min.   :2000-08-31 00:00:00   Nebraska      : 130   Length:3700       
##  1st Qu.:2004-11-26 18:00:00   Penn State    : 130   Class :character  
##  Median :2009-10-24 00:00:00   Arkansas      : 129   Mode  :character  
##  Mean   :2009-11-21 00:27:14   Michigan State: 129                     
##  3rd Qu.:2014-09-20 00:00:00   Kansas State  : 128                     
##  Max.   :2018-12-01 00:00:00   Texas A&M     : 127                     
##                                (Other)       :2927                     
##    Opponent              Rank          Site                TV           
##  Length:3700        NR     :2640   Length:3700        Length:3700       
##  Class :character   19     :  59   Class :character   Class :character  
##  Mode  :character   23     :  51   Mode  :character   Mode  :character  
##                     24     :  50                                        
##                     11     :  49                                        
##                     10     :  47                                        
##                     (Other): 804                                        
##     Result            Attendance      Current Wins    Current Losses  
##  Length:3700        Min.   :  4513   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 35986   1st Qu.: 1.000   1st Qu.: 0.000  
##  Mode  :character   Median : 53654   Median : 3.000   Median : 2.000  
##                     Mean   : 54633   Mean   : 3.051   Mean   : 2.056  
##                     3rd Qu.: 75579   3rd Qu.: 5.000   3rd Qu.: 3.000  
##                     Max.   :110889   Max.   :12.000   Max.   :11.000  
##                                                                       
##  Stadium Capacity   Fill Rate       New Coach    Tailgating  
##  Min.   : 22113   Min.   :0.08137   FALSE:3106   FALSE:2502  
##  1st Qu.: 49225   1st Qu.:0.78399   TRUE : 594   TRUE :1198  
##  Median : 60000   Median :0.94129                            
##  Mean   : 60373   Mean   :0.87926                            
##  3rd Qu.: 80321   3rd Qu.:1.00275                            
##  Max.   :107282   Max.   :1.40399                            
##                                                              
##       PRCP              SNOW              SNWD              TMAX       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   : 19.00  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 61.00  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median : 72.00  
##  Mean   :0.08396   Mean   :0.01227   Mean   :0.02176   Mean   : 70.82  
##  3rd Qu.:0.01000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 82.00  
##  Max.   :6.45000   Max.   :5.30000   Max.   :7.10000   Max.   :111.00  
##                                                                        
##       TMIN       Opponent_Rank   numericDate     numericTime   
##  Min.   : 0.00   NR     :2913   Min.   :11200   Min.   : 3600  
##  1st Qu.:38.00    1     :  51   1st Qu.:12749   1st Qu.:12600  
##  Median :49.00    2     :  42   Median :14541   Median :21600  
##  Mean   :48.51    13    :  39   Mean   :14569   Mean   :23517  
##  3rd Qu.:59.00    10    :  34   3rd Qu.:16333   3rd Qu.:39600  
##  Max.   :81.00    3     :  32   Max.   :17866   Max.   :45000  
##                  (Other): 589                                  
##         Conference 
##  Big-10      :753  
##  ACC         :696  
##  Big-12      :567  
##  SEC         :354  
##  Pac-12      :335  
##  Mid-American:279  
##  (Other)     :716

Univariate Analysis - exploring variables one by one.

Let’s start by using ggplot to create a histogram of the Fill Rate, names this chart “Histogram of Fill Rate”, set the binwidth = 0.05

Now do the same thing with the Attendance variable, name this chart, “Histogram of Attendance” with no binwidth

ggplot(CFB, aes(x=Attendance)) + geom_histogram() + ggtitle("Histogram of Attendance")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looking at these two charts illustrates why standardization is important. The Fill Rate is the standardized version of the Attendance data.

Bi-variate Analysis - exploring how variables relate to one another

Now we are beginning to get into some fun with the data. We will be looking at the apparent relationship between two variables. In this case, we will be looking at the relationship of a couple of the variables against the Fill Rate

Let’s start with the graph that is most demanding computationally first. Create a scatter plot with the following conditions met:

Shows the Fill Rate over the Date
Each conference is a difference color
Facet wrapped by team
Titled “Fill Rate by Team”

ggplot(CFB, aes(x= `Fill Rate`, y=Date, color=Conference)) + geom_point(size=.1)+ ggtitle("Fill by Rate by Team") + facet_wrap(~Team)

Since 2000, BYU has had 3 new head coaches. Do the following to create a scatter plot:

Filter the teams so that we only have BYU
Create a scatter plot with Fill Rate over the Date, and color set to New Coach
Title it “New Coach Fill Rate at BYU”

byu <- CFB %>% filter(CFB$Team == "BYU")

ggplot(byu, aes(x= `Fill Rate`, y=Date, color=`New Coach`)) + geom_point()+ ggtitle("New Coach Fill Rate at BYU")

From the scatter plot, we see that something happened in 2011, since then BYU Football attendance has been decreasing every year. Let’s investigate this more and look to see if it’s because BYU is losing more. Make a scatter plot that:

Filters the team to BYU
Puts Current Losses over Date
Uses geom_smooth to add a linear trendline without the standard error
Named “Losses by Year for BYU Football”

ggplot(byu, aes(x= `Current Losses`, y=Date)) + geom_point()+ ggtitle("Losses by Year for BYU Football") + geom_smooth(method="lm", se=FALSE, fullrange=FALSE, level=0.95)

This might help to explain why 2008-2010 have such a full stadium but it doesn’t explain the drop in fullness starting in 2011. This might be a factor also

Outlier treatment - dealing with observations that appear to be too far away from the pattern of the data.

For this assignment, there are no outliers.

Feature engineering - creating new variables to extract more information from the data. There are two ways to do this. First, by transforming the scale or relationship of the data. Second, by creating new variables based off the ones that we have.

The feature engineering was done to the data before the assignment started. That being said, we should practice writing functions before moving on. Write a function that outputs the following from a data = cfb argument and a team argument (Hint: the function should look fairly tidy when done if you know what I mean):

Filters the data down to one team
The Standard deviation of attendance for one team
The average attendance for one team
The Upper bound (the Average plus standard deviation)
The Lower bound (the Average minus standard deviation)

engineer <- function(df=CFB, Team) {
  filter1 <- df %>% filter(df$Team == !!Team)
  Dev <- sd (filter1$Attendance)
  Avg <- mean(filter1$Attendance)
  Avgplus <- Avg + Dev
  Avgminus <- Avg - Dev
  print(Dev)
  print(Avg) 
  print(Avgplus) 
  print(Avgminus)
}




#Use your function with BYU as the team, another time with SMU, and another time with Nebraska 


engineer(CFB, "BYU")

## [1] 4296.921
## [1] 60006.33
## [1] 64303.25
## [1] 55709.41

engineer(CFB, "SMU")

## [1] 8063.892
## [1] 19889.25
## [1] 27953.14
## [1] 11825.36

engineer (CFB, "Nebraska")

## [1] 5075.05
## [1] 84493.94
## [1] 89568.99
## [1] 79418.89

#Part 4. Data Modeling

We have now reviewed most of what we have learned in R up to this point. Next, we will start on what we have learned about correlation and regression in this week’s DataCamp. In this section, we will practive finding linear correlations between variables, adding best fit lines to data, and creating statistical models.

First a reminder: correlation is only good for linear data! We can find the correlation by subtracting the mean of the x axis from each value multiplied by the subtracted mean of the y axis from each value. divided by the squared version of each. [For those interested: https://slideplayer.com/slide/6992631/]

Second reminder, the line of best fit, or trendline, is the line that minimizes the squared distance from all points to it. The average distance of all the points above and below the line should be the same.

Third reminder, the generic statistical model is y = f(x) + noise. Most of what you will do in creating models from here on out is determining ways to better fit the response variable (y) with the the inputs (x).

Start by finding the correlation of the Fill Rate to the TMAX. Assume that earlier we didn’t remove all the NA values from TMAX, so please use the appropriate argument to disregard the NA values. Look here for the format we’re looking for.

CFB%>%
  summarize(N = n(), r = cor(`Fill Rate`, `TMAX`, use = "pairwise.complete.obs"))

## # A tibble: 1 x 2
##       N      r
##   <int>  <dbl>
## 1  3700 0.0573

Remember that .99 would be a very strong linear correlation and a .01 would be a very weak correlation. 0597 would also be a weak correlation

Let’s now look and see if any of the other numeric value are more correlated to the Fill Rate. Check its correlation against Current Wins, Current Losses, PRCP, SNOW, and TMIN. Make sure to use the correct argument when there are NA values.

CFB%>%
  summarize(N = n(), r = cor(`Fill Rate`, `Current Wins`, use = "pairwise.complete.obs"))

## # A tibble: 1 x 2
##       N     r
##   <int> <dbl>
## 1  3700 0.136

CFB%>%
  summarize(N = n(), r = cor(`Fill Rate`, `Current Losses`, use = "pairwise.complete.obs"))

## # A tibble: 1 x 2
##       N      r
##   <int>  <dbl>
## 1  3700 -0.348

CFB%>%
  summarize(N = n(), r = cor(`Fill Rate`, `PRCP`, use = "pairwise.complete.obs"))

## # A tibble: 1 x 2
##       N       r
##   <int>   <dbl>
## 1  3700 0.00919

CFB%>%
  summarize(N = n(), r = cor(`Fill Rate`, `SNOW`, use = "pairwise.complete.obs"))

## # A tibble: 1 x 2
##       N        r
##   <int>    <dbl>
## 1  3700 -0.00581

CFB%>%
  summarize(N = n(), r = cor(`Fill Rate`, `TMIN`, use = "pairwise.complete.obs"))

## # A tibble: 1 x 2
##       N      r
##   <int>  <dbl>
## 1  3700 0.0355

We have already fitted linear models simply within ggplot, now we want to practice creating the models themselves so that we can learn from them.

Create your own simple linear model

Create a linear model for the Fill Rate determined by TMAX
Show the summary for the linear model that you have made
What amount of variance is explained by our model? 0.003278

lm <- lm( `Fill Rate`~ `TMAX`, data = CFB)
summary(lm)

## 
## Call:
## lm(formula = `Fill Rate` ~ TMAX, data = CFB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80001 -0.09716  0.06258  0.12531  0.54336 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.8318411  0.0138903  59.886  < 2e-16 ***
## TMAX        0.0006695  0.0001920   3.487 0.000493 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.173 on 3698 degrees of freedom
## Multiple R-squared:  0.003278,   Adjusted R-squared:  0.003008 
## F-statistic: 12.16 on 1 and 3698 DF,  p-value: 0.0004934

Now we will make things a little more complicated.

Create a linear model for the Fill Rate determined by Team, TMAX, TMIN, Conference, Tailgating, Current Wins, and Current Losses
Show the summary of the model
What amount of variance is explained by this model? 0.5861
Which model is better? The linear Model determined by Multiple Variables is better because the variance explained is higher.

lm2 <- lm( `Fill Rate`~ `TMAX`+ `Team`+ `TMIN` + `Conference` + `Tailgating` + `Current Wins` + `Current Losses`, data = CFB)
summary (lm2)

## 
## Call:
## lm(formula = `Fill Rate` ~ TMAX + Team + TMIN + Conference + 
##     Tailgating + `Current Wins` + `Current Losses`, data = CFB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52942 -0.05310  0.00461  0.05392  0.57294 
## 
## Coefficients: (10 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.9461023  0.0212643  44.493  < 2e-16 ***
## TMAX                    0.0008184  0.0002603   3.145 0.001677 ** 
## TeamBaylor             -0.1672665  0.0144763 -11.555  < 2e-16 ***
## TeamBoise State        -0.0321757  0.0145398  -2.213 0.026964 *  
## TeamBYU                -0.0192351  0.0145047  -1.326 0.184881    
## TeamClemson            -0.0019864  0.0145751  -0.136 0.891604    
## TeamColorado           -0.0507471  0.0147523  -3.440 0.000588 ***
## TeamFlorida State      -0.0283639  0.0145734  -1.946 0.051697 .  
## TeamGeorgia Tech       -0.0550764  0.0145214  -3.793 0.000151 ***
## TeamIndiana            -0.1981630  0.0142942 -13.863  < 2e-16 ***
## TeamIowa State         -0.0017658  0.0142410  -0.124 0.901325    
## TeamKansas State        0.0172026  0.0139940   1.229 0.219046    
## TeamMarshall           -0.2667618  0.0145135 -18.380  < 2e-16 ***
## TeamMichigan State      0.0406111  0.0140807   2.884 0.003947 ** 
## TeamNC State            0.0155940  0.0142061   1.098 0.272410    
## TeamNebraska            0.0854354  0.0139851   6.109 1.11e-09 ***
## TeamNevada             -0.3133621  0.0149575 -20.950  < 2e-16 ***
## TeamNorthern Illinois  -0.2959506  0.0148346 -19.950  < 2e-16 ***
## TeamNotre Dame          0.0339272  0.0143101   2.371 0.017798 *  
## TeamOklahoma            0.0450473  0.0144802   3.111 0.001879 ** 
## TeamOle Miss           -0.0151866  0.0150482  -1.009 0.312946    
## TeamPenn State          0.0153855  0.0140734   1.093 0.274365    
## TeamRutgers            -0.1058482  0.0142791  -7.413 1.53e-13 ***
## TeamSMU                -0.3280568  0.0150522 -21.795  < 2e-16 ***
## TeamSyracuse           -0.1621237  0.0143989 -11.259  < 2e-16 ***
## TeamTexas A&M           0.0130434  0.0143274   0.910 0.362680    
## TeamToledo             -0.1437265  0.0145384  -9.886  < 2e-16 ***
## TeamTroy               -0.2779988  0.0174046 -15.973  < 2e-16 ***
## TeamUCLA               -0.2551988  0.0145898 -17.492  < 2e-16 ***
## TeamVirginia           -0.1387511  0.0144396  -9.609  < 2e-16 ***
## TeamWashington         -0.0281579  0.0146768  -1.919 0.055121 .  
## TeamWest Virginia      -0.0577110  0.0159326  -3.622 0.000296 ***
## TeamWestern Kentucky   -0.1918942  0.0184774 -10.385  < 2e-16 ***
## TeamWisconsin           0.0266625  0.0141477   1.885 0.059565 .  
## TMIN                   -0.0006619  0.0002848  -2.324 0.020175 *  
## ConferenceACC                  NA         NA      NA       NA    
## ConferenceBig-10               NA         NA      NA       NA    
## ConferenceBig-12               NA         NA      NA       NA    
## ConferenceCUSA                 NA         NA      NA       NA    
## ConferenceIndependent          NA         NA      NA       NA    
## ConferenceMid-American         NA         NA      NA       NA    
## ConferenceMWC                  NA         NA      NA       NA    
## ConferencePac-12               NA         NA      NA       NA    
## ConferenceSEC                  NA         NA      NA       NA    
## TailgatingTRUE                 NA         NA      NA       NA    
## `Current Wins`          0.0070846  0.0010809   6.555 6.36e-11 ***
## `Current Losses`       -0.0181029  0.0011859 -15.265  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.112 on 3663 degrees of freedom
## Multiple R-squared:  0.5861, Adjusted R-squared:  0.582 
## F-statistic: 144.1 on 36 and 3663 DF,  p-value: < 2.2e-16

Now let’s look and see how well our model is at predicting the Fill Rate for the stadium. I have split the data into a training and validation datasets so that you can predict how good your model is against the data.

Build a new model on the train dataset
Create a linear model for the Fill Rate determined by Team, TMAX, TMIN, Conference, Tailgating, Current Wins, and Current Losses
Predict the Fitted values of the test dataset, assign this to a new column of test

# Test and Training set
set.seed(121) #set.seed is included so that everyone gets the same results
testNum <- sample(nrow(CFB), 100) #Creates a vector of row numbers to include
test <- CFB[c(testNum),] #Subsets the data so that it only has the test rows in it
train <- CFB[-c(testNum),] #Takes the test numbers out

# Build Your Model and Predict 
lm3 <- lm( `Fill Rate`~ `TMAX`+ `Team`+ `TMIN` + `Conference` + `Tailgating` + `Current Wins` + `Current Losses`, data = train)
predication <- predict(lm3,test)

test <- cbind(test, predication)

Now let’s see how well we did. Write a for loop that will print the average difference between your prediction and the actual value.

totaldiff <-0
for (row in 1:nrow(test)) {
 filrate <- as.numeric(test[row, "Fill Rate"]) 
 filratepred <- as.numeric(test[row, "predication"])
 diff <- filratepred - filrate
 totaldiff <- totaldiff + diff

}
avgdiff <- (totaldiff/nrow(test))
print (avgdiff)

## [1] 0.01585466

Well, looks like in the future we will have to improve our model. The predicted Fill Rate is currently off by about %10 for each value.

Now knit the file to an HTML and submit it :)

Week 12 - Correlation and regression

Sariah Nokes

11/11/2019