Good work so far, you are on your way to becoming full fledged data analysts with R. This week’s course was a primer for when we are getting into Multiple and Logarithmic Regression. In the DataCamp courses we learned about different ways to visualize your data and when they are appropriate.
The data science process has five basic steps:
This assignment will be the first one that you will go through four of the steps for the data science process. By the end of this assignment you will have a nice report when you knit the Rmarkdown file.
This week you are a Sports Data Analyst for the prestigious Brigham Young University. You have been tasked with examining BYU football attendance compared with other football teams around the country. You will be visualizing data, building linear regression models, and interpreting your results. Good luck and have fun!
#Part 1. Obtaining the Data
This dataset was created for MRKT 585R, the data comes from extracting data from approximately 600 Wikipedia pages for attendance numbers for each football game from 2000-2018, after that data was added, it was bound to the weather data from the National Climate Data Center. Following is a list of variables and description of each, this format will create a table when you knit:
Variable | Description |
---|---|
Date | The Date the Game was played on |
Team | Home team of the Football Game |
Time | Kickoff Time |
Opponent | Away team of the Football Game |
Rank | Rank of the Home Team for the AP poll |
Site | Location for the game |
TV | TV channel that the game was played on |
Result | The outcome for the football game |
Attendance | How many people attended the game |
Current Wins | How many wins the team has leading up to the game |
Current Losses | How many losses the team has leading up to the game |
Stadium Capacity | How many people fit into the stadium |
Fill Rate | Attendance / Stadium Capacity |
New Coach | If the team has a first year head coach |
Tailgating | If the team is a Top 25 tailgate destination |
PRCP | Precipitation |
SNOW | Snowfall |
SNWD | Snow Depth (Snow on ground) |
TMAX | Max Temperature for the day |
TMIN | Min Temperature for the day |
Opponent_Rank | Rank of the Opponent at the time of the game |
Conference | What football conference the team belongs |
cfb <- readxl::read_excel("CFB.xlsx", sheet = 1)
## New names:
## * `` -> ...1
head(cfb)
## # A tibble: 6 x 25
## ...1 Date Team Time Opponent Rank Site TV Result
## <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 2000-09-02 00:00:00 Arka… 8:00… Southwe… NR War … <NA> W 38–0
## 2 2 2000-09-16 00:00:00 Arka… 6:00… Boise S… NR War … <NA> W 38–…
## 3 3 2000-09-23 00:00:00 Arka… 8:00… Alabama NR Razo… ESPN2 W 28–…
## 4 4 2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR Razo… JPS L 7–38
## 5 5 2000-10-07 00:00:00 Arka… 6:00… Louisia… NR Razo… <NA> W 52–6
## 6 6 2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR Razo… <NA> L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## # `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## # `New Coach` <chr>, Tailgating <chr>, PRCP <dbl>, SNOW <dbl>,
## # SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <chr>,
## # numericDate <dbl>, numericTime <dbl>, Conference <chr>
Before cleaning the data, take a moment and become familiar with the dataset that you are working with. There are a multitude of variables, it would be important to
#Part 2. Cleaning the Data
During this step we will look at the data to see if there are any unwanted observations, duplicate observations, irrelevant observations, structural issues or class type issues.
## Observations: 3,732
## Variables: 25
## $ ...1 <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ Date <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team <chr> "Arkansas", "Arkansas", "Arkansas", "Arkansas…
## $ Time <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank <chr> "NR", "NR", "NR", "NR", "NR", "NR", "NR", "NR…
## $ Site <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance <chr> "53,946", "54,286", "51,482", "51,162", "50,9…
## $ `Current Wins` <chr> "0", "1", "2", "3", "3", "4", "5", "0", "1", …
## $ `Current Losses` <chr> "0", "0", "0", "0", "1", "3", "5", "0", "0", …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate` <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach` <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", …
## $ Tailgating <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", …
## $ PRCP <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank <chr> "NR", "NR", "NR", " 25", "NR", "NR", " 24", "…
## $ numericDate <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference <chr> "SEC", "SEC", "SEC", "SEC", "SEC", "SEC", "SE…
A couple of the variables have issues with the. Here they are and their problems:
...1
- This was the row index, it is unneededTeam
- There are 34 teams in total, therefore the data is categoricalRank
- This data is also categoricalTV
- This data is also categorical (We will deal with this later)Attendance
- There are commas that prevent the data from being numericCurrent Wins
- Loaded in as charactersCurrent Losses
- Loaded in as charactersNew Coach
- This data is categoricalTailgating
- This data is categoricalOpponent_Rank
- This data is categoricalConference
- This data is categorical..1
), output the head for the new datasetcfb <- cfb[,-1]
head(cfb)
## # A tibble: 6 x 24
## Date Team Time Opponent Rank Site TV Result
## <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2000-09-02 00:00:00 Arka… 8:00… Southwe… NR War … <NA> W 38–0
## 2 2000-09-16 00:00:00 Arka… 6:00… Boise S… NR War … <NA> W 38–…
## 3 2000-09-23 00:00:00 Arka… 8:00… Alabama NR Razo… ESPN2 W 28–…
## 4 2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR Razo… JPS L 7–38
## 5 2000-10-07 00:00:00 Arka… 6:00… Louisia… NR Razo… <NA> W 52–6
## 6 2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR Razo… <NA> L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## # `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## # `New Coach` <chr>, Tailgating <chr>, PRCP <dbl>, SNOW <dbl>,
## # SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <chr>,
## # numericDate <dbl>, numericTime <dbl>, Conference <chr>
Team
, Rank
, New Coach
, Tailgating
, Opponent_Rank
, and Conference
columns so that the categorical nature of the columns is respected. Output the structure of the data using glimpse or str to confirm it worked.Note: In actual analysis, you would want Rank, and Opponent Rank as ordered factors. Also, the New Coach
column has a space in it, to type in the column without autofill, you surround the column name with `` symbols
cfb$Team <- as.factor(cfb$Team)
cfb$Rank <- as.factor(cfb$Rank)
cfb$`New Coach` <- as.factor(cfb$`New Coach`)
cfb$Tailgating <- as.factor(cfb$Tailgating)
cfb$Opponent_Rank <- as.factor(cfb$Opponent_Rank)
cfb$Conference <- as.factor(cfb$Conference)
glimpse(cfb)
## Observations: 3,732
## Variables: 24
## $ Date <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team <fct> Arkansas, Arkansas, Arkansas, Arkansas, Arkan…
## $ Time <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank <fct> NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, N…
## $ Site <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance <chr> "53,946", "54,286", "51,482", "51,162", "50,9…
## $ `Current Wins` <chr> "0", "1", "2", "3", "3", "4", "5", "0", "1", …
## $ `Current Losses` <chr> "0", "0", "0", "0", "1", "3", "5", "0", "0", …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate` <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach` <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ Tailgating <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRCP <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank <fct> NR, NR, NR, 25, NR, NR, 24, NR, 8, NR, 9,…
## $ numericDate <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference <fct> SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, …
Attendance
column. Use the gsub function to remove the commas in the text.cfb$Attendance <- gsub(",", "", cfb$Attendance)
head(cfb)
## # A tibble: 6 x 24
## Date Team Time Opponent Rank Site TV Result
## <dttm> <fct> <chr> <chr> <fct> <chr> <chr> <chr>
## 1 2000-09-02 00:00:00 Arka… 8:00… Southwe… NR War … <NA> W 38–0
## 2 2000-09-16 00:00:00 Arka… 6:00… Boise S… NR War … <NA> W 38–…
## 3 2000-09-23 00:00:00 Arka… 8:00… Alabama NR Razo… ESPN2 W 28–…
## 4 2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR Razo… JPS L 7–38
## 5 2000-10-07 00:00:00 Arka… 6:00… Louisia… NR Razo… <NA> W 52–6
## 6 2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR Razo… <NA> L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## # `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## # `New Coach` <fct>, Tailgating <fct>, PRCP <dbl>, SNOW <dbl>,
## # SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <fct>,
## # numericDate <dbl>, numericTime <dbl>, Conference <fct>
Attendance
, Current Wins
and Current Losses
are both characters right now, make them integers. Output the structure with glimpse or str again.cfb$Attendance <- as.integer(cfb$Attendance)
cfb$`Current Losses` <- as.integer(cfb$`Current Losses`)
cfb$`Current Wins` <- as.integer(cfb$`Current Wins`)
glimpse(cfb)
## Observations: 3,732
## Variables: 24
## $ Date <dttm> 2000-09-02, 2000-09-16, 2000-09-23, 2000-09-…
## $ Team <fct> Arkansas, Arkansas, Arkansas, Arkansas, Arkan…
## $ Time <chr> "8:00 pm", "6:00 pm", "8:00 pm", "11:30 pm", …
## $ Opponent <chr> "Southwest Missouri State*", "Boise State*", …
## $ Rank <fct> NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, NR, N…
## $ Site <chr> "War Memorial StadiumLittle Rock, AR", "War M…
## $ TV <chr> NA, NA, "ESPN2", "JPS", NA, NA, "CBS", "ESPN"…
## $ Result <chr> "W 38–0", "W 38–31", "W 28–21", "L 7–38", "W …
## $ Attendance <int> 53946, 54286, 51482, 51162, 50947, 49647, 439…
## $ `Current Wins` <int> 0, 1, 2, 3, 3, 4, 5, 0, 1, 1, 2, 3, 5, 6, 0, …
## $ `Current Losses` <int> 0, 0, 0, 0, 1, 3, 5, 0, 0, 3, 3, 3, 3, 3, 0, …
## $ `Stadium Capacity` <dbl> 53727, 53727, 50019, 50019, 50019, 50019, 537…
## $ `Fill Rate` <dbl> 1.0040762, 1.0104045, 1.0292489, 1.0228513, 1…
## $ `New Coach` <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ Tailgating <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRCP <dbl> 0.00, 0.00, 2.12, 0.00, 0.00, 0.11, 0.94, 0.0…
## $ SNOW <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SNWD <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TMAX <dbl> 105, 79, 85, 77, 50, 55, 49, 84, NA, 62, 62, …
## $ TMIN <dbl> 65, 44, 63, 45, 28, 49, 43, 63, NA, 31, 45, 2…
## $ Opponent_Rank <fct> NR, NR, NR, 25, NR, NR, 24, NR, 8, NR, 9,…
## $ numericDate <dbl> 11202, 11216, 11223, 11230, 11237, 11265, 112…
## $ numericTime <dbl> 28800, 21600, 28800, 41400, 21600, 7200, 5400…
## $ Conference <fct> SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, SEC, …
You have now cleaned the data for our College Football Dataset.
#Part 3. Exploratory Data Analysis
Thus far we have imported and cleaned the data that we have for each of the games. Now we begin the exploratory data analysis, this step includes Variable Identification, Univariate Analysis, Bi-variate Analysis, Missing values treatment, Outlier treatment, Variable transformation, and Feature engineering. These are alot of big words, don’t worry though. I am going to explain what each means.
Variable Identification - identifying the input and the output variables, there are many different names for these. Input variables are also predictors, features, and independent variables. Output variable are also the response or dependent variable.
In this assignment, I am going to tell you that the output variable is the Fill Rate
unless I specify otherwise. This means that the Fill Rate
will be on the y axis for most graphs that we do.
Missing Value Treatment - dealing with missing data in the dataset
summary(cfb)
## Date Team Time
## Min. :2000-08-31 00:00:00 Arkansas : 130 Length:3732
## 1st Qu.:2005-06-24 12:00:00 Nebraska : 130 Class :character
## Median :2009-10-31 00:00:00 Penn State : 130 Mode :character
## Mean :2009-12-03 10:01:09 Michigan State: 129
## 3rd Qu.:2014-09-27 00:00:00 Kansas State : 128
## Max. :2018-12-01 00:00:00 Texas A&M : 127
## (Other) :2958
## Opponent Rank Site TV
## Length:3732 NR :2668 Length:3732 Length:3732
## Class :character 19 : 59 Class :character Class :character
## Mode :character 23 : 51 Mode :character Mode :character
## 24 : 51
## 11 : 49
## 10 : 47
## (Other): 807
## Result Attendance Current Wins Current Losses
## Length:3732 Min. : 4513 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 36018 1st Qu.: 1.000 1st Qu.: 0.000
## Mode :character Median : 53663 Median : 3.000 Median : 2.000
## Mean : 54607 Mean : 3.049 Mean : 2.061
## 3rd Qu.: 75504 3rd Qu.: 5.000 3rd Qu.: 3.000
## Max. :110889 Max. :12.000 Max. :11.000
## NA's :2
## Stadium Capacity Fill Rate New Coach Tailgating
## Min. : 22113 Min. :0.08137 FALSE:3129 FALSE:2514
## 1st Qu.: 49225 1st Qu.:0.78404 TRUE : 603 TRUE :1218
## Median : 60000 Median :0.94100
## Mean : 60354 Mean :0.87895
## 3rd Qu.: 80321 3rd Qu.:1.00255
## Max. :107282 Max. :1.40399
## NA's :2
## PRCP SNOW SNWD TMAX
## Min. :0.00000 Min. :0.0000 Min. :0.00 Min. : 19.00
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.: 61.00
## Median :0.00000 Median :0.0000 Median :0.00 Median : 72.00
## Mean :0.08403 Mean :0.0166 Mean :0.03 Mean : 70.82
## 3rd Qu.:0.01000 3rd Qu.:0.0000 3rd Qu.:0.00 3rd Qu.: 82.00
## Max. :6.45000 Max. :5.3000 Max. :7.10 Max. :111.00
## NA's :20 NA's :1005 NA's :1049 NA's :29
## TMIN Opponent_Rank numericDate numericTime
## Min. : 0.00 NR :2936 Min. :11200 Min. : 3600
## 1st Qu.:38.00 1 : 53 1st Qu.:12958 1st Qu.:12600
## Median :49.00 2 : 42 Median :14548 Median :21600
## Mean :48.51 13 : 39 Mean :14581 Mean :23548
## 3rd Qu.:59.00 10 : 34 3rd Qu.:16340 3rd Qu.:39600
## Max. :81.00 3 : 32 Max. :17866 Max. :45000
## NA's :29 (Other): 596
## Conference
## Big-10 :755
## ACC :697
## Big-12 :570
## SEC :374
## Pac-12 :336
## Mid-American:281
## (Other) :719
From the output, we learn that TV
has 582 NA’s, Attendance
has 2 NA’s, Fill Rate
has 2 NA’s, PRCP
has 20 NA’s, SNOW
and SNWD
have over 1000 NA’s, and finally TMAX
and TMIN
have 29 NA’s.
TV
means that the game was not on TV, the same goes with PRCP
, SNOW
and SNWD
, we can assume that there was no precipitation of any kind. Assign the NA values for the weather with 0. Follow my example.cfb$TV[is.na(cfb$TV)] <- "Not on Television" #Example of how to fill na values
cfb$PRCP[is.na(cfb$PRCP)] <- 0
cfb$SNOW[is.na(cfb$SNOW)] <- 0
cfb$SNWD[is.na(cfb$SNWD)] <- 0
summary(cfb)
## Date Team Time
## Min. :2000-08-31 00:00:00 Arkansas : 130 Length:3732
## 1st Qu.:2005-06-24 12:00:00 Nebraska : 130 Class :character
## Median :2009-10-31 00:00:00 Penn State : 130 Mode :character
## Mean :2009-12-03 10:01:09 Michigan State: 129
## 3rd Qu.:2014-09-27 00:00:00 Kansas State : 128
## Max. :2018-12-01 00:00:00 Texas A&M : 127
## (Other) :2958
## Opponent Rank Site TV
## Length:3732 NR :2668 Length:3732 Length:3732
## Class :character 19 : 59 Class :character Class :character
## Mode :character 23 : 51 Mode :character Mode :character
## 24 : 51
## 11 : 49
## 10 : 47
## (Other): 807
## Result Attendance Current Wins Current Losses
## Length:3732 Min. : 4513 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 36018 1st Qu.: 1.000 1st Qu.: 0.000
## Mode :character Median : 53663 Median : 3.000 Median : 2.000
## Mean : 54607 Mean : 3.049 Mean : 2.061
## 3rd Qu.: 75504 3rd Qu.: 5.000 3rd Qu.: 3.000
## Max. :110889 Max. :12.000 Max. :11.000
## NA's :2
## Stadium Capacity Fill Rate New Coach Tailgating
## Min. : 22113 Min. :0.08137 FALSE:3129 FALSE:2514
## 1st Qu.: 49225 1st Qu.:0.78404 TRUE : 603 TRUE :1218
## Median : 60000 Median :0.94100
## Mean : 60354 Mean :0.87895
## 3rd Qu.: 80321 3rd Qu.:1.00255
## Max. :107282 Max. :1.40399
## NA's :2
## PRCP SNOW SNWD TMAX
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. : 19.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 61.00
## Median :0.00000 Median :0.00000 Median :0.00000 Median : 72.00
## Mean :0.08357 Mean :0.01217 Mean :0.02157 Mean : 70.82
## 3rd Qu.:0.01000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 82.00
## Max. :6.45000 Max. :5.30000 Max. :7.10000 Max. :111.00
## NA's :29
## TMIN Opponent_Rank numericDate numericTime
## Min. : 0.00 NR :2936 Min. :11200 Min. : 3600
## 1st Qu.:38.00 1 : 53 1st Qu.:12958 1st Qu.:12600
## Median :49.00 2 : 42 Median :14548 Median :21600
## Mean :48.51 13 : 39 Mean :14581 Mean :23548
## 3rd Qu.:59.00 10 : 34 3rd Qu.:16340 3rd Qu.:39600
## Max. :81.00 3 : 32 Max. :17866 Max. :45000
## NA's :29 (Other): 596
## Conference
## Big-10 :755
## ACC :697
## Big-12 :570
## SEC :374
## Pac-12 :336
## Mid-American:281
## (Other) :719
Attendance
and Fill Rate
, determine which rows have NA values in them for the Attendance
variable and output those row numbers. Now look at those rows and answer why is there no Attendance for those games? ________________(They were cancelled)which(is.na(cfb$Attendance))
## [1] 662 3040
cfb[c(662, 3040),]
## # A tibble: 2 x 24
## Date Team Time Opponent Rank Site TV Result
## <dttm> <fct> <chr> <chr> <fct> <chr> <chr> <chr>
## 1 2013-09-14 00:00:00 Colo… 12:0… Fresno … NR Fols… P12N Cance…
## 2 2015-09-03 00:00:00 Tole… 7:00… Stony B… NR Glas… ESPN3 Cance…
## # … with 16 more variables: Attendance <int>, `Current Wins` <int>,
## # `Current Losses` <int>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## # `New Coach` <fct>, Tailgating <fct>, PRCP <dbl>, SNOW <dbl>,
## # SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <fct>,
## # numericDate <dbl>, numericTime <dbl>, Conference <fct>
cfb <- cfb[-c(662, 3040),]
TMAX
and TMIN
. This variable is hardest to infer, so let’s drop the NA values now. You will have a dataset that is 3700 observations long.cfb <- cfb[complete.cases(cfb$TMAX),]
cfb <- cfb[complete.cases(cfb$TMIN),]
summary(cfb)
## Date Team Time
## Min. :2000-08-31 00:00:00 Nebraska : 130 Length:3700
## 1st Qu.:2004-11-26 18:00:00 Penn State : 130 Class :character
## Median :2009-10-24 00:00:00 Arkansas : 129 Mode :character
## Mean :2009-11-21 00:27:14 Michigan State: 129
## 3rd Qu.:2014-09-20 00:00:00 Kansas State : 128
## Max. :2018-12-01 00:00:00 Texas A&M : 127
## (Other) :2927
## Opponent Rank Site TV
## Length:3700 NR :2640 Length:3700 Length:3700
## Class :character 19 : 59 Class :character Class :character
## Mode :character 23 : 51 Mode :character Mode :character
## 24 : 50
## 11 : 49
## 10 : 47
## (Other): 804
## Result Attendance Current Wins Current Losses
## Length:3700 Min. : 4513 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 35986 1st Qu.: 1.000 1st Qu.: 0.000
## Mode :character Median : 53654 Median : 3.000 Median : 2.000
## Mean : 54633 Mean : 3.051 Mean : 2.056
## 3rd Qu.: 75579 3rd Qu.: 5.000 3rd Qu.: 3.000
## Max. :110889 Max. :12.000 Max. :11.000
##
## Stadium Capacity Fill Rate New Coach Tailgating
## Min. : 22113 Min. :0.08137 FALSE:3106 FALSE:2502
## 1st Qu.: 49225 1st Qu.:0.78399 TRUE : 594 TRUE :1198
## Median : 60000 Median :0.94129
## Mean : 60373 Mean :0.87926
## 3rd Qu.: 80321 3rd Qu.:1.00275
## Max. :107282 Max. :1.40399
##
## PRCP SNOW SNWD TMAX
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. : 19.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 61.00
## Median :0.00000 Median :0.00000 Median :0.00000 Median : 72.00
## Mean :0.08396 Mean :0.01227 Mean :0.02176 Mean : 70.82
## 3rd Qu.:0.01000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 82.00
## Max. :6.45000 Max. :5.30000 Max. :7.10000 Max. :111.00
##
## TMIN Opponent_Rank numericDate numericTime
## Min. : 0.00 NR :2913 Min. :11200 Min. : 3600
## 1st Qu.:38.00 1 : 51 1st Qu.:12749 1st Qu.:12600
## Median :49.00 2 : 42 Median :14541 Median :21600
## Mean :48.51 13 : 39 Mean :14569 Mean :23517
## 3rd Qu.:59.00 10 : 34 3rd Qu.:16333 3rd Qu.:39600
## Max. :81.00 3 : 32 Max. :17866 Max. :45000
## (Other): 589
## Conference
## Big-10 :753
## ACC :696
## Big-12 :567
## SEC :354
## Pac-12 :335
## Mid-American:279
## (Other) :716
cfb %>%
filter(is.na(TMAX))
## # A tibble: 0 x 24
## # … with 24 variables: Date <dttm>, Team <fct>, Time <chr>,
## # Opponent <chr>, Rank <fct>, Site <chr>, TV <chr>, Result <chr>,
## # Attendance <int>, `Current Wins` <int>, `Current Losses` <int>,
## # `Stadium Capacity` <dbl>, `Fill Rate` <dbl>, `New Coach` <fct>,
## # Tailgating <fct>, PRCP <dbl>, SNOW <dbl>, SNWD <dbl>, TMAX <dbl>,
## # TMIN <dbl>, Opponent_Rank <fct>, numericDate <dbl>, numericTime <dbl>,
## # Conference <fct>
Univariate Analysis - exploring variables one by one.
Fill Rate
, names this chart “Histogram of Fill Rate”, set the binwidth = 0.5Attendance
variable, name this chart, “Histogram of Attendance”ggplot(cfb, aes(x = Attendance)) +
geom_histogram() +
ggtitle("Histogram of Attendance")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at these two charts illustrates why standardization is important. The Fill Rate
is the standardized version of the Attendance data.
Bi-variate Analysis - exploring how variables relate to one another
Now we are beginning to get into some fun with the data. We will be looking at the apparent relationship between two variables. In this case, we will be looking at the relationship of a couple of the variables against the Fill Rate
Fill Rate
over the Date
ggplot(cfb, aes(x = Date, y = `Fill Rate`, color = Conference)) +
geom_point() +
facet_wrap(~ Team) +
ggtitle("Fill Rate by Team")
Fill Rate
over the Date
, and color set to New Coach
cfb %>%
filter(Team == "BYU") %>%
ggplot( aes(x = Date, y = `Fill Rate`, color = `New Coach`)) +
geom_point() +
ggtitle("New Coaches at BYU")
Current Losses
over Date
cfb %>%
filter(Team == "BYU") %>%
ggplot(aes(x = Date, y = `Current Losses`)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
ggtitle("Losses by Year for BYU Football")
This might help to explain why 2008-2010 have such a full stadium but it doesn’t explain the drop in fullness starting in 2011. This might be a factor also
Outlier treatment - dealing with observations that appear to be too far away from the pattern of the data.
For this assignment, there are no outliers.
Feature engineering - creating new variables to extract more information from the data there are two ways to do this. First, by transforming the scale or relationship of the data. Second, by creating new variables based off the ones that we have.
teamMath <- function(data = cfb, team) {
data %>%
filter(Team == team) %>%
summarize(team,SD = sd(Attendance), Avg = mean(Attendance),Upper = Avg + SD, Lower = Avg - SD)
}
#Use your function with BYU as the team, another time with SMU, and another time with Nebraska
teamMath(cfb, "BYU")
## # A tibble: 1 x 5
## team SD Avg Upper Lower
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BYU 4297. 60006. 64303. 55709.
teamMath(cfb, "SMU")
## # A tibble: 1 x 5
## team SD Avg Upper Lower
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 SMU 8064. 19889. 27953. 11825.
teamMath(cfb, "Nebraska")
## # A tibble: 1 x 5
## team SD Avg Upper Lower
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Nebraska 5075. 84494. 89569. 79419.
#Part 4. Data Modeling
Now we have reviewed most of what we have learned in R up to this point. Next we will begin what we have learned about correlation and regression in the class. In this section we will be looking at the linear correlation of values, adding best fit lines, statistical models.
First reminder, correlation only good for linear values. You get the correlation by subtracting the mean of the x axis from each value multiplied by the subtracted mean of the y axis from each value. divided by the squared version of each.
Second reminder, the line of best fit, or trendline is one that minimizes the squared distance from all points to it. The average distance of all the points above and below the line should be the same.
Third reminder, the generic statistical model is y = f(x) + noise
. Most of what you do in creating models from here on out is determine ways to better fit the response variable or y with the the inputs or x.
Fill Rate
by the TMAX
, earlier we didn’t remove all the NA values from TMAX, use the appropriate argument to disregard the NA values. Look herecfb %>%
summarize(Correlation = cor(`Fill Rate`, TMAX, use = "pairwise.complete.obs"))
## # A tibble: 1 x 1
## Correlation
## <dbl>
## 1 0.0573
.0597 would be a very weak correlation
Fill Rate
. Check its correlation against Current Wins
, Current Losses
, PRCP
, SNOW
, and TMIN
. Make sure to use the correct argument when there are NA values.cfb %>%
summarize(CorrWins = cor(`Fill Rate`, `Current Wins`), CorrLosses = cor(`Fill Rate`, `Current Losses`), CorrPRCP = cor(`Fill Rate`, PRCP), CorrSNOW = cor(`Fill Rate`, SNOW), CorrTMIN = cor(`Fill Rate`, TMIN, use = "pairwise.complete.obs"))
## # A tibble: 1 x 5
## CorrWins CorrLosses CorrPRCP CorrSNOW CorrTMIN
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.136 -0.348 0.00919 -0.00581 0.0355
We have already fitted linear models simply within ggplot, now we want to practice creating the models themselves so that we can learn from them.
Fill Rate
determined by TMAX
lm1 <- lm(`Fill Rate` ~ TMAX ,data = cfb)
summary(lm1)
##
## Call:
## lm(formula = `Fill Rate` ~ TMAX, data = cfb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.80001 -0.09716 0.06258 0.12531 0.54336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8318411 0.0138903 59.886 < 2e-16 ***
## TMAX 0.0006695 0.0001920 3.487 0.000493 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.173 on 3698 degrees of freedom
## Multiple R-squared: 0.003278, Adjusted R-squared: 0.003008
## F-statistic: 12.16 on 1 and 3698 DF, p-value: 0.0004934
Fill Rate
determined by Team
, TMAX
, TMIN
, Conference
, Tailgating
, Current Wins
, and Current Losses
lm2 <- lm(`Fill Rate` ~ TMAX + TMIN + Conference + Tailgating + `Current Wins` + `Current Losses` + Team, data = cfb)
summary(lm2)
##
## Call:
## lm(formula = `Fill Rate` ~ TMAX + TMIN + Conference + Tailgating +
## `Current Wins` + `Current Losses` + Team, data = cfb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52942 -0.05310 0.00461 0.05392 0.57294
##
## Coefficients: (10 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6180455 0.0238387 25.926 < 2e-16 ***
## TMAX 0.0008184 0.0002603 3.145 0.001677 **
## TMIN -0.0006619 0.0002848 -2.324 0.020175 *
## ConferenceACC 0.1893057 0.0153343 12.345 < 2e-16 ***
## ConferenceBig-10 0.1276783 0.0213279 5.986 2.35e-09 ***
## ConferenceBig-12 0.2703458 0.0171358 15.777 < 2e-16 ***
## ConferenceCUSA 0.1361626 0.0190918 7.132 1.19e-12 ***
## ConferenceIndependent 0.1349430 0.0215127 6.273 3.96e-10 ***
## ConferenceMid-American 0.0500580 0.0180696 2.770 0.005629 **
## ConferenceMWC 0.0146947 0.0160635 0.915 0.360364
## ConferencePac-12 0.0728579 0.0152291 4.784 1.79e-06 ***
## ConferenceSEC 0.3280568 0.0150522 21.795 < 2e-16 ***
## TailgatingTRUE 0.2270409 0.0153324 14.808 < 2e-16 ***
## `Current Wins` 0.0070846 0.0010809 6.555 6.36e-11 ***
## `Current Losses` -0.0181029 0.0011859 -15.265 < 2e-16 ***
## TeamBaylor -0.1095555 0.0165408 -6.623 4.02e-11 ***
## TeamBoise State 0.2811864 0.0155138 18.125 < 2e-16 ***
## TeamBYU 0.1738787 0.0215989 8.050 1.11e-15 ***
## TeamClemson -0.0902762 0.0216874 -4.163 3.22e-05 ***
## TeamColorado 0.2044518 0.0156644 13.052 < 2e-16 ***
## TeamFlorida State 0.1103872 0.0150156 7.352 2.40e-13 ***
## TeamGeorgia Tech 0.0836747 0.0149675 5.590 2.43e-08 ***
## TeamIndiana 0.0022155 0.0209931 0.106 0.915958
## TeamIowa State 0.0559452 0.0161525 3.464 0.000539 ***
## TeamKansas State 0.0749136 0.0159461 4.698 2.72e-06 ***
## TeamMarshall -0.0748676 0.0188875 -3.964 7.52e-05 ***
## TeamMichigan State 0.0139486 0.0140184 0.995 0.319795
## TeamNC State 0.1543451 0.0146479 10.537 < 2e-16 ***
## TeamNebraska 0.0587729 0.0140686 4.178 3.01e-05 ***
## TeamNevada NA NA NA NA
## TeamNorthern Illinois -0.0179518 0.0182483 -0.984 0.325303
## TeamNotre Dame NA NA NA NA
## TeamOklahoma -0.1242826 0.0227609 -5.460 5.07e-08 ***
## TeamOle Miss -0.2422276 0.0216531 -11.187 < 2e-16 ***
## TeamPenn State -0.0112770 0.0140222 -0.804 0.421320
## TeamRutgers 0.0945303 0.0209131 4.520 6.38e-06 ***
## TeamSMU NA NA NA NA
## TeamSyracuse -0.0233726 0.0148049 -1.579 0.114490
## TeamTexas A&M -0.2139975 0.0214652 -9.970 < 2e-16 ***
## TeamToledo 0.1342723 0.0180163 7.453 1.13e-13 ***
## TeamTroy NA NA NA NA
## TeamUCLA NA NA NA NA
## TeamVirginia NA NA NA NA
## TeamWashington NA NA NA NA
## TeamWest Virginia NA NA NA NA
## TeamWestern Kentucky NA NA NA NA
## TeamWisconsin NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.112 on 3663 degrees of freedom
## Multiple R-squared: 0.5861, Adjusted R-squared: 0.582
## F-statistic: 144.1 on 36 and 3663 DF, p-value: < 2.2e-16
Fill Rate
for the stadium. I have split the data into a training and validation datasets so that you can predict how good your model is against the data.train
datasetFill Rate
determined by Team
, TMAX
, TMIN
, Conference
, Tailgating
, Current Wins
, and Current Losses
test
dataset, assign this to a new column of test
# Test and Training set
set.seed(121) #set.seed is included so that everyone gets the same results
testNum <- sample(nrow(cfb), 100) #Creates a vector of row numbers to include
test <- cfb[c(testNum),] #Subsets the data so that it only has the test rows in it
train <- cfb[-c(testNum),] #Takes the test numbers out
# Build Your Model and Predict
lm3 <- lm(`Fill Rate` ~ TMAX + TMIN + Conference + Tailgating + `Current Wins` + `Current Losses` + Team, data = train)
test$pred_lm3 <- predict(lm3, test)
accuracy <- rep(0, nrow(test))
for (i in 1:nrow(test)) {
accuracy[i] <- abs(test$pred_lm3[i] - test$`Fill Rate`[i])
}
print(mean(accuracy))
## [1] 0.09386146
Well, looks like in the future we will have to improve our model. The predicted Fill Rate is currently off by about %10 for each value.
Now knit the file to an HTML and submit it :)