Good work so far, you are on your way to becoming full fledged data analysts with R. This week’s course was a primer for when we are getting into Multiple and Logarithmic Regression. In the DataCamp courses we learned about different ways to visualize your data and when they are appropriate.
The data science process has five basic steps:
This assignment will be the first one that you will go through four of the steps for the data science process. By the end of this assignment you will have a nice report when you knit the Rmarkdown file.
This week you are a Sports Data Analyst for the prestigious Brigham Young University. You have been tasked with examining BYU football attendance compared with other football teams around the country. You will be visualizing data, building linear regression models, and interpreting your results. Good luck and have fun!
#Part 1. Obtaining the Data
This dataset was created for MRKT 585R, the data comes from extracting data from approximately 600 Wikipedia pages for attendance numbers for each football game from 2000-2018, after that data was added, it was bound to the weather data from the National Climate Data Center. Following is a list of variables and description of each, this format will create a table when you knit:
| Variable | Description |
|---|---|
| Date | The Date the Game was played on |
| Team | Home team of the Football Game |
| Time | Kickoff Time |
| Opponent | Away team of the Football Game |
| Rank | Rank of the Home Team for the AP poll |
| Site | Location for the game |
| TV | TV channel that the game was played on |
| Result | The outcome for the football game |
| Attendance | How many people attended the game |
| Current Wins | How many wins the team has leading up to the game |
| Current Losses | How many losses the team has leading up to the game |
| Stadium Capacity | How many people fit into the stadium |
| Fill Rate | Attendance / Stadium Capacity |
| New Coach | If the team has a first year head coach |
| Tailgating | If the team is a Top 25 tailgate destination |
| PRCP | Precipitation |
| SNOW | Snowfall |
| SNWD | Snow Depth (Snow on ground) |
| TMAX | Max Temperature for the day |
| TMIN | Min Temperature for the day |
| Opponent_Rank | Rank of the Opponent at the time of the game |
| Conference | What football conference the team belongs |
CFB <- read_excel("CFB.xlsx")
## New names:
## * `` -> ...1
head(CFB)
## # A tibble: 6 x 25
## ...1 Date Team Time Opponent Rank Site TV Result
## <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 2000-09-02 00:00:00 Arka… 8:00… Southwe… NR War … <NA> W 38–0
## 2 2 2000-09-16 00:00:00 Arka… 6:00… Boise S… NR War … <NA> W 38–…
## 3 3 2000-09-23 00:00:00 Arka… 8:00… Alabama NR Razo… ESPN2 W 28–…
## 4 4 2000-09-30 00:00:00 Arka… 11:3… No. 25 … NR Razo… JPS L 7–38
## 5 5 2000-10-07 00:00:00 Arka… 6:00… Louisia… NR Razo… <NA> W 52–6
## 6 6 2000-11-04 00:00:00 Arka… 2:00… Ole Miss NR Razo… <NA> L 24–…
## # … with 16 more variables: Attendance <chr>, `Current Wins` <chr>,
## # `Current Losses` <chr>, `Stadium Capacity` <dbl>, `Fill Rate` <dbl>,
## # `New Coach` <chr>, Tailgating <chr>, PRCP <dbl>, SNOW <dbl>,
## # SNWD <dbl>, TMAX <dbl>, TMIN <dbl>, Opponent_Rank <chr>,
## # numericDate <dbl>, numericTime <dbl>, Conference <chr>
Before cleaning the data, take a moment and become familiar with the dataset that you are working with. There are a multitude of variables, it would be important to
#Part 2. Cleaning the Data
During this step we will look at the data to see if there are any unwanted observations, duplicate observations, irrelevant observations, structural issues or class type issues.
## ...1 Date Team
## Length:3732 Min. :2000-08-31 00:00:00 Length:3732
## Class :character 1st Qu.:2005-06-24 12:00:00 Class :character
## Mode :character Median :2009-10-31 00:00:00 Mode :character
## Mean :2009-12-03 10:01:09
## 3rd Qu.:2014-09-27 00:00:00
## Max. :2018-12-01 00:00:00
##
## Time Opponent Rank
## Length:3732 Length:3732 Length:3732
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Site TV Result
## Length:3732 Length:3732 Length:3732
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Attendance Current Wins Current Losses Stadium Capacity
## Length:3732 Length:3732 Length:3732 Min. : 22113
## Class :character Class :character Class :character 1st Qu.: 49225
## Mode :character Mode :character Mode :character Median : 60000
## Mean : 60354
## 3rd Qu.: 80321
## Max. :107282
##
## Fill Rate New Coach Tailgating PRCP
## Min. :0.08137 Length:3732 Length:3732 Min. :0.00000
## 1st Qu.:0.78404 Class :character Class :character 1st Qu.:0.00000
## Median :0.94100 Mode :character Mode :character Median :0.00000
## Mean :0.87895 Mean :0.08403
## 3rd Qu.:1.00255 3rd Qu.:0.01000
## Max. :1.40399 Max. :6.45000
## NA's :2 NA's :20
## SNOW SNWD TMAX TMIN
## Min. :0.0000 Min. :0.00 Min. : 19.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.: 61.00 1st Qu.:38.00
## Median :0.0000 Median :0.00 Median : 72.00 Median :49.00
## Mean :0.0166 Mean :0.03 Mean : 70.82 Mean :48.51
## 3rd Qu.:0.0000 3rd Qu.:0.00 3rd Qu.: 82.00 3rd Qu.:59.00
## Max. :5.3000 Max. :7.10 Max. :111.00 Max. :81.00
## NA's :1005 NA's :1049 NA's :29 NA's :29
## Opponent_Rank numericDate numericTime Conference
## Length:3732 Min. :11200 Min. : 3600 Length:3732
## Class :character 1st Qu.:12958 1st Qu.:12600 Class :character
## Mode :character Median :14548 Median :21600 Mode :character
## Mean :14581 Mean :23548
## 3rd Qu.:16340 3rd Qu.:39600
## Max. :17866 Max. :45000
##
A couple of the variables have issues with the. Here they are and their problems:
...1 - This was the row index, it is unneeded
Team - There are 34 teams in total, therefore the data is categoricalRank - This data is also categoricalTV - This data is also categorical (We will deal with this later)Attendance - There are commas that prevent the data from being numericCurrent Wins - Loaded in as charactersCurrent Losses - Loaded in as charactersNew Coach - This data is categoricalTailgating - This data is categoricalOpponent_Rank - This data is categoricalConference - This data is categorical
..1), output the head for the new datasetCFB <- select(CFB, -c(...1))
Team, Rank, New Coach, Tailgating, Opponent_Rank, and Conference columns so that the categorical nature of the columns is respected. Output the structure of the data using glimpse or str to confirm it worked.Note: In actual analysis, you would want Rank, and Opponent Rank as ordered factors. Also, the New Coach column has a space in it, to type in the column without autofill, you surround the column name with `` symbols
CFB$Team <- as.factor(CFB$Team)
CFB$Rank <-as.factor(CFB$Rank)
CFB$`New Coach` <-as.factor(CFB$`New Coach`)
CFB$Tailgating <-as.factor(CFB$Tailgating)
CFB$Opponent_Rank <-as.factor(CFB$Opponent_Rank)
CFB$Conference <- as.factor(CFB$Conference)
str(CFB)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3732 obs. of 24 variables:
## $ Date : POSIXct, format: "2000-09-02" "2000-09-16" ...
## $ Team : Factor w/ 33 levels "Arkansas","Baylor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : chr "8:00 pm" "6:00 pm" "8:00 pm" "11:30 pm" ...
## $ Opponent : chr "Southwest Missouri State*" "Boise State*" "Alabama" "No. 25 Georgia" ...
## $ Rank : Factor w/ 29 levels "1","1(3)","10",..: 29 29 29 29 29 29 29 29 29 29 ...
## $ Site : chr "War Memorial StadiumLittle Rock, AR" "War Memorial StadiumLittle Rock, AR" "Razorback StadiumFayetteville, AR" "Razorback StadiumFayetteville, AR" ...
## $ TV : chr NA NA "ESPN2" "JPS" ...
## $ Result : chr "W 38–0" "W 38–31" "W 28–21" "L 7–38" ...
## $ Attendance : chr "53,946" "54,286" "51,482" "51,162" ...
## $ Current Wins : chr "0" "1" "2" "3" ...
## $ Current Losses : chr "0" "0" "0" "0" ...
## $ Stadium Capacity: num 53727 53727 50019 50019 50019 ...
## $ Fill Rate : num 1 1.01 1.03 1.02 1.02 ...
## $ New Coach : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tailgating : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ PRCP : num 0 0 2.12 0 0 0.11 0.94 0 NA 0 ...
## $ SNOW : num NA NA NA NA NA NA NA NA NA NA ...
## $ SNWD : num NA NA NA NA NA NA NA NA NA NA ...
## $ TMAX : num 105 79 85 77 50 55 49 84 NA 62 ...
## $ TMIN : num 65 44 63 45 28 49 43 63 NA 31 ...
## $ Opponent_Rank : Factor w/ 32 levels " "," 1"," 1.",..: 32 32 32 21 32 32 20 32 27 32 ...
## $ numericDate : num 11202 11216 11223 11230 11237 ...
## $ numericTime : num 28800 21600 28800 41400 21600 7200 5400 23400 28800 21600 ...
## $ Conference : Factor w/ 10 levels "AAC","ACC","Big-10",..: 10 10 10 10 10 10 10 10 10 10 ...
Attendance column. Use the gsub function to remove the commas in the text.CFB$Attendance <- gsub(",", "", CFB$Attendance)
Attendance, Current Wins and Current Losses are both characters right now, make them integers. Output the structure with glimpse or str again.CFB$Attendance <- as.integer(CFB$Attendance)
CFB$`Current Losses` <- as.integer(CFB$`Current Losses`)
CFB$`Current Wins`<- as.integer(CFB$`Current Wins`)
str(CFB)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3732 obs. of 24 variables:
## $ Date : POSIXct, format: "2000-09-02" "2000-09-16" ...
## $ Team : Factor w/ 33 levels "Arkansas","Baylor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : chr "8:00 pm" "6:00 pm" "8:00 pm" "11:30 pm" ...
## $ Opponent : chr "Southwest Missouri State*" "Boise State*" "Alabama" "No. 25 Georgia" ...
## $ Rank : Factor w/ 29 levels "1","1(3)","10",..: 29 29 29 29 29 29 29 29 29 29 ...
## $ Site : chr "War Memorial StadiumLittle Rock, AR" "War Memorial StadiumLittle Rock, AR" "Razorback StadiumFayetteville, AR" "Razorback StadiumFayetteville, AR" ...
## $ TV : chr NA NA "ESPN2" "JPS" ...
## $ Result : chr "W 38–0" "W 38–31" "W 28–21" "L 7–38" ...
## $ Attendance : int 53946 54286 51482 51162 50947 49647 43982 52213 70470 52683 ...
## $ Current Wins : int 0 1 2 3 3 4 5 0 1 1 ...
## $ Current Losses : int 0 0 0 0 1 3 5 0 0 3 ...
## $ Stadium Capacity: num 53727 53727 50019 50019 50019 ...
## $ Fill Rate : num 1 1.01 1.03 1.02 1.02 ...
## $ New Coach : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tailgating : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ PRCP : num 0 0 2.12 0 0 0.11 0.94 0 NA 0 ...
## $ SNOW : num NA NA NA NA NA NA NA NA NA NA ...
## $ SNWD : num NA NA NA NA NA NA NA NA NA NA ...
## $ TMAX : num 105 79 85 77 50 55 49 84 NA 62 ...
## $ TMIN : num 65 44 63 45 28 49 43 63 NA 31 ...
## $ Opponent_Rank : Factor w/ 32 levels " "," 1"," 1.",..: 32 32 32 21 32 32 20 32 27 32 ...
## $ numericDate : num 11202 11216 11223 11230 11237 ...
## $ numericTime : num 28800 21600 28800 41400 21600 7200 5400 23400 28800 21600 ...
## $ Conference : Factor w/ 10 levels "AAC","ACC","Big-10",..: 10 10 10 10 10 10 10 10 10 10 ...
You have now cleaned the data for our College Football Dataset.
#Part 3. Exploratory Data Analysis
Thus far we have imported and cleaned the data that we have for each of the games. Now we begin the exploratory data analysis, this step includes Variable Identification, Univariate Analysis, Bi-variate Analysis, Missing values treatment, Outlier treatment, Variable transformation, and Feature engineering. These are alot of big words, don’t worry though. I am going to explain what each means.
Variable Identification - identifying the input and the output variables, there are many different names for these. Input variables are also predictors, features, and independent variables. Output variable are also the response or dependent variable.
In this assignment, I am going to tell you that the output variable is the Fill Rate unless I specify otherwise. This means that the Fill Rate will be on the y axis for most graphs that we do.
Missing Value Treatment - dealing with missing data in the dataset
summary(CFB)
## Date Team Time
## Min. :2000-08-31 00:00:00 Arkansas : 130 Length:3732
## 1st Qu.:2005-06-24 12:00:00 Nebraska : 130 Class :character
## Median :2009-10-31 00:00:00 Penn State : 130 Mode :character
## Mean :2009-12-03 10:01:09 Michigan State: 129
## 3rd Qu.:2014-09-27 00:00:00 Kansas State : 128
## Max. :2018-12-01 00:00:00 Texas A&M : 127
## (Other) :2958
## Opponent Rank Site TV
## Length:3732 NR :2668 Length:3732 Length:3732
## Class :character 19 : 59 Class :character Class :character
## Mode :character 23 : 51 Mode :character Mode :character
## 24 : 51
## 11 : 49
## 10 : 47
## (Other): 807
## Result Attendance Current Wins Current Losses
## Length:3732 Min. : 4513 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 36018 1st Qu.: 1.000 1st Qu.: 0.000
## Mode :character Median : 53663 Median : 3.000 Median : 2.000
## Mean : 54607 Mean : 3.049 Mean : 2.061
## 3rd Qu.: 75504 3rd Qu.: 5.000 3rd Qu.: 3.000
## Max. :110889 Max. :12.000 Max. :11.000
## NA's :2
## Stadium Capacity Fill Rate New Coach Tailgating
## Min. : 22113 Min. :0.08137 FALSE:3129 FALSE:2514
## 1st Qu.: 49225 1st Qu.:0.78404 TRUE : 603 TRUE :1218
## Median : 60000 Median :0.94100
## Mean : 60354 Mean :0.87895
## 3rd Qu.: 80321 3rd Qu.:1.00255
## Max. :107282 Max. :1.40399
## NA's :2
## PRCP SNOW SNWD TMAX
## Min. :0.00000 Min. :0.0000 Min. :0.00 Min. : 19.00
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.: 61.00
## Median :0.00000 Median :0.0000 Median :0.00 Median : 72.00
## Mean :0.08403 Mean :0.0166 Mean :0.03 Mean : 70.82
## 3rd Qu.:0.01000 3rd Qu.:0.0000 3rd Qu.:0.00 3rd Qu.: 82.00
## Max. :6.45000 Max. :5.3000 Max. :7.10 Max. :111.00
## NA's :20 NA's :1005 NA's :1049 NA's :29
## TMIN Opponent_Rank numericDate numericTime
## Min. : 0.00 NR :2936 Min. :11200 Min. : 3600
## 1st Qu.:38.00 1 : 53 1st Qu.:12958 1st Qu.:12600
## Median :49.00 2 : 42 Median :14548 Median :21600
## Mean :48.51 13 : 39 Mean :14581 Mean :23548
## 3rd Qu.:59.00 10 : 34 3rd Qu.:16340 3rd Qu.:39600
## Max. :81.00 3 : 32 Max. :17866 Max. :45000
## NA's :29 (Other): 596
## Conference
## Big-10 :755
## ACC :697
## Big-12 :570
## SEC :374
## Pac-12 :336
## Mid-American:281
## (Other) :719
From the output, we learn that TV has 582 NA’s, Attendance has 2 NA’s, Fill Rate has 2 NA’s, PRCP has 20 NA’s, SNOW and SNWD have over 1000 NA’s, and finally TMAX and TMIN have 29 NA’s.
TV means that the game was not on TV, the same goes with PRCP, SNOW and SNWD, we can assume that there was no precipitation of any kind. Assign the NA values for the weather with 0. Follow my example. #Example of how to fill na values
CFB$TV[is.na(CFB$TV)] <- "Not on Televison"
CFB$PRCP[is.na(CFB$PRCP)] <- 0
CFB$SNOW[is.na(CFB$SNOW)] <- 0
CFB$SNWD[is.na(CFB$SNWD)] <- 0
""
## [1] ""
Attendance and Fill Rate, determine which rows have NA values in them for the Attendance variable and output those row numbers. Now look at those rows and answer why is there no Attendance for those games? ______The game was canceled__________AttedanceNA <- CFB[is.na(CFB$Attendance),]
which(is.na(CFB$Attendance), arr.ind=TRUE)
## [1] 662 3040
CFB <- CFB[-c(662,3040),]
TMAX and TMIN. This variable is hardest to infer, so let’s drop the NA values now. You will have a dataset that is 3700 observations long.CFB <- na.omit(CFB, cols="CFB$TMAX")
CFB <- na.omit(CFB, cols="CFB$TMIN")
summary(CFB)
## Date Team Time
## Min. :2000-08-31 00:00:00 Nebraska : 130 Length:3700
## 1st Qu.:2004-11-26 18:00:00 Penn State : 130 Class :character
## Median :2009-10-24 00:00:00 Arkansas : 129 Mode :character
## Mean :2009-11-21 00:27:14 Michigan State: 129
## 3rd Qu.:2014-09-20 00:00:00 Kansas State : 128
## Max. :2018-12-01 00:00:00 Texas A&M : 127
## (Other) :2927
## Opponent Rank Site TV
## Length:3700 NR :2640 Length:3700 Length:3700
## Class :character 19 : 59 Class :character Class :character
## Mode :character 23 : 51 Mode :character Mode :character
## 24 : 50
## 11 : 49
## 10 : 47
## (Other): 804
## Result Attendance Current Wins Current Losses
## Length:3700 Min. : 4513 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 35986 1st Qu.: 1.000 1st Qu.: 0.000
## Mode :character Median : 53654 Median : 3.000 Median : 2.000
## Mean : 54633 Mean : 3.051 Mean : 2.056
## 3rd Qu.: 75579 3rd Qu.: 5.000 3rd Qu.: 3.000
## Max. :110889 Max. :12.000 Max. :11.000
##
## Stadium Capacity Fill Rate New Coach Tailgating
## Min. : 22113 Min. :0.08137 FALSE:3106 FALSE:2502
## 1st Qu.: 49225 1st Qu.:0.78399 TRUE : 594 TRUE :1198
## Median : 60000 Median :0.94129
## Mean : 60373 Mean :0.87926
## 3rd Qu.: 80321 3rd Qu.:1.00275
## Max. :107282 Max. :1.40399
##
## PRCP SNOW SNWD TMAX
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. : 19.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 61.00
## Median :0.00000 Median :0.00000 Median :0.00000 Median : 72.00
## Mean :0.08396 Mean :0.01227 Mean :0.02176 Mean : 70.82
## 3rd Qu.:0.01000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 82.00
## Max. :6.45000 Max. :5.30000 Max. :7.10000 Max. :111.00
##
## TMIN Opponent_Rank numericDate numericTime
## Min. : 0.00 NR :2913 Min. :11200 Min. : 3600
## 1st Qu.:38.00 1 : 51 1st Qu.:12749 1st Qu.:12600
## Median :49.00 2 : 42 Median :14541 Median :21600
## Mean :48.51 13 : 39 Mean :14569 Mean :23517
## 3rd Qu.:59.00 10 : 34 3rd Qu.:16333 3rd Qu.:39600
## Max. :81.00 3 : 32 Max. :17866 Max. :45000
## (Other): 589
## Conference
## Big-10 :753
## ACC :696
## Big-12 :567
## SEC :354
## Pac-12 :335
## Mid-American:279
## (Other) :716
Univariate Analysis - exploring variables one by one.
Fill Rate, names this chart “Histogram of Fill Rate”, set the binwidth = 0.05Attendance variable, name this chart, “Histogram of Attendance” with no binwidthggplot(CFB, aes(x=Attendance)) + geom_histogram() + ggtitle("Histogram of Attendance")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at these two charts illustrates why standardization is important. The Fill Rate is the standardized version of the Attendance data.
Bi-variate Analysis - exploring how variables relate to one another
Now we are beginning to get into some fun with the data. We will be looking at the apparent relationship between two variables. In this case, we will be looking at the relationship of a couple of the variables against the Fill Rate
Fill Rate over the Dateggplot(CFB, aes(x= `Fill Rate`, y=Date, color=Conference)) + geom_point(size=.1)+ ggtitle("Fill by Rate by Team") + facet_wrap(~Team)
Fill Rate over the Date, and color set to New Coachbyu <- CFB %>% filter(CFB$Team == "BYU")
ggplot(byu, aes(x= `Fill Rate`, y=Date, color=`New Coach`)) + geom_point()+ ggtitle("New Coach Fill Rate at BYU")
Current Losses over Dateggplot(byu, aes(x= `Current Losses`, y=Date)) + geom_point()+ ggtitle("Losses by Year for BYU Football") + geom_smooth(method="lm", se=FALSE, fullrange=FALSE, level=0.95)
This might help to explain why 2008-2010 have such a full stadium but it doesn’t explain the drop in fullness starting in 2011. This might be a factor also
Outlier treatment - dealing with observations that appear to be too far away from the pattern of the data.
For this assignment, there are no outliers.
Feature engineering - creating new variables to extract more information from the data. There are two ways to do this. First, by transforming the scale or relationship of the data. Second, by creating new variables based off the ones that we have.
engineer <- function(df=CFB, Team) {
filter1 <- df %>% filter(df$Team == !!Team)
Dev <- sd (filter1$Attendance)
Avg <- mean(filter1$Attendance)
Avgplus <- Avg + Dev
Avgminus <- Avg - Dev
print(Dev)
print(Avg)
print(Avgplus)
print(Avgminus)
}
#Use your function with BYU as the team, another time with SMU, and another time with Nebraska
engineer(CFB, "BYU")
## [1] 4296.921
## [1] 60006.33
## [1] 64303.25
## [1] 55709.41
engineer(CFB, "SMU")
## [1] 8063.892
## [1] 19889.25
## [1] 27953.14
## [1] 11825.36
engineer (CFB, "Nebraska")
## [1] 5075.05
## [1] 84493.94
## [1] 89568.99
## [1] 79418.89
#Part 4. Data Modeling
We have now reviewed most of what we have learned in R up to this point. Next, we will start on what we have learned about correlation and regression in this week’s DataCamp. In this section, we will practive finding linear correlations between variables, adding best fit lines to data, and creating statistical models.
First a reminder: correlation is only good for linear data! We can find the correlation by subtracting the mean of the x axis from each value multiplied by the subtracted mean of the y axis from each value. divided by the squared version of each. [For those interested: https://slideplayer.com/slide/6992631/]
Second reminder, the line of best fit, or trendline, is the line that minimizes the squared distance from all points to it. The average distance of all the points above and below the line should be the same.
Third reminder, the generic statistical model is y = f(x) + noise. Most of what you will do in creating models from here on out is determining ways to better fit the response variable (y) with the the inputs (x).
Fill Rate to the TMAX. Assume that earlier we didn’t remove all the NA values from TMAX, so please use the appropriate argument to disregard the NA values. Look here for the format we’re looking for.CFB%>%
summarize(N = n(), r = cor(`Fill Rate`, `TMAX`, use = "pairwise.complete.obs"))
## # A tibble: 1 x 2
## N r
## <int> <dbl>
## 1 3700 0.0573
Remember that .99 would be a very strong linear correlation and a .01 would be a very weak correlation. 0597 would also be a weak correlation
Fill Rate. Check its correlation against Current Wins, Current Losses, PRCP, SNOW, and TMIN. Make sure to use the correct argument when there are NA values.CFB%>%
summarize(N = n(), r = cor(`Fill Rate`, `Current Wins`, use = "pairwise.complete.obs"))
## # A tibble: 1 x 2
## N r
## <int> <dbl>
## 1 3700 0.136
CFB%>%
summarize(N = n(), r = cor(`Fill Rate`, `Current Losses`, use = "pairwise.complete.obs"))
## # A tibble: 1 x 2
## N r
## <int> <dbl>
## 1 3700 -0.348
CFB%>%
summarize(N = n(), r = cor(`Fill Rate`, `PRCP`, use = "pairwise.complete.obs"))
## # A tibble: 1 x 2
## N r
## <int> <dbl>
## 1 3700 0.00919
CFB%>%
summarize(N = n(), r = cor(`Fill Rate`, `SNOW`, use = "pairwise.complete.obs"))
## # A tibble: 1 x 2
## N r
## <int> <dbl>
## 1 3700 -0.00581
CFB%>%
summarize(N = n(), r = cor(`Fill Rate`, `TMIN`, use = "pairwise.complete.obs"))
## # A tibble: 1 x 2
## N r
## <int> <dbl>
## 1 3700 0.0355
We have already fitted linear models simply within ggplot, now we want to practice creating the models themselves so that we can learn from them.
Fill Rate determined by TMAXlm <- lm( `Fill Rate`~ `TMAX`, data = CFB)
summary(lm)
##
## Call:
## lm(formula = `Fill Rate` ~ TMAX, data = CFB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.80001 -0.09716 0.06258 0.12531 0.54336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8318411 0.0138903 59.886 < 2e-16 ***
## TMAX 0.0006695 0.0001920 3.487 0.000493 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.173 on 3698 degrees of freedom
## Multiple R-squared: 0.003278, Adjusted R-squared: 0.003008
## F-statistic: 12.16 on 1 and 3698 DF, p-value: 0.0004934
Fill Rate determined by Team, TMAX, TMIN, Conference, Tailgating, Current Wins, and Current Losseslm2 <- lm( `Fill Rate`~ `TMAX`+ `Team`+ `TMIN` + `Conference` + `Tailgating` + `Current Wins` + `Current Losses`, data = CFB)
summary (lm2)
##
## Call:
## lm(formula = `Fill Rate` ~ TMAX + Team + TMIN + Conference +
## Tailgating + `Current Wins` + `Current Losses`, data = CFB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52942 -0.05310 0.00461 0.05392 0.57294
##
## Coefficients: (10 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9461023 0.0212643 44.493 < 2e-16 ***
## TMAX 0.0008184 0.0002603 3.145 0.001677 **
## TeamBaylor -0.1672665 0.0144763 -11.555 < 2e-16 ***
## TeamBoise State -0.0321757 0.0145398 -2.213 0.026964 *
## TeamBYU -0.0192351 0.0145047 -1.326 0.184881
## TeamClemson -0.0019864 0.0145751 -0.136 0.891604
## TeamColorado -0.0507471 0.0147523 -3.440 0.000588 ***
## TeamFlorida State -0.0283639 0.0145734 -1.946 0.051697 .
## TeamGeorgia Tech -0.0550764 0.0145214 -3.793 0.000151 ***
## TeamIndiana -0.1981630 0.0142942 -13.863 < 2e-16 ***
## TeamIowa State -0.0017658 0.0142410 -0.124 0.901325
## TeamKansas State 0.0172026 0.0139940 1.229 0.219046
## TeamMarshall -0.2667618 0.0145135 -18.380 < 2e-16 ***
## TeamMichigan State 0.0406111 0.0140807 2.884 0.003947 **
## TeamNC State 0.0155940 0.0142061 1.098 0.272410
## TeamNebraska 0.0854354 0.0139851 6.109 1.11e-09 ***
## TeamNevada -0.3133621 0.0149575 -20.950 < 2e-16 ***
## TeamNorthern Illinois -0.2959506 0.0148346 -19.950 < 2e-16 ***
## TeamNotre Dame 0.0339272 0.0143101 2.371 0.017798 *
## TeamOklahoma 0.0450473 0.0144802 3.111 0.001879 **
## TeamOle Miss -0.0151866 0.0150482 -1.009 0.312946
## TeamPenn State 0.0153855 0.0140734 1.093 0.274365
## TeamRutgers -0.1058482 0.0142791 -7.413 1.53e-13 ***
## TeamSMU -0.3280568 0.0150522 -21.795 < 2e-16 ***
## TeamSyracuse -0.1621237 0.0143989 -11.259 < 2e-16 ***
## TeamTexas A&M 0.0130434 0.0143274 0.910 0.362680
## TeamToledo -0.1437265 0.0145384 -9.886 < 2e-16 ***
## TeamTroy -0.2779988 0.0174046 -15.973 < 2e-16 ***
## TeamUCLA -0.2551988 0.0145898 -17.492 < 2e-16 ***
## TeamVirginia -0.1387511 0.0144396 -9.609 < 2e-16 ***
## TeamWashington -0.0281579 0.0146768 -1.919 0.055121 .
## TeamWest Virginia -0.0577110 0.0159326 -3.622 0.000296 ***
## TeamWestern Kentucky -0.1918942 0.0184774 -10.385 < 2e-16 ***
## TeamWisconsin 0.0266625 0.0141477 1.885 0.059565 .
## TMIN -0.0006619 0.0002848 -2.324 0.020175 *
## ConferenceACC NA NA NA NA
## ConferenceBig-10 NA NA NA NA
## ConferenceBig-12 NA NA NA NA
## ConferenceCUSA NA NA NA NA
## ConferenceIndependent NA NA NA NA
## ConferenceMid-American NA NA NA NA
## ConferenceMWC NA NA NA NA
## ConferencePac-12 NA NA NA NA
## ConferenceSEC NA NA NA NA
## TailgatingTRUE NA NA NA NA
## `Current Wins` 0.0070846 0.0010809 6.555 6.36e-11 ***
## `Current Losses` -0.0181029 0.0011859 -15.265 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.112 on 3663 degrees of freedom
## Multiple R-squared: 0.5861, Adjusted R-squared: 0.582
## F-statistic: 144.1 on 36 and 3663 DF, p-value: < 2.2e-16
Fill Rate for the stadium. I have split the data into a training and validation datasets so that you can predict how good your model is against the data.train datasetFill Rate determined by Team, TMAX, TMIN, Conference, Tailgating, Current Wins, and Current Lossestest dataset, assign this to a new column of test# Test and Training set
set.seed(121) #set.seed is included so that everyone gets the same results
testNum <- sample(nrow(CFB), 100) #Creates a vector of row numbers to include
test <- CFB[c(testNum),] #Subsets the data so that it only has the test rows in it
train <- CFB[-c(testNum),] #Takes the test numbers out
# Build Your Model and Predict
lm3 <- lm( `Fill Rate`~ `TMAX`+ `Team`+ `TMIN` + `Conference` + `Tailgating` + `Current Wins` + `Current Losses`, data = train)
predication <- predict(lm3,test)
test <- cbind(test, predication)
totaldiff <-0
for (row in 1:nrow(test)) {
filrate <- as.numeric(test[row, "Fill Rate"])
filratepred <- as.numeric(test[row, "predication"])
diff <- filratepred - filrate
totaldiff <- totaldiff + diff
}
avgdiff <- (totaldiff/nrow(test))
print (avgdiff)
## [1] 0.01585466
Well, looks like in the future we will have to improve our model. The predicted Fill Rate is currently off by about %10 for each value.
Now knit the file to an HTML and submit it :)