Due: Mon March 4th, 2024 11:59pm
The purpose of this week’s data dive is for you to get experience running ANOVA tests and building regression models.
Your RMarkdown notebook for this data dive should contain the following:
Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.
Select a categorical column of data (explanatory variable) that you expect might influence the response variable.
Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions.
If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class.
Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.
Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.
Build a linear regression model of the response using just this column, and evaluate its fit.
Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
For this weeks data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 Ă— 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## 7 Baltimore Ravens 2000 12 4 333 165
## 8 Pittsburgh Steelers 2000 9 7 321 255
## 9 Jacksonville Jaguars 2000 7 9 367 327
## 10 Cincinnati Bengals 2000 4 12 185 359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
The most valuable continuous column given the context of the data is Wins. In the NFL, the success of a team’s regular season is based on their record, which in the context of this data are Wins and Losses.
For my categorical variable I will be utilizing the sb_winner column. This column describes whether the team won the Superbowl that year. Since there are not more than 10 categories we will not consolidate the columns.
Null Hypothesis: The winner of the Superbowl has the same average number of wins as teams that did not win.
m <- aov(wins ~ sb_winner, data = standings)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## sb_winner 1 309 308.54 34.13 8.23e-09 ***
## Residuals 636 5749 9.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An analysis of the variance shows tells us that the F-Value is equal to 34.13, which suggests there is evidence that the means between the groups overshadow the means within the groups, indicating that the groupings of Superbowl winners are introducing variance that isn’t inherent in the data themselves.
Additionally we see a very small P-value, telling there is a very small likelihood that there are samples more extreme than ours.
This means that there is very little evidence to conclude that the null hypothesis is true; the average number of wins between Superbowl winners and non-Superbowl winners are the same.
For my linear regression model I will be modelling the columns Wins and points_for.
library(ggplot2)
standings |>
ggplot(mapping = aes(x = wins, y = points_for)) +
geom_point(size = 2, color = 'darkblue')
Using the linear model function we can draw a line of best fit
standings |>
ggplot(mapping = aes(x = wins, y = points_for)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue')
## `geom_smooth()` using formula = 'y ~ x'
To create the coefficients we need to create a model
model <- lm(wins ~ points_for, standings)
model$coefficients
## (Intercept) points_for
## -3.06141069 0.03153369
To understand the coefficients we need to know what they represent. Using the equation
y= mx + b
The value of m represents the slope of the line. b represents the y-intercept, or the point on the y-axis that the line passes through at x = 0. In our model, the y-intercept is -3.06141069.
The slope of our model is 0.03153369, which means for every 1 point_for increase, the number of wins increases by 0.03153369.
The model suggests that to win more games, you need to score more points, which makes sense!