Due: Mon Feb 19, 2024 11:59pm
Your RMarkdown notebook for this data dive should contain the following:
Build at least three pairs of variables
For each pair of variables, include at least one column that you created (i.e., calculated based on others)*
All variables for this data dive should be either continuous
(i.e., numeric) or ordered (e.g.,
['small', 'medium', 'large']
is okay, but
["apples", "oranges", "bananas"]
is not)*
At least one pair should be a response variable and an explanatory variable *
Plot a visualization for each relationship, and draw some conclusions based on the plot
Calculate the appropriate correlation coefficient for each of these combinations
Build a confidence interval for each of the response variable(s). Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
For this weeks data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## 7 Baltimore Ravens 2000 12 4 333 165
## 8 Pittsburgh Steelers 2000 9 7 321 255
## 9 Jacksonville Jaguars 2000 7 9 367 327
## 10 Cincinnati Bengals 2000 4 12 185 359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
The goal of this data dive is to investigate the correlation between at least 3 variables pairs. Additionally I need to calculate a third variable for each pair based on the other variables.
Looking at the standings data I have decided to create the following variable pairs.
Pair 1 | Pair 2 | Pair 3 |
---|---|---|
year | wins | points_for |
points_for | points_against | defensive_ranking |
points_for_year_avg - Each years average points_for | wins / points_against | defensive_ranking / points_for |
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(boot)
# Create the points_for_year_avg column
standings <- standings |>
group_by(year) |>
mutate(points_for_year_avg = mean(points_for))
# Select the columns for pair 1
p1 <- standings |>
select(year, points_for, points_for_year_avg)
# Ungrouping standings by year
standings <- ungroup(standings)
# Create the points_against_per_win column
standings <- standings |>
mutate(points_against_per_win = points_against/wins)
# Select the columns for pair 2
p2 <- standings |>
select(wins, points_against, points_against_per_win)
# Create the defensive_rank_per_points_for column
standings <- standings |>
mutate(defensive_rank_per_points_for = defensive_ranking / points_for)
# Select the columns for pair 3
p3 <- standings |>
select(points_for, defensive_ranking, defensive_rank_per_points_for)
For each of the pair groups we will investigate the correlation between the variables by plotting. For each plot, x values represent the explanatory variable, while the y is the response variable. Following each visual I will:
Explain the meaning of the results.
Calculate the correlation coefficient between variables.
Calculate a 95% confidence interval via bootstrapping using the following function
# Bootstrap confidence interval function
boot_ci <- function (v, func = mean, conf = 0.95, n_iter = 10000) {
set.seed(123)
# the `boot` library needs the function in this format
boot_func <- \(x, i) func(x[i])
b <- boot(data = v, statistic = boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
# Visulaization
p1 |>
ggplot() +
geom_point(mapping = aes(x = year, y = points_for)) +
theme_minimal()
# Correlation Calculation - spearman method used since year is discrete
cor(p1$year, p1$points_for, method = 'spearman')
## [1] 0.1714296
Analyzing the results we can see there is a slightly positive correlation between time and points_for, suggesting that as time has gone by the average points_for has slightly increased.
# Bootstrapped Confidence Interval for points_for mean.
boot_ci(v = p1$points_for, func = mean, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (344.9, 355.8 )
## Calculations and Intervals on Original Scale
# Bootstrapped Confidence Interval for points_for median
boot_ci(v = p1$points_for, func = median, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (341, 355 )
## Calculations and Intervals on Original Scale
Additionally we can have a 95% confidence that the true mean of point_for will fall between 344.9 and 355.8, and 95% confidence that the true median of point_for will fall between 341 and 355.
p1 |>
ggplot() +
geom_point(mapping = aes(x = year, y = points_for_year_avg)) +
theme_minimal()
# Correlation Calculation - spearman method used since year is discrete
cor(p1$year, p1$points_for_year_avg, method = 'spearman')
## [1] 0.8417326
Our results show a very strong correlation between time and the average points_for in a year!
# Bootstrapped Confidence Interval for points_for_year_avg mean.
boot_ci(v = p1$points_for_year_avg, func = mean, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (349.2, 351.4 )
## Calculations and Intervals on Original Scale
# Bootstrapped Confidence Interval for points_for_year_avg median
boot_ci(v = p1$points_for_year_avg, func = median, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (347.5, 352.5 )
## Calculations and Intervals on Original Scale
The analysis shows a 95% confidence that the true mean and median of points_for_year_avg falls between 349.2 and 351.4, and 347.5 and 352.5 respectively. This results is similar to the mean and median of points_for which we would expect. It does appear that points_for_year_avg has tighter confidence intervals. Perhaps there are a few outliers in points_for that impact the result of the bootstrapping, vs when we take samples of the average of every year there is less variance.
# Visulaization
p1 |>
ggplot() +
geom_point(mapping = aes(y = points_for, x = points_for_year_avg )) +
theme_minimal()
cor(p1$points_for, p1$points_for_year_avg)
## [1] 0.2080149
There does appear to be a slight positive correlation between the average number of points_for in a year and points_for. Remember that points_for_year_avg is the mean of points_for for a single year, so if a particular year has high points_for, it makes sense that points_for_year_avg would also be higher.
# Visulaization
p2 |>
ggplot() +
geom_point(mapping = aes(x = wins, y = points_against)) +
theme_minimal()
cor(p2$wins, p2$points_against, method = 'spearman')
## [1] -0.6798108
It appears as though the number of wins a team has is negatively correlated to the number of points against.
# Bootstrapped Confidence Interval for points_against mean.
boot_ci(v = p2$points_against, func = mean, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (345.6, 354.9 )
## Calculations and Intervals on Original Scale
# Bootstrapped Confidence Interval for points_against median
boot_ci(v = p2$points_against, func = median, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (341, 353 )
## Calculations and Intervals on Original Scale
The analysis shows a 95% confidence that the true mean and median of points_against falls between 345.6 and 354.9, and 341 and 353 respectively. Coincidentally the median of points_for and points_against are the same.
# Visulaization
p2 |>
ggplot() +
geom_point(mapping = aes(x = wins, y = points_against_per_win)) +
theme_minimal()
cor(p2$wins, p2$points_against_per_win, method = 'spearman')
## [1] -0.9691891
This result shows a strong negative correlation to the number of wins and the number of points against per win. Rember that points_against_per_win is simply
points_against / wins
If makes sense that this value would be highly correlated to the number of wins.
The shape of this plot can be explained by knowing that teams with less wins will have many more points scored against since their last win than a team who is wins a lot of games. Teams with 0 wins will have an infinitly high points_against_per_win because they have yet to win! And a team lots of wins will have much lower scores because they are dividing up their points_against among all of their wins.
This ‘thickness’ of points_against_per_win for each win shows is interesting to assess. It appears the ‘thickness’ is decreasing as the number of wins increases.
# Bootstrapped Confidence Interval for points_against_per_win mean.
boot_ci(v = p2$points_against_per_win, func = mean, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1302 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (54, 63 )
## Calculations and Intervals on Original Scale
# Bootstrapped Confidence Interval for points_against_per_win median
boot_ci(v = p2$points_against_per_win, func = median, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (40.65, 47.31 )
## Calculations and Intervals on Original Scale
The analysis shows a 95% confidence that the true mean and median of points_against_per_win falls between 54 and 63, and 40.65 and 47.31 respectively.
The mean represents the average points scored against a team before their next win which could be a few games worth of points if a team goes on a losing streak. You could calculate the average number of games a team loses in a row by finding the average number of points_against for in a single game and the mean of points_against_per_win. Lets try this out.
First I need to calculate the average number of points_against for a single game. To do this I need to calculate the number of games in a season. The number of games in an NFL season has changed before in the past so it will be good to verify this number.
standings |>
group_by(year) |>
summarise(games_played = mean(wins + loss))
## # A tibble: 20 × 2
## year games_played
## <dbl> <dbl>
## 1 2000 16
## 2 2001 16
## 3 2002 15.9
## 4 2003 16
## 5 2004 16
## 6 2005 16
## 7 2006 16
## 8 2007 16
## 9 2008 15.9
## 10 2009 16
## 11 2010 16
## 12 2011 16
## 13 2012 15.9
## 14 2013 15.9
## 15 2014 15.9
## 16 2015 16
## 17 2016 15.9
## 18 2017 16
## 19 2018 15.9
## 20 2019 15.9
Some years have 16 games and others have less. The years with less games were years with ties, which are not present in the dataset. Lets assume 16 game seasons.
Now we can calculate the average points_against per game
standings |>
summarise(
avg_points_against_per_game = mean(points_against / 16)
)
## # A tibble: 1 × 1
## avg_points_against_per_game
## <dbl>
## 1 21.9
Now we can calculate the number of games a team is expected to lose in a row.
If the average team give ups between 54 and 63 points per win, then we would expect the average team to lose between…
54 / 21.89273
## [1] 2.466572
63 / 21.89273
## [1] 2.877668
2.46 and 2.87 games in a row.
# Visulaization
p2 |>
ggplot() +
geom_point(mapping = aes(x = points_against_per_win, y = points_against )) +
theme_minimal()
cor(p2$points_against_per_win, p2$points_against, method = 'spearman')
## [1] 0.8307888
There is a strong positive correlation between the number of points against per win and the number of points against.
# Visulaization
p3 |>
ggplot() +
geom_point(mapping = aes(x = defensive_ranking, y = points_for )) +
theme_minimal()
cor(p3$defensive_ranking, p3$points_for, method = 'spearman')
## [1] 0.2250367
A bit unsurprisingly, the correlation between defensive ranking and
points for as a weak correlation. The correlation is slightly positive,
which could suggest many things, but my hypothesis is that teams who
score lots of points make the defense’s job slightly easier since they
don’t need to worry about losing as much and could potentially with more
confidence.
I have already done confidence intervals for points_for in the pair 1
analysis so I will skip it this time.
# Visulaization
p3 |>
ggplot() +
geom_point(mapping = aes(x = defensive_ranking, y = defensive_rank_per_points_for )) +
theme_minimal()
cor(p3$defensive_ranking, p3$defensive_rank_per_points_for , method = 'pearson')
## [1] 0.9785555
This result shows a very strong correlation between a teams defensive ranking and a teams defensive ranking per point for which suggests that points_for has little influence on defensive_ranking. Considering the correlation between defensive_ranking and defensive_ranking is 1, we can see that points_for simply changes the scale of defensive_ranking and not much else.
# Bootstrapped Confidence Interval for defensive_rank_per_points_for mean.
boot_ci(v = p3$defensive_rank_per_points_for, func = mean, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (-0.0014, 0.0003 )
## Calculations and Intervals on Original Scale
# Bootstrapped Confidence Interval for defensive_rank_per_points_for median
boot_ci(v = p3$defensive_rank_per_points_for, func = median, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (-0.0011, 0.0011 )
## Calculations and Intervals on Original Scale
The analysis shows a 95% confidence that the true mean and median of defensive_rank_per_points_for falls between -0.0014 and 0.0003, and -0.0011 and 0.0011 respectively. It would make sense that the median would be 0 since defensive_ranking is calculated is centered at 0.
# Visulaization
p3 |>
ggplot() +
geom_point(mapping = aes(x = defensive_rank_per_points_for, y = points_for )) +
theme_minimal()
cor(p3$defensive_rank_per_points_for, p3$points_for , method = 'pearson')
## [1] 0.2337394
The result shows a slightly positive correlation between points_for and defensive_rank_per_points_for which again suggests that teams who score more points will have a slightly higher defensive ranking.