Week-6-Data-Dive---Confidence-Intervals.knit

Week 6 Data Dive - Confidence Intervals

Due: Mon Feb 19, 2024 11:59pm

Task(s)

Your RMarkdown notebook for this data dive should contain the following:

Build at least three pairs of variables
- For each pair of variables, include at least one column that you created (i.e., calculated based on others)*
- All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g., ['small', 'medium', 'large'] is okay, but ["apples", "oranges", "bananas"] is not)*
- At least one pair should be a response variable and an explanatory variable *
Plot a visualization for each relationship, and draw some conclusions based on the plot
- Use what we’ve covered so far in class to scrutinize the plot (e.g., are there any outliers?)
Calculate the appropriate correlation coefficient for each of these combinations
- Explain why the value makes sense (or doesn’t) based on the visualization(s)
Build a confidence interval for each of the response variable(s). Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

——————————— My Code ———————————

For this weeks data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.

https://www.pro-football-reference.com/years/2019/index.htmhttps://www.pro-football-reference.com/years/2019/index.htm

standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')

## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

standings

## # A tibble: 638 × 15
##    team         team_name  year  wins  loss points_for points_against
##    <chr>        <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Miami        Dolphins   2000    11     5        323            226
##  2 Indianapolis Colts      2000    10     6        429            326
##  3 New York     Jets       2000     9     7        321            321
##  4 Buffalo      Bills      2000     8     8        315            350
##  5 New England  Patriots   2000     5    11        276            338
##  6 Tennessee    Titans     2000    13     3        346            191
##  7 Baltimore    Ravens     2000    12     4        333            165
##  8 Pittsburgh   Steelers   2000     9     7        321            255
##  9 Jacksonville Jaguars    2000     7     9        367            327
## 10 Cincinnati   Bengals    2000     4    12        185            359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>

Building Pairs of Variables

The goal of this data dive is to investigate the correlation between at least 3 variables pairs. Additionally I need to calculate a third variable for each pair based on the other variables.

Looking at the standings data I have decided to create the following variable pairs.

Pair 1	Pair 2	Pair 3
year	wins	points_for
points_for	points_against	defensive_ranking
points_for_year_avg - Each years average points_for	wins / points_against	defensive_ranking / points_for

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(boot)

# Create the points_for_year_avg column
standings  <- standings |>
  group_by(year) |>
  mutate(points_for_year_avg = mean(points_for))

# Select the columns for pair 1
p1 <- standings |>
  select(year, points_for, points_for_year_avg)

# Ungrouping standings by year
standings <- ungroup(standings)

# Create the points_against_per_win column
standings <- standings |>
  mutate(points_against_per_win = points_against/wins)

# Select the columns for pair 2
p2 <- standings |>
  select(wins, points_against, points_against_per_win)

# Create the defensive_rank_per_points_for column
standings <- standings |>
  mutate(defensive_rank_per_points_for =  defensive_ranking / points_for)

# Select the columns for pair 3
p3 <- standings |>
  select(points_for, defensive_ranking, defensive_rank_per_points_for)

Visualizations and Analysis

For each of the pair groups we will investigate the correlation between the variables by plotting. For each plot, x values represent the explanatory variable, while the y is the response variable. Following each visual I will:

Explain the meaning of the results.
Calculate the correlation coefficient between variables.
Calculate a 95% confidence interval via bootstrapping using the following function

# Bootstrap confidence interval function
boot_ci <- function (v, func = mean, conf = 0.95, n_iter = 10000) {
  set.seed(123)
  # the `boot` library needs the function in this format
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(data = v, statistic = boot_func, R = n_iter)
  
  boot.ci(b, conf = conf, type = "perc")
}

Pair 1 Visualization and Analysis

# Visulaization
p1 |>
  ggplot() +
  geom_point(mapping = aes(x = year, y = points_for)) +
  theme_minimal()

# Correlation Calculation - spearman method used since year is discrete
cor(p1$year, p1$points_for, method = 'spearman')

## [1] 0.1714296

Analyzing the results we can see there is a slightly positive correlation between time and points_for, suggesting that as time has gone by the average points_for has slightly increased.

# Bootstrapped Confidence Interval for points_for mean.
boot_ci(v = p1$points_for, func = mean, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (344.9, 355.8 )  
## Calculations and Intervals on Original Scale

# Bootstrapped Confidence Interval for points_for median
boot_ci(v = p1$points_for, func = median, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (341, 355 )  
## Calculations and Intervals on Original Scale

Additionally we can have a 95% confidence that the true mean of point_for will fall between 344.9 and 355.8, and 95% confidence that the true median of point_for will fall between 341 and 355.

p1 |>
  ggplot() +
  geom_point(mapping = aes(x = year, y = points_for_year_avg)) +
  theme_minimal()

# Correlation Calculation - spearman method used since year is discrete
cor(p1$year, p1$points_for_year_avg, method = 'spearman')

## [1] 0.8417326

Our results show a very strong correlation between time and the average points_for in a year!

# Bootstrapped Confidence Interval for points_for_year_avg mean.
boot_ci(v = p1$points_for_year_avg, func = mean, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (349.2, 351.4 )  
## Calculations and Intervals on Original Scale

# Bootstrapped Confidence Interval for points_for_year_avg median
boot_ci(v = p1$points_for_year_avg, func = median, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (347.5, 352.5 )  
## Calculations and Intervals on Original Scale

The analysis shows a 95% confidence that the true mean and median of points_for_year_avg falls between 349.2 and 351.4, and 347.5 and 352.5 respectively. This results is similar to the mean and median of points_for which we would expect. It does appear that points_for_year_avg has tighter confidence intervals. Perhaps there are a few outliers in points_for that impact the result of the bootstrapping, vs when we take samples of the average of every year there is less variance.

# Visulaization
p1 |>
  ggplot() +
  geom_point(mapping = aes(y = points_for, x = points_for_year_avg )) +
  theme_minimal()

cor(p1$points_for, p1$points_for_year_avg)

## [1] 0.2080149

There does appear to be a slight positive correlation between the average number of points_for in a year and points_for. Remember that points_for_year_avg is the mean of points_for for a single year, so if a particular year has high points_for, it makes sense that points_for_year_avg would also be higher.

Pair 2 Visualization and Analysis

# Visulaization
p2 |>
  ggplot() +
  geom_point(mapping = aes(x = wins, y = points_against)) +
  theme_minimal()

cor(p2$wins, p2$points_against, method = 'spearman')

## [1] -0.6798108

It appears as though the number of wins a team has is negatively correlated to the number of points against.

# Bootstrapped Confidence Interval for points_against mean.
boot_ci(v = p2$points_against, func = mean, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (345.6, 354.9 )  
## Calculations and Intervals on Original Scale

# Bootstrapped Confidence Interval for points_against median
boot_ci(v = p2$points_against, func = median, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (341, 353 )  
## Calculations and Intervals on Original Scale

The analysis shows a 95% confidence that the true mean and median of points_against falls between 345.6 and 354.9, and 341 and 353 respectively. Coincidentally the median of points_for and points_against are the same.

# Visulaization
p2 |>
  ggplot() +
  geom_point(mapping = aes(x = wins, y = points_against_per_win)) +
  theme_minimal()

cor(p2$wins, p2$points_against_per_win, method = 'spearman')

## [1] -0.9691891

This result shows a strong negative correlation to the number of wins and the number of points against per win. Rember that points_against_per_win is simply

points_against / wins

If makes sense that this value would be highly correlated to the number of wins.

The shape of this plot can be explained by knowing that teams with less wins will have many more points scored against since their last win than a team who is wins a lot of games. Teams with 0 wins will have an infinitly high points_against_per_win because they have yet to win! And a team lots of wins will have much lower scores because they are dividing up their points_against among all of their wins.

This ‘thickness’ of points_against_per_win for each win shows is interesting to assess. It appears the ‘thickness’ is decreasing as the number of wins increases.

# Bootstrapped Confidence Interval for points_against_per_win mean.
boot_ci(v = p2$points_against_per_win, func = mean, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1302 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (54, 63 )  
## Calculations and Intervals on Original Scale

# Bootstrapped Confidence Interval for points_against_per_win median
boot_ci(v = p2$points_against_per_win, func = median, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (40.65, 47.31 )  
## Calculations and Intervals on Original Scale

The analysis shows a 95% confidence that the true mean and median of points_against_per_win falls between 54 and 63, and 40.65 and 47.31 respectively.

The mean represents the average points scored against a team before their next win which could be a few games worth of points if a team goes on a losing streak. You could calculate the average number of games a team loses in a row by finding the average number of points_against for in a single game and the mean of points_against_per_win. Lets try this out.

First I need to calculate the average number of points_against for a single game. To do this I need to calculate the number of games in a season. The number of games in an NFL season has changed before in the past so it will be good to verify this number.

standings |>
  group_by(year) |>
  summarise(games_played = mean(wins + loss))

## # A tibble: 20 × 2
##     year games_played
##    <dbl>        <dbl>
##  1  2000         16  
##  2  2001         16  
##  3  2002         15.9
##  4  2003         16  
##  5  2004         16  
##  6  2005         16  
##  7  2006         16  
##  8  2007         16  
##  9  2008         15.9
## 10  2009         16  
## 11  2010         16  
## 12  2011         16  
## 13  2012         15.9
## 14  2013         15.9
## 15  2014         15.9
## 16  2015         16  
## 17  2016         15.9
## 18  2017         16  
## 19  2018         15.9
## 20  2019         15.9

Some years have 16 games and others have less. The years with less games were years with ties, which are not present in the dataset. Lets assume 16 game seasons.

Now we can calculate the average points_against per game

standings |>
  summarise(
    avg_points_against_per_game = mean(points_against / 16)
  )

## # A tibble: 1 × 1
##   avg_points_against_per_game
##                         <dbl>
## 1                        21.9

Now we can calculate the number of games a team is expected to lose in a row.

If the average team give ups between 54 and 63 points per win, then we would expect the average team to lose between…

54 / 21.89273

## [1] 2.466572

63 / 21.89273

## [1] 2.877668

2.46 and 2.87 games in a row.

# Visulaization
p2 |>
  ggplot() +
  geom_point(mapping = aes(x = points_against_per_win, y = points_against )) +
  theme_minimal()

cor(p2$points_against_per_win, p2$points_against, method = 'spearman')

## [1] 0.8307888

There is a strong positive correlation between the number of points against per win and the number of points against.

Pair 3 Visualization and Analysis

# Visulaization
p3 |>
  ggplot() +
  geom_point(mapping = aes(x = defensive_ranking, y = points_for )) +
  theme_minimal()

cor(p3$defensive_ranking, p3$points_for, method = 'spearman')

## [1] 0.2250367

A bit unsurprisingly, the correlation between defensive ranking and points for as a weak correlation. The correlation is slightly positive, which could suggest many things, but my hypothesis is that teams who score lots of points make the defense’s job slightly easier since they don’t need to worry about losing as much and could potentially with more confidence.

I have already done confidence intervals for points_for in the pair 1 analysis so I will skip it this time.

# Visulaization
p3 |>
  ggplot() +
  geom_point(mapping = aes(x = defensive_ranking, y = defensive_rank_per_points_for )) +
  theme_minimal()

cor(p3$defensive_ranking, p3$defensive_rank_per_points_for , method = 'pearson')

## [1] 0.9785555

This result shows a very strong correlation between a teams defensive ranking and a teams defensive ranking per point for which suggests that points_for has little influence on defensive_ranking. Considering the correlation between defensive_ranking and defensive_ranking is 1, we can see that points_for simply changes the scale of defensive_ranking and not much else.

# Bootstrapped Confidence Interval for defensive_rank_per_points_for mean.
boot_ci(v = p3$defensive_rank_per_points_for, func = mean, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (-0.0014,  0.0003 )  
## Calculations and Intervals on Original Scale

# Bootstrapped Confidence Interval for defensive_rank_per_points_for median
boot_ci(v = p3$defensive_rank_per_points_for, func = median, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (-0.0011,  0.0011 )  
## Calculations and Intervals on Original Scale

The analysis shows a 95% confidence that the true mean and median of defensive_rank_per_points_for falls between -0.0014 and 0.0003, and -0.0011 and 0.0011 respectively. It would make sense that the median would be 0 since defensive_ranking is calculated is centered at 0.

# Visulaization
p3 |>
  ggplot() +
  geom_point(mapping = aes(x = defensive_rank_per_points_for, y = points_for )) +
  theme_minimal()

cor(p3$defensive_rank_per_points_for, p3$points_for , method = 'pearson')

## [1] 0.2337394

The result shows a slightly positive correlation between points_for and defensive_rank_per_points_for which again suggests that teams who score more points will have a slightly higher defensive ranking.