Week-8-Data-Dive---Regression-Modeling.knit

Week 8 Data Dive - Regression Modeling

Due: Mon March 4th, 2024 11:59pm

Task(s)

The purpose of this week’s data dive is for you to get experience running ANOVA tests and building regression models.

Your RMarkdown notebook for this data dive should contain the following:

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.
- For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses. What is that thing (or, one of “those” things) for your data?
Select a categorical column of data (explanatory variable) that you expect might influence the response variable.
- Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions.
- If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class.
- Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.
Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.
- Build a linear regression model of the response using just this column, and evaluate its fit.
- Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

The Data

For this weeks data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.

Link to Data

standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')

## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

standings

## # A tibble: 638 × 15
##    team         team_name  year  wins  loss points_for points_against
##    <chr>        <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Miami        Dolphins   2000    11     5        323            226
##  2 Indianapolis Colts      2000    10     6        429            326
##  3 New York     Jets       2000     9     7        321            321
##  4 Buffalo      Bills      2000     8     8        315            350
##  5 New England  Patriots   2000     5    11        276            338
##  6 Tennessee    Titans     2000    13     3        346            191
##  7 Baltimore    Ravens     2000    12     4        333            165
##  8 Pittsburgh   Steelers   2000     9     7        321            255
##  9 Jacksonville Jaguars    2000     7     9        367            327
## 10 Cincinnati   Bengals    2000     4    12        185            359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>

Selecting a Continuous Variable

The most valuable continuous column given the context of the data is Wins. In the NFL, the success of a team’s regular season is based on their record, which in the context of this data are Wins and Losses.

Selecting a Categorical Variable

For my categorical variable I will be utilizing the sb_winner column. This column describes whether the team won the Superbowl that year. Since there are not more than 10 categories we will not consolidate the columns.

Null Hypothesis

Null Hypothesis: The winner of the Superbowl has the same average number of wins as teams that did not win.

ANOVA Summary

m <- aov(wins ~ sb_winner, data = standings)
summary(m)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## sb_winner     1    309  308.54   34.13 8.23e-09 ***
## Residuals   636   5749    9.04                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An analysis of the variance shows tells us that the F-Value is equal to 34.13, which suggests there is evidence that the means between the groups overshadow the means within the groups, indicating that the groupings of Superbowl winners are introducing variance that isn’t inherent in the data themselves.

Additionally we see a very small P-value, telling there is a very small likelihood that there are samples more extreme than ours.

This means that there is very little evidence to conclude that the null hypothesis is true; the average number of wins between Superbowl winners and non-Superbowl winners are the same.

Building a Linear Regression Model

For my linear regression model I will be modelling the columns Wins and points_for.

library(ggplot2)
standings |>
  ggplot(mapping = aes(x = wins, y = points_for)) +
  geom_point(size = 2, color = 'darkblue')

Using the linear model function we can draw a line of best fit

standings |>
  ggplot(mapping = aes(x = wins, y = points_for)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')

## `geom_smooth()` using formula = 'y ~ x'

Analyzing the coefficients

To create the coefficients we need to create a model

model <- lm(wins ~ points_for, standings)
model$coefficients

## (Intercept)  points_for 
## -3.06141069  0.03153369

To understand the coefficients we need to know what they represent. Using the equation

y= mx + b

The value of m represents the slope of the line. b represents the y-intercept, or the point on the y-axis that the line passes through at x = 0. In our model, the y-intercept is -3.06141069.

The slope of our model is 0.03153369, which means for every 1 point_for increase, the number of wins increases by 0.03153369.

The model suggests that to win more games, you need to score more points, which makes sense!