Data Preparation

The file size has to be reduced in order to upload as a repository on Github.

library(ggplot2)
away_games <- read.csv("https://raw.githubusercontent.com/wco1216/DATA-606/master/awaygames.csv", TRUE, ",")
home_games <- read.csv("https://raw.githubusercontent.com/wco1216/DATA-606/master/homegames.csv", TRUE, ",")

Research question

What conditions allowed for the most rushing yards by NFL running backs in 2018?

Cases

There are 247,962 cases.

Data collection

The data was received from “NFL Big Data Bowl” posted on kaggle (https://www.kaggle.com/c/nfl-big-data-bowl-2020/data).

Type of study

These are observational cases.

Data Source

https://www.kaggle.com/c/nfl-big-data-bowl-2020/data

Dependent Variable

The dependent variable will be yards in this study.

Independent Variable

There are multiple independent variables but some significant variables are player names, offensive formation and player height & weight.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Yards

summary(away_games$Yards)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -12.00    1.00    3.00    4.42    6.00   99.00
summary(home_games$Yards)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -12.00    1.00    3.00    4.42    6.00   99.00
ggplot(home_games, aes(Yards)) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(limits = c(-12,99))

ggplot(away_games, aes(Yards)) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(limits = c(-12,99))

summary(away_games$PlayerWeight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   159.0   210.0   245.0   252.7   305.0   380.0
summary(home_games$PlayerWeight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   159.0   210.0   245.0   253.1   305.0   380.0
ggplot(away_games, aes(PlayerWeight)) +
  geom_histogram(binwidth = 4) +
  scale_x_continuous(limits = c(159,380))
## Warning: Removed 2 rows containing missing values (geom_bar).

ggplot(home_games, aes(PlayerWeight)) +
  geom_histogram(binwidth = 4) +
  scale_x_continuous(limits = c(159,380))
## Warning: Removed 2 rows containing missing values (geom_bar).