MoneyMaking

Importing and ggplots:

remove(list = ls()) # Clean the environment.
gc() 
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  595867 31.9    1352355 72.3         NA   700242 37.4
Vcells 1098570  8.4    8388608 64.0      16384  1963194 15.0
cat("\f") 
getwd() 
[1] "/Users/hannahrobinson/Downloads"
library(stargazer) # Load the packages.

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
library(ggplot2) 
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(visdat)
library(psych)

Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':

    %+%, alpha
library(conflicted)
?stargazer
money <- read.csv("~/Downloads/moneyball-training-data.csv") # Rename the dataset

money_clean <- money[, colSums(is.na(money)) == 0 ] # Clean data and rename

stargazer(money_clean, type = "text", title = "Summary Statistics", digits = 2, omit.summary.stat = "n", notes = "n = 2276") # Summary statistics

Summary Statistics
===============================================
Statistic          Mean   St. Dev.  Min   Max  
-----------------------------------------------
INDEX            1,268.46  736.35    1   2,535 
TARGET_WINS       80.79    15.75     0    146  
TEAM_BATTING_H   1,469.27  144.59   891  2,554 
TEAM_BATTING_2B   241.25   46.80    69    458  
TEAM_BATTING_3B   55.25    27.94     0    223  
TEAM_BATTING_HR   99.61    60.55     0    264  
TEAM_BATTING_BB   501.56   122.67    0    878  
TEAM_PITCHING_H  1,779.21 1,406.84 1,137 30,132
TEAM_PITCHING_HR  105.70   61.30     0    343  
TEAM_PITCHING_BB  553.01   166.36    0   3,645 
TEAM_FIELDING_E   246.48   227.77   65   1,898 
-----------------------------------------------
n = 2276                                       
ggplot(data = money_clean, mapping = aes(x = TEAM_BATTING_HR, y = TARGET_WINS)) + geom_point()

library("openxlsx")

# setwd("~/Dropbox/WCAS/Summer/Data Analysis/Summer 2024/Day 4")

# Write the data frame to an Excel file.
write.xlsx(money_clean, 
           file = "Final_Moneyball_Project.xlsx"
           )

Histograms:

?ggplot

# Creating histograms for each variable.

ggplot(data = money_clean, mapping = aes(x = TARGET_WINS)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = money_clean, mapping = aes(x = TEAM_BATTING_HR)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = money_clean, mapping = aes(x = TEAM_PITCHING_HR)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = money_clean, mapping = aes(x = TEAM_FIELDING_E)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = money_clean, mapping = aes(x = TEAM_BATTING_3B)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Five Key Observations:

  1. The team fielding E histogram is extremely skewed-right.
  2. The histogram for target wins is relatively uni-modal and symmetric, meaning the data is relatively normal.
  3. The distribution of team batting hr is bi-modal.
  4. The distribution of team pitching home runs is bi-modal as well.
  5. The team batting 3B distribution is skewed-right with a peak around 30.

Data Exploration:

#install.packages("readr")
library(readr)
library(dplyr)

Target Wins…

mean(money_clean$TARGET_WINS) # Finding the mean.
[1] 80.79086
sd(money_clean$TARGET_WINS) # Finding the standard deviation.
[1] 15.75215
median(money_clean$TARGET_WINS) # Finding the median.
[1] 82
boxplot(money_clean$TARGET_WINS, horizontal = TRUE) # Creating a boxplot with cleaned data.

?boxplot

#There are a lot of outliers both high and low.

Team Batting H…

mean(money_clean$TEAM_BATTING_H)
[1] 1469.27
sd(money_clean$TEAM_BATTING_H)
[1] 144.5912
median(money_clean$TEAM_BATTING_H)
[1] 1454
boxplot(money_clean$TEAM_BATTING_H, horizontal = TRUE)

?boxplot

# Has a lot of outliers on the high end.

plot(TARGET_WINS ~ TEAM_BATTING_H, data = money_clean)

Team Pitching HR…

mean(money_clean$TEAM_PITCHING_HR)
[1] 105.6986
sd(money_clean$TEAM_PITCHING_HR)
[1] 61.29875
median(money_clean$TEAM_PITCHING_HR)
[1] 107
boxplot(money_clean$TEAM_PITCHING_HR, horizontal = TRUE)

?boxplot

# There are three outliers on the high end and the distribution is skewed right.

plot(TARGET_WINS ~ TEAM_PITCHING_HR, data = money_clean) # Creating a dot plot for team pitching hr and target wins.

Team Fielding E…

mean(money_clean$TEAM_FIELDING_E)
[1] 246.4807
sd(money_clean$TEAM_FIELDING_E)
[1] 227.771
median(money_clean$TEAM_FIELDING_E)
[1] 159
boxplot(money_clean$TEAM_FIELDING_E, horizontal = TRUE)

?boxplot

# There are en extreme amount of outliers on the high end and the distribution is extremely skewed right with the center being less than 250.

plot(TARGET_WINS ~ TEAM_FIELDING_E, data = money_clean) # Dot plot with team fielding E and target wins.

Relationship Between Variables:

?ggplot

ggplot(data = money_clean, mapping = aes(x = TEAM_BATTING_H, y = TARGET_WINS)) + geom_point() + ggtitle("Relationship Between Home Runs and Target Wins") + geom_point(colour = "Pink")

?ggplot

ggplot(data = money_clean, mapping = aes(x = TEAM_PITCHING_H, y = TARGET_WINS)) + geom_point() + ggtitle("Relationship Between Home Runs and Target Wins") + geom_point(colour = "Orange")

Export Data:

library(writexl)

write_xlsx(money_clean, "MoneyballGroupProject.xlsx")

Excel:

  • The screenshot above is only a piece of the pivot table we created.

  • There appears to be relatively no relationship between team batting hits and the average of target wins for a team. However, this is hard to determine by just the numbers so a visual, like those seen earlier, may be more beneficial in determining a relationship between the two variables.

Summarize:

We began with 2276 objects and 17 variables, however, after cleaning the data and removing the na’s, we are left with 2276 objects and 11 variables. The six variables removed are: TEAM_FIELDING_DP, TEAM_PITCHING_SO, TEAM_BATTING_HBP, TEAM_BASERUN_SB, TEAM_BASERUN_CS, AND TEAM_BATTING_SO. Next, we ran summary statistics on the cleaned dataset; including min, max, standard deviation, and mean. Then we began to create visuals. We started with ggplots for all variables to show us the skew and overall shape of each distribution. As for standout distributions, TEAM_FIELDING_E and TEAM_BATTING_3B appeared to have the most defined and extreme skew (right). However, TEAM_BATTING_HR and TEAM_PITCHING_HR were roughly bimodal. We then began to run each variable by themselves; running the summary statistics, boxplots, and dot plots. Lastly, we created dot plots of relationships between two variables. In excel, with the imported data, we were able to create pivot tables to further analyze the data. There is an unclear relationship between Team Batting Homeruns and the Average of Target Wins. The data would be more clear in a visual diagram, such as a dot plot or histogram. We would like to continue researching and analyzing this relationship, because it is unclear. As well, we would like to continue looking at the relationship between Home Runs and Target Wins because the ggplot shows a generally positive trend but has a large cluster.