library("openxlsx")# setwd("~/Dropbox/WCAS/Summer/Data Analysis/Summer 2024/Day 4")# Write the data frame to an Excel file.write.xlsx(money_clean, file ="Final_Moneyball_Project.xlsx" )
Histograms:
?ggplot# Creating histograms for each variable.ggplot(data = money_clean, mapping =aes(x = TARGET_WINS)) +geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
?boxplot# There are three outliers on the high end and the distribution is skewed right.plot(TARGET_WINS ~ TEAM_PITCHING_HR, data = money_clean) # Creating a dot plot for team pitching hr and target wins.
?boxplot# There are en extreme amount of outliers on the high end and the distribution is extremely skewed right with the center being less than 250.plot(TARGET_WINS ~ TEAM_FIELDING_E, data = money_clean) # Dot plot with team fielding E and target wins.
Relationship Between Variables:
?ggplotggplot(data = money_clean, mapping =aes(x = TEAM_BATTING_H, y = TARGET_WINS)) +geom_point() +ggtitle("Relationship Between Home Runs and Target Wins") +geom_point(colour ="Pink")
?ggplotggplot(data = money_clean, mapping =aes(x = TEAM_PITCHING_H, y = TARGET_WINS)) +geom_point() +ggtitle("Relationship Between Home Runs and Target Wins") +geom_point(colour ="Orange")
The screenshot above is only a piece of the pivot table we created.
There appears to be relatively no relationship between team batting hits and the average of target wins for a team. However, this is hard to determine by just the numbers so a visual, like those seen earlier, may be more beneficial in determining a relationship between the two variables.
Summarize:
We began with 2276 objects and 17 variables, however, after cleaning the data and removing the na’s, we are left with 2276 objects and 11 variables. The six variables removed are: TEAM_FIELDING_DP, TEAM_PITCHING_SO, TEAM_BATTING_HBP, TEAM_BASERUN_SB, TEAM_BASERUN_CS, AND TEAM_BATTING_SO. Next, we ran summary statistics on the cleaned dataset; including min, max, standard deviation, and mean. Then we began to create visuals. We started with ggplots for all variables to show us the skew and overall shape of each distribution. As for standout distributions, TEAM_FIELDING_E and TEAM_BATTING_3B appeared to have the most defined and extreme skew (right). However, TEAM_BATTING_HR and TEAM_PITCHING_HR were roughly bimodal. We then began to run each variable by themselves; running the summary statistics, boxplots, and dot plots. Lastly, we created dot plots of relationships between two variables. In excel, with the imported data, we were able to create pivot tables to further analyze the data. There is an unclear relationship between Team Batting Homeruns and the Average of Target Wins. The data would be more clear in a visual diagram, such as a dot plot or histogram. We would like to continue researching and analyzing this relationship, because it is unclear. As well, we would like to continue looking at the relationship between Home Runs and Target Wins because the ggplot shows a generally positive trend but has a large cluster.