Attendance in the Modern Era # First off, I believe that technically the modern era of baseball is still defined as anything after 1900??? The common belief is baseball is too slow and boring for today???s sports fans, but if we look at just home game attendance it seems to show steady growth.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.1
library(scales)
library(readr)
## Warning: package 'readr' was built under R version 3.4.1
##
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
##
## col_factor
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
teams <- read.csv("./team.csv",na.strings=c("","NA"))
teams <- subset(teams, attendance != 'NA')
teams <- subset(teams, year > 1969)
#it keeps growing!
ggplot(teams, aes(year, attendance)) + geom_bar(stat="identity", aes(fill=g>157))+
guides(fill=FALSE)+
ggtitle("Total Home Attendance per Year (Lockouts in 72, 81, 94, and 95)")+
scale_y_continuous(labels = comma) +
theme_minimal()
Thoughts The last few years do seem very stagnant, but it would be interesting to compare this with tv ratings!
Wins VS Attendance
ggplot(teams, aes(w, attendance)) + geom_point(color="green") +
labs(x="Wins", y="Attendance") +
scale_y_continuous(labels = comma) +
ggtitle("Wins & Attendance")+
theme_minimal()
ggplot(teams, aes(hr, attendance)) + geom_point(color="green") +
labs(x="Home Runs", y="Attendance") +
scale_y_continuous(labels = comma) +
ggtitle("Home Runs & Attendance")+
theme_minimal()
## Thoughts I think that shows a fairly solid correlation. Certainly plenty of outliers, especially the 1993 Colorado Rockies who had a toal attendance of 4,483,350 despite winning only 67 games. BUT it was their first season as a franchise so the fans must have been engaged no matter what happened. On the other end there are some really bad teams who???s seasons and attendance were hurt during the lockout years. The 1981 Cubs for example were stopped at 38-65 with an attendance of 565,637 before the season was cut short.
The other side ## Now I will plot attendance against losses and total errors (who likes sloppy baseball?)
ggplot(teams, aes(l, attendance)) + geom_point(color="red") +
labs(x="Losses", y="Attendance") +
scale_y_continuous(labels = comma) +
ggtitle("Losses & Attendance")+
theme_minimal()
Thoughts Another similar correlation???fans just don???t want to pay to watch a bad team (unless you are the 1993 Colorado Rockies!)
Moneyball! ## Here I will join the team data with a summary of the salary data, to see if payroll plays any part in attendance
salary <- read.csv("./salary.csv",na.strings=c("","NA"))
salary <- select(salary, year, team_id, salary)
salary <- group_by(salary, year, team_id) %>% summarise(dollars = sum(salary))
team_salary <- left_join(teams, salary)
## Joining, by = c("year", "team_id")
## Warning: Column `team_id` joining factors with different levels, coercing
## to character vector
team_salary <- subset(team_salary, year > 1984) #salary data starts at 1984
ggplot(team_salary, aes(dollars, attendance)) + geom_point(color="darkgreen",size=4, shape=36) +
labs(x="Team Salary", y="Attendance") +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma) +
ggtitle("Team Salary & Attendance")+
theme_minimal()