This shows some of the data and codes used in Lecture 1.1, the Oakland A’s under Billy Beane.
You can see how data was processed and how the plot were generated.
Read in the csv (comma-separated value) file, and assign it to ‘oakland’.
oakland <- read_csv("Oakland_As_Seasons.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Standing = col_character(),
## WINS = col_double(),
## LOSSES = col_double(),
## BillyBeane = col_double()
## )
After R reads the csv file, R shows the column names that are contained in the csv. The file contains 53 years of Oakland A’s seasons with 5 columns: Year | Standing at the end of the year | WINS = number of wines | LOSSES = number of loses | BillyBeane denotes if Billy Beane was the GM or not (0 or 1) in that season.
ggplot(data=oakland) + theme(text = element_text(size=18)) + ylab("Wins") +
geom_col(aes(x=Year, y=WINS), fill="darkgreen", col="yellow") + ylim(0,162) + ggtitle("Number of A's wins in each season")
It looks like the A’s did very poorly in 1981, 1994, and 2020. A baseball season is conveniently always 162 games. Ops, except fir the strike shorten season of 1981, and of course COVID 2020.
Next we manipulate the dataframe. We will modify, with the ‘mutate’ function, the dataframe to add the winning percentage for each season. A new variable is defined and a little algebra is conducted for each year.
oakland <- mutate(oakland, win_perc = WINS/(WINS+LOSSES))
Now the data are more comparable year on year, and we see that 1981 and 2020 were great season, and 1994 was still bad.
ggplot(data=oakland) + theme(text = element_text(size=18)) + ylab("Winning Percentage") +
geom_col(aes(x=Year, y=win_perc), fill="darkgreen", col="yellow") + ggtitle("A's winning percentage each season")
Although this winning percentage data is probably better represented with points. Let’s replot with points instead of collumns.
ggplot(data=oakland) + theme(text = element_text(size=18)) + ylab("Winning Percentage") + ylim(0,1) +
geom_point(aes(x=Year, y=win_perc), col="darkgreen", size=3) + ggtitle("A's winning percentage each season")
Create data set for Billy Beane’s A’s seasons. Beane succeeded Alderson as GM on October 17, 1997. On October 5, 2015, the Athletics announced that Beane had been promoted to executive vice president of baseball operations. filter for the years when Beane was GM or VP (i.e. 1998 and forward) and assign that to ‘billy_oakland’. The “filter” function selects all years after Billy Beane became GM. In the code below, the dataframe ‘oakland’; is passed through the filter function and the Year anything greater than (>) 1997 is selected.
billy_oakland <- filter(oakland, Year>1997)
What is Billy Beanes overall winning percentage? Calculate the mean over all years of Billy Beane as the GM with the “mean” function and assign this value to ‘billy’. The code below take the "win_perc’ (each seasonn’s winning percentage that we created earlier) from the ‘billy_oakland’ dataframe and calculate the mean value.
billy <- mean(billy_oakland$win_perc)
Billy Beane’s overall winning percentage is 0.5336101 %.
A basic histogram with the ‘hist’ function ##### Histogram Oakland A’s Wins during BB area
hist(billy_oakland$win_perc)
Make a better plot by adding some color, and changing the labels to something more readable.
hist(billy_oakland$win_perc, col="darkgreen", border="yellow",
cex.axis=1.4, cex.lab=1.6,
xlab="Winning percentage", main=NULL, ylim=c(0, 4),
breaks=seq(0.4, 0.65, length.out = 22))
Oops - there are too many bins (breaks). Let’s try again and make an even better plot. You can explore changing the number of bins in this histogram with the shinyapp, M1.1 Oakland A’s winning % under Billy Bean
hist(billy_oakland$win_perc, cex.axis=1.4, cex.lab=1.6, cex.main=1.6,
col="darkgreen", border="yellow", main="Billy Beane's Oakland A's",
xlab="Winning percentage", ylim=c(0,10),
breaks=seq(0.4, 0.65, length.out = 6))