This shows some of the data and codes used in Lecture 1.1, the Oakland A’s under Billy Beane.
You can see how data was processed and how the plot were generated.

1st load data into R

Read in the csv (comma-separated value) file, and assign it to ‘oakland’.

oakland <- read_csv("Oakland_As_Seasons.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Standing = col_character(),
##   WINS = col_double(),
##   LOSSES = col_double(),
##   BillyBeane = col_double()
## )

After R reads the csv file, R shows the column names that are contained in the csv. The file contains 53 years of Oakland A’s seasons with 5 columns: Year | Standing at the end of the year | WINS = number of wines | LOSSES = number of loses | BillyBeane denotes if Billy Beane was the GM or not (0 or 1) in that season.

Take quick look at the data.
ggplot(data=oakland) + theme(text = element_text(size=18)) + ylab("Wins") +
  geom_col(aes(x=Year, y=WINS), fill="darkgreen", col="yellow") + ylim(0,162) + ggtitle("Number of A's wins in each season")

It looks like the A’s did very poorly in 1981, 1994, and 2020. A baseball season is conveniently always 162 games. Ops, except fir the strike shorten season of 1981, and of course COVID 2020.

2nd manipulate the data in R

Next we manipulate the dataframe. We will modify, with the ‘mutate’ function, the dataframe to add the winning percentage for each season. A new variable is defined and a little algebra is conducted for each year.

oakland <- mutate(oakland, win_perc = WINS/(WINS+LOSSES))

Now the data are more comparable year on year, and we see that 1981 and 2020 were great season, and 1994 was still bad.

ggplot(data=oakland) + theme(text = element_text(size=18)) + ylab("Winning Percentage") +
  geom_col(aes(x=Year, y=win_perc), fill="darkgreen", col="yellow") + ggtitle("A's winning percentage each season") 

Although this winning percentage data is probably better represented with points. Let’s replot with points instead of collumns.

ggplot(data=oakland) + theme(text = element_text(size=18)) + ylab("Winning Percentage") + ylim(0,1) +
  geom_point(aes(x=Year, y=win_perc), col="darkgreen", size=3) + ggtitle("A's winning percentage each season")

Create data set for Billy Beane’s A’s seasons. Beane succeeded Alderson as GM on October 17, 1997. On October 5, 2015, the Athletics announced that Beane had been promoted to executive vice president of baseball operations. filter for the years when Beane was GM or VP (i.e. 1998 and forward) and assign that to ‘billy_oakland’. The “filter” function selects all years after Billy Beane became GM. In the code below, the dataframe ‘oakland’; is passed through the filter function and the Year anything greater than (>) 1997 is selected.

billy_oakland <- filter(oakland, Year>1997)

What is Billy Beanes overall winning percentage? Calculate the mean over all years of Billy Beane as the GM with the “mean” function and assign this value to ‘billy’. The code below take the "win_perc’ (each seasonn’s winning percentage that we created earlier) from the ‘billy_oakland’ dataframe and calculate the mean value.

billy <- mean(billy_oakland$win_perc)

Billy Beane’s overall winning percentage is 0.5336101 %.

3rd visualize your data - create a basic plot

A basic histogram with the ‘hist’ function ##### Histogram Oakland A’s Wins during BB area

hist(billy_oakland$win_perc)

Create a better plot

Make a better plot by adding some color, and changing the labels to something more readable.

Histogram Oakland A’s Wins during BB area
hist(billy_oakland$win_perc, col="darkgreen", border="yellow",
     cex.axis=1.4, cex.lab=1.6,
     xlab="Winning percentage", main=NULL, ylim=c(0, 4),
     breaks=seq(0.4, 0.65, length.out = 22))

Oops - there are too many bins (breaks). Let’s try again and make an even better plot. You can explore changing the number of bins in this histogram with the shinyapp, M1.1 Oakland A’s winning % under Billy Bean

Histogram Oakland A’s Wins during BB area
hist(billy_oakland$win_perc, cex.axis=1.4, cex.lab=1.6, cex.main=1.6,
     col="darkgreen", border="yellow", main="Billy Beane's Oakland A's",
     xlab="Winning percentage", ylim=c(0,10),
     breaks=seq(0.4, 0.65, length.out = 6))