The Lahman
package in R contains a plethora of baseball data. The package is used extensively in the book, Analyzing Baseball Data with R, by Max Marchi and Jim Albert (2013). This tutorial will use a subset of data from the Lahman
package to expose you to some basic descriptive statistical functions and data subsetting within the R Markdown environment.
I have extracted a subset of the data and stored it in a Rdata file. In order to read-in the data correctly, save the Rmd file in a folder with the rest of your course work. Place the file baseball.Rdata in the same folder. Go to Session > Set Working Directory > To Source File Location. Now you may run the below code that will load the data.
load("baseball.Rdata")
To see all the objects loaded from the baseball data set in this tutorial use the function ls()
. Function ls()
lists all the objects in the current R environment.
You may see other objects from previous instances of work in R.
ls()
[1] "batting_stats" "CarltonFiskBA" "CarltonFiskHR" "CarltonFiskRBI"
[5] "JimRiceBA" "JimRiceHR" "JimRiceRBI" "TedWilliamsBA"
[9] "TedWilliamsHR" "TedWilliamsRBI"
To access the content of an object in R
use the object’s name. Keep in mind that R
is case sensitive. Thus, we need to type an object’s name exactly as it appears.
Above we see the object JimRiceBA
. Run the code below to see the contents of JimRiceBA
.
JimRiceBA
1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
0.269 0.309 0.282 0.320 0.315 0.325 0.294 0.284 0.309 0.305 0.280 0.291
1986 1987 1988 1989
0.324 0.277 0.264 0.234
Baseball abbreviations
Abbreviation | Meaning |
---|---|
BA | Batting Average |
HR | Home Runs |
RBI | Runs Batted In |
These are some measures of a batter’s success.
The names of many functions in R
are self-explanatory. To compute the minimum, maximum, and mean for Jim Rice’s career batting average we can use the corresponding functions given below.
min(JimRiceBA)
[1] 0.234
max(JimRiceBA)
[1] 0.325
mean(JimRiceBA)
[1] 0.292625
To find the year in which Jim Rice had his lowest batting average and the year in which he had his highest batting average, we can make use of the functions which.min()
and which.max()
, respectively.
which.min(JimRiceBA)
1989
16
which.max(JimRiceBA)
1979
6
Let’s examine how Rice’s batting average changed throughout his career. First, we compute year-over-year differences, then view the results. Second, we will look at which year he had the largest increase and which year he had the largest decrease.
# compute differences
JimRiceBA_diffs <- diff(JimRiceBA, lag = 1)
JimRiceBA_diffs
1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
0.040 -0.027 0.038 -0.005 0.010 -0.031 -0.010 0.025 -0.004 -0.025
1985 1986 1987 1988 1989
0.011 0.033 -0.047 -0.013 -0.030
# find years
which.max(JimRiceBA_diffs)
1975
1
which.min(JimRiceBA_diffs)
1987
13
The #
symbol was used to add comments. R
does not execute anything following #
. Use #
for code documentation to explain to others why you are doing what you are doing with your code. Good code documentation is also beneficial for your future self.
Recall that the correlation measures the linear strength between two quantitative variables. Let’s look at the correlation between each pair of available variables for Jim Rice: batting average, home runs, and RBIs.
cor(JimRiceBA, JimRiceHR)
[1] 0.7576232
cor(JimRiceBA, JimRiceRBI)
[1] 0.7942272
cor(JimRiceHR, JimRiceRBI)
[1] 0.9305225
To view a simple plot of Rice’s home runs versus his RBIs we can use the plot()
function.
plot(JimRiceHR, JimRiceRBI)
To learn more about Jim Rice click here.
Ted Williams, 1939
Answer parts a-i below. Use a separate code chunk for each part that requires code. You will examine data on Ted Williams.
To remind yourself of the variable names use the function ls()
.
Use the length
function to determine how many seasons Ted Williams played.
Which season did Ted Williams have his highest batting average?
Plot Williams’ batting average over time. To put the years on the x-axis, use names(TedWilliamsBA)
.
What was Williams’ highest batting average?
What was Williams’ career mean batting average?
What was the correlation between Williams’ home runs and RBIs? Was it higher than Jim Rice’s correlation?
What was the largest absolute change in Williams’ RBIs year-over-year?
Why does Ted Williams not have any statistics from 1943 - 1945? Was he hurt?
Which of the three players (Fisk, Rice, Williams) was most consistent year-over-year with regards to the batting average metric? How did you define consistency?
Create the R Markdown file that produced the HTML file ica-01-10-19. All formatting should match, but you may replace my name with your own name. Below are some helpful hints.
---
title: "Console only"
author: "Shawn Santo"
date: "January 10, 2019"
output:
html_document:
toc: true
number_sections: true
toc_float: true
df_print: paged
---
In section 3.1 the longley
data frame is shown, but the code is not. You will need to use an appropriate chunk option to get your HTML file to match.
To create the final plot use the below code. You will need to install each package before you can load them with the library
function.
library(ggplot2)
library(rvg)
library(ggiraph)
longley$tooltip <- paste("GNP: ", longley$GNP, sep = "")
gg_point_1 <- ggplot(longley,
aes(x = Year, y = Employed, tooltip = tooltip)) +
geom_point_interactive(size = 4, color = "darkblue") +
theme_bw()
ggiraph(ggobj = gg_point_1)
Lahman, S. (2017) Lahman’s Baseball Database, 1871-2016, Main page, http://www.seanlahman.com/baseball-archive/statistics/